Unlocking the Colonial Archive: Harnessing Artificial Intelligence for Indigenous and Spanish American Historical Collections

Lead Research Organisation: Lancaster University
Department Name: History


The Spanish empire controlled the vast majority of the western hemisphere's lands and peoples for more than three centuries. Its vast administration in the Americas depended on the work of royal notaries, Indigenous artists, and printers. They produced prodigious amounts of documents, written or printed on paper, which fill archives and libraries today. Despite the extensive documentation, present-day understanding of the Spanish colonial enterprise is fragmentary. Once the initial barrier of archival access has been overcome, scholars and other publics then must decipher archaic penmanship, obscure writing conventions, and unfamiliar Indigenous imagery. This project seeks to lower these barriers by introducing artificial intelligence (AI) technologies into representative Indigenous and Spanish colonial archives in Mexico and the U.S., and training them to convert the "unreadable" archive into worldwide accessible data. The project has the potential to revolutionize how cultural institutions provide access to their colonial collections and how humanities researchers can undertake cutting-edge digital scholarship.

In a highly interdisciplinary collaboration between archaeologists, historians, web scientists, designers, and computer scientists, the "Unlocking the Colonial Archive" project will create a step-change in the way a broad spectrum of researchers and the public engage with and use countless early modern Indigenous and Spanish collections dispersed throughout the world. Using machine learning and the exceptional collections of the LLILAS Benson library (US) and the General Archive of the Nation (Mexico), the project will tackle three challenges in interconnected research areas to: (a) accomplish the automated transcription of 16th- and 17th-century historical colonial documents that combine Spanish with Indigenous languages such as Nahuatl, Mixtec, Huastec, and Otomi, among others; (b) develop methods to carry out text mining in large historical collections; and (c) develop techniques to facilitate the automated identification of iconographic and other pictorial features in Indigenous maps and printed books.

The development of such approaches will not only facilitate the searching, retrieval, and reading of these materials, but will also transform the accessibility and analysis of large textual and image collections. With a strong commitment to a decolonial approach, both in terms of archival practices and in the critical use of technologies, the project will create freely available, enhanced open digital collections. As such, "Unlocking the Colonial Archive" will work in close partnership with Mexican, UK, US, Portuguese, and Spanish researchers and institutions, training scholars and interested members of the public on transferable skills and digital methods, and it will produce innovative, reproducible workflows that Latin American scholars and cultural institutions around the world can adopt and implement.


10 25 50
Description History Departmental Fund
Amount £1,000 (GBP)
Organisation Lancaster University 
Sector Academic/University
Country United Kingdom
Start 06/2021 
End 09/2021
Description NEH-AHRC Spanish Paleography & Digital Humanities 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact As a way to generate more ground truth transcriptions for HTR models, expose LLILAS Benson's digitized Spanish colonial collections to a broader public, and impart training on the machine learning tools we are using in this project, we organized two "NEH-AHRC Spanish Paleography & Digital Humanities" institutes. These 7-week online training programs took place November-December 2021 and January-March 2022. The institute provided practical training in the reading and visualization of 16th- to 18th-century colonial Spanish manuscripts preserved at LLILAS Benson.

The training developed skills in two areas. First, participants obtained specialized training on several free and open source tools that they can use to extract, visualize, and present data in colonial texts. These included FromThePage, Recogito, Voyant-Tools, ArcGIS, Onodo, and Transkribus. Each Monday, we led a DH workshop on a particular tool using the transcriptions participants had created for homework in previous weeks as datasets or sample texts we had already prepared.

Second, students learned and honed paleography skills for the accurate reading and transcription of these Spanish colonial manuscripts. Every Friday, we would split the class randomly into 6 breakout groups to work on specific pages of document types and handwriting styles we had pre-transcribed. Everyone would take turns reading these materials out loud and transcribe them in a shared class Google Doc. The expectation was that they would support each other in deciphering the text, especially those who had indicated that they have advanced paleography skills in specific handwriting styles. Throughout the session, we would correct their transcription or provide hints when they could not read a particular word in the shared Google Doc. In the fall institute, we invited experts from Germany, Portugal, France, and Mexico to talk about specific document types; we recorded these presentations and shared these with the spring cohort.

To provide them with individual paleography practice, we assigned 2-4 pages of manuscript material for homework each week. Students used FromThePage, a collaborative transcription tool we host on our University of Texas Libraries servers, to complete these transcriptions. We subsequently corrected these transcriptions to provide them feedback; FromThePage documents and preserves versioning, which enables students to compare the corrected version with their initial transcription. We then used these transcriptions to train HTR models in Transkribus.

Collectively, we trained/are currently training 60 scholars-including 35 graduate students, 8 junior faculty, 8 tenured professors, 5 archive/library professionals, and 4 independent researchers-from 11 countries and 18 U.S. states. We sent out an institute assessment survey on the last day to improve the training for the spring. Of the 24 responses we received, all participants expressed that the institute at least met, if not exceeded, their expectations, finding all the resources we created (recordings, document keys, training guides, and presentations) useful.
Year(s) Of Engagement Activity 2021,2022