Unlocking the Colonial Archive: Harnessing Artificial Intelligence for Indigenous and Spanish American Historical Collections

Lead Research Organisation: Lancaster University
Department Name: History

Abstract

The Spanish empire controlled the vast majority of the western hemisphere's lands and peoples for more than three centuries. Its vast administration in the Americas depended on the work of royal notaries, Indigenous artists, and printers. They produced prodigious amounts of documents, written or printed on paper, which fill archives and libraries today. Despite the extensive documentation, present-day understanding of the Spanish colonial enterprise is fragmentary. Once the initial barrier of archival access has been overcome, scholars and other publics then must decipher archaic penmanship, obscure writing conventions, and unfamiliar Indigenous imagery. This project seeks to lower these barriers by introducing artificial intelligence (AI) technologies into representative Indigenous and Spanish colonial archives in Mexico and the U.S., and training them to convert the "unreadable" archive into worldwide accessible data. The project has the potential to revolutionize how cultural institutions provide access to their colonial collections and how humanities researchers can undertake cutting-edge digital scholarship.

In a highly interdisciplinary collaboration between archaeologists, historians, web scientists, designers, and computer scientists, the "Unlocking the Colonial Archive" project will create a step-change in the way a broad spectrum of researchers and the public engage with and use countless early modern Indigenous and Spanish collections dispersed throughout the world. Using machine learning and the exceptional collections of the LLILAS Benson library (US) and the General Archive of the Nation (Mexico), the project will tackle three challenges in interconnected research areas to: (a) accomplish the automated transcription of 16th- and 17th-century historical colonial documents that combine Spanish with Indigenous languages such as Nahuatl, Mixtec, Huastec, and Otomi, among others; (b) develop methods to carry out text mining in large historical collections; and (c) develop techniques to facilitate the automated identification of iconographic and other pictorial features in Indigenous maps and printed books.

The development of such approaches will not only facilitate the searching, retrieval, and reading of these materials, but will also transform the accessibility and analysis of large textual and image collections. With a strong commitment to a decolonial approach, both in terms of archival practices and in the critical use of technologies, the project will create freely available, enhanced open digital collections. As such, "Unlocking the Colonial Archive" will work in close partnership with Mexican, UK, US, Portuguese, and Spanish researchers and institutions, training scholars and interested members of the public on transferable skills and digital methods, and it will produce innovative, reproducible workflows that Latin American scholars and cultural institutions around the world can adopt and implement.

Publications

10 25 50
 
Description This past year, we continued to focus on phase 1-the creation of ground truth transcriptions and handwritten text recognition (HTR) models. In year 1, we created transcriptions and models for two 16th-century font types represented in the Primeros Libros de las Americas (PLA) collection-gotica and lectura redonda. We ran the models on 43 untranscribed titles in the Primeros Libros collection to produce raw transcriptions. This past year, we developed models for lectura bastarda and atanasia redonda types in the PLA holdings. We also created a robust corpus of ground truth transcriptions derived from 103 imprints dating 1600-1630 to create a model for 17th-century books that has been able to transcribe Spanish, Latin, and Nahuatl printed text accurately. We are still in the process of correcting and depositing the raw transcriptions into our data repository to link to the Primeros Libros project for reuse subsequently.
We have been working on the development of transcriptions and models for the various Spanish handwriting styles. In year 1, we created three HTR models for the most common style, italica cursiva, a model for each century. We also created a model for procesal. This past year, we expanded the training corpus for these models to reduce the error rate. Being the most difficult handwriting style, we are still creating ground truth transcriptions for procesal encadenada to train an HTR model. To accomplish this, we decided to use some of the allocated consultant fees that have not been paid out to hire experienced paleographers in these handwriting styles to generate more ground truth transcriptions. The UK team is also working with this calligraphy using the Amoxcalli Collection, which holds materials from the National Library of France. Unfortunately, we have not encountered sufficient collection materials in LLILAS Benson's digital collections to develop HTR models for two of the identified handwriting styles: humanistica redonda/cursiva and cortesana. We are still in the process of identifying representative samples from digitized collection materials from Mexico's Archivo General de la Nación (General Archive of the Nation, Mexico City) and plan to look at other digital collections to address the issue.
To expedite the creation of ground truth transcriptions, we conceived and led what we called an "NEH-AHRC Spanish Paleography & Digital Humanities Institute" November through December 2021 and January through March 2022. Considering its success and the subsequent demand from scholars for more programming, we decided to offer another round of institutes that took place August through October in 2022 and January through March in 2023. For year 2's institutes, we decided to reduce the size of the accepted cohort from 30 to 20 participants to make the workload more manageable. Given the continued positive reception, we have decided to make this a recurring program for the LLILAS Benson Digital Scholarship Office.
In year 1, participants transcribed a total of 136 documents (approximately 1140 pages) as part of their weekly paleography homework assignments. This past year, participants worked on 89 documents, transcribing approximately 500 pages. We have been and continue to correct and publish these digital texts in the Benson Latin American Collection's data repository (https://dataverse.tdl.org/dataverse/blac). Thus far, we have published 64 document transcriptions in the repository. We have also been working with LLILAS Benson Digital Initiatives staff to ingest the scanned materials with the collected metadata and transcriptions in the University of Texas Libraries' Collection portal (https://collections.lib.utexas.edu/) for broader access.
Thanks to two University of Texas doctoral students, we have also made some headway in the creation of HTR models for Indigenous language materials. A doctoral student in the Linguistics department, James Tandy, has been creating ground truth transcriptions of Poquomchi' language (a Mayan language) materials preserved at Brigham Young University and Princeton University since year 1. This past year, we created a relatively low-error-rate model based on those transcriptions. In year 1, we awarded a LLILAS Benson Digital Scholarship Fellowship to a doctoral student in the Spanish and Portuguese Department, Eduardo Gorobets, to develop an HTR model for Nahuatl language documents in the Fondo Real de Cholula to facilitate his research. Given his excellent fellowship work, we subsequently hired him as the grant project graduate research assistant (GRA) to continue this work.

Regarding phases 2 and 3, the UK team has been also advancing on the proposed work. For phase 2, we have trained an annotation model with Natural Language Processing (NLP) techniques that allowed them to automatically identify and annotate words and concepts related to 18 entities and 49 labels associated to these categories in documents standardized in modern Spanish and several Indigenous languages. The entities include: Person, date, location, institution, ethnic group, social group, language, activity, architecture, cultural artifact, mobility, health, animal, plant, natural resource, measurement, and climate.
Delays in advancing this phase's work continue in year 2 as it is mostly dependent on the development of ground truth transcriptions of strong thematically-focused corpuses in phase 1. The Benson's digital collections are significant and cover a broad range of topics; however, with the exception of the Relaciones Geográficas (geographic accounts) collection, there are very few substantial clusters of strongly related manuscripts that can benefit from phase 2 work. This past year, we focused on the transcription of some of these clusters, including Inquisition proceedings, missionary correspondence in Northern New Spain, front matter in the PLA collection, and ledgers of vows of professions to various religious orders. We are currently developing NLP models to extract key concepts and proper names from these materials. We are currently working with the PLA collection and the Libro de Profesiones and have achieved results on a simple Name Entity Recognition model. There is still more work to be done for other categories.

We are also continuing to organize the work for the online Open Linked Data platform that will hold the information created in the project. This has already resulted in the article by Candela et al., "An ontological approach for unlocking the Colonial Archive", currently under review and to be published in the Journal of Computing and Cultural Heritage. We have also carried out a series of experiments with machine learning, annotating 164 colonial maps and creating an automated annotation model with varied results. This is just preliminary work as the consultant for this part of the project was hired recently. We are starting to look at the wider sample of maps including a collection from the AGN and AGI where we will implement as the first experiment, a computer vision pipeline for the semantic exploration of this collection. In the next stage we will carry out work in consultation with several specialists to annotate the wider sample of these documents.
Exploitation Route The Machine Learning models and the documents transcribed will be used by generations of scholars to come.
We think this will have an important impact on the Spanish-speaking world, as well as on the research of several Indigenous languages of Latin America.
Sectors Digital/Communication/Information Technologies (including Software)

Culture

Heritage

Museums and Collections

 
Description Through the project work, we have developed the skill set of numerous scholars. In the last two years, we have trained/are currently training 103 scholars-55 graduate students, 20 junior faculty, 12 tenured professors, 9 archive/library professionals, and 7 independent researchers-from 15 countries and 22 U.S. states in Spanish palaeography and digital humanities tools using LLILAS Benson's digital collections. We plan to increase this worldwide and national impact by leading these institutes annually. Historians and other humanities scholars interested in colonial Latin America will directly benefit-and, given the feedback we have received from the institute participants, have already benefited-from the transcription and annotation of LLILAS Benson's digital Spanish colonial collections. Besides being able to query across the transcribed documents we have deposited in the LLILAS Benson data repository so far, researchers, students, and other interested audiences will be able to more easily search, find, and retrieve these documents given the robust metadata we are creating for each transcribed document. Phase 2 work on developing automatic annotation models will exponentially expedite this metadata creation. Additionally, scholars who have contributed transcriptions will have a published credit of their intellectual work in the data repository record. The resulting HTR and annotation models will also greatly facilitate future colonial Latin Americanist research. As a benefit given to institute participants, they will be able to apply the HTR models we are creating on their own corpus (regardless of the manuscripts' physical repository) to automatically transcribe it. To further the impact, we will be depositing these automatic transcriptions in the LLILAS Benson repository for reuse by other scholars and audiences. These Spanish colonial HTR models-which are among the first of their kind-will be made public in the Transkribus platform at the end of the grant for others to use on their own materials. We also plan to lead tailored workshops on Transkribus and Tagtog for collection managers so that they can apply these machine learning technologies on the processing of their Spanish colonial collections. Besides scholars and information professionals who work directly with Spanish colonial materials, computer and data scientists will also benefit from the creation of substantial training datasets of early-modern textual and visual material and models that can serve for further experimentation in a diversity of AI applications. Our future development of one of the first linked open data repositories focused on the study of colonial Latin America, and the resulting interconnection of the collection transcriptions, will also enable cultural institutions to connect and recontextualize dispersed Spanish colonial sources.
First Year Of Impact 2021
Sector Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections
Impact Types Cultural

Societal

Economic

 
Description Catalyst Fund
Amount £4,000 (GBP)
Organisation Lancaster University 
Sector Academic/University
Country United Kingdom
Start 04/2022 
End 11/2022
 
Description History Departmental Fund
Amount £1,000 (GBP)
Organisation Lancaster University 
Sector Academic/University
Country United Kingdom
Start 06/2021 
End 09/2021
 
Description Implementing Artificial Intelligence to unlock the Library of Congress Spanish American historical collections (1500-1699)
Amount £120,585 (GBP)
Funding ID AH/X008851/1 
Organisation Arts & Humanities Research Council (AHRC) 
Sector Public
Country United Kingdom
Start 06/2023 
End 05/2024
 
Description Mesoamerican Apocalypse: A large scale analysis of the Indigenous perspective on the sixteenth-century epidemics of Colonial Mexico
Amount € 78,080 (EUR)
Organisation Heidelberg University 
Sector Academic/University
Country Germany
Start 07/2022 
End 07/2023
 
Description The New Spain Fleets: Delving into three centuries of socioeconomic colonial history through Artificial Intelligence
Amount £1,068,748 (GBP)
Funding ID ES/X013774/1 
Organisation Economic and Social Research Council 
Sector Public
Country United Kingdom
Start 03/2024 
End 03/2029
 
Title Benson Latin American Collection 
Description To expedite the creation of ground truth transcriptions, we conceived and led what we called an "NEH-AHRC Spanish Paleography & Digital Humanities Institute" November through December 2021 and January through March 2022. Considering its success and the subsequent demand from scholars for more programming, we decided to offer another round of institutes that took place August through October in 2022 and January through March in 2023. For year 2's institutes, we decided to reduce the size of the accepted cohort from 30 to 20 participants to make the workload more manageable. Given the continued positive reception, we have decided to make this a recurring program for the LLILAS Benson Digital Scholarship Office. In 2021, participants transcribed a total of 136 documents (approximately 1140 pages) as part of their weekly palaeography homework assignments. In 2022, participants worked on 89 documents, transcribing approximately 500 pages. We have been and continue to correct and publish these digital texts in the Benson Latin American Collection's data repository (https://dataverse.tdl.org/dataverse/blac). Thus far, we have published 64 document transcriptions in the repository. We have also been working with LLILAS Benson Digital Initiatives staff to ingest the scanned materials with the collected metadata and transcriptions in the University of Texas Libraries' Collection portal (https://collections.lib.utexas.edu/) for broader access. Please notice that each of the documents have an individual DOI. I'm providing here the address to the collection and only the DOI of the first document, as this form would require me to input manually almost 200 records. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact The transcriptions of thousands of pages of historical documents are being made available for the first time in electronic format. This is now broadening the materials researchers have accessible in a broad range of historical subjects. 
URL https://dataverse.tdl.org/dataverse/blac
 
Title UCA 16th century Spanish Procesal HTR Model 
Description This is a Machine Learning model for Handwritten Text Recognition of documents written in the sixteenth-century calligraphy called "Procesal" found in Spanish American legal documents. 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? No  
Impact In the future, these HRT models will unlock the contents of historical documents held in archives that usually take years to transcribe. We expect these models will change the scale in which research is carried out, as well as expediting it. 
 
Title UCA Nahuatl HTR Model 
Description This Handwritten Text Recognition model is based on the transcription of an extensive collection of Nahuatl documents from the Fondo Real de Cholula. 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? No  
Impact When finished, it will enable the automated transcription of thousands of pages of historical documents in Classical Nahuatl documents. This will include archival holdings from Mexico, Guatemala, Spain, and Nicaragua, among others. 
 
Title UCA Poquomchi HTR Model 
Description We are creating ground truth transcriptions of Poquomchi' language (a Mayan language) materials preserved at Brigham Young University and Princeton University. We are expecting to have a model to transcribe sixteenth and seventeenth-century calligraphies for this language. 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? No  
Impact This model will serve in the future to do the automated palaeography of this historical Indigenous language. 
 
Description US Library of Congress 
Organisation Library of Congress
Country United States 
Sector Public 
PI Contribution We have partnered to work on Handwritten Text recognition of their collections with the models we are creating for the project. Our work in the Unlocking the Colonial Archives project has resulted in further funding.
Collaborator Contribution They are providing access to their collections and staff at the library.
Impact This is work in progress. We are planning to develop and release the models for their collections in 2024.
Start Year 2023
 
Title Humanities AI-Digital Lab 
Description We are in the process of developing a Humanities AI-Lab online. We are in the first version of the lab. The idea behind this is to have an online laboratory to work, annotate, analyse, and train machine learning models with historical documents. At the moment is composed of two sections. The first one is an annotation tool that enables the user to define an annotation model with an ontology and annotate documents with the entities defined. The tool also facilitates the training of multiple NLP models with the data used. This enables the automated annotation of further documents. The purpose is to facilitate the extraction of information at large scale from historical collections. The second section consists of a computer vision tool to implement a method we are calling Visual Natural Language Processing. This tool is enabling the identification of elements in pictorial documents including Mexican codices and maps. 
Type Of Technology Webtool/Application 
Year Produced 2024 
Open Source License? Yes  
Impact These are to be realised yet, but the first iteration of the software is allowing us to annotate thousands of pages with historical information. 
URL https://anotador-codices.streamlit.app/
 
Description NEH-AHRC Spanish Paleography & Digital Humanities 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact As a way to generate more ground truth transcriptions for HTR models, expose LLILAS Benson's digitized Spanish colonial collections to a broader public, and impart training on the machine learning tools we are using in this project, we organized two "NEH-AHRC Spanish Paleography & Digital Humanities" institutes. These 7-week online training programs took place November-December 2021 and January-March 2022. The institute provided practical training in the reading and visualization of 16th- to 18th-century colonial Spanish manuscripts preserved at LLILAS Benson.

The training developed skills in two areas. First, participants obtained specialized training on several free and open source tools that they can use to extract, visualize, and present data in colonial texts. These included FromThePage, Recogito, Voyant-Tools, ArcGIS, Onodo, and Transkribus. Each Monday, we led a DH workshop on a particular tool using the transcriptions participants had created for homework in previous weeks as datasets or sample texts we had already prepared.

Second, students learned and honed paleography skills for the accurate reading and transcription of these Spanish colonial manuscripts. Every Friday, we would split the class randomly into 6 breakout groups to work on specific pages of document types and handwriting styles we had pre-transcribed. Everyone would take turns reading these materials out loud and transcribe them in a shared class Google Doc. The expectation was that they would support each other in deciphering the text, especially those who had indicated that they have advanced paleography skills in specific handwriting styles. Throughout the session, we would correct their transcription or provide hints when they could not read a particular word in the shared Google Doc. In the fall institute, we invited experts from Germany, Portugal, France, and Mexico to talk about specific document types; we recorded these presentations and shared these with the spring cohort.

To provide them with individual paleography practice, we assigned 2-4 pages of manuscript material for homework each week. Students used FromThePage, a collaborative transcription tool we host on our University of Texas Libraries servers, to complete these transcriptions. We subsequently corrected these transcriptions to provide them feedback; FromThePage documents and preserves versioning, which enables students to compare the corrected version with their initial transcription. We then used these transcriptions to train HTR models in Transkribus.

Collectively, we trained/are currently training 60 scholars-including 35 graduate students, 8 junior faculty, 8 tenured professors, 5 archive/library professionals, and 4 independent researchers-from 11 countries and 18 U.S. states. We sent out an institute assessment survey on the last day to improve the training for the spring. Of the 24 responses we received, all participants expressed that the institute at least met, if not exceeded, their expectations, finding all the resources we created (recordings, document keys, training guides, and presentations) useful.
Year(s) Of Engagement Activity 2021,2022