Unlocking the Colonial Archive: Harnessing Artificial Intelligence for Indigenous and Spanish American Historical Collections
Lead Research Organisation:
Lancaster University
Department Name: History
Abstract
The Spanish empire controlled the vast majority of the western hemisphere's lands and peoples for more than three centuries. Its vast administration in the Americas depended on the work of royal notaries, Indigenous artists, and printers. They produced prodigious amounts of documents, written or printed on paper, which fill archives and libraries today. Despite the extensive documentation, present-day understanding of the Spanish colonial enterprise is fragmentary. Once the initial barrier of archival access has been overcome, scholars and other publics then must decipher archaic penmanship, obscure writing conventions, and unfamiliar Indigenous imagery. This project seeks to lower these barriers by introducing artificial intelligence (AI) technologies into representative Indigenous and Spanish colonial archives in Mexico and the U.S., and training them to convert the "unreadable" archive into worldwide accessible data. The project has the potential to revolutionize how cultural institutions provide access to their colonial collections and how humanities researchers can undertake cutting-edge digital scholarship.
In a highly interdisciplinary collaboration between archaeologists, historians, web scientists, designers, and computer scientists, the "Unlocking the Colonial Archive" project will create a step-change in the way a broad spectrum of researchers and the public engage with and use countless early modern Indigenous and Spanish collections dispersed throughout the world. Using machine learning and the exceptional collections of the LLILAS Benson library (US) and the General Archive of the Nation (Mexico), the project will tackle three challenges in interconnected research areas to: (a) accomplish the automated transcription of 16th- and 17th-century historical colonial documents that combine Spanish with Indigenous languages such as Nahuatl, Mixtec, Huastec, and Otomi, among others; (b) develop methods to carry out text mining in large historical collections; and (c) develop techniques to facilitate the automated identification of iconographic and other pictorial features in Indigenous maps and printed books.
The development of such approaches will not only facilitate the searching, retrieval, and reading of these materials, but will also transform the accessibility and analysis of large textual and image collections. With a strong commitment to a decolonial approach, both in terms of archival practices and in the critical use of technologies, the project will create freely available, enhanced open digital collections. As such, "Unlocking the Colonial Archive" will work in close partnership with Mexican, UK, US, Portuguese, and Spanish researchers and institutions, training scholars and interested members of the public on transferable skills and digital methods, and it will produce innovative, reproducible workflows that Latin American scholars and cultural institutions around the world can adopt and implement.
In a highly interdisciplinary collaboration between archaeologists, historians, web scientists, designers, and computer scientists, the "Unlocking the Colonial Archive" project will create a step-change in the way a broad spectrum of researchers and the public engage with and use countless early modern Indigenous and Spanish collections dispersed throughout the world. Using machine learning and the exceptional collections of the LLILAS Benson library (US) and the General Archive of the Nation (Mexico), the project will tackle three challenges in interconnected research areas to: (a) accomplish the automated transcription of 16th- and 17th-century historical colonial documents that combine Spanish with Indigenous languages such as Nahuatl, Mixtec, Huastec, and Otomi, among others; (b) develop methods to carry out text mining in large historical collections; and (c) develop techniques to facilitate the automated identification of iconographic and other pictorial features in Indigenous maps and printed books.
The development of such approaches will not only facilitate the searching, retrieval, and reading of these materials, but will also transform the accessibility and analysis of large textual and image collections. With a strong commitment to a decolonial approach, both in terms of archival practices and in the critical use of technologies, the project will create freely available, enhanced open digital collections. As such, "Unlocking the Colonial Archive" will work in close partnership with Mexican, UK, US, Portuguese, and Spanish researchers and institutions, training scholars and interested members of the public on transferable skills and digital methods, and it will produce innovative, reproducible workflows that Latin American scholars and cultural institutions around the world can adopt and implement.
Organisations
- Lancaster University (Lead Research Organisation)
- Library of Congress (Collaboration)
- INAH (Project Partner)
- University of Lisbon (Project Partner)
- Zurich University of the Arts (Project Partner)
- University of Alicante (Project Partner)
- Nat Inst of Antrhopology & Hist (INAH) (Project Partner)
- National Autonomous University of Mexico (Project Partner)
- Potosino Inst of Sci & Tech Research (Project Partner)
- Lucentia Lab (Project Partner)
- TagTog (Project Partner)
- University of Innsbruck (Project Partner)
- Natnl Sch of Anthropol & History (ENAH) (Project Partner)
- Fordham University (Project Partner)
Publications
Candela G
(2023)
An Ontological Approach for Unlocking the Colonial Archive
in Journal on Computing and Cultural Heritage
Murrieta-Flores Patricia
(2023)
El futuro del pasado: Algunas reflexiones sobre el desarrollo de Inteligencia Artificial en la Historia Novohispana, la Arqueología Histórica, y la descolonización tecnológica.
in Ichan-Tecolotl
Moreira-Muñoz A
(2023)
GeoHumanidades. Arte y Naturaleza del Antropoceno
| Description | At the conclusion of the three-year grant project, we were able to produce several effective HTR models. For the handwritten Spanish-language corpus, we produced a model based on 145 pages containing 33466 words with a character error rate (CER) of 11.82% on the validation set, or the manually transcribed pages used to test the HTR model, that works relatively well on 16th-century procesal calligraphic styles (e.g. procesal, procesal-cortesana, and procesal-encadenada). We needed to pursue further this line of research because despite this interesting result, with further experiments we established that we could achieve much better models by creating specific training sets for each calligraphic type. We also created a robust model that is highly effective on the 17th- and 18th-century italica cursiva calligraphic style based on 441 pages containing 89315 words with a CER of 5.30%. We were also able to produce two HTR models based on the handwritten Indigenous-language corpus we compiled. We developed one based on 257 pages containing 44127 words with a CER of 11.40% for 16th- through 18th-century Nahuatl text. We were also able to train a base model on 32 pages containing 8056 words with a CERof 17.70% for 17th- through 18th-century Poqomchi text. We continue to create ground transcriptions for both languages to lower the CER of these models. Lastly, we created two precise models on 16th- and 17th-century American print types. Using 76 pages containing 19562 words, we created one with a CER of 2.30% for 16th-century Gothic typeface. We also created a general model with a CER of 1.60% based on 212 pages containing 36349 words that can accurately transcribe 16th- and 17th-century Italic and Roman typefaces in the atanasia redonda/bastarda and lectura redonda/bastarda typefaces. Both models have been able to accurately transcribe Spanish, Latin, and Nahuatl printed text in the PLA corpus. An unexpected grant result was the development and cultivation of a global community of colonial Latin Americanists. When we sent out the call for applications for the NEH-AHRC Spanish Paleography + Digital Humanities Institute, academics and cultural repository professionals from all over the world responded. We were initially only planning to lead one institute, but the overwhelming response prompted us to offer four institutes in total, which took place November to December 2021, January to March 2022, August to October 2022, and January to March 2023. In all, the institute trained 103 scholars, including 55 graduate students, 20 junior faculty, 12 tenured professors, 9 archive/library professionals, and 7 independent researchers. These specialists were from academic institutions in Washington, D.C., 20 U.S. states (Alabama, Arkansas, California, DC, Delaware, Florida, Illinois, Indiana, Kansas, Louisiana, Massachusetts, New Mexico, New York, North Carolina, Oregon, Pennsylvania, Rhode Island, Tennessee, Texas, and Virginia), and 15 countries (Argentina, Canada, Chile, Colombia, Costa Rica, France, Germany, Italy, Japan, Mexico, Peru, Spain, Sweden, Switzerland, and the United Kingdom). On the last day of each institute, we sent out an assessment survey to solicit feedback on the training to improve it. Ninety-five percent of the respondents indicated that the institute at least met, if not exceeded, their expectations, finding all the activities and resources, including recordings, document keys, training guides, and presentations, we designed useful. Given the overwhelmingly positive reception, we have decided to make this a recurring LLILAS Benson Digital Scholarship public program. Throughout the project, we also published the ground truth transcriptions we were generating from the Benson manuscripts. Besides providing enhanced access to them through our data repository, we wanted to provide SP + DH Institute participants a published citable version of their work. In all, the institute cohorts transcribed a total of 225 documents (approximately 1640 pages) as part of their weekly palaeography homework assignments. To date, we have generated meticulous item-level descriptions for 124 documents (nearly 1,500 pages) in the Benson Latin American Collection. We will continue this arduous work for the coming years. |
| Exploitation Route | We decided to continue to pursue this line of research as there are still aspects that can be refined and we have already accomplished excellent results as explained in the ESRC project report of the New Spain Fleets. At the moment we have granted access to the models to some stakeholders for their evaluation. We are now closer to release the latest versions of the models (end of 2025). When released, this outcome will impact all major Spanish-speaking GLAMs that may hold documents written between the 15th and the 20th century. |
| Sectors | Digital/Communication/Information Technologies (including Software) Education Culture Heritage Museums and Collections |
| Description | Through the project work, we have developed the skill set of numerous scholars. In the last two years, we have trained/are currently training 103 scholars-55 graduate students, 20 junior faculty, 12 tenured professors, 9 archive/library professionals, and 7 independent researchers-from 15 countries and 22 U.S. states in Spanish palaeography and digital humanities tools using LLILAS Benson's digital collections. We plan to increase this worldwide and national impact by leading these institutes annually. Historians and other humanities scholars interested in colonial Latin America will directly benefit-and, given the feedback we have received from the institute participants, have already benefited-from the transcription and annotation of LLILAS Benson's digital Spanish colonial collections. Besides being able to query across the transcribed documents we have deposited in the LLILAS Benson data repository so far, researchers, students, and other interested audiences will be able to more easily search, find, and retrieve these documents given the robust metadata we are creating for each transcribed document. Phase 2 work on developing automatic annotation models will exponentially expedite this metadata creation. Additionally, scholars who have contributed transcriptions will have a published credit of their intellectual work in the data repository record. The resulting HTR and annotation models will also greatly facilitate future colonial Latin Americanist research. As a benefit given to institute participants, they will be able to apply the HTR models we are creating on their own corpus (regardless of the manuscripts' physical repository) to automatically transcribe it. To further the impact, we will be depositing these automatic transcriptions in the LLILAS Benson repository for reuse by other scholars and audiences. These Spanish colonial HTR models-which are among the first of their kind-will be made public in the Transkribus platform at the end of the grant for others to use on their own materials. Although we first planned to keep on using the Tagtog platform to continue our work with Natural Language Processing, we have decided to move forward with the creation of a humanities-NLP annotation software. While this was out of the scope of this grant, we have secured further funding from the ESRC to carry out this new goal. Besides scholars and information professionals who work directly with Spanish colonial materials, computer and data scientists are also benefitting from the creation of substantial training datasets of early-modern textual and visual material and models that can serve for further experimentation in a diversity of AI applications. Our future development of one of the first linked open data repositories focused on the study of colonial Latin America, and the resulting interconnection of the collection transcriptions, will also enable cultural institutions to connect and recontextualize dispersed Spanish colonial sources. This is explained in the article by Candela et al., 2023. |
| First Year Of Impact | 2021 |
| Sector | Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections |
| Impact Types | Cultural Societal Economic |
| Description | Catalyst Fund |
| Amount | £4,000 (GBP) |
| Organisation | Lancaster University |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 04/2022 |
| End | 11/2022 |
| Description | History Departmental Fund |
| Amount | £1,000 (GBP) |
| Organisation | Lancaster University |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 06/2021 |
| End | 09/2021 |
| Description | Implementing Artificial Intelligence to unlock the Library of Congress Spanish American historical collections (1500-1699) |
| Amount | £120,585 (GBP) |
| Funding ID | AH/X008851/1 |
| Organisation | Arts & Humanities Research Council (AHRC) |
| Sector | Public |
| Country | United Kingdom |
| Start | 06/2023 |
| End | 05/2024 |
| Description | Mesoamerican Apocalypse: A large scale analysis of the Indigenous perspective on the sixteenth-century epidemics of Colonial Mexico |
| Amount | € 78,080 (EUR) |
| Organisation | Heidelberg University |
| Sector | Academic/University |
| Country | Germany |
| Start | 07/2022 |
| End | 07/2023 |
| Description | The New Spain Fleets: Delving into three centuries of socioeconomic colonial history through Artificial Intelligence |
| Amount | £1,068,748 (GBP) |
| Funding ID | ES/X013774/1 |
| Organisation | Economic and Social Research Council |
| Sector | Public |
| Country | United Kingdom |
| Start | 03/2024 |
| End | 03/2029 |
| Title | Benson Latin American Collection |
| Description | To expedite the creation of ground truth transcriptions, we conceived and led what we called an "NEH-AHRC Spanish Paleography & Digital Humanities Institute" November through December 2021 and January through March 2022. Considering its success and the subsequent demand from scholars for more programming, we decided to offer another round of institutes that took place August through October in 2022 and January through March in 2023. For year 2's institutes, we decided to reduce the size of the accepted cohort from 30 to 20 participants to make the workload more manageable. Given the continued positive reception, we have decided to make this a recurring program for the LLILAS Benson Digital Scholarship Office. In 2021, participants transcribed a total of 136 documents (approximately 1140 pages) as part of their weekly palaeography homework assignments. In 2022, participants worked on 89 documents, transcribing approximately 500 pages. We have been and continue to correct and publish these digital texts in the Benson Latin American Collection's data repository (https://dataverse.tdl.org/dataverse/blac). Thus far, we have published 64 document transcriptions in the repository. We have also been working with LLILAS Benson Digital Initiatives staff to ingest the scanned materials with the collected metadata and transcriptions in the University of Texas Libraries' Collection portal (https://collections.lib.utexas.edu/) for broader access. Please notice that each of the documents have an individual DOI. I'm providing here the address to the collection and only the DOI of the first document, as this form would require me to input manually almost 200 records. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | The transcriptions of thousands of pages of historical documents are being made available for the first time in electronic format. This is now broadening the materials researchers have accessible in a broad range of historical subjects. |
| URL | https://dataverse.tdl.org/dataverse/blac |
| Title | UCA 16th century Spanish Procesal HTR Model |
| Description | This is a Machine Learning model for Handwritten Text Recognition of documents written in the sixteenth-century calligraphy called "Procesal" found in Spanish American legal documents. |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2023 |
| Provided To Others? | No |
| Impact | In the future, these HRT models will unlock the contents of historical documents held in archives that usually take years to transcribe. We expect these models will change the scale in which research is carried out, as well as expediting it. |
| Title | UCA Nahuatl HTR Model |
| Description | This Handwritten Text Recognition model is based on the transcription of an extensive collection of Nahuatl documents from the Fondo Real de Cholula. |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2023 |
| Provided To Others? | No |
| Impact | When finished, it will enable the automated transcription of thousands of pages of historical documents in Classical Nahuatl documents. This will include archival holdings from Mexico, Guatemala, Spain, and Nicaragua, among others. |
| Title | UCA Poquomchi HTR Model |
| Description | We are creating ground truth transcriptions of Poquomchi' language (a Mayan language) materials preserved at Brigham Young University and Princeton University. We are expecting to have a model to transcribe sixteenth and seventeenth-century calligraphies for this language. |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2023 |
| Provided To Others? | No |
| Impact | This model will serve in the future to do the automated palaeography of this historical Indigenous language. |
| Description | US Library of Congress |
| Organisation | Library of Congress |
| Country | United States |
| Sector | Public |
| PI Contribution | We have partnered to work on Handwritten Text recognition of their collections with the models we are creating for the project. Our work in the Unlocking the Colonial Archives project has resulted in further funding. |
| Collaborator Contribution | They are providing access to their collections and staff at the library. |
| Impact | This is work in progress. We are planning to develop and release the models for their collections in 2024. |
| Start Year | 2023 |
| Title | Humanities AI-Digital Lab |
| Description | We are in the process of developing a Humanities AI-Lab online. We are in the first version of the lab. The idea behind this is to have an online laboratory to work, annotate, analyse, and train machine learning models with historical documents. At the moment is composed of two sections. The first one is an annotation tool that enables the user to define an annotation model with an ontology and annotate documents with the entities defined. The tool also facilitates the training of multiple NLP models with the data used. This enables the automated annotation of further documents. The purpose is to facilitate the extraction of information at large scale from historical collections. The second section consists of a computer vision tool to implement a method we are calling Visual Natural Language Processing. This tool is enabling the identification of elements in pictorial documents including Mexican codices and maps. |
| Type Of Technology | Webtool/Application |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | These are to be realised yet, but the first iteration of the software is allowing us to annotate thousands of pages with historical information. |
| URL | https://anotador-codices.streamlit.app/ |
| Description | NEH-AHRC Spanish Paleography & Digital Humanities |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | As a way to generate more ground truth transcriptions for HTR models, expose LLILAS Benson's digitized Spanish colonial collections to a broader public, and impart training on the machine learning tools we are using in this project, we organized two "NEH-AHRC Spanish Paleography & Digital Humanities" institutes. These 7-week online training programs took place November-December 2021 and January-March 2022. The institute provided practical training in the reading and visualization of 16th- to 18th-century colonial Spanish manuscripts preserved at LLILAS Benson. The training developed skills in two areas. First, participants obtained specialized training on several free and open source tools that they can use to extract, visualize, and present data in colonial texts. These included FromThePage, Recogito, Voyant-Tools, ArcGIS, Onodo, and Transkribus. Each Monday, we led a DH workshop on a particular tool using the transcriptions participants had created for homework in previous weeks as datasets or sample texts we had already prepared. Second, students learned and honed paleography skills for the accurate reading and transcription of these Spanish colonial manuscripts. Every Friday, we would split the class randomly into 6 breakout groups to work on specific pages of document types and handwriting styles we had pre-transcribed. Everyone would take turns reading these materials out loud and transcribe them in a shared class Google Doc. The expectation was that they would support each other in deciphering the text, especially those who had indicated that they have advanced paleography skills in specific handwriting styles. Throughout the session, we would correct their transcription or provide hints when they could not read a particular word in the shared Google Doc. In the fall institute, we invited experts from Germany, Portugal, France, and Mexico to talk about specific document types; we recorded these presentations and shared these with the spring cohort. To provide them with individual paleography practice, we assigned 2-4 pages of manuscript material for homework each week. Students used FromThePage, a collaborative transcription tool we host on our University of Texas Libraries servers, to complete these transcriptions. We subsequently corrected these transcriptions to provide them feedback; FromThePage documents and preserves versioning, which enables students to compare the corrected version with their initial transcription. We then used these transcriptions to train HTR models in Transkribus. Collectively, we trained/are currently training 60 scholars-including 35 graduate students, 8 junior faculty, 8 tenured professors, 5 archive/library professionals, and 4 independent researchers-from 11 countries and 18 U.S. states. We sent out an institute assessment survey on the last day to improve the training for the spring. Of the 24 responses we received, all participants expressed that the institute at least met, if not exceeded, their expectations, finding all the resources we created (recordings, document keys, training guides, and presentations) useful. |
| Year(s) Of Engagement Activity | 2021,2022 |
