📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Implementing Artificial Intelligence to unlock the Library of Congress Spanish American historical collections (1500-1699)

Lead Research Organisation: Lancaster University
Department Name: History

Abstract

From documents dealing with the early exploration of the Americas to numerous sources crucial to understanding the affairs of Hernando Cortés, the history of early colonial Mexico, and the activities of the inquisition, the Library of Congress (LoC) holds a varied but significant collection of Spanish American colonial documents in its Manuscript Division. While the library has carried out the enormous task of digitising hundreds of folios in their collections and making these available online, the documents in this division still pose a significant challenge. These are written in Early Modern Spanish, with calligraphies that only highly specialised scholars can decipher and, therefore, access. Furthermore, without available digital transcriptions, it can take scholars and specialists years to query, find, and connect the information in these documents on a large scale or between different collections. These documents contain vital information to understand how the Spanish Empire governed the Americas, the role and establishment of the church in the viceroyalties, living conditions, acts of contestation, social structures and knowledge of the indigenous cultures. However, this information remains "locked" and accessible only to a few.

New developments in Artificial Intelligence (AI) now allow us to train computers to carry out the automated transcriptions of these documents. They also facilitate new ways of identifying, querying, extracting, mapping and analysing information through annotating specific words and knowledge categories. Using linguistic means, these approaches now enable us to automatically identify place names, people names, dates, institutions, and other complex concepts of historical interest.

Using the Hans P. Kraus collection in the Manuscript Division of the LoC as the main case study, the proposed fellowship will:
1) Implement Handwritten Text Recognition models to accomplish the automated transcription of the digital material available in the collection from the sixteenth and seventeenth centuries;
2) Create an annotated version of the collection using Natural Language Processing techniques, enabling the identification and analysis of information in a text mining style;
3) Present these approaches, train staff, and establish a pipeline at the LoC for implementing these methods with the other Spanish American collections; and
4) Carry out a feasibility study for creating citizen science programs using AI approaches to historical collections with users and members of the LoC, the LC Labs, and the Hispanic Reading Room.

Publications

10 25 50
 
Description The project aimed to enhance accessibility and analysis of Spanish American colonial documents in the Library of Congress (LoC) through Artificial Intelligence (AI) methods. Specifically, it sought to implement Handwritten Text Recognition (HTR) models for automated transcription, create an annotated corpus using Natural Language Processing (NLP), and explore a citizen science approach for engagement. The focus was on the Hans P. Kraus collection, a significant repository of sixteenth and seventeenth-century manuscripts.

Throughout the course of this project, significant progress was made in advancing the accessibility of the Hans P. Kraus collection through AI-driven methods. The team successfully transcribed forty historical documents into a machine-readable format, a process that not only preserved their content but also facilitated further computational analysis. In addition to transcription, an extensive annotation process was carried out, enabling the identification of key historical entities such as place names, individuals, institutions, and events. This annotated corpus enhances the usability of the collection by allowing researchers to query the texts in more sophisticated ways.
One of the major technical advancements of the project was the refinement of Handwritten Text Recognition models tailored to the specific calligraphic styles present in the collection. To achieve this, the team undertook a thorough classification of the calligraphic variations, identifying major styles such as gótica, itálica cursiva, procesal cortesana, and procesal encadenada. This classification enabled a more precise training process, ultimately improving the accuracy of the AI models used for transcription. Leveraging existing models from the Unlocking the Colonial Archive Project, the team was able to build upon prior research while significantly enhancing the reliability of AI-generated transcriptions.
We worked with 180 documents created between the 16th and 17th centuries. We prioritised documents written Itálica cursiva calligraphy and to a lesser extent Procesal simple and Redonda, as we would be developing HTR models for these. We selected and manually transcribed 105 documents, 94 in Itálica cursiva, 11 in Procesal simple, and 3 in Redonda.

Despite facing unexpected challenges, particularly in securing high-resolution images of the full collection from the Library of Congress, the project adapted by sourcing additional materials from other archives. While we had been in communication with the Librarian and they knew the details and requirements of the project, the team was still denied access to most of the collection. However, this strategic shift allowed the team to continue refining the models, even as we worked around institutional constraints. The discovery that image resolution has a direct impact on the success of HTR models reinforced the importance of high-quality digitisation in archival AI research. Due to the low-quality images available online for most documents, some had to be digitised by our team to include them in HTR model training. We digitised 33 documents, adding 1749 individual pages/images. Some of these images and transcriptions have already been included in the Procesal simple and Redonda HTR models' training sets; the rest will be used for refining the Italica cursiva model.

The project is hoping to make substantial contributions to scholarly discourse. A research article, Unlocking colonial records with Artificial Intelligence. Achieving the automated transcription of large-scale 16th and 17th-century Latin American historical collections has been accepted for publication in Science & Technology of Archaeological Research and is expected to be released in 2025. Additionally, while the annotated corpus and transcriptions have not yet been made publicly available due to the delays encountered in the Library, they are scheduled for release in the summer of 2025, ensuring that the research outputs will benefit a wider community of scholars and practitioners in the near future.
Some of our key findings are:
*Impact of Image Resolution on HTR Models: The quality of digital images significantly affects the performance of AI-based transcription. Lower-resolution images yield suboptimal results, reinforcing the necessity of high-quality digitisation standards.
*Scalability of AI Methods for Historical Documents: The project demonstrated that AI techniques, including HTR and NLP, are scalable to large archival collections and can be adapted for broader applications in digital humanities.
*Institutional Barriers to AI Adoption in Archives: While libraries recognise the potential of AI, logistical and bureaucratic challenges hinder full implementation. This underscores the need for clearer policies on AI-driven research access.
*Potential for Citizen Science in Archival AI: Preliminary assessments indicate that integrating AI with public engagement initiatives (e.g., crowdsourced annotation) could enhance both research outcomes and public historical literacy.
Exploitation Route We are planning to give all the materials and outputs to the LoC as planned in the original proposal. We believe that when they make this material available, hundreds of researchers will make use of them.
Sectors Creative Economy

Digital/Communication/Information Technologies (including Software)

Education

Government

Democracy and Justice

Culture

Heritage

Museums and Collections

 
Description The project has advanced the accessibility of Spanish American colonial manuscripts by bridging AI methodologies with historical research. The transcription and annotation of the Hans P. Kraus collection pave the way for broader historical inquiries into colonial governance, indigenous interactions, and social structures. Moreover, the AI models developed for this project have the potential to be applied to other underutilized historical archives, enabling more inclusive access to colonial records. Despite institutional challenges, the project successfully adapted its methodology, demonstrating resilience and innovation in AI-driven archival research. The research findings will not only inform future LoC initiatives but also contribute to broader conversations on AI, historical archives, and digital humanities. We are also working on the next aspects of future directions: *Further refinement of HTR and NLP models to improve transcription accuracy across diverse archival materials. *Expansion of AI applications beyond the Hans P. Kraus collection to additional Spanish American holdings at LoC. *Development of a citizen science initiative to involve researchers, students, and the public in historical transcription and annotation efforts. *Advocacy for AI adoption in archival institutions, promoting policies that balance research access with preservation concerns.
First Year Of Impact 2024
Sector Education,Culture, Heritage, Museums and Collections
Impact Types Cultural

Societal

Economic

Policy & public services

 
Description The New Spain Fleets: Delving into three centuries of socioeconomic colonial history through Artificial Intelligence
Amount £1,068,748 (GBP)
Funding ID ES/X013774/1 
Organisation Economic and Social Research Council 
Sector Public
Country United Kingdom
Start 03/2024 
End 03/2029
 
Title Italica Simple Document Collection (16th-17th centuries) 
Description This is a large collection of Spanish American historical documents transcribed by hand by the team. This is serving as the main gold standard dataset for the training of Machine Learning algorithms for HTR. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? No  
Impact This dataset was essential to the development of the LoC Italica Cursiva HTR model. 
 
Title LoC Hans P. Krauss palaography classification 
Description The Hans P. Krauss collection of documents at the Library of Congress was classified by document and paleography type. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? No  
Impact We are using this dataset to record and manage the Artificial Intelligence models the project is creating for the automated transcription of the collection. 
URL https://airtable.com/appOq5tr9zZw9a76o/shrtPfX3ZqWPYEr9D
 
Title LoC Italica Cursiva (16th & 17th centuries) 
Description This is a Handwritten Text Recognition model for automated transcription of Spanish American historical documents that have been written in Italica Cursiva calligraphy during the 16th and 17th centuries. We have achieved a CER of 4.80% 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? No  
Impact We will see the full impact by the end of the project, but with this we will be able to automatically transcribe millions of historical documents written during these centuries. 
 
Title LoC Procesal 
Description This is a handwritten text recognition model for the automated transcription of historical documents. This focuses mainly in the calligraphy called Procesal Simple used in historical Spanish and Spanish American documents. We are still working on it, but we have already achieved a CER of 14.6%. 
Type Of Material Computer model/algorithm 
Year Produced 2024 
Provided To Others? No  
Impact This is yet to be realised. 
 
Description US Library of Congress 
Organisation Library of Congress
Country United States 
Sector Public 
PI Contribution We have partnered to work on Handwritten Text recognition of their collections with the models we are creating for the project. Our work in the Unlocking the Colonial Archives project has resulted in further funding.
Collaborator Contribution They are providing access to their collections and staff at the library.
Impact This is work in progress. We are planning to develop and release the models for their collections in 2024.
Start Year 2023
 
Description Exhibition "Imaginar el Fin de los Tiempos" at the National Museum of Anthropology and History, INAH-Mexico 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact We were part of a panel specially organised for a national exhibition at the National Museum of Anthropology and History of Mexico.
Year(s) Of Engagement Activity 2023