Implementing Artificial Intelligence to unlock the Library of Congress Spanish American historical collections (1500-1699)
Lead Research Organisation:
Lancaster University
Department Name: History
Abstract
From documents dealing with the early exploration of the Americas to numerous sources crucial to understanding the affairs of Hernando Cortés, the history of early colonial Mexico, and the activities of the inquisition, the Library of Congress (LoC) holds a varied but significant collection of Spanish American colonial documents in its Manuscript Division. While the library has carried out the enormous task of digitising hundreds of folios in their collections and making these available online, the documents in this division still pose a significant challenge. These are written in Early Modern Spanish, with calligraphies that only highly specialised scholars can decipher and, therefore, access. Furthermore, without available digital transcriptions, it can take scholars and specialists years to query, find, and connect the information in these documents on a large scale or between different collections. These documents contain vital information to understand how the Spanish Empire governed the Americas, the role and establishment of the church in the viceroyalties, living conditions, acts of contestation, social structures and knowledge of the indigenous cultures. However, this information remains "locked" and accessible only to a few.
New developments in Artificial Intelligence (AI) now allow us to train computers to carry out the automated transcriptions of these documents. They also facilitate new ways of identifying, querying, extracting, mapping and analysing information through annotating specific words and knowledge categories. Using linguistic means, these approaches now enable us to automatically identify place names, people names, dates, institutions, and other complex concepts of historical interest.
Using the Hans P. Kraus collection in the Manuscript Division of the LoC as the main case study, the proposed fellowship will:
1) Implement Handwritten Text Recognition models to accomplish the automated transcription of the digital material available in the collection from the sixteenth and seventeenth centuries;
2) Create an annotated version of the collection using Natural Language Processing techniques, enabling the identification and analysis of information in a text mining style;
3) Present these approaches, train staff, and establish a pipeline at the LoC for implementing these methods with the other Spanish American collections; and
4) Carry out a feasibility study for creating citizen science programs using AI approaches to historical collections with users and members of the LoC, the LC Labs, and the Hispanic Reading Room.
New developments in Artificial Intelligence (AI) now allow us to train computers to carry out the automated transcriptions of these documents. They also facilitate new ways of identifying, querying, extracting, mapping and analysing information through annotating specific words and knowledge categories. Using linguistic means, these approaches now enable us to automatically identify place names, people names, dates, institutions, and other complex concepts of historical interest.
Using the Hans P. Kraus collection in the Manuscript Division of the LoC as the main case study, the proposed fellowship will:
1) Implement Handwritten Text Recognition models to accomplish the automated transcription of the digital material available in the collection from the sixteenth and seventeenth centuries;
2) Create an annotated version of the collection using Natural Language Processing techniques, enabling the identification and analysis of information in a text mining style;
3) Present these approaches, train staff, and establish a pipeline at the LoC for implementing these methods with the other Spanish American collections; and
4) Carry out a feasibility study for creating citizen science programs using AI approaches to historical collections with users and members of the LoC, the LC Labs, and the Hispanic Reading Room.
Description | We have now achieved an AI model that allows the automated transcription of millions of historical documents written in the calligraphy type called 'Italica Cursiva'. This is a major achievement as this model will not only allow major archives and libraries, like the Library of Congress among many others, to offer these documents and explore information that previously took years to transcribe and study. |
Exploitation Route | When released, this outcome will impact all major Spanish-speaking GLAMs that may hold documents written between the 15th and the 20th century. |
Sectors | Creative Economy Digital/Communication/Information Technologies (including Software) Education Culture Heritage Museums and Collections |
Description | The New Spain Fleets: Delving into three centuries of socioeconomic colonial history through Artificial Intelligence |
Amount | £1,068,748 (GBP) |
Funding ID | ES/X013774/1 |
Organisation | Economic and Social Research Council |
Sector | Public |
Country | United Kingdom |
Start | 03/2024 |
End | 03/2029 |
Title | Italica Simple Document Collection (16th-17th centuries) |
Description | This is a large collection of Spanish American historical documents transcribed by hand by the team. This is serving as the main gold standard dataset for the training of Machine Learning algorithms for HTR. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | No |
Impact | This dataset was essential to the development of the LoC Italica Cursiva HTR model. |
Title | LoC Hans P. Krauss palaography classification |
Description | The Hans P. Krauss collection of documents at the Library of Congress was classified by document and paleography type. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | No |
Impact | We are using this dataset to record and manage the Artificial Intelligence models the project is creating for the automated transcription of the collection. |
URL | https://airtable.com/appOq5tr9zZw9a76o/shrtPfX3ZqWPYEr9D |
Title | LoC Italica Cursiva (16th & 17th centuries) |
Description | This is a Handwritten Text Recognition model for automated transcription of Spanish American historical documents that have been written in Italica Cursiva calligraphy during the 16th and 17th centuries. We have achieved a CER of 4.80% |
Type Of Material | Computer model/algorithm |
Year Produced | 2023 |
Provided To Others? | No |
Impact | We will see the full impact by the end of the project, but with this we will be able to automatically transcribe millions of historical documents written during these centuries. |
Title | LoC Procesal |
Description | This is a handwritten text recognition model for the automated transcription of historical documents. This focuses mainly in the calligraphy called Procesal Simple used in historical Spanish and Spanish American documents. We are still working on it, but we have already achieved a CER of 14.6%. |
Type Of Material | Computer model/algorithm |
Year Produced | 2024 |
Provided To Others? | No |
Impact | This is yet to be realised. |
Description | US Library of Congress |
Organisation | Library of Congress |
Country | United States |
Sector | Public |
PI Contribution | We have partnered to work on Handwritten Text recognition of their collections with the models we are creating for the project. Our work in the Unlocking the Colonial Archives project has resulted in further funding. |
Collaborator Contribution | They are providing access to their collections and staff at the library. |
Impact | This is work in progress. We are planning to develop and release the models for their collections in 2024. |
Start Year | 2023 |
Description | Exhibition "Imaginar el Fin de los Tiempos" at the National Museum of Anthropology and History, INAH-Mexico |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | We were part of a panel specially organised for a national exhibition at the National Museum of Anthropology and History of Mexico. |
Year(s) Of Engagement Activity | 2023 |