Developing a network to investigate the development of a global dataset of digitised texts

Lead Research Organisation: University of Glasgow
Department Name: School of Humanities


All around the world, libraries, archives, universities, and many other organisations are digitising collections in order to make them freely available for research and scholarship. Whilst this endeavour is making millions of texts available, much of the effort is uncoordinated, making it hard for researchers or digital scholars to make the best use of this growing corpus. In addition, organisations wishing to target their digitisation efforts efficiently are unable to easily check to see if a text has already been digitised.

These problems could be overcome by the development of an international register that contains a dataset of digitised texts, which brings together links to all openly available sources. The existence of such a dataset would deliver three big benefits:

Scholars seeking large corpora of texts could easy search and compile links to items across from many sources, creating novel collections from which they can undertake research with digital methods.
Readers wishing to find a digitised text, would be able to search quickly and efficiently across all potential sources to easily find whether a text has been digitised, and to gain access to the online text;
Libraries undertaking digitisation programmes would be able to discover already digitised texts, making their own digitisation programmes more efficient by avoiding duplication. It would also enable large-scale collection analysis for research and other purposes.

This project proposes the development of a new collaboration between the HathiTrust who are an existing and trusted aggregator of digitised texts, a preservation agent, and an access platform in the US, and key UK research Libraries who are large-scale sources of digitised materials. A trial combined dataset will be created to fully test the idea, and sustainability options will be explored in order to provide the collaboration with an ongoing operational model.

Previous projects have sought to provide registers of digitised texts, but this network will go beyond those efforts in two ways. First, it will undertake research to identify how our proposed dataset could support discovery of digitised texts. Second, it will create a dataset that goes beyond existing projects by providing a large-scale dataset of library collection metadata from several of the largest digitising organisations in the world: whereas existing efforts are focused on aggregation and discovery, this project will provide a more comprehensive dataset that is also suitable for researchers and libraries to undertake data analysis. This will support the development of local, national and global digitisation strategies.

If successful, this model has the potential to transform access to digital texts for citizens and researchers worldwide, for the study of single texts, or for the collation of novel corpora for digital scholarship.

Planned Impact

The network will support improved reuse of metadata and associated datasets for researchers; provide benefits to libraries engaged in digitising their collections; and provide a pathway towards enhanced access to digitised texts for the public. It will benefit the following groups:


The International Library Sector:
The dataset of digitised texts will provide a prototype of the method for creating a larger, global dataset of digitised texts. It will benefit the international library sector by: enhancing their understanding of data sharing across international boundaries; indicating overlap in digitisation efforts that could, in turn, provide cost savings by avoiding duplication of effort; establishing cross-Atlantic communities of interest that can build on the prototype in potential future funding projects; and support knowledge exchange for the international library sector. It will also provide evidence of transatlantic library holdings that could contribute to evidence based approaches to digitisation strategies at institutional, national, and global level. The associated case study will allow libraries to understand the rationale behind a larger dataset of digitised texts, and identify best practice for future work in this area. HathiTrust and RLUK are membership organisations that support nearly 200 research and academic libraries between them. These libraries will benefit from dissemination of project findings via the project website, and through the project partners' existing dissemination routes.

Users of digitised texts:
The network's outputs will have significant impact for users of digitised texts. First, they will indirectly benefit from the proposed case study, which could provide significant benefits to users in the form of future work to develop a single point of entry for discovery of digitised texts. Second, user engagement in the form of facilitated workshops will increase user awareness of the digitised collections of the project partners, and give participants the opportunity to play a significant role in defining the nature and scope of future collaborations that arise from the network. Third, in recent years organisations like the British Library and National Library of Scotland have successfully engaged with the creative sectors to facilitate innovative reuse of digitised materials; the dataset itself could provide opportunities for creative reuse, while the data would allow creative users to identify what has been digitised, and where, in order to identify suitable source materials. The impact for users is likely to continue after the funding period, as the long-term implications of data sharing and improved discoverability of digitised materials are felt.


The project website will act as a publicly-available archive of the network's activities, and a means of disseminating the issues arising from creating a prototype dataset of digitised texts. The PI will take responsibility for leading content creation for the website, and the project partners will use existing social media channels to disseminate information to their existing audiences. The dataset of digitised texts will be released, to the extent that is allowed, so that users are able to see the extent of digitisation activity amongst project partners.
Description Our findings are:
- There still exists no global resource within the library and heritage sector that comprehensively aggregates descriptive, preservation, or provenance metadata for all digitised texts: this impacts upon discoverability of digitised materials, and limits the impact of library digitisation efforts.
- Several clear use cases emerge that would differentiate such a resource from existing platforms. There is great overlap between the use cases we identified as "discovery and access" and the existing focus of platforms including DPLA and Europeana Collections. The cases we identified in "efficiency, cost, impact value", "provenance" and research provided a clear value proposition that is not fully addressed by any existing platform.
- Within these use case categories, there is a desire for a resource that allows collaborative strategic planning to take place: areas of particular interest are in clustering similar materials, digitisation planning, and the preservation of existing materials.
- Our dataset already allows us to address several use cases, but there are clear limitations that need to be addressed by enhancements to existing metadata, and by incorporating additional data fields in future iterations.
- Holdings analysis will inevitably have to play a key role in supporting use cases across the different categories, particularly in clustering of similar or duplicate manifestations, provenance, and discovery.
- Aggregation of diverse collections metadata remains a key challenge: clustering of similar records yields better results through resource-intensive methods. Therefore, a trade-off between efficiency and accuracy might be necessary should aggregator-side metadata transformation be adopted.
- Information professionals were generally excited by the value proposition of the GDDNetwork approach, but we found that some scholars professed confusion about the uniqueness and value of the dataset. Further work is therefore required to define and explain the scope of a global registry for non-information professionals.
Exploitation Route The outcomes of this funding have laid the groundwork for future research and development of the dataset. We have identified the following priorities for future work, which have possibilities for collaboration to emerge around a set of priority topics:
- Further work is required to create a business case for building a global dataset. This should include refining the use cases to identify the primary benefits and mission of the resource, as well as some assessment of the overall cost of such a service, and the potential for short and longer term institutional hosting and financial support.
- Further funding is required to expand the prototype: this should focus both upon enhancing the dataset to meet additional priority use cases, and expanding the network to include a more representative sample of global libraries.
- The US/UK constitution of our project inevitably focuses our findings upon specific regional and linguistic contexts. More therefore needs to be done to understand the issues that a global dataset would bring through diversification, including different cataloguing standards, multiple character sets, different languages, and questions of representativeness and comprehensiveness. This is both a research and a technical challenge that requires further collaboration between researchers, libraries, and content providers.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections

Description The dataset from our project has been used by the National Library of Scotland, which has made it available via the Data Foundry under a CC0 licence. This means that the project dataset is now publicly available for future research, and will benefit members of the public, the research community, and the library sector through reuse and access to aggregated data around digital collections. As the dataset was uploaded in the last month, the impact so far has largely been upon the library community through the ability to share and access new forms of collections data - we will expect further benefits to other communities in future. The project has also formed the basis for the foundation of a prototype service to support the discovery of digitised texts. The website, entitled OpenTexts.World, is an experimental service that helps you discover free digitised text collections from around the world. It currently holds around 8 million records from at least ten libraries from around the world. The service was created by Stuart Lewis and Gill Hamilton of the National Library of Scotland, who were key partners in the network. The service is based upon the work of the GDDNetwork. This has provided wider social and cultural benefits by improving the accessibility of digitised library collections to researchers and members of the public.
First Year Of Impact 2020
Sector Culture, Heritage, Museums and Collections
Impact Types Cultural,Societal

Title Aggregated dataset of digitised texts from the GDD project 
Description This is an aggregated dataset of digitised records created by the Global Digitisation Dataset project in 2019, an AHRC funded project under the UK/US Digital Scholarship in Cultural Institutions networking fund. The records come from the project's members: HathiTrust, National Library of Scotland, British Library and the National Library of Wales. Each record in the dataset contains limited bibliographic metadata, along with a link to the item. The dataset was created as a proof of concept, merging records of digitised texts from different organisations together. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact None yet - uploaded in March 2020 
Description Series of blogs on project website 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The core network members produced a series of blogposts aimed at keeping the wider profession up to date with our activities - the blogposts reached a wide cross section of relevant stakeholders and triggered many expressions of interest in the project.
Year(s) Of Engagement Activity 2019,2020
Description Series of invited workshops 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We organised two workshops, each attracting roughly 35 attendees to present the work of the research network and to invite input from relevant sectors. The events were successful in sparking discussions, and several attendees expressed a desire to contribute to further phases of the research. The events therefore laid the groundwork for further collaborations to emerge in future.
Year(s) Of Engagement Activity 2020