Digital Humanities Data Hive: Accessing Humanities Data At Scale

Lead Research Organisation: University of Glasgow
Department Name: School of Humanities


This project will scope the establishment of the Digital Humanities Data Hive ("DH2"). This is conceived as an active and interactive national data centre for the arts and humanities where the rich and complex data at the heart of research can flourish in new and unexplored ways. At the heart of the project is a respect for the hybrid and diverse array of data that are the basis for arts and humanities research. Our proposal rejects narrow definitions of data types and disciplines, and instead builds on the UK's long history of expertise in digital text, image, and multimedia collections; within our scope is any humanities data which has the potential to be transformed and have value added through the data-driven integration of research data and tools.

Our research, carried in out close collaboration with the arts and humanities and data science communities, will explore how DH2 will make possible completely new ways of doing digitally enabled work. This will enable us to scope, develop and design a suite of tightly-integrated services for both data storage and use. Our approach centres around building tools for mining, analysing, exploring and linking data alongside its storage, thus ensuring data within a future centre or centres is firmly embedded within the research data lifecycle. By so doing, our project will foster new ways of thinking about creating digitally driven research in the arts and humanities - and in parallel, make the case that the creation of tools and services for using data for scholarship must be an integrated aspect of any data-focused project in the humanities. We know that at present, the 'human infrastructure' of researchers, developers, users, and repository managers does not wish to interact with raw data stores, but instead with the holistic confluence of data and tools. Therefore, the combined data/interface/tool ecosystem that our project will design will be essential for the identification of future cross-disciplinary opportunities and new and emerging transformative uses of humanities data. Finally, by centering re-use we want to create an environment where there is greater transparency around the use and analysis of data and where methods and workflows for research are open and can be observed, critiqued, and replicated.

The DH2 proposal this project will scope will include a methodological layer of tools and services for using data at scale, offering a service that will allow people to run their own data through tools and create results based on comparisons and integration with larger bodies of data. To do this, we will develop an evidence based, fully-costed project plan, and full technical specifications for a data centre which has two key elements: a Data Service that will federate new and existing data repositories via an abstraction layer, which itself will be used in our Data Lab, an analytical layer of tools for data manipulation, mining, re-use, and visualisation. Our project will therefore be a crucial step towards a national data service for the arts and humanities, bringing together human infrastructures, results-oriented services, and state of the art technical developments.


10 25 50
Description Summary of key findings
1. There is no existing centralised data infrastructure for A&H in the UK, meaning there are no reliable tools and methods for researchers to discover, use and re-use and integrate data, and to connect tools with data, despite current researcher demand for storing, sustaining, aggregating, analysing, searching, and sharing data. Current solutions are fragmented by disciplines and data types or siloed to individual projects or institutions. There is substantial demand for services offering these capabilities amongst both general and expert users of A&H data: users want to be able to store, sustain, aggregate, analyse, search, and share their data in one centralised system. A comprehensive UKRI investment in data infrastructure at the heart of the UK's research and academic ecosystem is needed to address this challenge.
2. The sustainability and storage of data produced by A&H research projects is not assured, even for large-scale investments: there is no interdisciplinary A&H data centre available for these cultural assets that addresses the full life cycle of digital projects (i.e., sustaining data, and its use and reuse), resulting in a significant and quantifiable lost public investment.
3. There is a significant cost-benefit return anticipated from developing a formal data infrastructure for the UK, in both the value added to creative innovation, and losses prevented for new and existing projects. Models can be seen in the existence of infrastructures and services for STEM and social science disciplines in the UK, and for A+H data infrastructure internationally that make clear the inherent value in A&H data.
4. The current landscape lacks not just services and technical capacity, but a national focus for interdisciplinary academic experience in data, tools and methods that can galvanise the A&H community in the uptake of innovation in next generation and translational computational approaches that not only allow the use and re-use of existing (and future data), but allow them to use their own data in addressing grand challenges to maximise the scientific, societal, and economic benefits of a research infrastructure. Without the right A&H academic engagement, any future investment will recreate the existing unsustainable patchwork of tools and services.
5. Recent technological and infrastructural innovations have addressed previous challenges in developing a centralised A&H data service, such as cloud-based computing, improved data visualisation tools, machine learning technologies, etc. However, future development and any design thinking needs to have at its heart an awareness of the complexity of A&H data, and services and tools need to be designed around complexity at scale (including inherent rights and ethical issues, as well as data standards, models, and ontologies). This complexity is key to sustaining 'living' data with a meaningful digital afterlife beyond its creation, therefore academic stakeholders and bespoke A&H data expertise is essential.
6. While A&H researchers have access to advanced tools for data analysis and processing (e.g. NLP, data mining, text analysis, visualisation, data modelling, network analysis tools, etc.), these are still only accessible in bespoke ways for individual projects/datasets. The community does not have access to a shared tools layer that abstracts data from disparate collections and allows for their cross-comparison and re-use to address new research questions.
7. There is insufficient A&H planned cooperation and integration with HPC and emerging scientific infrastructure initiatives. This needs to be formalized and programmed into the development of any A&H infrastructure solutions to support next-generation research.
8. There is a link between sustainability and use of data; if data is more reliably accessible, it will be re-used. Re-use of data has also become more crucial in the REF-driven landscape in the UK: there is an urgent need to be able to replicate results of data-driven scholarship in the arts and humanities, and to support interdisciplinary approaches in academic research, which necessitates interoperability and reusability of data across fields and projects.
9. There is a need for solutions that are scalable to both large and small projects. Solutions cannot just cater to large, funded projects but must also facilitate the storage and re-use of small and diverse datasets, in order to maximise the utility and impact of A&H research.

6. Top 5 recommendations
Short-term: what needs to be done now in order to make the UK R+I infrastructure competitive for cutting edge A&H research:
1. Invest in constructing a research infrastructure for A&H that will deliver an integrated suite of services and support for the full data life cycle of data: a Data Service capable of storage and access of interdisciplinary, hybrid, complex, and multimodal data, accessible and scalable to large and small projects, that enables data sustainability and re-use; and a Data Lab offering an abstraction layer offering tools and methodologies for data use as services.
2. Build a structure and framework for the infrastructure that ensures connected academic leadership and continued development and scoping, including community building activity to ensure buy-in and co-production of the required tools and services to ensure that DH2 addresses researcher needs and takes advantage of emerging opportunities. Our interdisciplinary structures will be built so as to foster open, FAIR and sustainable, complex data-driven research across the A&H.
Medium term:
3. Explore increasing approaches for automation of services, such as aggregation and ingest of data; increasing exploration of next generation tools (ML, AI, NLP)
What should be done over the longer term (3-5 years):
4. Deliver a centralised Repository that can store and aggregate complex A&H data, and embed the use of this service across UK A&H projects.
5. Design a user layer which facilitates the use and re-use of diverse data sets held across the A&H and enables wider access and discoverability to this data.
Exploitation Route The recommendations were presented to the AHRC as the basis of future funding.
Sectors Education

Description A series of workshops with key practitioners to scope future data services required for the arts and humanities 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Data project workshops May-June 22 •
Targeted workshops with identified data projects, including UK-Ireland DH Association.
Included collaboration with other scoping projects including CONNECTED.

Scoping to ensure that DH2 deign fit for purpose as an active and interactive national data centre for the Arts and Humanities (A&H) where the rich and complex data at the heart of research can flourish in new and unexplored ways, as a multi-modal, interdisciplinary Research Infrastructure for the UK's A&H community that enables UK research to benefit from the essential tools, services and research synergies that can only be possible with a hybrid and open approach.

Our inclusive approach approach ensures that we will not build another technological or disciplinary silo, but instead ensure that the Arts and Humanities research community have the necessary human and academic infrastructure, including leadership, blue-skies development, sustainability planning, and engagement with the wider RI and HPC community. This approach is comparable to what has been built elsewhere, allowing the UK to compete internationally in data driven research in the arts and humanities. It responds to the findings of our scoping study, which showed that the UK lacks a research infrastructure of this sort for the Arts and Humanities, and that it is urgently needed.
Year(s) Of Engagement Activity 2021,2022
Description Interviews and survey of research needs: data centres in the arts and humanities 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Data expert interviews April 21-Dec 22
• Interviews with identified data experts conducted and analysed: 24 interviews, 8 countries.

Institutional & organisational interviews March-June 22
• Further consultation with a selection of institutions, research groups, individuals, IROs and heritage organisations, and REF panellists: 8 interviews.

Expert & user survey May-Nov 22
• Survey on A&H data use and re-use, storage and sustainability, and tools and methods.
• Sent to data experts and widely disseminated through digital humanities networks, including general and expert users; analysis developed.
Year(s) Of Engagement Activity 2022