Big UK Domain Data for the Arts and Humanities (BUDDAH)

Lead Research Organisation: University of London
Department Name: Inst of Historical Research

Abstract

Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analytical analysis. The proposed project will focus on deeper analysis of the dataset derived from the UK domain crawl from 1996 to 2013 (that is, when legal deposit legislation was extended to cover digital materials), totalling approximately 65 terabytes and constituting many billions of words. For the arts and humanities, this is very big data indeed.

A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. These may include, but are not limited to, sentiment analysis, topic modelling, proximity searching, link analysis, and geo-spatial analysis. The scale and inter-connectedness of the data requires an analytical, big data approach rather than the rendering of individual web pages.

A major study of the history of UK web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on, will be complemented by a series of sub-projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.

Project outputs will include a suite of tools associated with the 1996-2013 dataset; a series of case studies produced by the sub-projects; an online training course for arts and humanities researchers; peer-reviewed journal articles; and a monograph on the history of the UK web during this period.

Planned Impact

The proposed research will have a significant impact on a range of audiences and sectors, including:

1. Institutions with responsibility for web archiving and digital preservation
The UK web domain dataset 1996-2013 is held by the British Library, which as of April 2013 also has responsibility for preserving data arising from a regular crawl of UK web space. This is a major extension of the BL's legal deposit role, and one which will require considerable investment in time and resource. If this work is to be of genuine value to researchers in the arts and humanities, it is vital to understand how they wish to use the web archive, both now and in the future, and the methods and tools that they will require to interrogate and represent the data effectively. This project will inform the BL's future strategy, by engaging arts and humanities researchers in the co-production of tools for analytical access to the data. It will also help curators to understand the data for which they are now responsible, and to refine their processes of collection. Arts and humanities researchers in turn will benefit from the expertise of the digital scholarship and web archive teams at the BL, and the project will also inform provision at the UK's other legal deposit libraries.

The National Archives (TNA) has responsibility for preserving UK government information published on the web, a more selective process than the wider UK domain crawl. Members of TNA staff will be invited to attend the two proposed workshops and to join the project advisory board. Arrangements for the harvesting of national web data, and provision of access to it, vary widely between countries, but many of the problems are common. Consequently, the findings of this project will feed in to international web archive and digital preservation practice internationally, and particularly within the EU.

2. Memory institutions
As the web becomes an increasingly important primary source, museums, galleries and other memory institutions will face the challenge of incorporating material derived from web archives in exhibitions and other forms of public engagement (e.g. the recent British Library Propaganda exhibition made use of archived digital materials). A more accessible and well-promoted archive, with an associated body of expertise, will support the imaginative use of big web data in this context. One of the project case studies will examine material culture and approaches to exhibiting the web.

3. Government, policy-makers and charities
(Local) government, charities and other public sector organisations increasingly interact with the public online, and an understanding of the nature of that interaction and the ways in which it may be analysed is potentially of enormous interest to policy-makers, e.g. in assessing levels of political engagement over time. Relevant case studies will be disseminated widely.

4. General public
Much of the data contained in the UK web archive is generated by members of the public, and this project will establish a research framework which will both safeguard their interests and help them to understand the uses to which their data may be put. A series of short videos will be produced, presenting the project's objectives and findings in an accessible way, one of which will discuss the ethics of big data research. The videos will be published on the project website and via YouTube. The project findings will also be disseminated using the well-established social media presences of the Institute of Historical Research, the BL and the Oxford Internet Institute.

5. Journalists and other mainstream media-based researchers
There is a substantial media interest in web-based and social media research, and this project will provide journalists with a greater understanding of the data itself and the context within which that research is conducted. This will in turn increase public understanding of both web archives and big data.
 
Description Big UK Domain Data for the Arts and Humanities has resulted in the creation of a full-text index of the archive of UK web space from 1996 to 2013, the point at which the British Library undertook its first Legal Deposit full domain crawl. The sophisticated interface developed by the project to allow searching of this data, SHINE, has been made available by the British Library as a prototype (see https://www.webarchive.org.uk/shine), and the work undertaken by the project team is currently informing a fundamental redesign of access mechanisms for the BL's other web archive collections. Another major output of the project is a series of 10 bursary-holder case studies, highlighting the value of the archived web for humanities research in a range of disciplines, from history to political science and communication studies. Some of this work will also feature in a forthcoming open-access monograph edited by two members of the project team.

With regard to the development of skills required to work with web archives, which was identified as a key barrier to engagement with this important primary source, the Institute of Historical Research has already organised a training course for researchers and archivists, which will form part of its ongoing training offering. A training workshop has also been organised for the annual general meeting of the International Internet Preservation Consortium in 2016.

The work of the project will be taken forward in three main ways: by the British Library as it builds on the tools already developed to improve its core services; as part of the Research Infrastructure for the Study of Archived Web Materials (RESAW) research network; and through the recently funded AHRC research network, Born Digital Data and Approaches for History and the Humanities. Further research projects and funding applications are already in development.
Exploitation Route The project findings and outputs have already been used by others in the culture and heritage sector. Specifically, the tools and knowledge developed during the project have influenced provision of and access to web archives at the British Library, and the software and processes have informed similar work in Denmark and Canada. The archiving of the web is evolving, and will continue to evolve as the scale of the data increases exponentially, and new tools and processes may build further on the pioneering work of this project.
Sectors Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections

URL http://buddah.projects.history.ac.uk/
 
Description The main impact of this project to date lies in its contribution to the enhancement of archival processes and access arrangements at the British Library. In late 2017, the British Library launched a new beta service providing access to all of its archived web content which drew heavily on research undertaken during the Big UK Domain Data for the Arts and Humanities project. Advanced search and trends functionality are currently in development at the Library, and these too have been shaped by the BUDDAH research. The PI has been involved in user testing for the beta service throughout. The SHINE interface created during the project has also informed development of archived web search facilities at the Bibliotheque Nationale de France and the Danish Royal Library.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections
Impact Types Cultural

 
Description Judge, UK Trade and Investment's Sirius Programme
Geographic Reach National 
Policy Influence Type Membership of a guidance committee
URL http://www.siriusprogramme.com/
 
Description Presentation to the Joint Legal Deposit Committee
Geographic Reach National 
Policy Influence Type Participation in a national consultation
 
Description CLEOPATRA: Cross-lingual Event-centric Open Analytics Research Academy
Amount € 3,949,689 (EUR)
Funding ID 812997 
Organisation Marie Sklodowska-Curie Actions 
Sector Academic/University
Country Global
Start 01/2019 
End 12/2022
 
Description Research Networking Scheme
Amount £32,000 (GBP)
Funding ID AH/N006178/1 
Organisation Arts & Humanities Research Council (AHRC) 
Sector Public
Country United Kingdom
Start 03/2016 
End 02/2017
 
Description Membership of the Turing Institute Data Science and Digital Humanities Interest Group 
Organisation Alan Turing Institute
Country Unknown 
Sector Academic/University 
PI Contribution As an external research on the interest group, I have been able to contribute expertise in relation to working with web archives and born-digital data for historical research.
Collaborator Contribution The main aims of the group are to strengthen relationships and build collaborations at the intersection between data science and digital humanities. Our goal is to raise the profile of data-driven humanities research at the Turing, open up future collaborations, and strengthen the Turing's links with organisations such as the British Library, The National Records of Scotland and The UK National Archives. The group will show the key role that can be played by The Alan Turing Institute in the area of Digital Humanities by demonstrating that data science research can answer questions relevant to the humanities and vice versa, thus benefiting both fields. This will be achieved with meetings, workshops, and joint research projects. Translating fundamental research in data science into lasting impact in the humanities requires interdisciplinary efforts, through the sharing of perspectives, methods and knowledge. The interest group builds on the organisers' extensive experience in interdisciplinary research on historical data and brings together people from a range of different disciplines.
Impact A workshop was held at the University of Edinburgh in 2018, which fed in to the UKRI infrastructure road map consultation.
Start Year 2017
 
Title Prototype SOLR-powered web archive exploration UI 
Description Shine is a web UI for browsing the contents of a Solr server. It is specifically designed to explore a search server populated with web archive data using the warc-discovery indexer. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2015 
Impact The code has already been used by other web archive researchers in Canada, as well as in the British Library's SHINE service. 
URL https://github.com/ukwa/shine
 
Description 'The future of the past' public roundtable 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact This roundtable discussion was held as part of a series of public seminars organised under the theme of 'History now and then'. It addressed how future historians might judge today's historiography, what we over- or under-emphasise, big data and big history, and how history is changing in the digital age. One of the aims of the event was to raise awareness of the changing nature of historians' primary sources in a digital age, and in particular to encourage attendees to think about how they handle their personal digital archives.
Year(s) Of Engagement Activity 2017
 
Description 'Will history survive the digital age?', BBC History magazine article 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact The article in BBC History magazine discussed the challenges for historians of working with large-scale born-digital sources, and also the actions that people can take to make sure that their own digital records are preserved for future researchers.
Year(s) Of Engagement Activity 2017
 
Description Ancient History of the UK Web 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Talk led to a discussion of the issues raised.

International dissemination of the project.
Year(s) Of Engagement Activity 2014
 
Description Being Human Festival: computer games on the IA 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact An all-day event at the Being Human Festival, in which members of the general public could try computer games archived by the Internet Archive, and learn in general about web archiving.
Year(s) Of Engagement Activity 2015
URL http://beinghumanfestival.org/
 
Description CPD25 M25 Consortium of Academic Libraries event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Presentation about the use and promotion of born digital archives at a CPD25 event on 'My Digital Tools Bring all the Researchers to the Library - Marketing your Library in the 21st Century'. The main aim of the presentation was to demonstrate how to engage humanities researchers with 'difficult' digital collections.
Year(s) Of Engagement Activity 2016
 
Description Cambridge history faculty talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Talk was followed by discussion of the issues raised.

No known notable impacts.
Year(s) Of Engagement Activity 2014
 
Description Guest lecture: Notre Dame Data Science Program 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Undergraduate students
Results and Impact 20-30 students attended a guest lecture by Helen Hockx-Yu, which was given as part of the Notre Dame University Data Science Program.
Year(s) Of Engagement Activity 2010,2018
URL http://datascience.nd.edu/
 
Description History Day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Talk sparked questions and was followed by a one-on-one clinic.

No known notable impacts.
Year(s) Of Engagement Activity 2014
 
Description History Now and Then: the Future of the Past 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact A roundtable discussion on 'The future of the past', organised as part of a series of public events on 'History Now and Then'.
Year(s) Of Engagement Activity 2017
 
Description IHR Digital History seminar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Talk produced lively discussion

No known impacts of that importance.
Year(s) Of Engagement Activity 2014
 
Description IHR Postgrads talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Talk involved questions and discussion.

After the talk I was contacted by a freelance technology journalist who was interested in pitching a story about the project.
Year(s) Of Engagement Activity 2014
 
Description Introduction to web archives workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact An introduction to web archives for researchers, archivists and librarians.
Year(s) Of Engagement Activity 2015
URL http://www.history.ac.uk/events/browse/19262
 
Description JISC Digital Festival, Birmingham 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk increased knowledge of web archiving among university IT professionals and academics.

No known notable impacts.
Year(s) Of Engagement Activity 2014
 
Description Literature Festival 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Talk increased public knowledge of web archives.

No known notable impacts.
Year(s) Of Engagement Activity 2014
URL http://www.cheltenhamfestivals.com/literature/whats-on/2014/big-data-big-opportunities/
 
Description Oxford Digital Tools round table 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Presentation of the project at a round table for Oxford History faculty members and other interested university staff and students.
Year(s) Of Engagement Activity 2015
URL http://blogs.bodleian.ox.ac.uk/history/tag/digital-humanities/
 
Description Panel Presentation (IIPC) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Presentation of the project at the annual general meeting of the International Internet Preservation Consortium
Year(s) Of Engagement Activity 2015
URL https://www.youtube.com/watch?v=o4iIdZP4rg8
 
Description Plenary panel: Big data for arts and humanities 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Stimulated thinking and discussion about web archives.

No known notable impacts.
Year(s) Of Engagement Activity 2014
 
Description Presentation to a group of visiting PhD students from University of Kent 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Talk increased information about web archiving among postgraduates.

No known notable impacts.
Year(s) Of Engagement Activity 2014
 
Description Public debate 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Participation in a roundtable on the subject of 'Web Archives: truth, lies and politics in the 21st century'. The web and social media play a key role in the circulation of news in the 21st century. But increasingly it is becoming difficult to separate fact from fiction and untruth, or even to agree on what constitutes fact. These problems are heightened by the speed with which information can be shared, modified or deleted, the personalisation (both explicit and hidden) that determines which news we see online, and the difficulties of establishing authorship and provenance. This public roundtable discussed the role of web and social media archives in helping us, as digital citizens, to navigate through this complex and changing information landscape.
Year(s) Of Engagement Activity 2017
URL https://archivedweb.blogs.sas.ac.uk/digital-conversations/
 
Description Public lecture: 'Exploring the UK web' 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact A public lecture which involved lively discussion with the audience.
Year(s) Of Engagement Activity 2015
URL http://blogs.bodleian.ox.ac.uk/archivesandmanuscripts/2015/11/23/web-archives-talk/
 
Description Publishers' conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact A more informed attitude to web archives among publishers.

No known notable impacts.
Year(s) Of Engagement Activity 2014
URL http://eventifier.com/event/Alpsp2014/alpsp?full_embed=true
 
Description Utopia and Dystopia drop-in session 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact A range of people from computer game enthusiasts, to students of computer design and game designers themselves attended a drop in session which allowed them to play vintage computer games from the 1980s and 1990s. The games are held in the Internet Archive, and the event explored the importance of capturing this important aspect of digital cultural heritage. Many of the attendees had previously been unaware of the Internet Archive and reported that they would use it in the future.
Year(s) Of Engagement Activity 2016
 
Description Web Archiving Week 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A week of web archiving events and activities was organised in London on 12-16 June 2017. The centrepiece of the programme was a major international conference combining the second RESAW Conference and the rescheduled IIPC Web Archiving Conference, 14-16 June. The week began with a two-day Archives Unleashed hackathon, and a public debate was held on the evening of 14 June, as part of the British Library's series of Data Conversations.

Web Archiving Week was hosted by the British Library and the School of Advanced Study, University of London, and organised with the support and assistance of the IIPC, RESAW (A Research Infrastructure for the Study of Archived Web Materials), The National Archives and Archives Unleashed.
Year(s) Of Engagement Activity 2017
URL https://archivedweb.blogs.sas.ac.uk/
 
Description Web archiving workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Discussion and further knowledge of web archives, pros and cons, were engendered.

Activity led to some of the participants applying to the project for bursaries.
Year(s) Of Engagement Activity 2014