Big UK Domain Data for the Arts and Humanities (BUDDAH)
Lead Research Organisation:
University of London
Department Name: Inst of Historical Research
Abstract
Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analytical analysis. The proposed project will focus on deeper analysis of the dataset derived from the UK domain crawl from 1996 to 2013 (that is, when legal deposit legislation was extended to cover digital materials), totalling approximately 65 terabytes and constituting many billions of words. For the arts and humanities, this is very big data indeed.
A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. These may include, but are not limited to, sentiment analysis, topic modelling, proximity searching, link analysis, and geo-spatial analysis. The scale and inter-connectedness of the data requires an analytical, big data approach rather than the rendering of individual web pages.
A major study of the history of UK web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on, will be complemented by a series of sub-projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.
Project outputs will include a suite of tools associated with the 1996-2013 dataset; a series of case studies produced by the sub-projects; an online training course for arts and humanities researchers; peer-reviewed journal articles; and a monograph on the history of the UK web during this period.
A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. These may include, but are not limited to, sentiment analysis, topic modelling, proximity searching, link analysis, and geo-spatial analysis. The scale and inter-connectedness of the data requires an analytical, big data approach rather than the rendering of individual web pages.
A major study of the history of UK web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on, will be complemented by a series of sub-projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.
Project outputs will include a suite of tools associated with the 1996-2013 dataset; a series of case studies produced by the sub-projects; an online training course for arts and humanities researchers; peer-reviewed journal articles; and a monograph on the history of the UK web during this period.
Planned Impact
The proposed research will have a significant impact on a range of audiences and sectors, including:
1. Institutions with responsibility for web archiving and digital preservation
The UK web domain dataset 1996-2013 is held by the British Library, which as of April 2013 also has responsibility for preserving data arising from a regular crawl of UK web space. This is a major extension of the BL's legal deposit role, and one which will require considerable investment in time and resource. If this work is to be of genuine value to researchers in the arts and humanities, it is vital to understand how they wish to use the web archive, both now and in the future, and the methods and tools that they will require to interrogate and represent the data effectively. This project will inform the BL's future strategy, by engaging arts and humanities researchers in the co-production of tools for analytical access to the data. It will also help curators to understand the data for which they are now responsible, and to refine their processes of collection. Arts and humanities researchers in turn will benefit from the expertise of the digital scholarship and web archive teams at the BL, and the project will also inform provision at the UK's other legal deposit libraries.
The National Archives (TNA) has responsibility for preserving UK government information published on the web, a more selective process than the wider UK domain crawl. Members of TNA staff will be invited to attend the two proposed workshops and to join the project advisory board. Arrangements for the harvesting of national web data, and provision of access to it, vary widely between countries, but many of the problems are common. Consequently, the findings of this project will feed in to international web archive and digital preservation practice internationally, and particularly within the EU.
2. Memory institutions
As the web becomes an increasingly important primary source, museums, galleries and other memory institutions will face the challenge of incorporating material derived from web archives in exhibitions and other forms of public engagement (e.g. the recent British Library Propaganda exhibition made use of archived digital materials). A more accessible and well-promoted archive, with an associated body of expertise, will support the imaginative use of big web data in this context. One of the project case studies will examine material culture and approaches to exhibiting the web.
3. Government, policy-makers and charities
(Local) government, charities and other public sector organisations increasingly interact with the public online, and an understanding of the nature of that interaction and the ways in which it may be analysed is potentially of enormous interest to policy-makers, e.g. in assessing levels of political engagement over time. Relevant case studies will be disseminated widely.
4. General public
Much of the data contained in the UK web archive is generated by members of the public, and this project will establish a research framework which will both safeguard their interests and help them to understand the uses to which their data may be put. A series of short videos will be produced, presenting the project's objectives and findings in an accessible way, one of which will discuss the ethics of big data research. The videos will be published on the project website and via YouTube. The project findings will also be disseminated using the well-established social media presences of the Institute of Historical Research, the BL and the Oxford Internet Institute.
5. Journalists and other mainstream media-based researchers
There is a substantial media interest in web-based and social media research, and this project will provide journalists with a greater understanding of the data itself and the context within which that research is conducted. This will in turn increase public understanding of both web archives and big data.
1. Institutions with responsibility for web archiving and digital preservation
The UK web domain dataset 1996-2013 is held by the British Library, which as of April 2013 also has responsibility for preserving data arising from a regular crawl of UK web space. This is a major extension of the BL's legal deposit role, and one which will require considerable investment in time and resource. If this work is to be of genuine value to researchers in the arts and humanities, it is vital to understand how they wish to use the web archive, both now and in the future, and the methods and tools that they will require to interrogate and represent the data effectively. This project will inform the BL's future strategy, by engaging arts and humanities researchers in the co-production of tools for analytical access to the data. It will also help curators to understand the data for which they are now responsible, and to refine their processes of collection. Arts and humanities researchers in turn will benefit from the expertise of the digital scholarship and web archive teams at the BL, and the project will also inform provision at the UK's other legal deposit libraries.
The National Archives (TNA) has responsibility for preserving UK government information published on the web, a more selective process than the wider UK domain crawl. Members of TNA staff will be invited to attend the two proposed workshops and to join the project advisory board. Arrangements for the harvesting of national web data, and provision of access to it, vary widely between countries, but many of the problems are common. Consequently, the findings of this project will feed in to international web archive and digital preservation practice internationally, and particularly within the EU.
2. Memory institutions
As the web becomes an increasingly important primary source, museums, galleries and other memory institutions will face the challenge of incorporating material derived from web archives in exhibitions and other forms of public engagement (e.g. the recent British Library Propaganda exhibition made use of archived digital materials). A more accessible and well-promoted archive, with an associated body of expertise, will support the imaginative use of big web data in this context. One of the project case studies will examine material culture and approaches to exhibiting the web.
3. Government, policy-makers and charities
(Local) government, charities and other public sector organisations increasingly interact with the public online, and an understanding of the nature of that interaction and the ways in which it may be analysed is potentially of enormous interest to policy-makers, e.g. in assessing levels of political engagement over time. Relevant case studies will be disseminated widely.
4. General public
Much of the data contained in the UK web archive is generated by members of the public, and this project will establish a research framework which will both safeguard their interests and help them to understand the uses to which their data may be put. A series of short videos will be produced, presenting the project's objectives and findings in an accessible way, one of which will discuss the ethics of big data research. The videos will be published on the project website and via YouTube. The project findings will also be disseminated using the well-established social media presences of the Institute of Historical Research, the BL and the Oxford Internet Institute.
5. Journalists and other mainstream media-based researchers
There is a substantial media interest in web-based and social media research, and this project will provide journalists with a greater understanding of the data itself and the context within which that research is conducted. This will in turn increase public understanding of both web archives and big data.
Publications
Schroeder, R
(2019)
The Sage Handbook of Web History
Taylor H.
(2016)
Do online networks exist for the poetry community?
Winters J
(2019)
Negotiating the born-digital: a problem of search
in Archives and Manuscripts
Winters J
(2017)
Breaking in to the mainstream: demonstrating the value of internet (and web) histories
in Internet Histories
Winters, J
(2018)
Debating New Approaches to History
Winters, J
(2019)
The Sage Handbook of Web History
Winters, JF
(2019)
The Historical Web and Digital Humanities: the Case of National Web Domains
Description | Big UK Domain Data for the Arts and Humanities has resulted in the creation of a full-text index of the archive of UK web space from 1996 to 2013, the point at which the British Library undertook its first Legal Deposit full domain crawl. The sophisticated interface developed by the project to allow searching of this data, SHINE, has been made available by the British Library as a prototype (see https://www.webarchive.org.uk/shine), and the work undertaken by the project team is currently informing a fundamental redesign of access mechanisms for the BL's other web archive collections. Another major output of the project is a series of 10 bursary-holder case studies, highlighting the value of the archived web for humanities research in a range of disciplines, from history to political science and communication studies. Some of this work will also feature in a forthcoming open-access monograph edited by two members of the project team. With regard to the development of skills required to work with web archives, which was identified as a key barrier to engagement with this important primary source, the Institute of Historical Research has already organised a training course for researchers and archivists, which will form part of its ongoing training offering. A training workshop has also been organised for the annual general meeting of the International Internet Preservation Consortium in 2016. The work of the project will be taken forward in three main ways: by the British Library as it builds on the tools already developed to improve its core services; as part of the Research Infrastructure for the Study of Archived Web Materials (RESAW) research network; and through the recently funded AHRC research network, Born Digital Data and Approaches for History and the Humanities. Further research projects and funding applications are already in development. |
Exploitation Route | The project findings and outputs have already been used by others in the culture and heritage sector. Specifically, the tools and knowledge developed during the project have influenced provision of and access to web archives at the British Library, and the software and processes have informed similar work in Denmark and Canada. The archiving of the web is evolving, and will continue to evolve as the scale of the data increases exponentially, and new tools and processes may build further on the pioneering work of this project. |
Sectors | Digital/Communication/Information Technologies (including Software) Culture Heritage Museums and Collections |
URL | http://buddah.projects.history.ac.uk/ |
Description | The main impact of this project to date lies in its contribution to the enhancement of archival processes and access arrangements at the British Library. In late 2017, the British Library launched a new beta service providing access to all of its archived web content which drew heavily on research undertaken during the Big UK Domain Data for the Arts and Humanities project. Advanced search and trends functionality are currently in development at the Library, and these too have been shaped by the BUDDAH research. The PI has been involved in user testing for the beta service throughout. The SHINE interface created during the project has also informed development of archived web search facilities at the Bibliotheque Nationale de France and the Danish Royal Library. |
First Year Of Impact | 2015 |
Sector | Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections |
Impact Types | Cultural |
Description | Judge, UK Trade and Investment's Sirius Programme |
Geographic Reach | National |
Policy Influence Type | Participation in a guidance/advisory committee |
URL | http://www.siriusprogramme.com/ |
Description | Presentation to the Joint Legal Deposit Committee |
Geographic Reach | National |
Policy Influence Type | Contribution to a national consultation/review |
Description | CLEOPATRA: Cross-lingual Event-centric Open Analytics Research Academy |
Amount | € 3,949,689 (EUR) |
Funding ID | 812997 |
Organisation | Marie Sklodowska-Curie Actions |
Sector | Charity/Non Profit |
Country | Global |
Start | 01/2019 |
End | 12/2022 |
Description | Research Networking Scheme |
Amount | £32,000 (GBP) |
Funding ID | AH/N006178/1 |
Organisation | Arts & Humanities Research Council (AHRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2016 |
End | 02/2017 |
Description | WARCnet: Web ARChive Studies Network Researching Web Domains and Events |
Amount | 1,244,000 kr. (DKK) |
Organisation | Danish Council for Independent Research |
Sector | Public |
Country | Denmark |
Start | 01/2020 |
End | 12/2021 |
Description | Membership of the Turing Institute Data Science and Digital Humanities Interest Group |
Organisation | Alan Turing Institute |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | As an external research on the interest group, I have been able to contribute expertise in relation to working with web archives and born-digital data for historical research. |
Collaborator Contribution | The main aims of the group are to strengthen relationships and build collaborations at the intersection between data science and digital humanities. Our goal is to raise the profile of data-driven humanities research at the Turing, open up future collaborations, and strengthen the Turing's links with organisations such as the British Library, The National Records of Scotland and The UK National Archives. The group will show the key role that can be played by The Alan Turing Institute in the area of Digital Humanities by demonstrating that data science research can answer questions relevant to the humanities and vice versa, thus benefiting both fields. This will be achieved with meetings, workshops, and joint research projects. Translating fundamental research in data science into lasting impact in the humanities requires interdisciplinary efforts, through the sharing of perspectives, methods and knowledge. The interest group builds on the organisers' extensive experience in interdisciplinary research on historical data and brings together people from a range of different disciplines. |
Impact | A workshop was held at the University of Edinburgh in 2018, which fed in to the UKRI infrastructure road map consultation. |
Start Year | 2017 |
Title | Prototype SOLR-powered web archive exploration UI |
Description | Shine is a web UI for browsing the contents of a Solr server. It is specifically designed to explore a search server populated with web archive data using the warc-discovery indexer. |
Type Of Technology | New/Improved Technique/Technology |
Year Produced | 2015 |
Impact | The code has already been used by other web archive researchers in Canada, as well as in the British Library's SHINE service. |
URL | https://github.com/ukwa/shine |
Description | 'The future of the past' public roundtable |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | This roundtable discussion was held as part of a series of public seminars organised under the theme of 'History now and then'. It addressed how future historians might judge today's historiography, what we over- or under-emphasise, big data and big history, and how history is changing in the digital age. One of the aims of the event was to raise awareness of the changing nature of historians' primary sources in a digital age, and in particular to encourage attendees to think about how they handle their personal digital archives. |
Year(s) Of Engagement Activity | 2017 |
Description | 'Will history survive the digital age?', BBC History magazine article |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | The article in BBC History magazine discussed the challenges for historians of working with large-scale born-digital sources, and also the actions that people can take to make sure that their own digital records are preserved for future researchers. |
Year(s) Of Engagement Activity | 2017 |
Description | Ancient History of the UK Web |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Talk led to a discussion of the issues raised. International dissemination of the project. |
Year(s) Of Engagement Activity | 2014 |
Description | Being Human Festival: computer games on the IA |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | An all-day event at the Being Human Festival, in which members of the general public could try computer games archived by the Internet Archive, and learn in general about web archiving. |
Year(s) Of Engagement Activity | 2015 |
URL | http://beinghumanfestival.org/ |
Description | CPD25 M25 Consortium of Academic Libraries event |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation about the use and promotion of born digital archives at a CPD25 event on 'My Digital Tools Bring all the Researchers to the Library - Marketing your Library in the 21st Century'. The main aim of the presentation was to demonstrate how to engage humanities researchers with 'difficult' digital collections. |
Year(s) Of Engagement Activity | 2016 |
Description | Cambridge history faculty talk |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Postgraduate students |
Results and Impact | Talk was followed by discussion of the issues raised. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
Description | Guest lecture: Notre Dame Data Science Program |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Undergraduate students |
Results and Impact | 20-30 students attended a guest lecture by Helen Hockx-Yu, which was given as part of the Notre Dame University Data Science Program. |
Year(s) Of Engagement Activity | 2010,2018 |
URL | http://datascience.nd.edu/ |
Description | History Day |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Postgraduate students |
Results and Impact | Talk sparked questions and was followed by a one-on-one clinic. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
Description | History Now and Then: the Future of the Past |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | A roundtable discussion on 'The future of the past', organised as part of a series of public events on 'History Now and Then'. |
Year(s) Of Engagement Activity | 2017 |
Description | IHR Digital History seminar |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Talk produced lively discussion No known impacts of that importance. |
Year(s) Of Engagement Activity | 2014 |
Description | IHR Postgrads talk |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Talk involved questions and discussion. After the talk I was contacted by a freelance technology journalist who was interested in pitching a story about the project. |
Year(s) Of Engagement Activity | 2014 |
Description | Introduction to web archives workshop |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | An introduction to web archives for researchers, archivists and librarians. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.history.ac.uk/events/browse/19262 |
Description | JISC Digital Festival, Birmingham |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Talk increased knowledge of web archiving among university IT professionals and academics. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
Description | Literature Festival |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | Talk increased public knowledge of web archives. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
URL | http://www.cheltenhamfestivals.com/literature/whats-on/2014/big-data-big-opportunities/ |
Description | Oxford Digital Tools round table |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Postgraduate students |
Results and Impact | Presentation of the project at a round table for Oxford History faculty members and other interested university staff and students. |
Year(s) Of Engagement Activity | 2015 |
URL | http://blogs.bodleian.ox.ac.uk/history/tag/digital-humanities/ |
Description | Panel Presentation (IIPC) |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Presentation of the project at the annual general meeting of the International Internet Preservation Consortium |
Year(s) Of Engagement Activity | 2015 |
URL | https://www.youtube.com/watch?v=o4iIdZP4rg8 |
Description | Plenary panel: Big data for arts and humanities |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Postgraduate students |
Results and Impact | Stimulated thinking and discussion about web archives. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
Description | Presentation to a group of visiting PhD students from University of Kent |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Postgraduate students |
Results and Impact | Talk increased information about web archiving among postgraduates. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
Description | Public debate |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Public/other audiences |
Results and Impact | Participation in a roundtable on the subject of 'Web Archives: truth, lies and politics in the 21st century'. The web and social media play a key role in the circulation of news in the 21st century. But increasingly it is becoming difficult to separate fact from fiction and untruth, or even to agree on what constitutes fact. These problems are heightened by the speed with which information can be shared, modified or deleted, the personalisation (both explicit and hidden) that determines which news we see online, and the difficulties of establishing authorship and provenance. This public roundtable discussed the role of web and social media archives in helping us, as digital citizens, to navigate through this complex and changing information landscape. |
Year(s) Of Engagement Activity | 2017 |
URL | https://archivedweb.blogs.sas.ac.uk/digital-conversations/ |
Description | Public lecture: 'Exploring the UK web' |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | A public lecture which involved lively discussion with the audience. |
Year(s) Of Engagement Activity | 2015 |
URL | http://blogs.bodleian.ox.ac.uk/archivesandmanuscripts/2015/11/23/web-archives-talk/ |
Description | Publishers' conference |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Media (as a channel to the public) |
Results and Impact | A more informed attitude to web archives among publishers. No known notable impacts. |
Year(s) Of Engagement Activity | 2014 |
URL | http://eventifier.com/event/Alpsp2014/alpsp?full_embed=true |
Description | Utopia and Dystopia drop-in session |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | A range of people from computer game enthusiasts, to students of computer design and game designers themselves attended a drop in session which allowed them to play vintage computer games from the 1980s and 1990s. The games are held in the Internet Archive, and the event explored the importance of capturing this important aspect of digital cultural heritage. Many of the attendees had previously been unaware of the Internet Archive and reported that they would use it in the future. |
Year(s) Of Engagement Activity | 2016 |
Description | Web Archiving Week 2017 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | A week of web archiving events and activities was organised in London on 12-16 June 2017. The centrepiece of the programme was a major international conference combining the second RESAW Conference and the rescheduled IIPC Web Archiving Conference, 14-16 June. The week began with a two-day Archives Unleashed hackathon, and a public debate was held on the evening of 14 June, as part of the British Library's series of Data Conversations. Web Archiving Week was hosted by the British Library and the School of Advanced Study, University of London, and organised with the support and assistance of the IIPC, RESAW (A Research Infrastructure for the Study of Archived Web Materials), The National Archives and Archives Unleashed. |
Year(s) Of Engagement Activity | 2017 |
URL | https://archivedweb.blogs.sas.ac.uk/ |
Description | Web archiving workshop |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Other academic audiences (collaborators, peers etc.) |
Results and Impact | Discussion and further knowledge of web archives, pros and cons, were engendered. Activity led to some of the participants applying to the project for bursaries. |
Year(s) Of Engagement Activity | 2014 |