Traces though Time: Prosopography in practice across Big Data

Lead Research Organisation: National Archives
Department Name: Technology

Abstract

The growing availability of large digital historical datasets, coupled with the emergence of new methodologies and computer algorithms has the potential to revolutionise research in the Arts and Humanities. The right Big Data tools and approaches will deliver the potential to conduct research on the scale of entire populations - addressing key research questions and offering new insights.

Significant investment has already been sunk into the creation of large-scale digital resources. This investment is delivering historical big data of a variety, complexity and coverage that is beyond the scope of existing analytical tools and techniques. Yet these tools have not yet been the subject of large investment. Researchers in this field now require rapid innovation to extend the Big Data approaches pioneered for scientific and business applications, adapting and refining these to deliver practical analytical tools to support large-scale exploration of big historical datasets.

This innovative, multi-disciplinary project will address this challenge, bringing together international research experience in the digital humanities, natural language processing, information science, data mining and linked data, with large, complex and diverse 'big data' spanning over 500 years of British history.

The project's technical outputs will be a methodology and supporting toolkit that identify individuals within and across historical datasets, allowing people to be traced through the records and enabling their stories to emerge from the data. The tools will handle the 'fuzzy' nature of historical data, including aliases, incomplete information, spelling variations and the errors that are inevitably encountered in official records. The toolkit will be open and configurable, offering the flexibility to formulate and ask interesting questions of the data, exploring it in ways that were not imagined when the records were created. The open approach will create opportunities for further enhancement or re-use and offers the further potential to deliver the outputs as a service, extensible to new datasets as these become available. This brings the vision of 'bring your own data' closer, to find and link individuals in new combinations of datasets, from the widest range of historical sources.

The project will benefit academic and leisure historians alike, across the whole spectrum of digital history:
* It will assist historians seeking evidence of life-events through a collective study of individual biographies.
* It will help genealogists find and trace the paths of their ancestors across the landscape of the official record.
* It will help researchers by signposting routes between historical collections, enabling links between datasets at a deep level and creating opportunities for discovery.
* For cultural organizations it will illuminate effective approaches to creation and curation of new digital datasets to optimise their potential for linking and re-use.
* It will provide evidence to support policy making, helping balance the demands of Data Protection and information assurance with those of open data and Freedom of Information.
* It will provide a methodology to underpin the creation of new tools and resources, supporting the digital economy.

The project aims to extend the boundaries of current research in three important directions: to increase the extent and diversity of the data that can be handled; to improve support for inconsistent or fuzzy data; and to enable confidence measures to be tailored to fit specific research aims. These advances will extend the practical application of data linking techniques, enabling them to be applied to the large, diverse datasets that are continually emerging, to help answer historical research questions at a macro and micro scale. Our vision is to create a generic, extensible approach to tracing the lives of real people: through time and across the documentary evidence that survives them.

Planned Impact

The 2012 Heritage Counts survey highlights that successful heritage organisations are 'anchor institutions' - the civic, cultural and intellectual institutions that help make places resilient. Creating routes into Big Data and equipping users to engage with our shared documentary heritage is vital and supports that resilience into a new digital age. We propose differing but complementary strategies, to maximise impact.

The 2012 ARA/CIPFA Public Services Quality Group report states that 56% of visitors to archives described the purpose of their visit as 'family history research'. Developing new tools for navigating Big Data will equip family and leisure historians to find and trace the paths of individuals across the landscape of the official record and beyond. Regular talks and events will highlight findings, users will benefit from new exhibitions showcasing the stories that emerge from the data and updated research guidance will support them in applying the tools in pursuit of their own investigations.
TNA's User Advisory Group will be consulted to surface user needs. Articles submitted to popular history magazines will further engage genealogists and local history organisations. The research team will post updates and podcasts and blog on popular themes such as the First World War, where we anticipate a wealth of stories emerging from the records. We will hold a 'hack day' for researchers to work on their own datasets alongside our developers. The Institute of Historical Research will host the final dissemination event, for scholars, user groups and educators across disciplines. An advisory board will be established, drawing on expertise from previous work, such as the AHRC funded Fine Rolls and ESRC funded Integrated census microdata project.
TNA's press office, licensing and commercial partners are used to handling significant attention around file releases and public interest stories. All partners will draw on their wide network of international media, publishing and business contacts to best promote the project and its findings.

As a government department and an executive agency of the Ministry of Justice, TNA is well placed to facilitate connections with policy makers. This research has the potential to impact policy in the Open Government Data, Transparency, Data Protection and privacy arenas by offering a robust approach to linking individuals across datasets, and providing evidence to underpin policy development.

For archives and cultural heritage organisations, this work will inform future digitisation, data creation and capture methodologies and will encourage archive services and other data to work together to enable exploration across their holdings. For public sector data holders, it will enable both greater access to records that need not be closed, and greater assurance that records are not being opened inappropriately. This latter has enormous implications for the support of information security and assurance.

TNA has close ties with UK public sector bodies and cultural organisations. We work closely with the Imperial War Museum and British Library to share knowledge. These relationships will help extend the methodology to ensure application to other collections. The availability of a robust methodology for linking personal entities has the potential to dramatically change retention, preservation and cataloguing practices.

The research team will use their links to schools (TNA's award winning Education & Outreach team) to work with pupils and teachers to delve into the stories behind the history and to engage the smartphone generation. Though our case studies, students will learn to engage with Big Data, both from a historical perspective and as it impacts the lives of young people growing up in a digital age. This will contribute to government's aspiration to develop digital knowledge and skills for the future, which, in turn supports UKplc to become more agile and innovative.
 
Description Method and tools for linking records about individual people working with Big Data from different types of historical records from different time periods. Statistically robust methods for assigning confidence measures to the links that are discovered.
Exploitation Route The findings are being taken forward by The National Archives, both to further refine the methods and algorithms that have been created and also to deploy them to improve operational public services. The methods have been deployed by The National Archives commercial partners to accelerate the release of public records for use as a research resource. The data model will be incorporated into The National Archives data model for presenting and enabling re-use of data from born-digital records.
Sectors Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections

 
Description The National Archives (TNA) has applied the algorithms and tools developed by this research project to enhance access to public records via TNA's Discovery service, providing new routes to navigate the collections. This service launched in Spring 2016. A further enhancement in March 2017 saw the addition of links from TNA's records to records held at other public sector archival institutions TNA, where these records are likely to relate to the same individual. The National Archives' commercial partners have applied the approaches developed by the project to link individuals in newly digitised records with individuals in external datasets, including death registration data. This will accellarate the release of these records into the public domain, which will in turn extend access to these records for academic researchers and the general public.
First Year Of Impact 2016
Sector Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Culture, Heritage, Museums and Collections
Impact Types Cultural,Economic

 
Description A commitment to further investigate how best to manage uncertainty in our data about records and the risks that result, building on lessons from the Traces Through Time research project and the work done to both develop and communicate probabilistic links between records and entities.
Geographic Reach National 
Policy Influence Type Citation in other policy documents
 
Title Data linking algorithms and statistical model 
Description A statistical model for connected archival data and associated algorithms to support probabilistic data linking of personal data with robust measure of confidence in each link. The model and approach are generic and extensible to linking other entities and other identifying attributes. 
Type Of Material Computer model/algorithm 
Provided To Others? No  
Impact The model and algorithms enable The National Archvies (and other heritage institutions) to connect individual objects and collections on the basis of a range of entities of interest. The model has been implemented for linking by Person, but is readily extensible to other entities such as Place, Time or Event. 
 
Description Implementation of research outputs at TNA 
Organisation The National Archives
Country United Kingdom 
Sector Public 
PI Contribution Continuation of the project beyond the initial AHRC funded period, with support from the National Archives. This phase of the project shifted the focus of the project from research to 'R&D' for practical implementation of the research findings in an operational environment. There is a continued strong research element for further improvements to the algorithms developed in the earlier phase.
Collaborator Contribution TNA's project team continued to work on this project, with the addition of Software-Engineering and User Experience Design skills from TNA.
Impact Engagement activities - listed separately Software engineering - to embed the research outputs into TNA's Discovery service (to be released to the general public in Spring 2016). Software engineering - to refactor the research code to deliver an efficient, enterprise-level linker, capable of processing high volumes of data Research - further improvements and refinement of linking algorithms and the underlying statistical approaches Economic - collaboration with external genealogy service provider to apply linking techniqes to matching data in commercially digitized records User Experience Design - to surface the linked data as a new feature for TNA's Discovery serivce to provide an intuitive and highly usable service for researchers Data - listed separately
Start Year 2015
 
Title Data model 
Description RDF Data Model for describing an instance of a 'person' entity in a public record or other data source. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Schema is available for re-use, the team have invited comments and feedback. 
URL https://github.com/nationalarchives/traces-through-time/tree/master/schema
 
Title Traces through Time - Linker v1 (Leiden) 
Description Linker software which accepts information about people (in a format defined by the TTT data model), links occurrences which may relate to the same individual and assigns a confidence score to those links. 
Type Of Technology Software 
Year Produced 2015 
Impact Software is available for re-use. 
URL https://github.com/nationalarchives/traces-through-time/tree/master/Leiden
 
Description 1st International SEAHA Conference: Mining historical documents - technical and enterprise persepectives 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presentation on 'Mining historical documents - technical and enterprise perspectives'. Discussion of the Traces through Time architecture, comparison with previous project (CHARTEX) and discussion of the applications of deep vs shallow analysis of small collections vs 'big data' and tools for academic researchers vs enterprise level tools for the general public.
Year(s) Of Engagement Activity 2015
URL http://www.seaha-cdt.ac.uk/conference-programme/
 
Description Academic outreach via The National Archives academic newsletter 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Raised the profile of the project amongst academic historians (including digital historians). This is a core group of users at the National Archives and a key target audience for the outputs of this project.

Raised the profile of the project.
Year(s) Of Engagement Activity 2014
URL http://www.nationalarchives.gov.uk/documents/research-newsletter-spring-2014.pdf
 
Description Attendance at Digital Panopticon data linkage workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Discussion of methods and approaches and early results of both projects. Discussion of sharing and future collaboration.

Expected future collaboration.
Year(s) Of Engagement Activity 2014
URL http://www.digitalpanopticon.org/?p=669
 
Description Blog post on The National Archives website 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Blog post about some of the interesting names and patterns of names that were discovered during the course of the research. The blog received 8 positive comments from the general public as feedback.
Year(s) Of Engagement Activity 2015
URL http://blog.nationalarchives.gov.uk/blog/whats-name/
 
Description Engagement with Government Data Science Community of Interest 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Data Science Community of Interest group - Sonia Ranade, PI, and Mark Bell, project researcher, represent the project on the Data Science Community of Interest Group; a cross-government group that shares best practice on big data research, and showcases Big Data projects across Government. The discussion at this group has sparked follow-up from Government Departments - and led to the partnership with the Government Actuary's Department.

The discussion at this group has sparked follow-up from Government Departments - and led to the partnership with the Government Actuary's Department.
Year(s) Of Engagement Activity 2013,2014,2015,2016,2017
 
Description GDS Data-science show-and-tell 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Discussion of challenges and analytical tools and techniques

Raised awareness of the project within the Government Data science community. Raised awareness of the impacts of policy on digtial archives and digital history research.
Year(s) Of Engagement Activity 2014
 
Description Government Data Science competition 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact Open-table session: presentation of the project, its technical approach, aims and outcomes. Discussion of further work and possible future collaborations.
Year(s) Of Engagement Activity 2015
 
Description Government Heads of Analysis conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Heads of Analysis Conference, 2 July 2014 - Mark Bell presented at the Heads of Analysis conference - a conference about Big Data in government, which brought together leaders in the economics, operational research, science and engineering, social research, statistics and actuarial services professions.

Talk gave other practitioners useful insights into approaches and challenges in using historical Big Data. For hte project it helped form contacts within the government analytical professions which should result in future partnerships.
Year(s) Of Engagement Activity 2014
 
Description IHR Winter Conference (Senate House, London) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Conference on the production of the archive: Presentation on the project, the challenges of linking fuzzy data and implications for the way that digital and digitized records are produced. Plans to embed the research outputs into TNA's online Discovery service were discussed.
Year(s) Of Engagement Activity 2015
URL http://winterconference.history.ac.uk/
 
Description Institute of Historical Research: Digital History / Archives and Society seminar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Talk about the project's aims and outcomes, technical approach, examples and case-studies drawn from The National Archives collections. Discussion of how best to embed the new linking features into TNA's Discovery service.
Year(s) Of Engagement Activity 2015
URL http://www.history.ac.uk/events/seminars/321
 
Description Lunchtime seminar (technical) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Discussion of the project's technical approach and its implications for future digitisation work at The National Archives
Year(s) Of Engagement Activity 2015
 
Description Lunchtime seminar for National Archives staff 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Presentation about the project - an introduction to probabilistic linking, techniques for working with fuzzy data, current challenges, examples and case-studies drawn from The National Archives records, discussion on future work.
Year(s) Of Engagement Activity 2015
 
Description Presentation and discussion at workshop (Graphical display: Challenges for Humanists) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Presentation to an audience of scholars in Digital Humanities - exploring the potential of computational and statistical techniques to transform access to archival (and other Culture & Hertitage / GLAM sector collections). Discussion on the potential application of the techniques to Digital Humanities research.
Year(s) Of Engagement Activity 2015
URL http://www.crassh.cam.ac.uk/events/26096
 
Description Presentation at 'Common Ground' - the first national event of the AHRC Commons 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Discussion, debate and sharing of ideas, resources and good practice with other members of the AHRC commons engaged in data science / digital humanities work.
Year(s) Of Engagement Activity 2016
URL http://www.ahrc.ac.uk/about/ahrc-commons/
 
Description Presentation at DCDC (Discovering Collections, Discovering Communities) conference, October 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Formal presentation of the project's research, outputs and demonstration of the new features created for TNA's catalogue. Q&A and interest from delegates including discussion of how the linking might be extended to other nationally important collections.
Year(s) Of Engagement Activity 2016
URL http://dcdcconference.com/
 
Description Presentation at TNA academic research day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Mark Bell, project researcher, spoke by invitation at the National Archives academic research seminar.

The talk communicated the project and our early findings and outcomes to a forum of academic historians.
Year(s) Of Engagement Activity 2014
URL http://www.nationalarchives.gov.uk/about%5Cresearch-scholarship.htm
 
Description Presentation at TNA's 'catalogue day' 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Presentation at TNA's catalogue day describing the project, its outputs and benefits to the general public. The new features demonstrated were well recieved with many requests and suggestions for further extending the feature to additional collections.
Year(s) Of Engagement Activity 2016
URL http://www.nationalarchives.gov.uk/about/visit-us/whats-on/events/
 
Description Presentation at The National Archives 'Big Ideas' seminar 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Presentation and discussion of the project, its outputs and future work as part of TNA's 'Big Ideas' series of seminars.
Year(s) Of Engagement Activity 2016
 
Description Presentation to TNA's ASD team 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Presentation of the project to the team at The National Archives who lead engagement and advice to the wider archives sector. An introduction to the project, its aims and outcomes. Implications for the way digital and digitised archival collections are created and managed. Implications for news modes of access to archival collections. Ensuring the team is well-equipped to share these ideas and open up discussions with the wider sector.
Year(s) Of Engagement Activity 2015
 
Description Project announcement 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Announcement, 6 February 2014 - The National Archives issued a press release when the award from AHRC was formally announced (the press release included quotes from Ministry of Justice Minister Simon Hughes), and ensured Ministers were formally briefed on the project's scope and ambitions. The press release was uploaded onto The National Archives' website. We have also created a project page for Traces through Time on the National Archives website which has received around 350 visits so far in 2014.

Contact from members of the public asking how to get involved (received via TNA webmaster).
Year(s) Of Engagement Activity 2014
URL http://blog.nationalarchives.gov.uk/blog/big-data-funding-success/
 
Description Project cited on an external blog 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Raised profile of the project within the academic sector
Year(s) Of Engagement Activity 2014
URL http://languageofaccess.org/2014/04/25/visualisation-workshop/
 
Description Report to the Lord Chancellor's forum on Historical Research 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact Formal report to ensure that the forum is up to date on the use of archives as data, and the research questions and issues that arise from this.

tbc.
Year(s) Of Engagement Activity 2014
 
Description South West and Wales Digital Humanities academic meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact The presentation explained how TNA uses digital resources to help solve problems and highlighted some of the projects that have been developed, including Traces through Time. Discussion of digital collaborations with the academic sector (some of which have been AHRC funded) Other ideas and proposals were discussed as areas that we might explore further, such as network analysis and relationship mapping.
Year(s) Of Engagement Activity 2015
 
Description TALK magazine 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Article in The National Archives staff magazine about plans to embed the project's outputs into TNA's Discovery service. Part of ensuring that public-facing staff are aware of these changes and able to offer advice.
Year(s) Of Engagement Activity 2015
 
Description TECHNE congresss in digital humanities 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact The presentation explained how TNA uses digital resources to help solve problems and highlighted some of the projects that have been developed, including Traces through Time. Discussion of future digital collaborations with the academic sector and other ideas and proposals that we might explore further, such as network analysis and relationship mapping.
Year(s) Of Engagement Activity 2015
 
Description Traces through Time - end of project conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact End-of-project conference giving the opportunity to present and discuss the work of the project in greater detail and to set this in the context of other key 'big data' projects that were funded under the same call. Discussion of future work and how best to re-use the research outputs of the project.
Year(s) Of Engagement Activity 2015
 
Description Traces through Time - workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Workshop allowing participants to use the Traces through Time tools hands-on, to experiment with linking data and to see their own data processed by the linking algorithms.
Year(s) Of Engagement Activity 2015
 
Description Workshop at UK Archives Discovery forum 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A workshop aiming to introduce the potential of data-science approaches to interacting with archival collections to members of the archives and records management profession. The workshop gave and overview of tools and approaches for linking people in historical big data and showcased the methods developed by the Traces through Time project to identify when two records relate to the same individual. Participants were invited to !Find out more about what was involved in this work, and take a look through the 'viewer', to get a sense of how the algorithms work". The aims were to introduce data science ideas to the profession, give a concrete demonstration of the type of innovation that is possible when these techniques are applied, show how this can be applied to development of new public services and help the participants understand the routes to implementing these approaches for their own collections.
Year(s) Of Engagement Activity 2016
URL http://media.nationalarchives.gov.uk/index.php/tag/ukad-forum-2016/?order=asc