Information Extraction and Entity Linkage in Historical Crime Records

Lead Research Organisation: University of Sheffield
Department Name: History

Abstract

This project will develop and refine information extraction techniques by
working with one of the most intractable, largely unstructured, sources in
the humanities, historical newspapers. Addressing a challenge identified
during the recently completed Digital Panopticon (DP) project, this project
will develop methods of extracting information about crimes and police
court trials from the newspapers for linkage to the existing 'life archives' of
convicts in the DP. First, the student will conduct a user analysis by
interviewing current researchers using these sources. Second, a sample of
newspaper texts will be marked up manually, as a basis for machine learning
of relevant linguistic patterns. These techniques will then be refined in an
iterative process. Third, algorithms will be developed to allow the
information identified to be linked to the DP Life Archives. Finally, the results
of this process will be evaluated by re-interviewing the user group. Project
outcomes will be 1) advances in information extraction and entity linkage
techniques which can be applied to a much wider range of datasets in other
subject domains, and 2) greater depth and comprehensiveness of content in
the DP which will enhance scholarship in history and criminology, as well as
its public impact.

The project aligns with the EPSRC's research theme Digital Economy;
specifically the 'Content and Consumption' theme, which addresses
'research into how digital technologies enable the creation, co-creation and
exchange of content for social, cultural or business purposes', specifically by
focusing on 'research into tools, processes and platforms'. A subtheme of
the Digital Economy is 'Data Information and Knowledge', focusing on the
development of digital technologies, including natural language processing,
'designed to produce, capture, manage, understand and interpret large
amounts of data in specific application domains'.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513313/1 01/10/2018 30/09/2023
2277610 Studentship EP/R513313/1 01/10/2019 31/07/2024 Callum Booth