Identifying & Classifying Bias in Cultural Heritage Catalogues: Applying Natural Language Processing to University of Edinburgh Archival Descriptions

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

The objective of this project is to develop a context-informed approach to bias detection, executed as a series of case studies beginning with the University of Edinburgh's Archive. Motivated by separate yet related strands of research in the fields of Natural Language Processing (NLP) and Cultural Heritage, the project identifies opportunity to improve large-scale, automated bias detection. Taking a cross-disciplinary approach, the project applies NLP and data visualisation to archival descriptions. NLP approaches such as topic modelling and sentiment analysis will analyse and classify the language of the Archive's descriptions. Due to the context-dependency of bias, data visualisation provides a suitable approach to presenting results of the NLP analysis. Interactive data visualisations will present the results in their associated geographic areas and time periods, enabling people to see associations that Archive items have with different types of bias. The project will propose a visualisation framework for presenting bias in human language content, which, based on the author's knowledge, has yet to be proposed. Rather than eliminate bias, the project seeks to identify and classify bias, arguing that bias deserves a place in cultural heritage institutions.

Bias, though problematic when one-sided, is informative when presented transparently. Bias communicates the perspective of specific groups of people during specific time periods in history; recording historical biases informs understandings of societal evolution and the various perspectives that have existed on a topic [1]. Identifying different types of bias helps researchers understand how representative their dataset is, where more types of bias being present suggests a more representative dataset. This project seeks to develop techniques for identifying and classifying bias that will bring value to cultural heritage institutions and the public they serve, making bias transparent in human language content anywhere from an archival description to a social media post.

The project seeks to develop bias-detecting technology beginning with a case study with free-text, human-written, archival descriptions. Cataloguers first wrote archival descriptions on paper in the 1930s and then in databases beginning in the 1970s. Explicitly, the language of archival descriptions reflects their historical contexts, using terms considered racist, sexist or otherwise inappropriately biased today. Implicitly, missing information in archival descriptions regarding certain groups of people reflects historical biases. These types of explicit and implicit bias can be found in textual data beyond cultural heritage catalogues, such as in newspapers and social media posts. As a result, while improving the transparency of the Archive's descriptions, the outcomes of this project could also inform research on returning representative search results [5], implementing fair algorithms [2], and identifying bias in social media [3, 4].

References

1. Holterhoff, K. (2017) "From Disclaimer to Critique: Race and the Digital Image Archivist." In: Digital Humanities Quarterly 11.3 URL: http://digitalhumanities.org:8081/dhq/vol/11/3/ 000324/000324.html

2. IEEE. (2016) Ethically Aligned Design: A Vision for Prioritizing Human Wellbeing with Artificial Intelligence and Autonomous Systems. Version 1. http://standards.ieee.org/develop/indconn/ ec/autonomous%20systems.html 12.05.2018

3. Recasens, M., Danescu-Nculescu-Mizil, C., Jurafsky, D. (2013). "Linguistic Models for Analyzing and Detecting Biased Language." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics 1650-1659.

Student:

Lucy Havens

Period of Study:

Mar 20 - Sep 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2356289

Research Topic:

Unclassified

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Lucy Havens (Student)	http://orcid.org/0000-0001-8158-6039

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/R513209/1			30/09/2018	29/09/2023
2356289	Studentship	EP/R513209/1	31/03/2020	29/09/2023	Lucy Havens

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects