Person-specific automatic speaker recognition: understanding the behaviour of individual speakers for applications of ASR

Lead Research Organisation: University of York
Department Name: Language and Linguistic Science

Abstract

Automatic speaker recognition (ASR) software processes and analyses speech to make decisions about whether two voices belong to the same or different individuals. Such technology is becoming an increasingly important part of our lives; used as a security measure when gaining access to personal accounts (e.g. banks), or as a means of tailoring content to a specific person on smart devices. Around the world, ASR systems are commonly used for investigative and forensic purposes, to analyse recordings of criminal voices where identity is unknown. Yet systems perform better or worse with certain voices. Therefore, a fundamental question remains: what makes a particular voice easy or difficult for ASR to recognise?

State-of-the-art systems, using techniques from artificial intelligence (AI), have shown marked improvements in performance compared with older approaches. However, there remain issues. Firstly, ASR research has focused on minimising the effects of well-known technical factors, such as channel (e.g. mobile vs. landline telephone), recording quality and microphones. In resolving these technical challenges, large improvements in systems have been achieved. Yet little is known about how speakers themselves affect ASR performance. Secondly, ASR research has been interested in reducing overall error rates. Yet, in the real-world (where innocence and guilt may be at stake), the key question is: what is the chance the system has made an error in this specific instance? Finally, while AI approaches have undoubtedly brought improvements in overall performance, such algorithms make it more difficult to know what information systems are relying on to make decisions. This is problematic for forensic experts, who must explain their methods to non-expert end users, such as judges, juries, lawyers and police.

This project is the first to systematically assess how individual speakers perform within and across ASR systems and to compare speaker effects, in terms of linguistic properties of voices or speaker demographics (e.g. accent, ethnicity, gender), with well-studied technical effects. The aim is to use this knowledge to improve ASR systems by flagging potentially problematic speakers and to develop methods to handle these problematic speakers. We will use novel, interdisciplinary methods, bringing together expertise from speech technology, linguistics, and forensic speech science. Our collaboration with commercial ASR vendor Oxford Wave Research allows us to adapt and change systems to assess the effects on results for individual speakers. We will also use highly controlled, small-scale experiments to assess speaker effects in isolation, as well as using much larger datasets of more forensically realistic recordings, provided by our project partners, the UK Ministry of Defence and the Netherlands Forensic Institute. The availability of a variety of datasets also allows us to assess the generalisability of results across a range of voices.

This project is entirely driven by real-world issues and so the results will deliver considerable impact to a wide range of stakeholders. By understanding more about individuals, our results have the capability to improve overall ASR performance. This will be of benefit to users and developers of ASR systems. The results will also have specific implications for forensic and investigative applications, guiding data collection for validating methods (something which experts are under increasing regulatory pressure to do) and provide a framework for combining ASR and linguistic analysis. In doing so, through engagement with the legal community, we aim to affect a change in the status of ASR in England and Wales, such that it is admissible as expert evidence. We will deliver impact via knowledge exchange with a Forensic Advisory Panel consisting of representatives from forensic speech science, law enforcement, and the legal community.
 
Description Towards the growth of forensic speech science
Amount £323,596 (GBP)
Organisation University of York 
Sector Academic/University
Country United Kingdom
Start 08/2024 
End 08/2026
 
Description YorVoice
Amount £300,000 (GBP)
Organisation University of York 
Sector Academic/University
Country United Kingdom
Start 08/2023 
End 08/2025
 
Title PASR WorkPackage 1 Dataset 
Description This is a dataset of recordings of (currently) nine phoneticians producing a fixed, read text in 29 speaking conditions relevant to the project (voice qualities, vocal settings, accent guises, disguises). Each participant produced each condition three times across three or four recording sessions (made with at least a week between sessions). The recordings were made simultaneously with four technical conditions (head-band microphone, near microphone, far microphone and landline-to-VOIP). 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
Impact We are currently using this dataset as the basis for training automatic voice quality classifiers and as potential adaptation data for fine-tuning automatic speaker recognition systems. This work is still in its infancy and more will be shared once further progress has been made. 
 
Description Netherlands Forensic Institute: Project Partner 
Organisation Netherlands Forensic Institute
Country Netherlands 
Sector Public 
PI Contribution The project and team provide intellectual input on the use of automatic speaker recognition systems for the purposes of forensic casework at the Netherlands Forensic Institute.
Collaborator Contribution The Netherlands Forensic Institute provide guidance on the overall and specific directions of each of the workpackages. Our contact at NFI regularly attends our project meetings. He also provides comments and feedback on papers for publication and presentation at conferences.
Impact Sensitivity of x-vectors and automatic speaker recognition scores to vocal variation. Automatic speaker recognition with variation across vocal conditions: a controlled experiment with implications for forensics.
Start Year 2022
 
Description IAFPA 2022 - Person-specific automatic speaker recognition: understanding the behaviour of individuals for applications of ASR 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact We introduced the project to an audience of almost 100 people made up of an international audience of professional practitioners, academics and students who regularly conduct forensic casework
Year(s) Of Engagement Activity 2022
 
Description IAFPA 2023 - Effects of vocal variation on the output of an automatic speaker recognition system 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Paper presented to an international audience of professional practitioners, academics and students who regularly conduct forensic casework. The aim was to share research findings, leading to new research directions informed by practice, and to influence current practice.
Year(s) Of Engagement Activity 2023
 
Description IAFPA 2023 - Impact of the changes in long-term acoustic features upon different-speaker ASR scores 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Paper presented to an international audience of professional practitioners, academics and students who regularly conduct forensic casework. The aim was to share research findings, leading to new research directions informed by practice, and to influence current practice.
Year(s) Of Engagement Activity 2023