Humans and machines: novel methods for testing speaker recognition performance

Lead Research Organisation: University of York
Department Name: Language and Linguistic Science

Abstract

As humans, we regularly use the voice as a means of recognising people - for example, when someone calls us on the telephone, or shouts to us from another room. While humans are relatively good at recognising familiar, or at least predictable voices, identifying unfamiliar voices is much more difficult. This is often the task in forensic and investigative contexts. In such cases, a comparison is made of the voices of an unknown criminal and a known suspect, with the ultimate aim of assessing the likelihood that they belong to the same individual. Increasingly, around the world, speaker recognition machines (i.e. pieces of software) are used for these purposes. However, a critical question remains unanswered: do machines recognise speakers in the way that humans do?

This question has received relatively little attention in the literature. The studies that have examined this issue are all small scale and simply compare the results of human recognition with those of machine recognition using overall error rates. However, what is much more important is understanding the contexts in which one method might outperform the other, and whether there is any benefit in combining the approaches. In addressing these issues, our research will provide a better understanding of how speaker recognition machines work and how they might be improved. Further, previous work has overlooked the many factors that may affect human recognition performance, such as cognitive bias. In this project, we assess the variability in human judgements as a function of different amounts of contextual information, especially in the context of a criminal trial where there may be other information pertinent to the case or even a forensic expert providing voice evidence which could influence the decision-making process involved in the speaker recognition task.

In order to compare and combine human and machine responses, we will develop a bespoke computer game that elicits human judgements that are conceptually equivalent to those produced by the machine. In doing so, we will also test the viability of using the voice as the central element in a computer game; an area of computer game development that has received relatively little attention.

The project has a number of specific research questions:
1. How do humans and machines perform at speaker recognition relative to each other, and can we improve performance by combining the two approaches? To what extent, therefore, do these methods capture the same information?
2. In what contexts (using speakers with different regional accents and diverse speech samples with varying durations and recording quality) do humans outperform machines?
3. How do different listener groups perform in speaker comparison tasks? Does familiarity with the regional accent improve performance?
4. To what extent are human judgements affected by contextual information that may occur in a forensic case, such as (i) the knowledge that it is a criminal case, (ii) other evidence from the case, or (iii) a forensic expert's opinion?

Planned Impact

The proposed research is strongly applied in nature. The research questions stem from real world issues relating to speaker recognition, particularly in forensic contexts. Thus, our results will be of benefit to a range of non-academic users:

(1) Judicial contexts

The primary beneficiaries of our research will be forensic phoneticians who conduct speaker recognition for the purposes of casework. The results will allow experts to assess the conditions under which automatic systems may perform better or worse, and help experts to explain what such systems are actually doing, in order for courts to make more informed decisions about the evidence. Our results can also be used by experts to make delivery of their expert opinion to courts more accessible in cases involving earwitnesses. We will be able to disseminate our results to practising forensic phoneticians at J P French Associates (the UK's largest and longest established forensic speech and audio laboratory), given their long-standing relationship with our University Department. We will also share our results at the International Association of Forensic Phonetics and Acoustics (IAFPA) conference. IAFPA is the world's only association for forensic phonetics, and the conference attracts a large non-academic audience from private and governmental laboratories, security and law enforcement agencies, and police forces. Since our research will provide insights into the errors made by speaker recognition systems, the results will be of value to commercial developers of speaker recognition systems, principally for helping to improve the performance of such systems in the future. As part of the project, we will discuss our results with Nuance Communications (the largest commercial supplier of speaker recognition technology in the world and the producer of the system we use in this project). Detailed results of the comparison of the effects on decision-making of verbal vs numerical expert witness statements will be directed specifically to beneficiaries in the legal domain. Our results will be discussed with relevant end users in order that the courts are made aware of the potential effect of the wording of expert witness statement on jurors.

(2) Game designers

A central part of this project is to develop a game that has voice as its central component. The voice is an under-utilised modality for computer games and so our project acts as a test of the viability of including voice elements more centrally into games. We will present the findings of our research to the co-director of 'Betajester' (Adam Boyne), a commercial company that develops immersive and interactive digital experiences with a focus on game development. We will also discuss the potential for developing our game for projects in the future.

(3) Teaching and schools

The task of speaker recognition is familiar to the general public through TV shows and films. Its real world applications also make it a fascinating topic for students. We intend to further maximise the impact of this project by engaging with school teachers and students in various ways. We will include a session on the topic of speaker recognition (using findings from the project) as part of a CPD course aimed at A-level English Language teachers. Such courses have run previously at York, and Llamas has been centrally involved in their development. We will also contact schools directly to offer talks about our project for students. Finally, we will offer our findings as a case study for the York English Language Toolkit site; this site provides resources on current research from the Department for teachers and students of A-level English Language.
 
Description In this project we tested the speaker recognition performance of human listeners and computer software (known as automatic speaker recognition systems), both in isolation and in combination. We also examined the effects of potential sources of variability on listener judgements - factors which an automatic systems would not be affected by. Our research has three key findings:

- Automatic speaker recognition systems generally outperform human listeners with unfamiliar voices. However, human listeners are better at saying two samples belong to different speakers when the accents of the two voices are not matched. Combining an automatic system with judgements from the very best listeners can improve overall performance. This indicates that humans and automatic systems are capturing different information.

- Listeners judged pairs of voices to be more similar when in an immersive jury context within our game rather than a plain, non-game level. This results in more false positives in the jury level compared with the plain level. Some of this effect is due to order effects within our experiment (overall listeners gradually heard voice as being more similar), but could also be due to the effect of being 'on a jury'.

- Listeners generally respond to expert evidence in predictable ways. This means that listeners will override their own opinion about whether a pair of voices belong to the same or different speakers when presented with a conclusion by a forensic phonetician. We also see the 'weak evidence effect' in our results. This means that when presented with a weak positive conclusion from an expert, listeners interpret it as being a weak negative, and vice versa. This effect has been demonstrated for other forms of forensic evidence previously
Exploitation Route The development of the game-like experimental tool will be used for data elicitation in our future projects. We believe that this approach allows us to collect more ecologically valid data since participants are immersed in a narrative where the stakes are raised as much as possible. This will inform how forensic phoneticians (and those in other forensic disciplines) better estimate how jurors will respond in courts. Our game will provide a basis for future collaboration with DC Labs and others in the digital creativity space.

The results themselves provide a basis for demonstrating the validity of using automatic methods for the purposes of speaker recognition in forensic casework. This challenges commonly held beliefs of lawyers and judges that lay people are equally well-placed to listen to voices and arrive at a conclusion in a court room. The results also inform how best to present expert conclusions so that the strength of evidence will be correctly interpreted by jurors.
Sectors Digital/Communication/Information Technologies (including Software),Government, Democracy and Justice

URL https://sites.google.com/york.ac.uk/humans-machines/
 
Description The project is just coming to end so non-academic impacts arising from our results are still in their early stages. However, we have disseminated our work to forensic speech science practitioners at the conference of the International Association for Forensic Phonetics and Acoustics. The results provide a basis for the argument that voice analysis is a form of forensic evidence that requires expert analysis, rather than relying on lay people such as those sitting on juries. The results also provide a basis for experts to be able to explain what information about a speaker is being captured by automatic systems, which is important when explaining evidence to courts. We have also engaged with researchers, and representatives from industry and government within the field of digital creativity.
First Year Of Impact 2022
Sector Digital/Communication/Information Technologies (including Software),Government, Democracy and Justice
Impact Types Societal

 
Title SoundJury Game 
Description As part of our project, we developed an immersive platform for collecting online data relevant to jury decisions in criminal cases. We have also reported this in the 'software' section. The game allows us to collect different data from what would ordinarily be collected via a Qualtrics-style survey, where the the stakes of participants' responses are maximised. As part of the project, we have been validating our game-based elicitation tool and are in the process of writing a methods paper to describe its development and the type of data it produces. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? No  
Impact As we have only just finished the project, we haven't had a chance to share the game or information about it, beyond presenting it at academic conferences (IAFPA, NWAV). As noted above, we are in the process of writing a methods paper to describe the game and how it was developed. 
URL https://brave-field-0e50b7b03.azurestaticapps.net/
 
Title SoundJury: Speaker Recognition by Humans and Machines 
Description As part of the project we created an online app. The app is a gamified speaker recognition experiment used to collect data from participants as part of our research. The app was developed in collaboration with Digital Creativity Labs (https://digitalcreativity.ac.uk/) at York and with a freelance illustrator. 
Type Of Technology Webtool/Application 
Year Produced 2022 
Impact The game has been used extensively for our own data collection. In the future, we hope that it will provide a template for gamification of linguistic experiments and inform decisions around the use of voices within computer games. 
URL https://brave-field-0e50b7b03.azurestaticapps.net/
 
Description A game-based approach to eliciting and evaluating likelihood ratios for speaker recognition 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk given at the IAFPA conference to describe our game and explain underlying methods. Large international audience of around 80 made up of a large number of forensic practitioners.

Hughes, V., Llamas, C. and Kettig, T. (2022) A game-based approach to eliciting and evaluating likelihood ratios for speaker recognition. International Association for Forensic Phonetics and Acoustics (IAFPA) Conference, Prague, Czechia. 10-13 July 2022.
Year(s) Of Engagement Activity 2022
 
Description Assessing the speaker recognition performance of humans and machines: implications for forensic voice comparison. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Online talk for the LITHME Cost Action, Working Group 8 (Language Variation in the Human-Machine Era). Around 20 attendees from around the world made up of academics, students and forensic practitioners.

Hughes, V. (2022) Assessing the speaker recognition performance of humans and machines: implications for forensic voice comparison. LITHME Cost Action, Working Group 8 (Language Variation in the Human-Machine Era). 6 May 2022.
Year(s) Of Engagement Activity 2022
 
Description SoundJury: gamification of a speaker recognition experiment 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Talk and panel at Digital Creativity conference held at York for professionals working within the field in industry as well as policy makers. Discussed gamification of academic experiments and the potential of using the voice within games.

Hughes, V., Llamas, C., Kettig, T., Cutting, J. and Slawson, D. (2022). SoundJury: gamification of a speaker recognition experiment. Digital Creativity, Industry and Culture (DCIC) Conference, York, UK. 20 September 2022.
Year(s) Of Engagement Activity 2022
 
Description Speaker recognition by humans and machines: some findings from a game-like elicitation task 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Talk at UCL Speech Science Forum for student and staff. Around 30 attendees both in-person and online from around the world.

Hughes, V. (2022) Speaker recognition by humans and machines: some findings from a game-like elicitation task. UCL Speech Science Forum. 1 December 2022.
Year(s) Of Engagement Activity 2022