Person-specific automatic speaker recognition: understanding the behaviour of individual speakers for applications of ASR
Lead Research Organisation:
University of York
Department Name: Language and Linguistic Science
Abstract
Automatic speaker recognition (ASR) software processes and analyses speech to make decisions about whether two voices belong to the same or different individuals. Such technology is becoming an increasingly important part of our lives; used as a security measure when gaining access to personal accounts (e.g. banks), or as a means of tailoring content to a specific person on smart devices. Around the world, ASR systems are commonly used for investigative and forensic purposes, to analyse recordings of criminal voices where identity is unknown. Yet systems perform better or worse with certain voices. Therefore, a fundamental question remains: what makes a particular voice easy or difficult for ASR to recognise?
State-of-the-art systems, using techniques from artificial intelligence (AI), have shown marked improvements in performance compared with older approaches. However, there remain issues. Firstly, ASR research has focused on minimising the effects of well-known technical factors, such as channel (e.g. mobile vs. landline telephone), recording quality and microphones. In resolving these technical challenges, large improvements in systems have been achieved. Yet little is known about how speakers themselves affect ASR performance. Secondly, ASR research has been interested in reducing overall error rates. Yet, in the real-world (where innocence and guilt may be at stake), the key question is: what is the chance the system has made an error in this specific instance? Finally, while AI approaches have undoubtedly brought improvements in overall performance, such algorithms make it more difficult to know what information systems are relying on to make decisions. This is problematic for forensic experts, who must explain their methods to non-expert end users, such as judges, juries, lawyers and police.
This project is the first to systematically assess how individual speakers perform within and across ASR systems and to compare speaker effects, in terms of linguistic properties of voices or speaker demographics (e.g. accent, ethnicity, gender), with well-studied technical effects. The aim is to use this knowledge to improve ASR systems by flagging potentially problematic speakers and to develop methods to handle these problematic speakers. We will use novel, interdisciplinary methods, bringing together expertise from speech technology, linguistics, and forensic speech science. Our collaboration with commercial ASR vendor Oxford Wave Research allows us to adapt and change systems to assess the effects on results for individual speakers. We will also use highly controlled, small-scale experiments to assess speaker effects in isolation, as well as using much larger datasets of more forensically realistic recordings, provided by our project partners, the UK Ministry of Defence and the Netherlands Forensic Institute. The availability of a variety of datasets also allows us to assess the generalisability of results across a range of voices.
This project is entirely driven by real-world issues and so the results will deliver considerable impact to a wide range of stakeholders. By understanding more about individuals, our results have the capability to improve overall ASR performance. This will be of benefit to users and developers of ASR systems. The results will also have specific implications for forensic and investigative applications, guiding data collection for validating methods (something which experts are under increasing regulatory pressure to do) and provide a framework for combining ASR and linguistic analysis. In doing so, through engagement with the legal community, we aim to affect a change in the status of ASR in England and Wales, such that it is admissible as expert evidence. We will deliver impact via knowledge exchange with a Forensic Advisory Panel consisting of representatives from forensic speech science, law enforcement, and the legal community.
State-of-the-art systems, using techniques from artificial intelligence (AI), have shown marked improvements in performance compared with older approaches. However, there remain issues. Firstly, ASR research has focused on minimising the effects of well-known technical factors, such as channel (e.g. mobile vs. landline telephone), recording quality and microphones. In resolving these technical challenges, large improvements in systems have been achieved. Yet little is known about how speakers themselves affect ASR performance. Secondly, ASR research has been interested in reducing overall error rates. Yet, in the real-world (where innocence and guilt may be at stake), the key question is: what is the chance the system has made an error in this specific instance? Finally, while AI approaches have undoubtedly brought improvements in overall performance, such algorithms make it more difficult to know what information systems are relying on to make decisions. This is problematic for forensic experts, who must explain their methods to non-expert end users, such as judges, juries, lawyers and police.
This project is the first to systematically assess how individual speakers perform within and across ASR systems and to compare speaker effects, in terms of linguistic properties of voices or speaker demographics (e.g. accent, ethnicity, gender), with well-studied technical effects. The aim is to use this knowledge to improve ASR systems by flagging potentially problematic speakers and to develop methods to handle these problematic speakers. We will use novel, interdisciplinary methods, bringing together expertise from speech technology, linguistics, and forensic speech science. Our collaboration with commercial ASR vendor Oxford Wave Research allows us to adapt and change systems to assess the effects on results for individual speakers. We will also use highly controlled, small-scale experiments to assess speaker effects in isolation, as well as using much larger datasets of more forensically realistic recordings, provided by our project partners, the UK Ministry of Defence and the Netherlands Forensic Institute. The availability of a variety of datasets also allows us to assess the generalisability of results across a range of voices.
This project is entirely driven by real-world issues and so the results will deliver considerable impact to a wide range of stakeholders. By understanding more about individuals, our results have the capability to improve overall ASR performance. This will be of benefit to users and developers of ASR systems. The results will also have specific implications for forensic and investigative applications, guiding data collection for validating methods (something which experts are under increasing regulatory pressure to do) and provide a framework for combining ASR and linguistic analysis. In doing so, through engagement with the legal community, we aim to affect a change in the status of ASR in England and Wales, such that it is admissible as expert evidence. We will deliver impact via knowledge exchange with a Forensic Advisory Panel consisting of representatives from forensic speech science, law enforcement, and the legal community.
Publications
Wang BX
(2024)
Balancing validity and reliability as a function of sampling variability in forensic voice comparison.
in Science & justice : journal of the Forensic Science Society
| Description | We have conducted validation of AI-based speaker recognition systems for use as a form of forensic identification evidence. This involves a vast array of testing of the systems under a range of relevant conditions, testing variables such as duration, background noise, channel, and various system-level settings. This puts us in a position to carry out forensic casework using speaker recognition systems. We have developed a protocol to validate human-based linguistic methods for forensic speaker identification. In the coming year, we will be testing two established forensic experts to conduct the only large-scale test of human-based speaker identification. We have identified the types of voices that automatic speaker recognition systems typically struggle most with. Specifically, we have found that voices that use whisper or substantial deviations in vocal tract appratus produce the poorest performance. This provides a way of identifying which voices are likely to result in an error when using automatic speaker recognition systems. |
| Exploitation Route | Academic - our work will be used to inform further testing of human- and AI-based methods for speaker identification. It will be used for method developement and as a way of understanding what information new generations of automatic speaker recognition systems are sensitive to. Non-academic - our work provides essential benchmarking data for human- and AI-based methods for speaker identification used in the criminal justice system. this is validation work which is a key part of forensic science standards and regulation. Having our results available to practitioners takes considerable burden away from practitioners, who don't need to conduct the same level of large-scale testing as us. They can use our results as a basis, and then demonstrate they personally can use the methods competently. |
| Sectors | Aerospace Defence and Marine Government Democracy and Justice Security and Diplomacy |
| Description | Towards the growth of forensic speech science |
| Amount | £323,596 (GBP) |
| Organisation | University of York |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 08/2024 |
| End | 08/2026 |
| Description | YorVoice |
| Amount | £300,000 (GBP) |
| Organisation | University of York |
| Sector | Academic/University |
| Country | United Kingdom |
| Start | 08/2023 |
| End | 08/2025 |
| Title | PASR WorkPackage 1 Dataset |
| Description | This is a dataset of recordings of (currently) nine phoneticians producing a fixed, read text in 29 speaking conditions relevant to the project (voice qualities, vocal settings, accent guises, disguises). Each participant produced each condition three times across three or four recording sessions (made with at least a week between sessions). The recordings were made simultaneously with four technical conditions (head-band microphone, near microphone, far microphone and landline-to-VOIP). |
| Type Of Material | Database/Collection of data |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | We are currently using this dataset as the basis for training automatic voice quality classifiers and as potential adaptation data for fine-tuning automatic speaker recognition systems. This work is still in its infancy and more will be shared once further progress has been made. |
| Description | Netherlands Forensic Institute: Project Partner |
| Organisation | Netherlands Forensic Institute |
| Country | Netherlands |
| Sector | Public |
| PI Contribution | The project and team provide intellectual input on the use of automatic speaker recognition systems for the purposes of forensic casework at the Netherlands Forensic Institute. |
| Collaborator Contribution | The Netherlands Forensic Institute provide guidance on the overall and specific directions of each of the workpackages. Our contact at NFI regularly attends our project meetings. He also provides comments and feedback on papers for publication and presentation at conferences. |
| Impact | Sensitivity of x-vectors and automatic speaker recognition scores to vocal variation. Automatic speaker recognition with variation across vocal conditions: a controlled experiment with implications for forensics. |
| Start Year | 2022 |
| Description | Defence Science and Technology Laboratory - In-person visit day |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Professional Practitioners |
| Results and Impact | Members of the project team went to DSTL in-person to discuss the project and how the findings could be implemented in practice |
| Year(s) Of Engagement Activity | 2024 |
| Description | Defence Science and Technology Laboratory - Speaker Series |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Professional Practitioners |
| Results and Impact | I gave a talk on the PASR project as part of the Defence Science and Technology Laboratory speaker series. The audience is made up of professionals conducting research and casework on behalf of the UK Government. The talk was online. |
| Year(s) Of Engagement Activity | 2024 |
| Description | IAFPA 2022 - Person-specific automatic speaker recognition: understanding the behaviour of individuals for applications of ASR |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | We introduced the project to an audience of almost 100 people made up of an international audience of professional practitioners, academics and students who regularly conduct forensic casework |
| Year(s) Of Engagement Activity | 2022 |
| Description | IAFPA 2023 - Effects of vocal variation on the output of an automatic speaker recognition system |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Paper presented to an international audience of professional practitioners, academics and students who regularly conduct forensic casework. The aim was to share research findings, leading to new research directions informed by practice, and to influence current practice. |
| Year(s) Of Engagement Activity | 2023 |
| Description | IAFPA 2023 - Impact of the changes in long-term acoustic features upon different-speaker ASR scores |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Paper presented to an international audience of professional practitioners, academics and students who regularly conduct forensic casework. The aim was to share research findings, leading to new research directions informed by practice, and to influence current practice. |
| Year(s) Of Engagement Activity | 2023 |
| Description | IAFPA 2024 - Exploring solutions to vocal mismatch for automatic speaker recognition systems |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Paper presented to an international audience of professional practitioners, academics and students who regularly conduct forensic casework. The aim was to share research findings, leading to new research directions informed by practice, and to influence current practice. |
| Year(s) Of Engagement Activity | 2024 |
| Description | IAFPA 2024 - Method validation of phonetic forensic voice comparison |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | Paper presented to an international audience of professional practitioners, academics and students who regularly conduct forensic casework. The aim was to share research findings, leading to new research directions informed by practice, and to influence current practice. |
| Year(s) Of Engagement Activity | 2024 |
| Description | Project Advisory Panel Meeting 2023 |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | The meeting involved reporting back to the project Advisory Panel on progress so far. At the meeting were 10 external panel members from private practice, government, industry and academia. This meeting took place in-person in York. |
| Year(s) Of Engagement Activity | 2023 |
| Description | Project Advisory Panel Meeting 2024 |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Professional Practitioners |
| Results and Impact | The meeting involved reporting back to the project Advisory Panel on progress from 2023-24. At the meeting were 13 external panel members from private practice, government, industry and academia. This meeting took place online |
| Year(s) Of Engagement Activity | 2024 |
