Person-specific automatic speaker recognition: understanding the behaviour of individual speakers for applications of ASR

Lead Research Organisation: University of York


Automatic speaker recognition (ASR) software processes and analyses speech to make decisions about whether two voices belong to the same or different individuals. Such technology is becoming an increasingly important part of our lives; used as a security measure when gaining access to personal accounts (e.g. banks), or as a means of tailoring content to a specific person on smart devices. Around the world, ASR systems are commonly used for investigative and forensic purposes, to analyse recordings of criminal voices where identity is unknown. Yet systems perform better or worse with certain voices. Therefore, a fundamental question remains: what makes a particular voice easy or difficult for ASR to recognise?

State-of-the-art systems, using techniques from artificial intelligence (AI), have shown marked improvements in performance compared with older approaches. However, there remain issues. Firstly, ASR research has focused on minimising the effects of well-known technical factors, such as channel (e.g. mobile vs. landline telephone), recording quality and microphones. In resolving these technical challenges, large improvements in systems have been achieved. Yet little is known about how speakers themselves affect ASR performance. Secondly, ASR research has been interested in reducing overall error rates. Yet, in the real-world (where innocence and guilt may be at stake), the key question is: what is the chance the system has made an error in this specific instance? Finally, while AI approaches have undoubtedly brought improvements in overall performance, such algorithms make it more difficult to know what information systems are relying on to make decisions. This is problematic for forensic experts, who must explain their methods to non-expert end users, such as judges, juries, lawyers and police.

This project is the first to systematically assess how individual speakers perform within and across ASR systems and to compare speaker effects, in terms of linguistic properties of voices or speaker demographics (e.g. accent, ethnicity, gender), with well-studied technical effects. The aim is to use this knowledge to improve ASR systems by flagging potentially problematic speakers and to develop methods to handle these problematic speakers. We will use novel, interdisciplinary methods, bringing together expertise from speech technology, linguistics, and forensic speech science. Our collaboration with commercial ASR vendor Oxford Wave Research allows us to adapt and change systems to assess the effects on results for individual speakers. We will also use highly controlled, small-scale experiments to assess speaker effects in isolation, as well as using much larger datasets of more forensically realistic recordings, provided by our project partners, the UK Ministry of Defence and the Netherlands Forensic Institute. The availability of a variety of datasets also allows us to assess the generalisability of results across a range of voices.

This project is entirely driven by real-world issues and so the results will deliver considerable impact to a wide range of stakeholders. By understanding more about individuals, our results have the capability to improve overall ASR performance. This will be of benefit to users and developers of ASR systems. The results will also have specific implications for forensic and investigative applications, guiding data collection for validating methods (something which experts are under increasing regulatory pressure to do) and provide a framework for combining ASR and linguistic analysis. In doing so, through engagement with the legal community, we aim to affect a change in the status of ASR in England and Wales, such that it is admissible as expert evidence. We will deliver impact via knowledge exchange with a Forensic Advisory Panel consisting of representatives from forensic speech science, law enforcement, and the legal community.


10 25 50