Towards linguistically-informed automatic speaker recognition

Lead Research Organisation: University of York
Department Name: Language and Linguistic Science

Abstract

This project will investigate how Automatic Speaker Recognition (ASR) systems work and how they can be improved. ASR systems recognise speakers from just their voice and are commonly used by banks and government institutions such as the HMRC. Such systems have seen improvements in recent years due to continual refinement of methods and the availability of large databases of recordings to test them. State-of-the-art systems now produce few errors even with poor quality and short recordings. However, ASR systems are a 'black box': we know that they are analysing a speaker's voice but we do not know what linguistic information they rely on to make their decisions. This project is exciting because it builds on a small but growing body of research at the intersection of linguistics and speech technology. It is also the first systematic study to investigate the inner workings of such systems and the outcomes will be beneficial to society by improving the reliability of security systems and speaker recognition systems used by banks and courtrooms.

I am interested in this project because I have experience working with similar systems which recognise individuals using written data. However, those systems are not 'black boxes' like these ASR systems which use spoken data. Thus, I am driven to understand the 'black box' of ASR systems to ensure that the systems which use written and spoken data are equally reliable. Understanding this 'black box' is crucial because it will allow us to improve ASR systems further, particularly through understanding what 'types' of voices are difficult. They must also be explainable to lay people, e.g. jury members, in legal cases where such systems are used in evidence. The project will ask three research questions devoted to understanding and improving ASR systems:

RQ.1: To what extent do ASR systems capture tangible linguistic properties of a voice?

Firstly, we will investigate what linguistic properties of voice map onto the abstract properties of voice which ASR systems already detect. I hypothesise that many properties will be pertinent, e.g. vowel formants which are the regular and consistent resonating frequencies of different vowel sounds that are uniquely shaped by each speaker's vocal tract and accent.

RQ.2: Can we predict which speakers will be problematic for the system?

Secondly, we will identify groups who may be problematic for ASR systems so that we can improve the systems based on why these groups pose issues. Some accents have less vowel variation than others; as a result, their speakers could be at greater risk of misrecognition as someone with the same accent because there are less variables to identify the speaker uniquely.

RQ.3: Can linguistic information be used to improve the performance of ASR?

Finally, we will use linguistic speech analysis to improve ASR systems. By identifying the linguistic features which are used by ASR systems, we can tailor ASR systems to focus on these features to improve their reliability.

This project uses a state-of-the-art speaker recognition system (VoiSentry) developed by the commercial partner, Aculab. My methodology will involve testing the VoiSentry software on voices that have been manipulated in controlled ways, e.g. changing the acoustic properties of the vowel sounds, and seeing how it affects the end score. If it does, we will know that ASR systems capture tangible linguistic properties of voice and we can therefore tailor these systems to focus on detecting these features. Aculab will be influential to this study because they will allow us to examine the underlying computer code which no other ASR system will permit. Thus, we can do specific manipulations to test changes to the outcome result. Overall, this research will have societal value because it will ensure that speaker recognition systems used by banks and government institutions are as reliable as possible.

Publications

10 25 50