Voice and Identity: source, filter, biometric

Lead Research Organisation: University of York
Department Name: Language and Linguistic Science

Abstract

The voice has fascinated philosophers, writers and scientists throughout history, especially as a marker of human identity. Formal analysis of voice is conducted in strikingly different frameworks, some developed largely in arts & humanities disciplines (linguistics and phonetics), and others in the sciences (engineering, physics, computer science). These frameworks differ in focus, assumptions, and in methods used to test their performance and reliability. Remarkably little work has sought to integrate them, resulting in only limited under-standing of their respective benefits and drawbacks, and of their degree of compatibility. This is of particular importance in the applied domain of forensic speech science, where the individual properties of the voice are treated as a biometric. Forensic voice (or speaker) comparison (FVC) is increasingly called for in courts worldwide. Typically, comparison is made between the voice of a suspect, and a voice recorded in criminal activity (e.g. via covert surveillance of drug deals or terrorist plots). The aim of FVC is to aid the court in assessing the likelihood that the speaker in the recordings is the same person, as opposed to a different person. There is a growing consensus that an integrated approach is needed for significant progress to be made towards a more reliable and robust procedure for FVC.

In this proposal we seek to assess the comparative performance of voice analysis based on linguistic-phonetic methods, and automatic (computational) systems. We will explore the performance of the methods on the same data to assess their relative strengths, the consistency of their results and error patterns, and thus the potential for phonetic and automatic methods to be integrated. The ultimate aim is to improve methods in FVC, taking a major step towards the development of a methodology that is more transparent, validated, and replicable. This outcome will benefit academics and forensic practitioners, the public, judicial systems, and investigative/security agencies. More generally, the project answers recent calls to improve the quality of forensic evidence of all kinds, making forensic sciences more transparent and more carefully regulated (e.g. Law Commission of England & Wales 2011).

The project addresses the AHRC's focus theme, Science in Culture: it (i) explores the capacity of linguistic-phonetic techniques for advancing scientific methods; (ii) explores how methods developed in both the sciences and arts might be integrated; (iii) improves understanding of the comparative roles of expertise from the sciences and humanities in FVC; and (iv) aims to improve public confidence in forensic evidence.

Planned Impact

The outcomes of this research will also have a number of economic and societal beneficiaries:

1. Commercial private sector - forensic voice analysts

The findings of this project will have an immediate impact on the analytic procedures applied to forensic voice comparison cases. Beneficiaries will include both private and government laboratories. In the private sector, the benefits will be felt directly J P French Associates, as collaborators in the project. Dissemination to other national and international labs will take place at the wide range of conferences and specialist meetings outlined in the proposal. In particular, the project will improve expert analyses by identifying the global outputs from the source and filter which offer the greatest speaker-discriminatory power, which can then be applied in casework, as well as providing an understanding of the types of errors made by different approaches. The research will also benefit private companies developing ASR systems (incl. Agnitio, manufacturers of the leading commercially available ASR software, BATVOX), by offering different automatic-style variables for analysis and ultimately improving current error rates.

2. Government agencies

This research will provide the same benefits to both national and international government agencies as those outlined for the commercial private sector. In particular, the research will improve the forensic voice comparison analyses performed by governmental forensics laboratories. The applicants have close working relationships with many other agencies, including MI5, the Netherlands Forensic Institute, Bundeskriminalamt (German State Labs), the Royal Canadian Mounted Police, and the Estonian Forensic Science Institute.

3. Regulators/policy makers

This project will be of direct relevance to the UK Forensic Science Regulator (FSR), established in 2012 to regulate practice in forensic practice in disciplines in the UK. Co-PI French is the chair of the Forensic Speech and Audio committee of the FSR, ensuring the research can be disseminated to all registered practitioners, and, if appropriate, the implications for forensic practice can be enshrined in the regulations for expert witness policy and practice. As part of the dissemination of this research, we will also contribute towards the Home Office Biometrics Working Group. Therefore, the findings will contribute towards the advancement, improvement and standardisation (in particular the development of best practices) of forensic voice comparison evidence presented to UK courts.

4. Judicial system

As end-users of expert evidence, the findings of this project will necessarily have implications for the judicial system at large. The results are intended to improve the quality of forensic voice comparison evidence in courts, both by improving the results of forensic speech analysis and also leading to better understanding of the limitations of different methods. The research will also indirectly help to develop the level of understanding of speech evidence by the courts by increasing the degree of standardisation across experts and analyses.

5. Public

The general public benefit from the project, since justice is determined by the quality of forensic evidence presented to the courts. It is also hoped that the findings will increase public understanding of technical analysis as applied to forensic evidence, in particular to assist in understanding what is often regarded as opaque 'black-box' evidence such as that produced by ASR systems.
 
Description The project has three key outcomes.

1. Forensic speaker comparison (voice identification) cases involve comparing voices in criminal recordings with those of known suspects. There are two main approaches to this task. One involves a trained phonetician listening to the voices analytically and taking measurements of various physical parameters of the speech signal. They base their decision on identity/non-identity of the voices (expressed as a likelihood ratio) on correspondences or differences found between the recordings. The second involves the use of automatic speaker recognition (ASR) software. ASR systems entail uploading the voice samples to a computer, which automatically performs on them highly complex mathematical transformations, and reduces them to statistical voice models (based on mel-frequency cepstral coefficients, or MFCCs). The criminal and suspect voice models are then compared with each other in order to assess the degree of similarity between them, and with models from a 'reference population' of speakers, in order to assess just how widespread the features common to them are in the wider population. Again, the outcome of the ASR analysis of identity/non-identity of speakers is expressed as a likelihood ratio.

Both approaches are susceptible to error. Error rates for the 'human based' approach are not known, but those of the automatic approach vary between 1 - 5%. A central question for those working in this area is whether the human based and automatic approaches can be combined, the strengths of one approach compensating for the weaknesses of the other, and thereby improving the rate of accuracy overall.

A danger of combining the approaches, however, is that the features of the voice assessed within one approach are also being assessed in the other, albeit in methodologically different ways. Thus, one could end up overestimating - 'double weighting'- the strength of evidence if the same aspects of speech are being factored into one's conclusion twice.

The research shows that, firstly, there are some features of speech examined in the human based approach that are related to those used by the automatic systems and that combining these into the automatic approach is of no benefit - in fact, they decrease overall accuracy of performance. These concern the acoustic resonances of vowel sounds (formants). Second, however, we have found that there are other features concerning voice quality, the overall 'timbre' or individual colouring of a voice, which are assessed perceptually within the human based approach, that are quite independent of those processed by the automatic systems and may be both legitimately and usefully combined in reducing error rates of the systems to zero.

This is a methodological breakthrough and paves the way forward for integration of aspects of the human based approach into automatic analyses.

The work involved in this aspect of the project has also yielded new resources that will benefit the speech science community. We have assessed the performance of phonetic and ASR methods on the same data, using modified versions of the DyViS corpus, a collection of recordings of 100 young men in simulated police interviews (Nolan et al 2009, ESRC award no. RES-000-23-1248).

A number of different versions of the corpus have been produced using editing and sound processing techniques. These are being made publicly available and will facilitate use of the corpus for substantively and methodologically different research purposes.

New versions of DyViS include:
- manually edited boundaries to identify speech (of the target speaker)
- extraction of silences
- time alignment of the near-end and far-end of the telephone recordings
- segmentation of the speech into vowels and consonants
- resampling of audio to different sample rates
- 3G GSM mobile processing of audio with different bit-rates

We plan to share the relevant versions and scripts on an open platform in due course.

2. The description of voice quality mentioned under 1 may involve the use of a system of categories, known as the Voice Profile Analysis (VPA) scheme, originally designed for speech and language therapy applications. The use of this system has been subject to a degree of scepticism on the grounds that it is an impressionistic scheme and may therefore be applied differently by different analysts.

To test the approach we have conducted an inter-rater comparison of perceptual vocal setting analysis using the VPA (San Segundo et al 2019), the most extensive such study to date. A key finding of this research is that, through a process of calibration, it is possible to achieve a high degree of inter-rater agreement among properly trained analysts across VPA categories applied to different voices. This finding provides validation for the scheme and has import for those using it within, for example, sociophonetics, speech and language therapy, as well as forensic speech science.

3. The third key finding is in respect of our work on magnetic resonance imaging (MRI) to explore how the configuration of the vocal tract varies when speakers adopt different vocal settings, and how those changes relate to the acoustic output of the vocal tract. This has enabled us to develop methodologies for establishing how perceptual voice quality, as assessed by analysts trained in the use of the VPA scheme, relate to real, biological settings, movements of the articulatory organs and physical measurements of the distances between them. Owing to the resource-intensive nature of this work, we have only been able to process a limited amount of MRI data. However, the work has considerable methodological importance, as it incorporates procedures for the capture and representation of dynamic two-dimensional MRI data and three-dimensional images of the vocal tract. In particular, novel statistical procedures have been developed, which are now being used in a follow-on project.
Exploitation Route ? The new versions of corpora we have developed will be of wide value in speech science

? Further development of our pilot MFI work is already in train via a British Academy Postdoctoral Fellowship awarded to Amelia Gully, "Anatomy, Acoustics and the Individual"

? we have already secured two PhD scholarships to further develop work exploring the contribution of phonetic criteria to automatic speech recognition systems (funded by an AHRC Collaborative Doctoral Award in partnership with Aculab PLC, and by the University of York Graduate Research School scholarships for overseas students)

? there are practical practical implications for forensic casework by private and government laboratories, as evidenced in the section on 'non-academic impacts'
Sectors Aerospace

Defence and Marine

Digital/Communication/Information Technologies (including Software)

Electronics

Government

Democracy and Justice

Security and Diplomacy

 
Description ? our work on this project has led directly to collaborative research with ASR companies, Nuance, Aculab and Phonexia ? research has been presented to non-academic partners at several conferences on forensic speech science, general forensic science, and security. These include ASR companies, government laboratories, the National Crime Agency, and security agencies. ? we have presented aspects of the research to the general public (e.g. French, Café Scientifique) ? aspects of the research are already influential in forensic casework at JP French Associates (JPFA), the UK's largest provider of forensic speech and audio services. JPFA provides services to law enforcement agencies, government departments and independent law firms. The company is presently undergoing preparation for accreditation under ISO 17025, as required by the Home Office Forensic Regulator. Part of the accreditation involves demonstrating that the analytic methods and staff training procedures in place are robust, reliable and have been validated. In forensic speaker comparison cases ('voice identification'), which make up 70% or more of the company's work, use is made of a scheme for the auditory-perceptual analysis of voice quality ('timbre'). The scheme, known as the Vocal Profile Analysis (VPA) scheme, was the subject of research conducted under the present grant and published in the peer reviewed article by San Segundo et al (2018). The article affirmed the use of the scheme and described a procedure for calibration among analysts. While JPFA have for many years been undertaking cross-analyst calibration, the published research provides an impetus and a model for their formalising their calibration methods - both for their forensic casework and for their training procedures of new analysts entering the field. The research will figure in the validation documentation provided by JPFA to UKAS - the UK's National Accreditation Body, responsible for determining applications for accreditation under ISO 17025 (due to be completed in 2020).
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software),Education,Government, Democracy and Justice,Security and Diplomacy
Impact Types Societal

Policy & public services

 
Description Criminal voices: an outline of forensic speech science 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Seminar, Università degli Studi di Bergamo, Italy, 10 April.
Year(s) Of Engagement Activity 2017
 
Description Forensic speech science: principles and cases, Undergraduate Linguistics Association of Britain 2015 - Peter French 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Undergraduate students
Results and Impact tbc
Year(s) Of Engagement Activity 2015
 
Description French, P. Aspects of Forensic Speech Science. Forty Years of English Language and Linguistics Celebration Colloquium talk, York St John University, 26th October 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Undergraduate students
Results and Impact talk
Year(s) Of Engagement Activity 2016
 
Description French, P. Did HE really say THAT? Café Scientifique, Stockton, 26th May 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Café Scientifique is a monthly conversation about current issues in science and technology in a relaxed café setting.
Year(s) Of Engagement Activity 2016
 
Description French, P. Forensic Speaker Comparison in the UK: present and future. National Crime Agency Voice Analytics Conference 9th November 2016, Wellington Barracks, London 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The audience consisted of National Crime Agency (NCA) officers and technical staff, police and personnel from various UK government security departments, leading developers and providers of automatic speaker recognition software as well as representatives from foreign embassies. After the presentation I was approached by a number of delegates about the University providing training in speech analysis and about whether the Department of Language and Linguistics could provide students who had undergone postgraduate training in forensic speech science to fill posts as analysts at the NCA. The NCA are to visit the Department in June in order to explain these posts and their work more generally to doctoral students and students undertaking the MSc in forensic speech science.
Year(s) Of Engagement Activity 2016
 
Description Perceptual, spectral and prosodic correlates of vocal tract tension, Lab Meeting, 23rd February 2017, Phonetics Laboratory, University of Zurich, Switzerland; E San Segundo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact lab meeting
Year(s) Of Engagement Activity 2017
 
Description Seminar "Fonética Experimental, Prosodia y Entonación del Español. Perspectivas Investigadoras y Futuro Profesional", University of Seville, Spain 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Engagement activity.
Year(s) Of Engagement Activity 2016
URL http://www.siff.us.es/web/?p=14206
 
Description Seminar NOSH: New Observations in Speech and Hearing, Institute of Phonetics and Speech Processing, Ludwig Maximilian University of Munich, Germany 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited talk
Year(s) Of Engagement Activity 2015
URL http://www.phonetik.uni-muenchen.de/~hoole/kurse/mampf/mampf.html
 
Description Speaker-similarity perception of Spanish twins and non-twins by native speakers of Spanish, German and English 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster, IAFPA 26, Split, Croatia.
Year(s) Of Engagement Activity 2017
URL http://www.ffst.unist.hr/znanost/konferencije/iafpa
 
Description The Tarnished Silver Tongue: casework and research in forensic speech science, University of Sussex, School of English, Open Lecture - Peter French 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Undergraduate students
Results and Impact tbc
Year(s) Of Engagement Activity 2015,2016
URL http://sussexlinguists.blogspot.co.uk/2015/10/rolls-peter-french-on-forensic.html
 
Description The complementarity of automatic, semi-automatic, and phonetic measures of vocal tract output in forensic voice comparison 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Hughes, V., Harrison, P., Foulkes, P. French, P., Kavanagh, C. & San Segundo, E. (2017) The complementarity of automatic, semi-automatic, and phonetic measures of vocal tract output in forensic voice comparison. IAFPA 26, Split, Croatia.
Year(s) Of Engagement Activity 2017
 
Description Voice and Identity: applications and limitations of the voice as a biometric 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Kavanagh, C., Foulkes, P. French, P., Harrison, P., Hughes, V. & San Segundo, E. (2017) Voice and Identity: applications and limitations of the voice as a biometric. CREST Workshop: Current state of the art, and future directions for linguistic analysis in a security context. London, 2 October.
Year(s) Of Engagement Activity 2017
 
Description Workshop on Verbal Voice Profiling, University of Campinas, Brazil 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Collaboration with BrazilIan and Swedish colleagues
Year(s) Of Engagement Activity 2016