Understanding speech in the presence of other speech: Perceptual mechanisms for auditory scene analysis in human listeners

Lead Research Organisation: Aston University
Department Name: Sch of Life and Health Sciences

Abstract

We take it for granted that we can converse with other people in daily life and be understood with little, if any, noticeable effort. However, it is fairly unusual to hear the speech of a particular talker in isolation; speech is typically heard in the presence of interfering sounds, such as the voices of other talkers. The human auditory system, which is responsible for our sense of hearing, therefore faces the challenge of identifying which parts of the sounds reaching our ears have originated from the same environmental source. This involves separating those sound elements coming from one source (e.g., the voice of one talker) from those arising from other sources, and grouping them in ways that can be interpreted by higher-level processes in the brain (such as those involved in our understanding of speech). Without a solution to this "auditory scene analysis" problem, our perceptions of speech (and other sounds) would not correspond to the events that produced them. Humans have been exposed to a variety of complex listening environments over the course of evolution, and so we are generally very successful at understanding the speech of one person in the presence of other talkers. This contrasts with attempts to develop listening machines, which often fail catastrophically when confronted with complex listening environments, such as an open-plan office or a crowded party. Human listeners with hearing impairment also find these environments very difficult, even when using the latest developments in hearing-aid or cochlear-implant technology.

So far, most research on auditory scene analysis has focussed on relatively simple sounds and has identified a number of general principles for the perceptual grouping and separation of sound elements. However, at least as currently understood, these principles seem inadequate to explain fully the perceptual grouping of speech. This is because the speech signal consists of a diverse and rapidly changing stream of sounds. The speech of our native language is also a highly familiar stimulus, and so by adulthood our auditory system has had many years to learn about its speech-specific properties. These properties may also assist in the successful perceptual grouping of speech.

Much of the information necessary to understand speech is carried by the changes in frequency over time of a few broad peaks in the frequency spectrum of the speech signal, known as formants. The aim of this project is to investigate how human listeners presented with speech sound mixtures are able to group together the appropriate formants, and to reject others, such that the speech of the talker we want to listen to can be understood. We will do so using perceptual experiments with human listeners, in which we measure how the intelligibility of target speech (measured, for example, as the number of words reported correctly) changes under a variety of conditions. The project will explore the roles of general-purpose grouping factors (i.e., those that apply to a wide variety of sounds) and of speech-specific grouping factors, including higher-level constraints associated with the articulation of speech (i.e., the way our tongue, lips, and jaw move when we speak) and with the rules of our language. Our approach is to generate artificial speech-like stimuli with precisely controlled properties, to mix target speech with carefully designed "competitors" that offer alternative grouping possibilities, and to measure how manipulating the properties of these competitors affects the ability of human listeners to recognise the target speech in the mixture. The results of this project will not only improve our understanding of how human listeners separate speech from interfering sounds, but will also help to refine computer models of listening. Such refinements will in turn provide ways of improving the performance of devices such as hearing aids and automatic speech recognisers when they operate in noisy environments.

Planned Impact

Private-sector companies who develop robust automatic speech recognition (ASR) devices and techniques for speech enhancement:

Robust performance in unpredictable noise remains a key problem for ASR and mobile telephony. The project will provide perceptual data needed by researchers in computational auditory scene analysis (CASA) to develop further models and algorithms for separating speech from interfering sounds. This in turn offers the prospect of improved front-end processors for ASR and speech enhancement systems, which is likely to improve the performance of commercial systems. Our industrial advisor, Audience, is one of the leading commercial developers of speech enhancement technology. Our research findings will be communicated to Audience via email and at regular review meetings, timed to coincide with jointly attended international conferences. A report will also be made available to other key companies in ASR and speech enhancement technology.

Private-sector companies who develop hearing aids and cochlear implants:

Effective separation of a target voice from interfering sounds is one of the key problems facing designers of hearing aids and cochlear implant (CI) processors. Our findings on the perceptual grouping of formants will potentially improve the coding of speech in CI processors. The first benefits of enhanced CASA solutions for improving hearing aids and CI processors are likely to emerge after about 5 years, which is typical for development lead-time of such devices. Our findings will be communicated to hearing aid and CI manufacturers by sending them a bespoke report, by visits to selected companies, and meetings with their representatives at conferences.

Charities that support the hearing impaired:

Our research findings will be communicated to major UK charities that support the hearing impaired (Action on Hearing Loss, Deafness Research UK). The improved understanding of the perceptual processes underpinning successful auditory grouping will encourage the development of collaborative psychophysical and modelling studies. Such cross-disciplinary interaction is likely to stimulate the funding by these charities of research projects on advanced hearing aids and CI algorithms. Our findings will be communicated to hearing charities via a report tailored to their interests, and by conference presentations.

The general public:

Indirectly, the impact on the beneficiaries listed above will also benefit the general public within a timescale of 5-10 years. Improvements to hearing aid technology will benefit the estimated 9 million deaf and hard-of-hearing people in the UK. Similarly, there are about 180,000 CI users worldwide who would benefit from better techniques for encoding noisy speech in CI processors. Improved ASR and speech enhancement for mobile telephony, which may arise from our project via its impact on enhanced CASA solutions, will impact on quality of life through improved speech-based communication with machines, and enhanced electronically-mediated vocal communication between individuals. The outcomes of the project will also be communicated directly to the public via a website, which will include a lay summary of our findings, and by public lectures.

The research fellow:

Summers will receive communication skills training from the Royal Society to prepare him for public lectures, report writing and company visits. This will provide transferable skills that are relevant to many professional careers. His involvement in writing technical reports and visits to leading hearing-technology companies will foster links with the commercial sector likely to enhance the prospect of future employment either in academic or industrial research.

The project team has the expertise needed to implement this impact plan. We have already established contacts with CASA researchers and with major commercial organisations with interests in ASR, mobile communications, and hearing prostheses.

Publications

10 25 50
 
Description Ten experiments were completed during the course of this project. To summarise the key outcomes, these experiments are considered in five groups (each corresponding to either a published or anticipated journal article):

[EXPERIMENTS 1-2] Roberts, B., Summers, R.J., and Bailey, P.J. (2015). "Acoustic source characteristics, across-formant integration, and speech intelligibility under competitive conditions," Journal of Experimental Psychology: Human Perception & Performance, 41, 680-691.

Key Outcomes: The results indicate that the contribution of a formant to the phonetic identity of a speech sound is governed by the nature of that formant's acoustic source properties, rather than by whether or not it matches the source properties of the other formants. This outcome is incompatible with a major role for target-masker similarity in determining across-formant grouping, as might have been expected based on studies of informational masking using non-speech materials. The results add to a growing body of evidence from studies and simulations of combined acoustic and electro-acoustic hearing that listeners can integrate phonetic information across radically different modes of stimulation.

[EXPERIMENTS 3-4] Roberts, B., and Summers, R.J. (2015). "Informational masking of monaural target speech by a single contralateral formant," Journal of the Acoustical Society of America, 137, 2726-2736.

Key Outcomes: The results indicate that a single formant presented in the contralateral ear can produce substantial informational masking of target speech, despite the availability of a "clean" signal at the auditory periphery. The impact of an extraneous interfering formant on speech intelligibility depends primarily on the extent of variation of its frequency contour; variation of its amplitude contour has relatively little effect on the interference produced. There is no evidence that "speech-like variation" per se - i.e., distinctive acoustical correlates of particular articulatory movements - influences across-formant grouping and interference.

[EXPERIMENTS 5-7] Summers, R.J., Bailey, P.J., & Roberts, B. (2016). "Across-formant integration and speech intelligibility: Effects of acoustic source properties in the presence and absence of a contralateral interferer," J. Acoust. Soc. Am. 140, 1227-1238.

Key Outcomes: The results extend those from our earlier research using dichotic targets. Acoustic source type and competition, rather than acoustic similarity, govern the phonetic contribution of a formant, even when target and interfering formants that differ in bandwidth (harmonic = wide, tonal = narrow) are matched for equal loudness rather than for equal RMS power. Furthermore, the integration of phonetic information across formants with different source characteristics may be greatly affected not only by the presence of interferers, but also by the spatial configuration of formants. In particular, the informational masking produced by an interfering formant may be exacerbated under circumstances requiring the integration of target formants across ears. Such a situation may arise for cochlear-implant listeners with residual low-frequency hearing in the non-implanted ear.

[EXPERIMENTS 8-9] Summers, R.J., Bailey, P.J., & Roberts, B. (2017). "Informational masking and the effects of differences in fundamental frequency and fundamental-frequency contour on phonetic integration in a formant ensemble," Hear. Res. 344, 295-303.

Key Outcomes: In the absence of interference, a mismatch in F0 (pitch) contour between one target formant (F2) and the others (F1+F3) has no detrimental effect on intelligibility. Intelligibility is reduced when an interfering formant is added whose F0 contour matches that of F1+F3. As the difference in F0 between F2 and the other formants increases, intelligibility falls further. Where F0 differences between formants arise from differences in time-varying F0 contours, the fall in intelligibility depends on the mean difference in F0 between contours rather than differences in contour shape per se. There is no evidence that the natural variation of voice pitch over the course of a sentence increases the likelihood that a particular formant contributes to the speech percept.

[EXPERIMENT 10] Roberts, B., & Summers, R.J. (2020). "Informational masking of speech depends on masker spectro-temporal variation but not on its coherence," J. Acoust. Soc. Am. 148, 2416-2428.

Key Outcomes: The impact of a time-varying interferer on intelligibility depends critically on the overall extent of its formant-frequency variation, but not on its spectro-temporal coherence. Specifically, the extent to which the interfering formant reduces intelligibility does not depend on either the segmentation of the amplitude contour (unbroken vs. divided into 100- or 200-ms-long segments) or the randomization of segment order used for the frequency contour (coherent vs. incoherent). This outcome suggests that an extraneous formant may act as an interferer primarily by increasing the overall cognitive load on the listener, rather than from the intrusion of specific acoustic-phonetic properties from the extraneous formant into the target speech percept. Once again, there is no evidence that "speech-like variation" influences across-formant grouping and interference.
Exploitation Route The results obtained during this project suggest approaches by which engineers and computer scientists might improve the performance of devices such as hearing aids and automatic speech recognizers when they are operating in noisy environments.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

URL http://www.aston.ac.uk/lhs/staff/az-index/robertsb/understanding-speech-in-the-presence-of-other-speech/
 
Description This research project was primarily theoretical, and so its economic and societal impact at this stage is limited. Nonetheless, two routes of potential impact are beginning to emerge: (1) Scientists interested in computational solutions for auditory scene analysis (CASA) are beginning to use the results of this project to inform developments in these solutions. Notably, we are in periodic contact with Martin Cooke (University of the Basque Country, Spain) and DeLiang Wang (Ohio State University, USA). In the longer term, improved CASA solutions offer the prospect of enhanced performance by hearing prostheses and automatic speech recognition systems operating in noisy environments. Such enhancements will produce benefits in healthcare and societal outcomes. (2) Scientists employed by private-sector companies who develop hearing aids and cochlear implants are beginning to consider the results of this project in guiding their own research and development projects. We have carried out two dissemination visits on our funded research, one to Phonak in Switzerland (2-3/Feb/2016) and one to Oticon in Denmark (23-24/Feb/16). As a result, we now have established contacts with their research teams (Stefan Launer and Michael Boretzki at Phonak; Niels Pontoppidan and Lars Bramslow at Oticon). The link with Oticon (Eriksholm Institute) been maintained in the context of our recently completed ESRC-funded project (ES/N014383/1).
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Societal,Economic

 
Title Dataset for published article in Hearing Research by Summers, Bailey, and Roberts (2017). 
Description These datasets comprise listeners' transcriptions of sentence-length speech stimuli for Experiments 1 and 2 of the article of the same title (Summers, Bailey, and Roberts, 2016, Hearing Research). Each spreadsheet comprises two summary worksheets and the raw data for each listener. The summary worksheets contain aggregated scores (keywords correct by tight and loose scoring, see below) for each listener in each condition, with relevant demographic information. Subsequent worksheets comprise the raw data for each listener and stimulus. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact None at this stage. These datasets has only recently been published. 
URL http://dx.doi.org/10.17036/030af3e1-064c-4b80-b478-f17fa0e64842
 
Title Dataset for published article in JASA by Roberts and Summers (2015). 
Description These datasets comprise listeners' transcriptions of sentence-length speech stimuli for Experiments 1 and 2 of the article by Roberts and Summers (2015). Each spreadsheet comprises a summary worksheet and the raw data for each listener. The summary worksheet contains aggregated scores (keywords correct by tight scoring) for each listener in each condition, with relevant demographic information. Subsequent worksheets comprise the raw data for each listener and stimulus. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact None at this stage. These datasets have only recently been published. 
URL http://dx.doi.org/10.17036/Roberts_20150427_A01
 
Title Dataset for published article in JASA by Summers, Bailey, and Roberts (2016). 
Description These datasets comprise listeners' transcriptions of sentence-length speech stimuli for Experiments 1, 2, and 3 of the article of the same title (Summers, Bailey, and Roberts, 2016, J. Acoust. Soc. Am.). Each spreadsheet comprises a summary worksheet and the raw data for each listener. The summary worksheet contains aggregated scores (keywords correct by tight scoring, see below) for each listener in each condition, with relevant demographic information. Subsequent worksheets comprise the raw data for each listener and stimulus. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact None at this stage. These datasets has only recently been published. 
URL http://dx.doi.org/10.17036/832e5972-0344-4ae5-b38c-0ff066de3a5f
 
Title Dataset for published article in JEP:HPP by Roberts, Summers, and Bailey (2015). 
Description These datasets comprise listeners' transcriptions of sentence-length speech stimuli for Experiments 1 and 2 of the article by Roberts, Summers, and Bailey (2015). Each spreadsheet comprises a summary worksheet and the raw data for each listener. The summary worksheet contains aggregated scores (keywords correct by tight scoring) for each listener in each condition, with relevant demographic information. Subsequent worksheets comprise the raw data for each listener and stimulus. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact None at this stage. These datasets has only recently been published. 
URL http://dx.doi.org/10.17036/7428fbe0-d7fe-41f5-a53a-32e9726254cd
 
Title Entries in the UK Data Service (ReShare) repository. 
Description Datasets for all ten experiments completed for this grant are available on the ReShare repository; these datasets comprise listeners' transcriptions of sentence-length speech stimuli. Each spreadsheet comprises a summary worksheet and the raw data for each listener. The summary worksheets contain aggregated scores (keywords correct by tight and/or loose scoring) for each listener in each condition. Subsequent worksheets comprise the raw data for each listener and stimulus. Each dataset is accompanied by a short text description; in cases where the associated article has not yet been published, a pdf summary report is also provided. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact None at this stage. 
 
Title Informational masking of speech depends on masker spectro-temporal variation but not on its coherence 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://researchdata.aston.ac.uk/id/eprint/477
 
Description Pathways to Impact - 2 presentations at the Big Bang Young Scientists and Engineers Fair (NEC, Birmingham, 11-14 March 2015 and 16-19 March 2016). 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Involvement of the general public - mainly families with children - in interactive demonstrations involving the presentation of synthetic speech, plus discussion of the meaning and importance of research in this area to a lay audience.

High level of engagement with the general public - considerable interest in our interactive exhibit throughout the day, which used materials created as part of our ESRC-funded project.
Year(s) Of Engagement Activity 2015,2016
URL http://www.thebigbangfair.co.uk/Play-your-part/Volunteer-roles/
 
Description Pathways to Impact - Dissemination visit to Oticon's Eriksholm Research Centre (Snekkersten, Denmark, 23-24 February 2016) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Oticon is one of Europe's leading hearing-technology companies, and a major manufacturer of hearing aids and cochlear implants. We (Brian Roberts and Rob Summers) were hosted at Oticon's Eriksholm Research Centre by Niels Pontoppidan (Group Manager, Advanced Algorithms) and Lars Bramslow (Project Manager, Competing Voices). Our visit involved giving a presentation on our ESRC-funded research, including extensive round-table discussion, and receiving a briefing on related research and development taking place at Oticon. In addition to extending our relationship with this company, we identified areas for further consideration that might form the basis for future collaboration. The next stage will involve discussions with Dr Huw Cooper (Consultant Clinical Scientist, Audiology) at University Hospital Birmingham.
Year(s) Of Engagement Activity 2016
 
Description Pathways to Impact - Dissemination visit to Phonak HQ (Staefa, Switzerland, 2-3 February 2016) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Phonak is one of Europe's leading hearing-technology companies, and a major manufacturer of hearing aids and cochlear implants. We were hosted at Phonak HQ by Stefan Launer (Vice President for Advanced Concepts and Technologies) and Michael Boretzki (research scientist with interests in psychophysics, experimental psychology, and speech and language pathology). Our visit involved giving a presentation on our ESRC-funded research, including extensive round-table discussion, and receiving a briefing on related research and development taking place at Phonak. In addition to extending our relationship with this company, we identified areas for further consideration that might form the basis for future collaboration.
Year(s) Of Engagement Activity 2016
 
Description Presentation at the Big Bang Young Scientists and Engineers Fair (NEC, Birmingham, 13-16 March 2019). 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Interactive demonstrations at the Big Bang Young Scientists and Engineers Fair under the umbrella theme of "superpowers". Our contribution - part of Aston University's stand - concerns our amazing abilities to understand speech under adverse listening conditions. The event begins in the current reporting period but ends in the next, so at this point it is not possible to identify specific outcomes/impacts arising. The option chosen below is based on our previous experience with similar events.
Year(s) Of Engagement Activity 2019
URL https://www.thebigbangfair.co.uk/get-involved/volunteer-with-us/