Interference in spoken communication: Evaluating the corrupting and disrupting effects of other voices

Lead Research Organisation: Aston University
Department Name: Sch of Life and Health Sciences

Abstract

In everyday life, talking with other people is important not only for sharing knowledge and ideas, but also for maintaining a sense of belonging to a community. Most people take it for granted that they can converse with others with little or no effort. Successful communication involves understanding what is being said and being understood, but it is quite rare to hear the speech of a particular talker in isolation. Speech is typically heard in the presence of interfering sounds, which are often the voices of other talkers. The human auditory system, which is responsible for our sense of hearing, therefore faces the challenge of identifying which parts of the sounds reaching our ears have come from which talker.

Solving this "auditory scene analysis" problem involves separating those sound elements arising from one source (e.g., the voice of the talker to whom you are attending) from those arising from other sources, so that the identity and meaning of the target source can be interpreted by higher-level processes in the brain. Over the course of evolution, humans have been exposed to a variety of complex listening environments, and so we are generally very successful at understanding the speech of one person in the presence of other talkers. This contrasts with attempts to develop listening machines, which often fail when confronted with adverse conditions, such as automatic transcription of a conversation in an open-plan office. Human listeners with hearing impairment often find these environments especially difficult, even when using the latest developments in hearing-aid or cochlear-implant design, and so can struggle to communicate effectively in such conditions.

Much of the information necessary to understand speech (acoustic-phonetic information) is carried by the changes in frequency over time of a few broad peaks in the frequency spectrum of the speech signal, known as formants. The project aims to investigate how listeners presented with mixtures of target speech and interfering formants are able to group together the appropriate formants, and to reject others, such that the speech of the talker we want to listen to can be understood. Interfering sounds can have two kinds of effect - energetic masking, in which the neural response of the ear to the target is swamped by the response to the masker, and informational masking, in which the "auditory brain" fails to separate readily detectable parts of the target from the masker. The project will explore the informational masking component of interference - often the primary factor limiting speech intelligibility - using stimulus configurations that eliminate energetic masking. We will do so using perceptual experiments in which we measure how our ability to understand speech (e.g., the number of words reported correctly) changes under a variety of conditions.

The project will examine how acoustic-phonetic information is combined across formants. It will also explore how a speech-like interferer affects intelligibility, distinguishing the circumstances in which the interferer takes up some of the available perceptual processing capacity from those in which specific properties of the interferer intrude into the perception of the target speech. Our approach is to use artificial speech-like stimuli with precisely controlled properties, to mix target speech with carefully designed interferers that offer alternative grouping possibilities, and to measure how manipulating the properties of these interferers affects listeners' abilities to recognise the target speech in the mixture. The results will improve our understanding of how human listeners separate speech from interfering sounds and the constraints on that separation, helping to refine computational models of listening. Such refinements will in turn provide ways of improving the performance of devices such as hearing aids and automatic speech recognisers when they operate in adverse listening conditions.

Planned Impact

Private-sector companies who develop hearing aids and cochlear implants:

Effective separation of a target voice from interfering speech is one of the main problems facing designers of hearing aids and cochlear implant (CI) processors. The project will provide greater understanding of how human listeners combine acoustic-phonetic information across formants and improved characterisation of the informational component of speech-on-speech interference. This will inform the development of enhanced hearing prostheses, and may potentially improve speech coding in CI processors. One of our Project Partners, Starkey, invests in longer-term issues in hearing research and is the largest manufacturer of hearing aids in the USA. We have also established contacts with the two major suppliers of hearing prostheses in the UK (Oticon and Phonak). Outcomes will be communicated to hearing aid and CI manufacturers by sending technical reports of our findings, by visits to selected companies, and meetings with their representatives at conferences.

Private-sector companies who develop robust automatic speech recognition (ASR) devices and techniques for speech enhancement:

Robust performance in the presence of interferers remains a key problem for ASR. The project will provide perceptual data for researchers in computational auditory scene analysis (CASA) to develop further models and algorithms for separating speech from interfering sounds; we already have established contacts with CASA researchers. This offers the prospect of improved front-end processors for ASR and speech enhancement systems, which is likely to improve the performance of commercial systems. Our other Project Partner, Audience, is one of the leading commercial developers of speech enhancement technology. Our research findings will be communicated to Audience via email and at regular meetings. A report will also be made available to other key companies in ASR and enhancement technology.

Charities that support the hearing impaired:

Our research findings will be communicated to the main UK charity, Action on Hearing Loss. Improved understanding of the perceptual processes underpinning successful separation of target speech from interfering speech will encourage the development of collaborative behavioural and modelling studies. Such cross-disciplinary interaction may stimulate funding of research projects on advanced hearing aids and CI algorithms likely to interest the charity. As well as articles in specialist journals, our findings will be communicated via a report tailored to the charity's interests.

The general public:

Indirectly, the impact on the above beneficiaries will also benefit the general public within 5-10 years. Improvements to hearing aid technology will benefit the ~10 million deaf and hard-of-hearing people in the UK. Similarly, there are ~320,000 CI users worldwide who would benefit from better techniques for encoding noisy speech in CI processors. Improved ASR and speech enhancement, which may arise from our project via its impact on CASA solutions, will enhance quality of life through improved speech-based communication with machines, and better electronically mediated vocal communication between individuals. The outcomes of the project will be communicated directly to the public via a lay summary on our project website, and through public engagement events.

The research fellow:

The RF will receive training on public engagement for researchers to facilitate his contributions to public lectures, report writing, and company visits. His involvement in writing journal articles and developing grant applications will put him in an excellent position for fellowship applications. His involvement in writing technical reports and visits to leading hearing-technology companies will foster links with the commercial sector likely to enhance his prospects of future employment either in academic or industrial research.

Publications

10 25 50
 
Description Twelve experiments were completed during the course of this project. To summarise the key outcomes, these experiments are considered in six groups (A-F, each corresponding to either a published or anticipated journal article).

The following points are relevant to the key findings presented below: (1) Understanding speech requires the perceptual integration of acoustic-phonetic information carried by the three major spectro-temporal components (formants F1-F3) of the speech signal, and the perceptual exclusion of interfering sounds. (2) There are two broad ways in which interfering speech may lower the intelligibility of the target speech - energetic masking, in which the response of the auditory nerve to the target is swamped by the response to the masker, and informational masking (IM), which is of central origin. (3) Masking experienced when target speech is accompanied by one or two interfering voices is often primarily informational. (4) IM may arise because the interferer disrupts processing of the target (e.g., capacity limitations) or corrupts it (e.g., intrusions into the target percept). (5) Successful speech perception involves interpreting incoming sensory information in the context of stored linguistic knowledge.

[A. EXPERIMENTS 1-2:] Roberts, B., and Summers, R.J. (2018). "Informational masking of speech by time-varying competitors: Effects of frequency region and number of interfering formants," Journal of the Acoustical Society of America, 143, 891-900; doi:10.1121/1.5023476. Key Outcomes: The results indicate that the effect on intelligibility of an extraneous formant in the other ear depended relatively little on frequency region; the effect was broadly tuned around the second formant. It was also shown that increasing the number of extraneous formants moving independently of one another increased the extent of informational masking. The impact on intelligibility depended primarily on the overall extent of formant-frequency variation in each interfering formant and the number of extraneous formants present. These factors had independent and additive effects on speech intelligibility.

[B. EXPERIMENTS 3-4:] Roberts, B., and Summers, R.J. (2019). "Dichotic integration of acoustic-phonetic information: Competition from extraneous formants increases the effect of second-formant attenuation on intelligibility," Journal of the Acoustical Society of America, 145, 1230-1240; doi:10.1121/1.5091443. Key Outcomes: The results indicate that differences in ear of presentation and level do not necessarily prevent effective integration of concurrent speech formants, even for open-set sentence-length utterances. However, the presence of a contralateral interferer providing an alternative candidate for F2 (F2C) substantially increases the effect on intelligibility of attenuating the target F2. Factors likely to contribute to this interaction include informational masking from F2C acting to swamp the acoustic-phonetic information carried by F2, and interaural inhibition from F2C acting to reduce the effective level of F2.

[C. EXPERIMENTS 5-6:] Summers, R.J., and Roberts, B. (2020). "Informational masking of speech by acoustically similar intelligible and unintelligible interferers," Journal of the Acoustical Society of America, 147, 1113-1125; doi:10.1121/10.0000688. Key Outcomes: The results indicate that when acoustic differences between corresponding intelligible and unintelligible contralateral interferers are minimised, the intelligible interferers generate more interference than their counterparts. Overall, interference with acoustic-phonetic processing of the target speech can explain much of the impact of these interferers on intelligibility, but linguistic factors-particularly intrusion of words from an intelligible interferer-also make an important contribution to informational masking of speech.

[D. EXPERIMENTS 7-8:] "Ganong shifts and reaction times when categorising continua of vocoded monosyllables and the effects of spatial uncertainty and contralateral interferers." Preliminary report of experiment 7 given by Roberts, B., and Summers, R.J. (2018, poster presentation, "Ganong shifts for noise- and sine-vocoded speech continua in normal-hearing listeners", see Other Outputs and Knowledge / Future Steps); experiment 8 not yet reported. Key Outcomes: Lexical bias is the tendency to perceive an ambiguous speech sound as a phoneme that completes a word rather than a non-word; the more ambiguous the auditory signal, the greater the reliance on lexical knowledge. The extent of lexical bias also increases when the listener experiences a high cognitive load, but the effect of IM (higher perceptual load) has not previously been explored. The results indicate that the Ganong shift-a widely used measure of lexical bias-is strongly affected by stimulus naturalness. However, the presence of a contralateral interferer with spectro-temporal variation that was intended to increase perceptual load did not reliably increase the magnitude of the Ganong shift. In contrast, this type of interferer led to slower reaction times for identifying the initial phoneme. This outcome suggests that reaction times may provide a more useful measure of perceptual load than Ganong shifts.

[E. EXPERIMENTS 9-11:] Roberts, B., Summers, R.J., and Bailey, P.J. (2019, poster presentation, "Mandatory dichotic integration of second-formant information: Mismatched contralateral sine bleats have predictable effects on place judgments in consonant-vowel syllables", see Other Outputs and Knowledge / Future Steps). Key Outcomes: Disrupting effects of extraneous speech acting as an informational masker may be relatively non-specific but corrupting effects should produce predictable errors in understanding target speech. The results indicate that accompanying a consonant-vowel syllable with a carefully crafted sine-tone interferer in the contralateral ear leads to errors in judgments of consonant place of articulation that can be predicted from the interferer's specific acoustic properties. This outcome indicates mandatory dichotic integration of F2 information, despite the grouping cues disfavouring this integration.

[F. EXPERIMENT 12:] Roberts, B., & Summers, R.J. (2020). "Informational masking of speech depends on masker spectro-temporal variation but not on its coherence," J. Acoust. Soc. Am. 148, 2416-2428. This experiment (together with an unpublished experiment from a previous ESRC grant, ES/K004905/1) is reported in this journal article. Key Outcomes: The results indicate that the extent to which an interfering formant in the contralateral ear lowered target-speech intelligibility depended on the extent (depth) of formant-frequency variation in the interferer and, if the interferer was cut into short segments, whether those segments were time-varying or constant, but it did not depend on segment order (in order vs scrambled). This outcome indicates that the impact on intelligibility depends critically on the overall amount of frequency variation in the interferer, but not on its spectro-temporal coherence.
Exploitation Route The results obtained during this project suggest approaches by which engineers and computer scientists might improve the performance of devices such as hearing aids and automatic speech recognizers when they are operating in noisy environments.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

 
Description This research project was primarily theoretical, and so its economic and societal impact at this stage is limited. Nonetheless, there are two routes of potential impact: (1) Scientists developing computational solutions for auditory scene analysis (CASA) are interested in the results of this project to inform developments in these solutions. Specifically, we are in periodic contact with Martin Cooke (University of the Basque Country, Spain) and DeLiang Wang (Ohio State University, USA). In the longer term, improved CASA solutions offer the prospect of enhanced performance by hearing prostheses and automatic speech recognition systems operating in noisy environments. Such enhancements will produce benefits in healthcare and societal outcomes. (2) Scientists employed by private-sector companies who develop hearing aids and cochlear implants are beginning to consider the results of this project in guiding their own research and development projects. Notably, we carried out a dissemination visit on our ESRC-funded research to Oticon's Eriksholm Institute in Denmark towards the end of the grant (1-2/Jul/19). The visit was an important part of maintaining our already-established contacts with their relevant research team leaders, Niels Pontoppidan and Lars Bramslow. These researchers are now considering the feasibility of exploring the extent to which our findings with normal-hearing listeners are applicable to listeners with mild to moderate hearing loss.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Societal,Economic

 
Title Dataset for published article in JASA by Roberts and Summers (2018). 
Description hese datasets comprise listeners' transcriptions of sentence-length speech analogues for Experiments 1 and 2 of the article of the same title (Roberts and Summers, 2018; Journal of the Acoustical Society of America). There are two spreadsheets for each experiment; one comprising keyword scores and one comprising phonemic scores. Each spreadsheet comprises a summary worksheet and the raw data for each listener. The summary worksheet contains aggregated scores (keywords correct by tight scoring or phonemic scores, see below) for each listener in each condition, with relevant demographic information. Subsequent worksheets comprise the raw data for each listener and stimulus. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? No  
Impact None at this stage. These datasets have only recently been published. 
URL http://doi.org/10.17036/researchdata.aston.ac.uk.00000309
 
Title Dataset for published article in JASA by Roberts and Summers (2019). 
Description These datasets comprise listeners' transcriptions of sentence-length speech analogues for Experiments 1 and 2 of the article of the same title (Roberts and Summers, 2019; Journal of the Acoustical Society of America). There are two spreadsheets for each experiment; one comprising keyword scores and one comprising phonemic scores. Each spreadsheet comprises a summary worksheet and the raw data for each listener. The summary worksheet contains aggregated scores (keywords correct by tight scoring or phonemic scores, see below) for each listener in each condition, with relevant demographic information. Subsequent worksheets comprise the raw data for each listener and stimulus. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? No  
Impact None at this stage. These datasets have only recently been published. 
URL http://doi.org/10.17036/researchdata.aston.ac.uk.00000396
 
Title Dataset for published article in JASA by Summers and Roberts (2020). 
Description These datasets comprise listeners' transcriptions of sentence-length speech analogues, and the scores and error counts derived from them, for Experiments 1 and 2 of the article of the same title (Summers and Roberts, 2020; Journal of the Acoustical Society of America). 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact None at this stage. These datasets have only recently been published. 
URL https://doi.org/10.17036/researchdata.aston.ac.uk.00000459
 
Title Effects of stimulus naturalness and contralateral interferers on lexical bias in consonant identification 
Description These datasets comprise listeners' judgments of consonant voicing and their reaction times for Experiments 1 and 2 of the article of the same title (Roberts, Summers, and Bailey, 2022; Journal of the Acoustical Society of America). 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact one at this stage. These datasets have only recently been published. 
URL https://researchdata.aston.ac.uk/id/eprint/539
 
Title Entries in the UK Data Service (ReShare) repository. 
Description Datasets for all twelve experiments completed for this grant have been submitted to the ReShare repository. These datasets comprise listeners' transcriptions or other judgments of speech stimuli, ranging from sentence-length materials to consonant-vowel syllables, in the presence or absence of various interfering sounds. Each spreadsheet comprises one or more summary worksheets and worksheets containing the raw data for each listener. For the sentence-length materials, the summary worksheets contain aggregated scores (keywords correct by tight and/or loose scoring) for each listener in each condition. Each dataset is accompanied by a short text description; in cases where the associated article has not yet been published, a pdf summary report is also provided. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
Impact None at this stage. 
 
Title Informational masking of speech depends on masker spectro-temporal variation but not on its coherence 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://researchdata.aston.ac.uk/id/eprint/477
 
Title Mandatory dichotic integration of second-formant information: Contralateral sine bleats have predictable effects on consonant place judgments 
Description These datasets comprise listeners' judgments of consonant place of articulation for Experiments 1-3 of the article of the same title (Roberts, Summers and Bailey, 2021; Journal of the Acoustical Society of America). 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact None at this stage. These datasets have only recently been published. 
URL https://researchdata.aston.ac.uk/id/eprint/525
 
Description Pathways to Impact - Dissemination visit to Oticon's Eriksholm Research Centre (Snekkersten, Denmark, 1-2 July 2019) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Oticon is one of Europe's leading hearing-technology companies, and a major manufacturer of hearing aids and cochlear implants. We (Brian Roberts and Rob Summers) were hosted at Oticon's Eriksholm Research Centre by Niels Pontoppidan (Group Manager, Advanced Algorithms) and Lars Bramslow (Project Manager, Competing Voices). Our visit involved giving a presentation on our most recent ESRC-funded research, including extensive round-table discussion, and receiving a briefing on related research and development taking place at Oticon. In addition to continuing our relationship with this company, we identified areas of research arising from our own that Oticon might consider for in-house investigation using participants with mild-to-moderate hearing loss.
Year(s) Of Engagement Activity 2019
 
Description Presentation at the Big Bang Young Scientists and Engineers Fair (NEC, Birmingham, 13-16 March 2019). 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Interactive demonstrations at the Big Bang Young Scientists and Engineers Fair under the umbrella theme of "superpowers". Our contribution - part of Aston University's stand - concerns our amazing abilities to understand speech under adverse listening conditions. The event begins in the current reporting period but ends in the next, so at this point it is not possible to identify specific outcomes/impacts arising. The option chosen below is based on our previous experience with similar events.
Year(s) Of Engagement Activity 2019
URL https://www.thebigbangfair.co.uk/get-involved/volunteer-with-us/