The perceptual organization of speech: Contributions of general and speech-specific factors

Lead Research Organisation: Aston University
Department Name: Sch of Life and Health Sciences

Abstract

Spoken communication is a fundamental human activity. However, it is fairly uncommon in everyday life for us to hear the speech of a single talker in the absence of other background sounds, and so our auditory system is faced with the challenge of grouping together those sound elements that come from one source and segregating them from those arising from other sources. Without a solution to this auditory scene analysis problem, our perceptions of speech and other sounds would not correspond to the events that produced them. The fact that we can focus our attention on one person speaking in the presence of other talkers indicates that our auditory perceptual system is generally successful at grouping together the sound elements from a source in a complex auditory scene, and segregating them from other sound sources, but our understanding of how this is achieved remains limited. Most research on auditory scene analysis has focused on relatively simple sounds and has identified a number of general principles for the grouping of sound elements. However, at least as currently understood, these principles seem inadequate to explain the perceptual grouping of speech, because speech has acoustic properties that are diverse and rapidly changing. Furthermore, speech is a highly familiar stimulus, and so our auditory system has had the opportunity to learn about speech-specific properties that may assist in the successful perceptual grouping of speech. The aim of this project is to explore how much of our ability to segregate a talker's speech from a sound mixture depends on general-purpose auditory grouping principles that are applicable to all sounds, and how much depends on grouping principles that are specific to speech sounds. The approach is to generate artificial speech-like stimuli with precisely controlled properties, to mix target utterances with carefully designed competitors that offer alternative grouping possibilities, and to measure how manipulating the acoustic properties of these competitors affects the ability of listeners to recognize the target utterance in the mixture. The results of this project will improve our understanding of the perceptual organization of speech and suggest ways to improve the performance of devices such as hearing aids and automatic speech recognizers when they are operating in noisy environments.

Publications

10 25 50
 
Description In everyday life, it is uncommon to hear the speech of a single talker in the absence of other sounds. That we can focus our attention on one person talking in a crowd indicates that our auditory system is usually successful at grouping together the sound elements from a source in a complex auditory scene, and segregating them from other sounds, but we still know relatively little about how this is achieved. Most research on this "scene analysis" problem has focused on simple sounds and has identified a number of general principles for the grouping of sound elements. However, these principles often seem inadequate to explain the perceptual grouping of speech, because speech has acoustic properties that are diverse and rapidly changing. Also, speech is a highly familiar stimulus, and so our auditory system has the opportunity to learn about speech-specific properties that may assist in the successful perceptual grouping of speech.

This project's aim was to explore how much of our ability to separate a talker's speech from a mixture depends on general grouping principles, applicable to all sounds, and how much depends on speech-specific principles. Our approach was to generate artificial speech-like stimuli with precisely controlled properties, particularly the spectral prominences called formants. These are important because they arise as a result of resonances in the air-filled cavities of the talker's vocal tract. Variation in the frequency and amplitude of a formant is an inevitable consequence of change in the size of its associated cavity as the tongue, lips, and jaw move when the talker produces speech. Hence, knowledge of formant frequencies and their change over time is of great benefit to listeners trying to understand a spoken message, and so choosing the right set of formants from a mixture is critical for intelligibility. Simplified versions of target sentences were synthesised and mixed with carefully designed "competitors" offering alternative grouping possibilities for the formants in the target sentence. The impact of these competitors on listeners' recognition of the target sentence in the mixture was measured as the properties of the competitors were manipulated.

The key findings of the project are: (a) Modulation of the formant-frequency contour, but not the amplitude contour, is critical for across-formant grouping; (b) The ability of listeners to reject a competitor formant declines as either the rate or depth of modulation of its frequency contour increases, relative to that of the target sentence; (c) The impact of a competitor does not depend on whether its pattern of variation in formant frequency is plausibly speech-like;
(d) The ability of listeners to reject a competitor increases as the pitch difference between target and competitor formants increases; (e) Formant-frequency variation conveys information important for speech intelligibility even in contexts often regarded as conveying information about speech-sound identity mainly through other cues. In summary, the results have shown that our ability to segregate a talker's speech from a mixture depends heavily on general-purpose grouping principles and rather less on speech-specific principles than has been suggested by some researchers.
Exploitation Route The results obtained during this project suggest approaches by which engineers and computer scientists might improve the performance of devices such as hearing aids and automatic speech recognizers when they are operating in noisy environments.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

URL http://www.aston.ac.uk/lhs/staff/az-index/robertsb/perceptual-organization-of-speech/
 
Description There are no wider social and economic impacts that can be attributed specifically and unequivocally to this project. However, in more general terms, the results obtained during this project suggest approaches by which engineers and computer scientists might improve the performance of devices such as hearing aids and automatic speech recognizers when they are operating in noisy environments.
 
Title Dataset for published article in Hearing Research by Stachurski, Summers, and Roberts (2017). 
Description The accompanying files comprise data derived from listeners' responses to concurrent and single sequences of repeated stimulus words for Experiments 1 and 2 of the article of the same title (Stachurski, Summers, and Roberts, 2017, Hearing Research). Each spreadsheet comprises a demographics worksheet and separate worksheets of individual listeners' data and summary data for the following measures: number of verbal transformations, number of verbal forms, time to first verbal transformation, and dwell time of the initial form. For Experiment 1 only, an additional "Indices" worksheet provides individual listeners' data and summary data for three customised measures of the relationship between responses to two concurrent sequences - the dependency index, temporal overlap index, and intervening responses index. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? No  
Impact None at this stage. These datasets have only recently been published. 
URL http://doi.org/10.17036/researchdata.aston.ac.uk.00000278