CHIME: Computational Hearing in Multisource Environments

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

In everyday environments it is the norm for there to exist multiple sound sources competing for the listener's attention. Understanding any one of the jumble of sounds arriving at our ears requires being able to hear it separately from the other sounds arriving at the same time. For example, understand what someone is saying when there is a television on in the same room requires separating their voice from the television audio. The lack of an adequate computational solution to this problem prevents hearing technologies from working reliably in typical noisy human environments -- often the situations where they could be most useful. Computational hearing algorithms designed to operate in multisource environments would enable a whole range of listening applications: robust speech interfaces, intelligent hearing aids, audio-based monitoring and surveillance systems.The CHIME project will develop a framework for computational hearing in multisource environments. Our approach operates by exploiting two levels of processing that combine to simultaneously separate and interpret sound sources. The first processing level exploits the continuity of sound source properties, such as location, pitch, and spectral profile, to clump the acoustic mixture into pieces (`fragments') belonging to individual sources. Such properties are largely common to all sounds and can be modelled without having to first identify the sound source. The second processing level uses statistical models of specific sound sources expected to be in the environment. These models are used to separate fragments belonging to the acoustic foreground (i.e. the `attended' source) from fragments belonging to the background. For example, in a speech recognition context, this second stage will recruit sound fragments which string together to form a valid utterance. This second stage both separates foreground from background and provides an interpretation of the foreground.The CHIME project aims to investigate and develop key aspects of the proposed two-level hearing framework: we will develop statistical models that use multiple signal properties to represent sound source continuity; we will develop approaches for combining statistical models of the attended `foreground' and the unattended `background' sound sources; we will investigate approximate search techniques that allow acoustic scenes containing complex sources such as speech to be processed in real-time; we will investigate strategies for trying to learn about individual sound sources directly from noisy audio data. The results of this research will be built into a single demonstration system simulating a home-automation application with a speech-driven interface that will operate reliably in a noisy domestic environment.

Publications

10 25 50
 
Description The CHiME project was concerned with building speech recognisers that can operate reliably in everyday `acoustically cluttered' environments. For example, imagine attempting to communicate with a home automation system by speaking across a room while the television is on, children are playing and traffic noise is coming through an open window. Current speech recognition technology performs extremely poorly in such conditions.

The research project built on an existing framework know as `speech fragment decoding'. This intuitively simple approach, inspired by the `scene analysis' account of auditory perception, operates in two stages: first, signal processing techniques are used to split the acoustic mixture into local time-frequency `fragments' of individual sound sources; second, statistical models of speech and other sources are employed to match and select fragments belonging to the target speech source while rejecting fragments coming from distracting sound sources.

The project had several important objectives that were met in the three year running period. First, a collection of over 50 hours of audio data was made using recordings from real domestic living spaces. Using this data a noisy speech recognition challenge was designed that has since been used as the basis of an international robust speech recognition evaluation campaign. Second, novel approaches were developed that combine sound-direction and speech-pitch cues to better locate the individual sound source fragments. Third, new ways of combining models of speech and models of the noise background were developed that are better able to distinguish between speech and masking noise. Finally, a new improved version of our software framework has been generated and is publicly available on request for use by other researchers.

The project has been successful in raising the profile of the `multisource environment' hearing problem. Beyond the usual dissemination via journal and conference publication, in 2011 a CHiME workshop and ASR evaluation was organised with multisource environments as its central theme. The workshop has inspired a special issue of the journal, Computer Speech and Language that will be published next year. A 2nd CHiME evaluation and workshop is planned for 2013 this time with industrial support and financial sponsorship (http://spandh.dcs.shef.ac.uk/chime_workshop/).
Exploitation Route The CHiME project's research objectives have influence the wider speech recognition and signal processing community. There is now a well-established series of speech recognition evaluations, called the CHiME challenges, that targets the problem of distant microphone speech recognition in everyday environments.

http://spandh.dcs.shef.ac.uk/chime_challenge/

These Challenges have attracted significant industrial interest both in the form of sponsorship and participation.
Sectors Digital/Communication/Information Technologies (including Software)

URL http://spandh.dcs.shef.ac.uk/projects/chime/index.html
 
Description The CHiME project has inspired a number of workshops and speech recognition evaluation events based around the distant microphone research challenges that lie at the project's heart. - The 1st CHiME International Workshop on Machine Listening in Multisource Environments, Florence, 2011 - The 2nd CHiME International Workshop on Machine Listening in Multisource Environments, Vancouver, 2013 - The 3rd CHiME Challenge Evaluation Campaign, ASRU, Scottsdale, Arizona, 2015 - The 4th CHiME Challenge Workshop, Google, Mountain View, CA, 2016 These events have exclusively employed data that has been collected with the funding provided by the CHiME project. Each event has reached beyond academia to attract significant industrial input. For example, the workshops and challenges have been guided by an industrial advisory committee. In particular, the most recent completed evaluation has had participation from NTT, Hitachi, Mitsubishi, A-Star and was hosted by Google. These events have stimulated engagement between academia and industry and are influencing the development of speech technology in major speech technology companies in Europe, the US and Asia A 5th CHiME Challenge event is now being planned for 2018. This edition is being informed by the findings of the previous challenges but will involve a much larger scale data collection (approximately 100 hours of conversational speech). The collection and preparation of this data is a large undertaking and will require considerable funding. We are currently in negotiation with Google who have expressed an interest in providing financial sponsorship and who view the challenge as an opportunity to compare their latest algorithms with those of their competitors. We hope to start data recording during the Spring and Summer of 2017 for a launch by the end of the year.
First Year Of Impact 2011
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Cultural,Economic