Explore new approaches to distant microphone speech recognition that combine information across multiple microphone array devices

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

It is becoming common for speech to be used to communicate with digital devices. In the last few years, devices such as Google Home and Amazon Alexa have arrived in millions of homes. Getting speech recognition to work well in home environments is very challenging. The home is often a very noisy place, for example, if the device is placed in the kitchen, the washing machine may be running and people could be talking in the background. Also, the person speaking is often several metres away from the device (the 'distant microphone' scenario). This is a problem because the speech signal may easily be dominated by other sound sources which may be closer to the microphones.

This project will develop novel solutions to the distant microphone speech recognition problem. It will be conducted within the Speech and Hearing Research Group under the supervision of Prof. Jon Barker. It will take advantage of a new data set ('CHiME-5') that has been acquired by Prof. Barker's research team with support from Google (http://spandh.dcs.shef.ac.uk/chime_challenge/). CHiME-5 is a set of recordings of parties taking place in real homes. The data is captured with multiple recording devices, each of which captures video and four synchronised microphone channels. This unique data provides an opportunity to address new research questions lying outside the scope of current speech technology.

Research questions
Two key research directions will be prioritised,

Visually-driven beamforming algorithms: The most successful approach to distant microphone speech recognition is to use multiple microphones and apply techniques that enhance the signals coming from some directions while suppressing the signals coming from others. This requires detecting and tracking which directions are important. The project will look at how this information might be extracted from the video signal (e.g.,using person tracking techniques.)

Speech recognition with multiple microphone arrays: The 'beamforming' described above requires synchronised microphones with known positions with respect to each other. It can therefore be easily applied across multiple devices whose relative location is uncertain (e.g., combining outputs of two Google Homes in the same room). The CHiME-5 data has up to six devices within the same acoustic area and therefore provides a unique opportunity to find new solutions to this problem. A starting place would be to explore techniques for weighting and fusing the outputs of independent recognition systems.

Methodology
Speech recognition systems have evolved into hugely complex pieces of software. Fortunately, speech research has been effectively open-sourced with the community now focused around the Kaldi speech recognition toolkit. The CHiME-5 data set will be published with an open-source Kaldi 'baseline' that will represent a state-of-the-art system for single device audio-only system. It will also provide a set of 'rules' for training systems that allows
fair comparison between research groups. This will provide a robust reference against which to compare the performance of audio-visual and multi-device extensions.

The research will require a mixture of methods to be employed: video face and person tracking and beamforming algorithms; speech recognition fusion strategies, and signal quality assessment techniques. In addition, it will be necessary to have a fuller understanding of state-of-the-art techniques employed in the baseline recogniser, including convolutional neural networks, i-vector analysis, speaker-adaptive training, neural network language modelling, etc. Fortunately there are many excellent textbooks, tutorial papers and review papers that cover
these areas.

CHiME-5 is a complex 'conversational' speech recognition task. Training and testing the recognition systems will be computationally demanding. Modern speech recognisers use 'deep learning' which requires specialist GPU hardware.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509735/1 01/10/2016 30/09/2021
2112956 Studentship EP/N509735/1 01/10/2018 21/04/2022 Jack Deadman
EP/R513313/1 01/10/2018 30/09/2023
2112956 Studentship EP/R513313/1 01/10/2018 21/04/2022 Jack Deadman