Corpus-Based Speech Separation

Lead Research Organisation: Queen's University Belfast
Department Name: Computer Science

Abstract

In this project, we will develop new techniques for restoring clear speech from noisy recordings. We will focus on two problems: (1) retrieving speech from background noise, and (2) separating speech sentences spoken by different speakers. For convenience, we reference both problems as speech separation.Over the past decades, there have been many techniques developed for speech separation. While appearing in different forms, most techniques can be viewed as a filter, which aims to pass the frequencies of the targeted speech with minimum distortion, and at the same time block the frequencies of the noise. To build the filter, one thus needs knowledge about the frequency structure of the noise. For certain applications in which the noise remains relatively constant, one may obtain an estimate of the noise structure using the data observed at a time without speech, and then use it to predict the noise structure in the data containing mixed speech and noise. Based on the prediction, a filter can be formed to remove the noise and hence restore the speech. Unfortunately, this strategy does not work if the noise changes fast and thus is unpredictable. Examples of fast-varying noises include crosstalk speech, and the background noises in mobile/Internet communications, which are often complex, highly dynamic, and thus difficult to predict.In this research, we will investigate a new method to speech separation, aiming for the capability of handling unpredictable noise. We will use a pre-recorded speech corpus, consisting of clean speech sentences by various speakers, to help remove the requirement for information about the noise. The new method consists of four major components. First, we compare the noisy sentence, containing mixed speech and noise, with each corpus sentence to find all their matching parts. Second, we combine the longest matching parts from the clean corpus sentences to form a new sentence, as a reconstruction of the target speech. Because of the richer and more distinct contexts, longer speech utterances are less confused by noise, and thus can be recognised with fewer errors than shorter utterances. This explains why we synthesise the target speech using longest recognised speech parts, which minimises the effect of noise on the restoration. The third component of our method is a novel technique to reduce the sensitivity to noise for finding the matching speech parts between the noisy and corpus sentences. The last component uses the speakers characteristics, associated with the individual corpus sentences, to help separate mixed sentences spoken by different speakers. Combining these components, the new method offers the capability to separate speech from noise, and separate mixed speech sentences, without having to predict the noise/crosstalk.
 
Description In the real world, speech rarely occurs in isolation, and is usually accompanied by other acoustic interference. The two most common scenarios are: (1) speech is accompanied by some background noise, e.g., cocktail party noise, background music, street noise or any other environmental noise, and (2) one speaker's voice is masked by other speakers' voices, which happens when two or more people speak simultaneously. Severe interference can make speech unintelligible. Restoring clear speech from noise and separating crosstalk voices are two major and unsolved problems in signal processing research. The problems become extremely difficult if the noise is fast-varying and hence potentially unpredictable, and if the crosstalk voices are arbitrary in language, vocabulary and structure. These problems are further compounded when there is only a single microphone to record the noisy or mixed voices (the 'single channel' problem).

This project has developed a radically different and effective solution to the above problems, i.e., a corpus-based approach to single-channel speech enhancement and separation. The new method uses speech corpora as examples, combined with novel signal modelling techniques, to provide an accurate model for speech, both for what it sounds like (the human aspect) and for how it evolves over time (the language aspect). The new model of speech can reach a level of accuracy (or sharpness) which is previously unattainable with existing techniques. This proves to be extremely effective in extracting speech buried in noise and separating simultaneous voices from different speakers; and it does this with effectively no limitations on the complexity of the noise and the language. Large numbers of experiments have demonstrated that the new method can significantly outperform existing techniques for dealing with fast-varying noise and arbitrary crosstalk, and hence has raised the state-of-the-art performance to a new level.
Exploitation Route (1) NTT (Nippon Telegraph & Telephone Corp.) used our method in their speech recognition entry to the International Competition for Machine Listening in Multisource Environments (CHiME 2011), in which they took 1st place. NTT has further exploited our method to handle speech de-reverberation (Interspeech'2011).

(2) CSR has rated 5 out of 5 for the significance of the outcomes of the joint KTS project to their organisations future performance. (1) A patent was filed in 08/2012 (International Application No. PCT/EP2012/066549).

(3) A collaboration agreement was signed between QUB and Cambridge Silicon Radio for a potential commercial development of the new technology for in-car and wireless communication applications. The researcher on the grant is undertaking a secondment to CSR, supported by a grant from the follow-on EPSRC KTS scheme, and by the technology transfer facilities available in the QUB ECIT institute (2011-2012).

(4) Collaboration with Vitalograph Ltd, a company delivering healthcare monitoring systems, has led to a robust, acoustic-based monitoring system for medical inhalers, with potential for multi-million pound savings in NHS budgets. This project was awarded the 2013 InterTradeIreland Fusion Project Exemplar award. Further on-going collaborations with the company include the development of an automatic cough detection system based on our robust speeh processing techniques arising from the EPSRC project.

(5) In 2014, our techniques developed from this project was used by Microsoft in their system for speech bandwidth expansion with noise (ICASSP 2014).
Sectors Digital/Communication/Information Technologies (including Software),Healthcare

URL http://www.ecit.qub.ac.uk/Research/SpeechVisionSystems/SpeechSeparation/
 
Description Our research findings from this research have led to impact on a range of different fronts, mainly: (i) Collaboration with CSR, a leading $1 billion consumer electronics company, has shaped its R/D research agenda in speech enhancement, has inspired ideas for new product improvements, and has helped establish Belfast as an audio research centre of excellence within the company. (ii) Collaboration with Vitalograph Ltd, a company delivering healthcare monitoring systems, has led to a robust, acoustic-based monitoring system for medical inhalers, with potential for multi-million pound savings in NHS budgets. (iii) NTT (Nippon Telegraph & Telephone Corp.) used our method in their speech recognition entry to the International Competition for Machine Listening in Multisource Environments (CHiME 2011), in which they took 1st place. (iv) In 2014, Microsoft has used our technique developed from this project in their system for noisy speech bandwidth expansion, for details, see H. Seo, H. Kang, and F. Soong, "A maximum a posterior-based reconstruction approach to speech bandwidth expansion in noise," ICASSP 2014, pp. 6128-6132.
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Economic

 
Description EPSRC Knowledge Transfer Secondments (KTS)
Amount £38,671 (GBP)
Funding ID KTS-1117 (Queen's University Belfast) 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 10/2011 
End 09/2012
 
Description Research Impact Showcase 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Our work on speech enhancement and separation (funded by EPSRC and EPSRC KTS, in collaborations with CSR) was selected to be exhibited in the Research Impact Showcase at Queen's on 27 November 2013. The work was published in 'The DNA of Innovation, Volume 3: Creative Connections', interviewed by BBC Radio Ulster News, and was also published on the Belfast Telegraph.
Year(s) Of Engagement Activity 2013