SpeechWave

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

Speech recognition has made major advances in the past few years. Error rates have been reduced by more than half on standard large-scale tasks such as Switchboard (conversational telephone speech), MGB (multi-genre broadcast recordings), and AMI (multiparty meetings). These research advances have quickly translated into commercial products and services: speech-based applications and assistants such as such as Apple's Siri, Amazon's Alexa, and Google voice search have become part of daily life for many people. Underpinning the improved accuracy of these systems are advances in acoustic modelling, with deep learning having had an outstanding influence on the field.

However, speech recognition is still very fragile: it has been successfully deployed in specific acoustic conditions and task domains - for instance, voice search on a smart phone - and degrades severely when the conditions change. This is because speech recognition is highly vulnerable to additive noise caused by multiple acoustic sources, and to reverberation. In both cases, acoustic conditions which have essentially no effect on the accuracy of human speech recognition can have a catastrophic impact on the accuracy of a state-of-the-art automatic system. A reason for such brittleness is the lack of a strong model for acoustic robustness. Robustness is usually addressed through multi-condition training, in which the training set comprises speech examples across the many required acoustic conditions, often constructed by mixing speech with noise at different signal-to-noise ratios. For a limited set of acoustic conditions these techniques can work well, but they are inefficient and do not offer a model of multiple acoustic sources, nor do they factorise the causes of variability. For instance, the best reported speech recognition results for transcription of the AMI corpus test set using single distant microphone recordings is about 38% word error rate (for non-overlapped speech), compared to about 5% error rate for human listeners.

In the past few years there have been several approaches that have tried to address these problems: explicitly learning to separate multiple sources; factorised acoustic models using auxiliary features; and learned spectral masks for multi-channel beam-forming. SpeechWave will pursue an alternative approach to robust speech recognition: The development of acoustic models which learn directly from the speech waveform. The motivation to operate directly in the waveform domain arises from the insight that redundancy in speech signals is highly likely to be a key factor in the robustness of human speech recognition. Current approaches to speech recognition separate non-adaptive signal processing components from the adaptive acoustic model, and in so doing lose the redundancy - and, typically, information such as the phase - present in the speech waveform. Waveform models are particularly exciting as they combine the previously distinct signal processing and acoustic modelling components.

In SpeechWave, we shall explore novel waveform-based convolutional and recurrent networks which combine speech enhancement and recognition in a factorised way, and approaches based on kernel methods and on recent research advances in sparse signal processing and speech perception. Our research will be evaluated on standard large-scale speech corpora. In addition we shall participate in, and organise, international challenges to assess the performance of speech recognition technologies. We shall also validate our technologies in practice, in the context of the speech recognition challenges faced by our project partners BBC, Emotech, Quorate, and SRI.
 
Description We are currently halfway through this 3 year project.

The main objectives of the project are to explore approaches to speech recognition using the raw waveform, and to develop a deeper theoretical understanding of such approaches.

The key findings so far include:

1/ Development of state-of-the-art baseline systems for waveform based speech recognition using the SincNet architecture which enable signal processing algorithms to be learned from from data

2/ Development of a windowed attention model for end to end speech recognition

3/ Theoretical analysis on the statistical normalisation of bottleneck features for speech recognition

4/ Development of an automatic adaptation approach for waveform-based speech recognition, which demonstrated the ability to adapt a system trained on adult speech to successfully recognise children's speech, using a limited amount of child data.

5/ Detailed theoretical and experimental investigation of different learnable filters in waveform-based speech recognition.

6/ Development of a dynamic subsampling approach for end-to-end speech recognition, enabling the model to skip redundant data.
Exploitation Route Our systems are being released as open source software and may be applied to speech recognition problems,
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Healthcare,Culture, Heritage, Museums and Collections

 
Description Within SpeechWave we have taken measures to maximise the impact of our research. These will be in two main areas: 1. Broadcast and Media (via project partners BBC and Quorate). The focus of this work is to develop robust media transcription prototypes, able to cope with the diverse range of broadcast media. Media transcription has direct benefits (for example supporting accessibility through automatic subtitling), as well as enabling intelligent processing of broadcast media through natural language processing and text analytics. 2. Distant Speech Recognition (via project partner Emotech). The focus of this work is to develop prototype software for speech recognition in personal robots. Speech is perhaps the most natural communication modality for such robots, but the acoustic conditions can be extremely challenging due to reverberation and competing acoustic sources. Improving speech recognition accuracy for such devices in challenging environments is likely to have a significant impact on their usability and uptake. We also plan to enhance the global impact of our research through project partner SRI International who have a specific R&D interest in speech recognition in highly challenging acoustic environments. We have also begun new collaborations with Toshiba and with Samsung in the area of end-to-end speech recognition.
First Year Of Impact 2019
Sector Creative Economy,Digital/Communication/Information Technologies (including Software),Culture, Heritage, Museums and Collections
Impact Types Cultural,Economic

 
Description Adapting end-to-end speech recognition systems (year 1)
Amount £137,365 (GBP)
Organisation Samsung 
Sector Private
Country Korea, Republic of
Start 12/2018 
End 11/2019
 
Description Adapting end-to-end speech recognition systems (year 2)
Amount £113,989 (GBP)
Organisation Samsung 
Sector Private
Country Korea, Republic of
Start 12/2019 
End 11/2020
 
Description BBC Data Science Partnership 
Organisation British Broadcasting Corporation (BBC)
Department BBC Research & Development
Country United Kingdom 
Sector Public 
PI Contribution Development of speech and language technology applied to broadcasting and media production
Collaborator Contribution R&D work from BBC researchers; data sharing.
Impact MGB Challenge iCASE studentships EPSRC SCRIPT Project
Start Year 2017
 
Description Emotech 
Organisation EmoTech Ltd
Country United Kingdom 
Sector Private 
PI Contribution We are developing models and algorithms for raw-waveform based speech recognition with the aim of significantly improving robustness to acoustic conditions.
Collaborator Contribution We shall work with Emotech on evaluating our models and algorithms using data collected by Emotech and made available to the project researchers. Furthermore we plan to conduct experiments using Emotech's Olly platform, and to this end Emotech will donate two devices to the project along with the required software development platform. Through the collaboration with Emotech we shall be able to evaluate the novel contributions pro- vided by SpeechWave, against the current state-of- the-art in realistic circumstances.
Impact 1/ Development, analysis, and evaluation of convolutional and recurrent network speech recognition systems 2/ Development of end-to-end speech recognitions, including the development of novel algorithms for windowed attention
Start Year 2018
 
Description Quorate 
Organisation Quorate Technology
Country United Kingdom 
Sector Private 
PI Contribution We are developing models and algorithms for raw-waveform based speech recognition with the aim of significantly improving robustness to acoustic conditions.
Collaborator Contribution Quorate has a state-of-the-art product for multi-genre media transcription, and we are working with them to explore the use of the approaches developed in the project in the context of broadcast speech recognition. Quorate are currently jointly supporting a PhD student at Edinburgh, in the area of robust transcription of broadcast speech, and there are strong synergies between that project and SpeechWave.
Impact 1/ Development, analysis, and evaluation of convolutional and recurrent network speech recognition systems 2/ Development of end-to-end speech recognitions, including the development of novel algorithms for windowed attention
Start Year 2018
 
Description SRI 
Organisation SRI International (inc)
Country United States 
Sector Charity/Non Profit 
PI Contribution We are developing models and algorithms for raw-waveform based speech recognition with the aim of significantly improving robustness to acoustic conditions.
Collaborator Contribution SRI is concerned with the development of robust speech recognition within the DARPA RATS program, and this provides a platform for the evaluation of the technology developed in this project.
Impact 1/ Development, analysis, and evaluation of convolutional and recurrent network speech recognition systems 2/ Development of end-to-end speech recognition systems, including the development of novel algorithms for windowed attention
Start Year 2018