SpeechWave

Lead Research Organisation: King's College London
Department Name: Informatics

Abstract

Speech recognition has made major advances in the past few years. Error rates have been reduced by more than half on standard large-scale tasks such as Switchboard (conversational telephone speech), MGB (multi-genre broadcast recordings), and AMI (multiparty meetings). These research advances have quickly translated into commercial products and services: speech-based applications and assistants such as such as Apple's Siri, Amazon's Alexa, and Google voice search have become part of daily life for many people. Underpinning the improved accuracy of these systems are advances in acoustic modelling, with deep learning having had an outstanding influence on the field.

However, speech recognition is still very fragile: it has been successfully deployed in specific acoustic conditions and task domains - for instance, voice search on a smart phone - and degrades severely when the conditions change. This is because speech recognition is highly vulnerable to additive noise caused by multiple acoustic sources, and to reverberation. In both cases, acoustic conditions which have essentially no effect on the accuracy of human speech recognition can have a catastrophic impact on the accuracy of a state-of-the-art automatic system. A reason for such brittleness is the lack of a strong model for acoustic robustness. Robustness is usually addressed through multi-condition training, in which the training set comprises speech examples across the many required acoustic conditions, often constructed by mixing speech with noise at different signal-to-noise ratios. For a limited set of acoustic conditions these techniques can work well, but they are inefficient and do not offer a model of multiple acoustic sources, nor do they factorise the causes of variability. For instance, the best reported speech recognition results for transcription of the AMI corpus test set using single distant microphone recordings is about 38% word error rate (for non-overlapped speech), compared to about 5% error rate for human listeners.

In the past few years there have been several approaches that have tried to address these problems: explicitly learning to separate multiple sources; factorised acoustic models using auxiliary features; and learned spectral masks for multi-channel beam-forming. SpeechWave will pursue an alternative approach to robust speech recognition: The development of acoustic models which learn directly from the speech waveform. The motivation to operate directly in the waveform domain arises from the insight that redundancy in speech signals is highly likely to be a key factor in the robustness of human speech recognition. Current approaches to speech recognition separate non-adaptive signal processing components from the adaptive acoustic model, and in so doing lose the redundancy - and, typically, information such as the phase - present in the speech waveform. Waveform models are particularly exciting as they combine the previously distinct signal processing and acoustic modelling components.

In SpeechWave, we shall explore novel waveform-based convolutional and recurrent networks which combine speech enhancement and recognition in a factorised way, and approaches based on kernel methods and on recent research advances in sparse signal processing and speech perception. Our research will be evaluated on standard large-scale speech corpora. In addition we shall participate in, and organise, international challenges to assess the performance of speech recognition technologies. We shall also validate our technologies in practice, in the context of the speech recognition challenges faced by our project partners BBC, Emotech, Quorate, and SRI.

Planned Impact

Robust speech recognition is a key technology needed to make digital infrastructure simple, accessible, invisible, and reliable, supporting a range of mobile and other applications. It is driven in particular by the need for new interfaces to small and mobile devices in which traditional modalities like touch-screens may be inappropriate or not physically possible.

The EPSRC delivery plan highlights four "prosperity outcomes" to which SpeechWave will contribute:

1. Productivity: Robust speech recognition can significantly enhance productivity across a range of industries and is a key enabler for a range of new smart technologies which are enabled or enhanced by speech interfaces.

2. Connectedness: Robust speech recognition enables natural interaction with novel devices and applications, and can unlock the audio and video data that makes up 80% of the web.

3. Resilience: The core of the project is to develop speech technology that is robust and resilient to changing conditions of use.

4. Health: Robust speech recognition enables assistive technologies and accessibility.

SpeechWave is well-aligned to EPSRC's Cross-ICT Priorities 2017-2020. Speech recognition technologies that are robust to realistic and natural acoustic conditions help to underpin the Future Intelligent Technologies and the People at the Heart of ICT priorities, enabling broad utilisation of intelligent spoken interfaces. SpeechWave's collaboration between researchers from signal processing, speech technology, and machine learning addresses the priority Cross-Disciplinarity and Co-Creation.
The UK has a vibrant and expanding speech technology sector built on a healthy ecosystem comprising multinational companies with a UK R&D base (including Amazon, Apple, and Google), home-grown medium-size companies, and exciting startups (many located in Edinburgh or London). SpeechWave will help the UK maintain this world-leading research activity, through its collaboration with project partners, and will increase innovation potential.

Within SpeechWave we have taken measures to maximise the impact of our research. These will be in two main areas:

1. Broadcast and Media (via project partners BBC and Quorate). The focus of this work is to develop robust media transcription prototypes, able to cope with the diverse range of broadcast media. Media transcription has direct benefits (for example supporting accessibility through automatic subtitling), as well as enabling intelligent processing of broadcast media through natural language processing and text analytics.

2. Distant Speech Recognition (via project partner Emotech). The focus of this work is to develop prototype software for speech recognition in personal robots. Speech is perhaps the most natural communication modality for such robots, but the acoustic conditions can be extremely challenging due to reverberation and competing acoustic sources. Improving speech recognition accuracy for such devices in challenging environments is likely to have a significant impact on their usability and uptake.

We also plan to enhance the global impact of our research through project partner SRI International who have a specific R&D interest in speech recognition in highly challenging acoustic environments.

Publications

10 25 50
 
Description There are several key findings:

1) Speech recognition in high-dimensional spaces of acoustics waveforms or some high-dimensional representation that does not incur loss of information are achieving higher accuracy on benchmark tasks than standard low-dimensional features, especially in the presence of noise.

2) Variational inference is achieving state-of-the-art results on benchmark speech recognition tasks, which is a new insight.

3) A theoretical framework that characterises data augmentation as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples.

4) An effective data augmentation techniques in the domain of acoustic waveforms of speech that improves generalisation of automatic speech recognition to unseen conditions.

5) Speech modelling in the acoustic waveform leads to significant improvements in the accuracy of dysarthric speech recognition.
Exploitation Route We are pursuing two unorthodox approaches to speech recognition: variational inference and high-dimensional representations. We have demonstrated so far state of the art accuracy of these two paradigms, and will pursue them further to achieve step improvements. We have further developed a theoretical underpinning of data augmentation as an instance of vicinal risk minimisation and used it to design a novel data augmentation techniques that demonstrated an unprecedented generalisation abilities of automatic speech recognition systems to unseen conditions. These results have a potential to impact both academic research towards developing fundamental insights and theoretical frameworks underpinning robust automatic speech recognition, and at the same time improve the robustness of practical automatic speech recognition systems.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology

URL https://arxiv.org/abs/2110.08634
 
Description Acoustic waveform modelling paradigm
Geographic Reach Multiple continents/international 
Policy Influence Type Contribution to new or improved professional practice
 
Title ASR in the acoustic waveform domain software 
Description Conceptual developments of SpeechWave project were validated practically using standard software platforms and libraries for automatic speech recognition.. The key libraries/platforms used were Kaldi for speech recognition and decoding and PyTorch for deep learning training and inference. We developed our in-house codes and libraries to expand the scope of current mainstream systems towards raw waveform and raw signal modelling. All the codes and libraries developed by our team were written in Python and were made publicly available or shared with our scientific community upon request. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact We are not aware of a major impact yet. 
 
Description Sheffield 
Organisation University of Sheffield
Department Department of Computer Science
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaborative research.
Collaborator Contribution Collaborative research.
Impact One joint journal and one conference publication on robust dysarthric speech recognition.
Start Year 2021