Generative Kernels and Score Spaces for Classification of Speech
Lead Research Organisation:
University of Cambridge
Department Name: Engineering
Abstract
The aim of this project is to significantly improve the performance of automatic speech recognition systems across a wide-range of environments, speakers and speaking styles. The performance of state-of-the-art speech recognition systems is often acceptable under fairly controlled conditions and where the levels of background noise are low. However for many realistic situations there can be high levels of background noise, for example in-car navigation, or widely ranging channel conditions and speaking styles, such as observed on YouTube-style data. This fragility of speech recognition systems is one of the primary reasons that speech recognition systems are not more widely deployed and used. It limits the possible domains in which speech can be reliably used, and increases the cost of developing applications as systems must be tuned to limit the impact of this fragility. This includes collecting domain specific data and significant tuning of the application itself.The vast majority of research for speech recognition has concentrated on improving the performance of hidden Markov model (HMM) based systems. HMMs are an example of a generative model and are currently used in state-of-the-art speech recognition systems. A wide number of approaches have been developed to improve the performance of these systems under speaker and noise changes. Despite these approaches, systems are not sufficiently robust to allow speech recognition systems to achieve the level of impact that the naturalness of the interface should allow. This project will combine the current generative models developed in the speech community with discriminative classifiers used in both the speech and machine learning communities. An important, novel, aspect of the proposed approach is that the generative models are used to define a score-space that can be used as features by the discriminative classifiers. This approach has a number of advantages. It is possible to use current state-of-the-art adaptation and robustness approaches to compensate the acoustic models for particular speakers and noise conditions. As well as enabling any advances in these approaches to be incorporated into the scheme, it is not necessary to develop approaches that adapt the discriminative classifiers to speakers, style and noise. One of the major problems with speech recognition is that variable length data sequences must be classified. Using generative models also allows the dynamic aspects of speech data to be handled without having to alter the discriminative classifier. The final advantage is the nature of the score-space obtained from the generative model. Generative models such as HMMs have underlying conditional independence assumptions that, whilst enabling them to efficiently represent data sequences, do not accurately represent the dependencies in data sequences such as speech. The score-space associated with a generative model does not have the same conditional independence assumptions as the original generative model. This allows more accurate modelling of the dependencies in the speech data.The combination of generative and discriminative classifiers will be investigated on two very difficult forms of data that current systems perform badly on. The first task is adverse environment recognition of speech. In these situations there are very high levels of background noise which causes severe degradation in system performance. Data of interest for this task will be specified in collaboration with Toshiba Research Europe Ltd. The second task of interest is large vocabulary speech recognition of data from a wide-range of speaking styles and conditions. Google has supplied transcribed data from YouTube to allow evaluation of systems on highly diverse data. The project will yield significant performance gains over current state-of-the-art approaches for both tasks.
Planned Impact
The growth of business based on speech-enabled technology has been slower than predicted. A major contributing factor to this slow growth is that speech recognition systems are still not sufficiently robust to changing background noise conditions, speaker-styles, and accents. This results in unacceptable performance for too many users. Furthermore the cost of development of these applications is large as data is typically collected for the specific target domain and the application tuned to reduce the impact of the current fragility of speech recognition systems. Any approach that yields significant improvements in robustness (to noise, speaker and domain changes) would therefore be of enormous direct benefit to the speech industry making many new applications of the technology feasible. A range of companies in the UK would benefit in this case from core speech technology providers, such as Autonomy and Toshiba Research Europe Ltd, to application providers, such as Telephonetics, Acuvoice, through to application designers, such as VoxGen and Edify. The companies, such as major airlines, banks and new media, who would like to make further use of speech recognition to reduce operating costs and enable new applications would also benefit. The outcome of this research will be shared in the first instance with providers of core speech recognition technology. The Speech Group at CUED has close collaborations with a number of UK and international speech companies including Toshiba Research Europe Ltd (TREL), Google and IBM. Data from existing collaborations with TREL and Google will be used to benchmark the technology created within this research project. This will allow the companies to easily identify technical advances over their existing technology. In addition to research publications including conference papers and technical reports, software implementations of the research outcomes will be made available. The software will be released as an extension to the existing HTK toolkit via the HTK website. This will enable broader industry to replicate the results on publicly available databases.
People |
ORCID iD |
Mark Gales (Principal Investigator) |
Publications
Jingzhou Yang (Author)
(2013)
Infinite Support Vector Machines in Speech Recognition
Mark Gales (Author)
(2012)
Efficient decoding with continuous rational kernels using the expectation semiring
Van Dalen R
(2011)
A variational perspective on noise-robust speech recognition
Van Dalen R
(2015)
Structured discriminative models using deep neural-network features
Van Dalen R C
(2015)
Annotating large lattices with the exact word error
Van Dalen R C
(2015)
STRUCTURED DISCRIMINATIVE MODELS USING DEEP NEURAL-NETWORK FEATURES
Yang J
(2016)
System Combination with Log-Linear Models
Description | This research has demonstrated that under challenging speech recognition environments extracting rich features from the audio yields performance gains. Furthermore an efficient algorithm for efficiently extracting these rich features has been proposed, as is currently being evaluated. This work has continued under a Google funded project |
Exploitation Route | The features that can be extracted can be incorporated into a range of classifiers, including those based on deep-learning. This has been investigated under a Google Research Award, |
Sectors | Digital/Communication/Information Technologies (including Software) |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html |
Description | Google Research Awards |
Amount | $77,283 (USD) |
Organisation | |
Department | Research at Google |
Sector | Private |
Country | United States |
Start | 02/2014 |
Description | NTT Collaboration Funding |
Amount | £10,000 (GBP) |
Funding ID | RG78437 |
Organisation | Nippon Telegraph and Telephone Corporation |
Sector | Private |
Country | Japan |
Start | 09/2016 |
End | 02/2017 |
Description | NTT Research Collaboration |
Organisation | Nippon Telegraph and Telephone Corporation |
Country | Japan |
Sector | Private |
PI Contribution | Visitor for NTT for a year to the Speech Group at Cambridge University. |
Collaborator Contribution | Fully funded salary, bench fees and compute equipment. |
Impact | The outcome has been in the form of papers (conference papers and journal paper). Currently negotiating longer term agreement with NTT. Initial discussion are for £20000 for collaborative research. |
Start Year | 2013 |
Title | Cross-entropy for model compensation - Python program |
Description | This software is a Python program that implements a non-parametric method to, given speech and noise distributions and a mismatch function, compute the corrupted speech likelihood. It uses sampling and is exact in the limit. It therefore gives a theoretical bound for model compensation. |
Type Of Technology | Software |
Year Produced | 2011 |
Open Source License? | Yes |
Impact | This software was used for the results described in: R. C. van Dalen and M. J. F. Gales (2013). "Importance Sampling to Compute Likelihoods of Noise-Corrupted Speech." In Computer Speech and Language 27 (1), pp. 322-349. |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#source |
Title | Flipsta library: manipulate finite-state automata in C++ and Python. |
Description | The Flipsta library deals with finite-state automata. These are concise representations of, say, many word sequences, with probabilities attached to them. Many algorithms in text and speech processing can be expressed in terms of a handful of automaton operations. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | No impact to date (just released) |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#source |
Description | Dirichlet Process Mixture of Experts Models in Speech Reognition |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Type Of Presentation | poster presentation |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Poster presentation on non-parametric Bayesian approaches for speech recognition. This was the UK & IE speech meeting No significant changes |
Year(s) Of Engagement Activity | 2012 |
Description | Efficient Decoding with Generative Score-Spaces Using the Expectation Semiring |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Type Of Presentation | poster presentation |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Description of efficient feature extraction for use in discriminative models. No notable direct impact |
Year(s) Of Engagement Activity | 2012 |
Description | Generative Kernels and Score-Spaces for Classification of Speech: Progress Report |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | On-line publication for milestone report after year 1 of the project. This is a milestone from the project and made available via the project web-page. |
Year(s) Of Engagement Activity | 2012 |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html |
Description | Generative Kernels and Score-Spaces for Classification of Speech: Progress Report II |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Publication of the second milestone report for the project. No direct impact. Report referenced in successful application for Google Research Award. |
Year(s) Of Engagement Activity | 2013 |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html |
Description | Generative Kernels and Score-Spaces for Classification of Speech: Progress Report III |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Participants in your research and patient groups |
Results and Impact | Final milestone report for project No significant external impact. Formed basis of paper submission for ASRU 2015. |
Year(s) Of Engagement Activity | 2015 |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#publications |
Description | Monoids: efficient segmental features for speech recognition. |
Form Of Engagement Activity | A magazine, newsletter or online publication |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Techinical report publication (referenced in subsequent papers). Note there was no tracking of the downloads of this paper. No notable impact. |
Year(s) Of Engagement Activity | 2013 |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html |
Description | Presentation at Google Visit |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Participants in your research and patient groups |
Results and Impact | Discussion of collaboration opportunities with Google to continue research in this area. Plans to visit Google Research in London |
Year(s) Of Engagement Activity | 2015 |
Description | Structured Discriminative Models for Speech Recognition |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Type Of Presentation | keynote/invited speaker |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Keynote speech at International Symposium on Chinese Spoken Language Processing 2012. Increased interest from colleagues in discriminative models and sequence kernels. |
Year(s) Of Engagement Activity | 2012 |
Description | Structured Discriminative Models for Speech Recognition |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | Yes |
Geographic Reach | International |
Primary Audience | Participants in your research and patient groups |
Results and Impact | This talk was related to an invitation to visit NTT CS Lab in Kyoto Japan after ICASSP 2012. An overview of discriminative models and the use of score-spaces derived from generative models was presented. Initiated collaboration with NTT. Visitor to Cambridge, Dr Takuya Yoshioka, in 2013-2014. |
Year(s) Of Engagement Activity | 2012 |
Description | The exact word error for a lattice - Poster presentation Google |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Participants in your research and patient groups |
Results and Impact | Increased interesting features from lattices uses semi-rings Google research award obtained |
Year(s) Of Engagement Activity | 2013 |
Description | UK Speech - Annotating large lattices with the exact word error |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Participants in your research and patient groups |
Results and Impact | Informed UK Speech community of on-going research on the use of finite-state-atomatons in acoustic modelling for speech recognition. Talk prompted a series of questions and informal discussions after the presentation. No significant impact to date, |
Year(s) Of Engagement Activity | 2015 |
URL | http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html |
Description | UK Speech - Infinite Structured Support Vector Machines for Speech Recognition |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Participants in your research and patient groups |
Results and Impact | Dissemination of information about non-parametric Bayesian classifiers No notable impact |
Year(s) Of Engagement Activity | 2014 |