Generative Kernels and Score Spaces for Classification of Speech

Lead Research Organisation: University of Cambridge
Department Name: Engineering

Abstract

The aim of this project is to significantly improve the performance of automatic speech recognition systems across a wide-range of environments, speakers and speaking styles. The performance of state-of-the-art speech recognition systems is often acceptable under fairly controlled conditions and where the levels of background noise are low. However for many realistic situations there can be high levels of background noise, for example in-car navigation, or widely ranging channel conditions and speaking styles, such as observed on YouTube-style data. This fragility of speech recognition systems is one of the primary reasons that speech recognition systems are not more widely deployed and used. It limits the possible domains in which speech can be reliably used, and increases the cost of developing applications as systems must be tuned to limit the impact of this fragility. This includes collecting domain specific data and significant tuning of the application itself.The vast majority of research for speech recognition has concentrated on improving the performance of hidden Markov model (HMM) based systems. HMMs are an example of a generative model and are currently used in state-of-the-art speech recognition systems. A wide number of approaches have been developed to improve the performance of these systems under speaker and noise changes. Despite these approaches, systems are not sufficiently robust to allow speech recognition systems to achieve the level of impact that the naturalness of the interface should allow. This project will combine the current generative models developed in the speech community with discriminative classifiers used in both the speech and machine learning communities. An important, novel, aspect of the proposed approach is that the generative models are used to define a score-space that can be used as features by the discriminative classifiers. This approach has a number of advantages. It is possible to use current state-of-the-art adaptation and robustness approaches to compensate the acoustic models for particular speakers and noise conditions. As well as enabling any advances in these approaches to be incorporated into the scheme, it is not necessary to develop approaches that adapt the discriminative classifiers to speakers, style and noise. One of the major problems with speech recognition is that variable length data sequences must be classified. Using generative models also allows the dynamic aspects of speech data to be handled without having to alter the discriminative classifier. The final advantage is the nature of the score-space obtained from the generative model. Generative models such as HMMs have underlying conditional independence assumptions that, whilst enabling them to efficiently represent data sequences, do not accurately represent the dependencies in data sequences such as speech. The score-space associated with a generative model does not have the same conditional independence assumptions as the original generative model. This allows more accurate modelling of the dependencies in the speech data.The combination of generative and discriminative classifiers will be investigated on two very difficult forms of data that current systems perform badly on. The first task is adverse environment recognition of speech. In these situations there are very high levels of background noise which causes severe degradation in system performance. Data of interest for this task will be specified in collaboration with Toshiba Research Europe Ltd. The second task of interest is large vocabulary speech recognition of data from a wide-range of speaking styles and conditions. Google has supplied transcribed data from YouTube to allow evaluation of systems on highly diverse data. The project will yield significant performance gains over current state-of-the-art approaches for both tasks.

Planned Impact

The growth of business based on speech-enabled technology has been slower than predicted. A major contributing factor to this slow growth is that speech recognition systems are still not sufficiently robust to changing background noise conditions, speaker-styles, and accents. This results in unacceptable performance for too many users. Furthermore the cost of development of these applications is large as data is typically collected for the specific target domain and the application tuned to reduce the impact of the current fragility of speech recognition systems. Any approach that yields significant improvements in robustness (to noise, speaker and domain changes) would therefore be of enormous direct benefit to the speech industry making many new applications of the technology feasible. A range of companies in the UK would benefit in this case from core speech technology providers, such as Autonomy and Toshiba Research Europe Ltd, to application providers, such as Telephonetics, Acuvoice, through to application designers, such as VoxGen and Edify. The companies, such as major airlines, banks and new media, who would like to make further use of speech recognition to reduce operating costs and enable new applications would also benefit. The outcome of this research will be shared in the first instance with providers of core speech recognition technology. The Speech Group at CUED has close collaborations with a number of UK and international speech companies including Toshiba Research Europe Ltd (TREL), Google and IBM. Data from existing collaborations with TREL and Google will be used to benchmark the technology created within this research project. This will allow the companies to easily identify technical advances over their existing technology. In addition to research publications including conference papers and technical reports, software implementations of the research outcomes will be made available. The software will be released as an extension to the existing HTK toolkit via the HTK website. This will enable broader industry to replicate the results on publicly available databases.
 
Description This research has demonstrated that under challenging speech recognition environments extracting rich features from the audio yields performance gains. Furthermore an efficient algorithm for efficiently extracting these rich features has been proposed, as is currently being evaluated. This work has continued under a Google funded project
Exploitation Route The features that can be extracted can be incorporated into a range of classifiers, including those based on deep-learning. This has been investigated under a Google Research Award,
Sectors Digital/Communication/Information Technologies (including Software)

URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html
 
Description Google Research Awards
Amount $77,283 (USD)
Organisation Google 
Department Research at Google
Sector Private
Country United States
Start 02/2014 
 
Description NTT Collaboration Funding
Amount £10,000 (GBP)
Funding ID RG78437 
Organisation Nippon Telegraph and Telephone Corporation 
Sector Private
Country Japan
Start 09/2016 
End 02/2017
 
Description NTT Research Collaboration 
Organisation Nippon Telegraph and Telephone Corporation
Country Japan 
Sector Private 
PI Contribution Visitor for NTT for a year to the Speech Group at Cambridge University.
Collaborator Contribution Fully funded salary, bench fees and compute equipment.
Impact The outcome has been in the form of papers (conference papers and journal paper). Currently negotiating longer term agreement with NTT. Initial discussion are for £20000 for collaborative research.
Start Year 2013
 
Title Cross-entropy for model compensation - Python program 
Description This software is a Python program that implements a non-parametric method to, given speech and noise distributions and a mismatch function, compute the corrupted speech likelihood. It uses sampling and is exact in the limit. It therefore gives a theoretical bound for model compensation. 
Type Of Technology Software 
Year Produced 2011 
Open Source License? Yes  
Impact This software was used for the results described in: R. C. van Dalen and M. J. F. Gales (2013). "Importance Sampling to Compute Likelihoods of Noise-Corrupted Speech." In Computer Speech and Language 27 (1), pp. 322-349. 
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#source
 
Title Flipsta library: manipulate finite-state automata in C++ and Python. 
Description The Flipsta library deals with finite-state automata. These are concise representations of, say, many word sequences, with probabilities attached to them. Many algorithms in text and speech processing can be expressed in terms of a handful of automaton operations. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact No impact to date (just released) 
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#source
 
Description Dirichlet Process Mixture of Experts Models in Speech Reognition 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation poster presentation
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Poster presentation on non-parametric Bayesian approaches for speech recognition.

This was the UK & IE speech meeting

No significant changes
Year(s) Of Engagement Activity 2012
 
Description Efficient Decoding with Generative Score-Spaces Using the Expectation Semiring 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation poster presentation
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Description of efficient feature extraction for use in discriminative models.

No notable direct impact
Year(s) Of Engagement Activity 2012
 
Description Generative Kernels and Score-Spaces for Classification of Speech: Progress Report 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact On-line publication for milestone report after year 1 of the project.

This is a milestone from the project and made available via the project web-page.
Year(s) Of Engagement Activity 2012
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html
 
Description Generative Kernels and Score-Spaces for Classification of Speech: Progress Report II 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Publication of the second milestone report for the project.

No direct impact. Report referenced in successful application for Google Research Award.
Year(s) Of Engagement Activity 2013
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html
 
Description Generative Kernels and Score-Spaces for Classification of Speech: Progress Report III 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Participants in your research and patient groups
Results and Impact Final milestone report for project

No significant external impact. Formed basis of paper submission for ASRU 2015.
Year(s) Of Engagement Activity 2015
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#publications
 
Description Monoids: efficient segmental features for speech recognition. 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Techinical report publication (referenced in subsequent papers). Note there was no tracking of the downloads of this paper.

No notable impact.
Year(s) Of Engagement Activity 2013
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html
 
Description Presentation at Google Visit 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Participants in your research and patient groups
Results and Impact Discussion of collaboration opportunities with Google to continue research in this area.

Plans to visit Google Research in London
Year(s) Of Engagement Activity 2015
 
Description Structured Discriminative Models for Speech Recognition 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote speech at International Symposium on Chinese Spoken Language Processing 2012.

Increased interest from colleagues in discriminative models and sequence kernels.
Year(s) Of Engagement Activity 2012
 
Description Structured Discriminative Models for Speech Recognition 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Participants in your research and patient groups
Results and Impact This talk was related to an invitation to visit NTT CS Lab in Kyoto Japan after ICASSP 2012. An overview of discriminative models and the use of score-spaces derived from generative models was presented.

Initiated collaboration with NTT. Visitor to Cambridge, Dr Takuya Yoshioka, in 2013-2014.
Year(s) Of Engagement Activity 2012
 
Description The exact word error for a lattice - Poster presentation Google 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Participants in your research and patient groups
Results and Impact Increased interesting features from lattices uses semi-rings

Google research award obtained
Year(s) Of Engagement Activity 2013
 
Description UK Speech - Annotating large lattices with the exact word error 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Participants in your research and patient groups
Results and Impact Informed UK Speech community of on-going research on the use of finite-state-atomatons in acoustic modelling for speech recognition. Talk prompted a series of questions and informal discussions after the presentation.

No significant impact to date,
Year(s) Of Engagement Activity 2015
URL http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html
 
Description UK Speech - Infinite Structured Support Vector Machines for Speech Recognition 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Participants in your research and patient groups
Results and Impact Dissemination of information about non-parametric Bayesian classifiers

No notable impact
Year(s) Of Engagement Activity 2014