Robust Syllable Recognition in the Acousic-Waveform Domain

Lead Research Organisation: King's College London

Department Name: Electronic Engineering

Abstract

This proposal is concerned with robust classification/recognition of speech units (phonemes and consonant-vowel syllables) in the domain of acoustic waveforms. The motivation for this research comes from the idea that speech units should be much better separated in the high-dimensional spaces formed by acoustic waveforms than in the smaller representation spaces which are used in state-of-the-art speech recognition systems and which involve significant compression and dimension reduction. Hence, recognition/classification in the acoustic waveform domain should exhibit a higher level of robustness to additive noise than classification in low-dimensional feature spaces.In the first phase of the project we will investigate classification of speech units in the acoustic waveform domain under severe noise conditions, around 0dB signal-to-noise ratio and below, while in the second phase we will study techniques which would make classification robust also to linear filtering. The particular tasks that will be tackled in the first phase can be summarized as follows:1. Study the detailed structure of the sets of acoustic waveforms of individual speech units; in particular their intrinsic dimensions, and the existence of possible nonlinear surfaces on which the data are concentrated.2. Guided by the findings from item 1 above, estimate statistical models of the distribution of speech units in the acoustic waveform domain. We will then design and systematically assess so-called generative classifiers, whose defining property is that they are based on such statistical models.3. Investigate classification of speech units in the acoustic waveform domain using discriminative classification techniques (artificial neural networks, support vector machines, and relevance vector machines). These can be a useful alternative to generative techniques because they focus directly on the classification problem without building explicit models of waveform distributions for each speech unit.4. Construct classifiers by grouping speech units hierarchically. Top-level classifiers will be constructed to distinguish between a small of groups of similar speech units, followed by classifiers separating groups into subgroups and so on. Different methods for defining subgroups will be explored, including confusion matrices of the classifiers from item 3, appropriate distance measures between the statistical models obtained in item 2, and possibly perceptual experiments.A potential argument against our approach is that classification in the acoustic waveform domain will break down in the presence of linear filtering. However, this can be avoided by considering narrow-band signals: for these, the effect of linear filtering is approximately equivalent to amplitude scaling and time delay. In the second phase of the project, we will therefore consider speech classification using narrow-band components of acoustic waveforms. For classification of signals in individual sub-bands, the techniques investigated in the first phase of the project will be considered. A new issue is then how to combine the results of sub-band classifiers to minimize the overall classification error. Here recently developed machine learning techniques will be used, as specified in the case for support.As explained, individual sub-band classifiers should be robust to linear filtering because the latter does not significantly alter the shape of narrow-band signals. On the other hand, the dimension of the spaces of sub-band waveforms will be still high enough to facilitate classification robust to additive noise. Hence, the overall scheme is expected to be robust to both additive noise and linear fitering.

Funded Value:

£207,533

Funded Period:

Sep 06 - Mar 10

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/D053005/1

Principal Investigator:

Zoran Cvetkovic

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Human Communication in ICT (75%)

Music & Acoustic Technology (25%)

Organisations

People	ORCID iD
Zoran Cvetkovic (Principal Investigator)
Peter Sollich (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Ager M (2010) High-dimensional linear representations for robust speech recognition

Ager M (2011) Combined waveform-cepstral representation for robust speech recognition

Ager M. (2008) Towards robust phoneme classification: Augmentation of PLP models with acoustic waveforms in European Signal Processing Conference

J Yousafzai (2008) Combined PLP - acoustic waveform classification for robust phoneme recognition using support vector machines in European Signal Processing Conference, EUSIPCO 2008

Jibran Yousafzai (Author) (2009) Costume designed SVM kernels for improved robustness of phoneme classification in European Signal Processing Conference, EUSIPCO 2009

Matthew Ager (Author) (2009) Robust phoneme classification: exploiting the adaptability of acoustic waveform models in European Signal Processing Conference, EUSIPCO 2009

Yousafzai J (2008) Discriminative and generative machine learning approaches towards robust phoneme classification

Yousafzai J (2010) Subband acoustic waveform front-end for robust speech recognition using support vector machines

Yousafzai J (2011) Combined Features and Kernel Design for Noise Robust Phoneme Classification Using Support Vector Machines in IEEE Transactions on Audio, Speech, and Language Processing

Yousafzai J. (2009) Tuning support vector machines for robust phoneme classification with acoustic waveforms in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Key Findings
Impact Summary
Further Funding
Collaboration


Description	The project has demonstrated a significant potential of solving the long standing issue of the lack of robustness of automatic speech recognition (ASR) systems by posing the problem in high-dimensional spaces of acoustic waveforms of speech, possibly transformed by some linear orthogonal transforms. Further, for that purpose generative and discriminative (support vector machine) models are developed on this project.
Exploitation Route	The findings of this research are relevant for products and systems in defence, healthcare, and various other telecommunications and information systems, where speech is or can be used as mode of human-computer interaction. Findings of the project open up a new direction in the area of ASR as well as issues of learning in high dimensions and kernel methods, which are area of intense activity is several academic communities, including statistics, computer science, signal processing
Sectors	Aerospace Defence and Marine Digital/Communication/Information Technologies (including Software) Electronics Healthcare Leisure Activities including Sports Recreation and Tourism Security and Diplomacy


Description	The project was a proof-of-concept study, exploring a paradigm shift concept in automatic speech recognition. As such, findings of the project are still of fundamental scientific nature, but have gained us partnership with researchers from the University of California, Berkeley, and SRI (Stanford Research Institute) International for further exploration of the developed concepts to the point where they could be successfully deployed in practical automatic speech recognition systems.


Description	Travel grant
Amount	£21,059 (GBP)
Funding ID	EP/K034626/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	03/2013
End	03/2014


Description	Edinburgh
Organisation	University of Edinburgh
Country	United Kingdom
Sector	Academic/University
PI Contribution	Intellectual input.
Collaborator Contribution	Intellectual input.
Impact	Preparation of a joint grant proposal.
Start Year	2016


Description	SRI
Organisation	SRI International (inc)
Country	United States
Sector	Charity/Non Profit
PI Contribution	Exchange of ideas and technical discussions.
Collaborator Contribution	Exchange of ideas and technical discussions. They were co-sponsoring one visit of Prof Cvetkovic in 2012, and they were hosting him for 4 months (full or part time) at the lab in 2014.
Impact	We formulated a grant proposal, submitted to EPSRC, with SRI as a formal partner. It is a multidisciplinary project involving signal processing, statistics, and machine learning, applied to a problem in speech technologies.
Start Year	2012

Abstract

Organisations

People

ORCID iD

Publications