Multi-Modal Blind Source Separation for Robot Audition

Lead Research Organisation: University of Surrey

Department Name: Vision Speech and Signal Proc CVSSP

Abstract

This proposal draws on expertise in blind source separation and multimodal (audio-visual) speech processing within the Centre for Vision Speech and Signal Processing at University of Surrey. The objective is to perform source separation of the target speech in the presence of multiple competing sound sources in room environments and thereby ultimately provide progress towards automatic machine perception of auditory scenes within an un-controlled natural environment. The fundamental novelty in this work is to exploit visual cues for enhancing the operation of frequency domain blind source separation algorithms. Exploitation of such audio-visual processing is targeted at mitigating the permutation problem, the underdetermined problem (i.e. when the number of sources is greater than the number of microphones), and the reverberation problem, which currently limits the practical applicability of blind source separation algorithms. The focus of the work is therefore on the signal processing algorithms and software tools that can be used to perform automatic separation of sound signals, e.g., for a robot. The body of work in this proposal is underpinned by the substantial experience of the investigators, two from the areas of blind source separation and digital speech processing, and one from the area of computer vision and pattern recognition. The outcomes of the proposed research will be of considerable value to the UK defence industry working especially in the areas of target separation, detection and multi-path mitigation (or dereverberation), with applications in, for example, human-robot interaction, security surveillance and human-computer interaction.

Funded Value:

£115,288

Funded Period:

Oct 09 - Oct 12

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/H012842/1

Principal Investigator:

Wenwu Wang

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Digital Signal Processing (100%)

Organisations

People	ORCID iD
Wenwu Wang (Principal Investigator)
Philip J B Jackson (Co-Investigator)
Josef Kittler (Co-Investigator)	http://orcid.org/0000-0002-8110-9205

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Alinaghi A (2014) Joint Mixing Vector and Binaural Model Based Stereo Source Separation in IEEE/ACM Transactions on Audio, Speech, and Language Processing

Jan T (2011) A multistage approach to blind separation of convolutive speech mixtures in Speech Communication

Liu Q (2014) Interference Reduction in Reverberant <newline/>Speech Separation With Visual <newline/>Voice Activity Detection in IEEE Transactions on Multimedia

Liu Q (2012) Use of bimodal coherence to resolve the permutation problem in convolutive BSS in Signal Processing

Liu Q (2011) Blind source separation and visual voice activity detection for target speech extraction

Liu Q (2012) Reverberant speech separation based on audio-visual dictionary learning and binaural cues

Liu Q (2010) Latent Variable Analysis and Signal Separation

Liu Q (2013) Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking in IEEE Transactions on Signal Processing

Liu Q. (2010) Bimodal coherence based scale ambiguity cancellation for target speech extraction and enhancement in Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010

Qingju Liu (2011) A visual voice activity detection method with adaboosting

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products
Engagement Activities


Description	In this research, we have attempted to use both the audio and visual modalities for source separation of target speech from acoustic mixtures acquired in room environments, which contain multiple competing speech interferences and sound sources. The outcomes of this research have offered new insights towards machine perception of auditory scenes within an un-controlled natural environment, in particular, for the cross-modal fusion and interactions within the blind source separation (BSS) framework. The key findings of this research include: 1. Visual information is helpful for mitigating the scale and permutation ambiguities associated with traditional audio BSS algorithms Source separation of convolutive speech mixtures is often performed in the time-frequency domain using, e.g. short-time Fourier transform (STFT), where the convolutive BSS problem is converted into multiple instantaneous BSS problems over different frequency channels, and then solved by using e.g. independent component analysis (ICA) algorithms at each frequency bin. However, due to the inherent indeterminacies associated with the classical ICA model, the orders and amplitudes of the source components estimated at these frequency channels may not be consistent, leading to the well-known problems in frequency domain BSS, namely the permutation and scale ambiguities. We found that the visual information from concurrent video signals can be used as an additional cue for correcting the permutation and scale ambiguities of audio source separation algorithms. To use the visual information, we have developed a two-stage method including off-line training and online separation. We characterise statistically the audio-visual (AV) coherence in the off-line training stage, by mapping the AV data into the feature space, where we have taken the Mel-frequency cepstrum coefficients (MFCCs) as audio features, and the lip width and height as visual features, and then combined them to form an audio-visual feature space. We then model the features based on e.g. Gaussian mixture models (GMM) and evaluate their parameters using an adapted expectation maximisation (AEM) algorithm. In the online separation stage, we have developed a novel iterative sorting scheme based on coherence maximisation and majority voting, in order to correct the permutation ambiguities of the frequency components. To address the scaling ambiguity, we have used a group of scaling parameters, calculated in each Mel-frequency band using the bi-modal coherence and interpolated across the adjacent frequency bands expanded, which are then directly applied to the ICA-separated spectral components in each frequency bin. We have also adopted a robust feature selection scheme to improve the performance of the proposed AV-BSS system for the data corrupted by outliers, such as background noise and room reverberations. 2. Visual information is helpful for detecting voice activity and for separating sources from noisy mixtures Voice activity, indicating whether the speaker is uttering or remains silent, provides useful information about the concurrent number of speakers present in the auditory scene and therefore informs whether the BSS problem is determined, over-determined or underdetermined (i.e. the number of sources is greater than that of the sensors). Detecting the voice activity of the speakers is an important and also a very challenging problem in robot audition research. The majority of research in voice activity detection (VAD) is conducted in the audio domain, whose performance, however, deteriorates severely in a multi-source and noisy environment. We found that visual information from the video signals associated with the contemporary audio can be used to improve considerably the audio-domain VAD performance. We have proposed a new visual VAD approach which combines lip-reading with binary classification for determining the activity of speech utterances. More specifically, we have developed a new lip-reading method, which is robust to head rotations and changes of illuminations. In our proposed lip extraction algorithm, greedy active contour models (ACM) are used to drive the landmark points towards the lip contours, where a template matching is used to cope with head rotations, and shape energy constraints are applied to avoid points bending abruptly (which often occurs if the image resolution is very low). Using the lip features obtained in lip-reading, we then form a binary VAD classifier based on the Adaboosting technique, which combines or boosts a set of 'weak' classifiers to obtain a 'strong' classifier with a lower error rate. We also found that the visual VAD can be used to further improve the performance of the aforementioned AV-BSS algorithm. This is achieved as follows. First, in the off-line training stage, we apply the Adaboost training algorithm to the labelled visual features, which are extracted from the video signal associated with a target speaker. The trained Adaboost model is then used for visual VAD for detecting the silent periods in the target speech, using the accompanied contemporary video. Finally, these periods of the signal are suppressed by the multi-band spectral subtraction algorithm, as a post-processing stage for the proposed AV-BSS algorithm. 3. Dictionary learning based sparse coding provides an alternative way for audio-visual coherence modelling, offering improved BSS performance (over the feature-based technique) for separating reverberant and noisy speech mixtures acquired in real room environments We found that both speech signals and lip movements are 'sparse' by nature or can be made sparse in a transform domain, where the term 'sparse' is used to refer to that only few values in the signals (or their transformed coefficients) are non-zeros. Using sparse representations, we could potentially design more effective BSS algorithms for noisy, reverberant, and/or underdetermined speech mixtures, as under such a representation, (i) the noise components or coefficients become less prominent as compared with the signal components, and (ii) the possibility that speech sources overlap with each other is reduced. Under the sparse coding framework, we have proposed a novel audio-visual dictionary learning (AVDL) technique for modelling the AV coherence. This new method attempts to code the local temporal-spatial (TS) structures of an AV sequence, resembling the technique of locality-constrained linear coding. We address several challenges associated with AVDL, including, for example, cross-modality differences in size, dimension and sampling rate, as well as the issues on scalability and computational complexity. Our proposed AVDL algorithm follows a commonly employed two-stage coding-learning process, but features with new contributions in both coding and learning stages including, for example, bi-modality balanced and scalable matching criterion, size and dimension adaptive dictionary, a fast search index for efficient coding, and a varying sparsity for different modalities. Each AV atom in our dictionary contains both an audio atom and a visual atom spanning the same temporal length. The audio atom is the magnitude spectrum of an audio segment, which is found to be more robust to convolutive noise as compared with the time-domain representations. The visual atom is composed of several consecutive frames of image patches, focusing on the movement of the whole mouth region. The AVDL algorithm has been applied in the offline training stage of the AV-BSS algorithm, as an alternative to the aforementioned feature-based AEM algorithm. We have also developed a new time-frequency masking technique using the AVDL, where two parallel mask generation processes are combined to derive an AV mask, which is then used to separate the source from the mixtures. The audio mask can be obtained by using conventional BSS techniques based on ICA, or time-frequency techniques based on various cues, such as spatial, statistical, temporal, or spectral cues, evaluated using the EM algorithm. The visual mask is generated by comparing the reconstructed audio sequence using the AVDL algorithm with the observed (recorded) AV sequence, and it accommodates the information about the reliability and confidence of the likelihood that each time-frequency unit of the mask being occupied by a specific source that is suggested by the audio mask. The visual mask is used to re-weight the audio mask, resulting in the AV mask that is effective in suppressing the adverse effect of noise and room reverberations on the separation results. We have evaluated extensively our AVDL based AV-BSS algorithm on real speech and video data, using the performance metrics such as signal to distortion ratio (SDR), signal to interference ratio (SIR), signal to noise ratio (SNR), perceptual evaluations of speech quality (PESQ), and perceptual evaluation of audio source separation (PEASS). We have observed considerably improved separation performance as compared with the state-of-the-art baselines including both audio-only and audio-visual BSS methods.
Exploitation Route	The research results of this project could be used by several UK (and/or international) industry sectors, such as defence (target detection and tracking), security (automated crime detection and security surveillance), health-care (assisted living), and creative (human-computer interactions) industries, where the techniques of multi-modal data fusion, multi-channel signal separation and deconvolution, and corrupted sensor signal enhancement are commonly required. This research has the potential to be commercialised by industry sectors, if further developmental activities can be grounded and facilitated by e.g. Knowledge Transfer Partnership (KTP), Knowledge Transfer Accounts (KTA), and/or the Centre for Defence Enterprise.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Security and Diplomacy
URL	http://www.see.ed.ac.uk/drupal/sites/default/files/CDE%20UDRC%20Poster%20(O11).pdf


Description	The project has attracted follow-up funding from Samsung Electronics and EPSRC impact acceleration account to further develop the proposed algorithm (implemented in Matlab) into a demonstration software (in real-time C) that could be potentially deployed for smart phones.
First Year Of Impact	2013
Sector	Creative Economy,Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Description	Enhancing speech quality using lip tracking
Amount	£58,000 (GBP)
Organisation	Samsung
Sector	Private
Country	Korea, Republic of
Start	09/2013
End	03/2014


Description	Collaboration with Imperial College London
Organisation	Imperial College London
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have established collaboration with Dr Wei Dai at Imperial College London for investigating sparsity based techniques for blind source separation, thanks to the regular meetings and interactive events organised by the MoD University Defence Research Centre in Signal processing.
Start Year	2011


Description	joint development of audio-visual speech enhancement demonstration software for smart phones
Organisation	Samsung
Country	Korea, Republic of
Sector	Private
PI Contribution	We contributed to extensive tests of the audio-visual speech enhancement for real-life audio visual recordings made by smart phones.
Collaborator Contribution	Converted the Matlab code of the lip tracking algorithms into C code and tested on the mobile phones.
Impact	A software toolkit for lip tracking written in C language Data collected through Samsung a smart phone S4
Start Year	2013


Title	Software packages
Description	We have developed software packages for implementing the proposed multimodal blind source separation systems described in our publications.
Type Of Technology	Software


Description	Poster Presentation on BBC Audio Research Partnership Launch Meeting
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Primary Audience
Results and Impact	We presented the following poster "Audio and Audio-Visual Source Separation for Machine Listening" in the BBC Audio Research Partnership Launch Meeting, in MediaCityUK, Manchester. The poster contains some results from this project.
Year(s) Of Engagement Activity	2011


Description	Seminar presented in Beihang University
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience
Results and Impact	Part of the results of this project has been presented in a seminar in Beihang University, Beijing, China.
Year(s) Of Engagement Activity	2011

Abstract

Organisations

People

ORCID iD

Publications