Multi-Modal Blind Source Separation for Robot Audition

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP


This proposal draws on expertise in blind source separation and multimodal (audio-visual) speech processing within the Centre for Vision Speech and Signal Processing at University of Surrey. The objective is to perform source separation of the target speech in the presence of multiple competing sound sources in room environments and thereby ultimately provide progress towards automatic machine perception of auditory scenes within an un-controlled natural environment. The fundamental novelty in this work is to exploit visual cues for enhancing the operation of frequency domain blind source separation algorithms. Exploitation of such audio-visual processing is targeted at mitigating the permutation problem, the underdetermined problem (i.e. when the number of sources is greater than the number of microphones), and the reverberation problem, which currently limits the practical applicability of blind source separation algorithms. The focus of the work is therefore on the signal processing algorithms and software tools that can be used to perform automatic separation of sound signals, e.g., for a robot. The body of work in this proposal is underpinned by the substantial experience of the investigators, two from the areas of blind source separation and digital speech processing, and one from the area of computer vision and pattern recognition. The outcomes of the proposed research will be of considerable value to the UK defence industry working especially in the areas of target separation, detection and multi-path mitigation (or dereverberation), with applications in, for example, human-robot interaction, security surveillance and human-computer interaction.
Description In this research, we have attempted to use both the audio and visual modalities for source separation of target speech from acoustic mixtures acquired in room environments, which contain multiple competing speech interferences and sound sources. The outcomes of this research have offered new insights towards machine perception of auditory scenes within an un-controlled natural environment, in particular, for the cross-modal fusion and interactions within the blind source separation (BSS) framework. The key findings of this research include:

1. Visual information is helpful for mitigating the scale and permutation ambiguities associated with traditional audio BSS algorithms

Source separation of convolutive speech mixtures is often performed in the time-frequency domain using, e.g. short-time Fourier transform (STFT), where the convolutive BSS problem is converted into multiple instantaneous BSS problems over different frequency channels, and then solved by using e.g. independent component analysis (ICA) algorithms at each frequency bin. However, due to the inherent indeterminacies associated with the classical ICA model, the orders and amplitudes of the source components estimated at these frequency channels may not be consistent, leading to the well-known problems in frequency domain BSS, namely the permutation and scale ambiguities.

We found that the visual information from concurrent video signals can be used as an additional cue for correcting the permutation and scale ambiguities of audio source separation algorithms. To use the visual information, we have developed a two-stage method including off-line training and online separation. We characterise statistically the audio-visual (AV) coherence in the off-line training stage, by mapping the AV data into the feature space, where we have taken the Mel-frequency cepstrum coefficients (MFCCs) as audio features, and the lip width and height as visual features, and then combined them to form an audio-visual feature space. We then model the features based on e.g. Gaussian mixture models (GMM) and evaluate their parameters using an adapted expectation maximisation (AEM) algorithm. In the online separation stage, we have developed a novel iterative sorting scheme based on coherence maximisation and majority voting, in order to correct the permutation ambiguities of the frequency components. To address the scaling ambiguity, we have used a group of scaling parameters, calculated in each Mel-frequency band using the bi-modal coherence and interpolated across the adjacent frequency bands expanded, which are then directly applied to the ICA-separated spectral components in each frequency bin. We have also adopted a robust feature selection scheme to improve the performance of the proposed AV-BSS system for the data corrupted by outliers, such as background noise and room reverberations.

2. Visual information is helpful for detecting voice activity and for separating sources from noisy mixtures

Voice activity, indicating whether the speaker is uttering or remains silent, provides useful information about the concurrent number of speakers present in the auditory scene and therefore informs whether the BSS problem is determined, over-determined or underdetermined (i.e. the number of sources is greater than that of the sensors). Detecting the voice activity of the speakers is an important and also a very challenging problem in robot audition research. The majority of research in voice activity detection (VAD) is conducted in the audio domain, whose performance, however, deteriorates severely in a multi-source and noisy environment.

We found that visual information from the video signals associated with the contemporary audio can be used to improve considerably the audio-domain VAD performance. We have proposed a new visual VAD approach which combines lip-reading with binary classification for determining the activity of speech utterances. More specifically, we have developed a new lip-reading method, which is robust to head rotations and changes of illuminations. In our proposed lip extraction algorithm, greedy active contour models (ACM) are used to drive the landmark points towards the lip contours, where a template matching is used to cope with head rotations, and shape energy constraints are applied to avoid points bending abruptly (which often occurs if the image resolution is very low). Using the lip features obtained in lip-reading, we then form a binary VAD classifier based on the Adaboosting technique, which combines or boosts a set of 'weak' classifiers to obtain a 'strong' classifier with a lower error rate.

We also found that the visual VAD can be used to further improve the performance of the aforementioned AV-BSS algorithm. This is achieved as follows. First, in the off-line training stage, we apply the Adaboost training algorithm to the labelled visual features, which are extracted from the video signal associated with a target speaker. The trained Adaboost model is then used for visual VAD for detecting the silent periods in the target speech, using the accompanied contemporary video. Finally, these periods of the signal are suppressed by the multi-band spectral subtraction algorithm, as a post-processing stage for the proposed AV-BSS algorithm.

3. Dictionary learning based sparse coding provides an alternative way for audio-visual coherence modelling, offering improved BSS performance (over the feature-based technique) for separating reverberant and noisy speech mixtures acquired in real room environments

We found that both speech signals and lip movements are 'sparse' by nature or can be made sparse in a transform domain, where the term 'sparse' is used to refer to that only few values in the signals (or their transformed coefficients) are non-zeros. Using sparse representations, we could potentially design more effective BSS algorithms for noisy, reverberant, and/or underdetermined speech mixtures, as under such a representation, (i) the noise components or coefficients become less prominent as compared with the signal components, and (ii) the possibility that speech sources overlap with each other is reduced.

Under the sparse coding framework, we have proposed a novel audio-visual dictionary learning (AVDL) technique for modelling the AV coherence. This new method attempts to code the local temporal-spatial (TS) structures of an AV sequence, resembling the technique of locality-constrained linear coding. We address several challenges associated with AVDL, including, for example, cross-modality differences in size, dimension and sampling rate, as well as the issues on scalability and computational complexity. Our proposed AVDL algorithm follows a commonly employed two-stage coding-learning process, but features with new contributions in both coding and learning stages including, for example, bi-modality balanced and scalable matching criterion, size and dimension adaptive dictionary, a fast search index for efficient coding, and a varying sparsity for different modalities. Each AV atom in our dictionary contains both an audio atom and a visual atom spanning the same temporal length. The audio atom is the magnitude spectrum of an audio segment, which is found to be more robust to convolutive noise as compared with the time-domain representations. The visual atom is composed of several consecutive frames of image patches, focusing on the movement of the whole mouth region. The AVDL algorithm has been applied in the offline training stage of the AV-BSS algorithm, as an alternative to the aforementioned feature-based AEM algorithm.

We have also developed a new time-frequency masking technique using the AVDL, where two parallel mask generation processes are combined to derive an AV mask, which is then used to separate the source from the mixtures. The audio mask can be obtained by using conventional BSS techniques based on ICA, or time-frequency techniques based on various cues, such as spatial, statistical, temporal, or spectral cues, evaluated using the EM algorithm. The visual mask is generated by comparing the reconstructed audio sequence using the AVDL algorithm with the observed (recorded) AV sequence, and it accommodates the information about the reliability and confidence of the likelihood that each time-frequency unit of the mask being occupied by a specific source that is suggested by the audio mask. The visual mask is used to re-weight the audio mask, resulting in the AV mask that is effective in suppressing the adverse effect of noise and room reverberations on the separation results. We have evaluated extensively our AVDL based AV-BSS algorithm on real speech and video data, using the performance metrics such as signal to distortion ratio (SDR), signal to interference ratio (SIR), signal to noise ratio (SNR), perceptual evaluations of speech quality (PESQ), and perceptual evaluation of audio source separation (PEASS). We have observed considerably improved separation performance as compared with the state-of-the-art baselines including both audio-only and audio-visual BSS methods.
Exploitation Route The research results of this project could be used by several UK (and/or international) industry sectors, such as defence (target detection and tracking), security (automated crime detection and security surveillance), health-care (assisted living), and creative (human-computer interactions) industries, where the techniques of multi-modal data fusion, multi-channel signal separation and deconvolution, and corrupted sensor signal enhancement are commonly required. This research has the potential to be commercialised by industry sectors, if further developmental activities can be grounded and facilitated by e.g. Knowledge Transfer Partnership (KTP), Knowledge Transfer Accounts (KTA), and/or the Centre for Defence Enterprise.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Security and Diplomacy

Description The project has attracted follow-up funding from Samsung Electronics and EPSRC impact acceleration account to further develop the proposed algorithm (implemented in Matlab) into a demonstration software (in real-time C) that could be potentially deployed for smart phones.
First Year Of Impact 2013
Sector Creative Economy,Digital/Communication/Information Technologies (including Software)
Impact Types Economic

Description Enhancing speech quality using lip tracking
Amount £58,000 (GBP)
Organisation Samsung 
Sector Private
Country Global
Start 10/2013 
End 03/2014
Description Impact Acceleration Account
Amount £20,000 (GBP)
Funding ID EP/H012842/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 12/2014 
End 06/2015
Description Collaboration with Imperial College London 
Organisation Imperial College London
Country United Kingdom 
Sector Academic/University 
PI Contribution We have established collaboration with Dr Wei Dai at Imperial College London for investigating sparsity based techniques for blind source separation, thanks to the regular meetings and interactive events organised by the MoD University Defence Research Centre in Signal processing.
Start Year 2011
Description joint development of audio-visual speech enhancement demonstration software for smart phones 
Organisation Samsung
Country Global 
Sector Private 
PI Contribution We contributed to extensive tests of the audio-visual speech enhancement for real-life audio visual recordings made by smart phones.
Collaborator Contribution Converted the Matlab code of the lip tracking algorithms into C code and tested on the mobile phones.
Impact A software toolkit for lip tracking written in C language Data collected through Samsung a smart phone S4
Start Year 2013
Title Software packages 
Description We have developed software packages for implementing the proposed multimodal blind source separation systems described in our publications. 
Type Of Technology Software 
Description Poster Presentation on BBC Audio Research Partnership Launch Meeting 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Primary Audience
Results and Impact We presented the following poster "Audio and Audio-Visual Source Separation for Machine Listening" in the BBC Audio Research Partnership Launch Meeting, in MediaCityUK, Manchester. The poster contains some results from this project.
Year(s) Of Engagement Activity 2011
Description Seminar presented in Beihang University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience
Results and Impact Part of the results of this project has been presented in a seminar in Beihang University, Beijing, China.
Year(s) Of Engagement Activity 2011