Audio and Video Based Speech Separation for Multiple Moving Sources Within a Room Environment

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP

Abstract

Human beings have developed a unique ability to communicate within a noisy environment, such as at a cocktail party. This skill is dependent upon the use of both the aural and visual senses together with sophisticated processing within the brain. To mimic this ability within a machine is very challenging, particularly if the humans are moving, such as in a teleconferencing context, when human speakers are walking around a room. In the field of signal processing researchers have developed techniques to separate one speech signal from a mixture of such signals, as would be measured by a number of microphones, on the basis of only audio information with the assumption that the humans are static and typically no more than two humans are within the room. Such approaches have generally been found to fail, however, when the human speakers are moving and when there are more than two in number. Fundamentally new approaches are therefore necessary to advance the state-of-the-art in the field. Professor Chambers and his team at Loughborough University were the first in the UK to propose a new approach on the basis of combined audio and video processing to solve the source separation problem, but their preliminary approach identified major challenges in audio-visual speaker localization, tracking and separation which must be solved to provide a practical solution for speech separation for multiple moving sources within a room environment. These findings motivate this new project in which world-leading teams at the University of Surrey, led by Professor Kittler, and at the GIPSA Lab, Grenoble, France, headed by Professor Jutten, are ready to work with Professor Chambers and his team at Loughborough University to advance the state-of-the-art in the field.In this new project, two postdoctoral researchers will be employed, one at Loughborough and another at Surrey. The first will focus on the development of fundamentally new speech source separation algorithms for moving speakers by using geometrical room acoustic (for example location and number of sources, descriptions of their movement) information provided by the second researcher. The research team at Grenoble will provide technical guidance on the basis of their considerable experience in source separation throughout the project and will work on providing an acoustic noise model for the room environment which will also aid the speech separation process. To achieve these tasks, frequency domain based beamforming algorithms will be developed which exploit microphone arrays having more microphones than speakers so that new data independent superdirective robust beamformer design methods can be exploited using mathematical convex optimization. Additionally, further geometic information will be exploited to introduce robustness to errors in the localization information describing the desired source and the interference. To improve the localization information an array of collaborative cameras will be used and both audio and visual information will be used. Advanced methods from particle filtering and probabilistic data association will be exploited for improving the tracking performance. Finally, visual voice activity detection will be used to determine the active sources within the beamforming operations. We emphasize that this work is not implementation-driven, so computational complexity for real-time realization will not be a focus; this would be the subject of a future project.All of the new algorithms will be evaluated both in terms of objective and subjective performance measures on labelled audio and visual datasets acquired at Loughbourgh and Surrey, and from the CHIL seminar room at the Karlsruhe University (UKA), Germany. To ensure this pioneering work has maximum impact on the UK and international academic and research communities all the algorithms and datasets will be made available through the project website.

Publications

10 25 50
 
Description The major original objectives of this joint project between the Advanced Signal Processing group at Loughborough University and the Centre for Vision, Speech and Signal Processing at Surrey University were to



• design novel audio and video based advanced signal processing algorithms for speech separation of multiple active moving speakers by exploiting additional geometrical room acoustic information within the framework of convex-optimization-based beamformer design.

• propose multi-model solutions for human detection, localization, and tracking, to include situations where the sources exhibit complex motions including occlusions and interactions, within a room environment.



We have successfully progressed these topics and our key findings have been:-



1. A novel multimodal source separation approach was proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, the visual modality was used to facilitate the separation for both stationary and moving sources. The movement of the sources was detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation was performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker were controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources were further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results showed that by using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources.



2. A video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static was designed. By exploiting cues from video, we first localized individual speech sources in the enclosure and then estimated their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors were probabilistically modelled. The models made use of the source direction information and were evaluated at discrete time-frequency points. The model parameters were refined with the well-known expectation-maximization (EM) algorithm. The algorithm generated time-frequency masks that were used to reconstruct the individual sources. Simulation results showed that by utilizing the visual modality the proposed algorithm could produce better time-frequency masks thereby giving improved source estimates. We provided experimental results to test the proposed algorithm in different scenarios and provided comparisons with both other audio-only and audio-visual algorithms and achieved improved performance both on synthetic and real data. We also included dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.



3. Novel solutions to the following challenges in visual tracking of multiple human speakers in an office environment were proposed: (1) robust and computationally efficient modelling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialisation of the trackers, or re-initialisation when the trackers have lost lock caused by e.g. the limited camera views. First, we developed new algorithms for appearance modelling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries were used to generate the likelihood functions based on Support Vector Machine(SVM) classification. This likelihood function was then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) was proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model was proposed to track multiple speakers whilst dealing with occlusions. This model was updated online using Maximum a Posteriori (MAP) adaptation, where we controlled the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialisation of the visual trackers, we exploited audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provided, a priori, the number of speakers and constrained the search space for the speaker's faces. The proposed system was tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects.



4. A novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a reverberant room environment was developed. Our hypothesis was that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This method has the advantage of actually exploiting the reverberations within a room rather than trying to suppress them. This also has the practical advantage that most everyday enclosed spaces, such as meeting rooms or offices are highly reverberant environments. In order to test this hypothesis we also collected a new data set featuring 39 subjects adopting 8 different head poses in 4 different room positions captured with a 16 element microphone array. We believe that this data set is unique and will make a significant contribution to further work in the area of audio head pose estimation.



9th December 2013
Exploitation Route The research results feed into a follow up EPSRC/dstl project. DSTL and the industrial partners of this Programme Grant will provide a route to exploitation of the research results. The industrial partners include Selex, Thales, Texas Instruments, Mathworks.



Apart from the above, a potential use of the signal processing solutions is in next generation hearing aids. This would have both commercial and societal impact. Companies interested in this technology have been contacted to gauge their interest. -The research results will be used by the investigators in new research.



-The research results will be used by the academic community to build on with a view to developing new solutions for the blind source separation problem.



-The research results feed into a follow up EPSRC/dstl project. DSTL and the industrial partners of this Programme Grant will provide a route to exploitation of the research results.
Sectors Electronics

 
Description EPSRC Programme Grant
Amount £6,104,265 (GBP)
Funding ID EP/N007743/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 01/2016 
End 12/2020
 
Description S3A: future spatial audio for an immersive listener experience at home
Amount £5,800,000 (GBP)
Funding ID EP/L000539/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 12/2013 
End 11/2017
 
Description Signal processing for the networked battlespace
Amount £3,800,000 (GBP)
Funding ID EP/K014307/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 04/2013 
End 03/2018
 
Title Audio-Visual Data Set 
Description A new dataset was recorded in the audio media engineering lab for evaluating the head pose estimation algorithm. More than 39 subjects attended in the data collection. To our knowledge, this dataset is the first dataset on real audio recordings for head pose recognition tasks. The dataset may be accessible from http://www.cvssp.org/avbss/ 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact The published data was used in experiments conducted by peer groups. 
URL http://www.cvssp.org/avbss/
 
Description Loughborough Univeristy 
Organisation Loughborough University
Country United Kingdom 
Sector Academic/University 
PI Contribution Enhanced collaborations with Loughborough University which have led to the success of the bid for the EPSRC/Dstl funded project ?Signal processing solutions for a networked battlespace?.
Start Year 2009
 
Description MILES 
Organisation University of Surrey
Country United Kingdom 
Sector Academic/University 
PI Contribution Internal inter-department collaboration was initiated with Department of Computing and School of Psychology, and a small feasibility study fund was awarded by the MILES (Models and Mathematics in Life and Social Sciences) project (12/2012-12/2013).
Start Year 2011
 
Description Spatial Audio 
Organisation University of Salford
Country United Kingdom 
Sector Academic/University 
PI Contribution External collaborations with University of Southampton and University of Salford were established during the project period, which have contributed to the design of the work package 3 of the newly funded EPSRC project "S3A: future spatial audio for an immersive listener experience at home", where robust algorithms for audio-visual audio object separation, localisation and tracking will be studied.
Start Year 2012
 
Description Spatial Audio 
Organisation University of Southampton
Country United Kingdom 
Sector Academic/University 
PI Contribution External collaborations with University of Southampton and University of Salford were established during the project period, which have contributed to the design of the work package 3 of the newly funded EPSRC project "S3A: future spatial audio for an immersive listener experience at home", where robust algorithms for audio-visual audio object separation, localisation and tracking will be studied.
Start Year 2012