Audio and Video Based Speech Separation for Multiple Moving Sources Within a Room Environment
Lead Research Organisation:
University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP
Abstract
Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
Publications
Gu F
(2013)
Generalized generating function with tucker decomposition and alternating least squares for underdetermined blind identification
in EURASIP Journal on Advances in Signal Processing
Huber P
(2015)
Fitting 3D Morphable Face Models using local features
Josef Kittler (Author)
(2013)
Audio-visual face detection for tracking in a meeting room environment
Khan M
(2013)
Video-Aided Model-Based Source Separation in Real Reverberant Rooms
in IEEE Transactions on Audio, Speech, and Language Processing
Kilic V
(2015)
Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering
in IEEE Transactions on Multimedia
Kilic V
(2016)
Mean-Shift and Sparse Sampling-Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking
in IEEE Transactions on Multimedia
Kilic V
(2013)
Audio constrained particle filter based visual tracking
Liu Q
(2013)
Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking
in IEEE Transactions on Signal Processing
Mohsen Naqvi S
(2012)
Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking
in IET Signal Processing
Description | The major original objectives of this joint project between the Advanced Signal Processing group at Loughborough University and the Centre for Vision, Speech and Signal Processing at Surrey University were to • design novel audio and video based advanced signal processing algorithms for speech separation of multiple active moving speakers by exploiting additional geometrical room acoustic information within the framework of convex-optimization-based beamformer design. • propose multi-model solutions for human detection, localization, and tracking, to include situations where the sources exhibit complex motions including occlusions and interactions, within a room environment. We have successfully progressed these topics and our key findings have been:- 1. A novel multimodal source separation approach was proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, the visual modality was used to facilitate the separation for both stationary and moving sources. The movement of the sources was detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation was performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker were controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources were further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results showed that by using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources. 2. A video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static was designed. By exploiting cues from video, we first localized individual speech sources in the enclosure and then estimated their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors were probabilistically modelled. The models made use of the source direction information and were evaluated at discrete time-frequency points. The model parameters were refined with the well-known expectation-maximization (EM) algorithm. The algorithm generated time-frequency masks that were used to reconstruct the individual sources. Simulation results showed that by utilizing the visual modality the proposed algorithm could produce better time-frequency masks thereby giving improved source estimates. We provided experimental results to test the proposed algorithm in different scenarios and provided comparisons with both other audio-only and audio-visual algorithms and achieved improved performance both on synthetic and real data. We also included dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited. 3. Novel solutions to the following challenges in visual tracking of multiple human speakers in an office environment were proposed: (1) robust and computationally efficient modelling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialisation of the trackers, or re-initialisation when the trackers have lost lock caused by e.g. the limited camera views. First, we developed new algorithms for appearance modelling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries were used to generate the likelihood functions based on Support Vector Machine(SVM) classification. This likelihood function was then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) was proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model was proposed to track multiple speakers whilst dealing with occlusions. This model was updated online using Maximum a Posteriori (MAP) adaptation, where we controlled the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialisation of the visual trackers, we exploited audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provided, a priori, the number of speakers and constrained the search space for the speaker's faces. The proposed system was tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects. 4. A novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a reverberant room environment was developed. Our hypothesis was that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This method has the advantage of actually exploiting the reverberations within a room rather than trying to suppress them. This also has the practical advantage that most everyday enclosed spaces, such as meeting rooms or offices are highly reverberant environments. In order to test this hypothesis we also collected a new data set featuring 39 subjects adopting 8 different head poses in 4 different room positions captured with a 16 element microphone array. We believe that this data set is unique and will make a significant contribution to further work in the area of audio head pose estimation. 9th December 2013 |
Exploitation Route | The research results feed into a follow up EPSRC/dstl project. DSTL and the industrial partners of this Programme Grant will provide a route to exploitation of the research results. The industrial partners include Selex, Thales, Texas Instruments, Mathworks. Apart from the above, a potential use of the signal processing solutions is in next generation hearing aids. This would have both commercial and societal impact. Companies interested in this technology have been contacted to gauge their interest. -The research results will be used by the investigators in new research. -The research results will be used by the academic community to build on with a view to developing new solutions for the blind source separation problem. -The research results feed into a follow up EPSRC/dstl project. DSTL and the industrial partners of this Programme Grant will provide a route to exploitation of the research results. |
Sectors | Electronics |
Description | EPSRC Programme Grant |
Amount | £6,104,265 (GBP) |
Funding ID | EP/N007743/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 01/2016 |
End | 12/2020 |
Description | S3A: future spatial audio for an immersive listener experience at home |
Amount | £5,800,000 (GBP) |
Funding ID | EP/L000539/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 12/2013 |
End | 11/2017 |
Description | Signal processing for the networked battlespace |
Amount | £3,800,000 (GBP) |
Funding ID | EP/K014307/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2013 |
End | 03/2018 |
Title | Audio-Visual Data Set |
Description | A new dataset was recorded in the audio media engineering lab for evaluating the head pose estimation algorithm. More than 39 subjects attended in the data collection. To our knowledge, this dataset is the first dataset on real audio recordings for head pose recognition tasks. The dataset may be accessible from http://www.cvssp.org/avbss/ |
Type Of Material | Database/Collection of data |
Year Produced | 2013 |
Provided To Others? | Yes |
Impact | The published data was used in experiments conducted by peer groups. |
URL | http://www.cvssp.org/avbss/ |
Description | Loughborough Univeristy |
Organisation | Loughborough University |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Enhanced collaborations with Loughborough University which have led to the success of the bid for the EPSRC/Dstl funded project ?Signal processing solutions for a networked battlespace?. |
Start Year | 2009 |
Description | MILES |
Organisation | University of Surrey |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Internal inter-department collaboration was initiated with Department of Computing and School of Psychology, and a small feasibility study fund was awarded by the MILES (Models and Mathematics in Life and Social Sciences) project (12/2012-12/2013). |
Start Year | 2011 |
Description | Spatial Audio |
Organisation | University of Salford |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | External collaborations with University of Southampton and University of Salford were established during the project period, which have contributed to the design of the work package 3 of the newly funded EPSRC project "S3A: future spatial audio for an immersive listener experience at home", where robust algorithms for audio-visual audio object separation, localisation and tracking will be studied. |
Start Year | 2012 |
Description | Spatial Audio |
Organisation | University of Southampton |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | External collaborations with University of Southampton and University of Salford were established during the project period, which have contributed to the design of the work package 3 of the newly funded EPSRC project "S3A: future spatial audio for an immersive listener experience at home", where robust algorithms for audio-visual audio object separation, localisation and tracking will be studied. |
Start Year | 2012 |