Audio and Video Based Speech Separation for Multiple Moving Sources Within a Room Environment

Lead Research Organisation: University of Surrey

Department Name: Vision Speech and Signal Proc CVSSP

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Funded Value:

£359,529

Funded Period:

Sep 10 - Sep 13

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/H050000/1

Principal Investigator:

Josef Kittler

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Digital Signal Processing (50%)

Image & Vision Computing (25%)

Music & Acoustic Technology (25%)

Organisations

People	ORCID iD
Josef Kittler (Principal Investigator)	http://orcid.org/0000-0002-8110-9205
Wenwu Wang (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Gu F (2013) Generalized generating function with tucker decomposition and alternating least squares for underdetermined blind identification in EURASIP Journal on Advances in Signal Processing

Huber P (2015) Fitting 3D Morphable Face Models using local features

Josef Kittler (Author) (2013) Audio-visual face detection for tracking in a meeting room environment

Khan M (2013) Video-Aided Model-Based Source Separation in Real Reverberant Rooms in IEEE Transactions on Audio, Speech, and Language Processing

Kilic V (2015) Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering in IEEE Transactions on Multimedia

Kilic V (2016) Mean-Shift and Sparse Sampling-Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking in IEEE Transactions on Multimedia

Kilic V (2013) Audio constrained particle filter based visual tracking

Liu Q (2012) Reverberant speech separation based on audio-visual dictionary learning and binaural cues

Liu Q (2013) Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking in IEEE Transactions on Signal Processing

Mohsen Naqvi S (2012) Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking in IET Signal Processing

Key Findings
Further Funding
Research Databases and Models
Collaboration


Description	The major original objectives of this joint project between the Advanced Signal Processing group at Loughborough University and the Centre for Vision, Speech and Signal Processing at Surrey University were to • design novel audio and video based advanced signal processing algorithms for speech separation of multiple active moving speakers by exploiting additional geometrical room acoustic information within the framework of convex-optimization-based beamformer design. • propose multi-model solutions for human detection, localization, and tracking, to include situations where the sources exhibit complex motions including occlusions and interactions, within a room environment. We have successfully progressed these topics and our key findings have been:- 1. A novel multimodal source separation approach was proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, the visual modality was used to facilitate the separation for both stationary and moving sources. The movement of the sources was detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation was performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker were controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources were further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results showed that by using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources. 2. A video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static was designed. By exploiting cues from video, we first localized individual speech sources in the enclosure and then estimated their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors were probabilistically modelled. The models made use of the source direction information and were evaluated at discrete time-frequency points. The model parameters were refined with the well-known expectation-maximization (EM) algorithm. The algorithm generated time-frequency masks that were used to reconstruct the individual sources. Simulation results showed that by utilizing the visual modality the proposed algorithm could produce better time-frequency masks thereby giving improved source estimates. We provided experimental results to test the proposed algorithm in different scenarios and provided comparisons with both other audio-only and audio-visual algorithms and achieved improved performance both on synthetic and real data. We also included dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited. 3. Novel solutions to the following challenges in visual tracking of multiple human speakers in an office environment were proposed: (1) robust and computationally efficient modelling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealing with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatic initialisation of the trackers, or re-initialisation when the trackers have lost lock caused by e.g. the limited camera views. First, we developed new algorithms for appearance modelling of the moving speakers based on dictionary learning (DL), using an off-line training process. In the tracking phase, the histograms (coding coefficients) of the image patches derived from the learned dictionaries were used to generate the likelihood functions based on Support Vector Machine(SVM) classification. This likelihood function was then used in the measurement step of the classical particle filtering (PF) algorithm. To improve the computational efficiency of generating the histograms, a soft voting technique based on approximate Locality-constrained Soft Assignment (LcSA) was proposed to reduce the number of dictionary atoms (codewords) used for histogram encoding. Second, an adaptive identity model was proposed to track multiple speakers whilst dealing with occlusions. This model was updated online using Maximum a Posteriori (MAP) adaptation, where we controlled the adaptation rate using the spatial relationship between the subjects. Third, to enable automatic initialisation of the visual trackers, we exploited audio information, the Direction of Arrival (DOA) angle, derived from microphone array recordings. Such information provided, a priori, the number of speakers and constrained the search space for the speaker's faces. The proposed system was tested on a number of sequences from three publicly available and challenging data corpora (AV16.3, EPFL pedestrian data set and CLEAR) with up to five moving subjects. 4. A novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a reverberant room environment was developed. Our hypothesis was that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This method has the advantage of actually exploiting the reverberations within a room rather than trying to suppress them. This also has the practical advantage that most everyday enclosed spaces, such as meeting rooms or offices are highly reverberant environments. In order to test this hypothesis we also collected a new data set featuring 39 subjects adopting 8 different head poses in 4 different room positions captured with a 16 element microphone array. We believe that this data set is unique and will make a significant contribution to further work in the area of audio head pose estimation. 9th December 2013
Exploitation Route	The research results feed into a follow up EPSRC/dstl project. DSTL and the industrial partners of this Programme Grant will provide a route to exploitation of the research results. The industrial partners include Selex, Thales, Texas Instruments, Mathworks. Apart from the above, a potential use of the signal processing solutions is in next generation hearing aids. This would have both commercial and societal impact. Companies interested in this technology have been contacted to gauge their interest. -The research results will be used by the investigators in new research. -The research results will be used by the academic community to build on with a view to developing new solutions for the blind source separation problem. -The research results feed into a follow up EPSRC/dstl project. DSTL and the industrial partners of this Programme Grant will provide a route to exploitation of the research results.
Sectors	Electronics


Description	EPSRC Programme Grant
Amount	£6,104,265 (GBP)
Funding ID	EP/N007743/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	01/2016
End	12/2020


Description	S3A: future spatial audio for an immersive listener experience at home
Amount	£5,800,000 (GBP)
Funding ID	EP/L000539/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	12/2013
End	11/2017


Description	Signal processing for the networked battlespace
Amount	£3,800,000 (GBP)
Funding ID	EP/K014307/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	03/2013
End	03/2018


Title	Audio-Visual Data Set
Description	A new dataset was recorded in the audio media engineering lab for evaluating the head pose estimation algorithm. More than 39 subjects attended in the data collection. To our knowledge, this dataset is the first dataset on real audio recordings for head pose recognition tasks. The dataset may be accessible from http://www.cvssp.org/avbss/
Type Of Material	Database/Collection of data
Year Produced	2013
Provided To Others?	Yes
Impact	The published data was used in experiments conducted by peer groups.
URL	http://www.cvssp.org/avbss/


Description	Loughborough Univeristy
Organisation	Loughborough University
Country	United Kingdom
Sector	Academic/University
PI Contribution	Enhanced collaborations with Loughborough University which have led to the success of the bid for the EPSRC/Dstl funded project ?Signal processing solutions for a networked battlespace?.
Start Year	2009


Description	MILES
Organisation	University of Surrey
Country	United Kingdom
Sector	Academic/University
PI Contribution	Internal inter-department collaboration was initiated with Department of Computing and School of Psychology, and a small feasibility study fund was awarded by the MILES (Models and Mathematics in Life and Social Sciences) project (12/2012-12/2013).
Start Year	2011


Description	Spatial Audio
Organisation	University of Salford
Country	United Kingdom
Sector	Academic/University
PI Contribution	External collaborations with University of Southampton and University of Salford were established during the project period, which have contributed to the design of the work package 3 of the newly funded EPSRC project "S3A: future spatial audio for an immersive listener experience at home", where robust algorithms for audio-visual audio object separation, localisation and tracking will be studied.
Start Year	2012


Description	Spatial Audio
Organisation	University of Southampton
Country	United Kingdom
Sector	Academic/University
PI Contribution	External collaborations with University of Southampton and University of Salford were established during the project period, which have contributed to the design of the work package 3 of the newly funded EPSRC project "S3A: future spatial audio for an immersive listener experience at home", where robust algorithms for audio-visual audio object separation, localisation and tracking will be studied.
Start Year	2012

Abstract

Organisations

People

ORCID iD

Publications