Audio and Video Based Speech Separation for Multiple Moving Sources Within a Room Environment

Lead Research Organisation: Loughborough University


Human beings have developed a unique ability to communicate within a noisy environment, such as at a cocktail party. This skill is dependent upon the use of both the aural and visual senses together with sophisticated processing within the brain. To mimic this ability within a machine is very challenging, particularly if the humans are moving, such as in a teleconferencing context, when human speakers are walking around a room. In the field of signal processing researchers have developed techniques to separate one speech signal from a mixture of such signals, as would be measured by a number of microphones, on the basis of only audio information with the assumption that the humans are static and typically no more than two humans are within the room. Such approaches have generally been found to fail, however, when the human speakers are moving and when there are more than two in number. Fundamentally new approaches are therefore necessary to advance the state-of-the-art in the field. Professor Chambers and his team at Loughborough University were the first in the UK to propose a new approach on the basis of combined audio and video processing to solve the source separation problem, but their preliminary approach identified major challenges in audio-visual speaker localization, tracking and separation which must be solved to provide a practical solution for speech separation for multiple moving sources within a room environment. These findings motivate this new project in which world-leading teams at the University of Surrey, led by Professor Kittler, and at the GIPSA Lab, Grenoble, France, headed by Professor Jutten, are ready to work with Professor Chambers and his team at Loughborough University to advance the state-of-the-art in the field.In this new project, two postdoctoral researchers will be employed, one at Loughborough and another at Surrey. The first will focus on the development of fundamentally new speech source separation algorithms for moving speakers by using geometrical room acoustic (for example location and number of sources, descriptions of their movement) information provided by the second researcher. The research team at Grenoble will provide technical guidance on the basis of their considerable experience in source separation throughout the project and will work on providing an acoustic noise model for the room environment which will also aid the speech separation process. To achieve these tasks, frequency domain based beamforming algorithms will be developed which exploit microphone arrays having more microphones than speakers so that new data independent superdirective robust beamformer design methods can be exploited using mathematical convex optimization. Additionally, further geometic information will be exploited to introduce robustness to errors in the localization information describing the desired source and the interference. To improve the localization information an array of collaborative cameras will be used and both audio and visual information will be used. Advanced methods from particle filtering and probabilistic data association will be exploited for improving the tracking performance. Finally, visual voice activity detection will be used to determine the active sources within the beamforming operations. We emphasize that this work is not implementation-driven, so computational complexity for real-time realization will not be a focus; this would be the subject of a future project.All of the new algorithms will be evaluated both in terms of objective and subjective performance measures on labelled audio and visual datasets acquired at Loughbourgh and Surrey, and from the CHIL seminar room at the Karlsruhe University (UKA), Germany. To ensure this pioneering work has maximum impact on the UK and international academic and research communities all the algorithms and datasets will be made available through the project website.

Planned Impact

The project entails fundamental algorithmic research and evaluation. We are therefore not involving an industrial partner directly nor providing a commercial roadmap for exploitation; however, as listed below, our industrial contacts provide clear routes for longer term engagement with industry. Who will benefit from this research? The PDRAs working on the project and the research students in the associated laboratories at Loughborough and Surrey. The UK and international academic research communities working in the field of combined audio and video processing will be major beneficiaries, in particular those with an interest in speech separation. In the longer term, UK industries in the areas of automatic speech recognition and human machine interfaces, defence and security (MoD), and healthcare (NHS) are likely to benefit from the work, but this is expected to happen after the three-year duration of this project. How will they benefit from this research? The project intends to advance the state-of-the-art in terms of audio and video-based algorithms for localization, tracking and source separation of moving sources within a room environment. The signal processing algorithms developed and the related audio-video datasets used for evaluation will become important research tools for the UK and international academic and industrial research communities. Industrial contacts, such as through QinetiQ, for which Professor Chambers is the first QinetiQ Visiting Fellow, and with BAE Systems, for which Professor Kittler is currently managing collaborative projects, will ensure that the route is open for the longer term, beyond the three year period of the project, commercialization of new technological breakthroughs - particularly in the defence and security areas. In the wider context of the digital economy, the research has the potential to improve the quality of life of those who have disabilities and wish to remain living independently. What will be done to ensure they benefit from the research? The PDRAs will be trained by the investigators on the project all of whom have outstanding track records in their respective research areas. The research results will be published regularly in the foremost journals and presented at international conferences. We will attend other key international conferences to disseminate our results to the academic and industrial research communities. Quarterly meetings will be set up with our international collaborators to review the project progress and to transfer knowledge and skills. A website dedicated to our audio and video based research will be developed with the aim of attracting wide audience from academia, and industrial research laboratories.


10 25 50
Description In this joint project between the Advanced Signal Processing Group at Loughborough University and the Centre for Vision, Speech and Signal Processing at Surrey University we have developed new methods for solving the machine cocktail party problem in an enclosed environment; namely, to mimic the ability of a human to separate sounds from moving speakers using both their ears and eyes. In the machine these sounds are measured at multiple microphones and visual information is acquired by video cameras. One technique we have proposed exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. Use of the video modality allows the processing to adapt to whether the sources are statistic or moving. The second method limits the number of microphones to two, in the same way that a human only uses two ears, whilst retaining the ability to separate multiple sources. The processing is based upon exploiting both audio and visual information, that is interaural level and phase difference cues, together with video-informed mixing terms in the form of probabilistic models. Given that the sources can be moving the development of tracking algorithms has been crucial to the work and we have developed new solutions that are based upon (1) robust and computationally efficient modelling and classification of the changing appearance of the speakers in a variety of different lighting conditions and camera resolutions; (2) dealt with full or partial occlusions when multiple speakers cross or come into very close proximity; (3) automatically initialised the trackers, or performed re-initialisation, when the trackers have lost lock caused by e.g. the limited camera views. Finally, we have developed an audio-only method for estimation of head pose orientation. All of the methods have had success on real datasets and new databases have been recorded for future work in the field.
Exploitation Route The natural interface between man and machines is through speech and therefore our research findings have attracted broad interest both nationally and internationally. Internet and telecommunication companies are very keen to progress the field, for example in expanding the use of speech interfaces in noisy and challenging environments. We believe the use of video in this domain is crucial to generate transformative solutions, and the work we have undertaken is providing a basis for part of the research activity in our new five-year £4.4M project entitled "Signal Processing for the Networked Battlespace" funded by Dstl and EPSRC. The research findings have been published in leading international journals published by the IEEE such as Trans. on Audio, Speech and Language Processing, Signal Processing and Multimedia; and presented at key UK and international conferences. Collaboration with the University of Grenoble through funding from Dstl and DGA (France) has allowed international exchange of knowledge and transfer to key defence stakeholders. Research staff and students have benefitted from the very best training in research. This quality training has been the platform for success in the transfer of personal to both academic and new research positions at the end of the project.
Sectors Digital/Communication/Information Technologies (including Software)

Description The algorithms developed have been published in international conferences and journals.
Sector Digital/Communication/Information Technologies (including Software)