Unifying audio signal processing and machine learning: a fundamental framework for machine hearing
Lead Research Organisation:
University of Cambridge
Department Name: Engineering
Abstract
Modern technology is leading to a flood of audio data. For example, over seventy two hours of unstructured and unlabelled sound-tracks are uploaded to internet sites every minute. Automatic systems are urgently needed for recognising audio content so that these sound-tracks can be tagged for categorisation and search. Moreover, an increasing proportion of recordings are made on hand-held devices in challenging environments that contain multiple sound sources and noise. Such uncurated and noisy data necessitate automatic systems for cleaning the audio content and separating sources from mixtures. On a related note, devices for the hearing impaired currently perform poorly in noise. In fact, this is a major reason why six million people in the UK who would benefit from a hearing aid, do not use them (a market worth £18 billion p.a.). Patients fitted with cochlear implants suffer from similar limitations, and as the population ages more people are affected.
It is clear that audio recognition and enhancement methods are required to stop us drowning in audio-data, for processing in hearing devices, and to
support new technological innovations. Current approaches to these problems use a combination of audio signal processing (which places the audio data into a convenient format and reduces the data-rate) and machine learning (which removes noise, separates sources, or classifies the content). It is widely believed that these two fields must become increasingly integrated in the future. However, this union is currently a troubled one, suffering from four problems.
Inefficiency: The methods are too inefficient when we have vast amounts of data (as is the case for audio-tracks on the web) or for real-time applications (such as is necessary in hearing aids)
Impoverished models: The machine learning modules tend to be statistically limited.
Unadapted: The signal processing modules are unadapted despite evidence from other fields, like computer vision, which suggests that automatic tuning leads to significant performance gains
Distorted mixtures: The signal processing modules introduce non-linear distortions which are not captured by the machine learning modules.
In this project we address these four limitations by introducing a new theoretical framework which unifies signal processing and machine learning. The key step is to view the signal processing module as solving an inference problem. Since the machine-learning modules are often framed in this way, the two modules can be integrated into a single coherent approach allowing technologies from the two fields to be completely integrated. In the project we will then use the new approach to develop efficient, rich, adaptive, and distortion free approaches to audio denoising, source separation and recognition. We will evaluate the the noise reduction and source separations algorithms on the hearing impaired, and the audio recognition algorithms on audio-sound track data.
We believe this new framework will form a foundation of the emerging field of machine hearing. In the future, machine hearing will be deployed in a vast range of applications from music processing tasks to augmented reality systems (in conjunction with technologies from computer vision). We believe that this project will kick start this proliferation.
It is clear that audio recognition and enhancement methods are required to stop us drowning in audio-data, for processing in hearing devices, and to
support new technological innovations. Current approaches to these problems use a combination of audio signal processing (which places the audio data into a convenient format and reduces the data-rate) and machine learning (which removes noise, separates sources, or classifies the content). It is widely believed that these two fields must become increasingly integrated in the future. However, this union is currently a troubled one, suffering from four problems.
Inefficiency: The methods are too inefficient when we have vast amounts of data (as is the case for audio-tracks on the web) or for real-time applications (such as is necessary in hearing aids)
Impoverished models: The machine learning modules tend to be statistically limited.
Unadapted: The signal processing modules are unadapted despite evidence from other fields, like computer vision, which suggests that automatic tuning leads to significant performance gains
Distorted mixtures: The signal processing modules introduce non-linear distortions which are not captured by the machine learning modules.
In this project we address these four limitations by introducing a new theoretical framework which unifies signal processing and machine learning. The key step is to view the signal processing module as solving an inference problem. Since the machine-learning modules are often framed in this way, the two modules can be integrated into a single coherent approach allowing technologies from the two fields to be completely integrated. In the project we will then use the new approach to develop efficient, rich, adaptive, and distortion free approaches to audio denoising, source separation and recognition. We will evaluate the the noise reduction and source separations algorithms on the hearing impaired, and the audio recognition algorithms on audio-sound track data.
We believe this new framework will form a foundation of the emerging field of machine hearing. In the future, machine hearing will be deployed in a vast range of applications from music processing tasks to augmented reality systems (in conjunction with technologies from computer vision). We believe that this project will kick start this proliferation.
Planned Impact
The signal processing and machine learning methods developed in this project are keystone technologies upon which upstream research depends. In particular, the project will have a significant impact for the health-care and digital industries. The following specific groups will benefit from the research:
The hearing impaired: hearing aid users
Six million people in the UK who would benefit from a hearing aid do not use them (a market worth 18 billion p.a.). This group of people is expanding rapidly as the population ages (the number of people aged 65 or older is expected to double by 2050). One of the main reasons why hearing aids are not as widely used as they should be is that they perform poorly in noisy environments. The efficient and adaptive noise removal systems developed for hearing aids in this project will address this key issue. This proposal will therefore contribute to the EPSRC Healthcare Technologies theme. Prof. Moore will be involved in translating the research into hearing aids, including providing access to hearing impaired patients for testing.
The hearing impaired: cochlear implantees
Cochlear implants allow the profoundly deaf, who get little or no benefit from a normal hearing aid, to gain awareness of environmental sounds and, in most cases, understand speech without lip-reading. 8000 people in the UK currently use cochlear implants and there are 1000 new implantees each year. Again, perhaps the major limitation of the current devices is their poor performance in noisy environments. The efficient and adaptive noise removal systems developed in this project aim to address this key issue thereby contributing to the EPSRC Healthcare Technologies theme. Dr. Carlyon will be involved in translating the research into cochlear implants, including providing access to implanted patients for testing.
Audio search and information retrieval: digital-industries and society
Many companies preside over large, uncurated collections of audio data and they would benefit from methods for searching and categorizing the data. For example, over seventy two hours of unstructured and unlabelled sound-tracks are uploaded to internet sites every minute and this number continues to grow. Automatic systems are urgently needed for recognising audio content so that these sound-tracks can be tagged (possibly at precise times throughout the clips) for categorisation and search. Often the audio-tracks are recorded with video and so the audio tags can also be used to search the video. The audio recognition technology developed in this project therefore has wide commercial application to information retrieval. Similarly, the BBC's public space project is attempting to organise and unlock the archives of the BBC, making them accessible to the public. The audio-recognition technologies will make significant contributions to this project, thereby improving public services and enhancing life.
Audio denoising and source separation: digital-industries and society
Poor quality audio data is becoming common place. For example, 3hrs of recordings made on hand-held devices are uploaded to YouTube every minute. These recording are often made in challenging environments that contain multiple sound sources and noise. Such noisy data necessitate automatic systems for cleaning the audio content and separating sources from mixtures. This project will provide tools for this purpose. Moreover, upstream technologies such as Automatic Speech Recognition (ASR) and Audio Diarisation (AD) systems perform poorly in noisy environments. As such the noise removal methods developed in the project can be coupled with these approaches to improve performance. Since modern approaches to ASR and AD are probabilistic, this raises the possibility of integrated approaches that jointly estimate the noise at the same time as interpreting the content. The advisory group, and in particular Prof. Gales, will advise on possible technology transfer
The hearing impaired: hearing aid users
Six million people in the UK who would benefit from a hearing aid do not use them (a market worth 18 billion p.a.). This group of people is expanding rapidly as the population ages (the number of people aged 65 or older is expected to double by 2050). One of the main reasons why hearing aids are not as widely used as they should be is that they perform poorly in noisy environments. The efficient and adaptive noise removal systems developed for hearing aids in this project will address this key issue. This proposal will therefore contribute to the EPSRC Healthcare Technologies theme. Prof. Moore will be involved in translating the research into hearing aids, including providing access to hearing impaired patients for testing.
The hearing impaired: cochlear implantees
Cochlear implants allow the profoundly deaf, who get little or no benefit from a normal hearing aid, to gain awareness of environmental sounds and, in most cases, understand speech without lip-reading. 8000 people in the UK currently use cochlear implants and there are 1000 new implantees each year. Again, perhaps the major limitation of the current devices is their poor performance in noisy environments. The efficient and adaptive noise removal systems developed in this project aim to address this key issue thereby contributing to the EPSRC Healthcare Technologies theme. Dr. Carlyon will be involved in translating the research into cochlear implants, including providing access to implanted patients for testing.
Audio search and information retrieval: digital-industries and society
Many companies preside over large, uncurated collections of audio data and they would benefit from methods for searching and categorizing the data. For example, over seventy two hours of unstructured and unlabelled sound-tracks are uploaded to internet sites every minute and this number continues to grow. Automatic systems are urgently needed for recognising audio content so that these sound-tracks can be tagged (possibly at precise times throughout the clips) for categorisation and search. Often the audio-tracks are recorded with video and so the audio tags can also be used to search the video. The audio recognition technology developed in this project therefore has wide commercial application to information retrieval. Similarly, the BBC's public space project is attempting to organise and unlock the archives of the BBC, making them accessible to the public. The audio-recognition technologies will make significant contributions to this project, thereby improving public services and enhancing life.
Audio denoising and source separation: digital-industries and society
Poor quality audio data is becoming common place. For example, 3hrs of recordings made on hand-held devices are uploaded to YouTube every minute. These recording are often made in challenging environments that contain multiple sound sources and noise. Such noisy data necessitate automatic systems for cleaning the audio content and separating sources from mixtures. This project will provide tools for this purpose. Moreover, upstream technologies such as Automatic Speech Recognition (ASR) and Audio Diarisation (AD) systems perform poorly in noisy environments. As such the noise removal methods developed in the project can be coupled with these approaches to improve performance. Since modern approaches to ASR and AD are probabilistic, this raises the possibility of integrated approaches that jointly estimate the noise at the same time as interpreting the content. The advisory group, and in particular Prof. Gales, will advise on possible technology transfer
People |
ORCID iD |
Richard Turner (Principal Investigator) |
Publications
Adel T.
(2020)
CONTINUAL LEARNING WITH ADAPTIVE WEIGHTS (CLAW)
in 8th International Conference on Learning Representations, ICLR 2020
Alexander G. De G. Matthews
(2018)
Gaussian Process Behaviour in Wide Deep Neural Networks
Alexandre K. W. Navarro
(2017)
The Multivariate Generalised von Mises Distribution: Inference and Applications
Arno Solin
(2018)
Infinite-Horizon Gaussian Processes
Ashman M
(2020)
Sparse Gaussian Process Variational Autoencoders
Bauer M
(2017)
Discriminative k-shot learning using probabilistic models
Bronskill J.
(2020)
Tasknorm: Rethinking batch normalization for meta-learning
in 37th International Conference on Machine Learning, ICML 2020
Bui T
(2014)
Tree-structured Gaussian Process Approximations
Bui T
(2017)
Streaming Sparse Gaussian Process Approximations
Bui T.D.
(2016)
Deep Gaussian processes for regression using approximate expectation propagation
in 33rd International Conference on Machine Learning, ICML 2016
Bui Thang D.
(2017)
A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation
in JOURNAL OF MACHINE LEARNING RESEARCH
Bui, T.D.
(2017)
Streaming Sparse Gaussian Process Approximations
Choromanski K
(2018)
Structured Evolution with Compact Architectures for Scalable Policy Optimization
De G. Matthews A.G.
(2016)
On sparse variational methods and the Kullback-Leibler divergence between stochastic processes
in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016
Foong A.Y.K.
(2020)
Meta-learning stationary stochastic process prediction with convolutional neural processes
in Advances in Neural Information Processing Systems
Foong A.Y.K.
(2020)
On the expressiveness of approximate inference in Bayesian neural networks
in Advances in Neural Information Processing Systems
Gomersall PA
(2016)
Perception of stochastic envelopes by normal-hearing and cochlear-implant listeners.
in Hearing research
Gordon J
(2018)
Meta-Learning Probabilistic Inference For Prediction
Gordon J.
(2020)
CONVOLUTIONAL CONDITIONAL NEURAL PROCESSES
in 8th International Conference on Learning Representations, ICLR 2020
Gordon J.
(2019)
Meta-learning probabilistic inference for prediction
in 7th International Conference on Learning Representations, ICLR 2019
Gu S
(2015)
Neural Adaptive Sequential Monte Carlo
Gu S
(2015)
Neural adaptive sequential Monte Carlo
Description | We have made the following contributions: 1. we have shown a theoretical connection between probabilistic machine learning approaches and classical signal processing processing methods which bring these fields closer together and makes it simpler and more efficient to combine methods from both. This has led to better methods for removing noise from audio and filling in missing data. This contribution has been published in IEEE Transactions in Signal Processing. 2. we have scaled up a fundamental machine learning tool -- called a Gaussian process -- so that it can handle large scale datasets containing millions of datapoints. Previously these methods were limited to handling ten thousand datapoints. These contributions have been published in the Neural Information Processing Systems Conference, The Journal of Machine Learning Research, and the International Conference on Machine Learning. 3. We have developed new generally applicable machine learning tools. |
Exploitation Route | We are pursuing the following opportunities: -- applying the technology to improving audio uploaded to the internet and automatically recognising content (together with Google) -- applying methods to improving hearing devices (including cochlear implants and hearing aids) |
Sectors | Digital/Communication/Information Technologies (including Software) Healthcare |
URL | http://cbl.eng.cam.ac.uk/Public/Turner/ResearchAreas |
Description | The grant lead to technology being developed that formed the basis for a successful EPSRC Research Grant application. 1. efficient methods for removing noise from audio and recognising content for tagging video sound tracks (in collaboration with Prof. Brian Moore) 2. improving hearing devices for the hearing impaired (together with Dr. Robert Carlyon, Prof. Brian Moore and Dr. David Baguley) In addition it also led to a strong collaboration with Google which has resulted in follow up funding (worth roughly £120k). The fundamental machine learning methods have led to strong engagement with Microsoft Research and Toyota. |
Sector | Digital/Communication/Information Technologies (including Software),Healthcare |
Impact Types | Societal |
Description | Amazon research award |
Amount | $80,000 (USD) |
Organisation | Amazon.com |
Sector | Private |
Country | United States |
Start | 03/2019 |
Description | Baroness de Turckhiem Fund Award |
Amount | £22,233 (GBP) |
Organisation | University of Cambridge |
Department | Trinity College Cambridge |
Sector | Academic/University |
Country | United Kingdom |
Start | 01/2017 |
End | 01/2019 |
Description | Google European Doctoral Programme |
Amount | £40,000 (GBP) |
Organisation | |
Sector | Private |
Country | United States |
Start | 08/2015 |
End | 09/2017 |
Description | Google Focussed Research Award |
Amount | £101,455 (GBP) |
Organisation | |
Sector | Private |
Country | United States |
Start | 04/2017 |
End | 09/2021 |
Description | Machine Learning for Tomorrow: Efficient, Flexible, Robust and Automated |
Amount | £3,100,000 (GBP) |
Funding ID | EP/T005637/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 01/2020 |
End | 01/2025 |
Description | Microsoft Research Azure Computer Credits Award |
Amount | $50,000 (USD) |
Organisation | Microsoft Research |
Sector | Private |
Country | Global |
Start | 02/2018 |
End | 02/2019 |
Description | Collaboration with Prof. Brian Moore |
Organisation | University of Cambridge |
Department | Department of Psychology |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Based on the research in grant EP/L000776/1, Prof. Moore approached us to apply for an EPSRC research grant which we were subsequently awarded (grant number EP/G050821/1). |
Collaborator Contribution | Together we devised the new research programme (based around intelligent hearing devices and intelligent hearing tests). Prof. Brian Moore brought expertise in hearing and hearing devices and industrial collaborators (Phonak). I provided expertise in audio time series modelling and machine learning. |
Impact | Awarded EPSRC research grant (grant number EP/G050821/1) Multi-disciplinary: machine learning, signal processing, time-series modelling, hearing, hearing aids, deafness research |
Start Year | 2015 |
Description | Microsoft Research |
Organisation | Microsoft Research |
Country | Global |
Sector | Private |
PI Contribution | Joint supervision of PhD students and Postdoctoral Research Associates; Joint projects with applications within Microsoft and beyond. |
Collaborator Contribution | Joint supervision of PhD students and Postdoctoral Research Associates; Joint projects with applications within Microsoft and beyond. |
Impact | The partnership has existed for 5 months. In that time we have had three joint papers written and accepted together. In the longer term we expect the project to have economic and societal impact in the fields of health, gaming, and intelligent software systems. We have applied for a Prosperity Partnership grant together in order to expand and strengthen the existing partnership. |
Start Year | 2018 |
Description | Appeared on BBC Radio 5 Live |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Public/other audiences |
Results and Impact | Appeared on BBC 5Live programme the Naked Scientist talking about hearing and my research. The show was conducted in front of a live audience of school children. |
Year(s) Of Engagement Activity | 2014 |
URL | http://www.thenakedscientists.com/HTML/interviews/interview/1001043/ |