Weakly-Supervised Learning for Classifying Acoustic Scenes and Events

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP

Abstract

The aim of this research is to develop supervised learning methods in order to classify audio signals for which the labels are weak or noisy. Audio classification minimally entails identifying what sounds (classes) are present in an audio signal. When it is also necessary to determine when the sounds occur, it is known as sound event detection (SED). In either case, the most effective approach is to use a dataset of example audio signals that are already labelled and then learn how to perform classification using these labelled examples. This is known as supervised learning. Unfortunately, collecting large amounts of labelled data can be prohibitively expensive. For sound event detection, it is common for only the class labels to be provided and not the timestamps. In other situations, the labels can even be inaccurate (noisy) because of the effort required to otherwise verify them. We refer to both as weakly-supervised learning problems, and formulate them under a single framework.

This research focuses on deep learning methods for audio classification. Broadly speaking, deep learning refers to the use of modern neural network architectures and large amounts of training data. This approach has produced state-of-the-art results in many tasks, including audio classification. However, we believe the deep learning methods for weakly-supervised learning are underdeveloped for general-purpose audio classification. Our contribution would be to introduce techniques that exploit the characteristics of audio, and in a way that adapts to the specific application. We also want to unify the treatment of weakly-supervised problems so that knowledge from seemingly different areas can be consolidated. This is something that is currently lacking in the audio domain.

The following objectives have been identified:
- Use an architecture based on convolutional networks for classification. The most recent advances such as residual networks and inception-style networks will be investigated.
- Consider how to design a network that is robust to label noise. We will look at pseudolabelling and loss function weighting conditioned on the examples.
- Consider how to design a network for sound event detection in which labels are weak. The most promising direction is to look at attention mechanisms that learn to weight the different segments of an audio clip. By labelling each segment the same as the audio clip, we can also consider the labels as noisy and use techniques that are robust to noise.
- Generally speaking, we want to formulate the various problems using probabilistic reasoning so that the solutions fit under a common framework and do not appear ad-hoc.

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509772/1 30/09/2016 29/09/2021
1976218 Studentship EP/N509772/1 30/09/2017 28/02/2021 TURAB IQBAL
 
Description This work has involved developing algorithms that can teach machines to classify sound events using potentially mislabelled or incomplete training data. The problem we are trying to solve is how to train the machine (classifier) effectively despite the presence of incorrectly-labelled data. The motivation is that manually verifying data can be prohibitively costly, so omitting this process can result in much more data being available for training.

To tackle this problem, the main technique that we have used is known as pseudo-labelling, where a new label is estimated for an audio clip when the given label is suspected to be incorrect. To detect incorrect labels and estimate new labels, we have proposed using an auxiliary classifier that is trained on data that is known to be labelled correctly. Experiments were carried out on several audio datasets that were known to have some level of label corruption. Using a neural network to implement both the primary classifier and the auxiliary classifier, it was found that the performance of the primary classifier improved significantly when using pseudo-labelling. This suggests that the auxiliary classifier is able to substitute fairly well for a human annotator. This work helped us get 3rd place (out of 558 participants on Kaggle) in Task 2 of the DCASE 2018 challenge. The work was also published in the DCASE 2018 workshop.

In addition, we have focused on an instance of this problem where the data is out-of-domain (OOD). This means that some of the audio clips in the training data do not belong to any of the target classes (i.e. the classes defined for the classification problem), but have been incorrectly labelled as such. We have shown that even in this scenario, relabelling the OOD data, rather than discarding it, can improve the performance of the classifier. In order words, training with OOD instances can actually be beneficial provided they are labelled appropriately. This is because some of the OOD instances possess overlapping properties with the target classes, e.g. an oboe sound is similar to a clarinet sound even if it is OOD. We proposed using an auxiliary classifier to detect and relabel the OOD instances based on how similar they are to the target classes. Experiments on the FSDnoisy18k dataset showed that the hypothesis is correct and that the proposed method improves classification performance. A paper was written detailing this work and was accepted for the ICASSP 2020 conference.
Exploitation Route This work looks at using large amounts of noisy labelled data for use with machine learning algorithms. The research extends beyond the audio domain and is applicable in any area where large amounts of labelled data too costly to verify manually are available for training a classifier. This is becoming more and more relevant due to the rapid expansion of the internet; there is a large amount of data on the internet, but it is typically collected under uncontrolled environments. Our research will hopefully help in developing state-of-the-art techniques that can utilise large amounts of noisy labelled data effectively.
Sectors Environment

Healthcare

Transport

 
Title dcase2018_task2 
Description This is the source code for the system described in the paper 'General-Purpose Audio Tagging from Noisy Labels using Convolutional Neural Networks'. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact 19 stars and 2 forks on GitHub. Several people have asked questions related to the code and have evidently used it. 
URL https://github.com/tqbl/dcase2018_task2
 
Title gccaps 
Description This is the source code for the system described in the paper 'Capsule Routing for Sound Event Detection'. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact 12 stars and 1 fork on GitHub. 
URL https://github.com/tqbl/gccaps
 
Title ood_audio 
Description This is the source code for the system described in the paper 'Learning with Out-of-Distribution Data for Audio Classification'. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact 9 stars and 2 forks on GitHub. 
URL https://github.com/tqbl/ood_audio