Self-supervised autoencoder framework for salient sensory feature extraction
Lead Research Organisation:
Imperial College London
Department Name: Electrical and Electronic Engineering
Abstract
The natural world is full of noise, but the brain's capacity for information transmission is severely limited. Therefore, discarding irrelevant information contained in sensory inputs while retaining salient features, which are related to the input label, is key to survival. Numerous studies suggest that this could be partly achieved by the brain implementing information bottlenecks, which aim to maximise the trade-off between compression and preservation of salient information. Implementation of information bottlenecks requires a way of measuring information. A commonly adopted measure in neuroscience is the information theoretic metric - mutual information, which describes the amount of information a neuronal response can tell us about a stimulus. However, this metric and many existing approaches for its estimation suffer from the curse of dimensionality.
In recent years, this challenge has been approached by framing estimation of mutual information as a minmax optimisation problem in an adversarial setting, which entails two neural networks playing a game in which the improvement in one network's performance leads to the worsening of the other's. This method has been found to be scalable in dimension and sample-size unlike traditional mutual information estimators. Building on this research, we propose an adversarial-inspired autoencoder framework for salient sensory feature extraction.
The proposed framework consists of three neural networks: the encoder, the decoder, and the classifier. The objective is for the encoder to learn to compress the data such that the classifier can accurately classify the input, but the decoder cannot fully reconstruct the original input. The auxiliary classification task is implemented to help condition the latent space of the encoder to capture salient features.
Preliminary results obtained by training the framework on the MNIST dataset and subsets of the CIFAR10 dataset show that the framework discards irrelevant information from image data. Furthermore, it appears that it performs figure-ground separation, a phenomenon that enables perception of shapes and objects by segmenting visual scenes into object- and background-like regions. In this project, we aim to confirm these findings by training the framework on more complex image datasets and investigate whether it can reproduce other feature selective mechanisms implemented in visual information processing.
We also aim to extend the application of the framework to other sensory modalities, namely to speech processing. Human listeners have the remarkable ability to separate the speech of one speaker from a noisy background - a phenomenon called the cocktail party problem. However, our understanding of what information is discarded is limited. Physiological studies often make use of features derived from other fields, making them difficult to relate to auditory neurophysiology. Similarly, modelling studies often rely on annotated linguistic feature spaces rather than utilising the spectrotemporal dynamics of speech. The latter has been demonstrated to improve feature extraction in models. Therefore, by using representations obtained directly from input data, our framework may provide a more precise explanation of neural responses to stimuli.
The first step will be to train the framework on a simple task related to speech processing, such as classifying different vowels, and comparing the features learned to existing literature on auditory neurophysiology. Our results could then be used to generate new, testable hypotheses about the features underlying noise-robust speech processing in the brain. Furthermore, features learned by the framework could also be used to improve speech recognition systems.
In recent years, this challenge has been approached by framing estimation of mutual information as a minmax optimisation problem in an adversarial setting, which entails two neural networks playing a game in which the improvement in one network's performance leads to the worsening of the other's. This method has been found to be scalable in dimension and sample-size unlike traditional mutual information estimators. Building on this research, we propose an adversarial-inspired autoencoder framework for salient sensory feature extraction.
The proposed framework consists of three neural networks: the encoder, the decoder, and the classifier. The objective is for the encoder to learn to compress the data such that the classifier can accurately classify the input, but the decoder cannot fully reconstruct the original input. The auxiliary classification task is implemented to help condition the latent space of the encoder to capture salient features.
Preliminary results obtained by training the framework on the MNIST dataset and subsets of the CIFAR10 dataset show that the framework discards irrelevant information from image data. Furthermore, it appears that it performs figure-ground separation, a phenomenon that enables perception of shapes and objects by segmenting visual scenes into object- and background-like regions. In this project, we aim to confirm these findings by training the framework on more complex image datasets and investigate whether it can reproduce other feature selective mechanisms implemented in visual information processing.
We also aim to extend the application of the framework to other sensory modalities, namely to speech processing. Human listeners have the remarkable ability to separate the speech of one speaker from a noisy background - a phenomenon called the cocktail party problem. However, our understanding of what information is discarded is limited. Physiological studies often make use of features derived from other fields, making them difficult to relate to auditory neurophysiology. Similarly, modelling studies often rely on annotated linguistic feature spaces rather than utilising the spectrotemporal dynamics of speech. The latter has been demonstrated to improve feature extraction in models. Therefore, by using representations obtained directly from input data, our framework may provide a more precise explanation of neural responses to stimuli.
The first step will be to train the framework on a simple task related to speech processing, such as classifying different vowels, and comparing the features learned to existing literature on auditory neurophysiology. Our results could then be used to generate new, testable hypotheses about the features underlying noise-robust speech processing in the brain. Furthermore, features learned by the framework could also be used to improve speech recognition systems.
Organisations
People |
ORCID iD |
Dan Goodman (Primary Supervisor) |
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/W524323/1 | 30/09/2022 | 29/09/2028 | |||
2894189 | Studentship | EP/W524323/1 | 29/09/2023 | 29/03/2027 |