Enhanced AI Perception through Unified Joint Embedding of Multimodal Sensory Data
Lead Research Organisation:
UNIVERSITY OF OXFORD
Department Name: Computer Science
Abstract
Artificial Intelligence (AI), a field aiming to create machines that can think and learn like humans, and its subset, Deep Learning, which uses multi-layered neural networks to represent human brains, rely heavily on large collections of labeled data to train the machines. This dependency often hinders their application in dynamic, real-world scenarios. In contrast, humans natively process and intertwine multiple senses - from hearing the symphony of urban sounds to feeling the object's fine textures and interpreting distances visually. This natural ability to blend our senses and interpret our surroundings can potentially be the missing link in AI's evolution. This raises the central research question: can the integration of multisensory data close the gap between human cognition and machine learning, so that machines can learn more from natural sensory experiences and less from extensive labeled data?
The objectives of this research are as follows. First, this research aims to develop multimodal computational models capable of extracting structures from diverse sensory inputs. To address the scarcity of labeled multimodal datasets, we leverage naturally occurring paired data to distinguish between modality-specific information and integrate them, drawing inspiration from human learning processes initiated with audio-visual cues (e.g., we match the sound of a specific bird to its photo). Second, this research seeks to extend AI's perception range by incorporating novel sensory modalities, including thermal data, tactile signals, and spatial depth. This integration aims to augment AI's perceptual range beyond the conventional modalities of vision and audio. Third, this research aims to establish a holistic perception system. The developed multimodal computational model will be trained to process a wide array of sensory modalities concurrently, covering text and visual cues, auditory signals, spatial depth, thermal sensations, IMU readings, tactile signals, etc.
The novelty of the research methodology is as follows. First, this research uses contrastive learning. This technique helps models to find similarities and differences across data points from different modalities. Consequently, the models can correlate patterns and build connections across multiple modalities. Second, this research explores and associates novel modality pairs such as visual-depth and visual-touch, which have not been extensively researched before. Moreover, this research aims to move beyond the traditional dual-modality embeddings and develops a unified embedding landscape from diverse naturally co-occurring modalities. In this context, embeddings are essential mathematical representations that simplify complex data for machine interpretation. By leveraging state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs), we aim to identify features specific to each rare modality, match the pairs, and establish connections between them. This is facilitated by the zero-shot learning capabilities of LLMs/VLMs, a feature that enables models to interpret and execute tasks they have never encountered before. Consequently, the model can better encode information from various sensory modalities into a unified, joint embedded space.
This project falls within the EPSRC's Artificial Intelligence Technologies research area. We aspire to improve AI's perception through the exploration of joint embedding of multimodal sensory data. Our refined approach aims to develop a more interconnected and enriched embedded space, enhancing its adaptivity to a variety of downstream tasks. For example, in the field of robotic automation, the unified, joint embedding derived from this research have the potential to significantly improve robots' perception and revolutionize operational efficiency across different scenarios.
The objectives of this research are as follows. First, this research aims to develop multimodal computational models capable of extracting structures from diverse sensory inputs. To address the scarcity of labeled multimodal datasets, we leverage naturally occurring paired data to distinguish between modality-specific information and integrate them, drawing inspiration from human learning processes initiated with audio-visual cues (e.g., we match the sound of a specific bird to its photo). Second, this research seeks to extend AI's perception range by incorporating novel sensory modalities, including thermal data, tactile signals, and spatial depth. This integration aims to augment AI's perceptual range beyond the conventional modalities of vision and audio. Third, this research aims to establish a holistic perception system. The developed multimodal computational model will be trained to process a wide array of sensory modalities concurrently, covering text and visual cues, auditory signals, spatial depth, thermal sensations, IMU readings, tactile signals, etc.
The novelty of the research methodology is as follows. First, this research uses contrastive learning. This technique helps models to find similarities and differences across data points from different modalities. Consequently, the models can correlate patterns and build connections across multiple modalities. Second, this research explores and associates novel modality pairs such as visual-depth and visual-touch, which have not been extensively researched before. Moreover, this research aims to move beyond the traditional dual-modality embeddings and develops a unified embedding landscape from diverse naturally co-occurring modalities. In this context, embeddings are essential mathematical representations that simplify complex data for machine interpretation. By leveraging state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs), we aim to identify features specific to each rare modality, match the pairs, and establish connections between them. This is facilitated by the zero-shot learning capabilities of LLMs/VLMs, a feature that enables models to interpret and execute tasks they have never encountered before. Consequently, the model can better encode information from various sensory modalities into a unified, joint embedded space.
This project falls within the EPSRC's Artificial Intelligence Technologies research area. We aspire to improve AI's perception through the exploration of joint embedding of multimodal sensory data. Our refined approach aims to develop a more interconnected and enriched embedded space, enhancing its adaptivity to a variety of downstream tasks. For example, in the field of robotic automation, the unified, joint embedding derived from this research have the potential to significantly improve robots' perception and revolutionize operational efficiency across different scenarios.
Organisations
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/W524311/1 | 30/09/2022 | 29/09/2028 | |||
2874479 | Studentship | EP/W524311/1 | 30/09/2023 | 30/03/2027 | Chenyang Ma |