Enhanced AI Perception through Unified Joint Embedding of Multimodal Sensory Data

Lead Research Organisation: UNIVERSITY OF OXFORD

Department Name: Computer Science

Abstract

Artificial Intelligence (AI), a field aiming to create machines that can think and learn like humans, and its subset, Deep Learning, which uses multi-layered neural networks to represent human brains, rely heavily on large collections of labeled data to train the machines. This dependency often hinders their application in dynamic, real-world scenarios. In contrast, humans natively process and intertwine multiple senses - from hearing the symphony of urban sounds to feeling the object's fine textures and interpreting distances visually. This natural ability to blend our senses and interpret our surroundings can potentially be the missing link in AI's evolution. This raises the central research question: can the integration of multisensory data close the gap between human cognition and machine learning, so that machines can learn more from natural sensory experiences and less from extensive labeled data?

The objectives of this research are as follows. First, this research aims to develop multimodal computational models capable of extracting structures from diverse sensory inputs. To address the scarcity of labeled multimodal datasets, we leverage naturally occurring paired data to distinguish between modality-specific information and integrate them, drawing inspiration from human learning processes initiated with audio-visual cues (e.g., we match the sound of a specific bird to its photo). Second, this research seeks to extend AI's perception range by incorporating novel sensory modalities, including thermal data, tactile signals, and spatial depth. This integration aims to augment AI's perceptual range beyond the conventional modalities of vision and audio. Third, this research aims to establish a holistic perception system. The developed multimodal computational model will be trained to process a wide array of sensory modalities concurrently, covering text and visual cues, auditory signals, spatial depth, thermal sensations, IMU readings, tactile signals, etc.

The novelty of the research methodology is as follows. First, this research uses contrastive learning. This technique helps models to find similarities and differences across data points from different modalities. Consequently, the models can correlate patterns and build connections across multiple modalities. Second, this research explores and associates novel modality pairs such as visual-depth and visual-touch, which have not been extensively researched before. Moreover, this research aims to move beyond the traditional dual-modality embeddings and develops a unified embedding landscape from diverse naturally co-occurring modalities. In this context, embeddings are essential mathematical representations that simplify complex data for machine interpretation. By leveraging state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs), we aim to identify features specific to each rare modality, match the pairs, and establish connections between them. This is facilitated by the zero-shot learning capabilities of LLMs/VLMs, a feature that enables models to interpret and execute tasks they have never encountered before. Consequently, the model can better encode information from various sensory modalities into a unified, joint embedded space.

This project falls within the EPSRC's Artificial Intelligence Technologies research area. We aspire to improve AI's perception through the exploration of joint embedding of multimodal sensory data. Our refined approach aims to develop a more interconnected and enriched embedded space, enhancing its adaptivity to a variety of downstream tasks. For example, in the field of robotic automation, the unified, joint embedding derived from this research have the potential to significantly improve robots' perception and revolutionize operational efficiency across different scenarios.

Student:

Chenyang Ma

Period of Study:

Sep 23 - Mar 27

Funder:

EPSRC

Project Status:

Active

Project Category:

Studentship

Project Reference:

2874479

Research Topic:

Unclassified

Organisations

UNIVERSITY OF OXFORD (Lead Research Organisation)

People	ORCID iD
Andrew Markham (Primary Supervisor)	http://orcid.org/0000-0001-5716-3941
Chenyang Ma (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/W524311/1			30/09/2022	29/09/2028
2874479	Studentship	EP/W524311/1	30/09/2023	30/03/2027	Chenyang Ma

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects