Audio-Visual Egocentric Video Understanding

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

By nature, human learning is multi-modal. We combine information from multiple inputs via our senses, such as touch, sound, sight, and gain a better understanding of the world. Commonly, we will combine these modalities in order to learn how to do complete tasks. For example, consider when we try to learn a musical instrument, we will typically utilise both our sight and hearing to understand how different keys on a piano, or frets on a guitar, will ring out different sounds and therefore create music. Furthermore, there are instances when one modality can assist in understanding the continuation of an action, despite another modality shifting, such as when watching a chef fry something in a frying pan in cooking video; if the camera shifts and no longer visually shows what is in the pan, we can still use the sizzling sound to understand that the frying is still taking place, despite a visual shift. However, in the context of deep learning, auditory data linked to the video stream is a commonly underutilised resource, and potential increases in performance from integrating this audio data is often left neglected. Therefore, it seems logical to attempt to design and optimize audio-visual models for video understanding tasks to both: better model multi-modal human learning and also to improve performance over uni-modal solutions.

However, this is no trivial challenge, as it is not simply a case of optimizing each modality separately and then combining them together. There are multiple nuances and considerations with combining the modalities, which will constantly change between video understanding tasks, datasets, architectures, and other aspects of deep learning. This includes: how do we fuse the audio and video streams? At what point in the model do we fuse them? Once they are fused, how do we allow the modalities to communicate between each other? In our work, we seek to answer these questions, investigating a wide spectrum of audio-visual action recognition methods with the aim of improving accuracy results within the domain of action recognition, whilst developing and training models on the large-scale egocentric dataset Epic-Kitchens. This project relates to the image and vision computing research area within EPSRC, with its most obvious real-world application being applied to robot learning, which is further assisted by the egocentric (first-person) nature of the video data we use. However, this work can apply to any real-world applications that involve computer vision and is not necessarily restricted to robotics.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/T517872/1 01/10/2020 30/09/2025
2615061 Studentship EP/T517872/1 01/10/2021 31/03/2025 Jacob Chalk