Cross-modal egocentric activity recognition and zero-shot learning

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

The availability of low-cost wearable cameras has renewed the interest of first-person human activity analysis. The recognition of first-person activities has important challenges to be addressed such as, rapid changes in illuminations, significant camera motion and complex hand-object manipulations. In recent years, the advances in deep learning have influenced significantly the computer vision community, as convolutional networks gave impressive results in tasks such as, object recognition and detection, scene understanding and image segmentation. Convolutional networks have been used with success in first-person activity recognition as well. Before the emergence of deep learning the community of first-person computer vision was focused on the engineering of important egocentric features that capture properties of the first-person point of view, such as hand-object interactions and gaze. Convolutional networks allow the learning of such features automatically using big amounts of data, eliminating the need of hand-designed features.

In this work, we focus on activity recognition with convolutional networks. Influenced by the recent success of multi-stream architectures, we are investigating their applicability in egocentric videos, by employing multiple modalities for training the models. An important observation that motivates us is that humans combine their senses to understand concepts of the world, such as acoustic and visual information. To this end, we propose the employment of both videos and sounds towards more accurate activity recognition. Specifically, we will investigate how shared aligned representations can be learnt using the multi-stream paradigm. Moreover, we are interested in temporal feature pooling methods to leverage information that spans over the whole video, as in many cases the whole video should be observed in order to be able to discriminate between similar actions. Our final goal is to employ these ideas in zero-shot learning. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. An example is to recognize activities without having seen any video of these activities during training. This can be done by using the knowledge of trained classifiers (trained in other classes and not in the ones to be predicted by the zero-shot paradigm) and additional knowledge about the new classes.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509619/1 01/10/2016 30/09/2021
1971464 Studentship EP/N509619/1 01/10/2017 31/03/2021 Evangelos Kazakos
 
Description In my research, I have discovered novel ways of combining audio and vision for egocentric action recognition in videos, similar to how humans combine their senses to understand the world. We found that audio is very important for recognising egocentric actions and can play a very important role in assisting wearable technologies. Combining audio and vision in videos is particularly challenging because except of how the two modalities should be combined, it is also important which temporal parts of each modality should be combined, which is called binding. I developed a model for audio-visual binding in egocentric action recognition which significantly improved the recognition performance comparing to models that use only vision.

Furthermore, I explored architectures that draw inspiration from neuroscience focusing just in recognising events/actions from audio. In neuroscience there are two auditory streams for processing audio information, the dorsal and the ventral streams, for object recognition and localisation respectively. We designed an architecture inspired from this idea and we obtained better results comparing with previous methods.
Exploitation Route Models and features from my work have been used by other researchers including for action retrieval and domain adaptation. Open source code and models publicly available. Publications are well cited.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Software and models released for industry and research
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Title EPIC-KITCHENS-100 
Description Extended Footage for EPIC-KITCHENS dataset, to 100 hours of footage. For automatic annotations, see separate dataset at: https://doi.org/10.5523/bris.3l8eci2oqgst92n14w2yqi5ytu 10/09/2020 **N.b. please also see ERRATUM published at https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/README.md#erratum** 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact Widely used from the computer vision community. Open for research. There are several open challenges where users can participate and build their methods to address open problems that the dataset poses. 
URL https://data.bris.ac.uk/data/dataset/2g1n6qdydwa9u22shpxqzp0t8m/
 
Title EPIC-Kitchens 
Description Largest dataset in first-person vision, fully annotated with open challenges for object detection, action recognition and action anticipation 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Open challenges with 15 different universities and research centres competing for winning the relevant challenges. 
URL http://epic-kitchens.github.io/
 
Description EPIC-Kitchens Dataset Collection 
Organisation University of Catania
Country Italy 
Sector Academic/University 
PI Contribution Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact ECCV 2018 publication, TPAMI publication under review
Start Year 2017
 
Description EPIC-Kitchens Dataset Collection 
Organisation University of Toronto
Country Canada 
Sector Academic/University 
PI Contribution Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact ECCV 2018 publication, TPAMI publication under review
Start Year 2017
 
Description University of Oxford - Audio-visual Fusion for Egocentric Videos 
Organisation University of Oxford
Country United Kingdom 
Sector Academic/University 
PI Contribution Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani
Collaborator Contribution ICCV 2019 publication and code base
Impact (2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV).
Start Year 2018
 
Title Auditory Slow-Fast Networks 
Description A multi-stream audio convolutional network architecture for audio recognition 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact Early release. Too early too say 
 
Title Temporal Binding Network (TBN) 
Description A Convolutional Network based model for Audio-Visual Egocentric Action Recognition 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Software became popular: 78 stars in GitHub, and 17 forks 
URL http://openaccess.thecvf.com/content_ICCV_2019/papers/Kazakos_EPIC-Fusion_Audio-Visual_Temporal_Bind...
 
Description Attendance of 2020 Conference on Neural Information Processing Systems, Dec 6, 2020 - Dec 12, 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I attended several talks of the conference to strengthen my knowledge on more theoretical aspects of computer vision and machine learning.
Year(s) Of Engagement Activity 2020
 
Description Attendance of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I attended the conference which is from a slightly different background from my main area of focus, i.e. acoustic, speech and signal processing to improve my knowledge on audio processing and architectures, as my work focus on audio-visual fusion. So I wanted to strengthen my knowledge in the audio domain.
Year(s) Of Engagement Activity 2020
 
Description BMVA Symposium 2019 in London 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact We attended a computer vision symposium in London as a team, with my advisor and other PhD students from my group. There were talks from senior researchers/professors and there was also a poster presentation sessions where I also presented my work.
Year(s) Of Engagement Activity 2019
 
Description PAISS 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Artificial Intelligence Summer School
Year(s) Of Engagement Activity 2018
URL https://project.inria.fr/paiss/home-2018/
 
Description Poster presentation in ICCV 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I gave a poster presentation for my paper which was accepted in the main conference
Year(s) Of Engagement Activity 2019
URL http://iccv2019.thecvf.com