Cross-modal egocentric activity recognition and zero-shot learning

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

The availability of low-cost wearable cameras has renewed the interest of first-person human activity analysis. The recognition of first-person activities has important challenges to be addressed such as, rapid changes in illuminations, significant camera motion and complex hand-object manipulations. In recent years, the advances in deep learning have influenced significantly the computer vision community, as convolutional networks gave impressive results in tasks such as, object recognition and detection, scene understanding and image segmentation. Convolutional networks have been used with success in first-person activity recognition as well. Before the emergence of deep learning the community of first-person computer vision was focused on the engineering of important egocentric features that capture properties of the first-person point of view, such as hand-object interactions and gaze. Convolutional networks allow the learning of such features automatically using big amounts of data, eliminating the need of hand-designed features.

In this work, we focus on activity recognition with convolutional networks. Influenced by the recent success of multi-stream architectures, we are investigating their applicability in egocentric videos, by employing multiple modalities for training the models. An important observation that motivates us is that humans combine their senses to understand concepts of the world, such as acoustic and visual information. To this end, we propose the employment of both videos and sounds towards more accurate activity recognition. Specifically, we will investigate how shared aligned representations can be learnt using the multi-stream paradigm. Moreover, we are interested in temporal feature pooling methods to leverage information that spans over the whole video, as in many cases the whole video should be observed in order to be able to discriminate between similar actions. Our final goal is to employ these ideas in zero-shot learning. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. An example is to recognize activities without having seen any video of these activities during training. This can be done by using the knowledge of trained classifiers (trained in other classes and not in the ones to be predicted by the zero-shot paradigm) and additional knowledge about the new classes.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509619/1 01/10/2016 30/09/2021
1971464 Studentship EP/N509619/1 18/09/2017 31/03/2021 Evangelos Kazakos
 
Description In my research, I have discovered novel ways of combining audio and vision for egocentric action recognition in videos, similar to how humans combine their senses to understand the world. We found that audio is very important for recognising egocentric actions and can play a very important role in assisting wearable technologies. Combining audio and vision in videos is particularly challenging because except of how the two modalities should be combined, it is also important which temporal parts of each modality should be combined, which is called binding. I developed a model for audio-visual binding in egocentric action recognition which significantly improved the recognition performance comparing to models that use only vision.
Exploitation Route -
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Software and models released for industry and research
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Title EPIC-Kitchens 
Description Largest dataset in first-person vision, fully annotated with open challenges for object detection, action recognition and action anticipation 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Open challenges with 15 different universities and research centres competing for winning the relevant challenges. 
URL http://epic-kitchens.github.io/
 
Description EPIC-Kitchens Dataset Collection 
Organisation University of Catania
Country Italy 
Sector Academic/University 
PI Contribution Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact ECCV 2018 publication, TPAMI publication under review
Start Year 2017
 
Description EPIC-Kitchens Dataset Collection 
Organisation University of Toronto
Country Canada 
Sector Academic/University 
PI Contribution Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact ECCV 2018 publication, TPAMI publication under review
Start Year 2017
 
Description University of Oxford - Audio-visual Fusion for Egocentric Videos 
Organisation University of Oxford
Country United Kingdom 
Sector Academic/University 
PI Contribution Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani
Collaborator Contribution ICCV 2019 publication and code base
Impact (2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV).
Start Year 2018
 
Title Temporal Binding Network (TBN) 
Description A Convolutional Network based model for Audio-Visual Egocentric Action Recognition 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Software became popular: 43 stars in GitHub, and 11 forks 
URL http://openaccess.thecvf.com/content_ICCV_2019/papers/Kazakos_EPIC-Fusion_Audio-Visual_Temporal_Bind...
 
Description PAISS 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Artificial Intelligence Summer School
Year(s) Of Engagement Activity 2018
URL https://project.inria.fr/paiss/home-2018/
 
Description Poster presentation in ICCV 2019 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I gave a poster presentation for my paper which was accepted in the main conference
Year(s) Of Engagement Activity 2019
URL http://iccv2019.thecvf.com