Cross-modal egocentric activity recognition and zero-shot learning

Lead Research Organisation: University of Bristol

Department Name: Computer Science

Abstract

The availability of low-cost wearable cameras has renewed the interest of first-person human activity analysis. The recognition of first-person activities has important challenges to be addressed such as, rapid changes in illuminations, significant camera motion and complex hand-object manipulations. In recent years, the advances in deep learning have influenced significantly the computer vision community, as convolutional networks gave impressive results in tasks such as, object recognition and detection, scene understanding and image segmentation. Convolutional networks have been used with success in first-person activity recognition as well. Before the emergence of deep learning the community of first-person computer vision was focused on the engineering of important egocentric features that capture properties of the first-person point of view, such as hand-object interactions and gaze. Convolutional networks allow the learning of such features automatically using big amounts of data, eliminating the need of hand-designed features.

In this work, we focus on activity recognition with convolutional networks. Influenced by the recent success of multi-stream architectures, we are investigating their applicability in egocentric videos, by employing multiple modalities for training the models. An important observation that motivates us is that humans combine their senses to understand concepts of the world, such as acoustic and visual information. To this end, we propose the employment of both videos and sounds towards more accurate activity recognition. Specifically, we will investigate how shared aligned representations can be learnt using the multi-stream paradigm. Moreover, we are interested in temporal feature pooling methods to leverage information that spans over the whole video, as in many cases the whole video should be observed in order to be able to discriminate between similar actions. Our final goal is to employ these ideas in zero-shot learning. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. An example is to recognize activities without having seen any video of these activities during training. This can be done by using the knowledge of trained classifiers (trained in other classes and not in the ones to be predicted by the zero-shot paradigm) and additional knowledge about the new classes.

Student:

Evangelos Kazakos

Period of Study:

Sep 17 - Mar 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1971464

Research Topic:

Unclassified

Organisations

People	ORCID iD
Evangelos Kazakos (Student)

Publications

Author Name Title Publication Date Published

10 25 50

Damen D (2018) Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Kazakos K (2019) EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Damen D (2021) The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines. in IEEE transactions on pattern analysis and machine intelligence

Kazakos E. (2021) SLOW-FAST AUDITORY STREAMS FOR AUDIO RECOGNITION

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509619/1			30/09/2016	29/09/2021
1971464	Studentship	EP/N509619/1	30/09/2017	30/03/2021	Evangelos Kazakos

Key Findings
Impact Summary
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	In my research, I have discovered novel ways of combining audio and vision for egocentric action recognition in videos, similar to how humans combine their senses to understand the world. We found that audio is very important for recognising egocentric actions and can play a very important role in assisting wearable technologies. Combining audio and vision in videos is particularly challenging because except of how the two modalities should be combined, it is also important which temporal parts of each modality should be combined, which is called binding. I developed a model for audio-visual binding in egocentric action recognition which significantly improved the recognition performance comparing to models that use only vision. Furthermore, I explored architectures that draw inspiration from neuroscience focusing just in recognising events/actions from audio. In neuroscience there are two auditory streams for processing audio information, the dorsal and the ventral streams, for object recognition and localisation respectively. We designed an architecture inspired from this idea and we obtained better results comparing with previous methods.
Exploitation Route	Models and features from my work have been used by other researchers including for action retrieval and domain adaptation. Open source code and models publicly available. Publications are well cited.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	Software and models released for industry and research
First Year Of Impact	2019
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Title	EPIC-KITCHENS-100
Description	Extended Footage for EPIC-KITCHENS dataset, to 100 hours of footage. For automatic annotations, see separate dataset at: https://doi.org/10.5523/bris.3l8eci2oqgst92n14w2yqi5ytu 10/09/2020 N.b. please also see ERRATUM published at https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/README.md#erratum
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	Widely used from the computer vision community. Open for research. There are several open challenges where users can participate and build their methods to address open problems that the dataset poses.
URL	https://data.bris.ac.uk/data/dataset/2g1n6qdydwa9u22shpxqzp0t8m/


Title	EPIC-Kitchens
Description	Largest dataset in first-person vision, fully annotated with open challenges for object detection, action recognition and action anticipation
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	Open challenges with 15 different universities and research centres competing for winning the relevant challenges.
URL	http://epic-kitchens.github.io/


Description	EPIC-Kitchens Dataset Collection
Organisation	University of Catania
Country	Italy
Sector	Academic/University
PI Contribution	Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution	Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact	ECCV 2018 publication, TPAMI publication under review
Start Year	2017


Description	EPIC-Kitchens Dataset Collection
Organisation	University of Toronto
Country	Canada
Sector	Academic/University
PI Contribution	Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution	Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact	ECCV 2018 publication, TPAMI publication under review
Start Year	2017


Description	University of Oxford - Audio-visual Fusion for Egocentric Videos
Organisation	University of Oxford
Country	United Kingdom
Sector	Academic/University
PI Contribution	Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani
Collaborator Contribution	ICCV 2019 publication and code base
Impact	(2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV).
Start Year	2018


Title	Auditory Slow-Fast Networks
Description	A multi-stream audio convolutional network architecture for audio recognition
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	Early release. Too early too say


Title	Temporal Binding Network (TBN)
Description	A Convolutional Network based model for Audio-Visual Egocentric Action Recognition
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	Software became popular: 78 stars in GitHub, and 17 forks
URL	http://openaccess.thecvf.com/content_ICCV_2019/papers/Kazakos_EPIC-Fusion_Audio-Visual_Temporal_Bind...


Description	Attendance of 2020 Conference on Neural Information Processing Systems, Dec 6, 2020 - Dec 12, 2020
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I attended several talks of the conference to strengthen my knowledge on more theoretical aspects of computer vision and machine learning.
Year(s) Of Engagement Activity	2020


Description	Attendance of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I attended the conference which is from a slightly different background from my main area of focus, i.e. acoustic, speech and signal processing to improve my knowledge on audio processing and architectures, as my work focus on audio-visual fusion. So I wanted to strengthen my knowledge in the audio domain.
Year(s) Of Engagement Activity	2020


Description	BMVA Symposium 2019 in London
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We attended a computer vision symposium in London as a team, with my advisor and other PhD students from my group. There were talks from senior researchers/professors and there was also a poster presentation sessions where I also presented my work.
Year(s) Of Engagement Activity	2019


Description	PAISS
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Artificial Intelligence Summer School
Year(s) Of Engagement Activity	2018
URL	https://project.inria.fr/paiss/home-2018/


Description	Poster presentation in ICCV 2019
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I gave a poster presentation for my paper which was accepted in the main conference
Year(s) Of Engagement Activity	2019
URL	http://iccv2019.thecvf.com

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects