Cross-modal egocentric activity recognition and zero-shot learning
Lead Research Organisation:
University of Bristol
Department Name: Computer Science
Abstract
The availability of low-cost wearable cameras has renewed the interest of first-person human activity analysis. The recognition of first-person activities has important challenges to be addressed such as, rapid changes in illuminations, significant camera motion and complex hand-object manipulations. In recent years, the advances in deep learning have influenced significantly the computer vision community, as convolutional networks gave impressive results in tasks such as, object recognition and detection, scene understanding and image segmentation. Convolutional networks have been used with success in first-person activity recognition as well. Before the emergence of deep learning the community of first-person computer vision was focused on the engineering of important egocentric features that capture properties of the first-person point of view, such as hand-object interactions and gaze. Convolutional networks allow the learning of such features automatically using big amounts of data, eliminating the need of hand-designed features.
In this work, we focus on activity recognition with convolutional networks. Influenced by the recent success of multi-stream architectures, we are investigating their applicability in egocentric videos, by employing multiple modalities for training the models. An important observation that motivates us is that humans combine their senses to understand concepts of the world, such as acoustic and visual information. To this end, we propose the employment of both videos and sounds towards more accurate activity recognition. Specifically, we will investigate how shared aligned representations can be learnt using the multi-stream paradigm. Moreover, we are interested in temporal feature pooling methods to leverage information that spans over the whole video, as in many cases the whole video should be observed in order to be able to discriminate between similar actions. Our final goal is to employ these ideas in zero-shot learning. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. An example is to recognize activities without having seen any video of these activities during training. This can be done by using the knowledge of trained classifiers (trained in other classes and not in the ones to be predicted by the zero-shot paradigm) and additional knowledge about the new classes.
In this work, we focus on activity recognition with convolutional networks. Influenced by the recent success of multi-stream architectures, we are investigating their applicability in egocentric videos, by employing multiple modalities for training the models. An important observation that motivates us is that humans combine their senses to understand concepts of the world, such as acoustic and visual information. To this end, we propose the employment of both videos and sounds towards more accurate activity recognition. Specifically, we will investigate how shared aligned representations can be learnt using the multi-stream paradigm. Moreover, we are interested in temporal feature pooling methods to leverage information that spans over the whole video, as in many cases the whole video should be observed in order to be able to discriminate between similar actions. Our final goal is to employ these ideas in zero-shot learning. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. An example is to recognize activities without having seen any video of these activities during training. This can be done by using the knowledge of trained classifiers (trained in other classes and not in the ones to be predicted by the zero-shot paradigm) and additional knowledge about the new classes.
People |
ORCID iD |
Evangelos Kazakos (Student) |
Publications
Damen D
(2018)
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
Damen D
(2021)
The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines.
in IEEE transactions on pattern analysis and machine intelligence
Kazakos E.
(2021)
SLOW-FAST AUDITORY STREAMS FOR AUDIO RECOGNITION
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/N509619/1 | 30/09/2016 | 29/09/2021 | |||
1971464 | Studentship | EP/N509619/1 | 30/09/2017 | 30/03/2021 | Evangelos Kazakos |
Description | In my research, I have discovered novel ways of combining audio and vision for egocentric action recognition in videos, similar to how humans combine their senses to understand the world. We found that audio is very important for recognising egocentric actions and can play a very important role in assisting wearable technologies. Combining audio and vision in videos is particularly challenging because except of how the two modalities should be combined, it is also important which temporal parts of each modality should be combined, which is called binding. I developed a model for audio-visual binding in egocentric action recognition which significantly improved the recognition performance comparing to models that use only vision. Furthermore, I explored architectures that draw inspiration from neuroscience focusing just in recognising events/actions from audio. In neuroscience there are two auditory streams for processing audio information, the dorsal and the ventral streams, for object recognition and localisation respectively. We designed an architecture inspired from this idea and we obtained better results comparing with previous methods. |
Exploitation Route | Models and features from my work have been used by other researchers including for action retrieval and domain adaptation. Open source code and models publicly available. Publications are well cited. |
Sectors | Digital/Communication/Information Technologies (including Software) |
Description | Software and models released for industry and research |
First Year Of Impact | 2019 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Economic |
Title | EPIC-KITCHENS-100 |
Description | Extended Footage for EPIC-KITCHENS dataset, to 100 hours of footage. For automatic annotations, see separate dataset at: https://doi.org/10.5523/bris.3l8eci2oqgst92n14w2yqi5ytu 10/09/2020 **N.b. please also see ERRATUM published at https://github.com/epic-kitchens/epic-kitchens-100-annotations/blob/master/README.md#erratum** |
Type Of Material | Database/Collection of data |
Year Produced | 2020 |
Provided To Others? | Yes |
Impact | Widely used from the computer vision community. Open for research. There are several open challenges where users can participate and build their methods to address open problems that the dataset poses. |
URL | https://data.bris.ac.uk/data/dataset/2g1n6qdydwa9u22shpxqzp0t8m/ |
Title | EPIC-Kitchens |
Description | Largest dataset in first-person vision, fully annotated with open challenges for object detection, action recognition and action anticipation |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Impact | Open challenges with 15 different universities and research centres competing for winning the relevant challenges. |
URL | http://epic-kitchens.github.io/ |
Description | EPIC-Kitchens Dataset Collection |
Organisation | University of Catania |
Country | Italy |
Sector | Academic/University |
PI Contribution | Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities |
Collaborator Contribution | Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna) |
Impact | ECCV 2018 publication, TPAMI publication under review |
Start Year | 2017 |
Description | EPIC-Kitchens Dataset Collection |
Organisation | University of Toronto |
Country | Canada |
Sector | Academic/University |
PI Contribution | Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities |
Collaborator Contribution | Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna) |
Impact | ECCV 2018 publication, TPAMI publication under review |
Start Year | 2017 |
Description | University of Oxford - Audio-visual Fusion for Egocentric Videos |
Organisation | University of Oxford |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani |
Collaborator Contribution | ICCV 2019 publication and code base |
Impact | (2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV). |
Start Year | 2018 |
Title | Auditory Slow-Fast Networks |
Description | A multi-stream audio convolutional network architecture for audio recognition |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | Early release. Too early too say |
Title | Temporal Binding Network (TBN) |
Description | A Convolutional Network based model for Audio-Visual Egocentric Action Recognition |
Type Of Technology | Software |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | Software became popular: 78 stars in GitHub, and 17 forks |
URL | http://openaccess.thecvf.com/content_ICCV_2019/papers/Kazakos_EPIC-Fusion_Audio-Visual_Temporal_Bind... |
Description | Attendance of 2020 Conference on Neural Information Processing Systems, Dec 6, 2020 - Dec 12, 2020 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | I attended several talks of the conference to strengthen my knowledge on more theoretical aspects of computer vision and machine learning. |
Year(s) Of Engagement Activity | 2020 |
Description | Attendance of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | I attended the conference which is from a slightly different background from my main area of focus, i.e. acoustic, speech and signal processing to improve my knowledge on audio processing and architectures, as my work focus on audio-visual fusion. So I wanted to strengthen my knowledge in the audio domain. |
Year(s) Of Engagement Activity | 2020 |
Description | BMVA Symposium 2019 in London |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | We attended a computer vision symposium in London as a team, with my advisor and other PhD students from my group. There were talks from senior researchers/professors and there was also a poster presentation sessions where I also presented my work. |
Year(s) Of Engagement Activity | 2019 |
Description | PAISS |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Artificial Intelligence Summer School |
Year(s) Of Engagement Activity | 2018 |
URL | https://project.inria.fr/paiss/home-2018/ |
Description | Poster presentation in ICCV 2019 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | I gave a poster presentation for my paper which was accepted in the main conference |
Year(s) Of Engagement Activity | 2019 |
URL | http://iccv2019.thecvf.com |