UMPIRE: United Model for the Perception of Interactions in visuoauditory REcognition

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

Humans interact with tens of objects daily, at home (e.g. cooking/cleaning) or outdoors (e.g. ticket machines/shopping bags), during working (e.g. assembly/machinery) or leisure hours (e.g. playing/sports), individually or collaboratively. When observing people interacting with objects, our vision assisted by the sense of hearing is the main tool to perceive these interactions. Let's take the example of boiling water from a kettle. We observe the actor press a button, wait and hear the water boil and the kettle's light go off before water is used for, say, preparing tea. The perception process is formed from understanding intentional interactions (called ideomotor actions) as well as reactive actions to dynamic stimuli in the environment (referred to as sensormotor actions). As observers, we understand and can ultimately replicate such interactions using our sensory input, along with our underlying complex cognitive processes of event perception. Evidence in behavioural sciences demonstrates that these human cognitive processes are highly modularised, and these modules collaborate to achieve our outstanding human-level perception.

However, current approaches in artificial intelligence are lacking in their modularity and accordingly their capabilities. To achieve human-level perception of object interactions, including online perception when the interaction results in mistakes (e.g. water is spilled) or risks (e.g. boiling water is spilled), this fellowship focuses on informing computer vision and machine learning models, including deep learning architectures, from well-studied cognitive behavioural frameworks.

Deep learning architectures have achieved superior performance, compared to their hand-crafted predecessors, on video-level classification, however their performance on fine-grained understanding within the video remains modest. Current models are easily fooled by similar motions or incomplete actions, as shown by recent research. This fellowship focuses on empowering these models through modularisation, a principle proven since the 50s in Fodor's Modularity of the Mind, and frequently studied by cognitive psychologists in controlled lab environments. Modularity of high-level perception, along with the power of deep learning architectures, will bring a new understanding to videos analysis previously unexplored.

The targeted perception, of daily and rare object interactions, will lay the foundations for applications including assistive technologies using wearable computing, and robot imitation learning. We will work closely with three industrial partners to pave potential knowledge transfer paths to applications.

Additionally, the fellowship will actively engage international researchers through workshops, benchmarks and public challenges on large datasets, to encourage other researchers to address problems related to fine-grained perception in video understanding.

Planned Impact

The fellowship focuses on learning a model for understanding human object interactions, using visual- and auditory-sensors, with novel capabilities. The model will be capable of understanding the actor's hierarchy of goals and predicting upcoming interactions. The model will also be able to map the perceived interaction into a set of steps that could be replicated by a robot, tested within a simulated environment.

By enhancing the capabilities for computer vision models for recognising human-object interaction, the fellowship has limitless impact on future technologies. The economic and societal impacts are here intertwined where industry would be the prime beneficiary to build new technology, but individuals would be the end users. I summarise the potential through three application areas, impactful on the UK's national capabilities of several industries, and availing opportunities previously unexplored.

1) Assistive Technologies
Every individual can benefit from assistive technologies of object interactions. For example, reminding a person whether they had added salt to their meal or securely closed a water tap are capabilities of the model UMPIRE. Further assistance specialised for the elderly or people with impairments can be envisaged where alarms are raised in cases of unsafe interactions. Several start-ups have attempted to use assistive technologies in daily interactions. These however rely on specialised sensors to be integrated with every instrument (one sensor per tap to detect running water). Instead, this project promises human-level cognition using general visuo-auditory sensors, not specialised for the action. Through a model that can understand and detect the interaction's consequences and changes to environment (e.g. if water is still pouring then the water source has not been secured), the potential for assistive technologies will be widely enhanced. To realise this impact the fellowship, will engage with the Samsung AI Centre Cambridge, where assistive wearable technologies are under development.

2) Robotics and Beyond
A key capability of the UMPIRE model is actionable perception, i.e. a step-by-step procedure for an artificial agent to replicate the object interaction. This capability will be impactful to people working on vision for robotics. Teaching a robot how to 'open a can' by demonstrating the interaction is a main objective for effective household robotics. In this fellowship, I work closely with NVidia, originators of the open source simulating development kits Isaac and PhysX, to prepare for this impact.

3) Entertainment and Gaming
Virtual and augmented reality games can now integrate a three-dimensional avatar in our home, running around our sofas and tables. However, object interaction perception would enhance the ability to integrate these games with our everyday tasks combining life with fun. Though perceiving object interactions, avatars would be able to simulate opening your kitchen tap and augmented water flowing. Currently, such potential requires hand-coded graphics. Using a model for interaction perception would enable novel entertainment applications.

In this fellowship, I will engage with the first two impact areas, but note gaming as a potential for further exploration. Due to the large commercial potential, the fellowship will have a commercialisation plan, developed through consultation with Ultrahaptics and SAIC towards a spin-out and/or knowledge transfer.

In addition to the economic and societal impact, the fellowship has an impact on integrating two very active research communities, particularly in the UK: cognitively-inspired human behaviour and data-driven computer vision. New research directions can emerge introducing tools for data-driven research to cognitive psychologists.
 
Description (2023) Fine-grained understanding of object transformations has been thoroughly explored through a new set of annotations for Video Object Segmentations and Hand-Object Relations. During the research conducted this year an extensive study of how the two hands interact with the same or different objects during activities as well as tool understanding has been enabled. The findings were not available to the research community prior to this year's progress on the fellowship
====
(2022) New understanding of the multi-modal nature of hand-object interactions has been achieved by this award. How asynchronous audio-visual data can contribute to understanding ongoing actions, in long videos, will change the potential for assistive technologies. New methods and prototypes have been built
====
(2021) An interpretable model on the importance of every frame in the video, to decide on the action taking place within the video, has been published with an interactive dashboard available from: http://play-fair.uksouth.cloudapp.azure.com/?uid=137966&n-frames=10
This work has changed a main assumption in models that sampling more frames always improves the model's performance.
Additionally, we have published a large number of models on the dataset EPIC-KITCHENS, available for researchers to compare their methods on the same benchmark.
Exploitation Route Large-scale benchmark VISOR is now publicly available http://epic-kitchens.github.io/VISOR/
The large-scale dataset EGO4D its now publicly available https://ego4d-data.org/
Currently 5 published benchmarks are available for researchers to compare their methods on a hidden test set. Winners of the first round will be announced in June 2021 alongside a workshop in CVPR 2021: https://epic-kitchens.github.io/2021#challenges
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism

 
Description One aspect of this project has now contributed to industrial impact. The first is the recently released massive-scale dataset: Ego4D Read here: https://www.bristol.ac.uk/news/2021/october/ego4d.html
First Year Of Impact 2022
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description Consultancy to DeepMind
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
 
Description Visual AI: An Open World Interpretable Visual Transformer
Amount £5,912,096 (GBP)
Funding ID EP/T028572/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 12/2020 
End 11/2025
 
Title EPIC-KITCHENS VISOR 
Description We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. Data published under the Creative Commons Attribution-NonCommerial 4.0 International License. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact The first dataset for video object segmentations during object interactions where objects are undergoing drastic transformations. This work is testing the limit of previous approaches for tracking or segmentations. An ongoing open challenge is available to the research community. 
URL https://data.bris.ac.uk/data/dataset/2v6cgv1x04ol22qp9rm9x2j6a7/
 
Title EPIC-KITCHENS-100 
Description Extended Footage for EPIC-KITCHENS dataset, to 100 hours of footage. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact 5 open benchmarks are available for researchers to utilise. To-date the dataset was downloaded more than 2.3K times by researchers from 42 different countries. 
URL http://epic-kitchens.github.io/
 
Title Frame Attributions in Video Models - Interactive Dashboard 
Description Interactive Dashboard to assess the impact of individual frames in a video on current recognition models 
Type Of Material Data analysis technique 
Year Produced 2020 
Provided To Others? Yes  
Impact
URL https://play-fair.willprice.dev
 
Description Ego4D Consortium Collaboration 
Organisation Carnegie Mellon University
Country United States 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation Facebook
Country United States 
Sector Private 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation Georgia Institute of Technology
Country United States 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation Indian Institute of Technology Hyderabad
Country India 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation Indiana University Bloomington
Country United States 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation King Abdullah University of Science and Technology (KAUST)
Department KAUST Supercomputing Laboratory
Country Saudi Arabia 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation Massachusetts Institute of Technology
Country United States 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation National University of Singapore
Country Singapore 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation Universidad de Los Andes, Chile
Country Chile 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation University of Catania
Country Italy 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation University of Minnesota
Country United States 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation University of Pennsylvania
Country United States 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description Ego4D Consortium Collaboration 
Organisation University of Tokyo
Country Japan 
Sector Academic/University 
PI Contribution Collecting the largest and most diverse dataset of egocentric videos
Collaborator Contribution The project was inspired by my prior EPIC-KITCHENS project and I am a founding member of this consortium
Impact Public dataset for research and commercial purposes of 3670 hours collected by 923 participants in 74 cities around the world
Start Year 2021
 
Description University of Oxford - Audio-visual Fusion for Egocentric Videos 
Organisation University of Oxford
Department Department of Engineering Science
Country United Kingdom 
Sector Academic/University 
PI Contribution Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani
Collaborator Contribution ICCV 2019 publication and code base
Impact (2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV). (2021) E Kazakos, A Nagrani, A Zisserman, D Damen. Slow-Fast Auditory Streams for Audio Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021) E Kazakos, J Huh, A Nagrani, A Zisserman, D Damen. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition. British Machine Vision Conference (BMVC).
Start Year 2018
 
Description VISOR Benchmark: VIdeo Segmentations and Object Relations 
Organisation Procter & Gamble
Country United States 
Sector Private 
PI Contribution Working to collect a new benchmark of pixel-level objects and relations
Collaborator Contribution Established and leading the collaboration.
Impact Ongoing - both dataset and research paper expected this summer
Start Year 2021
 
Description VISOR Benchmark: VIdeo Segmentations and Object Relations 
Organisation University of Michigan
Country United States 
Sector Academic/University 
PI Contribution Working to collect a new benchmark of pixel-level objects and relations
Collaborator Contribution Established and leading the collaboration.
Impact Ongoing - both dataset and research paper expected this summer
Start Year 2021
 
Description VISOR Benchmark: VIdeo Segmentations and Object Relations 
Organisation University of Toronto
Country Canada 
Sector Academic/University 
PI Contribution Working to collect a new benchmark of pixel-level objects and relations
Collaborator Contribution Established and leading the collaboration.
Impact Ongoing - both dataset and research paper expected this summer
Start Year 2021
 
Title Auditory Slow-Fast 
Description Recognising actions using auditory signal only 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact Paper won outstanding paper at ICASSP 2021 - 3 papers selected out of 1400 papers. Well-referenced -46 stars. In a followup work by Deepmind [https://arxiv.org/pdf/2111.12124.pdf] this work is referred to as: "We find the Slowfast architecture is good at learning rich repre- sentations required by different domains" extending this work to speech and music audio. 
URL https://github.com/ekazakos/auditory-slow-fast
 
Title Explainable Video Understanding 
Description Frame Attributions in Video Models 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact A corresponding interactive dashboard is available for people to experiment with explainable models. 
URL http://play-fair.uksouth.cloudapp.azure.com/?uid=137966&n-frames=10
 
Title Multimodal Temporal Context Network (MTCN) 
Description Audio-Visual Recognition of Object Interactions - New Architecture and Modes 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact Used as baseline by other researchers 
URL https://github.com/ekazakos/MTCN
 
Title Temporal-Relational Cross-Transformers (TRX) 
Description Software suite for few-shot action recognition with novel cross-transformer architecture and model (CVPR 2021 paper) 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact Code is highly appreciated by the community (62 stars), and already compared to 10 different follow-up methods 
URL https://github.com/tobyperrett/trx
 
Title Video Object Segmentation 
Description Software for Video object segmentation and tracking throughout transformations 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Starter code for using EPIC-KITCHENS VISOR annotations 
URL https://github.com/epic-kitchens/VISOR-VIS
 
Description 10th EPIC Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 10th iteration of our international workshop with a round of international challenges and winners announced along with a technical report and a round table.
Year(s) Of Engagement Activity 2022
URL https://epic-workshop.org/EPIC_CVPR22/
 
Description Compositional and Multimodal Perception of Object Interactions 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote at International Challenge on Compositional and Multimodal Perception held alongside European Conference on Computer Vision (ECCV)
Year(s) Of Engagement Activity 2020
URL https://www.youtube.com/watch?v=zgwg1K77LBs&feature=youtu.be
 
Description Human-Centric Object Interactions - A Fine-Grained Perspective from Egocentric Videos 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote at the first international workshop on deep learning for human-centric activity understanding, held alongside International Conference on Pattern Recognition (ICPR)
Year(s) Of Engagement Activity 2020
URL http://staff.ustc.edu.cn/~tzzhang/dl-hau2020/program.html
 
Description Human-Centric Object Interactions - A Fine-Grained Perspective from Egocentric Videos 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at 1st International Workshop On Human-Centric Multimedia Analysis held alongside ACM Multimedia
Year(s) Of Engagement Activity 2020
URL https://hcma2020.github.io
 
Description Naturally Limited Videos of Fine-Grained Actions 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact In this talk, I'll present the case for collecting unscripted video datasets in their native environments, introducing naturally long-tailed datasets. Using such resource, I will present my group's approaches to zero-shot action retrieval [ICCV 2019], few-shot recognition [CVPR 2020], domain adaptation [CVPR 2020, ArXiv] and unsupervised learning [CVPR 2022].
Year(s) Of Engagement Activity 2022
URL https://sites.google.com/view/l3d-ivu/program
 
Description Research Visit: Berkeley AI Research Laboratory 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Research Visit at BAIR for extending research collaboration and engaging in interesting discussions with researchers in Computer Vision, AI and Robotics
Year(s) Of Engagement Activity 2023
 
Description Seventh International Workshop on Egocentric Perception, Interaction and Computing (EPIC) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact More than 200 researchers attended a full day workshop on egocentric perception, contributing talks, keynotes and poster presentations.
Year(s) Of Engagement Activity 2020
URL https://eyewear-computing.org/EPIC_ECCV20/
 
Description Sixth International Workshop on Egocentric Perception, Interaction and Computing 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 150 researchers from academia and industry attended a virtual international workshop where the latest research on fine-grained action recognition was discussed and presented.
Year(s) Of Engagement Activity 2020
URL https://eyewear-computing.org/EPIC_CVPR20/
 
Description Talk: Learning from Narrated Videos of Everyday Tasks 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk on Learning from Narrated Videos of Everyday Tasks at the CVPR2020 workshop on Instructional Videos
Year(s) Of Engagement Activity 2020
URL https://drive.google.com/file/d/1nMr6wanv9fQFjbJNP9ZjDQBMNVq8kUIT/view
 
Description Video Understanding - an Egocentric Perspective 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presentations at the 6th Summer School on AI
Year(s) Of Engagement Activity 2022
URL https://cvit.iiit.ac.in/summerschool2022/
 
Description Video Understanding: A Tutorial 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Participation in the International Computer Vision Summer School
Year(s) Of Engagement Activity 2022
URL https://iplab.dmi.unict.it/icvss2022/