UMPIRE: United Model for the Perception of Interactions in visuoauditory REcognition

Lead Research Organisation: University of Bristol
Department Name: Computer Science


Humans interact with tens of objects daily, at home (e.g. cooking/cleaning) or outdoors (e.g. ticket machines/shopping bags), during working (e.g. assembly/machinery) or leisure hours (e.g. playing/sports), individually or collaboratively. When observing people interacting with objects, our vision assisted by the sense of hearing is the main tool to perceive these interactions. Let's take the example of boiling water from a kettle. We observe the actor press a button, wait and hear the water boil and the kettle's light go off before water is used for, say, preparing tea. The perception process is formed from understanding intentional interactions (called ideomotor actions) as well as reactive actions to dynamic stimuli in the environment (referred to as sensormotor actions). As observers, we understand and can ultimately replicate such interactions using our sensory input, along with our underlying complex cognitive processes of event perception. Evidence in behavioural sciences demonstrates that these human cognitive processes are highly modularised, and these modules collaborate to achieve our outstanding human-level perception.

However, current approaches in artificial intelligence are lacking in their modularity and accordingly their capabilities. To achieve human-level perception of object interactions, including online perception when the interaction results in mistakes (e.g. water is spilled) or risks (e.g. boiling water is spilled), this fellowship focuses on informing computer vision and machine learning models, including deep learning architectures, from well-studied cognitive behavioural frameworks.

Deep learning architectures have achieved superior performance, compared to their hand-crafted predecessors, on video-level classification, however their performance on fine-grained understanding within the video remains modest. Current models are easily fooled by similar motions or incomplete actions, as shown by recent research. This fellowship focuses on empowering these models through modularisation, a principle proven since the 50s in Fodor's Modularity of the Mind, and frequently studied by cognitive psychologists in controlled lab environments. Modularity of high-level perception, along with the power of deep learning architectures, will bring a new understanding to videos analysis previously unexplored.

The targeted perception, of daily and rare object interactions, will lay the foundations for applications including assistive technologies using wearable computing, and robot imitation learning. We will work closely with three industrial partners to pave potential knowledge transfer paths to applications.

Additionally, the fellowship will actively engage international researchers through workshops, benchmarks and public challenges on large datasets, to encourage other researchers to address problems related to fine-grained perception in video understanding.

Planned Impact

The fellowship focuses on learning a model for understanding human object interactions, using visual- and auditory-sensors, with novel capabilities. The model will be capable of understanding the actor's hierarchy of goals and predicting upcoming interactions. The model will also be able to map the perceived interaction into a set of steps that could be replicated by a robot, tested within a simulated environment.

By enhancing the capabilities for computer vision models for recognising human-object interaction, the fellowship has limitless impact on future technologies. The economic and societal impacts are here intertwined where industry would be the prime beneficiary to build new technology, but individuals would be the end users. I summarise the potential through three application areas, impactful on the UK's national capabilities of several industries, and availing opportunities previously unexplored.

1) Assistive Technologies
Every individual can benefit from assistive technologies of object interactions. For example, reminding a person whether they had added salt to their meal or securely closed a water tap are capabilities of the model UMPIRE. Further assistance specialised for the elderly or people with impairments can be envisaged where alarms are raised in cases of unsafe interactions. Several start-ups have attempted to use assistive technologies in daily interactions. These however rely on specialised sensors to be integrated with every instrument (one sensor per tap to detect running water). Instead, this project promises human-level cognition using general visuo-auditory sensors, not specialised for the action. Through a model that can understand and detect the interaction's consequences and changes to environment (e.g. if water is still pouring then the water source has not been secured), the potential for assistive technologies will be widely enhanced. To realise this impact the fellowship, will engage with the Samsung AI Centre Cambridge, where assistive wearable technologies are under development.

2) Robotics and Beyond
A key capability of the UMPIRE model is actionable perception, i.e. a step-by-step procedure for an artificial agent to replicate the object interaction. This capability will be impactful to people working on vision for robotics. Teaching a robot how to 'open a can' by demonstrating the interaction is a main objective for effective household robotics. In this fellowship, I work closely with NVidia, originators of the open source simulating development kits Isaac and PhysX, to prepare for this impact.

3) Entertainment and Gaming
Virtual and augmented reality games can now integrate a three-dimensional avatar in our home, running around our sofas and tables. However, object interaction perception would enhance the ability to integrate these games with our everyday tasks combining life with fun. Though perceiving object interactions, avatars would be able to simulate opening your kitchen tap and augmented water flowing. Currently, such potential requires hand-coded graphics. Using a model for interaction perception would enable novel entertainment applications.

In this fellowship, I will engage with the first two impact areas, but note gaming as a potential for further exploration. Due to the large commercial potential, the fellowship will have a commercialisation plan, developed through consultation with Ultrahaptics and SAIC towards a spin-out and/or knowledge transfer.

In addition to the economic and societal impact, the fellowship has an impact on integrating two very active research communities, particularly in the UK: cognitively-inspired human behaviour and data-driven computer vision. New research directions can emerge introducing tools for data-driven research to cognitive psychologists.


10 25 50
publication icon
Damen D (2020) The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines. in IEEE transactions on pattern analysis and machine intelligence