LOCATE: LOcation adaptive Constrained Activity recognition using Transfer learning

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

It is estimated that there are six million surveillance cameras in the UK, with only 17% of them publicly operated. Increasingly, people are installing CCTV cameras in their homes for security or remote monitoring of elderly, infants or pets. Despite this increase, the use of the overwhelming majority of these cameras is limited to evidence gathering or live viewing. These sensors are currently incapable of providing smart monitoring - identifying an infant in danger or a dehydrated elderly. Similarly, CCTV in public places is mostly used for evidence gathering.

Following years of research, methods capable of automatically recognising activities of interest, such as a person departing a service station without making a payment for refueling the car, or one tampering with a fuel dispenser, are now available, achieving acceptable levels of success and low false alarms. Though automatic after installation, the installation process not only requires putting the hardware in place but also involves an expert studying the footage and designing a model suitable for the monitored location. At each new location, e.g. each new service station, a new model is needed, requiring the effort and time of an expert. This is expensive, difficult to scale and at times implausible such as for home monitoring for example. This requirement to build location-specific models is currently limiting the adoption of automatic recognition of activities, despite the potential benefits.

This project, LOCATE, proposes an algorithmic solution that is capable of using a pre-built model in a different location and adapting it by simply observing the new scene for a few days. The solution is inspired by the human ability to intelligently apply previously-acquired knowledge to solve new challenges. The researchers will work with senior scientists from two leading UK video analytics industrial partners; QinetiQ and Thales. Using these partners' expertise, the project will provide practical and valuable insight that can further boost the strong UK industry of video analytics. The United Kingdom is currently a global player in the video analytics market, and the leading country in the Europe, Middle East and Africa (EMEA) region.

The method will be applicable to various domains, including for home monitoring and CCTV in public places. To evaluate the proposed approach for home monitoring, LOCATE will work alongside the EPSRC-funded project SPHERE, which aims to develop and deploy a sensor-based platform for residential healthcare in and around Bristol. The findings of LOCATE will be integrated within the SPHERE platform, towards automatic monitoring of activities of daily living in a new home, such as preparing a meal, eating or taking medication.

The targeted plug-and-play approach will enable a non-expert user to setup a camera and automatically detect whether an elderly in the home had had their meal and medication, for example. A shop owner can similarly detect pickpocketing attempts in their store. The community can thus make better use of the already in place network of visual sensors.

Planned Impact

A) Economic Impact:
The LOCATE framework attempts to enable plug-and-play automatic activity recognition using visual sensors. This is central to the already strong and growing UK industry in video analytics. Current approaches require hand-crafted location-specific activity representation models, increasing costs and at times limiting the applicability of automatic monitoring. A location-adaptive solution would (i) decrease installation costs and enable wider adoption of automatic activity recognition, (ii) extend solutions to domains where location-specific models are difficult to obtain such as homes and highly-sensitive security environments, and (iii) encourage other established companies and start-ups to enter the video analytics market, as a result of the decrease in cost and increase in applicability.

Recently, a number of UK and international SMEs have focused on developing mobile applications that are capable of achieving basic image processing such as motion detection towards alarms for intrusions in residential environments. The LOCATE framework could encourage these SMEs to expand their approaches to more advanced computer vision methods that are capable of detecting activities such as an infant in danger or a pet unable to access food or water. This could result in further boosting the customer base of these SMEs. During the follow-up phase of the project, links will be established to these SMEs and start-ups.

LOCATE primarily aims to make the most of the already installed and functioning network of wired and wireless cameras in the UK. Empowering this infrastructure to detect and prevent, rather than to be used for evidence gathering or only scarcely for live viewing, is a better utilisation of available resources.

B) Societal Impact:
Following from the economic impact, wider adoption of automatic monitoring and its extension to novel domains is one step closer to a healthier and safer society.
When applied to healthcare monitoring, Activities of Daily Living (ADL) have been established as a measure of one's functional status and quality of life. Automatic monitoring of ADLs would allow better assessment of one's health as well as intervention when needed.
When used for surveillance, automatic detection of activities of interest will enable intervention towards saving belongings as well as lives.
Automatic understanding of a person's activities can also encourage developing approaches to human computer interaction as well as robot computer interaction that are smarter with agile responses.

C) Academic Impact:
The project contributes to two research areas: visual activity recognition and relational-knowledge transfer learning, establishing a novel area of research in relational-knowledge transductive transfer learning for visual activity recognition. A challenge will be released to encourage other researchers to pose solutions to this problem.
The LOCATE project aims to establish the PI as a leading researcher in this novel area, continuing what is already a successful career in video analysis and activity recognition. The project will establish working collaborations between the PI and the current project partners as well as new extended collaborations.
At least one postdoctoral researcher and PhD candidate will become proficient in transfer learning approaches - a skill in high demand in research laboratories world wide and of profound effects for applications in Computer Vision and beyond.

Publications

10 25 50
 
Description The award focused on the ability to deploy machine learning for recognising activities in new environments, without the need for additional annotations or manual intervention. This was achieved during the award through the following key steps:
*) A dataset collected in 45 different home environments, providing the largest benchmark for hand-object interactions from wearable cameras.
*) A temporal model to learnt jointly from multiple domains/environments. This provided the first attempt to share temporal knowledge between domains, published in CVPR 2019.
*) The first solution to unsupervised domain adaptation in fine-grained actions - that is the task of learning in a new environment from unlabelled data. The paper showcased the ability to utilise multiple views of the same data for best alignment with a domain adaptation, published in CVPR 2020.
*) The first benchmark for other researchers to compare their methods on the same dataset. This is currently an open challenge and the first set off winners will be announced in CVPR 2021.
Exploitation Route The Unsupervised Domain Adaptation benchmark, published as part of this award, is now available for researchers to compare various methods under the same set of hidden test data. This is available for all researchers worldwide at: https://competitions.codalab.org/competitions/26096
Sectors Digital/Communication/Information Technologies (including Software)

 
Description The ability to adapt action understanding models to new environments for ultimate deployability was a novice topic before this award. The work conducted in the LOCATE award was ground-breaking within the research community and beyond. Four key impact avenues from this award are noted below: 1. The work of Munro and Damen (2020) in CVPR entitled Multi-Modal Domain Adaptation for Fine-Grained Action Recognition provided the first benchmark with relevant evaluation metrics to evaluate the problem statement in the grant LOCATE. Since its publication, the work has been cited more than 140 times critically showcasing that video is distinct from images in this domain with self-supervision and multi-modality key to achieving success in this domain. 2. As a result of (1), a larger-scale Unsupervised Domain Adaptation challenge was setup for the EPIC-KITCHENS dataset and has been running annually with winners awarded at CVPR. Winning teams come from research labs such as A* Singapore as well as international universities such as the Univ of Tokyo, Univ of Amsterdam and Politecnico di Torino. Successful approaches again utilise the multi-modality proposed in our original work (1). 3. The work in EPIC-KITCHENS, particularly the diversity of locations, triggered the establishment of a 13-university consortium, called Ego4D where a large-scale dataset was collected in 74 places around the world capturing 3670 hours of daily activity. The dataset: http://ego4d-data.org/ also published as a dataset and benchmark paper at CVPR 2022 is another impactful outcome of the initial ideas in the LOCATE grant. This dataset is now key to video understanding and being used by all key industrial labs with commercial licensing signed with the university of Bristol. Of these companies we note: Meta, Apple, Samsung AI, Google amongst others. 4. Also, due to work on LOCATE, I was invited to serve as a consultant to international company Cookpad for 6 month, helping them build and design the goals for their computer vision and machine learning team. The impact of this grant continues as we will be having another round of the Unsupervised Domain Adaptation challenge with winners to be announced at CVPR 2023 this June. The findings of this grant have influenced the ongoing EPSRC Fellowship UMPIRE (EP/T004991/1) where the notions of adaptation and generalisation of activities are being further explored as one of the fellowship's goals.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description Jean Golding Institute Seedcorn funding
Amount £4,740 (GBP)
Organisation University of Bristol 
Sector Academic/University
Country United Kingdom
Start 01/2018 
End 07/2018
 
Description UMPIRE: United Model for the Perception of Interactions in visuoauditory REcognition
Amount £1,001,838 (GBP)
Funding ID EP/T004991/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 02/2020 
End 01/2025
 
Title EPIC-KITCHENS-100 
Description Extended Footage for EPIC-KITCHENS dataset, to 100 hours of footage. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact 5 open benchmarks are available for researchers to utilise. To-date the dataset was downloaded more than 2.3K times by researchers from 42 different countries. 
URL http://epic-kitchens.github.io/
 
Title EPIC-Kitchens 
Description Largest dataset in first-person vision, fully annotated with open challenges for object detection, action recognition and action anticipation 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact Open challenges with 15 different universities and research centres competing for winning the relevant challenges. 
URL http://epic-kitchens.github.io
 
Description EPIC-Kitchens Dataset Collection 
Organisation University of Catania
Department Department of Mathematics and Computer Science
Country Italy 
Sector Academic/University 
PI Contribution Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact Dataset annotation ongoing and paper draft in preparation
Start Year 2017
 
Description EPIC-Kitchens Dataset Collection 
Organisation University of Toronto
Department Department of Computer Science
Country Canada 
Sector Academic/University 
PI Contribution Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact Dataset annotation ongoing and paper draft in preparation
Start Year 2017
 
Description Naver Labs Europe 
Organisation NAVER LABS Europe
Country France 
Sector Public 
PI Contribution Research internship secured for PhD student over summer 2017
Collaborator Contribution Work is curried on to finalise the details of a PhD student visit to XRCE over the summer working on semantic embedding for action recognition. During the work, XRCE has been acquired by Naver Labs Europe. This has resulted in one publication (ICCV 2019). Another student (Jonathan Munro) is also following up with an internship this April 2020, continuing this collaboration
Impact Agreement signed, internship details finalised.
Start Year 2017
 
Description Naver Labs Europe 
Organisation Xerox Corporation
Department Xerox Research Centre Europe - XRCE
Country France 
Sector Private 
PI Contribution Research internship secured for PhD student over summer 2017
Collaborator Contribution Work is curried on to finalise the details of a PhD student visit to XRCE over the summer working on semantic embedding for action recognition. During the work, XRCE has been acquired by Naver Labs Europe. This has resulted in one publication (ICCV 2019). Another student (Jonathan Munro) is also following up with an internship this April 2020, continuing this collaboration
Impact Agreement signed, internship details finalised.
Start Year 2017
 
Description University of Oxford - Audio-visual Fusion for Egocentric Videos 
Organisation University of Oxford
Department Department of Engineering Science
Country United Kingdom 
Sector Academic/University 
PI Contribution Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani
Collaborator Contribution ICCV 2019 publication and code base
Impact (2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV). (2021) E Kazakos, A Nagrani, A Zisserman, D Damen. Slow-Fast Auditory Streams for Audio Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021) E Kazakos, J Huh, A Nagrani, A Zisserman, D Damen. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition. British Machine Vision Conference (BMVC).
Start Year 2018
 
Title EPIC-Kitchens Starters Kit 
Description Starter Toolkit for using EPIC-Kitchens Dataset 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Used in open challenges for the dataset 
URL https://github.com/epic-kitchens/starter-kit-action-recognition
 
Description Chair - BMVA Symposium on Transfer Learning in Computer Vision 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Computer Vision community is in need of moving beyond dataset or task-specific methods towards those that can efficiently adapt to new tasks or domains in a supervised, semi-supervised or unsupervised manner. We aim in this technical meeting to bring together leading researchers, at various levels in their career, with expertise or strong interest in TL for Computer Vision problems, in order to discuss current challenges and propose future directions including potentially establishing a continuous forum or a workshop series.

The symposium invited keynote speakers and researchers to present short talks and posters that address the motivation, methodologies, challenges and applications of using TL in Computer Vision. The day concluded with an hour of discussions by key researchers, with conclusions to be published in a report by BMVA
Year(s) Of Engagement Activity 2017
URL https://www.cs.bris.ac.uk/~damen/TLCV/
 
Description Conference Keynote: A Fine-grained Perspective onto Object Interactions from First-person Views 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote in international VISIGRAPP conference in Prague - targeting researchers in both academia and industry with interests in computer vision, visualisation and graphics
Year(s) Of Engagement Activity 2019
URL http://www.visigrapp.org/KeynoteSpeakers.aspx?y=2019#4
 
Description Invited Talk - BMVA symposium on Analysis and Processing of RGBD Data 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Dr Dima Damen gave an invited talk at the BMVA symposium on Analysis and Processing of RGBD Data in London. The talk focused on challenges and opportunities for Action and Activity Recognition using RGBD Data alongside two prominent professors in the UK (Prof Ling Shao, University of East Anglia and Prof Adrian Hilton, University of Surrey). The day was well-attended by graduate students, academics and representatives of the industry.
Year(s) Of Engagement Activity 2017
URL https://www.eventbrite.co.uk/e/bmva-technical-meeting-analysis-and-processing-of-rgbd-data-registrat...
 
Description Pint of Science - What can a Wearable Camera Know About Me? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Public talk with the following abstract: "Mobile cameras are everywhere; wearable cameras are coming! Current computer vision technology can summarise your day, figure out your routine, even teach you how to perform a new task and remind you if you forgot to switch off the hob after cooking. What are the potentials and limitations of such technology? How mature is it, and when does it fail? This talk will not discuss privacy concerns. It offers a bright outlook into our tech-enhanced future." Talk resulted in an active debate on potentials and limitations of the current technology
Year(s) Of Engagement Activity 2017
URL https://pintofscience.co.uk/event/rage-against-the-machine-vision
 
Description Scaling Egocentric Vision 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote at the Extreme Vision Modelling Workshop, alongside the International Conference on Computer Vision (ICCV) in South Korea. The conference has 7000 attendees and my talk was attended by around 200 of those attendees involving academics, postgraduate students and industry
Year(s) Of Engagement Activity 2019
URL https://sites.google.com/view/extremevision
 
Description The Lifetime of an Object in Egocentric Vision 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Talk on long-term monitoring of object interactions in an international workshop alongside ICCV.
Year(s) Of Engagement Activity 2017
URL http://www.eyewear-computing.org/EPIC_ICCV17/program.asp