LOCATE: LOcation adaptive Constrained Activity recognition using Transfer learning

Lead Research Organisation: University of Bristol

Department Name: Computer Science

Abstract

It is estimated that there are six million surveillance cameras in the UK, with only 17% of them publicly operated. Increasingly, people are installing CCTV cameras in their homes for security or remote monitoring of elderly, infants or pets. Despite this increase, the use of the overwhelming majority of these cameras is limited to evidence gathering or live viewing. These sensors are currently incapable of providing smart monitoring - identifying an infant in danger or a dehydrated elderly. Similarly, CCTV in public places is mostly used for evidence gathering.

Following years of research, methods capable of automatically recognising activities of interest, such as a person departing a service station without making a payment for refueling the car, or one tampering with a fuel dispenser, are now available, achieving acceptable levels of success and low false alarms. Though automatic after installation, the installation process not only requires putting the hardware in place but also involves an expert studying the footage and designing a model suitable for the monitored location. At each new location, e.g. each new service station, a new model is needed, requiring the effort and time of an expert. This is expensive, difficult to scale and at times implausible such as for home monitoring for example. This requirement to build location-specific models is currently limiting the adoption of automatic recognition of activities, despite the potential benefits.

This project, LOCATE, proposes an algorithmic solution that is capable of using a pre-built model in a different location and adapting it by simply observing the new scene for a few days. The solution is inspired by the human ability to intelligently apply previously-acquired knowledge to solve new challenges. The researchers will work with senior scientists from two leading UK video analytics industrial partners; QinetiQ and Thales. Using these partners' expertise, the project will provide practical and valuable insight that can further boost the strong UK industry of video analytics. The United Kingdom is currently a global player in the video analytics market, and the leading country in the Europe, Middle East and Africa (EMEA) region.

The method will be applicable to various domains, including for home monitoring and CCTV in public places. To evaluate the proposed approach for home monitoring, LOCATE will work alongside the EPSRC-funded project SPHERE, which aims to develop and deploy a sensor-based platform for residential healthcare in and around Bristol. The findings of LOCATE will be integrated within the SPHERE platform, towards automatic monitoring of activities of daily living in a new home, such as preparing a meal, eating or taking medication.

The targeted plug-and-play approach will enable a non-expert user to setup a camera and automatically detect whether an elderly in the home had had their meal and medication, for example. A shop owner can similarly detect pickpocketing attempts in their store. The community can thus make better use of the already in place network of visual sensors.

Planned Impact

A) Economic Impact:
The LOCATE framework attempts to enable plug-and-play automatic activity recognition using visual sensors. This is central to the already strong and growing UK industry in video analytics. Current approaches require hand-crafted location-specific activity representation models, increasing costs and at times limiting the applicability of automatic monitoring. A location-adaptive solution would (i) decrease installation costs and enable wider adoption of automatic activity recognition, (ii) extend solutions to domains where location-specific models are difficult to obtain such as homes and highly-sensitive security environments, and (iii) encourage other established companies and start-ups to enter the video analytics market, as a result of the decrease in cost and increase in applicability.

Recently, a number of UK and international SMEs have focused on developing mobile applications that are capable of achieving basic image processing such as motion detection towards alarms for intrusions in residential environments. The LOCATE framework could encourage these SMEs to expand their approaches to more advanced computer vision methods that are capable of detecting activities such as an infant in danger or a pet unable to access food or water. This could result in further boosting the customer base of these SMEs. During the follow-up phase of the project, links will be established to these SMEs and start-ups.

LOCATE primarily aims to make the most of the already installed and functioning network of wired and wireless cameras in the UK. Empowering this infrastructure to detect and prevent, rather than to be used for evidence gathering or only scarcely for live viewing, is a better utilisation of available resources.

B) Societal Impact:
Following from the economic impact, wider adoption of automatic monitoring and its extension to novel domains is one step closer to a healthier and safer society.
When applied to healthcare monitoring, Activities of Daily Living (ADL) have been established as a measure of one's functional status and quality of life. Automatic monitoring of ADLs would allow better assessment of one's health as well as intervention when needed.
When used for surveillance, automatic detection of activities of interest will enable intervention towards saving belongings as well as lives.
Automatic understanding of a person's activities can also encourage developing approaches to human computer interaction as well as robot computer interaction that are smarter with agile responses.

C) Academic Impact:
The project contributes to two research areas: visual activity recognition and relational-knowledge transfer learning, establishing a novel area of research in relational-knowledge transductive transfer learning for visual activity recognition. A challenge will be released to encourage other researchers to pose solutions to this problem.
The LOCATE project aims to establish the PI as a leading researcher in this novel area, continuing what is already a successful career in video analysis and activity recognition. The project will establish working collaborations between the PI and the current project partners as well as new extended collaborations.
At least one postdoctoral researcher and PhD candidate will become proficient in transfer learning approaches - a skill in high demand in research laboratories world wide and of profound effects for applications in Computer Vision and beyond.

Funded Value:

£98,100

Funded Period:

Jul 16 - May 18

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N033779/1

Principal Investigator:

Dima Damen

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (20%)

Image & Vision Computing (80%)

Organisations

People	ORCID iD
Dima Damen (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Damen D (2018) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV

Damen D (2021) Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 in International Journal of Computer Vision

Damen D (2021) The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines. in IEEE transactions on pattern analysis and machine intelligence

Evangelos Kazakos (2019) EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Jonathan Munro (2020) Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Kazakos E (2019) EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Michael Wray (2019) Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Michael Wray (2019) Learning Visual Actions Using Multiple Verb-Only Labels

Moltisanti D (2017) Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

Moltisanti D (2019) Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Munro J (2020) Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Perrett T (2017) Recurrent Assistance: Cross-Dataset Training of LSTMs on Kitchen Tasks

Perrett T (2019) DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition

Wray M (2019) Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	The award focused on the ability to deploy machine learning for recognising activities in new environments, without the need for additional annotations or manual intervention. This was achieved during the award through the following key steps: ) A dataset collected in 45 different home environments, providing the largest benchmark for hand-object interactions from wearable cameras. ) A temporal model to learnt jointly from multiple domains/environments. This provided the first attempt to share temporal knowledge between domains, published in CVPR 2019. ) The first solution to unsupervised domain adaptation in fine-grained actions - that is the task of learning in a new environment from unlabelled data. The paper showcased the ability to utilise multiple views of the same data for best alignment with a domain adaptation, published in CVPR 2020. ) The first benchmark for other researchers to compare their methods on the same dataset. This is currently an open challenge and the first set off winners will be announced in CVPR 2021.
Exploitation Route	The Unsupervised Domain Adaptation benchmark, published as part of this award, is now available for researchers to compare various methods under the same set of hidden test data. This is available for all researchers worldwide at: https://competitions.codalab.org/competitions/26096
Sectors	Digital/Communication/Information Technologies (including Software)


Description	The ability to adapt action understanding models to new environments for ultimate deployability was a novice topic before this award. The work conducted in the LOCATE award was ground-breaking within the research community and beyond. Four key impact avenues from this award are noted below: 1. The work of Munro and Damen (2020) in CVPR entitled Multi-Modal Domain Adaptation for Fine-Grained Action Recognition provided the first benchmark with relevant evaluation metrics to evaluate the problem statement in the grant LOCATE. Since its publication, the work has been cited more than 140 times critically showcasing that video is distinct from images in this domain with self-supervision and multi-modality key to achieving success in this domain. 2. As a result of (1), a larger-scale Unsupervised Domain Adaptation challenge was setup for the EPIC-KITCHENS dataset and has been running annually with winners awarded at CVPR. Winning teams come from research labs such as A* Singapore as well as international universities such as the Univ of Tokyo, Univ of Amsterdam and Politecnico di Torino. Successful approaches again utilise the multi-modality proposed in our original work (1). 3. The work in EPIC-KITCHENS, particularly the diversity of locations, triggered the establishment of a 13-university consortium, called Ego4D where a large-scale dataset was collected in 74 places around the world capturing 3670 hours of daily activity. The dataset: http://ego4d-data.org/ also published as a dataset and benchmark paper at CVPR 2022 is another impactful outcome of the initial ideas in the LOCATE grant. This dataset is now key to video understanding and being used by all key industrial labs with commercial licensing signed with the university of Bristol. Of these companies we note: Meta, Apple, Samsung AI, Google amongst others. 4. Also, due to work on LOCATE, I was invited to serve as a consultant to international company Cookpad for 6 month, helping them build and design the goals for their computer vision and machine learning team. The impact of this grant continues as we will be having another round of the Unsupervised Domain Adaptation challenge with winners to be announced at CVPR 2023 this June. The findings of this grant have influenced the ongoing EPSRC Fellowship UMPIRE (EP/T004991/1) where the notions of adaptation and generalisation of activities are being further explored as one of the fellowship's goals.
First Year Of Impact	2019
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Description	Jean Golding Institute Seedcorn funding
Amount	£4,740 (GBP)
Organisation	University of Bristol
Sector	Academic/University
Country	United Kingdom
Start	01/2018
End	07/2018


Description	UMPIRE: United Model for the Perception of Interactions in visuoauditory REcognition
Amount	£1,001,838 (GBP)
Funding ID	EP/T004991/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	02/2020
End	01/2025


Title	EPIC-KITCHENS-100
Description	Extended Footage for EPIC-KITCHENS dataset, to 100 hours of footage.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	5 open benchmarks are available for researchers to utilise. To-date the dataset was downloaded more than 2.3K times by researchers from 42 different countries.
URL	http://epic-kitchens.github.io/


Title	EPIC-Kitchens
Description	Largest dataset in first-person vision, fully annotated with open challenges for object detection, action recognition and action anticipation
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	Open challenges with 15 different universities and research centres competing for winning the relevant challenges.
URL	http://epic-kitchens.github.io


Description	EPIC-Kitchens Dataset Collection
Organisation	University of Catania
Department	Department of Mathematics and Computer Science
Country	Italy
Sector	Academic/University
PI Contribution	Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution	Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact	Dataset annotation ongoing and paper draft in preparation
Start Year	2017


Description	EPIC-Kitchens Dataset Collection
Organisation	University of Toronto
Department	Department of Computer Science
Country	Canada
Sector	Academic/University
PI Contribution	Collaboration to collect the largest cross-location dataset of egocentric non-scripted daily activities
Collaborator Contribution	Effort time of partners (Dr Sanja Fidler and Dr Giovanni Maria Farinella) in addition to time of your research team members (Dr Antonino Furnari and Mr David Acuna)
Impact	Dataset annotation ongoing and paper draft in preparation
Start Year	2017


Description	Naver Labs Europe
Organisation	NAVER LABS Europe
Country	France
Sector	Public
PI Contribution	Research internship secured for PhD student over summer 2017
Collaborator Contribution	Work is curried on to finalise the details of a PhD student visit to XRCE over the summer working on semantic embedding for action recognition. During the work, XRCE has been acquired by Naver Labs Europe. This has resulted in one publication (ICCV 2019). Another student (Jonathan Munro) is also following up with an internship this April 2020, continuing this collaboration
Impact	Agreement signed, internship details finalised.
Start Year	2017


Description	Naver Labs Europe
Organisation	Xerox Corporation
Department	Xerox Research Centre Europe - XRCE
Country	France
Sector	Private
PI Contribution	Research internship secured for PhD student over summer 2017
Collaborator Contribution	Work is curried on to finalise the details of a PhD student visit to XRCE over the summer working on semantic embedding for action recognition. During the work, XRCE has been acquired by Naver Labs Europe. This has resulted in one publication (ICCV 2019). Another student (Jonathan Munro) is also following up with an internship this April 2020, continuing this collaboration
Impact	Agreement signed, internship details finalised.
Start Year	2017


Description	University of Oxford - Audio-visual Fusion for Egocentric Videos
Organisation	University of Oxford
Department	Department of Engineering Science
Country	United Kingdom
Sector	Academic/University
PI Contribution	Shared publication and code base with Prof Zisserman and PhD student Arsha Nagrani
Collaborator Contribution	ICCV 2019 publication and code base
Impact	(2019) E Kazakos, A Nagrani, A Zisserman, D Damen. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. International Conference on Computer Vision (ICCV). (2021) E Kazakos, A Nagrani, A Zisserman, D Damen. Slow-Fast Auditory Streams for Audio Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021) E Kazakos, J Huh, A Nagrani, A Zisserman, D Damen. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition. British Machine Vision Conference (BMVC).
Start Year	2018


Title	EPIC-Kitchens Starters Kit
Description	Starter Toolkit for using EPIC-Kitchens Dataset
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Used in open challenges for the dataset
URL	https://github.com/epic-kitchens/starter-kit-action-recognition


Description	Chair - BMVA Symposium on Transfer Learning in Computer Vision
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The Computer Vision community is in need of moving beyond dataset or task-specific methods towards those that can efficiently adapt to new tasks or domains in a supervised, semi-supervised or unsupervised manner. We aim in this technical meeting to bring together leading researchers, at various levels in their career, with expertise or strong interest in TL for Computer Vision problems, in order to discuss current challenges and propose future directions including potentially establishing a continuous forum or a workshop series. The symposium invited keynote speakers and researchers to present short talks and posters that address the motivation, methodologies, challenges and applications of using TL in Computer Vision. The day concluded with an hour of discussions by key researchers, with conclusions to be published in a report by BMVA
Year(s) Of Engagement Activity	2017
URL	https://www.cs.bris.ac.uk/~damen/TLCV/


Description	Conference Keynote: A Fine-grained Perspective onto Object Interactions from First-person Views
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Keynote in international VISIGRAPP conference in Prague - targeting researchers in both academia and industry with interests in computer vision, visualisation and graphics
Year(s) Of Engagement Activity	2019
URL	http://www.visigrapp.org/KeynoteSpeakers.aspx?y=2019#4


Description	Invited Talk - BMVA symposium on Analysis and Processing of RGBD Data
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	Dr Dima Damen gave an invited talk at the BMVA symposium on Analysis and Processing of RGBD Data in London. The talk focused on challenges and opportunities for Action and Activity Recognition using RGBD Data alongside two prominent professors in the UK (Prof Ling Shao, University of East Anglia and Prof Adrian Hilton, University of Surrey). The day was well-attended by graduate students, academics and representatives of the industry.
Year(s) Of Engagement Activity	2017
URL	https://www.eventbrite.co.uk/e/bmva-technical-meeting-analysis-and-processing-of-rgbd-data-registrat...


Description	Pint of Science - What can a Wearable Camera Know About Me?
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	Public talk with the following abstract: "Mobile cameras are everywhere; wearable cameras are coming! Current computer vision technology can summarise your day, figure out your routine, even teach you how to perform a new task and remind you if you forgot to switch off the hob after cooking. What are the potentials and limitations of such technology? How mature is it, and when does it fail? This talk will not discuss privacy concerns. It offers a bright outlook into our tech-enhanced future." Talk resulted in an active debate on potentials and limitations of the current technology
Year(s) Of Engagement Activity	2017
URL	https://pintofscience.co.uk/event/rage-against-the-machine-vision


Description	Scaling Egocentric Vision
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Keynote at the Extreme Vision Modelling Workshop, alongside the International Conference on Computer Vision (ICCV) in South Korea. The conference has 7000 attendees and my talk was attended by around 200 of those attendees involving academics, postgraduate students and industry
Year(s) Of Engagement Activity	2019
URL	https://sites.google.com/view/extremevision


Description	The Lifetime of an Object in Egocentric Vision
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Talk on long-term monitoring of object interactions in an international workshop alongside ICCV.
Year(s) Of Engagement Activity	2017
URL	http://www.eyewear-computing.org/EPIC_ICCV17/program.asp

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications