Creating a Powerful Navigation Tool for image and video datasets

Lead Research Organisation: University of Oxford

Department Name: Engineering Science

Abstract

Understanding the scene in an image; although a trivial task for humans, and one that children master by a young age, object recognition and the extrapolation to scene understanding remain the core problems that computer vision researchers are trying to solve. In recent years, the use of convolutional neural networks (a complex and powerful type of computer vision algorithm) and machine learning has resulted in computers rivalling humans at certain vision tasks.
The ability of a robot to understand who people are and what they are doing in images and video has many potential applications. For example, say you wanted to find the scene in the film where two particular characters are shouting at each other, or where one character is laughing; if a computer could understand human actions and interactions in a scene then a powerful video navigation tool could be made. Amongst many other potential applications is that of smart-glasses for those who suffer from autism, which could label human expressions and emotions for the wearer to help them better understand their surroundings. The aim of my research is to use convolutional neural networks and machine learning methods to create a powerful navigation tool for image and video datasets, by improving the ability for computers to understand who people are, and what they are doing in a scene.
The key objectives of my research will be to successfully train and implement computer vision algorithms for recognising identities using facial recognition, for recognising human pose and actions, and also for recognising human emotions and interactions, culminating in a powerful navigation tool for image and video datasets that can understand complex instructions from a human user with regards to a particular scene that they want to find. The project aims to answer the question of: Can computers understand a scene of human action and interaction as well as a human can?
The novel engineering methodology in this project will be two-fold: Firstly, some of these objectives have either never been tackled by computer vision researchers or have hardly been tackled, such as teaching computers to recognise interactions between multiple humans. Therefore the work will involve curating novel datasets (that will be made freely available to the international research community) to train algorithms with and pioneering the first benchmark results. Secondly I will be improving upon current standards for more popular tasks such as facial recognition, and so this will involve the research of novel neural network architectures and machine learning techniques in order to perform these tasks better.
This project falls within the EPSRC 'engineering' research area. No companies or collaborators are involved.

Student:

Andrew Brown

Period of Study:

Sep 18 - Mar 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2118163

Research Topic:

Unclassified

Organisations

University of Oxford (Lead Research Organisation)

People	ORCID iD
Andrew Zisserman (Primary Supervisor)
Andrew Brown (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Brown A (2020) Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IX

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509711/1			30/09/2016	29/09/2021
2118163	Studentship	EP/N509711/1	30/09/2018	30/03/2022	Andrew Brown
EP/R513295/1			30/09/2018	29/09/2023
2118163	Studentship	EP/R513295/1	30/09/2018	30/03/2022	Andrew Brown

Key Findings
Impact Summary


Description	Several discoveries have been made as a result of the work funded through this award. These discoveries are highlighted across multiple publications that I have had accepted to conferences as a result of this funding. The goal of the work funded by this award is creating a powerful video & image navigation tool, through the use of machine learning and computer vision technology. Specifically, through the use of neural networks. The main discovery that we made was related to the training of neural networks. Neural networks are 'trained' by showing them lots of data (images or video) and giving them a 'learning objective' for this data. For example the 'learning objective' can be to predict the breed of dog, given the input data of many images of dogs. The neural networks can then learn to perform this same task on unseen data. Our goal was to train neural networks to retrieve images of an object/item, given a singular image of that item. In relation to the goal of the award-funded work, the goal is to be able to navigate video and image collections effectively through image-matching. The user here can show an image to the neural network, and the neural network will navigate to the point in the video or image collection that shows this same object/item in the image. We discovered a new mathematic formulation of a learning objective for training neural networks, which meant that the networks out-performed all prior works.
Exploitation Route	Our research into training neural networks can be used by the wider community for implementing state-of-the-art video and image navigation tools. Integrally all of the software associated with the research that is funded by this award is made open-source. This means that anyone can access it and use it for their research, or their video and image navigation software.
Sectors	Aerospace Defence and Marine Digital/Communication/Information Technologies (including Software) Healthcare Culture Heritage Museums and Collections


Description	Our research into person-oriented video retrieval has been used by the British Library (https://blogs.bl.uk/digital-scholarship/2020/10/bl-labs-public-award-runner-up-research-2019-automated-labelling-of-people-in-video-archives.html) The research in question was a way of automatically labelling people in videos, such that they could be navigated quickly and efficiently by just searching for people. The British Library have very large video archives. These archives serve as an important and fascinating resource for researchers and the general public alike. However, the sheer scale of the data, coupled with a lack of relevant metadata, makes indexing, analysing and navigating this content an increasingly difficult task. Relying on human annotation is no longer feasible, and without an effective way to navigate these videos, this bank of knowledge is largely inaccessible. The British Library used our method for automated person labelling, in order to effectively navigate, and therefore use these large important archives.
First Year Of Impact	2020
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Cultural

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects