Person-centric Story Understanding

Lead Research Organisation: University of Oxford
Department Name: Engineering Science


Summary: The aim of this project is to further personalised human-centric understanding of narrative from videos. This project falls within the EPSRC information and communication technologies research area.

Story understanding is a complex problem, relying not only on the short-term low-level knowledge (e.g. action performed right now), but also on the ability to predict higher-level semantic information such as human identities, motivations, and intents. Most works in the realm of story-understanding aggregate information based on temporal information -- that is, they summarise video segments temporally, and attempt to solve character-related tasks by reasoning over these temporal summaries. In contrast, our objective is to summarise the videos based on character, rather than time. The uses for a system capable of such reasoning are many-fold: from improving accessibility of various video platforms by automatically generating the narrative for each person appearing in one of the many videos uploaded each day, to increasing productivity by enabling retrieval of video segments specific to characters.

As a part of our work, we have developed a neural-network-based computational model that is able to retrieve and condense information for a specific individual. This is orthogonal to prevailing methods that focus on generic (non-personal) understanding of such information. For example, most modern computer-vision methods such as ones in smartphone photo libraries are able to find images or videos on your phone where people are skiing, or are able to recognise all images of a particular person, but not both at the same time. Our method, however, can simultaneously find not just the videos where people are skiing, but also the particular videos where the phone owner is skiing. We achieve this by not only training the model to recognise every person appearing in the data, but also allowing it to infer identities from outside sources of information, for example from images of the person. In other words, even if our model does not recognise the phone owner by their name, a single image is enough to find relevant images or videos. In order to do so, we designed our own evaluation dataset that can be used to evaluate any future methods developed in this area. This research has been published at British Machine Vision Conference 2022 and has been awarded an oral presentation.
In order to challenge our methods further, we intend to extend them to longer temporal windows and eventually entire plot lines of movies which will require designing a whole new set of algorithms and presents an enormous technological challenge. We also plan to further the human-centric story understanding in videos by developing our own tasks and datasets, as no benchmark datasets exist at present.


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513295/1 01/10/2018 30/09/2023
2863164 Studentship EP/R513295/1 01/10/2020 31/03/2024 Bruno Korbar
EP/T517811/1 01/10/2020 30/09/2025
2863164 Studentship EP/T517811/1 01/10/2020 31/03/2024 Bruno Korbar