📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Exploiting narrative structure in the generation of audio description of video

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

We are interested in the task of semi-automatically generating audio
descriptions for video. Audio description is "an additional audio
commentary developed primarily to enable people who are blind or have
sight loss to access audiovisual content" (Ofcom UK Accessibility
Guidelines, 2024). Major UK broadcasters are legally required to audio
describe 10% of their programmes, and in line with policies to make more
digital content accessible, this is expected to expand in the near
future. A sizeable creative industry already exists to produce audio
description. A single show can take multiple days for a team to describe
- it is a skilled task that goes beyond identifying actions in the
current scene, as it draws not only the video, but also knowledge of the
script, characterisation and the overall narrative.

Audio description has received computational treatment from the computer
vision community. There exist systems that take short clips as input and
generate verbal descriptions. The state-of-the-art approach involves
encoding the visual frames with one neural network (the visual encoder)
and learning to decode into the verbal domain (with a large language
model). Such systems have been augmented with surrounding
dialogue/narration, other audio and external knowledge sources (e.g.,
knowledge of casting and images of the actors). These systems are a
major milestone for the task. But the resulting (stiched together) audio
description is not engaging. There is no sense of narrative encoded, a
central component to any story. This project aims to tackle this
problem. We ask: what data structures can narratives take such that they
are (a) learnable by automatic methods and (b) useful to the task of
generating audio description?

In this project, the novel engineering will be to develop
self-supervised methods that model video-form narrative. Possible
directions include operationalizing theoretical approaches to narrative
structure or modelling the causal relationships that build up a
narrative. This will serve as an efficient approach to encode narrative
for the task of generating audio description. They may also serve to
validate particular theories of narrative. Our ultimate goal is to
semi-automatically generate compelling, sensitive and perhaps even
personalized audio description.

People

ORCID iD

Igor Sterner (Student)

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/W524384/1 30/09/2022 29/09/2028
2923920 Studentship EP/W524384/1 31/08/2024 29/02/2028 Igor Sterner