Exploiting narrative structure in the generation of audio description of video
Lead Research Organisation:
University of Edinburgh
Department Name: Sch of Informatics
Abstract
We are interested in the task of semi-automatically generating audio
descriptions for video. Audio description is "an additional audio
commentary developed primarily to enable people who are blind or have
sight loss to access audiovisual content" (Ofcom UK Accessibility
Guidelines, 2024). Major UK broadcasters are legally required to audio
describe 10% of their programmes, and in line with policies to make more
digital content accessible, this is expected to expand in the near
future. A sizeable creative industry already exists to produce audio
description. A single show can take multiple days for a team to describe
- it is a skilled task that goes beyond identifying actions in the
current scene, as it draws not only the video, but also knowledge of the
script, characterisation and the overall narrative.
Audio description has received computational treatment from the computer
vision community. There exist systems that take short clips as input and
generate verbal descriptions. The state-of-the-art approach involves
encoding the visual frames with one neural network (the visual encoder)
and learning to decode into the verbal domain (with a large language
model). Such systems have been augmented with surrounding
dialogue/narration, other audio and external knowledge sources (e.g.,
knowledge of casting and images of the actors). These systems are a
major milestone for the task. But the resulting (stiched together) audio
description is not engaging. There is no sense of narrative encoded, a
central component to any story. This project aims to tackle this
problem. We ask: what data structures can narratives take such that they
are (a) learnable by automatic methods and (b) useful to the task of
generating audio description?
In this project, the novel engineering will be to develop
self-supervised methods that model video-form narrative. Possible
directions include operationalizing theoretical approaches to narrative
structure or modelling the causal relationships that build up a
narrative. This will serve as an efficient approach to encode narrative
for the task of generating audio description. They may also serve to
validate particular theories of narrative. Our ultimate goal is to
semi-automatically generate compelling, sensitive and perhaps even
personalized audio description.
descriptions for video. Audio description is "an additional audio
commentary developed primarily to enable people who are blind or have
sight loss to access audiovisual content" (Ofcom UK Accessibility
Guidelines, 2024). Major UK broadcasters are legally required to audio
describe 10% of their programmes, and in line with policies to make more
digital content accessible, this is expected to expand in the near
future. A sizeable creative industry already exists to produce audio
description. A single show can take multiple days for a team to describe
- it is a skilled task that goes beyond identifying actions in the
current scene, as it draws not only the video, but also knowledge of the
script, characterisation and the overall narrative.
Audio description has received computational treatment from the computer
vision community. There exist systems that take short clips as input and
generate verbal descriptions. The state-of-the-art approach involves
encoding the visual frames with one neural network (the visual encoder)
and learning to decode into the verbal domain (with a large language
model). Such systems have been augmented with surrounding
dialogue/narration, other audio and external knowledge sources (e.g.,
knowledge of casting and images of the actors). These systems are a
major milestone for the task. But the resulting (stiched together) audio
description is not engaging. There is no sense of narrative encoded, a
central component to any story. This project aims to tackle this
problem. We ask: what data structures can narratives take such that they
are (a) learnable by automatic methods and (b) useful to the task of
generating audio description?
In this project, the novel engineering will be to develop
self-supervised methods that model video-form narrative. Possible
directions include operationalizing theoretical approaches to narrative
structure or modelling the causal relationships that build up a
narrative. This will serve as an efficient approach to encode narrative
for the task of generating audio description. They may also serve to
validate particular theories of narrative. Our ultimate goal is to
semi-automatically generate compelling, sensitive and perhaps even
personalized audio description.
Organisations
People |
ORCID iD |
| Igor Sterner (Student) |
Studentship Projects
| Project Reference | Relationship | Related To | Start | End | Student Name |
|---|---|---|---|---|---|
| EP/W524384/1 | 30/09/2022 | 29/09/2028 | |||
| 2923920 | Studentship | EP/W524384/1 | 31/08/2024 | 29/02/2028 | Igor Sterner |