Generating video descriptions for visually impaired users

Lead Research Organisation: University of Oxford

Department Name: Engineering Science

Abstract

Advancements in image and video understanding using deep learning and the improvements in processing and generating text using machine learning give rise to new opportunities. Such opportunities could be but are not limited to:

improving the results of retrieval systems - e.g. given a text, find the closest image or video best described by that text or vice-versa
generating more dense captions for images and videos etc.

In a more practical setting, better text and image/video understanding can be put to use in helping visually impaired users ``see" by generating audio descriptions for what is being depicted. This can have a great impact on people who are visually impaired since, at the moment, of the vast video content available online or offline (DVDs, cassettes), very little is accessible to such users. The main reason for not having a lot of accessible content is the expensive process of captioning videos both in terms of money but also in terms of time. Additionally, only a few people are properly trained to generate such descriptions. This is because the describer needs to know what is the best time to add a description, not to step over dialogue or background music that might contain clues of what is going on, to be objective in describing scenes, to describe only what is most relevant and not give away future developments, etc.
Therefore, using machine learning to aid this process could potentially have a great impact on making more content accessible to visually impaired users. Additionally, having computers better understand what is being shown on the screen can lead to improved retrieval systems. However, challenges still need to be overcome while working towards this goal.

Some of these challenges which make the objectives of this research are:

reducing bias in machine learning tasks - For example, bias can be very problematic for face identification systems but can also be a problem when images are being wrongly described.
better evaluate the correctness of descriptions - Metrics have been developed for evaluating machine generated translations and machine generated image descriptions. However, no specific video captioning evaluation metric exists.
more densely describe videos and images - This might be achieved by using compositionality of text to associate nouns and noun phrases with their visual correspondent in the images or videos. Starting from these identified connections, external knowledge such as noun dictionary definitions can be used to improve descriptions.
obtaining good resources - These resources are usually in the form of manually generated dense descriptions for images or videos. However, this generation process is expensive. Additional useful data can be made of scripts used when shooting movies or tv shows, but, generally, they do not provide dense enough descriptions. Lastly, some dedicated websites can provide described videos or images which require processing in order to be put to good use.

Steps currently taken towards achieving these goals are:

detecting best times to introduce descriptions by starting from existent video-captions datasets
testing various losses to better address bias
using cyclic learning to generate dense descriptions for videos in an unsupervised manner
searching and generating more data in the form of videos - captions and associating text to actual parts of the video frames

Student:

Andreea-Maria Oncescu

Period of Study:

Oct 19 - Jun 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2285346

Research Topic:

Unclassified

Organisations

University of Oxford (Lead Research Organisation)

People	ORCID iD
Alice Cicirello (Primary Supervisor)
Jin-Chong Tan (Primary Supervisor)
Andreea-Maria Oncescu (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Oncescu A (2021) Audio Retrieval with Natural Language Queries

Oncescu A (2021) QUERYD: A Video Dataset with High-Quality Text and Audio Narrations

Koepke A (2023) Audio Retrieval With Natural Language Queries: A Benchmark Study in IEEE Transactions on Multimedia

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/R513295/1			01/10/2018	30/09/2023
2285346	Studentship	EP/R513295/1	01/10/2019	30/06/2023	Andreea-Maria Oncescu

Key Findings
Impact Summary


Description	Through this award I have focused on better understanding multimodal content (videos, audios) through text. During my first years of funding I have collected two new datasets that allow other researchers to train and test their video/audio understanding models. Additionally, I have proposed together with my supervisors a new task of audio retrieval using free-form text. This means that users can now use our work - for which we have also created a demo accessible online - to search through large datasets of audio content that has no labels. This search is done by understanding the content of the audio files and relating it to the words the users have searched for. Although this is already largely encountered on platforms such as Google Search or YouTube that allow users to search for relevant text/video content, no similar work has been done since the era of deep learning for audio files. Our work was accepted and presented at conferences (ICASSP 2021, INTERSPEECH 2021) and workshops (WASPAA 2021) and published in journals (Transactions of Multimedia 2022). Our audio search work was also shortlisted for the best student paper award at INTERSPEECH 2021. We are currently working on expanding our research by allowing for even more relevant search results for both audio and video using free-form text.
Exploitation Route	The outcomes of this funding will help with higher accuracy search engines for videos/audios/images using text, will allow for better models to generate descriptions for videos/audios which can be used to help visually impaired or hearing impaired users understand content that is otherwise inaccessible to them. Therefore, this research can be used for entertainment purposes (e.g. movie creators can generate audio descriptions), can be used to search large collections of image/video data (e.g. in museum archives), to help communities of people with various impairments enjoy activities that might otherwise be inaccessible.
Sectors	Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections
URL	https://www.robots.ox.ac.uk/~oncescu/


Description	A new research area was introduced by my work on text to audio retrieval. It raised awareness on how the audio can help with video understanding through the use of text. A year later more people started working on this topic and improving numbers as compared to the benchmark we have initially proposed. To help with popularizing this task we have also created and presented a demo that can be accessed online, to give an understanding on the abilities and limitations of the original approach.
First Year Of Impact	2022
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Policy & public services

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects