Generating video descriptions for visually impaired users

Lead Research Organisation: University of Oxford
Department Name: Engineering Science


Advancements in image and video understanding using deep learning and the improvements in processing and generating text using machine learning give rise to new opportunities. Such opportunities could be but are not limited to:

improving the results of retrieval systems - e.g. given a text, find the closest image or video best described by that text or vice-versa
generating more dense captions for images and videos etc.

In a more practical setting, better text and image/video understanding can be put to use in helping visually impaired users ``see" by generating audio descriptions for what is being depicted. This can have a great impact on people who are visually impaired since, at the moment, of the vast video content available online or offline (DVDs, cassettes), very little is accessible to such users. The main reason for not having a lot of accessible content is the expensive process of captioning videos both in terms of money but also in terms of time. Additionally, only a few people are properly trained to generate such descriptions. This is because the describer needs to know what is the best time to add a description, not to step over dialogue or background music that might contain clues of what is going on, to be objective in describing scenes, to describe only what is most relevant and not give away future developments, etc.
Therefore, using machine learning to aid this process could potentially have a great impact on making more content accessible to visually impaired users. Additionally, having computers better understand what is being shown on the screen can lead to improved retrieval systems. However, challenges still need to be overcome while working towards this goal.

Some of these challenges which make the objectives of this research are:

reducing bias in machine learning tasks - For example, bias can be very problematic for face identification systems but can also be a problem when images are being wrongly described.
better evaluate the correctness of descriptions - Metrics have been developed for evaluating machine generated translations and machine generated image descriptions. However, no specific video captioning evaluation metric exists.
more densely describe videos and images - This might be achieved by using compositionality of text to associate nouns and noun phrases with their visual correspondent in the images or videos. Starting from these identified connections, external knowledge such as noun dictionary definitions can be used to improve descriptions.
obtaining good resources - These resources are usually in the form of manually generated dense descriptions for images or videos. However, this generation process is expensive. Additional useful data can be made of scripts used when shooting movies or tv shows, but, generally, they do not provide dense enough descriptions. Lastly, some dedicated websites can provide described videos or images which require processing in order to be put to good use.

Steps currently taken towards achieving these goals are:

detecting best times to introduce descriptions by starting from existent video-captions datasets
testing various losses to better address bias
using cyclic learning to generate dense descriptions for videos in an unsupervised manner
searching and generating more data in the form of videos - captions and associating text to actual parts of the video frames


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513295/1 01/10/2018 30/09/2023
2285346 Studentship EP/R513295/1 01/10/2019 31/03/2023 Andreea-Maria Oncescu