Visual AI: An Open World Interpretable Visual Transformer

Lead Research Organisation: University of Oxford
Department Name: Engineering Science


With the advent of deep learning and the availability of big data, it is now possible to train machine learning algorithms for a multitude of visual tasks, such as tagging personal image collections in the cloud, recognizing faces, and 3D shape scanning with phones. However, each of these tasks currently requires training a neural network on a very large image dataset specifically collected and labelled for that task. The resulting networks are good experts for the target task, but they only understand the 'closed world' experienced during training and can 'say' nothing useful about other content, nor can they be applied to other tasks without retraining, nor do they have an ability to explain their decisions or to recognise their limitations. Furthermore, current visual algorithms are usually 'single modal', they 'close their ears' to the other modalities (audio, text) that may be readily available.

The core objective of the Programme is to develop the next generation of audio-visual algorithms that does not have these limitations. We will carry out fundamental research to develop a Visual Transformer capable of visual analysis with the flexibility and interpretability of a human visual system, and aided by the other 'senses' - audio and text. It will be able to continually learn from raw data streams without requiring the traditional 'strong supervision' of a new dataset for each new task, and deliver and distill semantic and geometric information over a multitude of data types (for example, videos with audio, very large scale image and video datasets, and medical images with text records).

The Visual Transformer will be a key component of next generation AI, able to address multiple downstream audio-visual tasks, significantly superseding the current limitations of computer vision systems, and enabling new and far reaching applications.

A second objective addresses transfer and translation. We seek impact in a variety of other academic disciplines and industry which today greatly under-utilise the power of the latest computer vision ideas. We will target these disciplines to enable them to leapfrog the divide between what they use (or do not use) today which is dominated by manual review and highly interactive analysis frame-by-frame, to a new era where automated visual analytics of very large datasets becomes the norm. In short, our goal is to ensure that the newly developed methods are used by industry and academic researchers in other areas, and turned into products for societal and economic benefit. To this end open source software, datasets, and demonstrators will be disseminated on the project website.

The ubiquity of digital images and videos means that every UK citizen may potentially benefit from the Programme research in different ways. One example is smart audio-visual glasses, that can pay attention to a person talking by using their lip movements to mask out other ambient sounds. A second is an app that can answer visual questions (or retrieve matches) for text-queries over large scale audio-visual collections, such as a person's entire personal videos. A third is AI-guided medical screening, that can aid a minimally trained healthcare professional to perform medical scans.

Planned Impact

The proposed programme encompasses new methodology and applied research in computer vision and other modalities (audio, text) that will enable analysis and search of image and video content while learning new things, with human-like flexibility and interpretability. These capabilities will encourage end user take up of computer vision technologies and commercial interest in embedding these technologies in products.

The Programme will have Economic and Societal impact by
1. Enabling UK industry to leverage AI in their activities with a key strategic advantage.
2. Developing new and improved computer vision technologies that will require substantially less training data to solve problems and is thus suitable for commercialisation by a wide range of companies.
3. Enhancing the visual and audio capabilities and knowledge base of UK industries, including small ones.
4. Enhancing quality of life by improving, for instance, healthcare capabilities, surveillance, environmental monitoring, and the means of accessing and enjoying personal digital media.
5. Reducing the cost and risk of collecting manual annotations for deploying AI technology, especially for sensitive data such as medical records.
6. Collaborating directly with companies and organizations that we have already identified, and will work with over the course of the Programme.
7. Training the next generation of computer vision researchers who will be equipped to support the imaging needs of science, technology and wider society for the future.

Impact on Knowledge includes
1. Realisation of new approaches to essential computer vision technology, and the dissemination of research findings through publications, conference presentations, summer school teaching, and the distribution of open source software and image databases.
2. Sharing knowledge with industrial collaborators via Transfer and Application Projects (TAPs) and other activities leading to adoption of advanced computer vision methods across many disciplines of science, engineering and medicine that currently do not use them.
3. Communication of advances to a public audience through website articles, Show and Tell events, social and broadcast media, and other co-ordinated public understanding activities


10 25 50