Detailed and Deep Image Understanding

Lead Research Organisation: University of Oxford
Department Name: Engineering Science


Computer vision, the technology that allows machines to understand the content of image automatically, is fuelling a revolution in digital image processing. For example, it is now possible to use computers to search billions of images and millions of hours of video in the Internet for a particular content (Google Googles), interpret gestures and body motions to play games (Microsoft Kinect), automatically focus cameras on faces, or build smart cameras that can monitor hazardous industrial equipment on a 24h basis.

If not for their scale, these tasks would appear trivial to a human. However, vision is computationally exceptionally challenging, to the extent that more than half of our brain is dedicated to this function alone. Since this complexity cannot be met by hand-crafting software, vision architectures are nowadays learned automatically from million of example images, leveraging advanced machine learning and optimisation technologies. Despite recent terrific successes, however, machine vision still pales in comparison to vision in humans. Probably the most disappointing restriction is that these systems can address a single task at a time, such as deciding whether a particular image contains, say, person. Recognising a different concept, for example a dog, or addressing a different task, for example outlining rather than recognising the person, requires learning a new system from scratch, wasting time and effort.

My research idea is to transform existing architectures into repositories of 'visual knowledge' that can be reused and extended incrementally to address multiple tasks and domains, greatly improving the efficiency, scalability, and flexibility of the technology. The key scientific challenge is to understand how visual information is encoded in state-of-the-art vision systems. In fact, since these are learned automatically rather than being hand-crafted, it is currently unclear what information is captured by them and how it is represented. An in-depth investigation will explicate this formally and quantitatively and will be the basis to share and integrate visual knowledge between a growing number of concepts and tasks, including ones not addressed by the initial design of the system. At the same time, identifying fine-grained information will allow a system to obtain a more detailed, comprehensive, and meaningful understanding of the content of images.

The potential for impact is huge as the proposed research will enhance core computer vision technology that already powers countless applications. For example, computers will be able to search images by matching more detailed queries expressed using a far richer visual vocabulary; software will be extensible to new domains and tasks with minimal effort; and computer vision systems will be able to explain in explicit, intuitive terms how they understand images.

The research outcomes will be evaluated in the most rigorous manner on international benchmark data and protocols. Research results will be made available to a widespread technical audience by distributing open source software implementing the new technology. The project is also likely to have a strong academic impact, consolidating the leadership of the UK in computer vision, a strategic competitive area in the digital economy.

Planned Impact

Automatic image understanding has a tremendous impact on a wide spectrum of cutting-edge applications. Recent computer vision and machine learning breakthroughs have made it possible for companies such as Google and the BBC to offer tools that can search and organise very large media collections automatically, in some cases as large as the whole Internet (Google Googles); at the same time, this technology can be used to index personal photo collections (Google plus). Vision technologies are now deployed in public spaces in intelligent surveillance cameras that can automatically match thousands of people to a particular description. The Microsoft Kinect sensor, that was largely developed in the UK, allows players to interact with video games using gestures and body motion rather than a controller. Major enterprises that have historically relied manual inspection to verify the safety of their plants are now looking at computer vision as a way to make visual inspection work continuously, on a large scale, and on a quantitative foundation. These are just a few examples; by constructing machines that can interpret the content of images automatically, we can open major avenues to innovation in countless application domains and create new business opportunities.

The key benefit of the proposed research is the creation of a new generation of computer vision systems that will more powerful and flexible, capable of efficient adaptation to an ever expanding array of problems and application domains. These systems will also be capable of extracting more refined information from images; for example, where current technology may be able to detect a person, a vehicle, or some other object in an image, the proposed advances will make it be possible to extract detailed information about these objects as well, such as particular facial or body features, or whether the wheels of the car are steered in a particular direction. The impact of these advances is broad: for example, in a content search application it will be possible to formulate more refined queries to pinpoint exactly the desired content; or it will be possible to index new types of searchable content with minimal modification to an existing system. In an industrial inspection application faults may not only be detected, but also diagnosed and explained to an engineer by highlighting salient features in an image.

Example areas that can ultimately benefit from the proposed advances are exemplified by current collaborations of the PI with industrial partners such as XRCE/Xerox, BBC, and BP. The PI is transferring existing state-of-the-art vision technology to these partners to: recognise detailed properties of objects such as the breed of an animal (XRCE), search large-scale video databases for specific contents (BBC), and industrial monitoring (BP). The proposed research will enable substantial improvements in the underlying techniques, supporting ultimately finer-grained characterisation of visual objects, an understanding of more typologies of visual contents, and the ability to mark image features relevant to a particular visual assessment. The PI will pursue follow-up collaborations with these partners to create practical applications of the technology developed in this proposal to the EPSRC.

A successful outcome of this research will have not just a national but also an international impact. As suggested above, this research is likely to be of interest to international business and research centres such as XRCE/Xerox. At the same time, it is likely to attract the interest of the international academic community.


10 25 50
Description This project focused on developing artificial intelligence technologies for the automatic understanding of images. Recently, a new generation of artificial neural networks, called deep networks, have revolutionised the field, demonstrating exceptional performance in classifying images by content. This project investigated the extent of the understanding generated by such models. In particular, our goal was to demonstrate that, in order to perform image classification, deep neural networks implicitly learn that images are composed of individual objects, leading to object recognition, and can capture not just static information but also dynamics, leading to action recognition.

There are several technical outcomes of this research, published in top-tier international venues. One significant result is a novel neural network architecture that can automatically learn about objects in the world given only a weak supervisory signal at the level of the image as a whole. This problem, called weakly-supervised detection, is significant because a major bottleneck in image understanding is our ability to explicitly teach algorithms to recognise hundreds of thousands of object types. This bottleneck is removed if algorithms can learn about objects with less or no supervision, for example by browsing the Internet or watching videos automatically. Our ideas are supported by hard evidence, as our algorithm has achieved state-of-the-art performance in weakly supervised detection. We also showed that similar neural networks can learn powerful models of dynamics in videos that also lead to state-of-the-art results in understanding human actions.
Exploitation Route This research is likely to have a high impact in the area of computer vision, the discipline that studies the problem of automatic image understanding. A pre-print of our latest work on weakly-supervised learning of visual objects has already been noticed by the international community and will be presented at top-tier venues this summer.

More broadly, our research is a stepping stone towards increasingly powerful automatic computer vision systems, which will be a key component of future applications in Internet-scale image search, augmented reality, medical imaging, autonomy, and similar. These systems are required to learn about thousands of different object categories in order to make sense of general imagery of interest in applications; by showing that modern neural networks can learn automatically about such objects, we have demonstrated that these methods are plausible candidates to scale up to current and future application challenges.
Sectors Digital/Communication/Information Technologies (including Software),Electronics

Description Our work on weakly-supervised object detection was noted by Continental, a leading automotive business, due to their interest in object detection combined with the availability of massive amounts of unsupervised data collected from their test vehicles. We are now engaged with Continental in a years-long project on extending our research on weakly supervised learning to their particular area of interest: intelligent transportation. The post doc that was hired to work on this project is now a faculty in engineering science at the university of Edinburgh.
First Year Of Impact 2017
Sector Transport
Impact Types Cultural,Societal,Economic

Description ERC Starting Grant
Amount € 1,500,000 (EUR)
Organisation European Research Council (ERC) 
Sector Public
Country European Union (EU)
Start 08/2015 
End 09/2020