ROSSINI: Reconstructing 3D structure from single images: a perceptual reconstruction approach

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP

Abstract

Consumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.

Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.

Computer scientists have built machine vision systems for lifting to 3D by incorporating scene constraints. A popular technique is to train a deep neural network with a collection of 2D images and associated 3D range data. However, to be successful, this approach requires a very large dataset, which can be expensive to acquire. Furthermore, performance is only as good as the dataset is complete: if the system encounters a type of scene or geometry that does not conform to the training dataset, it will fail. Most methods have been trained for specific situations - e.g. indoor, or street scenes - and these systems are typically less effective for rural scenes and less flexible and robust than humans. Finally, such systems provide a single reconstructed output, without any measure of uncertainty. The user must assume that the 3D reconstruction is correct, which will be a costly assumption in many cases.

Computer systems are designed and evaluated based upon their accuracy with respect to the real world. However, the ultimate goal of lifting into 3D is not perfect accuracy - rather it is to deliver a 3D representation that provides a useful and compelling visual experience for a human observer, or to guide a robot whilst avoiding obstacles. Importantly, humans are expert at interacting with 3D environments, even though our perception can deviate substantially from true metric depth. This suggests that human-like representations are both achievable and sufficient, in any and all environments.

ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.

Publications

10 25 50
 
Description Aston University 
Organisation Aston University
Department Department of Psychology
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaborative research as part of ROSSINI project with Dr Andrew Schofield at Aston University
Collaborator Contribution Collaborative, multidisciplinary research project
Impact Not yet, ongoing
Start Year 2019
 
Description Southampton University 
Organisation University of Southampton
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaborative research as part of ROSSINI project with Prof Wendy Adams at Southampton University
Collaborator Contribution Collaborative, multidisciplinary research project
Impact Not yet, ongoing
Start Year 2019
 
Description York University Canada 
Organisation York University Toronto
Country Canada 
Sector Academic/University 
PI Contribution Collaborative research as part of ROSSINI project with Prof James Elder at York University Canada
Collaborator Contribution Direct input into program of research at both an advisory and technical level
Impact Not yet, ongoing
Start Year 2019
 
Description One Day BMVA Symposium: 3D worlds from 2D images in humans and machines. 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact When humans view a photograph they perceive the 3D world that constructed the image. They can, for example, describe the depth relationships between objects, plan a route through the scene and imagine the scene from a different viewpoint. This process is automatic and compulsive. For example, even though humans possess size constancy they will readily misinterpret the size of a person in order to make sense of the rest of the scene as a 3D world. State of the art computer vision systems are now also very good at reconstructing 3D layout from 2D images (3D uplift) although, unlike humans, this is often restricted to specific domains or requires multiple views. This workshop will consider recent developments in 3D uplift as well as our current knowledge of scene understanding in human vision.
Year(s) Of Engagement Activity 2020
URL https://britishmachinevisionassociation.github.io/meetings