ROSSINI: Reconstructing 3D structure from single images: a perceptual reconstruction approach

Lead Research Organisation: Aston University

Department Name: College of Health and Life Sciences

Abstract

Consumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.

Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.

Computer scientists have built machine vision systems for lifting to 3D by incorporating scene constraints. A popular technique is to train a deep neural network with a collection of 2D images and associated 3D range data. However, to be successful, this approach requires a very large dataset, which can be expensive to acquire. Furthermore, performance is only as good as the dataset is complete: if the system encounters a type of scene or geometry that does not conform to the training dataset, it will fail. Most methods have been trained for specific situations - e.g. indoor, or street scenes - and these systems are typically less effective for rural scenes and less flexible and robust than humans. Finally, such systems provide a single reconstructed output, without any measure of uncertainty. The user must assume that the 3D reconstruction is correct, which will be a costly assumption in many cases.

Computer systems are designed and evaluated based upon their accuracy with respect to the real world. However, the ultimate goal of lifting into 3D is not perfect accuracy - rather it is to deliver a 3D representation that provides a useful and compelling visual experience for a human observer, or to guide a robot whilst avoiding obstacles. Importantly, humans are expert at interacting with 3D environments, even though our perception can deviate substantially from true metric depth. This suggests that human-like representations are both achievable and sufficient, in any and all environments.

ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.

Planned Impact

The principle non-academic beneficiaries of this work will be:

i) Film and television technology providers.
This group will benefit from new methods for generating and manipulating Film/TV and image content that can be built into post-production software suites. 3D content is a strong revenue stream for the film industry but it remains difficult and expensive to create and is not seen as offering a new way to 'tell stories' or transform the viewer experience. Thus 3D versions of films are often generated in post-production using Rotoscope methods in which humans manually identify objects to be shown in 3D relief from selected frames within the footage. A reliable automated method for lifting 2D content into 3D would eliminate this manual step while still allowing footage to be captured in 2D.

ii) Film and Television content providers, CGI and computer games industries.
This group will also benefit from an enhanced ability to create 3D footage / imagery / models from 2D content (see letter of support from DNeg). The project deliverables may speed up post-production pipelines and result in enhanced footage that is more visually acceptable to the viewer than footage developed with current automated methods. The decrease in time and cost associated with producing 3D film variants should allow more existing and new films and TV shows to be presented in 3D. Additionally, understanding of perceptual relevance will dramatically impact the real-time rendering process, allowing the games industry to reach wider audiences by achieving higher perceived fidelity with reduced computational resources.

iii) Virtual- and augmented-reality (VR,AR) equipment and software manufacturers.
This group will benefit from enhanced methods for generating 3D content. The project may provide solutions enabling end users to upload a 2D image for display in 3D within the VR or AR headset. In the case of augmented reality they will benefit from enhanced methods for stitching introduced 3D objects into the local real-world scene (see letter of support from Microsoft HoloLens). In the case where a distal object needs to be captured and transmitted for rendering into the local scene it may not be possible to capture the object in 3D or bandwidth limitations may reduce the update speed to unacceptable levels. Transmission of 2D imagery for lifting into 3D at the local terminal may be a better option. This same process would also allow archival 2D images of 3D objects to be converted to 3D for AR display.

iv) Manufacturers of intelligent robots / self-driven vehicles.
This group will benefit from a new method to create 3D scene descriptions for route planning and obstacle avoidance. Current robot and driverless technologies currently use a range of sensors including structure light and LiDAR sensors to establish 3D layout. However, these technologies remain expensive. Solutions that enable 3D layout to be inferred from a standard 2D camera would drastically reduce the cost of such systems, enabling low cost personal or home assistant robots (see letter of support from CrossWing) and reducing the cost barriers for self-driven vehicles.

v) Members of the public:
Members of the public will ultimately benefit from improved experiences with uplifted 3D content and improved interactions with robot and self-driven vehicle technology. They may also benefit from low cost household or personal assistant robots and cheaper self-driven cars.

vi) Research staff on the project.
Industry increasingly looks for employees with diverse skill sets that are relevant to multiple technologies. This project will provide the researchers involved with an excellent multi-disciplinary training, and skills with wide-ranging applications.

Funded Value:

£409,880

Funded Period:

Mar 19 - Jun 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/S016260/1

Principal Investigator:

Andrew Schofield

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Image & Vision Computing (67%)

Vision & Senses - ICT appl. (33%)

Organisations

People	ORCID iD
Andrew Schofield (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Qian CS (2022) Surface attitude judgements with haptic and visual response.

Qian CS (2021) Surface Attitude Judgements in monocular and stereo textures: a method evaluation

Qian CS (2022) Surface attitude judgements in synthetic textures and natural images: a method evaluation.

Qian S (2022) Surface Attitude Judgements in synthetic textures and real-world images: a method evaluation in Journal of Vision

Skog E (2023) What surprises the Mona Lisa? The relative importance of the eyes and eyebrows for detecting surprise in briefly presented face stimuli. in Vision research

Skog Emil (2021) Eye regions dominate judgements of surprise in static face stimuli in PERCEPTION

Spencer J (2023) The Second Monocular Depth Estimation Challenge

Spencer J (2023) The Monocular Depth Estimation Challenge

Stella Qian C. (2021) Surface Attitude Judgements in monocular and stereo textures: a method evaluation in PERCEPTION

Key Findings
Collaboration


Description	This award was part of a larger grant held across 3 universities. At Aston we set out to establish how people perceive the 3D layout of scenes and the shapes of objects in photographs in terms of surface orientation measured as slant (the degree to which the surface is oriented towards or away from the observer) and tilt (the direction of slant). We first tested a number of methods for measuring slant and tilt. A 2D and 3D gauge figures that attempted to represent slant and tilt directly via visual cues, a dial-method that provided an indirect representation of slant and tilt and a paddle device that participants could physically orient to align with perceived surfaces and which could be visible or invisible while settings were made. We found that all methods represented the tilt of artificial surface well and with a high degree of accuracy. For slant participants were about equally sensitive and precise when using the visible paddle gauge figures and dial methods but were less precise when the paddle was unseen. Slant estimates tended to be lower overall for the paddle and higher overall for the gauge figures. For the visible paddle and gauge figure methods participants tended to overestimate small slants and underestimate large slants. Conversely, the dial method underestimated low slants and overestimated high slants. We next set out to explore the perception of slant and tilt in images of real scenes. Here we found that participants were less good at judging slant than for the synthetic scenes but still produced results that broadly correlated with the real slants presented. Both the dial and gauge figure methods underestimated slant. However, participants were very poor at judging tilt - this latter result is the opposite of what we found for synthetic surfaces. People tended to assign tilt to one of four key directions, horizontal-up (floor), horizontal-down (ceiling), vertical-left and vertical-right (walls). This was true even for outdoor scenes except for the absence of ceilings. This may to some extent be due to the prevalence of such tilt profiles in real scenes especially when man-made objects are included. Tilt estimates did not correlate between measurement methods suggesting that participants were not consistent with themselves. Given very small patches the choice of tilt direction was quite random, for larger sections of images and whole images the selection was usually the key direction closest to the true direction. We conclude that while humans are very good at judging tilt when there are sufficient cues available, real scenes lack sufficient cues to tilt and in the absence of this information humans assume one of four key directions approximating a 'carpented' world made up of flat surfaces such as the walls of a room. This is true even for outdoor surfaces. Other partners on the project considered the relative depth of surfaces on either side of edges (Southampton) and machine vision methods for reconstructing 3D scene layout from 2d images (Surrey).
Exploitation Route	The Aston project strongly suggests that humans do not perceive the tilt of real world surfaces accurately and that 3D rendering based on simple flat room like structures may be adequate to represent 3D scene layout to human observers.
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software)


Description	Crosswing
Organisation	CrossWing Inc
Country	Canada
Sector	Private
PI Contribution	We share information with the partner
Collaborator Contribution	The partner attends steering group meetings.
Impact	The is a collaboration between robotics and psychology.
Start Year	2019


Description	Microsoft
Organisation	Microsoft Research
Department	Microsoft Research Cambridge
Country	United Kingdom
Sector	Private
PI Contribution	We share information with the partner
Collaborator Contribution	The partner attends meetings of the project steering group and has hosted such meetings.
Impact	Collaboration between computer science and psychology
Start Year	2019


Description	Northwestern Polytechnical University
Organisation	Northwestern Polytechnical University
Country	China
Sector	Academic/University
PI Contribution	We attend monthly project meetings
Collaborator Contribution	The partner attends occasional project meetings
Impact	Collaboration between engineering and psychology
Start Year	2019


Description	University of Southampton
Organisation	University of Southampton
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have attended monthly project meetings to discuss progress on the grant.
Collaborator Contribution	University of Southampton is a partner organisation on the grant, receiving funding from a joint grant. They have attended monthly project meetings to discuss progress on the project.
Impact	None
Start Year	2019


Description	University of Surrey
Organisation	University of Surrey
Department	Centre for Vision, Speech and Signal Processing
Country	United Kingdom
Sector	Academic/University
PI Contribution	We attend monthly project meetings with the partner
Collaborator Contribution	University of Surrey is a project partner with a linked grant. They attend monthly project managements meetings.
Impact	Collaboration between Psychology and Engineering
Start Year	2019


Description	York University
Organisation	York University Toronto
Country	Canada
Sector	Academic/University
PI Contribution	We attend monthly management meeting with the partner.
Collaborator Contribution	Prof James Elder is a visiting researcher on the project wo also attends monthly project management meetings giving his time for free.
Impact	The collaboration is multi-disciplinary between Psychology and Engineering.
Start Year	2019

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications