Dense Monocular Reconstruction and Semantic Segmentation of 3D Environments

Lead Research Organisation: University of Oxford
Department Name: Engineering Science

Abstract

Most applications of robotics and augmented reality (AR) rely on some form of a 3D model of the environment in which they operate in order to interact with the environment. At the most basic level, these models provide information about where things are in the environment, allowing systems to accurately and safely interact with the environment. Conventionally there has been a trade-off between the quality and the density of the model (i.e. accurate models typically consist of a cloud of points with no information about the region between the points). This sparseness typically arises from the use of 3D scanners, such as lasers. There are scanners that are able to produce dense information, however, these scanners are either limited to indoors operation, or have an extremely short range (typically 1-5m). If more meaningful interaction is desired then semantic information, which describes the contents of the environment, is required.

The ability to quickly generate accurate models of 3D spaces has a variety of impacts, from increasing the ease with which automated systems can navigate and explore the world, to improving the interaction of visuals generated by AR with the real world. Further, there are potential applications for disability and access, whereby buildings or areas can be easily and quickly mapped, and then scale models printed, to aid those with impaired vision to navigate ares. Semantic information allows systems to understand the world and answer questions about it. For example, with semantic information, we can ask questions like "Where are the chairs in this room?"

We aim to investigate systems for both reconstructing dense 3D models of environments as well as generating semantic segmentations of those models. We hope to be able to develop a system that is capable of generating these models in real time, and ultimately on embedded and mobile devices, such as an iPhone. Further, we aim to be able to generate these models in places where current sensor based systems cannot, i.e. outdoors and over ranges greater than 5m.

To improve on the existing techniques for reconstructing 3D models from monocular images, we plan to utilize convolutional neural networks (CNNs), alongside existing geometric methods. However, rather than computing a depth image, we propose to directly compute a full 3D model from the network. Although this process requires significantly more memory, we believe that this will allow for better integration of the available information. We suspect that the direct use of 3D information will be of particular importance for semantic segmentation, where certain viewing angles of objects can be misleading (e.g. a chair from above looks a lot like a table). We also plan to investigate the potential of recurrent neural networks to improve the quality of the reconstructions over a sequence of images, as this input pipeline mimics those that you would most likely see in real world data acquisition scenarios.

This project falls within the EPSRC Information and Communication Technologies theme, specifically the Image and Vision Computing research area.

Publications

10 25 50
 
Description This grant has lead to the publication of two papers: one exploring how deep neural networks can be compressed to use less space in hardware and be deployed more economically, the other exploring applications and limitations to a new method of storing and representing 3D information.
Exploitation Route The papers published, as well as the thesis that is to come, will constitute part of the literature that will be drawn on by others in future work.
Sectors Digital/Communication/Information Technologies (including Software)