Immersive Audio-Visual 3D Scene Reproduction Using a Single 360 Camera

Lead Research Organisation: University of Southampton
Department Name: Sch of Electronics and Computer Sci

Abstract

The COVID-19 pandemic has changed our lifestyle and caused high demand for remote communication and experience. Many organizations have had to set up remote work systems with video conferencing platforms. However, current video conferencing systems do not meet basic requirements for remote collaboration due to the lack of eye contact, gaze awareness and spatial audio synchronisation. Reproduction of a real space as an audio-visual 3D model allows users to remotely experience real-time interaction in real environments, thus it can be widely utilised in various applications such as healthcare, teleconferencing, education, entertainments, etc. The goal of this project is to develop a simple and practical solution to estimate geometrical structure and acoustic properties of general scenes allowing spatial audio to be adapted to the environment and listener location to give an immersive rendering of the scene to improve user experience.

Existing 3D scene reproduction systems have two problems. (i) Audio and vision systems have been researched separately. Computer vision research has mainly focused on improving the visual side of scene reconstruction. In an immersive display, such as a VR system, the experience is not perceived as "realistic" by users if sound is not matched with the visual cues. On the other hand, audio researches have been using only audio sensors to measure acoustic properties without considering the complementary effect with visual sensors. (ii) Current capture and recording systems for 3D scene reproduction require too invasive set up and professional process to be deployed by users in their private places. A LiDAR sensor is expensive and requires long scanning time. Perspective images require large number of photos to cover the whole scene.

The objective of this research is to develop an end-to-end audio-visual 3D scene reproduction pipeline using a single shot from a consumer 360 (panoramic) camera. In order to make the system easily accessible by common users in their own private spaces, automatic solution using computer vision and artificial intelligence algorithms should be included in the back-end. A deep neural network (DNN) jointly trained for semantic scene reconstruction and acoustic property prediction for the captured environments will be developed. This process includes inference for invisible regions from the camera. Impulse Responses (IRs) characterising acoustic attributes of an environment allow to reproduce the acoustics of the space with any sound sources. It also allows to extract the original (dry) sound by eliminating acoustic effects from recorded sound so that this source can be re-rendered in new environments with different acoustic effects. A simple and efficient method to estimate acoustic IRs from the captured single 360 photo will be investigated.

This semantic scene data is used to provide immersive audio-visual experience to users. Two types of display scenarios will be considered: personalised display system such as a VR headset with headphones and communal display system (e.g., TV or projector) with loudspeakers. Real-time 3D human pose tracking using a single 360 camera will be developed to accurately render 3D audio-visual scene at the locations of users. Delivering binaural sound to listeners using loudspeakers is a challenging task. Audio beam-forming techniques aligned with human-pose tracking for multiple loudspeakers will be investigated in collaboration with the project partners in audio processing.

The resulting system would have a significant impact on innovation of VR and multimedia systems, and open up new and interesting applications for their deployment. This award should provide the foundation for the PI to establish and lead a group with a unique research direction which is aligned with national priorities and will address a major long-term research challenge.
 
Description The main goal of this project is to develop a practical system for reproduction of visually and acoustically plausible 3D scenes from a simple capture. Our strategy was to design the whole pipeline, develop individual components and integrate/optimise/test the system. The system was composed of 6 key research components: Depth estimation, Material recognition, Semantic Scene completion, Acoustic room modelling, Human pose estimation and Interactive Rendering. Owing to the rapid progress in AI and active research collaboration with the project partners, we could complete all six research objectives. The results were published in 15 international conferences and 3 public demonstrations. 3 more papers were submitted to international journals (under review).

In depth estimation, the biggest problem was lack of training datasets for 360 images. We developed domain adaptation methods to use synthetic data for real image training, which can be used in many other cases with limited training dataset. We also proposed considering physical constraints like gravity and environmental structures.

Material recognition was one of the most challenging parts in this project because even human cannot recognise material from visual input. It is common to use an additional sensor like hyper spectral imaging (HIS). In this project, we developed a new AI architecture to synthesise HIS images from normal RGB colour input so that we could use the material recognition algorithms for normal colour images.

Semantic Scene completion is to reconstruct invisible area in the scene using AI. We found that this is a relatively new area with limited resources. Even a standard evaluation metric had not been established. We provided taxonomy and survey on this research area as well as our own algorithms.

In Acoustic room modelling, the biggest issue was parametrising room impulse responses as this is a set of analogue signals. There have been two parametrisation methods: RSAO and SIRR. We analysed and evaluated two methods through thorough experiments and developed a new method to combine RSAO and SIRR to cover limitations of both methods.

3D human pose estimation from a single image is also very challenging unless full 3D model training set is available. We developed a real-time 3D human pose estimation model based on unsupervised method. This can be useful in various applications as it does not require any 3D reference.

All components were integrated into one end-to-end pipeline on Unity 3D platform. It was implemented to run all components and provide a dynamic audio-visual VR scene on a VR headset from a single input image with just one click.

The individual techniques developed in this project can be used in various fields. Through this project, we found new research partners and launched two international collaboration projects. One is to develop humanoid robot interaction using environment analysis and real-time human pose estimation. The other is to develop an eXtended Reality (XR) system for remote user interaction with additional sensing (audio and haptic).

The PDRA in our project started working on similar researches in ISVR (our project partner) after this project.
Exploitation Route The project was originally focused on developing environment understanding and 3D visualisation technology for Virtual Reality and Augmented reality applications. However, the key methods developed from this project have potential of various applications such as health-care, entertainment and robot research fields.

We tried to build strong cross-disciplinary links and transfer our technology to neighbouring research areas. The most successful cases were two international collaboration research projects mentioned in Key Findings.

We also tried to collaborate with health-care and medicine areas. We provided our machine learning and AI techniques for audio-visual signal processing to Innovate Physics Limited, a company based in Isle of Wight. As a consortium with Innovate Physics Ltd, we applied for an Innovate UK funding in 2023 (TSB Application number: 10086279) to develop a wearable toolkit for early detection of Alzheimer's disease. Though it was unsuccessful, we are still trying to find another funding for this project.

We are also collaborating with medicine and phycology fields to support aged people using our 3D human tracking and pose estimation technique. Our bidding to MRC in 2023 (Application number 13463) was unsuccessful and we are looking for other funding opportunities.

We started collaboration with our university department of psychology for research on human perception which is one of the main factors in immersive scene rendering. We will start co-supervision of two PhD students from this year 2024.
Sectors Digital/Communication/Information Technologies (including Software)

Education

Healthcare

Culture

Heritage

Museums and Collections

URL http://www.3dkim.com/Eng/publication.html
 
Description This project was focused on academic researches and its outcome still stays in a level of pilot system. Therefore its impacts on society, economy and environment are limited. However, as mentioned in the Outcomes section, we are trying to unlock the creative potential of our techniques to deliver a step change in various applications including health-care, humanoid robots, human understanding and communication. We tried to increase the visibility of our research through various seminars and public open events.
First Year Of Impact 2021
Sector Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Cultural

Economic

 
Description International Collaborative Research 
Organisation Electronics and Telecommunications Research Institute (ETRI)
Country Korea, Republic of 
Sector Public 
PI Contribution This research collaboration aims to develop an immersive eXtended reality (XR) environment for realistic egocentric interaction between users in the XR space. The main contributions are developing lagorithms for human interaction between remote users in the XR enviroment, free-view visualisation of remote environment, and 3D environment analysis/visualisation.
Collaborator Contribution They provide direct financial contribution
Impact Algorithms Co-authored international conference papers
Start Year 2024
 
Description International Collaborative Research 
Organisation Korea Institute of Science and Technology
Country Korea, Republic of 
Sector Public 
PI Contribution This collaboration aims to develop a 3D environment understanding and 3D human detection/pose estimation system using a single omni-directional camera and several depth sensors. The main target application will be providing an integrated 3D scene understanding to a digital human or humanoid robot which is being developed in KIST. The key techniques we contributed was to implemente 3D human tracking/visualization using a camera and additional sensors. Based on the experience from the NIA project, we are providing new algorithms for their system.
Collaborator Contribution They made direct financial contribution in a form of international collaboration project.
Impact Algorithms Co-authored international conference papers
Start Year 2022
 
Description Career Development Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Gave a talk about a research career pathway and joined pannel discussion for Postgraduate students in science and engineering. Around 60 postgraduate students attended in the UK.
Year(s) Of Engagement Activity 2023
 
Description Invited Talks 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I made 9 invited talks about my researchs on this NIA project in various conferences, research institutes and universities in different countries between 2021 and 2024. Number of audiences varied from 20 people to 200 people.
Year(s) Of Engagement Activity 2021,2022,2023,2024
 
Description Keynote speech 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I made 4 keynote speeches on the NIA research topic at international conferences between 2022 and 2024.
- KOSEN Bridge Forum, Seoul, South Korea, Oct. 2022
- ICCE-Asia, Yeosu, South Korea, Oct. 2022
- CVPR Workshop on Omnidirectional Computer Vision, Vancouver, Canada, June 2023
- ICEIC, Taipei, Taiwan, Jan. 2024
Year(s) Of Engagement Activity 2022,2023,2024
 
Description Science and Engineering Day 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact The University of Southampton run Science and Engineering Day for local public engagement.
I made demonstration for public including kids in 2022 and 2023.
Year(s) Of Engagement Activity 2022,2023
URL https://www.sotsef.co.uk/wider-festival/explore/