Immersive Audio-Visual 3D Scene Reproduction Using a Single 360 Camera

Lead Research Organisation: University of Southampton

Department Name: Sch of Electronics and Computer Sci

Abstract

The COVID-19 pandemic has changed our lifestyle and caused high demand for remote communication and experience. Many organizations have had to set up remote work systems with video conferencing platforms. However, current video conferencing systems do not meet basic requirements for remote collaboration due to the lack of eye contact, gaze awareness and spatial audio synchronisation. Reproduction of a real space as an audio-visual 3D model allows users to remotely experience real-time interaction in real environments, thus it can be widely utilised in various applications such as healthcare, teleconferencing, education, entertainments, etc. The goal of this project is to develop a simple and practical solution to estimate geometrical structure and acoustic properties of general scenes allowing spatial audio to be adapted to the environment and listener location to give an immersive rendering of the scene to improve user experience.

Existing 3D scene reproduction systems have two problems. (i) Audio and vision systems have been researched separately. Computer vision research has mainly focused on improving the visual side of scene reconstruction. In an immersive display, such as a VR system, the experience is not perceived as "realistic" by users if sound is not matched with the visual cues. On the other hand, audio researches have been using only audio sensors to measure acoustic properties without considering the complementary effect with visual sensors. (ii) Current capture and recording systems for 3D scene reproduction require too invasive set up and professional process to be deployed by users in their private places. A LiDAR sensor is expensive and requires long scanning time. Perspective images require large number of photos to cover the whole scene.

The objective of this research is to develop an end-to-end audio-visual 3D scene reproduction pipeline using a single shot from a consumer 360 (panoramic) camera. In order to make the system easily accessible by common users in their own private spaces, automatic solution using computer vision and artificial intelligence algorithms should be included in the back-end. A deep neural network (DNN) jointly trained for semantic scene reconstruction and acoustic property prediction for the captured environments will be developed. This process includes inference for invisible regions from the camera. Impulse Responses (IRs) characterising acoustic attributes of an environment allow to reproduce the acoustics of the space with any sound sources. It also allows to extract the original (dry) sound by eliminating acoustic effects from recorded sound so that this source can be re-rendered in new environments with different acoustic effects. A simple and efficient method to estimate acoustic IRs from the captured single 360 photo will be investigated.

This semantic scene data is used to provide immersive audio-visual experience to users. Two types of display scenarios will be considered: personalised display system such as a VR headset with headphones and communal display system (e.g., TV or projector) with loudspeakers. Real-time 3D human pose tracking using a single 360 camera will be developed to accurately render 3D audio-visual scene at the locations of users. Delivering binaural sound to listeners using loudspeakers is a challenging task. Audio beam-forming techniques aligned with human-pose tracking for multiple loudspeakers will be investigated in collaboration with the project partners in audio processing.

The resulting system would have a significant impact on innovation of VR and multimedia systems, and open up new and interesting applications for their deployment. This award should provide the foundation for the PI to establish and lead a group with a unique research direction which is aligned with national priorities and will address a major long-term research challenge.

Funded Value:

£267,460

Funded Period:

Aug 21 - Feb 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/V03538X/1

Principal Investigator:

HANSUNG KIM

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Image & Vision Computing (40%)

Music & Acoustic Technology (60%)

Organisations

People	ORCID iD
HANSUNG KIM (Principal Investigator)	http://orcid.org/0000-0003-4907-0491

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Alawadh M (2022) Room Acoustic Properties Estimation from a Single 360° Photo

Alinaghi A (2023) Spatial Audio Reconstruction for VR Applications Using a Combined Method Based on SIRR and RSAO Approaches

Alinaghi A (2023) Analysis and Synthesis of Spatial Audio for VR Applications: Comparing SIRR and RSAO as Two Main Parametric Approaches

Hardy P (2023) Optimising 2D Pose Representations: Improving Accuracy, Stability and Generalisability Within Unsupervised 2D-3D Human Pose Estimation

Hardy P (2022) Can Super Resolution Improve Human Pose Estimation in Low Resolution Scenarios?

Heng Y (2023) Computer Vision, Imaging and Computer Graphics Theory and Applications - 17th International Joint Conference, VISIGRAPP 2022, Virtual Event, February 6-8, 2022, Revised Selected Papers

Heng Y (2023) DBAT: Dynamic Backward Attention Transformer for Material Segmentation with Cross-Resolution Patches

Heng Y (2022) CAM-SegNet: A Context-Aware Dense Material Segmentation Network for Sparsely Labelled Datasets

Heng Y (2022) Enhancing Material Features Using Dynamic Backward Attention on Cross-Resolution Patches

Heng Y (2023) Material Recognition for Immersive Interactions in Virtual/Augmented Reality

Key Findings
Impact Summary
Collaboration
Engagement Activities


Description	The main goal of this project is to develop a practical system for reproduction of visually and acoustically plausible 3D scenes from a simple capture. Our strategy was to design the whole pipeline, develop individual components and integrate/optimise/test the system. The system was composed of 6 key research components: Depth estimation, Material recognition, Semantic Scene completion, Acoustic room modelling, Human pose estimation and Interactive Rendering. Owing to the rapid progress in AI and active research collaboration with the project partners, we could complete all six research objectives. The results were published in 15 international conferences and 3 public demonstrations. 3 more papers were submitted to international journals (under review). In depth estimation, the biggest problem was lack of training datasets for 360 images. We developed domain adaptation methods to use synthetic data for real image training, which can be used in many other cases with limited training dataset. We also proposed considering physical constraints like gravity and environmental structures. Material recognition was one of the most challenging parts in this project because even human cannot recognise material from visual input. It is common to use an additional sensor like hyper spectral imaging (HIS). In this project, we developed a new AI architecture to synthesise HIS images from normal RGB colour input so that we could use the material recognition algorithms for normal colour images. Semantic Scene completion is to reconstruct invisible area in the scene using AI. We found that this is a relatively new area with limited resources. Even a standard evaluation metric had not been established. We provided taxonomy and survey on this research area as well as our own algorithms. In Acoustic room modelling, the biggest issue was parametrising room impulse responses as this is a set of analogue signals. There have been two parametrisation methods: RSAO and SIRR. We analysed and evaluated two methods through thorough experiments and developed a new method to combine RSAO and SIRR to cover limitations of both methods. 3D human pose estimation from a single image is also very challenging unless full 3D model training set is available. We developed a real-time 3D human pose estimation model based on unsupervised method. This can be useful in various applications as it does not require any 3D reference. All components were integrated into one end-to-end pipeline on Unity 3D platform. It was implemented to run all components and provide a dynamic audio-visual VR scene on a VR headset from a single input image with just one click. The individual techniques developed in this project can be used in various fields. Through this project, we found new research partners and launched two international collaboration projects. One is to develop humanoid robot interaction using environment analysis and real-time human pose estimation. The other is to develop an eXtended Reality (XR) system for remote user interaction with additional sensing (audio and haptic). The PDRA in our project started working on similar researches in ISVR (our project partner) after this project.
Exploitation Route	The project was originally focused on developing environment understanding and 3D visualisation technology for Virtual Reality and Augmented reality applications. However, the key methods developed from this project have potential of various applications such as health-care, entertainment and robot research fields. We tried to build strong cross-disciplinary links and transfer our technology to neighbouring research areas. The most successful cases were two international collaboration research projects mentioned in Key Findings. We also tried to collaborate with health-care and medicine areas. We provided our machine learning and AI techniques for audio-visual signal processing to Innovate Physics Limited, a company based in Isle of Wight. As a consortium with Innovate Physics Ltd, we applied for an Innovate UK funding in 2023 (TSB Application number: 10086279) to develop a wearable toolkit for early detection of Alzheimer's disease. Though it was unsuccessful, we are still trying to find another funding for this project. We are also collaborating with medicine and phycology fields to support aged people using our 3D human tracking and pose estimation technique. Our bidding to MRC in 2023 (Application number 13463) was unsuccessful and we are looking for other funding opportunities. We started collaboration with our university department of psychology for research on human perception which is one of the main factors in immersive scene rendering. We will start co-supervision of two PhD students from this year 2024.
Sectors	Digital/Communication/Information Technologies (including Software) Education Healthcare Culture Heritage Museums and Collections
URL	http://www.3dkim.com/Eng/publication.html


Description	This project was focused on academic researches and its outcome still stays in a level of pilot system. Therefore its impacts on society, economy and environment are limited. However, as mentioned in the Outcomes section, we are trying to unlock the creative potential of our techniques to deliver a step change in various applications including health-care, humanoid robots, human understanding and communication. We tried to increase the visibility of our research through various seminars and public open events.
First Year Of Impact	2021
Sector	Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types	Cultural Economic


Description	International Collaborative Research
Organisation	Electronics and Telecommunications Research Institute (ETRI)
Country	Korea, Republic of
Sector	Public
PI Contribution	This research collaboration aims to develop an immersive eXtended reality (XR) environment for realistic egocentric interaction between users in the XR space. The main contributions are developing lagorithms for human interaction between remote users in the XR enviroment, free-view visualisation of remote environment, and 3D environment analysis/visualisation.
Collaborator Contribution	They provide direct financial contribution
Impact	Algorithms Co-authored international conference papers
Start Year	2024


Description	International Collaborative Research
Organisation	Korea Institute of Science and Technology
Country	Korea, Republic of
Sector	Public
PI Contribution	This collaboration aims to develop a 3D environment understanding and 3D human detection/pose estimation system using a single omni-directional camera and several depth sensors. The main target application will be providing an integrated 3D scene understanding to a digital human or humanoid robot which is being developed in KIST. The key techniques we contributed was to implemente 3D human tracking/visualization using a camera and additional sensors. Based on the experience from the NIA project, we are providing new algorithms for their system.
Collaborator Contribution	They made direct financial contribution in a form of international collaboration project.
Impact	Algorithms Co-authored international conference papers
Start Year	2022


Description	Career Development Workshop
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Postgraduate students
Results and Impact	Gave a talk about a research career pathway and joined pannel discussion for Postgraduate students in science and engineering. Around 60 postgraduate students attended in the UK.
Year(s) Of Engagement Activity	2023


Description	Invited Talks
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I made 9 invited talks about my researchs on this NIA project in various conferences, research institutes and universities in different countries between 2021 and 2024. Number of audiences varied from 20 people to 200 people.
Year(s) Of Engagement Activity	2021,2022,2023,2024


Description	Keynote speech
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	I made 4 keynote speeches on the NIA research topic at international conferences between 2022 and 2024. - KOSEN Bridge Forum, Seoul, South Korea, Oct. 2022 - ICCE-Asia, Yeosu, South Korea, Oct. 2022 - CVPR Workshop on Omnidirectional Computer Vision, Vancouver, Canada, June 2023 - ICEIC, Taipei, Taiwan, Jan. 2024
Year(s) Of Engagement Activity	2022,2023,2024


Description	Science and Engineering Day
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	The University of Southampton run Science and Engineering Day for local public engagement. I made demonstration for public including kids in 2022 and 2023.
Year(s) Of Engagement Activity	2022,2023
URL	https://www.sotsef.co.uk/wider-festival/explore/

Abstract

Organisations

People

ORCID iD

Publications