Deep Learning for Free-Viewpoint Video in Sports and Immersive VR Experiences
Lead Research Organisation:
University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP
Abstract
The increase in popularity of virtual reality (VR) and augmented reality (AR) experiences has given rise to a need for high-quality immersive content. Until now these experiences have predominantly featured artist-made content, but the process of creating realistic models and textures this way is challenging and time-consuming. There has been recent uptake in the use of vision-based methods for creating virtual content from real images, including Free-Viewpoint Video (FVV). FVV allows us to reconstruct real-world scenes and present them in a virtual medium. Most FVV methods require large volumes of geometry and texture data, and can require significant processing capabilities for real-time rendering. On top of these challenges, sports scenes are an especially difficult subject for FVV due to the unconstrained environment, sizeable performance volume and
large number of people. The overall focus of this work is to investigate methods for the offline processing and real-time rendering of FVV of highly dynamic scenes such as sports. FVV pipelines employ a geometric representation of the scene data, which is textured using camera images. The geometry is usually obtained using 3D reconstruction from multiple camera viewpoints - either general or model-based. General multi-view reconstruction methods are not reliant on prior information about the scene structure, and many have been applied to the generation of FVV
content. Model-based reconstruction methods involve fitting a shape-model to real-world data to obtain a geometrical proxy of the scene. We believe that model-based methods can be employed in the production of FVV of sports and dynamic scenes, providing multiple benefits over existing methods:
- Rather than a unique geometry per person and frame, a model-based representation requires only a set of model parameters plus the overhead of the body model. This means that modelbased FVV would be easily scalable to multiple people.
- Since the output of model-based reconstruction is intrinsically consistent over time, video compression on the texture maps is more effective, further increasing compactness.
- Model-based reconstruction is more robust to the errors that general reconstruction methods are prone to. Prior knowledge of human shape and pose embedded in the model assists in overcoming visual ambiguities present in the input data, providing a whole-body reconstruction without missing regions.
- The model may allow finer detail to be inferred where other capture methods are insufficient, such as around the face and hands.
- If the body model features an articulated skeleton, the reconstruction can be easily refined or reanimated by an artist.
The merits of model-based reconstruction make it suitable for FVV, especially for sports scenes. This work focuses on employing it in the production of immersive content for VR and AR. The aim is to achieve real-time FVV rendering with temporally coherent representations for compact streaming of dynamic scene data. A particular focus will be on the application to sports, both for player performance and analytics, and the creation of
mmersive content for VR and AR experiences. FVV production for sports scenes is an especially challenging application since the environment is uncontrolled. The associated difficulties include inaccurate calibration, uncontrolled illumination, poor segmentation, and very wide-baseline cameras.
large number of people. The overall focus of this work is to investigate methods for the offline processing and real-time rendering of FVV of highly dynamic scenes such as sports. FVV pipelines employ a geometric representation of the scene data, which is textured using camera images. The geometry is usually obtained using 3D reconstruction from multiple camera viewpoints - either general or model-based. General multi-view reconstruction methods are not reliant on prior information about the scene structure, and many have been applied to the generation of FVV
content. Model-based reconstruction methods involve fitting a shape-model to real-world data to obtain a geometrical proxy of the scene. We believe that model-based methods can be employed in the production of FVV of sports and dynamic scenes, providing multiple benefits over existing methods:
- Rather than a unique geometry per person and frame, a model-based representation requires only a set of model parameters plus the overhead of the body model. This means that modelbased FVV would be easily scalable to multiple people.
- Since the output of model-based reconstruction is intrinsically consistent over time, video compression on the texture maps is more effective, further increasing compactness.
- Model-based reconstruction is more robust to the errors that general reconstruction methods are prone to. Prior knowledge of human shape and pose embedded in the model assists in overcoming visual ambiguities present in the input data, providing a whole-body reconstruction without missing regions.
- The model may allow finer detail to be inferred where other capture methods are insufficient, such as around the face and hands.
- If the body model features an articulated skeleton, the reconstruction can be easily refined or reanimated by an artist.
The merits of model-based reconstruction make it suitable for FVV, especially for sports scenes. This work focuses on employing it in the production of immersive content for VR and AR. The aim is to achieve real-time FVV rendering with temporally coherent representations for compact streaming of dynamic scene data. A particular focus will be on the application to sports, both for player performance and analytics, and the creation of
mmersive content for VR and AR experiences. FVV production for sports scenes is an especially challenging application since the environment is uncontrolled. The associated difficulties include inaccurate calibration, uncontrolled illumination, poor segmentation, and very wide-baseline cameras.
Organisations
People |
ORCID iD |
Adrian Hilton (Primary Supervisor) | |
LEWIS BRIDGEMAN (Student) |
Publications
Bridgeman L
(2019)
Full-body Performance Capture of Sports from Multi-view Video (Short Paper)
Bridgeman L
(2019)
Multi-Person 3D Pose Estimation and Tracking in Sports
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/N509772/1 | 30/09/2016 | 29/09/2021 | |||
1976240 | Studentship | EP/N509772/1 | 30/09/2017 | 29/06/2021 | LEWIS BRIDGEMAN |
Description | We have identified that model-based reconstruction of people from multi-view video proves a lightweight, but realistic method of generating full-body reconstructions of people in constrained environments. We have produced FVV rendering results on single people in constrained studio environments that are able to be played back in virutal reality (VR) and augmented reality (AR). Model-based reconstruction is able to capture finer details (such as fingers, and facial details) where other reconstruction methods fail. However, there is still room for improvement in capturing details not present within the model, such as clothing. Our work in "Multi-person 3D Pose Estimation and Tracking in Sports" provides a stepping-stone to FVV of multiple people in sports scenes. This work focuses on sorting and tracking pose estimations of multiple people from multiple camera views in sports scenes; pose estimations are a critical component of the model-based reconstruction pipeline. This work provides a new method for: correcting errors in pose estimations using multi-view consensus; associating 2D pose estimations between camera viewpoints; and sorting associated poses between frames to generate tracked 3D skeletons. Our approach achieves a significant improvement in speed over the state-of-the-art. "Full-body Performance Capture of Sports from Multi-view Video" extends our previous work by using the sorted pose estimations in a model-based reconstruction of multiple people in sports environments. We demonstrate results for our method on a soccer dataset comprising over 20 subjects. These initial results show that model-based reconstruction has the potential to provide smooth, temporally consistent reconstructions of multiple people on challenging sports datasets. |
Exploitation Route | Our intermediate work on multiple person 3D pose estimation could find applications in a range of fields: motion capture & animation; sports player analysis; or even gait analysis in healthcare. The extension of this work in multi-person reconstruction could prove useful in the creative industry: the method allows us to generate 4D reconstructions of real human performances from video cameras, which could help to save hours of animators' and digital artists' time. |
Sectors | Creative Economy Digital/Communication/Information Technologies (including Software) Leisure Activities including Sports Recreation and Tourism |