Audio-visual object-based dynamic scene representation from monocular video

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP


This research will investigate the transformation of monocular audio and visual video into a spatially localised object-based audio-visual representation. Self supervised and weakly supervised deep learning will be investigated for the transformation of general scenes into
semantically labelled and localised objects. This will build on recent advances in deep-learning based monocular reconstruction of general dynamic scenes and objects with known semantic labels, such as people. Multi-modal information sources including audio and text subtitles will be employed to support weakly supervised learning for semantic labelling and object-based reconstruction. The goal of this research is to generalise to unconstrained video sequences of complex real world scenes with multiple interacting people. Research will investigate approaches for the transfer of multi-modal or additional information to support the object-based scene reconstruction and evaluate the relative importance of different
information sources. The approach should be able to achieve plausible reconstruction of unknown or unmodelled object classes, together with complete reconstruction for modelled object classes. Learning on in-the-wild and BBC archive datasets will be investigated to support the generalisation to complex scenes. Specific use-cases such as sports and programme recommendation will also be investigated for evaluation in constrained contexts. The approach will be evaluated on both live and legacy content.


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/W522272/1 30/09/2021 29/09/2026
2701695 Studentship EP/W522272/1 31/03/2022 29/03/2026 Asmar Nadeem