Deep Learning from Crawled Spatio-Temporal Representations of Video (DECSTER)
Lead Research Organisation:
University College London
Department Name: Electronic and Electrical Engineering
Abstract
Video has been one of the most pervasive forms of online media for some time. Several statistics show that video traffic will dominate IP networks within the next five years. Yet, video remains one of the least-manageable elements of the big data ecosystem. This project argues that this difficulty stems primarily from the fact that all advanced computer vision and machine learning algorithms view video as a stream of frames of picture elements. This is despite the fact that pixel-domain representations are known to be notoriously difficult to manage in machine learning systems, mainly due to: their high volume, high redundancy between successive frames, and artifacts stemming from camera calibration under varying illumination.
We propose to abandon pixel representations and consider spatio-temporal activity information that is directly extractable from compressed video bitstreams or neuromorphic vision sensing (NVS) hardware. The first key outcome of the project will be to design deep neural networks (DNNs) that ingest such activity information in order to derive state-of-the-art classification, action recognition and retrieval results within large video datasets. This will be achieved at record-breaking speed and comparable accuracy to the best DNN designs that utilize pixel-domain video representations and/or optical flow calculations. The second key outcome will be to design and prototype a crawler-based bitstream parsing and analysis service, where some of the parsing and processing will be carried out by a bitstream crawler running on a remote repository, while the back-end processing will be carried out by high-performance servers in the cloud. This will enable for the first time the continuous parsing of large compressed video content libraries and NVS repositories with new & improved versions of crawlers in order to derive continuously-improved semantics or track changes and new content elements, in a manner similar to how search engine bots continuously crawl web content. These outcomes will pave the way for exabyte-scale video datasets to be newly-discovered and analysed over commodity hardware.
We propose to abandon pixel representations and consider spatio-temporal activity information that is directly extractable from compressed video bitstreams or neuromorphic vision sensing (NVS) hardware. The first key outcome of the project will be to design deep neural networks (DNNs) that ingest such activity information in order to derive state-of-the-art classification, action recognition and retrieval results within large video datasets. This will be achieved at record-breaking speed and comparable accuracy to the best DNN designs that utilize pixel-domain video representations and/or optical flow calculations. The second key outcome will be to design and prototype a crawler-based bitstream parsing and analysis service, where some of the parsing and processing will be carried out by a bitstream crawler running on a remote repository, while the back-end processing will be carried out by high-performance servers in the cloud. This will enable for the first time the continuous parsing of large compressed video content libraries and NVS repositories with new & improved versions of crawlers in order to derive continuously-improved semantics or track changes and new content elements, in a manner similar to how search engine bots continuously crawl web content. These outcomes will pave the way for exabyte-scale video datasets to be newly-discovered and analysed over commodity hardware.
Planned Impact
Industrial stakeholders and the general public will benefit from the results of this research via the development of advanced video classification and retrieval services at scale and resource levels that are impossible to achieve with conventional pixel-based video analysis systems. Therefore, the project outcomes may enable a wide range of new and emerging consumer video and Internet-of-Things (IoT) related applications, thus helping to meet public expectations for the future of advanced visual computing systems. The role of industry and in particular our industrial partners, will be of paramount importance here, especially in view of the significance of the widespread adoption of new media processing technologies in numerous vertical sectors such as advertising, surveillance, recommendation services, etc. The dissemination of our research outputs to standardisation bodies, such as the on-going work of ISO/IEC MPEG on CDVA and ISO ISAN extensions, will facilitate this impact.
Our industrial partners are in a leading position to exploit the research outcomes within their products and services (e.g., Soundmouse for the creative industries sector and Yamaha Motor for smart vehicles) and the planned interactions with them will substantially facilitate this. Overall, as detailed in the Impact document, this encompasses three large areas: creative content production and management systems, cloud computing services for media processing, and IoT-oriented vehicle and surveillance systems.
Our industrial partners are in a leading position to exploit the research outcomes within their products and services (e.g., Soundmouse for the creative industries sector and Yamaha Motor for smart vehicles) and the planned interactions with them will substantially facilitate this. Overall, as detailed in the Impact document, this encompasses three large areas: creative content production and management systems, cloud computing services for media processing, and IoT-oriented vehicle and surveillance systems.
Organisations
People |
ORCID iD |
Yiannis Andreopoulos (Principal Investigator) |
Publications
Jubran M
(2022)
Sequence-Level Reference Frames in Video Coding
in IEEE Transactions on Circuits and Systems for Video Technology
Jubran M
(2020)
Rate-Accuracy Trade-Off in Video Classification With Deep Convolutional Neural Networks
in IEEE Transactions on Circuits and Systems for Video Technology
Abbas A
(2020)
Biased Mixtures of Experts: Enabling Computer Vision Inference Under Data Transfer Limitations
in IEEE Transactions on Image Processing
Chadha A
(2019)
Improved Techniques for Adversarial Discriminative Domain Adaptation.
in IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Bi Y
(2020)
Graph-based Spatio-Temporal Feature Learning for Neuromorphic Vision Sensing.
in IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Kordopatis-Zilos G
(2022)
DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval
in International Journal of Computer Vision
Piccione A
(2022)
Predicting resistive wall mode stability in NSTX through balanced random forests and counterfactual explanations
in Nuclear Fusion
Piccione A
(2020)
Physics-guided machine learning approaches to predict the ideal stability properties of fusion plasmas
in Nuclear Fusion
Description | In our publications, we have showed that, under minimal loss of accuracy against the state-of-the-art in video classification, action/object localisation & recognition, and video retrieval, up to 100-fold increase of processing throughput is achieved. In addition, we can reduce the data transfer requirements to as low as 3 kilobits per second. In addition, we have derived state-of-the-art performance for neuromorphic vision sensing based classification. These results were published in top-tier conferences and journals: IEEE ICCV and IEEE Transactions on Image Processing. In addition to these initially foreseen outcomes, we have also shown that some of our neural network based techniques are applicable to high-speed classification of disruption events in fusion reactors (where the inputs are either visual or time series data). Parts of this research are still on-going and initial results have been published in the Nuclear Fusion journal in collaboration with Columbia University in the US. |
Exploitation Route | Our industrial partners are in a leading position to exploit the research outcomes within their products and services (e.g., Soundmouse for the creative industries sector and Yamaha Motor for smart vehicles) and the planned interactions with them will substantially facilitate this. Overall, as detailed in the Impact document, this encompasses three large areas: creative content production and management systems, cloud computing services for media processing, and IoT-oriented vehicle and surveillance systems. |
Sectors | Aerospace Defence and Marine Digital/Communication/Information Technologies (including Software) Energy |
URL | https://github.com/rate-accuracy-mvcnn/main |
Description | 1) Open-source software being released (PIX2NVS and MV-CNN projects on github), now in use for industrial and academic R&D. 2) Follow-on funding from Innovate UK (QUEMAT, project 104408) has been secured in collaboration with FOCAL international, Dithen Limited and the UCL Media Institute. 3) The PI has carried out expert witness work that related to the topics of this proposal. |
First Year Of Impact | 2018 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Societal Economic |