Deep Learning from Crawled Spatio-Temporal Representations of Video (DECSTER)

Lead Research Organisation: University College London
Department Name: Electronic and Electrical Engineering

Abstract

Video has been one of the most pervasive forms of online media for some time. Several statistics show that video traffic will dominate IP networks within the next five years. Yet, video remains one of the least-manageable elements of the big data ecosystem. This project argues that this difficulty stems primarily from the fact that all advanced computer vision and machine learning algorithms view video as a stream of frames of picture elements. This is despite the fact that pixel-domain representations are known to be notoriously difficult to manage in machine learning systems, mainly due to: their high volume, high redundancy between successive frames, and artifacts stemming from camera calibration under varying illumination.

We propose to abandon pixel representations and consider spatio-temporal activity information that is directly extractable from compressed video bitstreams or neuromorphic vision sensing (NVS) hardware. The first key outcome of the project will be to design deep neural networks (DNNs) that ingest such activity information in order to derive state-of-the-art classification, action recognition and retrieval results within large video datasets. This will be achieved at record-breaking speed and comparable accuracy to the best DNN designs that utilize pixel-domain video representations and/or optical flow calculations. The second key outcome will be to design and prototype a crawler-based bitstream parsing and analysis service, where some of the parsing and processing will be carried out by a bitstream crawler running on a remote repository, while the back-end processing will be carried out by high-performance servers in the cloud. This will enable for the first time the continuous parsing of large compressed video content libraries and NVS repositories with new & improved versions of crawlers in order to derive continuously-improved semantics or track changes and new content elements, in a manner similar to how search engine bots continuously crawl web content. These outcomes will pave the way for exabyte-scale video datasets to be newly-discovered and analysed over commodity hardware.

Planned Impact

Industrial stakeholders and the general public will benefit from the results of this research via the development of advanced video classification and retrieval services at scale and resource levels that are impossible to achieve with conventional pixel-based video analysis systems. Therefore, the project outcomes may enable a wide range of new and emerging consumer video and Internet-of-Things (IoT) related applications, thus helping to meet public expectations for the future of advanced visual computing systems. The role of industry and in particular our industrial partners, will be of paramount importance here, especially in view of the significance of the widespread adoption of new media processing technologies in numerous vertical sectors such as advertising, surveillance, recommendation services, etc. The dissemination of our research outputs to standardisation bodies, such as the on-going work of ISO/IEC MPEG on CDVA and ISO ISAN extensions, will facilitate this impact.

Our industrial partners are in a leading position to exploit the research outcomes within their products and services (e.g., Soundmouse for the creative industries sector and Yamaha Motor for smart vehicles) and the planned interactions with them will substantially facilitate this. Overall, as detailed in the Impact document, this encompasses three large areas: creative content production and management systems, cloud computing services for media processing, and IoT-oriented vehicle and surveillance systems.
 
Description In our publications, we have showed that, under minimal loss of accuracy against the state-of-the-art in video classification, action/object localisation & recognition, and video retrieval, up to 100-fold increase of processing throughput is achieved. In addition, we can reduce the data transfer requirements to as low as 3 kilobits per second. In addition, we have derived state-of-the-art performance for neuromorphic vision sensing based classification. These results were published in top-tier conferences and journals: IEEE ICCV and IEEE Transactions on Image Processing. In addition to these initially foreseen outcomes, we have also shown that some of our neural network based techniques are applicable to high-speed classification of disruption events in fusion reactors (where the inputs are either visual or time series data). Parts of this research are still on-going and initial results have been published in the Nuclear Fusion journal in collaboration with Columbia University in the US.
Exploitation Route Our industrial partners are in a leading position to exploit the research outcomes within their products and services (e.g., Soundmouse for the creative industries sector and Yamaha Motor for smart vehicles) and the planned interactions with them will substantially facilitate this. Overall, as detailed in the Impact document, this encompasses three large areas:
creative content production and management systems, cloud computing services for media processing, and IoT-oriented vehicle and surveillance systems.
Sectors Aerospace

Defence and Marine

Digital/Communication/Information Technologies (including Software)

Energy

URL https://github.com/rate-accuracy-mvcnn/main
 
Description 1) Open-source software being released (PIX2NVS and MV-CNN projects on github), now in use for industrial and academic R&D. 2) Follow-on funding from Innovate UK (QUEMAT, project 104408) has been secured in collaboration with FOCAL international, Dithen Limited and the UCL Media Institute. 3) The PI has carried out expert witness work that related to the topics of this proposal.
First Year Of Impact 2018
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal

Economic