Deep Learning from Crawled Spatio-Temporal Representations of Video (DECSTER)

Lead Research Organisation: Queen Mary University of London

Department Name: Sch of Electronic Eng & Computer Science

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Funded Value:

£343,838

Funded Period:

Jun 18 - Dec 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/R026424/1

Principal Investigator:

Ioannis Patras

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Image & Vision Computing (100%)

Organisations

People	ORCID iD
Ioannis Patras (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Apostolidis E (2021) Video Summarization Using Deep Neural Networks: A Survey

Apostolidis E (2021) AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization in IEEE Transactions on Circuits and Systems for Video Technology

Apostolidis E (2021) Video Summarization Using Deep Neural Networks: A Survey in Proceedings of the IEEE

Apostolidis E (2020) MultiMedia Modeling - 26th International Conference, MMM 2020, Daejeon, South Korea, January 5-8, 2020, Proceedings, Part I

Apostolidis E (2020) Performance over Random

Apostolidis E (2021) Combining Global and Local Attention with Positional Encoding for Video Summarization

Apostolidis E (2021) Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection

Apostolidis Evlampios (2022) Combining Global and Local Attention with Positional Encoding for Video Summarization

Bishay M (2019) TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products


Description	The work has been focusing on Deep Learning methods for action recognition and action localisation. We have focused in particular on fine-grained recognition and have developed baselines for action localisation as outlined in the original project description. A key finding underlying all related publications is that fine grained temporal analysis, i.e., at increased temporal resolutions, is important for increased performance. We have developed methods for action recognition and action retrieval that rely mechanisms for feature extraction at high temporal resolution, and mechanisms for temporal alignment for estimating similarity/distances between videos. We have shown that this leads to increased performance in comparison to crude video-based representations. This has been extended for fine-grained (temporal) localisation of actions in long, untrimmed image sequences. A second key-finding is that, by using a framework called knowledge distillation, in which Networks are used to train each other, it is possible to achieve different tradeoffs of accuracy, speed and storage requirements. In a parallel direction our work on video summarisation have shown the limitations of the current evaluation protocols and how variations of deep learning methods keep can improving the state-of-the-art.
Exploitation Route	We have developed methods for video recognition, action localisation and video summarisation that are published. We also provide code and datasets that are also in the public domain. Those can be used by others to benchmark their methods, to train their models and to improve on the methods that we have developed.
Sectors	Creative Economy Healthcare Culture Heritage Museums and Collections


Description	We have developed methods that have been used widely by researchers in the field and have provided code, models and the data that we have used. In addition, with the collaboration with CERTH-ITI, have made publicly available a dataset that has been widely used in the field.
First Year Of Impact	2020
Sector	Other


Description	AI4Media
Amount	€ 12,000,000 (EUR)
Funding ID	951911
Organisation	European Commission
Sector	Public
Country	Belgium
Start	08/2020
End	09/2024


Title	CA-SUM pretrained models
Description	This dataset contains pretrained models of the CA-SUM network architecture for video summarization, that is presented in our work titled "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", in Proc. ACM ICMR 2022. Method overview: In our ICMR 2022 paper we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance. File format: The "pretrained_models.zip" file that is provided in the present zenodo page contains a set of pretrained models of the CA-SUM network architecture. After downloading and unpacking this file, in the created "pretrained_models" folder, you will find two sub-directories one per each of the utilized benchmarking datasets (SumMe and TVSum) in our experimental evaluations. Within each of these sub-directories we provide the pretrained model (.pt file) for each data-split (split0-split4), where the naming of the provided .pt file indicates the training epoch and the value of the length regularization factor of the selected pretrained model. The models have been trained in a full-batch mode (i.e., batch size is equal to the number of training samples) and were automatically selected after the end of the training process, based on a methodology that relies on transductive inference (described in Section 4.2 of [1]). Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed CA-SUM network architecture, can be found at: https://github.com/e-apostolidis/CA-SUM. License and Citation: These resources are provided for academic, non-commercial use only. If you find these resources useful in your work, please cite the following publication where they are introduced: E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras. 2022, "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR '22), June 2022, Newark, NJ, USA. https://doi.org/10.1145/3512527.3531404 Software available at: https://github.com/e-apostolidis/CA-SUM
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	the Github page had 29 stars and 11 forks (03/2025) indicating good usage of the code by the community.
URL	https://zenodo.org/record/6562992


Title	PGL-SUM pretrained models
Description	This dataset contains pretrained models of the PGL-SUM network architecture for video summarization, that is presented in our work titled "Combining Global and Local Attention with Positional Encoding for Video Summarization", in Proc. IEEE ISM 2021. This work introduces a new method for supervised video summarization, which aims to overcome drawbacks of existing RNN-based summarization architectures that relate to the modeling of long-range frames' dependencies and the ability to parallelize the training process. The proposed PGL-SUM network architecture relies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames' dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames' dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two benchmarking datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. File format The provided "pretrained_models.zip" file contains two sets of pretrained models of the PGL-SUM network architecture. After downloading and unpacking this file, in the created "pretrained_models" folder you will find the following sub-directories: table3_models, table4_models The sub-directory "table3_models" contains models of the PGL-SUM network architecture that have been trained in a single-batch mode and were manually selected based on the observed summarization performance on the videos of the test set. The average performance of these models (over the five utilized data splits) is reported in Table III of [1]. The sub-directory "table4_models" contains models of the PGL-SUM network architecture that have been trained in a full-batch mode and were automatically selected after the end of the training process based on the recorded training losses and the application of the designed model selection criterion (described in Section IV.B of our paper). The average performance of these models (over the five utilized data splits) is reported in Table IV of [1]. Each of these sub-directories contains the pretrained model (.pt file), for: Each utilized benchmarking dataset: {SumMe, TVSum} And each utilized data-split: {0, 1, 2, 3, 4} The naming of each of the provided .pt files indicates the training epoch associated with the selected pretrained model. Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed PGL-SUM network architecture, can be found at: https://github.com/e-apostolidis/PGL-SUM. License and Citation This dataset is provided for academic, non-commercial use only. If you find this dataset useful in your work, please cite the following publication where it is introduced: [1] E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. 23rd IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. Software available at: https://github.com/e-apostolidis/PGL-SUM Acknowledgements This work was supported by the EU Horizon 2020 programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	The Github page has 87 stars and 33 forks indicating good utilisation of the code by the community.
URL	https://zenodo.org/record/5635735


Title	ViSiL code and models
Description	This repository contains the Tensorflow implementation of the paper "ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning", ICCV, 2019, a method for video retrieval. It provides code for the calculation of similarities between the query and database videos given by the user. Also, it contains an evaluation script to reproduce the results of the paper. The video similarity calculation is achieved by applying a frame-to-frame function that respects the spatial within-frame structure of videos and a learned video-to-video similarity function that also considers the temporal structure of videos.
Type Of Material	Computer model/algorithm
Year Produced	2020
Provided To Others?	Yes
Impact	ViSiL has drawn attention since it was made publicly available, with 25 forks (people/groups that started building upon it) and 114 github stars - Also, various researchers contributed with "pull requests", for porting the framework into PyTorch, for instance.
URL	https://github.com/MKLab-ITI/visil


Description	Collaboration with Institute on Telematics and Informatics
Organisation	Centre for Research and Technology Hellas (CERTH)
Country	Greece
Sector	Academic/University
PI Contribution	QMUL has a long standing collaboration with the Institute of Telematics and Informatics, Centre for Research and Technology Hellas (CERTH-ITI). During the period of the Decster project Georgios Kordopatis-Zilos has been performing research on video retrieval under the supervision of Ioannis Patras. The work has been aligned with the aims of the DECSTER project so as to address action recognition, and more specifically action retrieval - the work has resulted in two publications in selective Computer Vision and Multimedia Analysis venues.
Collaborator Contribution	Informatics and Telematics Institute (ITI-CERTH, Greece) funded the salaries and paid the fees of the researcher, provides equipment and travel costs and co-supervision of the research.
Impact	The collaboration is long standing -- since 2009. Within the Decster project in the period until 01-2020 there have been two publications aligned with the goals of the project.
Start Year	2018


Title	Few-Shot Action Localization without Knowing Boundaries
Description	The repository contains the implementation of "Few-Shot Action Localization without Knowing Boundaries" (2021 International Conference on Multimedia Retrieval), and provides the training and the evaluation code for reproducing the reported results
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	The code has been very recently released.
URL	https://github.com/June01/WFSAL-icmr21


Title	Performance over Random -- repository
Description	The repository contains the implementation of "Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods" (28th ACM International Conference on Multimedia (MM '20)) and can be used for evaluating the summaries of a video summarization method using the PoR evaluation protocol.
Type Of Technology	Software
Year Produced	2020
Impact	The repository contains the implementation of "Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods" (28th ACM International Conference on Multimedia (MM '20)) and can be used for evaluating the summaries of a video summarization method using the PoR evaluation protocol.

Abstract

Organisations

People

ORCID iD

Publications