Deep Learning from Crawled Spatio-Temporal Representations of Video (DECSTER)
Lead Research Organisation:
Queen Mary University of London
Department Name: Sch of Electronic Eng & Computer Science
Abstract
Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
People |
ORCID iD |
| Ioannis Patras (Principal Investigator) |
Publications
Apostolidis E
(2021)
Video Summarization Using Deep Neural Networks: A Survey
Apostolidis E
(2021)
AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization
in IEEE Transactions on Circuits and Systems for Video Technology
Apostolidis E
(2021)
Video Summarization Using Deep Neural Networks: A Survey
in Proceedings of the IEEE
Apostolidis E
(2020)
Performance over Random
Apostolidis E
(2021)
Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection
Apostolidis Evlampios
(2022)
Combining Global and Local Attention with Positional Encoding for Video Summarization
Apostolidis Evlampios
(2022)
Combining Global and Local Attention with Positional Encoding for Video Summarization
| Description | The work has been focusing on Deep Learning methods for action recognition and action localisation. We have focused in particular on fine-grained recognition and have developed baselines for action localisation as outlined in the original project description. A key finding underlying all related publications is that fine grained temporal analysis, i.e., at increased temporal resolutions, is important for increased performance. We have developed methods for action recognition and action retrieval that rely mechanisms for feature extraction at high temporal resolution, and mechanisms for temporal alignment for estimating similarity/distances between videos. We have shown that this leads to increased performance in comparison to crude video-based representations. This has been extended for fine-grained (temporal) localisation of actions in long, untrimmed image sequences. A second key-finding is that, by using a framework called knowledge distillation, in which Networks are used to train each other, it is possible to achieve different tradeoffs of accuracy, speed and storage requirements. In a parallel direction our work on video summarisation have shown the limitations of the current evaluation protocols and how variations of deep learning methods keep can improving the state-of-the-art. |
| Exploitation Route | We have developed methods for video recognition, action localisation and video summarisation that are published. We also provide code and datasets that are also in the public domain. Those can be used by others to benchmark their methods, to train their models and to improve on the methods that we have developed. |
| Sectors | Creative Economy Healthcare Culture Heritage Museums and Collections |
| Description | We have developed methods that have been used widely by researchers in the field and have provided code, models and the data that we have used. In addition, with the collaboration with CERTH-ITI, have made publicly available a dataset that has been widely used in the field. |
| First Year Of Impact | 2020 |
| Sector | Other |
| Description | AI4Media |
| Amount | € 12,000,000 (EUR) |
| Funding ID | 951911 |
| Organisation | European Commission |
| Sector | Public |
| Country | Belgium |
| Start | 08/2020 |
| End | 09/2024 |
| Title | CA-SUM pretrained models |
| Description | This dataset contains pretrained models of the CA-SUM network architecture for video summarization, that is presented in our work titled "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", in Proc. ACM ICMR 2022. Method overview: In our ICMR 2022 paper we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance. File format: The "pretrained_models.zip" file that is provided in the present zenodo page contains a set of pretrained models of the CA-SUM network architecture. After downloading and unpacking this file, in the created "pretrained_models" folder, you will find two sub-directories one per each of the utilized benchmarking datasets (SumMe and TVSum) in our experimental evaluations. Within each of these sub-directories we provide the pretrained model (.pt file) for each data-split (split0-split4), where the naming of the provided .pt file indicates the training epoch and the value of the length regularization factor of the selected pretrained model. The models have been trained in a full-batch mode (i.e., batch size is equal to the number of training samples) and were automatically selected after the end of the training process, based on a methodology that relies on transductive inference (described in Section 4.2 of [1]). Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed CA-SUM network architecture, can be found at: https://github.com/e-apostolidis/CA-SUM. License and Citation: These resources are provided for academic, non-commercial use only. If you find these resources useful in your work, please cite the following publication where they are introduced: E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras. 2022, "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR '22), June 2022, Newark, NJ, USA. https://doi.org/10.1145/3512527.3531404 Software available at: https://github.com/e-apostolidis/CA-SUM |
| Type Of Material | Database/Collection of data |
| Year Produced | 2022 |
| Provided To Others? | Yes |
| Impact | the Github page had 29 stars and 11 forks (03/2025) indicating good usage of the code by the community. |
| URL | https://zenodo.org/record/6562992 |
| Title | PGL-SUM pretrained models |
| Description | This dataset contains pretrained models of the PGL-SUM network architecture for video summarization, that is presented in our work titled "Combining Global and Local Attention with Positional Encoding for Video Summarization", in Proc. IEEE ISM 2021. This work introduces a new method for supervised video summarization, which aims to overcome drawbacks of existing RNN-based summarization architectures that relate to the modeling of long-range frames' dependencies and the ability to parallelize the training process. The proposed PGL-SUM network architecture relies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames' dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames' dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two benchmarking datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. File format The provided "pretrained_models.zip" file contains two sets of pretrained models of the PGL-SUM network architecture. After downloading and unpacking this file, in the created "pretrained_models" folder you will find the following sub-directories: table3_models, table4_models The sub-directory "table3_models" contains models of the PGL-SUM network architecture that have been trained in a single-batch mode and were manually selected based on the observed summarization performance on the videos of the test set. The average performance of these models (over the five utilized data splits) is reported in Table III of [1]. The sub-directory "table4_models" contains models of the PGL-SUM network architecture that have been trained in a full-batch mode and were automatically selected after the end of the training process based on the recorded training losses and the application of the designed model selection criterion (described in Section IV.B of our paper). The average performance of these models (over the five utilized data splits) is reported in Table IV of [1]. Each of these sub-directories contains the pretrained model (.pt file), for: Each utilized benchmarking dataset: {SumMe, TVSum} And each utilized data-split: {0, 1, 2, 3, 4} The naming of each of the provided .pt files indicates the training epoch associated with the selected pretrained model. Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed PGL-SUM network architecture, can be found at: https://github.com/e-apostolidis/PGL-SUM. License and Citation This dataset is provided for academic, non-commercial use only. If you find this dataset useful in your work, please cite the following publication where it is introduced: [1] E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. 23rd IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. Software available at: https://github.com/e-apostolidis/PGL-SUM Acknowledgements This work was supported by the EU Horizon 2020 programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2021 |
| Provided To Others? | Yes |
| Impact | The Github page has 87 stars and 33 forks indicating good utilisation of the code by the community. |
| URL | https://zenodo.org/record/5635735 |
| Title | ViSiL code and models |
| Description | This repository contains the Tensorflow implementation of the paper "ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning", ICCV, 2019, a method for video retrieval. It provides code for the calculation of similarities between the query and database videos given by the user. Also, it contains an evaluation script to reproduce the results of the paper. The video similarity calculation is achieved by applying a frame-to-frame function that respects the spatial within-frame structure of videos and a learned video-to-video similarity function that also considers the temporal structure of videos. |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2020 |
| Provided To Others? | Yes |
| Impact | ViSiL has drawn attention since it was made publicly available, with 25 forks (people/groups that started building upon it) and 114 github stars - Also, various researchers contributed with "pull requests", for porting the framework into PyTorch, for instance. |
| URL | https://github.com/MKLab-ITI/visil |
| Description | Collaboration with Institute on Telematics and Informatics |
| Organisation | Centre for Research and Technology Hellas (CERTH) |
| Country | Greece |
| Sector | Academic/University |
| PI Contribution | QMUL has a long standing collaboration with the Institute of Telematics and Informatics, Centre for Research and Technology Hellas (CERTH-ITI). During the period of the Decster project Georgios Kordopatis-Zilos has been performing research on video retrieval under the supervision of Ioannis Patras. The work has been aligned with the aims of the DECSTER project so as to address action recognition, and more specifically action retrieval - the work has resulted in two publications in selective Computer Vision and Multimedia Analysis venues. |
| Collaborator Contribution | Informatics and Telematics Institute (ITI-CERTH, Greece) funded the salaries and paid the fees of the researcher, provides equipment and travel costs and co-supervision of the research. |
| Impact | The collaboration is long standing -- since 2009. Within the Decster project in the period until 01-2020 there have been two publications aligned with the goals of the project. |
| Start Year | 2018 |
| Title | Few-Shot Action Localization without Knowing Boundaries |
| Description | The repository contains the implementation of "Few-Shot Action Localization without Knowing Boundaries" (2021 International Conference on Multimedia Retrieval), and provides the training and the evaluation code for reproducing the reported results |
| Type Of Technology | Software |
| Year Produced | 2021 |
| Open Source License? | Yes |
| Impact | The code has been very recently released. |
| URL | https://github.com/June01/WFSAL-icmr21 |
| Title | Performance over Random -- repository |
| Description | The repository contains the implementation of "Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods" (28th ACM International Conference on Multimedia (MM '20)) and can be used for evaluating the summaries of a video summarization method using the PoR evaluation protocol. |
| Type Of Technology | Software |
| Year Produced | 2020 |
| Impact | The repository contains the implementation of "Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods" (28th ACM International Conference on Multimedia (MM '20)) and can be used for evaluating the summaries of a video summarization method using the PoR evaluation protocol. |