Multimodal Video Search by Examples (MVSE)

Lead Research Organisation: University of Ulster

Department Name: Sch of Computing & Mathematical Sci

Abstract

How to effectively and efficiently search for content from large video archives such as BBC TV programmes is a significant challenge. Search is typically done via keyword queries using pre-defined metadata such as titles, tags and viewer's notes. However, it is difficult to use keywords to search for specific moments in a video where a particular speaker talks about a specific topic at a particular location. Most videos have little or no metadata about content in the video, and automatic metadata extraction is not yet sufficiently reliable. Furthermore, metadata may change over time and cannot cover all content. Therefore, search by keyword is not a desirable approach for a comprehensive and long-lasting video search solution.

Video search by examples is a desirable alternative as it allows search for content by one or more examples of the interested content without having to specify interest in keyword. However, video search by examples is notoriously challenging, and its performance is still poor. To improve search performance, multiple modalities should be considered - image, sound, voice and text, as each modality provides a separate search cue so multiple cues should identify more relevant content. This is multimodal video search by examples (MVSE). This is an emerging area of research, and the current state of the art is far from desirable so there is a long way to go. There is no commercial service for MVSE.

This proposal has been co-created with BBC R&D through the BBC Data Science Partnership via a number of online meetings and one face to face meeting involving all partners. The proposal has been informed by recent unpublished ethnographic research on how current BBC staff (producers, journalists, archivists) search for media content. It was found that they were very interested in knowledge retrieval from archives or other sources but they required richer metadata and cataloguing of non-verbal data.

In this proposal we will study efficient, effective, scalable and robust MVSE where video archives are large, historical and dynamic; and the modalities are person (face or voice), context, and topic. The aim is to develop a framework for MVSE and validate it through the development of a prototype search tool. Such a search tool will be useful for organisations such as the BBC and British Library, who maintain large collections of video archives and want to provide a search tool for their own staff as well as for the public. It will also be useful for companies such as Youtube who host videos from the public and want to enable video search by examples. We will address key challenges in the development of an efficient, effective, scalable and robust MVSE solution, including video segmentation, content representation, hashing, ranking and fusion.

This proposal is planned for three years, involving three institutions (Cambridge, Surrey, Ulster) and one partner (the BBC) who will contribute significant resources (estimated at £128.4k) to the project (see Letter of Support from the BBC).

Planned Impact

The project's objective is to provide scalable next-generation 'search by example' functionality across national video archives. The project will develop beyond the state of the art in video segmentation, content representation/matching/ranking functionality and these outputs are intended to provide positive, disruptive impact in multimedia search capability across the media industry nationally and internationally.

The beneficiaries of this project's outputs will include academics, journalists, broadcasters, TV viewers, multimedia companies and organisations hosting and managing large video or multimedia repositories.

Journalists and broadcasters will directly benefit by time efficiency savings and the rapid discovery of relevant content when using this new technology. This will in turn provide better, more relevant and more enriched TV programming in less time thus having economic savings. This will have a benefit to TV viewers who will enjoy more relevant TV programmes by the effective repurposing of content within big media archives. As a key partner, the immediate beneficiary will be the BBC who will likely adopt and integrate the new technology within their workflows to improve the discovery of media content when producing TV programmes. However, the technologies developed are transferable to other broadcasters and indeed major online companies such as Youtube who rely on semantically enriched search technologies.

Academics will benefit by the dissemination and inspiration of the project's new research findings and search technologies for rapidly discovering relevant video/multimedia content based on new intelligent algorithms.

The pathways to impact document provides an outline of a series of activities including co-creation workshops and licensing to increase the likelihood of research impact and adoption of the novel, disruptive technologies produced in this project.

Funded Value:

£720,502

Funded Period:

Apr 21 - Jul 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/V002740/1

Principal Investigator:

Hui Wang

Research Subject:

Info. & commun. Technol. (95%)

Linguistics (5%)

Research Topic:

Artificial Intelligence (15%)

Computational Linguistics (5%)

Human Communication in ICT (20%)

Image & Vision Computing (60%)

Organisations

People	ORCID iD
Hui Wang (Principal Investigator)	http://orcid.org/0000-0003-2633-6015
Maurice Mulvenna (Co-Investigator)	http://orcid.org/0000-0002-1554-0785
Raymond Bond (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Key Findings
Further Funding


Description	This project is undertaken by 4 partners -- Queen's University Belfast, Cambridge University, Surrey University, and the BBC. The objectives of the MVSE project are the following: 1. To develop state-of-the-art methods for content-based video segmentation in order to index the video by content at the right time and for the right duration. This objective is achieved. We now can segment videos in 4+ modalities -- face and scene (Surrey), speaker and speech (topic and emotion) (Cambridge). Video segments are then represented for subsequent hashing and indexing. Segmentation and representation are now integrated in the project demonstrator. 2. To develop state-of-the-art methods for multimodal, variation (age, lighting, pose and quality) invariant content representation. This is ongoing, but will be achieved in 2023. We have designed a deep learning architecture for this, implemented a simplified version of it, and conducted an initial experiment. Initial results are very encouraging. This gives us confidence with the architecture. We are now working on implementing the full architecture. 3. To develop state-of-the-art methods for content ranking based on hash codes and feature vectors for effective video search. This is largely achieved. We have completed one study on content ranking based feature vectors from different models, and the results are already published in BMVC 2022 Workshop. We have recently completed another study on content ranking based on hash codes and feature vectors, achieving state of the art performance. We have written a draft paper and aim to submit the paper to a high quality journal. 4. To develop state-of-the-art methods for incremental content hashing in order to allow the tool to efficiently handle dynamic archives, large or small. This is achieved. We have designed a state of the art method for incremental content hashing, which is already implemented in the project demonstrator. 5. To establish a network of researchers and practitioners in the area of example-based multimedia retrieval. This is ongoing. We organised a workshop in 2022, which is associated with BMVC 2022. We are planning to organise a Video Search Challenge associated with a major international conference. Data license issue is under discussion with the BBC.
Exploitation Route	The BBC is a partner, so they will be the first users. Other broadcasters and libraries are possible users. Youtube will be another user. In light of the recent success of ChatGPT, a new opportunity has arisen for the project. We will explore that opportunity in the near future.
Sectors	Creative Economy,Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections
URL	http://mvse.ares.ecit:8001/


Description	Multimodal Video Search by Examples (MVSE)
Amount	£720,502 (GBP)
Funding ID	EP/V002740/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	04/2021
End	03/2024

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications