Visual Sense. Tagging visual data with semantic descriptions

Lead Research Organisation: University of Surrey

Department Name: Vision Speech and Signal Proc CVSSP

Abstract

Recent years have witnessed an unprecedented growth in the number of image and video collections, partially due to the increased popularity of photo and video sharing websites. One such website alone (Flickr) stores billions of images. And this is not the only way in which visual content is present on the Web: in fact most web pages contain some form of visual content. However, while most traditional tools for search and retrieval can successfully handle textual content, they are not prepared to handle heterogeneous documents. This new type of content demands the development of new efficient tools for search and retrieval.

The large number of readily accessible multi-media data-collections pose both an opportunity and a challenge. The opportunity lies in the potential to mine this data to automatically discover mappings between visual and textual content. The challenge is to develop tools to classify, filter, browse and search such heterogeneous data. In brief, the data is available, but the tools to make sense of it are missing.

The Visual Sense project aims to automatically mine the semantic content of visual data to enable "machine reading" of images. In recent years, we have witnessed significant advances in the automatic recognition of visual concepts. These advances allowed for the creation of systems that can automatically generate keyword-based image annotations. However, these annotations, e.g. "man" and "pot", fall far short of the sort of more meaningful descriptive captions necessary for indexing and retrieval of images, for example,"Man cooking in kitchen". The goal of this project is to move a step forward and predict semantic image representations that can be used to generate more informative sentence-based image annotations, thus facilitating search and browsing of large multi-modal collections. It will address the following key open research challenges:

1) Develop methods that can derive a semantic representation of visual content. Such representations must go beyond the detection of objects and scenes and also include a wide range of object relations.
2) Extend state-of-the-art natural language techniques to the tasks of mining large collections of multi-modal documents and generating image captions using both semantic representations of visual content and object/scene type models derived from semantic representations of the textual component of multi-modal documents.
3) Develop learning algorithms that can exploit available multi-modal data to discover mappings between visual and textual content. These algorithms should be able to leverage 'weakly' annotated data and be robust to large amounts of noise.

Thus, the main focus of the Visual Sense project is the development of machine learning methods for knowledge and information extraction from large collections of visual and textual content and for the fusion of this information across modalities. The tools and techniques developed in this project will have a variety of applications. To demonstrate them, we will address three case studies: 1) evaluation of generated descriptive image captions in established international image annotation benchmarks, 2) re-ranking for improved image search and 3) automatic illustration of articles with images.

To address these broad challenges, the project will build on expertise from multiple disciplines, including computer vision, machine learning and natural language processing (NLP). It brings together four research groups from University of Surrey (Surrey, UK), Institut de Robotica i Informatica Industrial (IRI, Spain), Ecole Centrale de Lyon (ECL, France), and University of Sheffield (Sheffield, UK) having each well established and complementary expertise in their respective areas of research.

Planned Impact

Innovations in rich annotation of visual data directly impact the European ICT, Digital Media and Creative industries as well as having broader societal impact for any individual seeking to make sense of the huge amount of valuable, yet unstructured, image and video published online. Specifically, the project will:

Reinforce the position of the European ICT and Digital Media research, widening marketing opportunities especially for technology-providing SMEs who consume, produce or have need to search or aggregate visual data.
Stimulate greater creativity through technologies and tools to search professional and user-generated digital media content.
Provide digital media/service search engines with innovative offers for interactive and personalised digital media.
Enhance opportunities for education through illustrative images and videos complementing text descriptions.
This project technologies focus on visual data search, and metadata is the key to improving image and video search capability. The development of metadata and associated tools is in its infancy, and here lies an opportunity within both the ICT and Digital Media sectors. Although metadata frequently accompanies image and video sources, it is often patchy and limited to technical data on media capture rather than on content itself. When editorial metadata is present, this is often interest-specific and in a non-standardised form. As the NEM note within their 2009 Strategic Research Agenda (NEM-SRA), "without metadata [visual data] content is almost valueless". Our core innovations include the automated annotation of visual data to enrich source-supplied metadata with additional metadata derived automatically from visual and text content, so enabling search and aggregation. NEM-SRA notes this topic of "automatic video indexing" as one of five key promising topics for further research; another being the fusion of visual, aural and meta-data in multimedia search. These topics are explicitly addressed by WP2-WP5. Not only will enhanced techniques for semantic metadata annotation add value to Digital Media, but visual recognition also benefit other areas of ICT e.g. for surveillance, robotics, sports and medical analysis, automated manufacturing, and assisted living.

ViSen is timely in that the opportunities to exploit its outcomes are just beginning to appear. Professional customers are starting to use textual metadata tools for searching archives and managing stored content; they are already acutely aware of the need for similar and more advanced tools for live content. Consumers are becoming accustomed to searching social network sites for visual content. Other professional users
have expressed interest in the ability to search live feeds for specified content or contexts. Market conditions are favourable and will be actively monitored during the final year of ViSen through a living website reflecting routes to market and exploitation opportunities.
Project partners are in close collaboration with the BBC and SMEs, e.g. Omniperception UK, which offers multimedia search tools. We will also investigate commercial exploitation possibilities arising from ViSen with the assistance of the University's Commercialisation of IP Team and via its agreement with Fusion IP plc (http://www.fusionip.co.uk), whose mission is to seek commercial avenues to exploit IP generated in research projects by the University of Sheffield.

Funded Value:

£331,317

Funded Period:

Feb 13 - Aug 15

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/K01904X/1

Principal Investigator:

Krystian Mikolajczyk

Research Topic:

Unclassified

Organisations

People	ORCID iD
Krystian Mikolajczyk (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Akin O (2014) Online Learning and Detection with Part-Based, Circulant Structure

Balntas V (2014) Improving Object Tracking with Voting from False Positive Detections

Balntas V (2015) BOLD - Binary online learned descriptor for efficient image matching

Boudissa A (2013) Computer Analysis of Images and Patterns

Chan C (2015) Full ranking as local descriptor for visual recognition: A comparison of distance metrics on sn in Pattern Recognition

Gaur A (2014) Ranking Images Based on Aesthetic Qualities

Koniusz P (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection in Computer Vision and Image Understanding

Schubert F (2014) Robust Registration and Filtering for Moving Object Detection in Aerial Videos

Schubert F (2013) Computer Analysis of Images and Patterns

Schubert F. (2013) Performance evaluation of image filtering for classification and retrieval in ICPRAM 2013 - Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products


Description	The visual sense project aims at mining automatically the semantic content of visual data to enable "machine reading" of images. In recent years, we have witnessed significant advances in the automatic recognition of visual concepts (VCR). These advances allowed for the creation of systems that can automatically generate keyword-based image annotations. The goal of this project is to move a step forward and predict semantic image representations that can be used to generate more informative sentence-based image annotations. Thus, facilitating search and browsing of large multi-modal collections. More specifically, the project targets three case studies, namely image annotation, re-ranking for image search, and automatic image illustration of articles. It addresses the following key open research challenges: 1. It developesmethods that can predict a semantic representation of visual content. This representation goes beyond the detection of objects and scenes and also recognizes a wide range of object relations. 2. It extends state-of-the-art natural language techniques to the tasks of mining large collections of multi-modal documents and generating image captions using both semantic representations of visual content and object/scene type models derived from semantic representations of the multi-modal documents. 3. It develops learning algorithms that can exploit available multi-modal data to discover mappings between visual and textual content. These algorithms should be able to leverage 'weakly' annotated data and be robust to large amounts of noise.
Exploitation Route	Image Annotation Benchmarks can be used for evaluating approaches for annotation and retrieval of visual documents. This work provides quantitative results in comparison to currently available methods. Re-ranking for Image Search is a system that can improve image retrieval quality on complex queries such as ' a man cooking pasta'. The search engine generates a rich semantic representation of each database image and uses it to re-rank the output of a baseline image search engine. Illustrating Articles involves creating a system that can take an article or a blog entry and find images that are likely to be a good illustration for that article. This can generate useful textual and visual content representations and to map between them. These techniques broden marketing opportunities especially for technology-providing SMEs who consume, produce or have need to search or aggregate visual data. It can stimulate greater creativity through technologies and tools to search professional and user-generated digital media content. Provide digital media/service search engines with innovative offers for interactive and personalised digital media. Enhance opportunities for education through illustrative images and videos complementing text descriptions.
Sectors	Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Environment,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Retail,Transport
URL	https://sites.google.com/site/visenproject/


Description	Our findings , software and datasets were used in the scientific community but may also have impact on the creative industries.
Sector	Creative Economy,Digital/Communication/Information Technologies (including Software),Education
Impact Types	Economic


Description	Interactive Perception-Action-Learning for Modelling Objects
Amount	£397,394 (GBP)
Funding ID	EP/S032398/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	03/2019
End	03/2022


Title	BreakingNews: Article Annotation by Image and Text Processing
Description	BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (such as GPS coordinates and user comments). We show this dataset to be appropriate to explore intersection of computer vision and natural language processing have achieved unprecedented breakthroughs in tasks like automatic captioning or image retrieval.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	This is the largest existing database for analysing visual and natural language content in form of news articles rather than short captions. The database has already been used and referenced by other research labs.
URL	http://www.iri.upc.edu/people/aramisa/BreakingNews/


Title	ImageCLEF
Description	ImageCLEF 2015, 2016 benchmark data for image annotation tasks.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	Used by many internationally recognised research labs for evaluation of their computer vision approaches.
URL	http://www.imageclef.org/2015/annotation


Description	ECL
Organisation	École normale supérieure de Lyon (ENS Lyon)
Country	France
Sector	Academic/University
PI Contribution	Development of machine learning methods for image annotations
Collaborator Contribution	Development of image feature extractors and object detectors for image annotation and retrieval.
Impact	Software took=ls for image annotations and ImageCLEF benchmark dataset.
Start Year	2013


Description	IRI
Organisation	Polytechnic University of Catalonia
Country	Spain
Sector	Academic/University
PI Contribution	Development of new methods for text and image analysis in news articles.
Collaborator Contribution	Collection of a large news dataset for development and evaluation of computer vision and natural language processing methods.
Impact	Publications, datasets, evaluation benchmarks.
Start Year	2013


Description	Sheffield
Organisation	University of Sheffield
Country	United Kingdom
Sector	Academic/University
PI Contribution	Development of image annotation methods for vision and language
Collaborator Contribution	Development of natural language processing methods for generating natural image captions.
Impact	Software and datasets such as Deep Canonical Correlation, ImageCLEF benchmark.
Start Year	2013


Title	BOLD
Description	Low level image descriptors for large scale matching and retrieval
Type Of Technology	Software
Year Produced	2015
Open Source License?	Yes
Impact	Has been used by other researchers in machining and retrieval applications
URL	https://github.com/vbalnt/bold

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications