VisualSense

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
 
Description The following are the key findings of the award from our perspective:
1. A proposal for how to better evaluate system-generated image descriptions in comparison with human-authored gold standard descriptions. This addresses the problem of how to score system descriptions which are at a different level of granularity from those in a human gold standard, for example, when a human suggests "dog" as an image description a system that says "poodle" is better than one that says "horse". Having meaningful evaluation measures is important to assess whether our systems are getting better or worse and is particularly important these measures become objective functions that learning methods try to optimise. See: J. Wang, F. Yan, A. Aker and R. Gaizauskas (2014). "A Poodle or a Dog? Evaluating Automatic Image Annotation Using Human Descriptions at Different Levels of Granularity" in In Proceedings of the 3rd Workshop on Vision and Language (VL'14).
2. A proposal for a specific subtask, which we called "content selection", as part of the process of generating image descriptions. Content selection is the choice of which elements of an image as recognised by an objection detection system, should be included in a natural language description of image. We defined the task, developed baseline and state-of-the-art systems to carry out the task and proposed and ran an international shared task evaluation challenge around this task. Addressing this task is an important part of understanding which elements in an image humans chose to talk about and of designing systems to do this automatically. See:
o J. Wang and R. Gaizauskas (2015). "Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines". In Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), 117--126.
o A. Gilbert, L. Piras, . Wang, F. Yan, E. Dellandrea, R. Gaizauskas, M. Villegas, K. Mikolajczyk (2015). "Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task". In CLEF2015 Working Notes.
o Josiah Wang and Robert Gaizauskas (2016). "Don't Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation". In Proceedings of the 9th International Natural Language Generation Conference (INLG16).
3. A proposal for how to define visually descriptive language and a set of annotation guidelines based on this definition to allow humans to annotate examples of visually descriptive language. This is important as to date language used to train automatic image description software has largely been derived from existing image caption which is a limited and noisy resource. Our scheme allows visually descriptive language to be annotated in any sort of text and could be used to train classifiers to recognise visually descriptive language in arbitrary text, allowing it to be harvested for training the next generation of image description software. See: R. Gaizauskas, J. Wang and A. Ramisa (2015). "Defining Visually Descriptive Language". In Proceedings of the Fourth Workshop on Vision and Language, 10--17.
4. A method for using knowledge about visual and semantic similarity between object classes to improve the adaptation of image classifiers into image detectors, allowing better object detection by Deep Learning techniques. See: Yuxing Tang, Josiah Wang, Boyang Gao, Emmanuel Dellandrea, Robert Gaizauskas and Liming Chen (2016). "Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer". In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR 2016) and Tang Y, Wang J, Wang X, Gao B, Dellandrea E, Gaizauskas R, Chen L. (2017). Visual and Semantic Knowledge Transfer for Large Scale Semi-supervised Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence.
5. A proposal for the task of predicting the preposition that best expresses the relation between a subject and an object ("man on horse"). This is important for the subsequent task of automatically generating textual descriptions for an image. We constructed a dataset containing instances of extracted from human-authored image descriptions, thus allowing us to explore real-world usage of prepositions. We proposed methods to tackle the preposition prediction task given two objects, combining different cues -- text, image and the geometric arrangements between objects. We found that all three cues play a role in preposition prediction. See: A. Ramisa, J. Wang, Y. Lu, E. Dellandrea, F. Moreno-Noguer, R. Gaizauskas (2016). "Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions". In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
Exploitation Route Our methods for evaluating image system-generated image descriptions in comparison with human-authored gold standard descriptions could be used or further developed by anyone wanting to improve the quality of automatic image description (and in turn web-based image retrieval).
Pursuing the content selection and predicting the preposition tasks could also be used by anyone wanting to improve the quality of automatic image description, especially those interested in developing compositional approaches to image description generation.
As indicated above, building classifiers to recognise visually descriptive language, as we have defined it, could increase the voiume of data available to train/inform image description generators by several orders of magnitude.
Our findings on how to use knowledge of object similarities to improve the adaptation of image classifiers into image detectors can be used anyone wishing to build better object detectors.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Education,Culture, Heritage, Museums and Collections,Security and Diplomacy

URL https://sites.google.com/site/visenproject/home/publications
 
Title ImageCLEF2015 
Description We contributed to the ImageCLEF2015 Scalable Concept Image Annotation Challenge dataset. Our contribution was (1) to select concepts to be identified within images by challenge participants (2) to annotate 5000 images at the image level with the concepts from *(1) (3) to annotate 5000 images with full sentence descriptions of image content (4) to annotate ~1000 images with correspondences between bounding boxes around instances of concept categories in the image and terms in the associated full sentence descriptions from (3). 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact The dataset was used by 11 participants from 7 countries who participated in the ImageCLEF2015 Scalable Concept Image Annotation Challenge. Its design is being re-used in the ImageCLEF 2016 international challenge, as is an extended version of the data. 
URL http://www.imageclef.org/2015/annotation
 
Title ImageCLEF2016 
Description As a continuation of the ImageCLEF2015 challenge (http://doi.org/10.5281/zenodo.1038546), we also contributed to the ImageCLEF2016 Scalable Concept Image Annotation Challenge dataset (http://imageclef.org/2016/annotation). The dataset from the 2015 edition was extended to cover a new "teaser task" for illustrating text in news articles. The new tasks aim at further stimulating and encouraging multimodal research that uses both text and visual data for image annotation and retrieval. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact The dataset was used by 7 participants from 5 countries who participated in the ImageCLEF2016 Scalable Concept Image Annotation Challenge. 
URL https://zenodo.org/record/1038554
 
Title Visen Prepositions Dataset 
Description Visen Prepositions Dataset is a dataset with instances of and the bounding box localisations of entity1 and entity2 in the corresponding picture (e.g. "boy on sled" and the bounding boxes for "boy" and "sled" in the image being described). This dataset has two main appeals: (i) it is extracted from two large-scale image datasets with human-authored descriptions, with a reasonable amount of noise as extraction was performed automatically; (ii) the prepositions are based on real-world usage as used by humans in image descriptions, making it attractive for exploring prepositional usage specifically in image descriptions. The dataset is used in Ramisa et al. (2015), and is publicly available at http://preposition.github.io/ Arnau Ramisa*, Josiah Wang*, Ying Lu, Emmanuel Dellandrea, Francesc Moreno-Noguer and Robert Gaizauskas. Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015). 2015. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact This database will progress research in the field of natural language processing and computer vision, by bridging the gap between both text and image modality, based on how humans use prepositions in an image description in the context of two entities and how these entities are actually depicted in a picture. 
URL http://preposition.github.io/