Learning the Memorable Information from Images

Lead Research Organisation: University of York
Department Name: Computer Science

Abstract

Memories are very important for people and what we can memorise from images defines a specific information which is important for humans. During a memorisation experiment, people were shown images from a database and then they were asked to indicate what do they remember from those images, if anything. The detailed explanation of the memorisation experiment and the resulting dataset VISCHEMA are provided at: https://www.cs.york.ac.uk/. The resulting dataset defines the areas of images that cause that image to remembered or falsely remembered. These areas together are known as the 'visual memory schema' of the image, which is related to the idea of a psychological schema - that is, certain learned arrangements in images affect memorability.

This PhD project will use a combination of Convolutional Neural Networks, Variational Autoencoders, and Generative Adversarial Networks, alongside deep learning to further explore the idea of a 'visual schema'. The purpose of utilising these machine learning techniques is to expose the potential hidden structure beneath these visual schemas and lead to improved generation of visual schemas for images that do not have a human defined baseline, which currently is restricted to generating drastically reduced resolution visual schema maps, using fully convolutional networks. Learning to generate visual schemas with a generative model is important; the model should learn to understand the underlying features that align with human-held schemas that cause an image to be remembered or falsely remembered. This has clear commercial and educational applications, such as creating advertisements that match a particularly memorable schema.

The initial research will involve generating more baseline visual schemas from psychological experiments in order to improve the effectiveness of machine learning models using this data. This will follow the previous VISCHEMA methodology. This data will be used to train a variational autoencoder that has already been pre-trained on scene datasets (transfer learning, intended to mitigate the possibility of overfitting on the relatively small VISCHEMA dataset). This generative model will be constrained to learn to generate visual schema maps using the VISCHEMA data. Various different encoding/decoding architectures will be explored. The output of this model will be compared to baseline obtained visual memory schemas to determine how well the model has learned the features that describe memorable regions. The learned latent space may lead to better understanding of a visual schema beyond the memorability maps obtained by the VISCHEMA experiment.

In later stages, the machine learning models which generate visual schemas will be expanded to generate memorable images instead, using more complex architectures such as generative adversarial networks. Sufficiently complex latent encodings may capture the features which correspond with memorable regions, and hence can be used to generate new, realistic images for a given class which contain these regions. Additional psychological experiments will be ran to determine whether generated visual schemas agree with human baselines, and whether generated memorable images truly are more memorable as seen by a human.

Publications

10 25 50
 
Description Image memorability prediction aims to develop computational algorithms capable of determining exactly how memorable any given image might be to the average person. Understanding memorability has far-reaching applications, among both commercial, educational and medical interests. However, most research focusing on image memorability treats this property as a single score assigned to a given image. Recently datasets have arisen which are human data driven, capturing memorability as a two-dimensional property across the image, rather than as a single "score". In our work, we seek to understand two-dimensional image memorability via neural models, and to develop techniques to generate images whose memorability we can define and manipulate. These two-dimensional image memorability maps are known as "Visual Memory Schema" (VMS) maps.

We:

1.) Significantly enhanced the availability of VMS-based datasets. Initially, only an 800 image/VMS map dataset existed, which makes it difficult to determine whether any algorithms based on this dataset are constrained by the limits of the algorithm; or the limits of the dataset. We enhance this existing dataset by a further 800 images and VMS maps by following the original VMS experiment paradigm, then increase the available dataset yet further via a crowdsourced repeat-recognition experiment, to a total of 4,261 images/VMS map pairs.

2.) We employ these datasets to make advances in the automatic prediction of VMS maps for a wide array of scene images. These advances have been made possible via the application and development of several different classes of deep neural network. We initially focused on retasking Variational Autoencoders for the purposes of VMS map reconstruction, surpassing prior work in this field. Extending this, we develop and analyse a wide array of neural techniques and their application to VMS map prediction, with results superior to our own prior work. Our two-dimensional memorability models employ multi-scale information, depth information, and self-attention, and we find that the non-local detection of features granted by self-attention modules improves memorability map prediction. Finally, we investigate a novel method for combining single-score datasets with two-dimensional datasets to reach even higher levels of VMS map prediction accuracy.

3.) We combine VMS maps with certain generative models, and attempt to synthesise brand new scene images whose memorability we can control through modulation of an input "target" VMS map. By utilising our previous predictor models combined with state-of-the-art generative adversarial networks, we attempt to evaluate the memorability of generated images, and force the network to generate scene images that employ specific visual memory schemas. We evaluate our generated images on human observers via psychological experiment, and find that images generated to be more memorable appear to be rated as such by participants. We also note that a certain image quality for all generated images is necessary; poor quality memorability-modulated generated images do not cause the same human-perceived difference in memorability, due to the lack of clear semantic features in the images.
Exploitation Route The most straightforward method in which others could benefit from this outcome is by performing further analysis on the dataset that we have developed here. We structure our results in such a fashion to invite easy comparison with other methods, and provide a wide array of metrics should others wish to build on our work with memorability predictive models.

Future work may also focus on the generative models; we are limited by available computational power and time, and hence have a hard limit at which resolution we can generate our memorable scene images. Academically, others may be interested in enhancing the output resolution of our models, the categories of images that the model is capable of generating, and investigate different methods for applying a VMS-based constraint upon a neural network.

There are commercial applications of a refined variant of this technique; some may be interested in generating memorable backdrops for advertisements, or other marketing-related applications. Medical researchers may be interested in clinical applications of VMS-based memorability; questions remain on how neurodegenerative issues affect two-dimensional memorability of scene images, and could lead to methods to track disease/age-related cognitive decline.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Other