Cross-modal egocentric activity recognition and zero-shot learning

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

The availability of low-cost wearable cameras has renewed the interest of first-person human activity analysis. The recognition of first-person activities has important challenges to be addressed such as, rapid changes in illuminations, significant camera motion and complex hand-object manipulations. In recent years, the advances in deep learning have influenced significantly the computer vision community, as convolutional networks gave impressive results in tasks such as, object recognition and detection, scene understanding and image segmentation. Convolutional networks have been used with success in first-person activity recognition as well. Before the emergence of deep learning the community of first-person computer vision was focused on the engineering of important egocentric features that capture properties of the first-person point of view, such as hand-object interactions and gaze. Convolutional networks allow the learning of such features automatically using big amounts of data, eliminating the need of hand-designed features.

In this work, we focus on activity recognition with convolutional networks. Influenced by the recent success of multi-stream architectures, we are investigating their applicability in egocentric videos, by employing multiple modalities for training the models. An important observation that motivates us is that humans combine their senses to understand concepts of the world, such as acoustic and visual information. To this end, we propose the employment of both videos and sounds towards more accurate activity recognition. Specifically, we will investigate how shared aligned representations can be learnt using the multi-stream paradigm. Moreover, we are interested in temporal feature pooling methods to leverage information that spans over the whole video, as in many cases the whole video should be observed in order to be able to discriminate between similar actions. Our final goal is to employ these ideas in zero-shot learning. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. An example is to recognize activities without having seen any video of these activities during training. This can be done by using the knowledge of trained classifiers (trained in other classes and not in the ones to be predicted by the zero-shot paradigm) and additional knowledge about the new classes.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509619/1 01/10/2016 30/09/2021
1971464 Studentship EP/N509619/1 18/09/2017 31/03/2021 Evangelos Kazakos