Continuous prediction of emotions in human-human interactions using audio-visual data

Lead Research Organisation: Imperial College London
Department Name: Dept of Computing

Abstract

This research consists of applying deep neural architectures to audio visual data in order to predict human emotions (valence and arousal). To this end, multimodal input data, audio and visual, are used. Features from the visual mode are extracted using a pre-trained Convolutional Neural Network (CNN) and features from audio are extracted both using a temporal model (Bidirectional Long Short Term Memory - LSTM) and a CNN. These modes are then fused using different ways of concatenation (from simple addition to more complex relationships) in order to make the final prediction. Many loss functions have also been explored, as well as ways to predict either valence and arousal separately or to predict them simultaneously.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R512540/1 01/10/2017 30/09/2021
2021107 Studentship EP/R512540/1 01/10/2017 30/09/2021 Thomas Uriot