Learning by Imitating via Convolutional Neural Networks and Reinforcement Learning.

Lead Research Organisation: University of Manchester
Department Name: Computer Science

Abstract

Robots are increasingly becoming part of several aspects of human's life: homes, offices and hospitals. Most of these robots have humanoid appearances and interact with the surrounding environment, objects and people using their cameras and actuators. However, the behaviour of this robots is usually pre-programmed, meaning that developers need to hard code movements. Consequently, the robots have a limited amount of possible actions. One alternative is to teleoperate the robot using several devices, such as cameras, headsets, joysticks and inertial sensors. While these devices can provide really accurate measurements, they can be quite expensive and their usage is non-trivial. Moreover, this does not resemble human learning at all: when a person replicates a task, they observe a demonstrator and try to imitate their action. Consequently, in order to perform learn by imitating, an ideal robotic system would require no additional components aside from the camera of the robot.
We treat the problem of imitation learning as a composition of two different fields: perception (i.e., vision) and action.

In Computer Vision, Convolutional Neural Networks (CNNs) can achieve impressive performance on recognition and localization tasks. Interestingly, the same architectures that are used for these tasks can be adapted for pose estimation and tracking. While some neural network rely on older approaches, like depth images, there are also several innovations regarding RGB images, such as 3D pose estimation from these images, without additional information. In this project, we rely on recent developments in CNNs - such as temporal convolutions (which enforce temporal constraints on the predicted position of the joints) and meshes regression - to perform 3D tracking of the upper-body (arms and torso). In this way, we use a single camera to solve the perception problem.
On the other hand, the action side of imitation requires a different kind of approach from neural networks, so that the robot does not only matches known inputs with knows outputs, but also learn to acquire novel actions in an online way. Reinforcement Learning (RL) is currently one of the state-of-the-art algorithms to use for locomotion in virtual avatars and robots. In particular, DeepMimic by University of California shows how a RL algorithm based on Policy Gradient methods can be used to teach a virtual robot how to imitate the motion performed by a virtual avatar.

The final aim of this project is teaching a robot Sign Language by interacting with a human, using the previously described methods. By learning how to replicate human motion through RL, we provide a way for the robot to learn different gestures that corresponds to different concepts (e.g., eat, drink, home). Moreover, the usage of CNNs enable us to observe a demonstrator in realtime and translate the motion into keypoints that can be used by the robot to replicate the shown motion. In this way, the robot is able to learn different signs in realtime, compared to current scenarios where digital avatars or robots are pre-trained and deployed with a specific set of gestures.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513131/1 01/10/2018 30/09/2023
2169204 Studentship EP/R513131/1 01/01/2019 31/12/2021 Federico Tavella