Efficient Robotic Reinforcement Learning via Off-Policy and Meta-Learning

Lead Research Organisation: University of Oxford

Department Name: Computer Science

Abstract

This project falls within the EPSRC Artificial Intelligence and Robotics research areas.
Research Context
Deep reinforcement learning-based methods are increasingly researched approaches for robotics because of their promise to provide more flexible control policies with reduced manual engineering overhead. In contrast to traditional robotics methods, where the control policies are specified by highly specialized experts for each task separately, learning algorithms can acquire general behaviors from their own experience in the same way that many biological organisms do. Deep learning models are shown to generalize well when trained on diverse datasets, and the key to their success lies in their ability to learn millions of parameters from large amounts of training data. One of the major limitations of real-world robotic learning is that we cannot afford to collect large enough datasets for "ImageNet-scale" generalization within a single experiment.
Potential Impact
While robots are becoming increasingly affordable to average consumers, the set of tasks they can carry out is limited due to the difficulty of designing robust control policies. Robots could reduce the human burden in many everyday tasks such as cooking at homes, elderly care at assisted-living communities, surgery at hospitals, or rescue operations in dangerous disaster zones. Algorithms that can learn and generalize efficiently are crucial to disseminating useful low-cost robots for wider audiences.
Objectives and Research Methodology
For reinforcement learning algorithms to evolve into practical methods for complex real-world tasks, we must design novel algorithms that allow us to get around the issue of data scarcity. One possible way is to better leverage existing historical data. Towards this goal, we propose to investigate how to (1) better utilize off-policy data, that is, the data collected outside of the specific robot experiment, and (2) meta-learn policies that can adapt to new tasks quickly.

There is an abundance of previously collected robotic data available, which already provides a large and diverse experience for learning robotic methods. With the ability to incorporate this experience into reinforcement learning, we can get the policies to truly generalize across different objects, environments, scenes, and possibly even across different robots. For example, we could use the RoboNet dataset to improve the training of a single-task or multi-task reinforcement learning. We would define one or more tasks manually and relabel all the RoboNet data with these task rewards. We would then run an off-policy reinforcement algorithm, such as Soft-Actor Critic, on the large set of RoboNet data and a modest amount of new data for each task. This should allow policies to generalize and learn faster.

Meta-reinforcement learning algorithms allow agents to rapidly adapt to new tasks by exploiting the structural similarities of previously collected experiences. Existing meta-learning algorithms operate mainly in a setting where all the experience is accessible to the learner in a single batch. More realistically, in the real world, the tasks encountered by agents are typically experienced in a sequential fashion, which is why we should extend the current meta-learning formulation to support such cases of streaming experiences. Another interesting direction would be to formulate versions of meta reinforcement learning where all the MDPs don't necessarily share the same state and action spaces. This would require developing new model architectures that can read in heterogeneous state spaces and output heterogeneous actions.

Student:

Kristian Haritkainen

Period of Study:

Oct 19 - Mar 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2285275

Research Topic:

Unclassified

Organisations

University of Oxford (Lead Research Organisation)

People	ORCID iD
Shimon Whiteson (Primary Supervisor)
Kristian Haritkainen (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/R513295/1			01/10/2018	30/09/2023
2285275	Studentship	EP/R513295/1	01/10/2019	31/03/2023	Kristian Haritkainen

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects