Improved analysis of policy gradient methods in reinforcement learning.

Lead Research Organisation: Imperial College London

Department Name: Mathematics

Abstract

Reinforcement learning is a popular branch of machine learning that aims to solve a sequential decision-making problem in an environment. This has a wide variety of applications including autonomous driving, robotics, recommendation systems and healthcare. In some of these applications, the cost of a wrong decision could be dramatic. In particular for applications like autonomous driving, the lives of human beings are at stake. As such, it is of crucial importance that we understand how the methods work and whether they really do work in the way that was intended.

However, the methods that are used in practice are often only poorly understood. The theory describing these methods is currently unable to explain the huge successes that reinforcement learning has enjoyed in practice. The aim of this project is to provide improved theoretical guarantees for methods known as policy gradient methods that form the basis for much of the practical implementations of reinforcement learning. These methods are particularly used for large-scale problems that are often faced in practice.

Specifically, theory on algorithms of this type takes the form of convergence bounds. That is, the algorithm is aiming to output a solution to the problem that is optimal. We are interested in understanding how quickly the algorithm outputs something close to this optimal solution, where the notion of closeness is mathematically precise. The aim of improved analyses translates here into saying that an algorithm converges faster than what was previously proven.

Recently, a particular type of a policy-gradient method in a specific setting has been studied under a new perspective known as policy mirror descent. What exactly this means is not too important except that mirror descent is a concept from optimisation theory that has been heavily studied in that setting. As such, tools and methods of analysis may be translated from optimisation theory to this reinforcement learning framework. This can be exploited to achieve improved convergence guarantees, which is one of the avenues that we are using in this project.

This project is part of the StatML CDT, which is a joint CDT between Imperial College London and the university of Oxford. It falls within the EPSRC statistics and applied probability research area. In particular, though this project is heavily linked to optimisation, it remains very statistical in nature. This is because we are interested in using data that inherently has some randomness to it in order to solve the decision-making problem of reinforcement learning.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Student:

Emmeran Johnson

Period of Study:

Oct 21 - Aug 25

Funder:

EPSRC

Project Status:

Active

Project Category:

Studentship

Project Reference:

2602524

Research Topic:

Unclassified

Organisations

Imperial College London (Lead Research Organisation)

People	ORCID iD
Axel Gandy (Primary Supervisor)
Emmeran Johnson (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/S023151/1			01/04/2019	30/09/2027
2602524	Studentship	EP/S023151/1	02/10/2021	30/08/2025	Emmeran Johnson