Developing robust and scalable reinforcement learning algorithms

Lead Research Organisation: University of Oxford

Abstract

Reinforcement learning (RL) represents a powerful paradigm for applying machine learning to complex decision making tasks. However, current RL algorithms are brittle and have numerous failure mechanisms, making them difficult to apply to beyond the synthetic simple environments used in research. Over my PhD I aim to help in developing robust and scalable RL algorithms, such that RL can be more easily applied to large, complex, real-world problems.

One promising approach to this is offline RL; a broad set of methods which utilize pre-existing datasets to remove the need for the online data collection required by standard RL methods. Use of large-scale datasets has been the driver of the transformative recent advances in supervised learning, and offline RL presents a way of enabling similar scaling progress in RL. In addition to this, offline RL is a practical choice in many real-world domains where online data collection is costly or poses safety risks, such as in robotics or healthcare. The offline approach, however, introduces the extra problem of needing to evaluate effectiveness of actions or behaviour not covered in the dataset. This additional challenge has been shown to cause existing online RL algorithms to fail, and hence represents a significant barrier to robust, scalable offline RL algorithms.

In my recent project we investigated this issue and identified the "edge-of-reach" failure mechanism in offline model-based reinforcement learning. Based on these insights we were able propose a "value uncertainty-aware" approach which resolves this issue. As an extension to this, I plan to investigate whether meta-learning can be used to discover an offline RL algorithm that deals with value uncertainty even more effectively.

Another barrier to large-scale RL is the phenomenon of "plasticity loss," whereby the inherent non-stationarity in RL training has been shown to cause deep RL agents to gradually lose the ability to learn from new data. I am currently investigating the underlying mechanisms behind this phenomenon, with the hope that this will inform development of more effective mitigation techniques.

Finally, the recent advances in large language models present huge opportunity for using language as a tool for instilling real-world knowledge and inductive biases into RL agents. The most effective approach for this integration is an open question, and I plan to work on exploring this. In summary, RL has huge potential which is not yet realized by current approaches, and during my PhD I aim to contribute towards developing RL algorithms which are robust and scalable and are effective when applied to complex real-world problems.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Student:

Anya Sims

Period of Study:

Oct 22 - Sep 26

Funder:

EPSRC

Project Status:

Active

Project Category:

Studentship

Project Reference:

2740739

Research Topic:

Unclassified

Organisations

University of Oxford (Lead Research Organisation)

People	ORCID iD
Yee Teh (Primary Supervisor)
Anya Sims (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/S023151/1			01/04/2019	30/09/2027
2740739	Studentship	EP/S023151/1	01/10/2022	30/09/2026	Anya Sims