Delays in Reinforcement Learning

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

Reinforcement Learning is an area of machine learning concerned with sequential decision making. Here, an agent interacts with its environment as follows. First, the environment presents a situation, and the agent must choose an action to take. Consequently, the agent receives some numerical reward and finds itself in a new situation selected by the environment. What makes the problem challenging is that actions influences situations and, in return, the possible
future rewards.

The goal of an agent is to learn the optimal policy, which is the mapping from any situation to the action maximising the expected long-term reward. However, the environment is often unknown and complicated. Therefore, the agent must carefully balance exploiting what it does know and exploring the environment. Many high-stakes applications, such as healthcare, finance and education, require algorithms addressing this exploration-exploitation dilemma. Such algorithms
should perform well empirically and have strong theoretical guarantees.

This research aims to address some assumptions made in the theoretical literature. At present, the focus is on relaxing the immediate reward assumption in episodic tasks, where the agent
interacts with the environment by taking a fixed number of actions before starting the next episode. Here, we allow the reward associated with each decision to arrive at some unknown
time in the future. The first objective is to quantify the effect of these delays on the performance of existing reinforcement learning algorithms. It is of interest to know if provably efficient
algorithms can still find an optimal policy in this setting and, if so, how does it affect the performance. Doing so requires developing novel theoretical techniques that adapt whole
classes of algorithms to the delayed reward setting. These techniques should provide worst-case guarantees for all algorithms in the considered class. A theoretical treatment of the subject matter will give insight into the worst-case effect of delays on the algorithm of choice. Using the episodic setting as a stepping stone, we seek to develop new algorithms for handling delayed rewards in the more general setting of stochastic shortest paths. Here an agent must
make decisions until it reaches some predetermined goal. In healthcare applications, for example, the agent might have to continue selecting treatments until it cures the patient. Here, the rewards associated with each action will return in delay, as the effect of a treatment is not immediately observable. In the meantime, the agent can still gain information about how the environment works. When developing a new algorithm, it will be crucial to ensure the agent
quickly learns a model of the world. However, we must account for the fact it has fewer rewards than it should. Accounting for this will involve developing novel statistical techniques. A theoretical treatment of the subject matter will give insight into the effect of delayed rewards and is a piece in the puzzle that will allow confidence in automated systems, allowing their application to real-world problems in healthcare, finance and online recommendation. This
project falls within the EPSRC Artificial Intelligence and Robotics and Mathematical Sciences research areas.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2445745 Studentship EP/S023151/1 03/10/2020 30/09/2024 Benjamin Howson