Statistical Methods in Offline Reinforcement Learning

Lead Research Organisation: London School of Economics & Pol Sci

Department Name: Statistics

Abstract

Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics).

A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.

Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.

Funded Value:

£398,392

Funded Period:

Sep 22 - Sep 25

Funder:

EPSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/W014971/1

Principal Investigator:

Chengchun Shi

Research Subject:

Info. & commun. Technol. (60%)

Mathematical sciences (40%)

Research Topic:

Artificial Intelligence (60%)

Statistics & Appl. Probability (40%)

Organisations

London School of Economics & Pol Sci (Lead Research Organisation)

People	ORCID iD
Chengchun Shi (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Bian Z (2024) Off-Policy Evaluation in Doubly Inhomogeneous Environments in Journal of the American Statistical Association

Cai H (2023) Jump Interval-Learning for Individualized Decision Making in Journal of Machine Learning Research

Ge L (2023) A Reinforcement Learning Framework for Dynamic Mediation Analysis

Li T (2023) Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making

Li T (2024) Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing in Journal of the American Statistical Association

Luo S (2022) Policy Evaluation for Temporal and/or Spatial Dependent Experiments

Luo S (2024) Policy evaluation for temporal and/or spatial dependent experiments in Journal of the Royal Statistical Society Series B: Statistical Methodology

Shi C (2022) A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

Shi C (2023) Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning in Journal of the American Statistical Association

Shi C (2022) Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process in Journal of the American Statistical Association

Abstract

Organisations

People

ORCID iD

Publications