Statistical Methods in Offline Reinforcement Learning
Lead Research Organisation:
London School of Economics and Political Science
Department Name: Statistics
Abstract
Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics).
A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.
Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.
A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.
Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.
People |
ORCID iD |
Chengchun Shi (Principal Investigator) |
Publications
Cai H
(2023)
Jump Interval-Learning for Individualized Decision Making
in Journal of Machine Learning Research
Li T
(2024)
Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing
in Journal of the American Statistical Association
Luo S
(2024)
Policy evaluation for temporal and/or spatial dependent experiments
in Journal of the Royal Statistical Society Series B: Statistical Methodology
Shi C
(2022)
Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process
in Journal of the American Statistical Association
Shi C
(2023)
Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning
in Journal of the American Statistical Association
Shi C
(2023)
Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization
in Journal of the American Statistical Association
Shi C
(2023)
A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets
in The Annals of Applied Statistics