Statistical Methods in Offline Reinforcement Learning
Lead Research Organisation:
London School of Economics and Political Science
Department Name: Statistics
Abstract
Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics).
A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.
Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.
A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.
Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.
People |
ORCID iD |
| Chengchun Shi (Principal Investigator) |
Publications
Bian Z
(2024)
Off-Policy Evaluation in Doubly Inhomogeneous Environments
in Journal of the American Statistical Association
Cai H
(2023)
Jump Interval-Learning for Individualized Decision Making
in Journal of Machine Learning Research
Cai H.
(2023)
Jump Interval-Learning for Individualized Decision Making with Continuous Treatments
in Journal of Machine Learning Research
Ge L.
(2023)
A Reinforcement Learning Framework for Dynamic Mediation Analysis
in Proceedings of Machine Learning Research
Li M
(2025)
Testing Stationarity and Change Point Detection in Reinforcement Learning
in The Annals of Statistics
Li T
(2024)
Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing
in Journal of the American Statistical Association
Li T.
(2023)
Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making
in Advances in Neural Information Processing Systems
| Description | Our research focuses on reinforcement learning (RL), a cutting-edge area of artificial intelligence (AI) that teaches machines how to make sequential decisions over time to achieve the best outcomes. RL has been widely studied in computer science, with applications ranging from teaching computers to play complex games to training robotics. However, the field of statistics, had not fully explored RL until recently. Thanks to this grant, our research has successfully bridged the gap between statistics and RL, substantially enriching the statistical tools available for RL, making RL algorithms more reliable, interpretable, and effective. The main developed tools include: 1. Hypothesis Testing: The performance of RL algorithms depends heavily on two key data assumptions: (i) the Markov assumption (that future states depend on the history only through the present) and (ii) the stationarity assumption (that the environment doesn't change over time). Our research has developed rigorous statistical tests to verify these assumptions. More importantly, by leveraging these tests, we can successfully improve the performance of modern RL algorithms. 2. Confidence Intervals: In many real-world applications, newly-developed policies must be evaluated offline before being deployed. This process, known as off-policy evaluation (OPE), often requires not just a point estimator of a target policy's return but also a measure of uncertainty around that estimate. Our research has developed advanced methods for constructing confidence intervals in OPE, quantifying the reliability of these estimates. 3. Experimental Design: While much attention has been given to developing OPE methods in RL, less focus has been studied on how to general the experimental data for evaluation. Our research has introduced innovative algorithms for designing online experiments, ensuring that the data collected is optimal for accurate policy evaluation. Additionally, we have adapted the aforementioned methods for A/B testing in two-sided marketplaces such as ride-sharing and e-commerce platforms that involve sequential decision making over time. This research has highlighted the importance of statistics in RL, encouraging more statisticians to engage with this field and amplifying the voice of statisticians within the RL and broader AI communities. It has been presented at leading universities in UK and abroad, including Stanford, Berkeley, Oxford, and Peking University, and has resulted in over 20 high-impact publications: including 10 papers in top statistics journals (Annals of Statistics, Journal of the Royal Statistical Society Series B, and Journal of the American Statistical Association) and publications in prestigious machine learning venues (ICML, NeurIPS, JMLR, and AISTATS). It research has also been recognized with prestigious awards, including the 2024 IMS Tweedie Award and the 2023 ICSA Outstanding Young Research Award. |
| Exploitation Route | The outcomes of this funding have the potential to be widely adopted across academia and industry, advancing both research and practical applications. Key ways these outcomes might be taken forward include: Academic Collaboration: The developed statistical tools-such as hypothesis testing, confidence intervals, and experimental design for RL- bridge statistics and RL, fostering collaboration between statisticians and AI researchers. Statisticians can apply their expertise to enrich RL methodologies, while RL researchers gain employ these tools for algorithm development and refinement. Two-Sided Marketplaces: Ride-sharing and e-commerce platforms can use the proposed A/B testing tools to evaluate new policies. Case studies in our published papers demonstrate the effectiveness of the proposed approaches in ridesharing companies such as Uber and Lyft. Open-source code is also available for direct use by industry practitioners. Healthcare: The developed RL algorithms can be deployed for personalized treatment recommendations and drug evaluation in pharmaceuticals and medical biotechnology. Our collaboration with health scientists has demonstrated their use in mobile health for devising and evaluating optimal treatment strategies. Published code allows health scientists to directly apply these tools. In summary, this research makes RL algorithms more reliable, interpretable, and effective, paving the way for safer and more efficient AI systems, ultimately benefiting the society as a whole. |
| Sectors | Digital/Communication/Information Technologies (including Software) Healthcare Pharmaceuticals and Medical Biotechnology Transport |
| Description | The proposed research has pioneered a new area at the intersection of statistics and reinforcement learning (RL), sparking a wave of subsequent work in developing statistical methods for RL and their adaptations for A/B testing in two-sided marketplaces. |
| Title | 2FEOPE |
| Description | This software package contains the implementation of the algorithm developed by the JASA paper "Off-Policy Evaluation in Doubly Inhomogeneous Environments". It aims to predict the value of a target policy using a pre-collected dataset. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Impact | Most existing software packages for off-policy evaluation requires the assumption of unmeasured confounders, which can be easily violated in practice. Our implementation is motivated by the two-way fixed-effects model widely used in the econometrics literature for handling unmeasured confoundings over time and subjects. It allows for unmeasured confounding. |
| URL | https://github.com/ZeyuBian/2FEOPE |
| Title | COPE |
| Description | This software contains the implementation for the paper "Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process" (JASA, 2022) in Python. |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Impact | This software package is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing packages assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. |
| URL | https://github.com/Mamba413/cope |
| Title | COPP |
| Description | This software contains the official implementation of of the AISTATS paper Conformal Off-policy Prediction. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing software packages focus on evaluating the expected return and provide a point estimator only. This software implements a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. The procedure accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. |
| URL | https://github.com/yyzhangecnu/COPP |
| Title | CQSTVCM |
| Description | This software officially implements the A/B testing algorithm developed in the JASA paper "Dynamic conditional quantile treatment effects evaluation with applications to ridesharing". |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Impact | Many modern tech companies, such as Google, Uber, and Didi, utilize online experiments (also known as A/B testing) to evaluate new policies against existing ones. While most existing software packages concentrate on average treatment effects, situations with skewed and heavy-tailed outcome distributions may benefit from alternative criteria, such as quantiles. However, assessing dynamic quantile treatment effects (QTE) remains a challenge, particularly when dealing with data from ride-sourcing platforms that involve sequential decision-making across time and space. Our package bridges the aforementioned gap by allowing practitioners and end-users to evaluate dynamic QTEs for A/B testing. |
| URL | https://github.com/BIG-S2/CQSTVCM |
| Title | CUSUM-RL |
| Description | This software contains the implementation for the AoS paper "Testing Stationarity and Change Point Detection in Reinforcement Learning" in Python (and R for plotting). It implements a consistent procedure to test the stationarity of the optimal policy based on pre-collected historical data, without additional online data collection. Based on this test, it further implements a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimisation in nonstationary environments. |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Impact | This software is concerned with reinforcement learning (RL) methods in offline nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. With this software, researchers and practitioners are allowed to apply the hypothesis testing and change point detection algorithms therein for estimating optimal policies in non-stationary environments. |
| URL | https://github.com/limengbinggz/CUSUM-RL |
| Title | CausalMRL |
| Description | This repository contains the implementation for the AoAS paper "A MULTI-AGENT REINFORCEMENT LEARNING FRAMEWORK FOR OFF-POLICY EVALUATION IN TWO-SIDED MARKETS" in Python. |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Impact | The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. This software implements policy evaluation algorithms that address the following challenges: (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. Existing algorithms often fail due to these challenges. |
| URL | https://github.com/RunzheStat/CausalMARL |
| Title | Data_Combination |
| Description | This software contains the implementation for the ICML paper "Combining Experimental and Historical Data for Policy Evaluation" in R. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Impact | This paper implements policy evaluation and A/B testing with multiple data sources, in scenarios with one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. Unlike existing software that mainly relies on the experimental dataset, our software enables data integration that effectively combines historical data with the experimental data to enhance policy evaluation and/or A/B testing. |
| URL | https://github.com/tingstat/Data_Combination |
| Title | Double-CUSUM-RL |
| Description | This software contains the implementation for the paper "A Robust Test for the Stationarity Assumption in Sequential Decision Making" (ICML 2023) in Python. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | Reinforcement learning (RL) is a powerful technique that allows an autonomous agent to learn an optimal policy to maximize the expected return. The optimality of various RL algorithms implemented by existing software packages relies on the stationarity assumption, which requires time-invariant state transition and reward functions. However, deviations from stationarity over extended periods often occur in real-world applications like robotics control, health care and digital marketing, resulting in sub-optimal policies learned under stationary assumptions. This software implements a model-based doubly robust procedure for testing the stationarity assumption and detecting change points in offline RL settings. The procedure is robust to model misspecifications and can effectively control type-I error while achieving high statistical power, especially in high-dimensional settings. |
| URL | https://github.com/jtwang95/Double_CUSUM_RL |
| Title | IVMDP |
| Description | This software contains the official implementation of the ICML paper titled "An Instrumental Variable Approach to Confounded Off-Policy Evaluation" |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. Most existing software requires the assumption of no unmeasured confounders, which can be easily violated. This software develops an instrumental variable (IV)-based algorithm for consistent OPE in the presence of unmeasured confounders. |
| URL | https://github.com/YangXU63/IVMDP |
| Title | MDP_design |
| Description | This software contains the implementation for the NeurIPS paper "Optimal Sequential Treatment Allocation for Efficient Policy Evaluation" in python. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | A/B testing is critical for modern technological companies to evaluate the effectiveness of newly developed products against standard baselines. Despite the popularity of developing A/B testing software, the design of online experiments has been less considered in the existing literature. This is the gap our software aims to fill in. We offer implementations of two designs, assuming the data is generated by a Markov decision process and a non-MDP respectively. |
| URL | https://github.com/tingstat/MDP_design |
| Title | MediationRL |
| Description | This software is the official implementation of the ICML paper A Reinforcement Learning Framework for Dynamic Mediation Analysis(ICML 2023) in Python. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing software focuses on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Employing a reinforcement learning (RL) framework, this software offers the evaluation of dynamic mediation effects over multiple time points. |
| URL | https://github.com/linlinlin97/MediationRL |
| Title | MedtimeRL |
| Description | This software contains the official implementation of the AoS paper "Multivariate Dynamic Mediation Analysis Under a Reinforcement Learning Framework" in Python. It implements a novel multivariate dynamic mediation analysis approach when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. |
| Type Of Technology | Software |
| Year Produced | 2025 |
| Impact | To our knowledge, there is no existing software available that conducts mediation analysis in settings with multivariate and conditionally dependent mediators observed over multiple time points. Our package enables researchers and practioners to infer the individual effect of each mediator across time. |
| URL | https://github.com/jtwang95/MedtimeRL/blob/main/README.md |
| Title | PBL |
| Description | This software contains the implementation for the AISTATS paper "Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach" in Python. |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Impact | This software implements a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes (DTRs) in the offline setting. We notice that most existing algorithms were implemented in the MDP settings and less has been considered in DTRs. Meanwhile, existing DTR algorithms and software packages developed in the statistics literature do not employ the pessimistic principle to address the distributional shift between the optimal policy and the behavior policy that generates the offline data. This software is built to address this gap. |
| URL | https://github.com/yunzhe-zhou/PBL |
| Title | ROOM |
| Description | This software package contains the implementation of the ATSTATS paper Robust Offline Reinforcement Learning with Heavy-Tailed Rewards. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Impact | This software package considers offline reinforcement learning (RL) algorithms with heavy-tailed rewards. It implements two practical algorithms, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization, respectively. Although there are existing offline RL software packages available, their algorithms are less robust to heavy-tailed rewards when compared to ours. |
| URL | https://github.com/Mamba413/ROOM |
| Title | SEAL |
| Description | This software contains the official implementation of the JASA paper "Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons". |
| Type Of Technology | Software |
| Year Produced | 2022 |
| Impact | This software implements a state-of-the-art offline reinforcement learning algorithm, addressing a gap as most existing tools focus on online RL. It implements a sample-efficient advantage learning framework, enabling researchers and practitioners to improve the performance of existing offline Q-learning algorithms. |
| URL | https://github.com/leyuanheart/SEAL |
| Title | STVCM |
| Description | This package contains the official implementation of the JRSS-B paper "Policy Evaluation for Temporal and/or Spatial Dependent Experiments". |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | This software package implements an A/B testing algorithm for technology companies based on data collected from temporal and/or spatial dependent experiments. Many existing software packages do not account for temporal or spatial interference effects, making them inapplicable for policy evaluation when applied to these experiments. Our software addresses this gap, enabling accurate A/B testing in these settings. |
| URL | https://github.com/anneyang0060/STVCM |
| Title | SUGAR |
| Description | This software contains the official implementation for the JASA paper "Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning" in Python. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | This software package implements a new hypothesis testing method for directed acyclic graph (DAG). While there is a rich class of DAG estimation software, there is a relative paucity of DAG inference algorithms. Moreover, the existing algorithms often impose some specific model structures such as linear models or additive models, and assume independent data observations. Our test instead allows the associations among the random variables to be nonlinear and the data to be time-dependent. The test is implemented based on some highly flexible neural networks learners. |
| URL | https://github.com/yunzhe-zhou/SUGAR |
| Title | Two-way-Deconfounder |
| Description | This software implements the two-way-deconfounder algorithm developed in the NeurIPS paper 'Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning,' |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Impact | This software implements an advanced deconfounding-type algorithm for off-policy evaluation, which aims to estimate the expected return of a given target policy using data collected from possibly different behavior policy. Most existing software either considers settings without unmeasured confounders, or imposes strong structural assumptions. In contrast, our package allows for more flexible assumptions regarding these unmeasured confounders. |
| URL | https://github.com/fsmiu/Two-way-Deconfounder |
| Title | VEPO |
| Description | The software implements the reinforcement learning algorithm developed in the JASA paper "Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization". |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | Most existing reinforcement learning algorithms consider online settings. Our software implements an offline reinforcement learning algorithm, which learns optimal policies from a pre-collected offline dataset. Given any existing offline algorithm, our software is designed to improve its performance for value enhancement. |
| URL | https://github.com/dc-wangjn/VEPO |
| Title | markov_test |
| Description | This software contains the implementation for the JRSSB paper "Testing for the Markov Property in Time Series via Deep Conditional Generative Learning " in Python. |
| Type Of Technology | Software |
| Year Produced | 2023 |
| Impact | The Markov property is widely imposed in time series analysis. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, is of paramount importance. This software implements a nonparametric testing procedure for the Markov property in high-dimensional time series via deep conditional generative learning. To our knowledge, limited software is available for testing the Markov assumption, particularly in high-dimensional settings. |
| URL | https://github.com/yunzhe-zhou/markov_test |