Offline reinforcement learning

Lead Research Organisation: University of Warwick
Department Name: Mathematics

Abstract

The context of the research *
Offline deep reinforcement learning (offline-DRL), uses a static data set to optimize sequential decision-making in an environment by using a data-driven formulation (Levine et al., 2020). Mathematically, the environment is described as a Markov decision process, and the goal is to maximize the long-term reward (Sutton and Barto, 2018). Offline-DRL can be more useful than (online) DRL when collecting new data is time-consuming, expensive and/or dangerous. However
offline-DRL does generate new challenges, such as selecting actions that are not known in the data set (known as extrapolation error or distributional shift) (Fujimoto et al., 2019). Offline-DRL can be seen as a version of off-policy DRL
where there is no longer an interaction with the environment. Simply manipulating state-of-the-art off-policy (online) algorithms such as deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) and soft actor-critic (SAC) (Haarnoja et al., 2018) and removing their exploration capabilities, causes overfitting (Fu et al., 2019) and issues with out-of-distribution actions (Kumar etal., 2019). Many approaches to date use a variational auto-encoder (VAE) (Kingma and Welling, 2013) to sample transitions from the same distribution as the given data set. To alleviate the problems of overfitting and distributional shift, many algorithms focus on adapting existing model-free and model-based algorithms to the offline setting using ensemble techniques (Agarwal et al., 2020), policy constraints (Siegel et al., 2020) and conservative estimation of action quality (Kumar et al., 2020). These algorithms do not completely solve the problem however, and there can be much improvement to be made in terms of performance of the extracted policy. A successful algorithm is based on how much it can improve on the underlying behaviour policy of the data; this can lead to improving on what is considered expert decision making policies, e.g. long-term treatment planning. From a recent review (Nguyen and La, 2019), previous methods using DRL for robotics have mostly been online methods, and there is a need for more efficient
algorithms and methods. Directly applying DRL to real-world robotics requires a suitable reward system in place as well as an efficient DRL method (Zhu et al., 2020). Robotics is an excellent application for offline-DRL as there are large data
sets available of robots performing different tasks. This means that offline-DRL could be an efficient way of learning new tasks without the potentially expensive costs of attaining more data.

The proposed research aims to review the current existing offline deep reinforcement learning (DRL) algorithms by implementing them in code, to see how they work and where they are no longer viable. This will involve a comprehensive literature review into the current state-of-the-art model-free and model-based offline DRL algorithms exploring their strengths and weaknesses. With the knowledge of current state-of-the-art algorithms, the next step will be to
improve and develop these algorithms with the goal of creating our own novel offline DRL algorithm. This will involve having a fundamental grounding in mathematics as well as implementing and publishing the algorithm in code (Python).
The novel algorithm will be evaluated against current offline methods using standard datasets (D4RL). On successfully exceeding current algorithms, our novel algorithm will then be used towards a real-world application in robotics.
This will involve devising a suitable reward design and collaborating with the partner to understand the capabilities of the robot.
This research falls into the "Artificial intelligence technologies" remit. The research will advance the algorithms involved in offline deep reinforcement learning with a focus on leveraging hardware advances and unlocking new mission-critical applications.

Research areas; Digital economy, Healthcare technologies

Planned Impact

In the 2018 Government Office for Science report, 'Computational Modelling: Technological Futures', Greg Clarke, the Secretary of State for Business Energy and Industrial Strategy, wrote "Computational modelling is essential to our future productivity and competitiveness, for businesses of all sizes and across all sectors of the economy". With its focus on computational models, the mathematics that underpin them, and their integration with complex data, the MathSys II CDT will generate diverse impacts beyond academia. This includes impacts on skills, on the economy, on policy and on society.

Impacts on skills.
MathSys II will produce a minimum of 50 PhD graduates to support the growing national demand for advanced mathematical modelling and data analysis skills. The CDT will provide each of them with broad core skills in the MSc, a deep knowledge of their chosen research specialisation in the PhD and a complementary qualification in transferable skills integrated throughout. Graduates will thus acquire the profiles needed to form the next generation of leaders in business, government and academia. They will be supported by an integrated pastoral support framework, including a diverse group of accessible leadership role models. The cohort based environment of the CDT provides a multiplier effect by encouraging cohorts to forge long-lasting professional networks whose value and influence will long outlast the CDT itself. MathSys II will seek to maximise the influence of these networks by providing topical training in Responsible Research and Innovation, by maintaining a robust Equality, Diversity & Inclusion policy, and by integration with Warwick's global network of international partnerships.

Economic impacts.
The research outputs from many MathSys II PhD projects will be of direct economic value to commercial, public sector and charitable external partners. Engagement with CDT partners will facilitate these impacts. This includes co-supervision of PhD and MSc projects, co-creation of Research Study Groups, and a strong commitment to provide placements/internships for CDT students. When commercial innovations or IP are generated, we will work with Warwick Ventures, the commercial arm of the University of Warwick, to commercialise/license IP where appropriate. Economic impact may also come from the creation of new companies by CDT graduates. MathSys II will present entrepreneurship as a viable career option to students. One external partner, Spectra Analytics, was founded by graduates of the preceding Complexity Science CDT, thus providing accessible role models. We will also provide in-house entrepreneurship training via Warwick Ventures and host events by external start-up accelerator Entrepreneur First.

Impacts on policy.
The CDT will influence policy at the national and international level by working with external partners operating in policy. UK examples include Department of Health, Public Health England and DEFRA. International examples include World Health Organisation (WHO) and the European Commission for the Control of Foot-and-mouth Disease (EuFMD). MathSys students will also utilise the recently announced UKRI policy internships scheme.

Impacts on society.
Public engagement will allow CDT students to promote the value of their research to society at large. Aside from social media, suitable local events include DataBeers, Cafe Scientifique, and the Big Bang Fair. MathSys will also promote a socially-oriented ethos of technology for the common good. Concretely, this includes the creation of open-source software, integration of software and data carpentry into our computational and data driven research training and championing open-access to research. We will also contribute to the 'innovation culture and science' strand of Coventry's 2021 City of Culture programme.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S022244/1 01/10/2019 31/03/2028
2431593 Studentship EP/S022244/1 01/10/2020 30/09/2024 Charles Hepburn