Random Policy Search and Stochastic control

Lead Research Organisation: University of Oxford

Abstract

The paradigm of model free reinforcement learning has proved successful in a multitude of tasks in automated decision making. This paradigm assumes limited information about the environment it operates in and tries to learn good policies by interacting with its environment over time. The quality of a policy is judged by its performance on an appropriately chosen objective function. Usually, the learning algorithms rely on estimating either a value function or directly the policy or a mixture of both.

A specific challenge is the learning of policies in environments with continuous action spaces for which computer scientists have introduced several simulation-based benchmark problems (e.g. MuJoCo). A simple algorithm that produced remarkably good results in these benchmarks is the random policy search algorithm with linear parametrization. For this algorithm, our goal is to find a good policy directly by learning parametrized policies linear in the state instead of searching the entire function space of admissible policies. The fitting is done by approximating the gradient of the value function in the directions of the parameter with a Monte-Carlo scheme. Given the gradient, a gradient ascent step is performed. This procedure is repeated until a suitable policy is found and only requires a sampling oracle.

In spite of the simplicity of the algorithm, little is known about its theoretical properties. A good starting point for an analysis is stochastic control theory. Already, authors used these tools to establish convergence to the optimal policy in cases where the algorithm is applied to the linear quadratic regulator (LQR). We would like to build on the same approach to establish the behaviour of the algorithm in other control frameworks and deepen the understanding of the interplay between representation and optimization. This research is of interest since it potentially improves algorithm selection for practical problems and gives new insights in deeper aspects of learning.

For the first part, we apply the random policy search algorithm to a time-continuous optimal queueing problem that has a direct application to optimal execution in limit order book markets. For this problem and a specified linear policy on discretized time intervals, we can give the closed form expression for the value function and its gradient. We hope to find a parameter region for which the algorithm is guaranteed to converge to an optimal time-discrete policy, if initialised within the region. For that purpose, one could establish a gradient dominance condition for a gradient-descent algorithm with the known gradient and then generalize to a situation where the gradient is approximated. Furthermore, since the value function for the optimal time-continuous policy can be calculated as well we hope to make a statement about discretization errors. Finally, we want to apply the algorithm to real world limit order book data and extend it to a two sided queue corresponding to the market maker problem. I work with Roel Oomen from Deutsche Bank on this project.

Further, we would like to better understand the capability of the algorithm to learn an appropriate state representation while simultaneously optimizing. As a first instructive problem, we could consider the LQR case where the appropriate state can be almost written as a combination of bases functions where the coefficients are assumed to have a low dimensional structure. The optimal result in this framework would establish a model selection property for a proximal gradient descent algorithm. Finally, we hope to investigate the role of state representation in an application of this algorithm in the context of backward stochastic differential equation solvers.

This project falls within the EPSRC Artificial intelligence technologies research area.

Planned Impact

Probabilistic modelling permeates the Financial services, healthcare, technology and other Service industries crucial to the UK's continuing social and economic prosperity, which are major users of stochastic algorithms for data analysis, simulation, systems design and optimisation. There is a major and growing skills shortage of experts in this area, and the success of the UK in addressing this shortage in cross-disciplinary research and industry expertise in computing, analytics and finance will directly impact the international competitiveness of UK companies and the quality of services delivered by government institutions.
By training highly skilled experts equipped to build, analyse and deploy probabilistic models, the CDT in Mathematics of Random Systems will contribute to
- sharpening the UK's research lead in this area and
- meeting the needs of industry across the technology, finance, government and healthcare sectors

MATHEMATICS, THEORETICAL PHYSICS and MATHEMATICAL BIOLOGY

The explosion of novel research areas in stochastic analysis requires the training of young researchers capable of facing the new scientific challenges and maintaining the UK's lead in this area. The partners are at the forefront of many recent developments and ideally positioned to successfully train the next generation of UK scientists for tackling these exciting challenges.
The theory of regularity structures, pioneered by Hairer (Imperial), has generated a ground-breaking approach to singular stochastic partial differential equations (SPDEs) and opened the way to solve longstanding problems in physics of random interface growth and quantum field theory, spearheaded by Hairer's group at Imperial. The theory of rough paths, initiated by TJ Lyons (Oxford), is undergoing a renewal spurred by applications in Data Science and systems control, led by the Oxford group in conjunction with Cass (Imperial). Pathwise methods and infinite dimensional methods in stochastic analysis with applications to robust modelling in finance and control have been developed by both groups.
Applications of probabilistic modelling in population genetics, mathematical ecology and precision healthcare, are active areas in which our groups have recognized expertise.

FINANCIAL SERVICES and GOVERNMENT

The large-scale computerisation of financial markets and retail finance and the advent of massive financial data sets are radically changing the landscape of financial services, requiring new profiles of experts with strong analytical and computing skills as well as familiarity with Big Data analysis and data-driven modelling, not matched by current MSc and PhD programs. Financial regulators (Bank of England, FCA, ECB) are investing in analytics and modelling to face this challenge. We will develop a novel training and research agenda adapted to these needs by leveraging the considerable expertise of our teams in quantitative modelling in finance and our extensive experience in partnerships with the financial institutions and regulators.

DATA SCIENCE:

Probabilistic algorithms, such as Stochastic gradient descent and Monte Carlo Tree Search, underlie the impressive achievements of Deep Learning methods. Stochastic control provides the theoretical framework for understanding and designing Reinforcement Learning algorithms. Deeper understanding of these algorithms can pave the way to designing improved algorithms with higher predictability and 'explainable' results, crucial for applications.
We will train experts who can blend a deeper understanding of algorithms with knowledge of the application at hand to go beyond pure data analysis and develop data-driven models and decision aid tools
There is a high demand for such expertise in technology, healthcare and finance sectors and great enthusiasm from our industry partners. Knowledge transfer will be enhanced through internships, co-funded studentships and paths to entrepreneurs

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023925/1 01/04/2019 30/09/2027
2435718 Studentship EP/S023925/1 01/10/2020 30/09/2024 Michael Giegrich