research into Reformulating reinforcement learning

Lead Research Organisation: University of Warwick

Department Name: Mathematics

Abstract

Reinforcement learning (RL) can be viewed as optimisation in an unknown environment, where the goal is to balance exploration (visiting new states/actions in the environment) and exploitation (revisiting states and actions with large rewards). The power of RL has been demonstrated many times in the recent years, for instance with AlphaGo defeating the world's best Go player.The objective of this project is to assess the applicability to RL of a recent approach [1] for representing the dual nature of uncertainty: random and deterministic. The former refers to the usual probabilistic approach and the latter corresponds to the case where something (e.g. a parameter in a statistical model) is fixed but unknown. This is relevant for RL where the environment can be truly random or simply unknown, such as with the strategy of an opponent or with the probability of reward of a given action.The framework of [1], based on the measure-theoretic notion of outer measure, naturally bridges the gap between the frequentist and Bayesian approaches. This can also be crucial for RL where the two approaches currently coexist.There are two main research questions:1)What are the practical benefits of introducing a more faithful representation of uncertainty in terms of performance and computational efficiency?2)Can the theoretical guarantees developed for frequentist and Bayesian techniques be extended to algorithms based on outer measures?To answer these questions, the special case of the multi-armed bandits (MABs) will first be considered. MABs are sufficiently simple to allow for theoretical guarantees to be derived while still presenting the fundamental dilemma of RL regarding exploration vs exploitation.
[1] J. Houssineau. Parameter estimation with a class of outer probability measures.arXiv:1801.00569, 2018.
Naval Group is a French company specialised in naval-based defence.The sequential resource allocation problems that MABs and RL solve appear in a number of crucial aspects in military operations. For instance, military vessels are equipped with a range of sensors that can operate in various modes; controlling these sensors to fulfil different objectives is a challenging problem that requires dealing with a complex environment where different types of uncertainty arise.
The context of the research - Reinforcement learning (RL) can be viewed as optimisation in an unknown environment, where the goal is to balance exploration (visiting new states/actions in the environment) and exploitation (revisiting states and actions with large rewards).
The aims and objectives of the research - The objective of this project is to assess the applicability to RL of a recent approach based on a combination of possibility and probability theory. This is relevant for RL where the environment can be truly random or simply unknown, such as with the strategy of an opponent or with the probability of reward of a given action.
The novelty of the research methodology - The proposed approach is based on a new formulation of Bayesian inference using the tools of possibility theory to model parameter uncertainty. This approach lends itself to RL problems where the initial absence of knowledge must be faithfully represented.
The potential impact, applications, and benefits - By extending the standard statistical framework, the proposed approach has the potential to lead to new solutions for MABs and for RL in general. Within statistics, the considered approach allows for explaining several standard heuristics and hence providing formal ground to understand their properties and limitations. There is therefore a potential for bringing new insights into existing techniques.
How the research relates to the remit - The proposed research is under the Mathematical Sciences theme of the EPSRC, in particular under Statistics and Applied Probability and under Theoretical Computer Science.
Research Area: Mathematical Sciences
External Partner - Naval Group

Planned Impact

In the 2018 Government Office for Science report, 'Computational Modelling: Technological Futures', Greg Clarke, the Secretary of State for Business Energy and Industrial Strategy, wrote "Computational modelling is essential to our future productivity and competitiveness, for businesses of all sizes and across all sectors of the economy". With its focus on computational models, the mathematics that underpin them, and their integration with complex data, the MathSys II CDT will generate diverse impacts beyond academia. This includes impacts on skills, on the economy, on policy and on society.

Impacts on skills.
MathSys II will produce a minimum of 50 PhD graduates to support the growing national demand for advanced mathematical modelling and data analysis skills. The CDT will provide each of them with broad core skills in the MSc, a deep knowledge of their chosen research specialisation in the PhD and a complementary qualification in transferable skills integrated throughout. Graduates will thus acquire the profiles needed to form the next generation of leaders in business, government and academia. They will be supported by an integrated pastoral support framework, including a diverse group of accessible leadership role models. The cohort based environment of the CDT provides a multiplier effect by encouraging cohorts to forge long-lasting professional networks whose value and influence will long outlast the CDT itself. MathSys II will seek to maximise the influence of these networks by providing topical training in Responsible Research and Innovation, by maintaining a robust Equality, Diversity & Inclusion policy, and by integration with Warwick's global network of international partnerships.

Economic impacts.
The research outputs from many MathSys II PhD projects will be of direct economic value to commercial, public sector and charitable external partners. Engagement with CDT partners will facilitate these impacts. This includes co-supervision of PhD and MSc projects, co-creation of Research Study Groups, and a strong commitment to provide placements/internships for CDT students. When commercial innovations or IP are generated, we will work with Warwick Ventures, the commercial arm of the University of Warwick, to commercialise/license IP where appropriate. Economic impact may also come from the creation of new companies by CDT graduates. MathSys II will present entrepreneurship as a viable career option to students. One external partner, Spectra Analytics, was founded by graduates of the preceding Complexity Science CDT, thus providing accessible role models. We will also provide in-house entrepreneurship training via Warwick Ventures and host events by external start-up accelerator Entrepreneur First.

Impacts on policy.
The CDT will influence policy at the national and international level by working with external partners operating in policy. UK examples include Department of Health, Public Health England and DEFRA. International examples include World Health Organisation (WHO) and the European Commission for the Control of Foot-and-mouth Disease (EuFMD). MathSys students will also utilise the recently announced UKRI policy internships scheme.

Impacts on society.
Public engagement will allow CDT students to promote the value of their research to society at large. Aside from social media, suitable local events include DataBeers, Cafe Scientifique, and the Big Bang Fair. MathSys will also promote a socially-oriented ethos of technology for the common good. Concretely, this includes the creation of open-source software, integration of software and data carpentry into our computational and data driven research training and championing open-access to research. We will also contribute to the 'innovation culture and science' strand of Coventry's 2021 City of Culture programme.

Student:

Jake Thomas

Period of Study:

Oct 19 - Jul 24

Funder:

EPSRC

Project Status:

Active

Project Category:

Studentship

Project Reference:

2271308

Research Topic:

Unclassified

Organisations

People	ORCID iD
Jeremie Houssineau (Primary Supervisor)
Theodoros Damoulas (Primary Supervisor)
Jake Thomas (Student)

Publications

Author Name Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/S022244/1			01/10/2019	31/03/2028
2271308	Studentship	EP/S022244/1	01/10/2019	31/07/2024	Jake Thomas