Information Geometry and Reflexive Reinforcement Learning

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

Experimentation in a reinforcement learning agent is a process by which actions are drawn and the outcome has some evaluative signal. This is not dissimilar to humans, whereby we acquire knowledge through trying actions to determine the effect they have. Where this agent has some task to accomplish, this paradigm necessitates that an agent selects actions that are both maximally interesting in terms of information gain, but also actions that reduce the prediction error, i.e., actions which it has some knowledge or understanding of. This dichotomy is often called the "exploration-exploitation dilemma". There is an inherent duality in these two problems as they both are performed over the same state-action space, and as such it is hoped that we may be able to use Information Geometry, a sub field of information theory, and use the duality present in that to help bring together a more theoretical approach to solving the exploration-exploitation dilemma.

In doing so we have developed what we believe to be a novel approach to Reinforcement Learning which we call Reflexive Reinforcement Learning, whereby an agent better uses the evaluative signals generated over the trial and error learning process. Combining this with IRL we hope will lead to an adaptive expert agent who can change its policy over time to improve the efficacy of IRL as this is an ill-posed problem.

Planned Impact

The Centre will have immediate short-term impacts on people skills and pipeline, alongside advances in scientific knowledge and techniques. However, with the strength of the program's training emphasis on innovation and social/societal challenges we also target longer term economic and societal benefits.
People: Centre graduates will be grounded in fundamental RAS topics and acquire advanced specialist scientific knowledge of crucial interaction themes. They will be skilled at teamwork, with a broader appreciation of RAS ethical issues. They will have international contacts and experience, with public presentation experience. Most importantly, they will be Innovation Ready - skilled in the principles of how technical and commercial disruption occurs, understanding how finance and organization realize new products and services in startup, SME and corporate situations. Their economic impact will be as industrial leaders of the future, foundational in realizing new products and services. This impact will be accelerated by our #Cauldron training programme in the interlinked areas of Scientific Cohesion, Research and Creativity Skills, Social and Societal Challenges, and programmed engagements and activities with our User Partners who shape the Centre's direction.
Science: The Centre will realize scientific advances, e.g. greater understanding of AI vs biomimetic approaches to persistent autonomy, advanced empathetic multimodal interaction between people and machines in smart spaces, advanced robotic micro-sensing and computing in soft embodiments, adaptive compliant actuation at a multitude of scales and form factors, semantic understanding of environments from noisy sensor data and more. Not only the advances, but also the research methods and practice to achieve them will be realized, e.g. hardware-in-the-loop architectures for re-usability and easy, low cost experimentation. The impact of these advances will be enhanced by strongly supported opportunities for dissemination, including conference presentations and publications (and training in presentation and writing skills), reciprocal secondments with Associate Research Partners, international student robot competitions, public outreach activities, CDT hosted international researcher visitors and workshops.
Society: Robotic and autonomous systems decrease cost and risk, increasing productivity while removing human operators from the 'dull, dirty and dangerous' tasks across the industries of our User Partners. Centre graduates and technology will contribute to maintaining UK business competitiveness and exports in this emerging Euro15.5Billion market, whilst improving quality of life for example a) more interesting (and prestigious) day-to-day employment for workers, b) assisted healthcare for an ageing population (including the Centre Directors), and c) greater awareness of environmental impacts and changes leading to policy and legislation.

Student:

William Lyons

Period of Study:

Aug 17 - May 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1944359

Research Topic:

Unclassified

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
J Herrmann (Primary Supervisor)
William Lyons (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Herrmann J (2020) Reflexive Reinforcement Learning: Methods for Self-Referential Autonomous Learning

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
NE/W50287X/1			31/03/2021	30/03/2022
1944359	Studentship	NE/W50287X/1	31/08/2017	14/05/2022	William Lyons

Key Findings


Description	In the context of reinforcement learning, a lot of information is encountered over the trial and error learning process, but much of this information is lost as the agent reduces this down to an "expert policy" doing whatever is the optimal action in a given state. We believed this information may have additional uses in a dynamic environment and as such shouldn't be discarded but should be included in some way. In doing so we felt we may be able to improve the efficacy of "Inverse Reinforcement Learning", a subset of reinforcement learning in which an agent learns a reward function implied by the behaviour of some expert. There are inherent issues with Inverse Reinforcement Learning as things stand, whereby, multiple reward functions can represent the observed behaviours. Our novel informational approach has meant that we have developed an agent capable of dynamically changing its behaviour, even to suboptimal behaviours, to better teach another agent the true values in an environment. We have a paper to submit for this shortly.
Exploitation Route	The hope I have for this work, personally, is that rather than having to retrain agents every time a new platform is developed, we will instead be able to train an agent once, and then, as rapid prototyping takes place we will be able to have the agent observe the current expert agent in a "teaching" behaviour, where it performs optimal and suboptimal actions to show the true value of the environment over a decreased time frame and training. I think this will be particularly useful in any highly dynamic environments.
Sectors	Environment Retail Transport
URL	https://www.scitepress.org/PublicationsDetail.aspx?ID=9NOSxB6sHK4=&t=1