Reward Design for Safe Reinforcement Learning

Lead Research Organisation: University of Oxford

Department Name: Computer Science

Abstract

In my DPhil, I intend to focus on the safe development of autonomous systems: algorithms that will be deployed in ways that change their environment and have to make sequences of decisions. One popular paradigm for creating decision-making agents is reinforcement learning (RL). Training an RL agent involves two stages: (1) designing the reward signal used to 'score' behaviour and (2) using that reward signal to train a high-scoring agent. Much previous research has focussed on the challenges of training an agent to get a high reward. However, the problem of specifying a reward that captures exactly what designers want is extremely challenging - especially in complex, real-world environments. If the reward function is misspecified, competent optimisers can learn to behave in unpredictable and undesirable ways.

In recent years, reward learning has become a popular way to specify rewards in complicated environments. For example, ChatGPT uses a reward model trained on human labels. These reward models are only approximately accurate to the designers' intentions, and models may learn to exploit errors in the reward model to get rewards for undesirable actions. Forming a better understanding of how ChatGPT's reward models' inaccuracies influence its behaviour may be an important step to avoiding unsafe or antisocial behaviour.

I want to further develop the theory of reward function design to create safe decision-making systems. My aims and objectives are as follows:
1. To develop the theory of how agents fail when their reward functions are misspecified. For example, we can study ways to softly optimise an imperfect reward function to avoid unsafe behaviour. Alternatively, we can try to derive bounds on the error in the performance of a model in terms of the error in a reward model.
2. To develop the theory of ways to design safer or more accurately specified reward functions. We can investigate whether some reward misspecification leads to more benign behaviours than others or find ways to improve reward learning methods.
3. To investigate alternative training methods that side-step the need for a reward function. One such method is cooperative inverse reinforcement learning, which asks agents to model their uncertainty about their goals and to ask questions when they are uncertain. Another method might be training agents using goal-conditioning.

The novelty of this research direction is the focus on the design of the reward rather than on the training process and the safety rather than the competence of agents. When RL has historically been applied in small or toy environments, the complexities of reward design were obscured relative to the challenges of learning to score a high reward. I instead aim to abstract away learning to score a high reward, by asking: if agents were very competent at doing what reward them for doing, how do we reward them for the right behaviours? I intend to develop previous work from the OxCAV group on reward theory, such as in impact regularisation, reward gaming and Goodhart's Law. This project falls within the EPSRC Artificial Intelligence Technologies research area.

Student:

Charlie Griffin

Period of Study:

Sep 23 - Mar 27

Funder:

EPSRC

Project Status:

Active

Project Category:

Studentship

Project Reference:

2872672

Research Topic:

Unclassified

Organisations

University of Oxford (Lead Research Organisation)

People	ORCID iD
Alessandro Abate (Primary Supervisor)
Charlie Griffin (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/W524311/1			30/09/2022	29/09/2028
2872672	Studentship	EP/W524311/1	30/09/2023	30/03/2027	Charlie Griffin

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects