Reward Design for Safe Reinforcement Learning
Lead Research Organisation:
University of Oxford
Department Name: Computer Science
Abstract
In my DPhil, I intend to focus on the safe development of autonomous systems: algorithms that will be deployed in ways that change their environment and have to make sequences of decisions. One popular paradigm for creating decision-making agents is reinforcement learning (RL). Training an RL agent involves two stages: (1) designing the reward signal used to 'score' behaviour and (2) using that reward signal to train a high-scoring agent. Much previous research has focussed on the challenges of training an agent to get a high reward. However, the problem of specifying a reward that captures exactly what designers want is extremely challenging - especially in complex, real-world environments. If the reward function is misspecified, competent optimisers can learn to behave in unpredictable and undesirable ways.
In recent years, reward learning has become a popular way to specify rewards in complicated environments. For example, ChatGPT uses a reward model trained on human labels. These reward models are only approximately accurate to the designers' intentions, and models may learn to exploit errors in the reward model to get rewards for undesirable actions. Forming a better understanding of how ChatGPT's reward models' inaccuracies influence its behaviour may be an important step to avoiding unsafe or antisocial behaviour.
I want to further develop the theory of reward function design to create safe decision-making systems. My aims and objectives are as follows:
1. To develop the theory of how agents fail when their reward functions are misspecified. For example, we can study ways to softly optimise an imperfect reward function to avoid unsafe behaviour. Alternatively, we can try to derive bounds on the error in the performance of a model in terms of the error in a reward model.
2. To develop the theory of ways to design safer or more accurately specified reward functions. We can investigate whether some reward misspecification leads to more benign behaviours than others or find ways to improve reward learning methods.
3. To investigate alternative training methods that side-step the need for a reward function. One such method is cooperative inverse reinforcement learning, which asks agents to model their uncertainty about their goals and to ask questions when they are uncertain. Another method might be training agents using goal-conditioning.
The novelty of this research direction is the focus on the design of the reward rather than on the training process and the safety rather than the competence of agents. When RL has historically been applied in small or toy environments, the complexities of reward design were obscured relative to the challenges of learning to score a high reward. I instead aim to abstract away learning to score a high reward, by asking: if agents were very competent at doing what reward them for doing, how do we reward them for the right behaviours? I intend to develop previous work from the OxCAV group on reward theory, such as in impact regularisation, reward gaming and Goodhart's Law. This project falls within the EPSRC Artificial Intelligence Technologies research area.
In recent years, reward learning has become a popular way to specify rewards in complicated environments. For example, ChatGPT uses a reward model trained on human labels. These reward models are only approximately accurate to the designers' intentions, and models may learn to exploit errors in the reward model to get rewards for undesirable actions. Forming a better understanding of how ChatGPT's reward models' inaccuracies influence its behaviour may be an important step to avoiding unsafe or antisocial behaviour.
I want to further develop the theory of reward function design to create safe decision-making systems. My aims and objectives are as follows:
1. To develop the theory of how agents fail when their reward functions are misspecified. For example, we can study ways to softly optimise an imperfect reward function to avoid unsafe behaviour. Alternatively, we can try to derive bounds on the error in the performance of a model in terms of the error in a reward model.
2. To develop the theory of ways to design safer or more accurately specified reward functions. We can investigate whether some reward misspecification leads to more benign behaviours than others or find ways to improve reward learning methods.
3. To investigate alternative training methods that side-step the need for a reward function. One such method is cooperative inverse reinforcement learning, which asks agents to model their uncertainty about their goals and to ask questions when they are uncertain. Another method might be training agents using goal-conditioning.
The novelty of this research direction is the focus on the design of the reward rather than on the training process and the safety rather than the competence of agents. When RL has historically been applied in small or toy environments, the complexities of reward design were obscured relative to the challenges of learning to score a high reward. I instead aim to abstract away learning to score a high reward, by asking: if agents were very competent at doing what reward them for doing, how do we reward them for the right behaviours? I intend to develop previous work from the OxCAV group on reward theory, such as in impact regularisation, reward gaming and Goodhart's Law. This project falls within the EPSRC Artificial Intelligence Technologies research area.
Organisations
People |
ORCID iD |
Alessandro Abate (Primary Supervisor) | |
Charlie Griffin (Student) |
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/W524311/1 | 30/09/2022 | 29/09/2028 | |||
2872672 | Studentship | EP/W524311/1 | 30/09/2023 | 30/03/2027 | Charlie Griffin |