Learning Flexibility: Deep Transfer Reinforcement Learning
Lead Research Organisation:
University of Oxford
Department Name: Computer Science
Abstract
Reinforcement learning is an a powerful form of learning that is inspired by the dopamine controlled processes employed by many animals to learn rewarding behaviours in novel environments[1]. Despite successfully finding optimal solutions to simple Markov decision processes, reinforcement learning's potential over complex tasks is only realised when combined with deep neural network architectures; such systems can approach or beat human-level performance when used in specific contexts[2][3].
These improvements in performance are as a consequence of allowing an agent to represent information hierarchically with varying degrees of abstraction in ways that are similar to techniques used in computer vision[4]. How these agents achieve this is a largely undiscovered part of deep reinforcement learning. In this project we propose to address this problem by exploring structures of neural networks that are trained to perform over similar tasks. We hope to identify underlying and higher level neural structures that are common to these tasks, and to explore the possibility of transferring them to novel, but similar tasks. We anticipate that these underlying structures will also yield the essence of each type of problem and will be a great aid in classification of tasks, as well as a step towards human-level flexibility.
This project falls within the EPSRC ICT research area.
[1] Sutton RS, Barto AG. Reinforcement Learning: An introduction. MIT Press 1998
[2] Tesauro G (1995). 'Temporal Difference Learning and TD-Gammon'. Communications of the ACM. 38 (3). P58-68.
[3] Mnih V, Kavukcuoglu K, Silver D et al. (2015). 'Human-level Control through Deep Reinforcement Learning'. Nature 518. P529-533
[4] Zeiler MD, Fergus R (2014). 'Visualizing and Understanding Convolutional Networks'. Lecture Notes in Computer Science 8689. P818-833
These improvements in performance are as a consequence of allowing an agent to represent information hierarchically with varying degrees of abstraction in ways that are similar to techniques used in computer vision[4]. How these agents achieve this is a largely undiscovered part of deep reinforcement learning. In this project we propose to address this problem by exploring structures of neural networks that are trained to perform over similar tasks. We hope to identify underlying and higher level neural structures that are common to these tasks, and to explore the possibility of transferring them to novel, but similar tasks. We anticipate that these underlying structures will also yield the essence of each type of problem and will be a great aid in classification of tasks, as well as a step towards human-level flexibility.
This project falls within the EPSRC ICT research area.
[1] Sutton RS, Barto AG. Reinforcement Learning: An introduction. MIT Press 1998
[2] Tesauro G (1995). 'Temporal Difference Learning and TD-Gammon'. Communications of the ACM. 38 (3). P58-68.
[3] Mnih V, Kavukcuoglu K, Silver D et al. (2015). 'Human-level Control through Deep Reinforcement Learning'. Nature 518. P529-533
[4] Zeiler MD, Fergus R (2014). 'Visualizing and Understanding Convolutional Networks'. Lecture Notes in Computer Science 8689. P818-833
Organisations
People |
ORCID iD |
Shimon Whiteson (Primary Supervisor) | |
Matthew Fellows (Student) |
Publications
Fellows M
(2018)
Fourier Policy Gradients
Fellow M G
(2019)
VIREL: A Variational Inference Framework for Reinforcement Learning
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/N509711/1 | 30/09/2016 | 29/09/2021 | |||
1735385 | Studentship | EP/N509711/1 | 02/10/2016 | 30/03/2020 | Matthew Fellows |
Description | Several existing reinforcement learning algorithms have been consolidated into a single framework with improved theoretical properties. Most importantly, the work has demonstrated that the optimal policy that these algorithms attempt to recover is the true optimal policy for the reinforcement learning objective, a result that has been missing in several empircially successful algorithms such as Maximum a Posteriori policy evalution. For widely used algorithms based on the maximum entropy pricinple (such as soft actor-critic that don't fit into our framework, we have provided a theoretical demonstration that these algorithms may never recover an optimal policy. Moreover, we have provided evidence that algorithms from our framework have similar performace or even outperform those algorithms derived from the maximum entropy principle. |
Exploitation Route | Our framework provides state of the art performance in reinforcement learning control environments with the added benefit of theoretical gaurentees, allowing others to use our framework in any reinforcement learning setting. Going forward, we are investigating the convergence properties of these algorithms. |
Sectors | Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Healthcare Transport |