Hierarchical reinforcement learning in large-scale domains

Lead Research Organisation: University of Bath

Department Name: Computer Science

Abstract

Temporal abstraction is valuable for an intelligent system. Representing knowledge over multiple timescales provides a means of partitioning state space, which can accelerate learning and allow behaviour to be transferred to different tasks. Humans constantly plan and behave using temporally extended actions, breaking any particular task down into a sequence of salient waypoints, or subgoals. Hierarchical reinforcement learning rests upon a set of theoretically sound approaches for learning and planning using temporally extended actions [1], [2], [3]. Despite us having richly expressive frameworks for utilising a given hierarchy of action, a major problem that remains is how we may autonomously discover the hierarchical structure of a given domain. This problem is known as skill discovery. There exist many good approaches to skill discovery, with some based on graph theory, some on mining the trajectories of a reinforcement learning agent's experience, and others on gradient based, end-to-end optimisation. However, the current methods are not immediately compatible with all types of problems, and have not been demonstrated to scale well.

Rubik's cube is an iconic puzzle that has a reputation for being difficult to solve for someone without prior knowledge. There are currently no solutions that use reinforcement learning starting from an arbitrary scrambled state. An obvious element of the problem is the need to simultaneously satisfy competing objectives, i.e. by correctly placing the different pieces of the cube. Another key source of difficulty is due to the property of non-serialisable subgoals: using a sequence of subgoals to arrive at the solution, some previous subgoals must be temporarily violated before reaching further ones. Whilst it is known that Rubik's cube can be solved in 20 moves or less from any of its 43 quintillion states, 'cubists' who can solve the cube typically use many more moves by employing a variety of macro operators [4]. These macro operators leave part of a state invariant to their effects, which allows cubists to manipulate only certain parts of Rubik's cube during each stage of their solve.

This research will focus on the question of how a reinforcement learning agent may learn a hierarchical policy for Rubik's cube. Preliminary work undertaken has identified a key property of the state space. Possible future directions could address the discovery of macro operators from direct experience, develop ways to restrict initiation sets, and utilise symmetries of the problem. Careful consideration will be needed to design effective methods of function approximation, both at the top-level of control and also for the temporarily extended actions.

Beyond the Rubik's cube there are many permutation puzzles that can also be solved through methods this research will create. More generally, combinatorial optimisation problems are widespread throughout science and engineering, and are increasingly being addressed using reinforcement learning [5]. The aim is to incorporate methods arising from this research into this wider body of work.

[1] Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems: Proceedings of the 10th Conference, Denver. Cambridge, MA: MIT Press.
[2] Sutton, R. S., Precup, D., and Singh, S. 1999. Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp.181-211.
[3] Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, pp. 227-303.
[4] Korf, R. 1985. Macro-operators: A weak method for learning. Artificial Intelligence, 35, pp. 35-77.
[5] Yanjun, L., Hengtong, K., Ketian, Y., Shuyu, Y., and Xiaolin, L. 2018. FoldingZero: Protein Folding from Scratch in Hydrophobic-Polar Model. In Advances in Neural Information Processing Systems

Student:

Paul RUSSELL

Period of Study:

Mar 18 - May 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2120604

Research Topic:

Unclassified

Organisations

University of Bath (Lead Research Organisation)

People	ORCID iD
Wenbin Li (Primary Supervisor)
Paul RUSSELL (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509589/1			30/09/2016	29/09/2021
2120604	Studentship	EP/N509589/1	28/03/2018	30/05/2023	Paul RUSSELL
EP/R513155/1			30/09/2018	29/09/2023
2120604	Studentship	EP/R513155/1	28/03/2018	30/05/2023	Paul RUSSELL
NE/W503022/1			31/03/2021	30/03/2022
2120604	Studentship	NE/W503022/1	28/03/2018	30/05/2023	Paul RUSSELL

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects