Unsupervised Learning of World Models

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

Artificial Intelligence research has experienced breathtaking progress in the last 50 years. The most general paradigm of learning to act intelligently is based on reinforcement learning - the agent is trained to map perception coming from sensors to actions that maximise the reward. The revival in reinforcement learning came only in the last decade, when progress in deep learning has been exploited to approximate the mapping by a neural network, and when exponentially increasing amounts of data and computational power needed to train neural networks became available. In the setting of high amounts of data and computational power, machines are now able to play Atari, Go (AlphaZero) and Dota (OpenAI Five).

However, this setting is rarely available, making current AI algorithms unusable in the real world. First are sample efficiency problems, e.g. to teach a self-driving car with current AI algorithms, the car would have to drive off the cliff many times to learn that it is not a good behaviour. Moreover, in many applications required data is impossible to collect (requiring systems to extrapolate, for example when distribution changes over time) and collecting many trials incurs intolerably high cost (in terms of money and time). Secondly, despite originally intended to be a method that requires little human intervention and that scales with increased data and computation, reinforcement learning requires a careful description of rewards that is sometimes harder to specify than providing with correct behaviours. Furthermore, in contrast to narrow tasks, such as games, where rewards are easy to define (e.g. +1 for winning), reward specification for general intelligence is far from clear: if rewards and values are misaligned, AI will perform not as humans intended. Finally, current systems cannot solve even a relatively tiny bit more complex environments (such as Atari game Montezuma's Revenge) where reward comes after a very long sequence that is unlikely to occur during random exploration and where it is hard to assign a credit for each action taken.

In contrast to current machines, humans have a remarkable ability to learn and exploit a model of the world that can be used to aid subsequent learning, make inferences that go beyond available data and extrapolate from sparse information. For example, given a visual scene of the Atari game, we naturally decompose pixel input to neurons into a set of objects, inferring their properties and relationships that cannot be directly observed. Our research aims to replicate human success by building general learning methods that accomplish these tasks.

1. We aim to embed priors and build machines that accomplish generalization performance similar to humans yet keeping algorithms scalable such that they could continually improve their representations and models of the world as more data and compute is available.

2. Algorithms should learn in a purely unsupervised fashion, without requiring human annotation and could be trained end-to-end with gradient descent. The latter requires building new methods that are fully differentiable.

3. A learnt model should be a generative causal model of the world that could explain what will happen next as a consequence of agent actions. Once the model is learnt, it could be used for planning and policy learning, and should outperform model-free methods.

4. A world model should also help the agent to tackle partial observability of the environment since the sensors do not capture all the information in the history of observations.

5. The model should encode very high dimensional observations to a small latent space - and help to tackle computational and sample-complexity challenges for policy selection. Our goal is to build a system that would reason compositionally and that could generalize well to samples even when the distribution shifts considerably, e.g. we should be able to train the model on one set of

Student:

Titas Anciukevicius

Period of Study:

Aug 20 - Apr 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2420831

Research Topic:

Unclassified

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Titas Anciukevicius (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/R513209/1			30/09/2018	29/09/2023
2420831	Studentship	EP/R513209/1	31/08/2020	29/04/2024	Titas Anciukevicius
EP/T517884/1			30/09/2020	29/09/2025
2420831	Studentship	EP/T517884/1	31/08/2020	29/04/2024	Titas Anciukevicius

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects