Delay and Cooperation in Bandits and Reinforcement Learning

Lead Research Organisation: Imperial College London

Department Name: Mathematics

Abstract

Bandits and Reinforcement Learning are subfields of machine learning concerned with sequential decision-making. Here, the agent must balance exploiting what they know to maximise the objective function, and exploring to gain more information about the environment to improve future decision-making. Many high-stakes applications, such as healthcare, finance and education require algorithms addressing this exploration-exploitation dilemma. Such algorithms should perform well empirically and have strong theoretical guarantees.

Currently, most research in bandits and reinforcement learning assumes that feedback from the environment arrives immediately and there is a single decision-maker. However, the assumption of immediate feedback is unrealistic in practice and many applications involve a network of distributed decision-makers that can cooperate.

Firstly, we consider a variant of the reinforcement learning problem where feedback arrives at some unknown future time-point. Here, we develop generic black-box algorithms and use them as a theoretical tool to quantify the impact of delayed feedback. Generally speaking, we were able to show that the delays increase the regret of the agent by an additive penalty depending on the expected delay. Secondly, we consider a variant of the bandit problem where the reward function is a generalised linear model, which has found applications in recommender systems and healthcare. Once again, we assume that the feedback is delayed and develop a novel but simple algorithm whose regret increases by an additive penalty involving the expected delay. Indeed, the technical contributions are likely applicable to the full reinforcement learning problem under appropriate assumptions. Thirdly, we study the cooperative stochastic bandit problem, where a network of decision-makers collaborate to find the optimal action. We provide a black-box reduction that allows us to extend any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions, we prove that our reduction transfers the regret guarantees of the single-agent algorithm to the multi-agent setting. Our reduction and theoretical results are also general and apply to many different bandit settings. By plugging in appropriate single-player algorithms, we can easily develop provably efficient algorithms for many multi-player settings such as heavy-tailed bandits, duelling bandits and bandits with local differential privacy, among others.

This project falls within the EPSRC Artificial Intelligence and Robotics and Mathematical Sciences research areas.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Student:

Benjamin Howson

Period of Study:

Oct 20 - Mar 25

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2445745

Research Topic:

Unclassified

Organisations

Imperial College London (Lead Research Organisation)

People	ORCID iD
Benjamin Howson (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/S023151/1			31/03/2019	29/09/2027
2445745	Studentship	EP/S023151/1	02/10/2020	30/03/2025	Benjamin Howson