Machine Learning-Driven Generation of Congestion Control and Flow Scheduling Algorithms for Improving Data Centre Performance

Lead Research Organisation: University of Sussex
Department Name: Sch of Engineering and Informatics

Abstract

Data centres consist of racks of commodity servers interconnected by multiple, high bandwidth links. The Transmission Control Protocol (TCP), the de-facto protocol for reliably transmitting data, is not appropriate for data centres; it is single-path and its congestion control algorithm, which regulates senders so that network congestion is minimised, while resources are utilised efficiently and fairly, follows a one-size-fits-all approach.On the other hand, data centres support high-speed, low-latency links and multiple paths among servers as well as applications with diverse data transport requirements and network workloads which constantly change.

This project aims at utilising machine learning techniques, such Bayesian optimisation and Gaussian processes, the power of parallelisation offered by modern Graphics Processing Units (GPUs), and Software Defined Networking (SDN) to automate the process of controlling congestion and scheduling flows so that data centre network resources are utilised efficiently even under dynamically changing network traffic.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509784/1 01/10/2016 30/09/2021
1804241 Studentship EP/N509784/1 01/10/2016 30/09/2020 Luca Giacomoni
 
Description - We are developing a congestion control learning framework that leverages on different technologies to provide researchers (and potentially practitioners) with a scalable, flexible and easy to use training platform. The system was distributed, so that processing powers could scale up to accommodate large network simulations and was built on top of one of the de-facto network simulators in the field (Omnet++) and the widely used Tensorflow+Keras libraries. After some further experimentations, we decided to replace the Tensorflow python component with Libtorch (C++ APY for Pytorch) and centralize our learning framework. The advantages of using Libtorch in our case are three-fold: full control on determinism and results reproducibility; implementation of on-line learning algorithm, where actor and critic are updated at each interaction step with the environment; learning algorithm can be implemented in C++ code, reducing the overhead of cross-platform (Python/C++) communication.
The possibility to implement and evaluate any learning algorithm specifically to generate congestion control protocols using our framework opens a large number of interesting research questions.


- Preliminary results using Reinforcement Learning (specifically Deterministic Policy Gradient algorithm) to train protocols on simple networking scenarios have shown good convergence properties. Generated protocols also perform near-optimally. Currently, we are experimenting with non-trivial networking scenarios to measure convergence time and performance of the generated protocol. Convergence time seems to largely depend on the complexity of the networking scenario: from 4 hours for a trivial networking scenario to 24+ hours for more complicated networks. To speed up convergence, we are planning to offload training on a GPU and the simulators on a HPC. We were award the Amazon Research Grant, and plan to use the award for extensive experimentation on CPU optimized and GPU servers.

- The project initially proposed to use Bayesian Optimisation and Gaussian Processes to optimise performance of congestion control protocol. After a systematic literature review, we opted for a more general approach (Reinforcement Learning), since it easily allows extensions and modifications of the learning algorithm, which can also accommodate Bayesian theory and stochastic processes in the training process. Reinforcement learning has also shown to be effective in many decision-making problems of other domains.
Exploitation Route Our framework allows to experiment with different learning algorithms and a very diverse set of target networks. Researchers from both computer networks and reinforcement learning fields could benefit from our system. Also, a systematic study of the applicability of reinforcement learning algorithms on the congestion control problem would provide keys insights for future extensions and move towards more intelligent networks. Finally, good protocol performance and short protocol generation times - compared to human designed protocols - would be of interest of practitioners, particularly data centre administrators.
Sectors Digital/Communication/Information Technologies (including Software)