Decentralised, Large-scale Resource Management in Modern Data Centres

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

The backbone of modern, world-wide Information Technology (IT) and Cloud infrastructure consists of a global network of data centres (DCs) each equipped with thousands of server machines. Modern large DCs are equipped with 50,000 to 100,000 of server machines and run a diverse set of application workloads. Reports show that about three million DCs containing 12 million of server machines run all US online operations. We face a DC environment for application deployment of unprecedented scale with regards to the number of server machines and applications.

The enormous scale of servers in modern DCs dramatically affects DCs' capital and operational costs. Capital costs include all initial spending for DC equipment, including server machines and operational costs are towards the DCs' daily operation including electricity consumption and personnel salaries for management. The costs for running DC are enormous. Reports show that the 2015 world-wide spending on DC systems was $170 billion and these are expected to grow by 3% for 2016 to $175 billion.

Given the high DC expenditure it is of paramount importance that modern DCs operate in a cost-effective manner, i.e. server machines are fully utilised by running applications and applications are adequately provisioned to meet their performance goals. However, there are numerous reports showing that machines in DCs are on average only 10-15% CPU utilised. The main cause of low utilisation has been the practice of over-provisioning applications with resources to match even their most demanding application workload demands, however rare they might be. However, as workloads are typically time-varying with unknown variations, this practice has led to a dramatic under-utilisation of modern DC resources and consequently to an excess of DC expenditure. Futhermore, practitioners report that current management frameworks are inadequate to perform scalable operational tasks in large-scale environments such as the Cloud. It is therefore an open challenge how to tackle the resource management problem in modern large-scale DCs and increase the overall resource utilisation while satisfying applications' performance demands.

We propose a new decentralised resource management approach to tackle the under-utilisation problem of DCs.
We envisage a decentralised scheme where resource schedulers are distributed across the DC and each scheduler controls the resource allocation of a subset of the DC machines referred to as clusters, i.e. a cluster contains a few 100s of servers. The use of cluster schedulers aims to increase the effective utilisation of machines within a cluster in a timely fashion. Global resource planning across all DC servers is achieved through decentralised coordination of all schedulers. Schedulers communicate to exchange resource utilisation information of their clusters and application performance information for global convergence. To increase the overall utilisation, the goal is to balance the load across all clusters while avoiding hotspots and under-utiisation. The novelty of this work will be on the coordination of the distributed set of cluster schedulers for global resource planning. We aim to use a distributed optimisation and control approach.

The potential impact of this work is huge. We anticipate an impact in the Economy of the DC sector and in the domains of People and Knowledge as the proposed work will assist the development of IT administrators' skills.
The ultimate beneficiary is Society and in particular developers and end-users of Cloud and IT applications. UK currently holds the largest European data centre market. The proposed research has the potential to significantly strengthen the position of the UK in the important DC sector and impact its international position.

Planned Impact

There are three main beneficiaries from this research which seeks to develop a new decentralised and large-scale resource management approach for modern data centres.

1. Data Centre Infrastructure Sector.
We identify this sector to be the primary beneficiary group of the proposed research. The impact in this sector
is in the Economy of data centres by potentially reducing their capital and operational costs.

The sector of data centre management includes management software and IT management personnel that deals with
purchasing, allocation and management of machines' resources for hosted applications. Reports show that out of the
£5.7 billion UK data centre market, 51.2% is spent on IT staff and 17.7% on infrastructure software totalling
£10.9 billion being allocated to this sector. This high amount represents the relative importance of
the management domain in delivering efficient data centre services to applications. The potential impact of this
proposal would be to reduce management expenditures by providing the ground towards an autonomic, decentralised
management framework for data centres administrators to carry out their work and increase server machines' resource utilisation.

Increasing servers' effective utilisation has an immediate effect on the number of servers used as more
applications can be consolidated on fewer servers. It is well known that power and cooling expenses on
data centres account on average 30% to the total spendings. When using fewer servers eventually
the energy footprint of data centre will be reduced. The proposed work has the potential to reduce the number
of server machines used within a data centre to host applications in the large-scale and so to reduce the
energy footprint of data centres.

2. Data Centre Management Personnel.
Currently, there is a lack of trained personnel to tackle the resource management challenges in
large-scale modern data centres. Today's approach for infrastructure management remains ad-hoc, lacks scalability and operates in an offline fashion: for example, to identify load hotspots administrators may use proprietary management tools, stand-alone data analysis systems or custom processing scripts to explore data logged over a finite time window such as one hour. Fundamentally, these approaches are unable to analyse terabytes of performance data across the entire infrastructure. One of the project's objectives is that our approach could autonomically perform management actions and so it will assist administrators to manage resources in the large-scale.

The expected impact in this case are in the domains of People and Knowledge as the proposed work will enhance to develop the skills of IT administrators towards intelligently managing large-scale data centres. Furthermore, the
proposed work will produce new techniques and will scientifically advance the field of large-scale distributed systems in resource management as these are further detailed in the Academic Impact section.

3. End-users and developers of Cloud Applications.
The ultimate beneficiary is Society and in particular developers and end-users of Cloud and IT applications.
Facilitating and enhancing the performance of resource management in data centres will ultimately impact the quality of services that end-users receive from applications. For example, when there is a high, temporal demand of users on a particular application, it is very important that the application is provisioned properly to cope against the workload surge. Our proposal on decentralised control aims to reduce application's responsiveness to increased demand and so
deliver good quality of service to end-users at all times. In addition, with the support of automated tools, application
developers can concentrate more on feature developments rather that low-level infrastructure management and so deliver better services to modern society.
 
Description The work of this grant is a multi-disciplinary approach to large-scale management of data centers using distributed optimization. Work that has been done mostly after the end of grant showed that we can control the resource allocation in the large-scale using distributed coordination mechanisms and distributed optimization such as ADMM. Evaluation results over a simulator shows that our work scales to 10,000 of nodes and converges in just a few iterations.
Exploitation Route Our work can be used by others in the following two main ways. First, other researchers and practitioners can directly use our work through the prototype Kubernetes implementation. Second, once our work is published others can further extend our ideas and potentially make new contributions in the resource management field especially in the large-scale.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description EPSRC Impact Acceleration Accounts fund
Amount £20,575 (GBP)
Organisation University of Cambridge 
Sector Academic/University
Country United Kingdom
Start 03/2023 
End 03/2023
 
Description Allan Turing Institute and University of Cambridge, Mr. A. Grammenos 
Organisation Alan Turing Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution I have included Mr Grammenos in my collaboration with Aalto as Mr Grammenos provides valuable contributions to problem formulation, problem solving and experimental evaluation. In 2022 Mr Grammenos has completed his PhD with The Alan Turing and the University of Cambridge. I no longer had a collaboration with The Alan Turing as Mr Grammenos is now a Visiting Research at the University of Cambridge. I am no longer a fellow at The Turing Institute.
Collaborator Contribution Mr A. Grammenos has now completed his PhD at the Alan Turing Institute and the University of Cambridge, Department of CS and Technology. Mr Grammenos is working on this project on problem formulation, problem solving and experimental evaluation.
Impact We have submitted one paper in the TCNS 2021 journal and two more papers are about to be submitted to the CDC 2021 conference by the 18th of March.
Start Year 2020
 
Description Dr Andreas Grammenos is a visiting post-doctoral researcher at the Department of Computer Science and Technology, University of Cambridge working on topics related to this grant. 
Organisation University of Cambridge
Country United Kingdom 
Sector Academic/University 
PI Contribution Dr Grammenos is working closely with me and Dr Charalambous on data center management using distributed optimisation and in some cases federated learning. Dr Grammenos is an expert in systems and federated learning.
Collaborator Contribution Dr Grammenos is working closely with me and Dr Charalambous on data center management using distributed optimisation and in some cases federated learning. Dr Grammenos is an expert in systems and federated learning.
Impact several paper included in the list of papers associated with this grant
Start Year 2021
 
Description University of Cyprus (2021 - now), Dr T. Charalambous, distributed optimization in data center management 
Organisation University of Cyprus
Country Cyprus 
Sector Academic/University 
PI Contribution I have initiated this collaboration as Dr T. Charalambous (partner) is an expert in distributed optimization which is essential to solve the main problem of the grant as initially envisaged. My contribution is on problem and model formulation, solution and results calibration and paper writing. Together with Dr Charalambous and another partner from KU Leuven and another from the Universita Campus Bio Medico di Roma we applied for a HORIZON-EIC-2023-PATHFINDEROPEN-01 European grant to further continue this work (proposal submitted on March 2023) Before I had also applied for a CHIST-ERA European funding and an ERC Advanced Grant on this topic. Both applications were rejected.
Collaborator Contribution Dr Charalambous is responsible on applying existing distributed optimization techniques and analytical models to solve our problems. Inspired by our work Dr. Charalambous is also working on proposing novel distributed optimization algorithms to solve the current and related problems.
Impact The collaboration is multi-disciplinary as it combines the disciplines of Computer Science (distributed systems, systems, performance management) with Optimization (distributed). We are currently working on three different papers. 2023: we have an accepted paper in TNSE and two papers are accepted in CDC 2021.
Start Year 2020