Algorithmic Support for Massive Scale Distributed Systems

Lead Research Organisation: University of Leeds
Department Name: Sch of Computing

Abstract

Resource scheduling in massive-scale distributed systems is the process of matching demand with supply. Demand is associated with requests for resources to execute workloads, such as jobs, tasks and applications. Typical resources in a distributed computing system include servers within a data centre cluster. A scheduler aims to achieve several goals, for example, to maximise system throughput, to minimise response time, to optimise energy usage, etc. These goals may conflict (e.g. throughput versus latency), and the scheduler needs to make a suitable compromise, depending on the user's needs and objectives.
In a data centre system with hundreds of thousands of distributed servers, its massive scale is characterised by a number of factors that contribute to the system complexity:
- the number of server nodes in the cluster, interconnections between resources and heterogeneity of resources (different types of CPUs, memories, local storages);
- the number of concurrent jobs in the system and their arrival rate;
- heterogeneity of jobs (different requirements of CPU, memory and local storage; different patterns of resource usage, long-running jobs vs short-alive jobs; urgent jobs vs jobs with loose deadlines).
The key requirement for the system is its scalability - the ability of the system to sustain the required throughput level (such as operations per second) while confining the perceptional response latencies to a level similar to a small or medium size system.
In our project, we aim to address the following challenges:
(a) scheduling at scale (to make prompt scheduling decisions at a rapid rate);
(b) resource utilisation at scale (to improve utilisation of resources while maintaining high quality of service);
(c) Quality-of-Service provision at scale (to satisfy requirements of diverse workloads).
Existing scheduling algorithms developed for practical systems are often designed largely based on empirical knowledge, experience, and best effort. Due to the lack of theoretical foundation, performance of those algorithms cannot be always guaranteed. On the other hand, scheduling algorithms proposed by the theoretical community are usually based on oversimplified abstract system models. Theoretically sound algorithms, with guaranteed accuracy and time complexity, are often impractical because system models do not reflect practical complexity of real systems, and even minor adjustments of system models towards real systems make algorithms no longer applicable.
In our project, theoretical and applied experts will consolidate efforts to conduct jointly an interdisciplinary study, overcoming the shortcomings of isolated research. Overall, our project is 1) methodologically driven, attempting to extend the applicability of the most powerful techniques of mathematical optimisation; 2) application driven, where the challenges of massive-scale distributed systems invoke new developments of scheduling methodology; and 3) practice driven, where the research direction is based on hands-on experience of distributed systems specialists.

Planned Impact

This project will address scientific and practical challenges in resource scheduling and decision making of Cloud-based cluster management systems by pulling together theoretical and applied research: 1) enabling scalable and prompt scheduling decisions in modern massive-scale distributed systems, 2) allowing intelligent and autonomous management by combining toolkits of optimisation algorithms and advanced machine learning techniques, and 3) enhancing the overall utilization and reducing the operational costs by utilizing the appropriate resource over-subscription, autonomous system failover and energy optimised models.
This project will proactively engage with academia, industry, and the general public in order to ensure that it delivers a substantial impact on UK's Digital Economy on the UK's society at large. All these will have significant impact on the society in the upcoming massive scale AI based and big data driven computing systems.
Academic impact will be delivered through
- publications in high impact journals and presentations at leading conferences,
- delivering a series of dissemination and training workshops organised at the University of Leeds, at the premises of our project partners, Alibaba Group (China), and Edgetic Ltd. (Leeds), and under the umbrella of the Data Centre Alliance,
- making available the source codes of our software prototypes,
- creating and maintaining the project web site where we will systematically present our findings, publish working papers, preprints, presentation slides, maintain the library of software products and benchmarks used in our experiments.
Industrial impact will be delivered through
- close collaboration with British and international Cloud Data Centre industries (Edgetic Ltd, Alibaba Group and the Data Centre Alliance which joins over 120 companies across Europe and over 20 companies in China),
- publishing peer-reviewed industrial case studies verified by our project's industrial partners,
- publishing project findings on the project web-site and through the partners' extensive professional networks, including the Data Central Portal maintained by the Data Centre Alliance,
- organising project-themed workshop and tutorials at international conferences and within professional institutions such as IEEE and the British Computer Society.
Economic and societal engagement will be achieved through
- knowledge transfer and commercialisation of research results,
- organising new skills training courses for Cloud and AI professionals,
- making steps towards creating a future Doctoral Training Centre in Big Data and Intelligent Manufacturing at the University of Leeds, a university's spin-out company, and enhancing the partnerships with UK's and international companies in the field.

Publications

10 25 50