Pin the Tail: Understanding Straggler Manifestation in Internet-based Distributed Systems

Lead Research Organisation: Lancaster University
Department Name: Computing & Communications

Abstract

Distributed systems are the essential elements that form the foundation for Internet infrastructure, and are critical for fulfilling the technological and societal needs of the digital age. Comprising Cloud datacenters, compute clusters, and the Internet of Things, these systems are responsible for the effective provisioning and execution of a multitude of parallelizable applications. The increased complexity and scale of these systems has resulted in the manifestation of emergent phenomena that substantially degrades overall system performance, and cannot be solved by simply increasing the number of compute nodes. This phenomena is known as The Long Tail Problem, whereby a small proportion of task stragglers - a small subset of tasks that execute abnormally slow - impede overall job completion time, and is systemic to all distributed systems that operate at sufficient scale. While work within this area attempts to address this problem through straggler detection or mitigation, their effectiveness is underpinned by understanding the precise underlying causes for straggler manifestation, and importantly determining what system conditions influence their occurrence. However achieving this understanding is incredibly challenging given the multitude of possible straggler root-causes - all of which can stem from diverse sub-system operational characteristics and their interactions with other sub-systems. As current understanding of straggler manifestation is restricted to a qualitative and high-level detail, it is presently impossible to determine what system operational conditions (e.g. cluster resource contention, temperature, failures) are highly likely to create a "perfect storm" for straggler occurrence. Determining the system conditions which influence the probability of straggler occurrence in different operational scenarios is vital towards achieving predictable and rapid parallel application execution, given the continued increase of system size and complexity.

The vision of this proposed research is to address our limited understanding of straggler manifestation and conduct in-depth analysis and modelling of Internet-based distributed systems to quantify the precise relationship between straggler occurrence and system behaviour. This study will involve analysis and modelling stragglers within real systems, performed through comprehensive experimentation to identify and extract key system parameters from virtual and physical sub-system operation across the entire distributed system architecture. A framework will be constructed capable of automated analysis to determine straggler root-cause within production systems, which will interface with an event-based simulation engine for determining the optimal system conditions for avoiding stragglers.

By working with leading international industrialists in massive-scale distributed systems, this work represents a significant step change towards solving The Long Tail Problem by providing much sought-out knowledge to truly understand straggler manifestation. As this problem is systemic across every type of large-scale distributed system, the impact of this work will have far reaching implications for both academia and industry, and will provide direct benefit to the competitiveness of the UKs digital economy within the short and long-term. This grant represents the first step towards realizing the research ambitious to scientifically understanding the operation of massive-scale Internet infrastructure, enabling the design of fault-tolerant techniques for future systems at unprecedented scale - a crucial objective towards realizing key emergent technologies for the future.

Planned Impact

The proposed works will provide key knowledge to fully understand how and why stragglers precisely manifest due to different system conditions for Internet-driven distributed systems. This will allow computer scientists at all career levels to exploit the developed model and framework to evaluate proposed techniques for fault-tolerance, scheduling, and speculative execution without omitting assumptions of straggler manifestation and its impact within computing infrastructure. The research findings are foreseen to provide the following impact directly:

Cloud datacenter industry: Internet service providers such as Microsoft, Google, Facebook and Alibaba operate massive-scale distributed systems, and all seek a solution for solving The Long Tail Problem. The analysis findings and developed model will enable them to attain greater insight into the operation of their own systems in understanding how stragglers manifest. The framework can be used to determine the prevalence and impact stragglers impose on their datacenters, and allows focused developmental effort for mitigation. This will lead to economic gains and a competitive advantage to providers capable of ensuring increased likelihood of predictable service execution for enhanced user experience and reduced operational costs. Likewise, UK based datacenters are increasingly offering Big Data services to consumers, therefore the proposed simulation framework allows guidance for calculating the optimal system size with respect to performance gains for parallel application execution.

Big Data Tools: Analysis findings will provide more informed design and development of Big Data services that heavily exploit parallel execution, and their ability to minimize straggler occurrence that directly impacts application performance. This will provide the ability to provision timely service yielding greater economic gain.

General Public: Enhanced user experience and performance gains for Internet services which use parallel application execution. This is particularly important when service demand is extremely high due to popular or significant world events which drive unexpected application demand.

National: Ability to create larger-scale Internet-based distributed systems and increases likelihood of timing guarantee - an essential requirement for Fog computing and emergent technology areas including Smart Cities and the Internet of Things. Moreover, this work will provide a foundation for Internet service providers to enhance user experience by reducing performance slowdown caused by stragglers.

Business: Any organisation or business which operates a distributed system that leverages parallel execution can exploit the analysis findings or framework to perform complex failure root-cause analytics, as well as quantify the monetary and performance impact of stragglers. This will result in economic benefit due to automated root-cause analysis.

Publications

10 25 50
 
Description Computing systems and the Internet are increasingly complex and difficult to effectively understand and track when things go wrong. The reason for a slow search engine results or webpage loading may occur from a vast number of reasons -- failure, energy management, data transfer, monitoring systems, network traffic. Similar to problems that occur within the human body, successfully isolating precise causes of slowdown is very difficult due to the high complexity and stochastic behaviour of large computing systems. These failures are defined as stragglers, and are an unsolved issue within Computer Science.

Through this award, we have achieved the following:

1) Identification of different types of stragglers which manifest within resource schedulers at scale (specifically, the impact of stragglers within computing resource oversubscription) from real-world computing systems.

2) Design and implementation of a new kind of speculative resource manager and oversubscription that mitigate complex phenomena (including stragglers), that has been used by Alibaba Group, China which provides service to hundreds and millions of customers.

3) Determined straggler manifestation resultant from differing utilization properties (CPU processing), and that whilst not possible to perfectly replay for each executed job, can precisely control environmental conditions which aggravate straggler occurrence.

4) Designed an automated experiment framework (PRISM) to automatically deploy, monitor, collect, and analyse straggler behaviour from distributed systems.
Exploitation Route The resource scheduler designed and published can be leveraged by any tech company operated a large-scale cluster, as well as by academic researchers within distributed systems research.

The PRISM framework allows Computer Science researchers the ability to accelerate experiment design and analysis.

Identification of queue blocking as a potential cause of straggler manifestation within containerized cluster environments.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Findings from this research (e.g. over-subscription, ROSE) have been deployed and integrated into large-scale computing systems operated by Alibaba Group, enabling higher performance and resilience when provisioning digital services to a billion users.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description Reducing the Global ICT Footprint via Self-adaptive Large-scale ICT Systems
Amount £832,044 (GBP)
Funding ID EP/V007092/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 05/2021 
End 05/2025