Dynamic Scaling of Distributed Dataflows under Uncertainty

Lead Research Organisation: Newcastle University
Department Name: Sch of Computing

Abstract

The key objective/aims of the research.

Stream processing engines are a common execution platform for a variety of contemporary datacentre applications. A minor degradation in application performance can have a high penalty on cloud operators. Performance variability in streaming systems arise from rapid changes in offered load, workload skew and performance interference observed when running atop public cloud infrastructures. State-of-the-art scaling controllers fail to model these interactions, instead making strong assumptions (e.g. linearity of performance under scaling) which compromise their effectiveness. We advocate for the development of approaches capable of making robust operating decisions under uncertainty.
Improvements are being developed in collaboration between Newcastle University, ETH Zurich and Oracle Labs, in automated scaling decisions for distributed streaming dataflows and demonstrating the strong potential for stream processing systems as execution platforms for performant and expressive network control. Motivated by these findings, this studentship will extend these works - and strengthen these collaborations - by incorporating more sophisticated treatment of sources of uncertainty.

What questions does the project intend to answer?

Characterisation of streaming workloads
Understanding the properties of streaming workloads strongly affect the efficient operation and management of contemporary datacenter applications, and the management of underlying infrastructure (including traffic engineering and provisioning). Naturally, these workloads vary depending on customer base, operator offering and geographical scope. We wish to study the statistical characteristics of industrial datacenter workloads and subsequently develop a model which faithfully capture those characteristics.

Instrumentating streaming systems
The state of streaming systems and the underlying infrastructure evolves rapidly, so having fresh data is critical. We will leverage and extend prior work with Red Hat in non-invasive instrumentation of streaming systems, to obtain real-time telemetry data to support online decision making. An open challenge will be addressed in developing a scalable processing architecture to ingest and process telemetry information from large deployments.

Appropriating risk models for dynamic scaling
Seeking inspiration from algorithmic trading strategies and brokerage risk models, this studentship will develop and evaluate dynamic scaling strategies which are robust to performance and workload variability, capable of modelling known volatility characteristics for model inputs.

Evaluation of stream processing engines
Evaluation of the developed approaches will be performed for prominent stream processing engines, including Apache Flink, Apache Heron and Timely Dataflow. Further opportunities for applying the developed models have been identified with Red Hat for pod scaling in the Kubernetes container orchestration system. Candidate workloads will include Nexmark benchmarks, and additional workload traces obtained from industry partners and research projects. We anticipate methodological contributions - and identification of best practice - in performance evaluation of stream processing systems.

The novel science/engineering methodology that will be carried out during the course of the project.
The appropriation of statistical methods and modelling approaches previously developed and implemented within the domain of algorithmic financial trading strategies and brokerage risk quantification and management.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509528/1 01/10/2016 31/03/2022
2281202 Studentship EP/N509528/1 01/10/2019 30/09/2022 Stuart Jamieson