Network Measurement as a Service

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

Recent advances in server and network virtualisation have given rise to the Infrastructure-as-a-Service paradigm where businesses can lease resources from cloud datacentre operators, thus enabling the outsourcing of ICT. Such businesses can themselves be application and service providers who act as tenants of a shared data centre infrastructure. The tenants resize their ICT footprint through the pay-as-you-go pricing model, thereby maintaining low capital (and operational) expenditure and increasing their profit margin. This infrastructural abstraction allows tenants to focus solely on their business delivery model while leaving the infrastructure maintenance to the operators.
However, the resulting lack of visibility to the dynamic state of the underlying infrastructure can immensely hurt the services of the tenants when its performance fluctuates in short timescales. This prohibits the more pervasive migration of businesses to the cloud who are instead forced to maintain their own, in-house infrastructures.

Adding to the problem, security risks are more acute in the cloud. Attackers can leverage cloud servers to launch DDoS attacks to other tenants or faster portscan to identify vulnerable services. Especially tenants are completely excluded from detecting security threats and from taking remedial action autonomously as the incidents unfold. Vulnerable services can end up consuming immense amounts of compute and network resources, leading to unsustainable bills for tenants who ultimately may have to retreat their services from the cloud. Existing measurement and monitoring approaches are inadequate because they are architected specifically for accounting, traffic engineering or offline debugging. Measurements from these approaches provide no clue on whether an application suffers self-induced congestion or cyber-attacks, there are some other offending flows/applications, or unacceptable latencies are due to long queueing delay at certain switch or application components, and how many flows are impacted by them. While addressing these problems itself is important to cloud operators, doing so in a timely fashion is often simply impossible because software and hardware updates take time and new pathological traffic patterns may arise as applications evolve.

The overarching goal of this project is to design and develop a native Network Measurement-as-a-Service (NMaaS) framework that will allow tenants to express their measurement needs, and to subsequently synthesise the corresponding complex service-level performance functions out of simple monitoring primitives. The required primitive measurement components will be dynamically and transparently instantiated when and where required throughout the infrastructure, exploiting the temporal available capacity of servers and network nodes. In particular, we aim to:

- devise novel server and switch instrumentation capabilities for traffic monitoring and make them as a native part of an underlying infrastructure so that they can support diverse measurement functions while alleviating measurement errors and uncertainties

- develop a network-wide, centrally-orchestrated algorithm for the synthesis of complex metrics through the optimal placement of server-based and switch-based measurement functions in virtual and physical network components

- design and develop measurement requirement description APIs to parse high-level measurement specifications issued by tenants and transform them into low-level measurement indicators.

Ultimately, we aim to demonstrate that the proposed framework will contribute significantly in maintaining the desired application performance while at the same time improving the utilisation of cloud resources. Given that the cloud is still a rapidly growing global business, we anticipate that the research outcome will greatly benefit the wider IT industry.

Planned Impact

Our research will enable new measurement services for users of virtualised computing infrastructures such as cloud, which will allow the users to do more fine-grained performance monitoring and timely troubleshooting of their applications. As such, the major beneficiaries (apart from academic researchers) of our research are industry parties who hold a stake in the development and operation of future cloud infrastructures, and run services in the cloud. In a broad context, our research can enrich the ecosystem that surrounds virtualised infrastructures.

Cloud users (i.e., tenants) can significantly save the costs for building and maintaining their own monitoring tools, and manage their services in a more predictable manner. Cloud operators should be able to improve infrastructure utilisation without negatively impacting the performance of tenant applications. Software and hardware vendors can equip their products with new resource-aware monitoring and measurement functions for virtualised infrastructures, which will provide competitiveness in their respective business.

Publications

10 25 50
 
Description Cloud infrastructures are becoming larger and more complex. To effectively serve cloud users' applications and online services, it is critical to monitor and manage the infrastructures effectively. In the research funded on this grant, we investigated what are the current monitoring and management practices of such large infrastructures and found out that the existing research proposals fall short in that they cannot achieve fine-grained network monitoring because of either lack of resources (e.g., CPU and memory) or visibility (e.g., actual events are not monitored). Our observation led to a new design that can address the shortcomings of the current solutions by enabling distributed monitoring functionality in the cloud. We are currently working on Proof of Concept for the design and its validation.
Exploitation Route Our work can dramatically improve the state of the art in cloud network monitoring and management. In particular, our work can be synergetic with emerging programmable switch architectures. Thus, researchers and engineers will be interested in enhancing our work by exploiting the programmable switch approaches. For example, Barefoot Networks, a startup in the US, has a cloud network monitoring solution called In-Network Telemetry which can directly benefit from our work to further improve its performance and efficacy.
Sectors Digital/Communication/Information Technologies (including Software)