Intelligent Management of Big Data Storage

Lead Research Organisation: Imperial College London
Department Name: Dept of Computing

Abstract

The continuing revolutionary growth of data volumes and the increasing diversity of data-intensive applications demands an urgent investigation of effective means for efficient storage management. In the summer of 2012, the volume of data in the world was around 10 to the power of 21 bytes, about 1.1TB per internet user, and this volume continues to increase at about 50% Compound Annual Growth Rate. It has been said that "By 2013, storage systems will no longer be manually tunable for performance or manual data placement. Similar to virtual memory management, the storage array's algorithms will determine data placement (The Future of Storage Management, Gartner 2010). Meeting service-level objective/agreement (SLO/SLA) requirements for data-intensive applications is not straightforward and will become increasingly more challenging. In particular, there is an increasing need for intelligent mechanisms to manage the underlying architectures' infrastructure, taking into account the advent of new device technologies.

To cope with this challenge, we propose a research program in the mainstream of EPSRC's theme "Towards an intelligent information infrastructure (TI3)", specifically with reference to the "deluge of data" and the exploration of "emerging technologies for low power, high speed, high density, low cost memory and storage solutions". Today, with the widespread distribution of storage, for example in cloud storage solutions, it is difficult for an infrastructure provider to decide where data resides, on what type of device, co-located with what other data owned by which other (maybe competing) user, and even in what country. The need to meet energy-consumption targets compounds this problem. These decisional problems motivate the present research proposal, which aims at developing new model-based techniques and algorithms to facilitate the effective administration of data-intensive applications and their underlying storage device infrastructure.

We propose to develop techniques and tools for the quantitative analysis and optimisation of multi-tiered data storage systems. The primary objective is to develop novel modelling approaches to define and facilitate the most appropriate data placement and data migration strategies. These strategies share the common aim of placing data on the most effective target device in a tiered storage architecture. In the proposed research, the allocation algorithm will be able to decide the placement strategy and trigger data migrations to optimize an appropriate utility function. Our research will also take into account the likely quantitative impact of evolving storage and energy-efficiency technologies, by developing suitable models of these and integrating them into our tier-allocation methodologies. In essence, our models will be specialised for different storage and power technologies (e.g. fossil fuel, solar, wind).

The models, optimisers and methodologies that we produce will be tested in pilot implementations on our in-house cloud (already purchased); on Amazon EC2 resources; and finally in an industrial, controlled production environment as part of our collaboration with NetApp. This will provide feedback to enable us to refine, enhance and extend our techniques, and hence to further improve the utility of the biggest of storage systems.

Planned Impact

Both short- and medium-term benefits will arise from the efficient performance modeling and optimisation methodologies proposed, as well as from their specific application in our case studies to large-scale, tiered storage through enhanced performance, energy-efficiency, scalability and reliability. These benefits will impact upon both academia and industry, through the novel theoretical research and its practical exploitation. Moreover, longer-term, indirect benefits will also accrue to the competitiveness of UK industry, the environment and the public at large. For example, one case study application will be in the design, construction and geographical placement of datacentres, together with efficient usage of data storage by applications with respect to performability (performance, availability, reliability and energy-efficiency).

Our methodology solves models for QoS efficiently and feeds their output into both static and dynamic optimisers. This represents a significant advance over traditional methods, based mainly on heuristics, possibly backed up by simulation, which are reliable only for small systems relative to today's highly complex, distributed storage architectures and access methods. Virtualisation technologies in particular need to make scheduling decisions in real time and so require models that quantify performance rapidly, on-the-fly. The project will provide excellent training in research for the RA, and also a PhD student (to be funded internally), both of whom should emerge from the project with a good knowledge of the theory and practice of performance engineering.

As the second largest distributor of storage systems in the world, deploying the largest storage management operating system, our industrial partner Netapp will be ideally placed to make practical the potential of our methodologies and case studies, with open access to our project output. Through our dissemination channels, UK developers and vendors will thereby acquire a distinct advantage, based on the local research at Imperial and the joint development with Netapp. NetApp will assist us directly in year 3 by hosting our RA and testing our techniques in their own software that controls real storage systems in a controlled industrial environment, as well as by providing real-world test case scenarios. Both NetApp and Citrix (our other partner with a local research presence in Cambridge) are eager to take suitable industrial placement students for periods of six months on a regular basis.

Communication of project outputs to a wider audience and engagement of relevant stakeholders with the project will take place through a variety of channels:
* We will host a workshop on the theme of efficient large-scale data management in order to disseminate our work, to obtain timely feedback, to investigate the feasibility of the integration of our ideas into real products, and to establish further academic collaborations;
* A (virtual) project web server will be set up to promote and communicate the goals of the project, and to encourage wider collaboration;
* Project results and case studies will be openly published in journals and presented at international conferences, as described in the "Case for Support" and "Justification of Resources";
* Project results will also be accessible through open source tool development as described in the "Case for Support". This aims to encourage the adoption of our methods in both industry and academia, as well as the application of our methods to case studies in domains beyond storage management.

Publications

10 25 50

publication icon
Casale G (2017) Accelerating Performance Inference over Closed Systems by Asymptotic Methods in ACM SIGMETRICS Performance Evaluation Review

publication icon
Casale G (2016) QRF An Optimization-Based Framework for Evaluating Complex Stochastic Networks in ACM Transactions on Modeling and Computer Simulation

publication icon
Harrison P (2016) Energy--Performance Trade-Offs via the EP Queue in ACM Transactions on Modeling and Performance Evaluation of Computing Systems

publication icon
Harrison P (2019) Managing Response Time Tails by Sharding in ACM Transactions on Modeling and Performance Evaluation of Computing Systems

publication icon
P.G. Harrison (2019) Managing Response Time Tails by Sharding in ACM TOMPECS

 
Description New queueing models have been developed that are able to take into account the bursty nature of workload as well as address the energy-performance trade-off. A paper was published in the ACM journal TOMPECS last year on this topic. We have also investigated the benefits of replication of tasks (for reasons of both reliability and latency) and of waiting for only a subset of tasks to complete, as in a striped storage system with erasure coding (sharding), for example. Several papers have already been published on this topic. We have also conducted experimental work on the in-house tiered storage network, purchased partly with the funding provided. This has three physical tiers: two flash and one SATA disk, and we have investigated, in particular, the effectiveness of a second level cache, comparing the results with our analytical models. Due to continued staffing problems, this work could not be completed to the level we expected, but very promising results were obtained by an MSc student in his project in 2016. A third branch of our research has been an investigation into the way multi-core processors handle storage transactions - where substantial delays can occur. The key requirement here is to model the scheduling strategy used, which is typically close to a form of discriminatory processor sharing (DPS) - much harder to deal with where response times are concerned than FIFO; novel results were published in a series of conferences.

Finally, we developed techniques - based on Markov decision processes and the central limit theorem - to optimise energy and performance together using constrained optimisation procedures. The optimiser switches power levels, taking into account variability in energy usage rates due to external conditions and demand. It will have application not only in storage systems but also in sensor networks. A paper on this, jointly by the PI and Naresh Patel of Netapp, is to be presented at the ACM conference ICPE in April 2018.
Exploitation Route The storage industry and any organisation with a large quantity of data will benefit from the use of our modelling methodology, and any open-source software we produce in the future. It is intended at some point - maybe through PhD students or other student projects - to apply our results to the (near) optimal placing and migration of files in storage systems with "flash-cache", i.e. 2 levels. This will require access to measurements and specific design details of the local operating systems, i.e. involving highly beneficial collaboration with industry.
Sectors Digital/Communication/Information Technologies (including Software),Education,Energy,Financial Services, and Management Consultancy,Healthcare,Government, Democracy and Justice,Manufacturing, including Industrial Biotechology,Retail