Peta-5: A National Facility for Petascale Data Intensive Computation and Analytics

Lead Research Organisation: University of Cambridge
Department Name: Physics

Abstract

The Peta-5 proposal from the University of Cambridge brings together 15 world-leading HPC system and application experts from 10 different institutions to lead the creation of a breakthrough HPC and data analytics capability that will deliver significant National impact to the UK research, industry and health sectors.

Peta-5 aims to make a significant contribution towards the establishment and sustainability of a new EPSRC Tier 2 HPC network. The Cambridge Tier 2 Centre working in collaboration with other Tier 1, Tier 2 and Tier 3 stakeholders aims to form a coherent, coordinated and productive National e-Infrastructure (Ne-I) ecosystem. This greatly strengthened computational research support capability will enable a significant increase in computational and data centric research outputs, driving growth in both academic research discovery and the wider UK knowledge economy.

The Peta-5 system will be one of the largest heterogeneous data intensive HPC systems available to EPSRC research in the UK. In order to create the critical mass in terms of system capability and capacity needed to make an impact at National level Cambridge have pooled funding and equipment resources from the University, STFC DiRAC and this EPSRC Tier 2 proposal to create a total capital equipment value of £11.5M; the request to EPSRC is £5M. The University will guarantee to cover all operational costs of the system for 4 years from the service start date, with the option to run for a fifth year to be discussed. Cambridge will ensure that 80% of the EPSRC funded element of Peta-5 is deployed on EPSRC research projects, with 65% of the EPSRC funded element of Peta-5 being made available to any UK EPSRC funded project free of charge by use of a light weight resource allocation committee, 15% going to Cambridge EPSRC research and 20% being sold to UK industry to drive the UK knowledge economy.

The Peta-5 system will be the most capable HPC system in operation in the UK when it enters service in May 2017. In total Peta-5 will provide 3 petaflops (PF) of sustained performance derived from 3 heterogeneous compute elements, 1PF Intel X86, 1PF Intel KNL and 1PF NIVIDIA Pascal GPU (Peta-1) connected via a Pb/s HPC fabric (Peta-2) to an extreme I/O solid state storage pool (Peta-3), a petascale data analytics (Machine Learning + Hadoop) pool (Peta-4) and a large 15 PB tiered storage solution (Peta-5), all under a single execution environment. This creates a new HPC capability in the UK specifically designed to meet the requirements of both affordable petascale simulation and data intensive workloads combined with complex data analytics. It is the combination of these features which unlocks a new generation of computational science research.

The core science justification for the Peta-5 service is based on three broad science themes: Materials Science and Computational Chemistry; Computational Engineering and Smart Cities; Health Informatics. These themes were chosen as they represent significant EPSRC research areas, which demonstrate large benefit from the data intensive HPC capability of Peta-5. The service will clearly be valuable for many other areas of heterogeneous computing and Data Intensive science. Hence a fourth horizontal thematic of "Heterogeneous - Data Intensive Science" is included. Initial theme allocation in the RAC will be: Materials 30%, Engineering 30%, Health, 20%, Heterogeneous - Data Intensive 20%.

The Peta-5 facility will drive research discovery and impact at national level, creating the largest and most cost effective petascale HPC resource in the UK, bringing petascale simulation within the reach of a wide range of research projects and UK companies. Also Peta-5 is the first UK HPC system specifically designed for large scale machine learning and data analytics, combining the areas of HPC and Big Data, promising to unlock both knowledge and economic benefit from the Big Data revolution.

Planned Impact

As an innovative HPC service for data intensive science, Peta-5 will impact significantly on the research communities who make use of its resources. However, in addition to the expected science outcomes (e.g. papers in high-impact, peer-reviewed journals; keynote presentations at international conferences, etc.), Peta-5 will deliver impact in a number of other key areas:

1)Peta-5 will create one of the most powerful academic UK supercomputer facilities.

2)Peta-5 will provide the most cost effective petascale simulation capability in the UK providing unrivalled price performance. This unlocks sustainable HPC for academia and industry, demonstrating affordable petascale simulation capability. This is a game-changing capability widening access and opening new possibilities out of reach for many research projects or company budgets.

3)Peta-5 is currently the only HPC system in the UK aimed at data intensive computing, combining state of the art extreme I/O solid state storage technologies with emerging machine learning and data analytics frameworks. This provides a new capability for tackling the largest "Big Data" problems in UK research and industry.

In particular Peta-5 will:-

1)Enable new petascale academic research projects
Cambridge will pro-actively seek UK academic usage of the Peta-5 system by opening the system up to UK EPSRC researchers free of charge, with strong user support and low inertia application processes and particular emphasis on new users. Cambridge are well-connected to all levels of the Ne-I and via its involvement in many existing HPC academic networks will promote the uptake of the Peta-5 system.

2)Enable industrial use of petascale HPC capability
Cambridge have a long-established and successful industry engagement activity called CORE. CORE will proactively seek industry HPC use cases, promoting the use of HPC and advanced data analytics to drive industrial R&D.

3)Enable new extreme I/O and high performance data analytics capability
The Peta-5 architecture provides new extreme I/O capability combined with emerging machine learning and data analytics capability at a scale not available anywhere else in the UK. This will enable UK research projects and industry to develop new approaches to solving the largest "Big Data" problems addressed to date.

4)Cambridge have a specific partnership with the Alan Turing institute (ATI) to develop novel big data analytic methods and solutions to implement on the Peta-5 system. The ATI will then help disseminate the capability and train both academic and industrial beneficiaries.

5)Enable new advances in health informatics
Peta-5 will provide the advanced data analytics technologies and data safe havens for interdisciplinary research in health informatics, linking leading EPSRC research projects in this domain with the ATI, Addenbrookes and Genomics England (GEL). This combination of linkage and capability will result in ground breaking health informatics capability with potential use within the clinical setting. Partners such as Addenbrookes and GEL provide a direct route to patient health outcomes from the methods developed from the interdisciplinary research undertaken on Peta-5. Such outcomes can then be adopted nationally.
 
Description The PETA-5 installation is complete and the performance of the system exceeds the original planned performance. The facility is now known as CSD3. excellent useofthe facility has been made with a widening participation especially from new research areas utilising the University of Cambridge's time allocation. The Pandemic has led to a focussing of effort on simulations and analysis of relevance to the pandemic.
Exploitation Route The findings are only preliminary based on the installation rather than the use of the system, However the experienced gain in building such a system to budget, on time and with excellent performance will be of interest to others involved in HPC or data centre installation.
Sectors Digital/Communication/Information Technologies (including Software)

URL https://www.hpc.cam.ac.uk/CSD3/csd3-platform
 
Title Data supporting 'Numerical Investigation of full helicopter with and without the ground effect' 
Description In the present work, the aerodynamic performance of the full helicopter PSP in hover flight is investigated using a simplified concept of multiple reference frame (MRF) technique in the context of high-order Monotone Upstream Centred Scheme for Conservation Laws (MUSCL) cell-centred finite volume method. The predictions were obtained for two ground distances and several collective pitch angle at tip Mach number of 0.585. The calculations were made for both out-of-ground-effect (OGE) and in-ground-effect (IGE) cases and compared with experimental data in terms of pressure distribution and integrated thrust and torque and vortex system. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://cord.cranfield.ac.uk/articles/dataset/Data_supporting_Numerical_Investigation_of_full_helico...
 
Title Data supporting 'Numerical Investigation of full helicopter with and without the ground effect' 
Description In the present work, the aerodynamic performance of the full helicopter PSP in hover flight is investigated using a simplified concept of multiple reference frame (MRF) technique in the context of high-order Monotone Upstream Centred Scheme for Conservation Laws (MUSCL) cell-centred finite volume method. The predictions were obtained for two ground distances and several collective pitch angle at tip Mach number of 0.585. The calculations were made for both out-of-ground-effect (OGE) and in-ground-effect (IGE) cases and compared with experimental data in terms of pressure distribution and integrated thrust and torque and vortex system. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://cord.cranfield.ac.uk/articles/dataset/Data_supporting_Numerical_Investigation_of_full_helico...
 
Title Research data supporting "Computational Investigation of Copper Phosphides as Conversion Anodes for Lithium-Ion Batteries" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/307307
 
Title Research data supporting "Computational Investigation of Copper Phosphides as Conversion Anodes for Lithium-Ion Batteries" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/308174
 
Title Research data supporting "High-throughput discovery of high-temperature conventional superconductors" 
Description Crystal structures of the materials listed in Table. 1 of "High-throughput discovery of high-temperature conventional superconductors", generated using ab initio random structure searching (AIRSS). These are the structures as found to exhibit high-Tc superconductivity after an initial geometry optimization at the listed pressure. They are provided in the CASTEP .cell format and can be easily converted to a number of different formats using the C2x software (https://www.c2x.org.uk/). 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/326388
 
Title Research data supporting 'Physics-driven coarse-grained model for biomolecular phase separation with near-quantitative accuracy' 
Description This file is part of the supporting data for the manuscript 'Physics-driven coarse-grained model for biomolecular phase separation with near-quantitative accuracy', and contains scripts and code for running all-atom and coarse-grained simulations of proteins described in the manuscript. The included README file outlines the structure of the archive and contains Lammps installation instructions for running the remaining code supplied. There are five directories in this archive. Four of these correspond to figures in the results section of the manuscript, and each one contains a separate README file detailing its contents. They include Gromacs and Lammps scripts with example simulation set-ups to run simulations corresponding to figures 2 [potentials of mean force of all-atom models], 4 [radius of gyration simulations of coarse-grained models], 5 [direct-coexistence simulations used to determine the phase diagrams] and 6 [multiphase equilibria] of the manuscript. The directories contain input scripts and corresponding parameter and simulation configuration files, as well as example simulation output to benchmark against. The final directory, 'all-model-parameters', contains Lammps parameter files not only for the Mpipi potential, but also for all the other models we have benchmarked in the paper. These parameter files can be used instead of the Mpipi parameters in the simulations provided in the other directories. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/329039
 
Title Research data supporting 'Quantum-mechanical exploration of the phase diagram of water' 
Description We provide DFT input files, example ice configurations studied, a Mathematica notebook used to collate the results, and numerical results of the free-energy computations. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/315122
 
Title Scientific OpenStack 
Description The software has been developed as part of the SKA Science Data Processor Platform. It adds new functionality to the OpenStack platform to enable high performance workflows and other monitoring. The software has been added back into the main OpenStack repository for general use. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact This software has wide applicability outside of radio astronomy and the SKA project for which it is developed and at least one company, StackHPC are taking this forward in a commercial context 
URL http://ska-sdp.org