Cambridge Service for Data Driven Discovery (CSD3) - A National Data Intensive Science Cloud for Converged Simulation, AI & Analytics

Lead Research Organisation: University of Cambridge
Department Name: Physics

Abstract

The high-level aims of CSD3 are threefold:-
Firstly, to provide the EPSRC community with an accessible, scalable and innovative world-class research facility, with 80% of
the system freely available via an open national call. This will enable an increased quantity and diversity of EPSRC researchers, to exploit advanced simulation, data analytics and AI capability, significantly increasing impactful scientific outputs across the entire EPSRC portfolio.
Secondly, to kickstart a significant innovation partnership with industry that will investigate emerging exascale technologies, feeding back into the UK-RI HPC roadmap and science programmes, helping to keep the UK at the forefront of HPC & AI technologies.
Thirdly, to pro-actively drive a more coordinated Tier-2 network, with greatly improved vertical integration within wider UK-RI Ne-I.

To achieve these aims, the proposal builds on the success our previous 2016 Tier-2 proposal where the University of Cambridge (UoC) combined its funds with those from STFC DiRAC and EPSRC Tier-2 to form "CSD3", the UK's largest data intensive national HPC service. CSD3 has seen by far the largest throughput of open-access Tier-2 projects of all current Tier-2 centers. This proposal again levers significant additional non-EPSRC funding, providing 144% match funding from UoC, STFC and industry. These significant investments will be used to enhance the EPSRC Tier-2 funding to expand four key elements of CSD3, integrated with the existing CSD3 to form a significantly enhanced capability.

These new service elements are described briefly here:
1) Platform: Retain the highly successful data intensive heterogeneous architecture already deployed and significantly enhance its capacity: Firstly, growing the Intel X86 cluster from 36K cores to 72K. Secondly, create one of the world's first large-scale pre-exascale prototype systems for AI & simulation with exciting next-generation GPUs from NVIDIA doubling the number of GPUs from 360 to 720 but importantly with next generation cards, significantly faster than the current V100 generation. In total, for just £4M, this will increase the EPSRC simulation capability of CSD3 by 4X and its AI capability by 9X.

2) Service support: Substantially increase the user support, RSE, training and Tier-2 management and outreach capability. Also, in collaboration with N8-CIR, Supercomputing Wales, STFC DiRAC & STFC IRIS CSD3 will initiate a new collaboration with 3 key USA HPC centers involved in the US XSEDE HPC program to co-develop, integrate & test a world-leading software ecosystem for allocation management, reporting, impact analysis and improved accessibility of federated e-Infrastructure resources.

3) Accessibility: Improve the user access layer built on Openstack, by creating an ISO27001 certified environment for holding sensitive data, implementing application-specific portals to aid new disciplines and users and develop with an exemplar user community, a community focused scientific gateway, using the gateway tool kit developed at TACC.

4) Open Exascale Lab: An industry funded and jointly resourced partnership to undertake major innovation activity, investigating emerging exascale technologies, across networking, accelerators, programming environments, novel storage and filesystem technologies. Transforming the UK's access to emerging HPC technologies feeding science programs and UK-RI Ne-I roadmap.

Thus this proposal will produce significant outputs and impact in 4 areas:-
1)help to deliver impactful EPSRC science across a large number of projects (targeted at 600) and increase UK industrial competitiveness

2)greatly enhanced access to HPC resources, significantly increase training /upskilling for both academic and industrial users

3)greatly enhance and coordinate Tier-2 programme with effort and software tools;

4)undertake significant technology innovation via industry funded Open Exascale La

Planned Impact

Increasing the impact of the EPSRC UK research community. On installation the expanded CSD3 will be the most powerful academic supercomputer in the UK. With multi-petascale performance, CSD3 will deliver world-leading simulation, data analytics and AI capability to both the UK EPSRC research community and industry; bringing multi-petascale capability within the mainstream reach of the wider community and establish a path to pre-exascale exploitation. CSD3 will incorporate the fastest UK Intel system, fastest UK GPU system, world's fastest HPC storage, UK's fastest AI platform. As CSD3 will be one of the first large scale systems in the world to use NVIDIA next generation "Volta-Next" technology it represents one of the world's first exascale prototype platforms, providing the UK science community leadership status in regard to exascale system architecture.

Increasing the ties to the UK industrial community. Cambridge has a long-established and successful industry engagement activity through the specialist SME, CORE Advantage. Through CORE, CSD3 has successfully engaged with clients in the aerospace, automotive, maritime and oil and gas industries, providing access to both petascale capability and to the expertise needed to successfully exploit the use of advanced computational services.

Increasing the ability of the UK EPSRC community and industry to develop new approaches to solving large-scale data-intensive problems by providing extreme I/O capability through our world-leading data accelerator, combined with innovative 'Intel Optane' memory technology. CSD3 works with the Alan Turing Institute and with the recently announced Cambridge Centre for Data-Driven Discovery industry club to directly promote and enable the exploitation of CSD3's expanded capability, leveraging the university's many industry partners and the wider EPSRC community in the broad area of data-driven discovery.

Increasing horizontal integration amongst Tier 2 centres. We will co-develop the next generation of software for resource allocation and management, impact measurement, reporting, and user accessibility across a National e-Infrastructure. These tools will help to ensure that the Ne-I becomes an increasingly integrated national service, without compromising its ability to deliver an efficient, responsive service to the community while maintaining the highest standards of research quality.

Increasing ties with international leading supercomputing centres. To develop the tools to deliver an integrated Ne-I, CSD3 will collaborate with University of Ohio, University of Buffalo, and Texas Advanced Computing Center on the open-source tools Coldfront, Open XDMod and Open On-demand. This co-development activity will further raise the profile of the capability within the UK NeI, potentially leading to further collaborations with the dynamic US XSE DE program.

Increasing breadth of research projects available through EPSRC and UKRI CDTs. CSD3 will engage with EPSRC CDTs specifically in health, engineering, AI and materials. We have already had positive and fruitful discussions with a range of CDTs on how Tier 2 HPC centres can engage more effectively to bring multi-petascale capability at the grass roots of the UK EPSRC research community, building transferable skills in the area of HPC exploitation and software development.

Finally, the CDS3 project has stimulated significant industrial investment into the Cambridge "Open Exascale Lab" providing a large critical mass of emerging exascale technologies such as accelerators, programming models, networking and storage. These technologies will be stood up as test beds and made available to the UK science community as early access technology demonstrators, keeping the UK science community at the forefront of the emerging technology curve. This is of relevance to UK-AEA and their involvement in the ExCALIBUR project and UK- AEA are contributing 2 FTE RSE effort to the Lab.

Publications

10 25 50