📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

ExCALIBUR H&ES: Intel Xeon GPU Max Pre-Exascale Testbed

Lead Research Organisation: University of Cambridge
Department Name: Applied Maths and Theoretical Physics

Abstract

In 2018, the Exascale Computing ALgorithms & Infrastructures for the Benefit of UK Research (ExCALIBUR) programme was proposed by the Met Office, CCFE and EPSRC (on behalf of UKRI). The goal of ExCALIBUR is to redesign high priority computer codes and algorithms, keeping UK research and development at the forefront of high-performance simulation science. The challenge spans many disciplines and as such the programme of research will be delivered through a partnership between the Met Office and UKRI Research Councils. Research software engineers and scientists will work together to future proof the UK against the fast-moving changes in supercomputer designs. This combined scientific expertise will push the boundaries of science across a wide range of fields delivering transformational change at the cutting-edge of scientific supercomputing. DiRAC proposed the inclusion in the ExCALIBUR business case of a request for £4.5M in capital funding over 4.5 years to develop a hardware fore-sighting programme. Industry co-funding for the programme will be sought where possible.

The £4.5m capital is intended to provide a testbed area that uses pre-commercial equipment for software prototyping and development. It has two main purposes: (1) to enable the software community to be ready to use commercial products effectively as soon as they come on to the market; and (2) to provide the UKRI HPC community with the ability to influence industry and the necessary knowledge to guide their purchase decisions. This will ensure that facilities and the future UK National e-Infrastructure are in a position to maximise value for money by getting the most powerful systems exactly suited to the communities' needs. This double-pronged approach will give UK researchers a competitive advantage internationally.

ExCALIBUR will now establish a set of modest-sized, adaptable clusters dedicated solely to this purpose and embedded within established HPC environments. Although small, they need to be of a scale capable of carrying out meaningful performance studies. They are expected to be co-funded with industry partners and will initially require investments of £200k-£300k each, and will allow a range of future hardware to be assessed for its relevance to the delivery of UKRI science and innovation. The pre-commercial equipment will be refreshed and added to on a regular, likely to be annual, basis. This agile tactic is designed to take advantage of the different approaches across industry (some companies, e.g. NVidia tend to have a short (less than 3-month) pre-commercial window while for others this can be up to a year).

ExCALIBUR can use the hardware piloting systems to drive software innovation across the UKRI research community. Researchers are rightly reluctant to invest time in code development to take advantage of new hardware which may not be available at scale for several years or may even prove not to have longevity - scientific leadership demands that research funding is used to deliver science results now. In addition and DiRAC and others will offer funded RSE effort to support the development work combined with access to novel technologies within modest-sized systems, Excalibur can lower the bar for engaging with the process of software re-engineering and encourage researchers to make the necessary (modest) investments of their time. In some cases, there may also be the potential for some immediate science outputs by exploiting the proof-of-concept systems.

ExCALIBUR will thus be able to provide an incentive for greater software innovation across the UKRI research communities and help to ensure that when novel technology is included in national services, there are workflows that are already able to exploit it optimally. This will increase productivity across all UKRI computing services and enable UK researchers to use the latest hardware to deliver the largest and most complex calculations, ensuring international leadership.
 
Description ExCALIBUR Hardware and Enabling Software (H&ES): In-situ Visualisation and Unified Programming across Accelerator Architectures at Exascale
Amount £115,810 (GBP)
Funding ID ST/W001667/1 
Organisation Science and Technologies Facilities Council (STFC) 
Sector Public
Country United Kingdom
Start 03/2021 
End 03/2025
 
Description Intel oneAPI Centre of Excellence
Amount $150,000 (USD)
Organisation Intel Corporation 
Sector Private
Country United States
Start 11/2023 
End 10/2024
 
Title Intel Xeon GPU Max Pre-Exascale Testbed 
Description Procure, commission and operate a multi-accelerator system based on the Intel oneAPI-enabled Ponte Vecchio GPU (Xeon Data Center GPU Max) together with the 4th gen Sapphire Rapids (SPR) CPUs (Xeon CPU Max). The purpose of this testbed is to act as a demonstrator (with over half a petaflop of peak performance) for code porting, optimisation and evaluation by the investigators and the wider UK HPC Community, especially as it pertains to future exascale systems. 
Type Of Material Improvements to research infrastructure 
Year Produced 2024 
Provided To Others? Yes  
Impact The ExCALIBUR Intel PVC testbed has been co-designed in collaboration with Lenovo (main supplier) and Intel (key technology) has gone through procurement and is currently being commissioned and will be operational shortly. The testbed system is hosted by DAMTP and its purpose is to explore performance improvements by offloading compute loads to the new Intel GPU accelerators and will be made available to the wider DiRAC and ExCALIBUR H&ES community for testing and benchmarking. Preparatory impacts include the focus on the GRTeclyn numerical relativity code and on in-situ visualisation (both described elsewhere in this report, with GRTeclyn publicly released for use on GPU platforms https://github.com/GRTLCollaboration/GRTeclyn), and training of students in the University of Cambridge CDT and MPhil in Data Intensive Science in both SYCL open platform programming for GPUs and use of the oneAPI PyTorch and other libraries on GPU systems. The testbed includes several nodes of both NVIDIA and Intel GPUs, as well as several AMD and Intel CPUs nodes, for the purposes of direct performance comparisons. 
 
Description ExCALIBUR 
Organisation Science and Technologies Facilities Council (STFC)
Department Distributed Research Utilising Advanced Computing
Country United Kingdom 
Sector Academic/University 
PI Contribution Development of the public numerical relativity code GRTeclyn (originally developed in the Cambridge CTC/GR group), the porting of GRTeclyn to large GPU supercomputers, and general RSE support within the members of the STFC consolidated grant and the ExCALIBUR grant. This includes technical and scientific workshops, training days, technical support and procurement of computational resources.
Collaborator Contribution The development of GRTeclyn is supported by Miren Radia and Kacper Kornet as DiRAC RSEs. Note that they are not under the STFC or ExCALIBUR award but a separate DiRAC RSE grant. Substantial technical support and RSE support from Intel software engineers as part of the Cambridge Intel oneAPI Centre of Excellence hosted by the CTC/GR group. This also includes funding support for RSE efforts.
Impact Several important developments to GRTeclyn have been made including: • The introduction of new features e.g. cell-centered quartic interpolation • The improvement of existing features/refactoring of the code, e.g. handling of derived variables. • Advances in in-situ visualisation using Paraview/Catalysis These are essential features that improve its accuracy and also make the code and outputs more accessible for researchers. AMReX at Exascale workshop, 24-27 June 2024 - a meeting exploring themes of open-platform programming and performance portability, particularly in the context of AMReX and AMR applications. This meeting was run by the CTC/GR group, together with the GRTeclyn collaboration. It included international speakers from large U.S. and European supercomputer centres. We submitted an Intel blogpost summarising outcomes of our work.
Start Year 2022
 
Description Intel oneAPI Centre of Excellence 
Organisation Intel Corporation
Country United States 
Sector Private 
PI Contribution As a part of our ExCALIBUR research grant, we have designed and commissioned a test bed to explore possible performance improvements by offloading compute loads to accelerated devices such as GPUs. The testbed includes several nodes of both NVIDIA and Intel GPUs as well as several AMD and Intel CPUs nodes. The cluster is hosted by DAMTP but open to the wider DiRAC community for testing and benchmarking before using a larger system such as Dawn. We have also developed a Klein-Gordon solver within the AMReX framework with in-situ visualization support for the ExCALIBUR community. This provides a test case and training example for ExCALIBUR members looking to either contribute to the GRTeclyn codebase or an introduction to ParaView Catalyst.
Collaborator Contribution The previous ExCALIBUR RSE, Miren Radia, has overseen several contributions to this ExCALIBUR project, such as * Implementing in-situ vizualisation for community codes e.g. JOREK. * Contributing towards porting of the binary black hole example from G\ RChombo to GRTeclyn. * Developing a Continuous Integration/Continuous Testing (CI/CT) pipel\ ine for GRTeclyn. * Refactoring of the GRChombo code for GRTeclyn. Kacper Kornet (DiRAC RSE) has also made several contributions: * Preliminary benchmark runs for GRTeclyn on 1 NVIDIA A100 vs 1 Intel Ice Lake CPU.
Impact Our major outcome is the release of the open source code GRTeclyn: https://github.com/GRTLCollaboration/GRTeclyn We have also been active in various knowledge exchange engagements, such as the: * Presentation of a talk at CI:UK in Manchester, Dec 2023: https://excalibur.ac.uk/resources/preparing-for-exascale-computing-with-grteclyn/ * Presentation of a poster at DiRAC Day in Liverpool, Dec 2023.
Start Year 2014
 
Title GRTeclyn numerical relativity simulation code 
Description A sophisticated adaptive mesh refinement (AMR) solver designed to solve Einstien's equations for General Relativity to describe black hole mergers and other non-linear gravitational interactions. This is built upon the AMReX libraries (LBNL), superseding the previous GRChombo code because it can run on heterogeneous architectures including CPU and GPU (Nvidia, Intel and AMD). In addition we have enabled in-situ visualisation in the Paraview/Catalyst framework, with the Intel oneAPI Rendering toolkit which incorporates OSPRay ray-tracing with access to AMR data. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2023 
Open Source License? Yes  
Impact The primary impact of this work is to enable large-scale numerical relativity simulations to run more efficiently on high performance computers (HPC), notably on GPU systems including new technologies like the Intel Xeon GPU Max. Further impacts include the adoption of high quality in-situ visualisation that allows simulations to be observed "on the fly" without the output of intermediate data for post-processing. 
URL https://github.com/GRTLCollaboration/GRTeclyn
 
Description Preparing for Exascale Computing with GRTeclyn. Talk by Dr Juliana Kwan at Computing Insight UK 7-8 December 2023 In Manchester Central. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Description of the GRTeclyn numerical relativity code and preparations for use on exascale systems powered by GPUs. Challenges with porting large open-sources codes, the use of the AMReX libraries, preliminary results for binary black hole mergers, innovations with in-situ visualisation with Paraview/Catalyst/OSPRay on HPC systems. Source code: https://github.com/GRTLCollaboration/GRTeclyn
Year(s) Of Engagement Activity 2023
URL http://excalibur.ac.uk/resources/preparing-for-exascale-computing-with-grteclyn/