Exascale Data Testbed for Simulation, Data Analysis & Visualisation

Lead Research Organisation: UNIVERSITY OF CAMBRIDGE
Department Name: Chemistry

Abstract

In 2018, the Exascale Computing ALgorithms & Infrastructures for the Benefit of UK Research (ExCALIBUR) programme was proposed by the Met Office, CCFE and EPSRC (on behalf of UKRI). The goal of ExCALIBUR is to redesign high priority computer codes and algorithms, keeping UK research and development at the forefront of high-performance simulation science. The challenge spans many disciplines and as such the programme of research will be delivered through a partnership between the Met Office and UKRI Research Councils. Research software engineers and scientists will work together to future proof the UK against the fast-moving changes in supercomputer designs. This combined scientific expertise will push the boundaries of science across a wide range of fields delivering transformational change at the cutting-edge of scientific supercomputing. DiRAC proposed the inclusion in the ExCALIBUR business case of a request for £4.5M in capital funding over 4.5 years to develop a hardware fore-sighting programme. Industry co-funding for the programme will be sought where possible.
The £4.5m capital is intended to provide a testbed area that uses pre-commercial equipment for software prototyping and development. It has two main purposes: (1) to enable the software community to be ready to use commercial products effectively as soon as they come on to the market; and (2) to provide the UKRI HPC community with the ability to influence industry and the necessary knowledge to guide their purchase decisions. This will ensure that facilities and the future UK National e-Infrastructure are in a position to maximise value for money by getting the most powerful systems exactly suited to the communities' needs. This double-pronged approach will give UK researchers a competitive advantage internationally.
ExCALIBUR will now establish a set of modest-sized, adaptable clusters dedicated solely to this purpose and embedded within established HPC environments. Although small, they need to be of a scale capable of carrying out meaningful performance studies. They are expected to be co-funded with industry partners and will initially require investments of £200k-£300k each, and will allow a range of future hardware to be assessed for its relevance to the delivery of UKRI science and innovation. The pre-commercial equipment will be refreshed and added to on a regular, likely to be annual, basis. This agile tactic is designed to take advantage of the different approaches across industry (some companies, e.g. NVidia tend to have a short (less than 3-month) pre-commercial window while for others this can be up to a year).
ExCALIBUR can use the hardware piloting systems to drive software innovation across the UKRI research community. Researchers are rightly reluctant to invest time in code development to take advantage of new hardware which may not be available at scale for several years or may even prove not to have longevity - scientific leadership demands that research funding is used to deliver science results now. In addition and DiRAC and others will offer funded RSE effort to support the development work combined with access to novel technologies within modest-sized systems, Excalibur can lower the bar for engaging with the process of software re-engineering and encourage researchers to make the necessary (modest) investments of their time. In some cases, there may also be the potential for some immediate science outputs by exploiting the proof-of-concept systems.
Excalibur will thus be able to provide an incentive for greater software innovation across the UKRI research communities and help to ensure that when novel technology is included in national services, there are workflows that are already able to exploit it optimally. This will increase productivity across all UKRI computing services and enable UK researchers to use the latest hardware to deliver the largest and most complex calculations, ensuring international leadership.

Publications

10 25 50
 
Description Building on the previous work we have investigated the effect of a new version, 2.2, of the DAOS high performance object store, both in running the io500 benchmark and, in collaboration with Georgia Institute of Technology, investigating the effect of this platform on the integration of codes that use the ADIOS middleware to optimise I/O patterns. We have also extended the testbed platforms to use other filesystems apart from Lustre and DAOS, namely BeeGFS, Spectrum Scale and WekaFS. These filesystems are set up on identical hardware. With the exception of WekaFS, small node runs of the io500 benchmark have been run on these filesystems with the objective of determining the potential of the platform to fully utilise the client network capability. We have also investigated the impact of enabling GPUDirect Storage on benchmarks against the Lustre Filesystem.

We have demonstrated that the new, at the time, release of DAOS did not have any appreciable effect on the performance as measured by the io500 benchmarks. In addition, we used the POSIX interface of the DAOS software to measure how code, that is not optimised for DAOS, might perform. This was discouraging as this interface performed poorly with code failing to complete in the vast majority of cases. This limitation in the software is expected to be mitigated in the, now current, 2.4.1 release of the filesystem. Further DAOS work involved a collaboration with The Georgia Institute of Technology, with Greg Eisenhauer and Sarpangala Venkatesh. In this they used the Cambridge Platform to investigate the effect of DAOS metadata configuration on the performance of both bespoke benchmark code and real-world applications - WarpX and E3SM. This work resulted in acknowledgements on a workshop paper at Supercomputing 2023 and ISC 2024.

A comparison of three parallel filesystems was made using the io500 benchmark. These were simple two node runs against filesystems built against four servers. The aim was to see how the clients managed to utilise available network when the servers were not so restricted. The results, as measured by the io500 'score', show that BeeGFS was best able to maximise I/O in this situation, with Lustre second and Spectrum scale third. In terms of bandwidth the best client-side utilisation was 90% of the maximum and the worst 72% and, by this measure the best performing filesystem was Spectrum Scale and the worst BeeGFS.

The testbed was used to set up a NVMe backed Lustre filesystem (Lustre 2.15.1) that supported GDS, (GPU Direct Storage). Two I/O benchmarks, ior and elbencho, that support GDS were run, both on local NVMe and the Lustre filesystem. Comparisons between the GDS enabled benchmark and the non-GDS enabled benchmark showed that speedups in I/O from the use of GDS were limited to a particular set of application parameters, such as chunk-size or if MPI or the CUDA threads were used. This was particularly true where Lustre was the target filesystem.
Exploitation Route The DAOS work has been used to move forward the integration of DAOS with WarpX and E3SM, which will be of use to the communities that use these codes.
Sectors Digital/Communication/Information Technologies (including Software)

URL https://sc23.supercomputing.org/proceedings/workshops/workshop_pages/ws_pdsw111.html
 
Description Exascale Data Testbed for Simulation, Data Analysis & Visualisation
Amount £200,000 (GBP)
Funding ID ST/V006282/1 
Organisation Science and Technologies Facilities Council (STFC) 
Sector Public
Country United Kingdom
Start 03/2021 
End 03/2022
 
Description SPF ExCALIBUR EX20- 6: I/O & Storage: ExcaliStore
Amount £741,553 (GBP)
Organisation Meteorological Office UK 
Sector Academic/University
Country United Kingdom
Start 05/2021 
End 05/2024
 
Title Exascale Data Testbed 
Description The Data Accelerator (DAC) consists of 24 Dell PowerEdge R740xd servers, each with 12 1.5TB NVMe disks. Using open-source software written in-house, they interface with Slurm's burst buffer plugin via etcd which is a key-value store for distributed systems. Upon a user requesting a buffer in their job script, a Lustre filesystem is created on request using enough NVMe disks to satisfy the users size requirement. Once the Slurm job ends, the filesystem is destroyed, and resources are released. The DAC nodes are connected to the ToR switches in the Cascade Lake racks. The 24 DAC servers are located in pairs across 12 Cascade Lake racks and are connected via low latency, high bandwidth networks. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2022 
Open Source License? Yes  
Impact We wanted to re-run the IO500 to check performance was as expected. To do this, we submitted a slurm job to execute the IO500 script on 10 compute nodes against a 336TiB buffer, and re-ran to tweak values for the tests to run long enough for a valid result. At this time, some DAC nodes were being used for other tests, so only 20 of the 24 were available. The 336TiB available gave us a Lustre filesystem with 240 OSTs and 20 MDTs. Compared to our 2019 IO500 submission (https://io500.org/submissions/view/78), here only have one NIC (in our 2019 submission we were using nodes with 2 NICs). In general we felt the performance from these tests looked as expected and gave us a very similar result to our previous run. The buffer we were creating had 20 MDTs, and by setting a value of 20 we saw consistently better mdtest results. Our results show a 23% performance improvement (compared to 2019 OPA result) on the IO500 benchmark on the DAC platform with Mellanox installed. (note that the version of Lustre used differed between the OPA (v2.13 development branch) run and the Mellanox (v2.12.5 production branch) run. 
URL https://excalibur.ac.uk/projects/exascale-data-testbed/