Crop Diversity GPU - Growing Plant Understanding
Lead Research Organisation:
James Hutton Institute
Department Name: Information & Computational Sciences
Abstract
Crop Diversity GPU - Growing Plant Understanding for resilient food systems and global plant biodiversity conservation (CD-GPU)
Comprehensive collections of plant biological materials that capture wide genetic diversity have been carefully established and curated at institutes across the UK, such as the national collections curated by our project partners at the Natural History Museum and the Royal Botanic Gardens in Kew and Edinburgh, as well as diverse materials for crop pre-breeding at NIAB, the James Hutton Institute (JHI) and Scotland's Rural College (SRUC).
Such resources are increasingly being exploited via the application of high-throughput 'omics' approaches aimed at capturing large and complex datasets relating to the features they display such as DNA sequence ('genomics'), plant characteristics (from the level of whole fields, down to individual plant tissues and cells; 'phenomics'), and the amount and nature of gene and protein expression ('transcriptomics' and 'proteinomics'). Such approaches underpin broad areas of plant R&D, from landscape-scale analysis all the way down to fundamental research of individual genes and their variants. Further, there is a technological revolution underway within industrial crop production systems, whereby increased crop monitoring combined with robotic technologies are being exploited to streamline crop production - providing further opportunity for academic-industrial joint research underpinned by detailed temporal and spatial datasets from real production scenarios. This requires powerful hardware in order to fully exploit the resulting datasets and information, both within and at the research interface between these disciplines. Specifically, computing infrastructures must incorporate graphical processing units (GPUs), alongside central processing units (CPU) and storage, to provide the appropriate task parallelisation and memory bandwidth required for the analysis of datasets at this scale, as well as to support their analysis and interpretation via emerging Machine Learning (ML) and Artificial Intelligence (AI) approaches.
CD-GPU will provide compute capabilities to a consortium of seven UK institutes working in complementary areas of plant and crop science. Specifically, it will deliver the following hardware:
1) A 300% increase in GPU capacity for artificial intelligence and machine learning methods
2) A 65% increase in storage capacity for 'omics' data and associated modelling
3) A 55% increase in CPU capacity to meet the demand for high memory bioinformatics applications such as plant genome assemblies at pangenome scale
The resource will provide a step-change in the research work we undertake to help underpin sustainable food production, and to understand and reverse plant biodiversity loss across the world. Importantly, by tailoring CD-GPU to our common research needs, and via the provision of associated technical support and training to the users of the resource, this project will build a strong user community with complementary research aims and help encourage collaboration and innovation between our seven institutes.
Finally, while there is an environmental cost to run such a resource, the plant, crop and agri-tech science it will enable helps the drive to net zero. Shared infrastructure located at a single site, rather that separate resources at each institute, rationalises compute provision, so reducing environmental impact of installing, maintaining and the resource. The hardware we select is prioritised on performance-per-unit-energy-used, and rather than processors with fewer very fast cores, we select ones with a lot of cores that run more efficiently - a perfect fit for many of our targeted analysis tasks where parallelizing jobs can realise huge cost and performance benefits. Similarly, a single GPU with its thousands of parallel cores can - when appropriate - perform the same work both faster and more efficiently than hundreds of CPUs.
Comprehensive collections of plant biological materials that capture wide genetic diversity have been carefully established and curated at institutes across the UK, such as the national collections curated by our project partners at the Natural History Museum and the Royal Botanic Gardens in Kew and Edinburgh, as well as diverse materials for crop pre-breeding at NIAB, the James Hutton Institute (JHI) and Scotland's Rural College (SRUC).
Such resources are increasingly being exploited via the application of high-throughput 'omics' approaches aimed at capturing large and complex datasets relating to the features they display such as DNA sequence ('genomics'), plant characteristics (from the level of whole fields, down to individual plant tissues and cells; 'phenomics'), and the amount and nature of gene and protein expression ('transcriptomics' and 'proteinomics'). Such approaches underpin broad areas of plant R&D, from landscape-scale analysis all the way down to fundamental research of individual genes and their variants. Further, there is a technological revolution underway within industrial crop production systems, whereby increased crop monitoring combined with robotic technologies are being exploited to streamline crop production - providing further opportunity for academic-industrial joint research underpinned by detailed temporal and spatial datasets from real production scenarios. This requires powerful hardware in order to fully exploit the resulting datasets and information, both within and at the research interface between these disciplines. Specifically, computing infrastructures must incorporate graphical processing units (GPUs), alongside central processing units (CPU) and storage, to provide the appropriate task parallelisation and memory bandwidth required for the analysis of datasets at this scale, as well as to support their analysis and interpretation via emerging Machine Learning (ML) and Artificial Intelligence (AI) approaches.
CD-GPU will provide compute capabilities to a consortium of seven UK institutes working in complementary areas of plant and crop science. Specifically, it will deliver the following hardware:
1) A 300% increase in GPU capacity for artificial intelligence and machine learning methods
2) A 65% increase in storage capacity for 'omics' data and associated modelling
3) A 55% increase in CPU capacity to meet the demand for high memory bioinformatics applications such as plant genome assemblies at pangenome scale
The resource will provide a step-change in the research work we undertake to help underpin sustainable food production, and to understand and reverse plant biodiversity loss across the world. Importantly, by tailoring CD-GPU to our common research needs, and via the provision of associated technical support and training to the users of the resource, this project will build a strong user community with complementary research aims and help encourage collaboration and innovation between our seven institutes.
Finally, while there is an environmental cost to run such a resource, the plant, crop and agri-tech science it will enable helps the drive to net zero. Shared infrastructure located at a single site, rather that separate resources at each institute, rationalises compute provision, so reducing environmental impact of installing, maintaining and the resource. The hardware we select is prioritised on performance-per-unit-energy-used, and rather than processors with fewer very fast cores, we select ones with a lot of cores that run more efficiently - a perfect fit for many of our targeted analysis tasks where parallelizing jobs can realise huge cost and performance benefits. Similarly, a single GPU with its thousands of parallel cores can - when appropriate - perform the same work both faster and more efficiently than hundreds of CPUs.
Technical Summary
We will purchase a High-Performance Computing (HPC) cluster tailored to provide significant Graphical Processing Unit (GPU) capabilities. This configuration will enhance our ability to (i) process and store the large data volumes produced by cutting-edge 'omics, imaging and sensing technologies, and (ii) apply artificial intelligence, deep learning and computer vision approaches to gain a deeper understanding of such data at high-throughput. The platform will:
- Deliver GPU acceleration via 16 NVIDIA A100 cards; each providing ~7,000 CUDA cores and 40-80GB of memory. A-series GPUs enable 20x the Tensor floating-point operations per second versus previous V-series cards, twice the memory bandwidth, and support multi-instance GPU partitioning allowing many users to access a single card simultaneously.
- Build on existing petabyte scale storage by providing much needed additional space (1PB primary, 1PB backup), essential for the large volumes of data being generated, while increasing performance via scale-up/scale-out benefits of the BeeGFS filesystem and a shift from 25Gbps to 100Gbps networking. Included is 200TB of all-flash storage ("scratch space") for accelerating file operations during active job processing.
- Include 16 compute nodes, each providing 2 AMD EPYC processors with 32 cores (1024 total), 512GB memory (8TB total) and 1TB SSD disk, providing a 40% performance increase vs existing hardware while also improving performance-per-watt.
It will use open-source solutions, including: Rocky Linux (o/s); BeeGFS (storage); SLURM (job scheduling); bioconda (informatics), Apptainer/Docker (containers); Ansible (configuration management); and Prometheus (monitoring). We will build and deploy bespoke tools and pipelines across the platform to benefit both local users and public-facing services, and provide expert technical support and training to users/collaborators, supporting collaboration and innovation across our institutes and research sectors.
- Deliver GPU acceleration via 16 NVIDIA A100 cards; each providing ~7,000 CUDA cores and 40-80GB of memory. A-series GPUs enable 20x the Tensor floating-point operations per second versus previous V-series cards, twice the memory bandwidth, and support multi-instance GPU partitioning allowing many users to access a single card simultaneously.
- Build on existing petabyte scale storage by providing much needed additional space (1PB primary, 1PB backup), essential for the large volumes of data being generated, while increasing performance via scale-up/scale-out benefits of the BeeGFS filesystem and a shift from 25Gbps to 100Gbps networking. Included is 200TB of all-flash storage ("scratch space") for accelerating file operations during active job processing.
- Include 16 compute nodes, each providing 2 AMD EPYC processors with 32 cores (1024 total), 512GB memory (8TB total) and 1TB SSD disk, providing a 40% performance increase vs existing hardware while also improving performance-per-watt.
It will use open-source solutions, including: Rocky Linux (o/s); BeeGFS (storage); SLURM (job scheduling); bioconda (informatics), Apptainer/Docker (containers); Ansible (configuration management); and Prometheus (monitoring). We will build and deploy bespoke tools and pipelines across the platform to benefit both local users and public-facing services, and provide expert technical support and training to users/collaborators, supporting collaboration and innovation across our institutes and research sectors.
Organisations
Publications
Adams TM
(2023)
HISS: Snakemake-based workflows for performing SMRT-RenSeq assembly, AgRenSeq and dRenSeq for the discovery of novel plant disease resistance genes.
in BMC bioinformatics
Alström P
(2023)
Systematics of the avian family Alaudidae using multilocus and genomic data
in Avian Research
Alström P
(2023)
Integrative taxonomy reveals unrecognised species diversity in African Corypha larks (Aves: Alaudidae)
in Zoological Journal of the Linnean Society
Brown MJM
(2023)
Re-evaluating the importance of threatened species in maintaining global phytoregions.
in The New phytologist
Delisle Z
(2023)
Modelling density surfaces of intraspecific classes using camera trap distance sampling
in Methods in Ecology and Evolution
Ding G
(2023)
The Dissection of Nitrogen Response Traits Using Drone Phenotyping and Dynamic Phenotypic Analysis to Explore N Responsiveness and Associated Genetic Loci in Wheat.
in Plant phenomics (Washington, D.C.)
Elliott T
(2023)
Global analysis of Poales diversification - parallel evolution in space and time into open and closed habitats
in New Phytologist
Foster P
(2023)
Recoding Amino Acids to a Reduced Alphabet may Increase or Decrease Phylogenetic Accuracy
in Systematic Biology
Li S
(2023)
Alternative splicing impacts the rice stripe virus response transcriptome
in Virology