Crop Diversity GPU - Growing Plant Understanding
Lead Research Organisation:
James Hutton Institute
Department Name: Information & Computational Sciences
Abstract
Crop Diversity GPU - Growing Plant Understanding for resilient food systems and global plant biodiversity conservation (CD-GPU)
Comprehensive collections of plant biological materials that capture wide genetic diversity have been carefully established and curated at institutes across the UK, such as the national collections curated by our project partners at the Natural History Museum and the Royal Botanic Gardens in Kew and Edinburgh, as well as diverse materials for crop pre-breeding at NIAB, the James Hutton Institute (JHI) and Scotland's Rural College (SRUC).
Such resources are increasingly being exploited via the application of high-throughput 'omics' approaches aimed at capturing large and complex datasets relating to the features they display such as DNA sequence ('genomics'), plant characteristics (from the level of whole fields, down to individual plant tissues and cells; 'phenomics'), and the amount and nature of gene and protein expression ('transcriptomics' and 'proteinomics'). Such approaches underpin broad areas of plant R&D, from landscape-scale analysis all the way down to fundamental research of individual genes and their variants. Further, there is a technological revolution underway within industrial crop production systems, whereby increased crop monitoring combined with robotic technologies are being exploited to streamline crop production - providing further opportunity for academic-industrial joint research underpinned by detailed temporal and spatial datasets from real production scenarios. This requires powerful hardware in order to fully exploit the resulting datasets and information, both within and at the research interface between these disciplines. Specifically, computing infrastructures must incorporate graphical processing units (GPUs), alongside central processing units (CPU) and storage, to provide the appropriate task parallelisation and memory bandwidth required for the analysis of datasets at this scale, as well as to support their analysis and interpretation via emerging Machine Learning (ML) and Artificial Intelligence (AI) approaches.
CD-GPU will provide compute capabilities to a consortium of seven UK institutes working in complementary areas of plant and crop science. Specifically, it will deliver the following hardware:
1) A 300% increase in GPU capacity for artificial intelligence and machine learning methods
2) A 65% increase in storage capacity for 'omics' data and associated modelling
3) A 55% increase in CPU capacity to meet the demand for high memory bioinformatics applications such as plant genome assemblies at pangenome scale
The resource will provide a step-change in the research work we undertake to help underpin sustainable food production, and to understand and reverse plant biodiversity loss across the world. Importantly, by tailoring CD-GPU to our common research needs, and via the provision of associated technical support and training to the users of the resource, this project will build a strong user community with complementary research aims and help encourage collaboration and innovation between our seven institutes.
Finally, while there is an environmental cost to run such a resource, the plant, crop and agri-tech science it will enable helps the drive to net zero. Shared infrastructure located at a single site, rather that separate resources at each institute, rationalises compute provision, so reducing environmental impact of installing, maintaining and the resource. The hardware we select is prioritised on performance-per-unit-energy-used, and rather than processors with fewer very fast cores, we select ones with a lot of cores that run more efficiently - a perfect fit for many of our targeted analysis tasks where parallelizing jobs can realise huge cost and performance benefits. Similarly, a single GPU with its thousands of parallel cores can - when appropriate - perform the same work both faster and more efficiently than hundreds of CPUs.
Comprehensive collections of plant biological materials that capture wide genetic diversity have been carefully established and curated at institutes across the UK, such as the national collections curated by our project partners at the Natural History Museum and the Royal Botanic Gardens in Kew and Edinburgh, as well as diverse materials for crop pre-breeding at NIAB, the James Hutton Institute (JHI) and Scotland's Rural College (SRUC).
Such resources are increasingly being exploited via the application of high-throughput 'omics' approaches aimed at capturing large and complex datasets relating to the features they display such as DNA sequence ('genomics'), plant characteristics (from the level of whole fields, down to individual plant tissues and cells; 'phenomics'), and the amount and nature of gene and protein expression ('transcriptomics' and 'proteinomics'). Such approaches underpin broad areas of plant R&D, from landscape-scale analysis all the way down to fundamental research of individual genes and their variants. Further, there is a technological revolution underway within industrial crop production systems, whereby increased crop monitoring combined with robotic technologies are being exploited to streamline crop production - providing further opportunity for academic-industrial joint research underpinned by detailed temporal and spatial datasets from real production scenarios. This requires powerful hardware in order to fully exploit the resulting datasets and information, both within and at the research interface between these disciplines. Specifically, computing infrastructures must incorporate graphical processing units (GPUs), alongside central processing units (CPU) and storage, to provide the appropriate task parallelisation and memory bandwidth required for the analysis of datasets at this scale, as well as to support their analysis and interpretation via emerging Machine Learning (ML) and Artificial Intelligence (AI) approaches.
CD-GPU will provide compute capabilities to a consortium of seven UK institutes working in complementary areas of plant and crop science. Specifically, it will deliver the following hardware:
1) A 300% increase in GPU capacity for artificial intelligence and machine learning methods
2) A 65% increase in storage capacity for 'omics' data and associated modelling
3) A 55% increase in CPU capacity to meet the demand for high memory bioinformatics applications such as plant genome assemblies at pangenome scale
The resource will provide a step-change in the research work we undertake to help underpin sustainable food production, and to understand and reverse plant biodiversity loss across the world. Importantly, by tailoring CD-GPU to our common research needs, and via the provision of associated technical support and training to the users of the resource, this project will build a strong user community with complementary research aims and help encourage collaboration and innovation between our seven institutes.
Finally, while there is an environmental cost to run such a resource, the plant, crop and agri-tech science it will enable helps the drive to net zero. Shared infrastructure located at a single site, rather that separate resources at each institute, rationalises compute provision, so reducing environmental impact of installing, maintaining and the resource. The hardware we select is prioritised on performance-per-unit-energy-used, and rather than processors with fewer very fast cores, we select ones with a lot of cores that run more efficiently - a perfect fit for many of our targeted analysis tasks where parallelizing jobs can realise huge cost and performance benefits. Similarly, a single GPU with its thousands of parallel cores can - when appropriate - perform the same work both faster and more efficiently than hundreds of CPUs.
Technical Summary
We will purchase a High-Performance Computing (HPC) cluster tailored to provide significant Graphical Processing Unit (GPU) capabilities. This configuration will enhance our ability to (i) process and store the large data volumes produced by cutting-edge 'omics, imaging and sensing technologies, and (ii) apply artificial intelligence, deep learning and computer vision approaches to gain a deeper understanding of such data at high-throughput. The platform will:
- Deliver GPU acceleration via 16 NVIDIA A100 cards; each providing ~7,000 CUDA cores and 40-80GB of memory. A-series GPUs enable 20x the Tensor floating-point operations per second versus previous V-series cards, twice the memory bandwidth, and support multi-instance GPU partitioning allowing many users to access a single card simultaneously.
- Build on existing petabyte scale storage by providing much needed additional space (1PB primary, 1PB backup), essential for the large volumes of data being generated, while increasing performance via scale-up/scale-out benefits of the BeeGFS filesystem and a shift from 25Gbps to 100Gbps networking. Included is 200TB of all-flash storage ("scratch space") for accelerating file operations during active job processing.
- Include 16 compute nodes, each providing 2 AMD EPYC processors with 32 cores (1024 total), 512GB memory (8TB total) and 1TB SSD disk, providing a 40% performance increase vs existing hardware while also improving performance-per-watt.
It will use open-source solutions, including: Rocky Linux (o/s); BeeGFS (storage); SLURM (job scheduling); bioconda (informatics), Apptainer/Docker (containers); Ansible (configuration management); and Prometheus (monitoring). We will build and deploy bespoke tools and pipelines across the platform to benefit both local users and public-facing services, and provide expert technical support and training to users/collaborators, supporting collaboration and innovation across our institutes and research sectors.
- Deliver GPU acceleration via 16 NVIDIA A100 cards; each providing ~7,000 CUDA cores and 40-80GB of memory. A-series GPUs enable 20x the Tensor floating-point operations per second versus previous V-series cards, twice the memory bandwidth, and support multi-instance GPU partitioning allowing many users to access a single card simultaneously.
- Build on existing petabyte scale storage by providing much needed additional space (1PB primary, 1PB backup), essential for the large volumes of data being generated, while increasing performance via scale-up/scale-out benefits of the BeeGFS filesystem and a shift from 25Gbps to 100Gbps networking. Included is 200TB of all-flash storage ("scratch space") for accelerating file operations during active job processing.
- Include 16 compute nodes, each providing 2 AMD EPYC processors with 32 cores (1024 total), 512GB memory (8TB total) and 1TB SSD disk, providing a 40% performance increase vs existing hardware while also improving performance-per-watt.
It will use open-source solutions, including: Rocky Linux (o/s); BeeGFS (storage); SLURM (job scheduling); bioconda (informatics), Apptainer/Docker (containers); Ansible (configuration management); and Prometheus (monitoring). We will build and deploy bespoke tools and pipelines across the platform to benefit both local users and public-facing services, and provide expert technical support and training to users/collaborators, supporting collaboration and innovation across our institutes and research sectors.
Organisations
Publications
Adams TM
(2023)
HISS: Snakemake-based workflows for performing SMRT-RenSeq assembly, AgRenSeq and dRenSeq for the discovery of novel plant disease resistance genes.
in BMC bioinformatics
Alström P
(2024)
Integrative taxonomy reveals unrecognised species diversity in African Corypha larks (Aves: Alaudidae)
in Zoological Journal of the Linnean Society
Alström P
(2023)
Systematics of the avian family Alaudidae using multilocus and genomic data
in Avian Research
Bachman S
(2024)
Extinction risk predictions for the world's flowering plants to support their conservation
in New Phytologist
Bates HJ
(2024)
Comparative genomics and transcriptomics reveal differences in effector complement and expression between races of Fusarium oxysporum f.sp. lactucae.
in Frontiers in plant science
Bennici S
(2024)
The origin and the genetic regulation of the self-compatibility mechanism in clementine (Citrus clementina Hort. ex Tan.)
in Frontiers in Plant Science
Bilton T
(2024)
Construction of relatedness matrices in autopolyploid populations using low-depth high-throughput sequencing data
in Theoretical and Applied Genetics
Braichenko S
(2024)
Polymorphism-Aware Models in RevBayes: Species Trees, Disentangling Balancing Selection, and GC-Biased Gene Conversion.
in Molecular biology and evolution
Brown MJM
(2023)
Re-evaluating the importance of threatened species in maintaining global phytoregions.
in The New phytologist
| Description | BioSS procurement of High Performance Computing (HPC) Cluster |
| Amount | £839,299 (GBP) |
| Organisation | Department for Business, Energy & Industrial Strategy |
| Sector | Public |
| Country | United Kingdom |
| Start | 05/2023 |
| End | 06/2024 |
| Description | HPC Equipment |
| Amount | £1,300,000 (GBP) |
| Organisation | Department for Business, Energy & Industrial Strategy |
| Sector | Public |
| Country | United Kingdom |
| Start | 05/2023 |
| End | 06/2024 |
| Title | A high-quality reference genome of the widely-farmed banded cricket (Gryllodes sigillatus) |
| Description | ABSTRACT: Farmed insects have gained attention as an alternative, sustainable source of protein with a lower carbon footprint than traditional livestock. We present a high-quality reference genome for one of the most commonly farmed insects, the banded cricket Gryllodes sigillatus. In addition to its agricultural importance, G. sigillatus is also a model in behavioural and evolutionary ecology research on reproduction and mating systems. We report comparative genomic analyses that clarify the banded cricket's evolutionary history, identify gene family expansions and contractions unique to this lineage, associate these with agriculturally important traits, and identify targets for genome-assisted breeding efforts. The high-quality G. sigillatus genome assembly plus accompanying comparative genomic analyses serve as foundational resources for both applied and basic research on insect farming and behavioural biology, enabling researchers to pinpoint trait-associated genetic variants, unravel functional pathways governing those phenotypes, and accelerate selective breeding efforts to increase the efficacy of large-scale insect farming operations. This repository includes annotation files to the genome assembly. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2025 |
| Provided To Others? | Yes |
| URL | https://zenodo.org/doi/10.5281/zenodo.14617131 |
| Title | A high-quality reference genome of the widely-farmed banded cricket (Gryllodes sigillatus) |
| Description | ABSTRACT: Farmed insects have gained attention as an alternative, sustainable source of protein with a lower carbon footprint than traditional livestock. We present a high-quality reference genome for one of the most commonly farmed insects, the banded cricket Gryllodes sigillatus. In addition to its agricultural importance, G. sigillatus is also a model in behavioural and evolutionary ecology research on reproduction and mating systems. We report comparative genomic analyses that clarify the banded cricket's evolutionary history, identify gene family expansions and contractions unique to this lineage, associate these with agriculturally important traits, and identify targets for genome-assisted breeding efforts. The high-quality G. sigillatus genome assembly plus accompanying comparative genomic analyses serve as foundational resources for both applied and basic research on insect farming and behavioural biology, enabling researchers to pinpoint trait-associated genetic variants, unravel functional pathways governing those phenotypes, and accelerate selective breeding efforts to increase the efficacy of large-scale insect farming operations. This repository includes annotation files to the genome assembly. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2025 |
| Provided To Others? | Yes |
| URL | https://zenodo.org/doi/10.5281/zenodo.14617130 |
| Title | HPC Scripts |
| Description | Collections of scripts and utilities used by our HPC resource that have been made available for others are part of the publication: https://doi.org/10.1002/ppp3.10607 |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Improved HPC efficiency by our users due to weekly reporting. |
| Description | HPC Training (Introduction to Linux) |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Professional Practitioners |
| Results and Impact | We run regular training courses, held both online and in person that introduce users to our cluster, working with Linux, and submitting jobs on a shared system (via SLURM). Each course lasts approximately one day, and is attended by 20-30 people at a time. As our partnership involves multiple institutions all accessing the same HPC cluster, these courses are also a good chance to introduce people from different organisations, establish new contacts, discover similar research areas, etc. |
| Year(s) Of Engagement Activity | 2022,2023,2024 |
| URL | https://github.com/cropgeeks/hpc-training |
