CLIMB-BIG-DATA: A Cloud Infrastructure for Big-Data Microbial Bioinformatics

Lead Research Organisation: Institute of Food Research
Department Name: Microbes in the Food Chain

Abstract

High-throughput sequencing has transformed microbiology, delivering an explosion in genomic and metagenomic big data. However, many microbiologists remain unable to exploit large genomics datasets to address key questions in microbiology, because they lack access to the relevant computational resources, bioinformatics tools or expertise in data analysis. To address this problem, six years ago we launched CLIMB, a pioneering British cloud-computing infrastructure project funded by the MRC that has supported >900 users. As CLIMB comes to an end, we propose a unique new partnership--CLIMB-BIG-DATA (Cloud Infrastructure for Big-Data Microbial Bioinformatics)- to meet the bioinformatic needs of the UK microbiology community as we head into the 2020s. This new CLIMB-BIG-DATA partnership will occupy a distinctive position in the UK, underpinning research in the academic sector alongside the front-line work of government agencies and the health service, while also supporting research that maps on to a wide variety of national/UKRI and international strategic priorities and Official Development Assistance objectives.

In response to community needs (as evidenced by >160 signatories), the proposed partnership will maintain the existing CLIMB infrastructure to support hundreds of research projects including high-profile efforts to track the spread of Ebola or Zika virus. However, we also promise to deliver a step-change in the scale and scope of what we can offer to users. We will adopt a matrix model, in which a range of activities will be mapped on to strategically important themes championed by our investigators, including Antimicrobial Resistance; Emerging Infectious Disease and Global Health; Microbial Genomics for Public Health; Microbial communities and metagenomics; Pathogen Biology and Functional Genomics; Sequencing Technologies.

Activities aimed at community engagement will include bioinformatics workshops, hackathons and symposia. Activities focused on tools and integration will include enhanced support for sharing software and data, workflow integration and migration between clouds; enhanced support and security for clinical applications; plus integration with large datasets at external facilities, such as the European Nucleotide Archive. Activities focused on infrastructure include: provision of graphics processing units and enhanced storage; maintenance of our original cloud-computing infrastructure to support microbial bioinformatics; plus incorporation of cloud infrastructures from the MRC unit in the Gambia and from the Quadram Institute. The CLIMB-BIG-DATA partnership will run as a UKRI-supported project for five years, with the expectation that the project will become self-financing through robust pathways to sustainability and expansion. The partnership will draw upon a diverse team of partners from multiple research organisations and collaborators from government agencies, and it will be run from the Quadram Institute in Norwich, which as a strategically funded UKRI research institute will provide a first-class stable and resilient environment for the project's future.

Technical Summary

The CLIMB-BIG-DATA partnership will provide a substantial computational resource that will enhance UK capability and infrastructure in microbial bioinformatics, building on our highly successful CLIMB project. Our computational infrastructure will feature an OpenStack cloud architecture with >10000 virtual CPU cores spanning six research organisations (incorporating clouds from the MRC unit Gambia and the Quadram), with access to the CEPH platform to implement object storage. A dedicated web portal Bryn will allow users to gain easy access to their own virtual machines, preconfigured with powerful user-friendly bioinformatics tools. We will add newly requisitioned specialised servers aimed at memory-intensive tasks (e.g. metagenomic assembly) or compute-intensive tasks (e.g. GPU nanopore analyses) and we will add substantial additional storage (>3 petabytes). Other features will include a freely accessible database of relevant workflows, pipelines, scripts, programs, preconfigured virtual machine images and containers, curated to support strategically relevant themed activities; an accreditation-compliant computational infrastructure for linking sensitive human and animal health metadata with microbial sequence data; support for containerisation via the Docker Engine and Singularity; a capability to share VMs, containers, data and software across the entire CLIMB-BIG-DATA infrastructure and with public cloud providers (with cloud bursting on to public clouds, should demand spike on our own infrastructure). We also promise an ambitious and exciting programme of training/community engagement, featuring hackathons, workshops, and modules suitable for a wide range of users from undergraduate students to professional bioinformaticians in the UK and more widely. We will build protocols for demand management and for charging users as we move towards becoming self-sustainable and will also improve integration with public facilities and new potential partner sites.

Planned Impact

This research will be of benefit to a range of beneficiaries outside of academic disciplines that take in microbiology:

1. Clinical and veterinary microbiologists, vets and governmental organisations such as APHA, FSA and local agencies including health services such as PHE/PHW who have a role in tracking zoonotic disease and tracing pathogens through the food chain. These users will be able to use our computational infrastructure to integrate informatics systems, animal and human health metadata, epidemiological disease patterns and microbial (meta)genomic data to elucidate modes and routes of transmission, detect outbreaks, explore the relationships between potential pathogens and disease, with impacts on animal health, welfare, and disease prevention. This system will also provide an infrastructure that will bring new opportunities for productive engagement between organisations focused on animal health and the academic sector, so that research findings and approaches can be more easily translated into outcomes that impact food security and human and animal health.

2. Industrial users stand to benefit in several ways. The tools around the characterisation and development of novel antimicrobials and metagenomics are of wide interest to industrial beneficiaries as these tools will be invaluable for the identification of new targets and the rational design of probiotic treatments for the prevention of microbial disease in farmed animals. Industrial users will also benefit from the tools and data that the infrastructure will make available. These will allow the rapid contextualisation and characterisation of bacteria of industrial importance (for example in product spoilage), information that can then be used to design interventions or better optimise preservative selection.

3. Commercial beneficiaries include sequencing companies, computer companies and private laboratories, who stand to benefit from increased demand for their products and opportunities for innovation and spread of best practice (NB: both Solexa and Oxford Nanopore Sequencing were developed within the UK, with benefits to our economy).

4. Anyone planning a large cloud-based computing project will be able to draw on the example and precedent we set here.

5. Policy makers, who will benefit from grounding their public policy and legislation, e.g. on food safety or pandemic preparedness, on a more solid understanding of bacterial evolution, epidemiology, population genetics and taxonomy. .

6. The wider public will benefit from the positive impacts on food security, reduced preservative use, and increased profitability of UK companies, resulting in stronger tax revenues for the UK.

This work will also make a decisive contribution through employment and training to enhancing the professional and research skills base of the United Kingdom, contributing to the development of the knowledge economy through the training of undergraduates and postgraduates in data intensive research techniques, using the common CLIMB-BIG-DATA platform.

Publications

10 25 50