EBI Metagenomics - enabling the reconstruction of microbial populations

Lead Research Organisation: Earlham Institute
Department Name: Research Faculty


Microorganisms inhabit practically all environments on Earth. For example, there are more microbes in the ocean than stars in the known universe, with complex communities living in vastly different niches, from the tropics to the polar waters and from well-lit surface waters to the deep abyss. They harvest and transduce solar energy and is estimated that they contribute 50-90% to global primary production, turning light into biomass through photosynthesis, making them vital to the world's food chain. Microbes produce and consume most greenhouse gases (carbon dioxide, nitrous oxide and methane), which is of particular importance in relation to man-made climate change. They are also responsible for over half of all oxygen production on Earth. Within ecosystems, microbes catalyse the key bio-geochemical transformations of nutrients and trace elements that sustain organic productivity. Understanding these processes would bring many potential benefits. For example, working out the mechanisms by which microbes unlock organic phosphate to a soluble form that can be absorbed by plants could reduce the use of fertilizers and increase agricultural yields.
Within each environment, the microbial population contains a vast and dynamic reservoir of genetic variability, much of which is yet to be studied. Current biological databases do not represent the vast majority of environmental organisms, as traditional genome sequencing approaches require isolation and culturing. Metagenomics, the sequencing of the entire collection of DNA found within an environmental sample, circumvents this need. As a result, we have begun to answer some of the key questions about which organisms are found in which environments. There has been a huge uptake of the approach across a broad range of disciplines. Nevertheless, the majority of metagenomics projects produced over the past decade have given only a fragmentary picture of underlying micro-organisms genomes, as larger volumes of sequencing are required to improve the level of genomic detail.
In the era of data driven science, and with widespread access to sequencing technology and ever diminishing costs, huge volumes of sequence data present an amazing opportunity to understand the microbial world at a more detailed level. However, the field of metagenomics faces the following issues: 1) given the vast data volumes, specialist expert-built pipelines are required for efficient, high-throughput analysis; 2) bioinformatics analysis of results is costly to produce and requires expert knowledge; 3) to extract maximum knowledge from experiments, there is a need to systematically capture the associated experimental data along with the sequence data; 4) there is a lack of consistency between different analysis approaches, affecting comparability. The EBI Metagenomics (EMG) resource solves these issues by offering a free service for the analysis and archiving of all metagenomic data.
With advances in algorithms and methods, it is now possible to piece together the fragments that make up an individual organism's genome. In this project, we will not only continue the provision of the EMG, but also develop the analysis, archiving, tools and data presentation frameworks required to generate genomes from metagenomes. Due to the unique position of EMG, we will be also able to combine data across different projects that contain similar microbial communities. This important data reuse will enable us to generate the highest quality genomes, allow us to detect different strains of bacteria and ensure that we capitalize on previous investments. Our genomes will enrich the current tree of life, and we will extend the EMG interfaces to accommodate the new data that we will produce. This will empower research and innovation in the environment, bioindustries, agriculture and medicine (human and livestock). We will work closely with biotechnological industries, to enable them to harness the huge potential for discovery.

Technical Summary

Metagenomics is a widely used approach to investigate the composition and function of microbial communities. With the development of modern sequencing platforms, data generation is rarely the bottleneck, but rather its analysis. Even when researchers have access to large-scale computing facilities, two metagenomics datasets are rarely analysed in the same way and the workflows used to produce results are virtually impossible to reconstruct. The EBI metagenomics (EMG) resource solves all of the above problems by providing a freely available service for the analysis and archiving (via the European Nucleotide Archive, (ENA)) of metagenomics data. It also provides a platform for the discovery of analysed metagenomics datasets. As these are uniformly analysed, it enables comparability and meta-analysis across projects and biomes. Unlike any other public analysis service, EMG has an archiving remit. The capture of rich, contextual metadata associated with the sequencing data ensure maximal data longevity and reuse. Over and above this, EMG is also a data generator, in terms of functional and taxonomic annotations, and has already analysed a world leading 100,000 publicly available datasets.

To date, EMG has focused entirely on annotating raw reads. While this provides a comprehensive analysis of all sampled micro-organisms, the disconnected and fragmentary nature of the data has some limitations, e.g. lack of full length peptides. To overcome this, we will expand the service to include assembly of metagenomics data. We will build reproducible workflows (deployable within multiple cloud environments) and develop tools to reveal near complete genome maps for the more abundant organisms found within a sample, or that occur commonly across samples. ENA will be extended to allow more comprehensive capture of this assembly data. We will extend EMG to include a catalogue of metagenome assembled genomes, offering insights into 10,000s of novel microbial genomes.

Planned Impact

Metagenomics is a rapidly expanding field and the depth and breadth of data are constantly increasing. At the same time, experimental approaches for investigating different microbiomes are constantly improving, providing deeper insights into microbes occupying particular environments. The use of metagenomics is widespread in research projects associated with BBSRC strategic priorities - agriculture and food security, industrial biotechnology and bioscience for health - and the field represents the epitome of data driven biology. This proposal will contribute to the continued support and development of the world leading EBI metagenomics (EMG) resource. Moreover, its expansion to offer assembly (and genomic reconstruction) as a public service will make EMG unique in the world of metagenomics analysis provision. Moreover, the application of assembly workflows will be taken to an unprecedented level of scale, scope and precision, allowing even deeper insights into the microbial world. This will enable the scientific community to make the leap from correlative observations to mechanistic hypothesis generation. Such deep knowledge will be of particular importance for cross cutting themes, such as understanding antimicrobial resistance, discovery of new secondary metabolites (e.g antimicrobial agents), host-microbe interactions (plant/animal) and microbial ecology.

The scientific community benefits from EMG in many ways. Primarily it provides freely available services for analysis and archiving (via the ENA) of microbiome sequence data, helping democratise the research field by overcoming limitations of compute and informatics expertise. It also provides a platform for discovery of analysed metagenomics data, already amassing over 100,000 datasets (representing nearly a petabyte of processed data). These are uniformly analysed, enabling comparability and meta-analysis across projects and biomes. Archiving of sequence data with rich experimental metadata also encourages data re-use. Beyond this, EMG outputs will have applications in a wide range of academic and industrial fields, including enzyme discovery, environmental science, diagnostics and animal/human health, as assembly begins to provide a more complete picture of microbial communities.

The results of the project will be of exceptional value to the commercial sector, and the benefits will eventually feed through to the public, in the form of new antibiotics for humans and livestock, higher agricultural yields from the understanding of socio-ecological interplay (e.g., food chain microbes) and expanded discovery of novel enzymes capable of operating at extremes, such as psychrophilic enzymes for detergents, or with novel catalytic functionality (e.g., anaerobic digestion pathways in biofuel production). Industrial partnering has demonstrated that EMG data outputs have increased translation rates within this sector, and continued support for the resource will enhance this.

There are also many technical developments within this project that will have far reaching impacts and can be applied to other analytical disciplines. For example, the use of workflows and containerisation of software for Cloud compute infrastructures will enable a new level of reproducibility and sharing.

We will ensure impact to all academic and industrial audiences by the publication of software, workflows, compute containers and peer reviewed articles. To address the skills shortages in the field of metagenomics informatics, we will also deliver training and webinars.

Metagenomics is pivotal to the notion of One Health - the collaborative effort of multiple disciplines working at national and international levels to to attain optimal health for people, animals and the environment. This proposal (and EMG) encapsulates this philosophy, serving the major UK and international communities, and will deliver a cost effective resource that will become the world's leading microbiome data service.


10 25 50

Related Projects

Project Reference Relationship Related To Start End Award Value
BB/R015171/1 06/09/2018 29/09/2020 £192,609
BB/R015171/2 Transfer BB/R015171/1 30/09/2020 05/09/2021 £15,051
Description We developed a pipeline during this project to extract genomes of microbes from time series of DNA extracted directly from communities. This enables us to understand what organisms are in a community and what they are doing. We have applied this pipeline in many areas of research relevant to health and biotechnology. Including studies of dietary treatments for Crohn's disease and industrial biotechnology.
Exploitation Route Other individuals can use our pipeline in their research projects whenever microbial communities are studied and the scientific conclusions from our study will be relevant in both medicine and engineering.
Sectors Agriculture, Food and Drink,Energy,Healthcare

Description We have generated a database of other 2,000 genomes generated from UK industrial anaerobic digestion reactors. This has enabled us to profile the community dynamics of these reactors over time and relate to operating conditions. The results are relevant to the industrial operators and were fed back to them during a workshop in 2021.
First Year Of Impact 2020
Sector Energy
Impact Types Economic

Title Metahood 
Description Metahood is a snakemake based metagenomics pipeline. What does the pipeline do : sample qualitycheck/trimming assemblies / co-assemblies binning (Concoct/Metabat2) de novo tree construction for mags diamond annotation and profiles output annotated orf graphs (derived from assembly graph), TO_FIX Strain resolution (Desman) 
Type Of Technology Software 
Year Produced 2020 
Impact This pipeline has been used for generating a large collections of genomes from anaerobic digesters. 
URL https://github.com/Sebastien-Raguideau/Metahood