Enriching MGnify Genomes to capture the full spectrum of the microbiota and bolster taxonomic classifications

Lead Research Organisation: European Bioinformatics Institute
Department Name: Genome Assembly and Annotation

Abstract

Microbes (viruses, bacterial and single celled eukaryotes) are ubiquitous in nature and perform key roles essential to sustain life, e.g. oxygenation of the planet by marine microbes, soil nutrient cycling to support plant growth or facilitating animal digestion, especially human. Increasing knowledge about microbial ecosystems has accompanied a broadening scope of environments analysed, such as anaerobic digesters, food production systems and the built environment (extending as far as the International Space Station). Metagenomics is a culture independent method that applies modern DNA sequencing technologies to study the genomes of the organisms present in a microbiome. The latest approaches combine advanced sequencing technologies, throughput, and bioinformatics techniques to enable the assembly of short DNA fragments (produced by sequencing machines) into larger chromosomal fragments. Subsequently, these fragments are classified into sets belonging to an individual species, i.e. metagenome assembled genomes (MAGs). While the first MAG was reported in 2004, the first large-scale study applying these techniques was published only in 2015. Since then, there has been an explosion in the number of MAGs reported, which not only provides novel insights into the ~99% of organisms yet to be experimentally cultured but also dramatically expands the Tree of Life. In addition to capturing biodiversity of microbes, these MAGs facilitate a genome centric understanding of their functional role within the community, and how they interact with each other and their surroundings. A substantive section of applied research leverages these findings to restore perturbed microbiomes to a healthy state or to harness the enzymes they encode.

This proposal focuses on MGnify, a resource that already performs four major roles in microbial community research: (i) it facilitates the capture of petabytes of sequence data being generated currently; (ii) it provides users access to the computational resources to conduct metagenomic assembly; (iii) it generates new knowledge by analysing microbiome derived sequence data and presenting this via a website and API to the user community; (iv) it has initiated capture of prokaryotic MAGs. In this proposal, we will extend MGnify to recover Eukaryotic MAGs using innovative new methodologies and capture the viruses in the MGnify assemblies. These non-redundant catalogues of Eukaryotic and viral genomes will be used to supplement the existing MGnify genomes. To perfect the MAG generation process, we propose to develop additional pipelines that will identify and remove the contaminants found in the prokaryotic MAGs. In addition to generating high-quality MAGs that cover the entire range of microbial taxa, we will harmonise efforts with the Genome Taxonomy Database (GTDB) to ensure that this newly discovered bacterial diversity is properly represented therein, as it is one of the most widely used resources for taxonomic classification. Underpinning this, we will enhance the metagenomic sequence submission systems to better cater for all data types and improve the internal mechanisms for data exchange, so that MGnify can perform submission on behalf of the users and gain access to all data types, whether the data is public or private (prepublication), given the appropriate user consent. Finally, in addition to updating the reference databases in our analysis pipelines, we will also improve the annotation of carbohydrate metabolism enzymes, which are poorly represented in databases currently.

Collectively, these developments will reinforce MGnify's crucial importance to the microbiome research community. It will serve as the foundational knowledgebase that propels integrative microbiome research and its translation to real world applications.

Technical Summary

Three major new areas of activity are proposed to enrich MGnify and meet the evolving demands of microbiome research: (i) improve the MGnify bacterial genomes and enable their incorporation into the Genome Taxonomy database (GTDB); (ii) develop pipelines to facilitate the recovery of Eukaryotic genomes; (iii) identify and annotate viruses found in MGnify assemblies to enrich MGnify genomes. This proposal also describes significant updates to the MGnify analysis pipelines and the infrastructure underpinning the resource. To achieve this we will undertake the following key developments:
1. Incorporate the latest biological information by updating the reference DB used in the MGnify analysis pipelines and the associated FAIR workflow descriptions.
2. Develop and apply an improved profile HMM library for the detection of CAZymes by utilising metagenomic sequences so as to improve their sensitivity. These will be integrated into an annotation system that will also help to detect polysaccharide utilisation loci.
3. Extend client side validation tools and interfaces to enable easier submission of metagenomics datasets, including MAGs, and enrich internal access and control mechanisms between ENA and MGnify.
4. Assemble a pipeline that extends beyond the standard single copy marker genes to facilitate the systematic detection of contaminating contigs within MAGs, to produce a refined set of prokaryotic MAGs.
5. Co-develop a cloud based framework to generate the non-redundant set of MGnify MAGs and the GTDB taxonomy, and extend GTDB to incorporate MAGs, thus accurately reflecting the taxonomic diversity of prokaryotes.
6. Initiate a collection of Eukaryotic MAGs by developing a novel binning and refinement workflow.
7. Systematically detect and cluster viral sequences, enriching them with taxonomy, functional annotations and environmental metadata to produce a viral catalogue. Use computational methods to link phages to bacterial hosts, thereby connecting catalogues.

Publications

10 25 50
publication icon
Burgin J (2023) The European Nucleotide Archive in 2022. in Nucleic acids research

publication icon
Gurbich T (2023) MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues in Journal of Molecular Biology

publication icon
Richardson L (2023) MGnify: the microbiome sequence data analysis resource in 2023. in Nucleic acids research

 
Description Member, Resilience Frontiers Technology Advisory Group of the UN Climate Change Secretariat (UNFCCC)
Geographic Reach Multiple continents/international 
Policy Influence Type Participation in a guidance/advisory committee
 
Description Member, UKRI Pool of Experts
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
 
Description Panel Member, UKRI-IKC National Biofilms Innovation Centre
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
 
Description Scientific Advisory Board Member for the NFDI4Microbiota Consortium
Geographic Reach Europe 
Policy Influence Type Participation in a guidance/advisory committee
URL https://nfdi4microbiota.de/consortium/international-partners
 
Description "What metagenomic data can tell us about healing the planet" talk at the Life Science Across the Globe - talks on science and culture 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Talk by PI Rob Finn on MGnify at the Learning from the planet to heal the planet: Microbial Ecosystems online seminar series (hosted by EMBL and HHMI Janelia Research Campus).
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=Hc89Rrs_ykY&ab_channel=HHMI%27sJaneliaResearchCampus
 
Description 26th Annual Meeting EDF Plenary Guest Lecture "Role of microbial communities in skin health and disease" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Plenary guest lecture by PI Rob Finn at the 26th Annual Meeting of the European Dermatology Forum.
Year(s) Of Engagement Activity 2023
URL https://www.edf-meeting.com/en/program/plenary-guest-lectures
 
Description BIOCEV Special Lecture "Genome resolved metagenomics analysis for understanding the composition of the human gut microbiome" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Special Lecture by PI Rob Finn at the Microbial Communities: Function, Structure, and Complexity" conference, which was organized in BIOCEV (Vestec).
Year(s) Of Engagement Activity 2022
URL https://www.biocev.eu/en/about/events/microbial-communities-function-structure-and-complexity.294?ty...
 
Description BIOPROSP_23 Keynote talk "Genome Resolved Metagenomics - Understanding the potential of marine microbial communities for novel product discovery" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote talk by PI Rob Finn at the BIOPROSP-23 conference held at Tromsø, Norway. BIOPROSP is the international biennial scientific conference on marine biotechnology, which aims to translate basic research into applied research with industrial application.
Year(s) Of Engagement Activity 2023
URL https://www.tekna.no/en/events/bioprosp_23-42323/Program/?info=156913
 
Description EMBL-CSIC workshop talk "Multi-kingdom genome resolved metagenomics from different environments" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk by MGnify PI Rob Finn at the EMBL-CSIC (Consejo Superior de Investigaciones; Spanish National Research Council) Workshop 'One Health: Microbes in a changing world' held in Spain.
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=_25Yxl48-iY&ab_channel=CSICEventos
 
Description EMBL-EBI News "2.4 billion sequences now available in the latest MGnify protein database release" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Newsletter announcing MGnify's new release of their protein database which contains 2.4 billion non-redundant sequences, inlcuding new annotations provided by Google AI.
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/about/news/updates-from-data-resources/2-4-billion-sequences-now-available-in-...
 
Description ETIM 2022 talk "Genome resolved metagenomics: understanding the metabolic potential of microbial communities" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk by MGnify PI Rob Finn at the ETIM 2022 meeting on Artificial Intelligence and Bioinformatics held at Essen
Year(s) Of Engagement Activity 2022
URL https://etim.uk-essen.de
 
Description ICG-17 Keynote talk "Genome-level resolution metagenomics: from viruses to eukaryotes" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote speech by PI Rob Finn at the ICG-17 Conference held at Riga, Latvia.
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=x8WJysdL5zA&ab_channel=ICG-17Riga
 
Description ISME 18 Roundtable "What does it take to be FAIR?" by the National Microbiome Data Collaborative 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Roundtable organised by the National Microbiome Data Collaborative at ISME18. PI Rob Finn was an expert panelist on the roundtable. Discussions covered attitude shifts required for microbiome data sharing, what constitutes good metadata and other points.
Year(s) Of Engagement Activity 2022
URL https://twitter.com/MicrobiomeData/status/1559210668485640194
 
Description Public engagement talk "Microbes, genomes and communities" at the Saffron Walden Rotary Club 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Public engagement focused talk by PI Rob Finn where he spoke to the Saffron Walden Rotary Club members on the MGnify microbiome resource he administers at MEBL-EBI and the how the data can be leveraged to provide new insights into microbial diversity.
Year(s) Of Engagement Activity 2022
 
Description Virtual training course "Genome-resolved metagenomics bioinformatics" 2022 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Annual EMBL-EBI course delivered by the Microbiome Informatics Team which administers the MGnify microbiome resource. Participants learnt about the tools, processes and analysis approaches used in the field of genome-resolved metagenomics.
https://www.ebi.ac.uk/training/materials/genome-resolved-metagenomics-bioinformatics-materials/
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/training/events/metagenomics-bioinformatics-2022/#vf-tabs__section--tab1