Enriching MGnify Genomes to capture the full spectrum of the microbiota and bolster taxonomic classifications

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

Three major new areas of activity are proposed to enrich MGnify and meet the evolving demands of microbiome research: (i) improve the MGnify bacterial genomes and enable their incorporation into the Genome Taxonomy database (GTDB); (ii) develop pipelines to facilitate the recovery of Eukaryotic genomes; (iii) identify and annotate viruses found in MGnify assemblies to enrich MGnify genomes. This proposal also describes significant updates to the MGnify analysis pipelines and the infrastructure underpinning the resource. To achieve this we will undertake the following key developments:
1. Incorporate the latest biological information by updating the reference DB used in the MGnify analysis pipelines and the associated FAIR workflow descriptions.
2. Develop and apply an improved profile HMM library for the detection of CAZymes by utilising metagenomic sequences so as to improve their sensitivity. These will be integrated into an annotation system that will also help to detect polysaccharide utilisation loci.
3. Extend client side validation tools and interfaces to enable easier submission of metagenomics datasets, including MAGs, and enrich internal access and control mechanisms between ENA and MGnify.
4. Assemble a pipeline that extends beyond the standard single copy marker genes to facilitate the systematic detection of contaminating contigs within MAGs, to produce a refined set of prokaryotic MAGs.
5. Co-develop a cloud based framework to generate the non-redundant set of MGnify MAGs and the GTDB taxonomy, and extend GTDB to incorporate MAGs, thus accurately reflecting the taxonomic diversity of prokaryotes.
6. Initiate a collection of Eukaryotic MAGs by developing a novel binning and refinement workflow.
7. Systematically detect and cluster viral sequences, enriching them with taxonomy, functional annotations and environmental metadata to produce a viral catalogue. Use computational methods to link phages to bacterial hosts, thereby connecting catalogues.

Publications

10 25 50
 
Description The primary aim of this project is to extend the functionality of MGnify and to develop tools for microbiota analysis. MGnify is a tool extensively used for submitting, analysing, and comparing microbiome data (https://www.ebi.ac.uk/metagenomics), including metagenomic data. Metagenomics is the study of all of the genetic material from an environment, allowing for the identification and functional characterisation of taxa within that environment.

So far this study has contributed to tools for metagenomic analysis in several ways:
- It is possible to construct microbial genomes from metagenomic data, referred to as metagenome assembled genomes (MAGs). One method for improving the quality of MAGs is via a method called multi-coverage binning (https://pubmed.ncbi.nlm.nih.gov/37386187/). If multi-coverage binning were included in MGnify pipelines, it may improve the quality of generated MAGs. However, this method is computationally expensive and we therefore assessed if there were methods that could be used to allow this method to be used efficiently in MGnify pipelines. We concluded that while this method could not be included in these pipelines, it could be used on a selective, case-by-case basis, depending on the composition of the data submitted to MGnify .
- KEGG is a commonly used reference database, which groups proteins into functional units called KEGG orthologs. We developed an open-access pipeline (KOunt), which allows users to calculate KEGG ortholog abundance in metagenomic samples (https://github.com/WatsonLab/KOunt). This is an improvement over previous pipelines as it allows the user to calculate the abundance of RNA KEGG orthologs and it also clusters proteins by sequence identity, allowing the user to identify the scale of diversity within KEGG orthologs.
- We created a tool to allow users to better link microbial genome and protein information stored on MGnify.
Exploitation Route The tools developed during this project are all open-access or contribute to publicly-available resources (MGnify). They will expand the ability of microbiota researchers to produce high-quality metagenomic data and to functionally analyse such data.
Sectors Agriculture

Food and Drink

Environment

Healthcare

 
Title KOunt 
Description KOunt is a Snakemake pipeline that calculates the abundance of KEGG orthologues (KOs) in metagenomic sequence data. KOunt takes raw paired-end reads and quality trims, assembles, predicts proteins and annotates them with KofamScan. The reads are mapped to the assembly and protein coverage calculated. Users have the option of calculating coverage evenness of the proteins and filtering the KofamScan proteins to remove unevenly covered proteins. The proteins annotated by KofamScan are clustered at 100%, 90% and 50% identity within each KO to quantify their diversity; as using the evenness filtering option reduces the numbers of these proteins we don't recommend using the evenness option if you are interested in the clustering results. All predicted proteins that don't have a KO hit or are excluded by evenness filtering are called 'NoHit'. The NoHit proteins are blasted against a custom UniProt database annotated with a KO and the nucleotides against a custom RNA database. Reads mapped to NoHit proteins that remain unannotated and unmapped reads are blasted against the KOunt databases and RNA quantified in the remaining reads. 
Type Of Material Data analysis technique 
Year Produced 2023 
Provided To Others? Yes  
Impact KOunt is a reproducible open-source workflow which uses freely available software to calculate KO abundance in metagenomic sequence data, taking multiple approaches to improve the annotation of proteins and reads that initially do not have a hit. Unlike other KO abundance tools, KOunt gives the user the option to calculate the abundance of the RNA KOs in the metagenomes and also cluster the proteins by sequence identity to report the diversity within each KO. Currently cited by https://doi.org/10.1101/2023.12.18.572173. 
URL https://github.com/WatsonLab/KOunt
 
Title KOunt_databases_v1.tar 
Description Reference database for the KEGG Abundance tool KOunt 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact As of February 2024, this database has been downloaded 83 times. 
URL https://figshare.com/articles/online_resource/KOunt_databases_v1_tar/21269715/1
 
Title MGYG to MGYP accession mapping 
Description Enables users of EBI to map MGYG (genome) numbers to MGYP (protein) numbers. 
Type Of Material Data analysis technique 
Year Produced 2023 
Provided To Others? Yes  
Impact There was previously no method available for mapping between EBI genome and proteins files. This allows users to do so, linking proteins with their genome of origin, and vice versa. 
URL http://ftp.ebi.ac.uk/pub/databases/metagenomics/temp/MGYP_mappings/human_gut_mgyg_to_mgyp.tsv.gz
 
Title watson_and_mattock_v1.tar.gz 
Description Data in support of "A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination", Mick Watson and Jennifer Mattock, 2022 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL https://figshare.com/articles/dataset/watson_and_mattock_v1_tar_gz/19733509