2020BBSRC-NSF/BIO: REDEFINE - Development of efficient, large-scale metagenomics sequence comparison algorithms to facilitate novel genomic insights

Lead Research Organisation: European Bioinformatics Institute
Department Name: Genome Assembly and Annotation

Abstract

Microbes are ubiquitous and perform essential roles that help sustain life on earth, for e.g. environmental oxygenation, soil nutrient cycling to support plant growth or facilitating animal digestion. They cause many diseases in plants and animals and have the ability to rapidly evolve to exploit new niches and/or combat antimicrobials. A relatively new field, metagenomics is a culture independent method that applies sophisticated DNA sequencing technologies to analyse the total microbial genetic material from any environment. It is now possible to reassemble the millions of short DNA sequences to produce representations of the microbial genomes in a sample, termed metagenome assembled genomes (MAGs), especially for bacteria. While this approach remains computationally expensive, the computer algorithms used to recover these genomes have been substantially improved to increase accuracy of MAGs. Just in the past five years, many large-scale studies, including our own, have successfully applied these techniques to cumulatively generate millions of MAGs. This has provided scientists with novel insights into ~99% of organisms yet to be experimentally cultured and dramatically expanded the Tree of Life. These MAGs are reshaping our understanding of microbial community structure and the functional capacities of constituent members.

This explosion in MAG numbers nevertheless presents new challenges. These large-scale analyses can generate genomes at magnitudes that match GenBank's large genome collection, which is derived from traditional techniques of sequencing experimentally isolated microbes. Such genome collections have taken decades to build and are managed by large data centres. Yet, there is now the need for groups to routinely perform comparisons between new MAG collections and such large reference genome collections. We propose to use a particular class of algorithm called MinHash, which rapidly estimates similarity between two sets based on the number of shared entities, in our case short sequences. Most implementations of this approach have focused on the rapid comparison of one genome to another. In this proposal, we aim to use a range of computational techniques to enable the comparison of a large query dataset to a large reference database, with the purview of being applied to microbial genomes, MAG collections and metagenomic sequences. We will develop and apply this tool to a range of datasets, particularly those housed in MGnify, a leading database of metagenomic data. The key applications are the identification of errors in MAGs which were introduced by the computational methods, data reduction by identifying duplicate MAGs between datasets, the rapid incorporation of MAGs into catalogues of genomes that have been found in a particular environment, taxonomic classification of MAGs (by converting similarity distances to evolutionary distances), and the profiling of metagenome datasets to determine which genomes are likely to be found. The latter set of profiles will also enable the delineation of datasets that are poorly characterised by MAG/genome collections and prioritise them for analysis (i.e. MAG generation).

The outputs of this proposal are manifold. The first is a suite of software tools and associated workflows that can be installed and run on the computer command line. The application of the tool will lead to multiple new data outputs (refined MAGs, improved catalogues and metagenomic profiles) which will be made available via MGnify's web interfaces. To provide rapid access to these MAG catalogues, we will also deploy new web interfaces (implementing the new tools) that allow users to compare their own MAGs against established collections. This will not only democratise scientific research but also reduce the need for data duplication. We will also use specific use cases to demonstrate the utility of our tools and provide training and support for their use.

Technical Summary

Over the past five years, assembled metagenomic sequence data have been generated at such accelerated rates that they have overtaken the volume of sequence data from isolate microbial genomes. Furthermore, the numbers of distinct bacterial strains recovered from metagenomes already match the orders of magnitude of isolate strains (100,000s). This massive and continual expansion in size and numbers of metagenomics datasets are increasingly yielding metagenome assembled genomes (MAGs). Thus, there is an urgent need to produce new tools and resources that enable large-scale genome comparisons, as no existing approaches sufficiently scale to deal with both large queries and large reference databases. In this proposal, we will extend the functionality of sourmash, a widely used tool for performing sequence comparison using MinHash approaches and apply it to datasets in MGnify. To achieve the necessary scalability we will optimise search through algorithmic optimisations (e.g. heuristics, caching), precalculation of sketches for reference databases and horizontal scaling using multiple compute nodes. Additional improvements will be achieved via implementations in Rust, the multi-paradigm programming language. Collectively, these will enable us to perform large-scale database comparisons that will enable:
(1) Detection of contaminating contigs in MAGs and reference genomes;
(2) Removal of redundancies between MAG collections by intra- and inter-dataset comparisons;
(3) Assignment of taxonomy to MAGs by converting MinHash distances into evolutionary distances;
(4) Profiling of short-read metagenomics datasets to detect novelty and permit matching to MAG collections.
These tools will be made available to the community as standalone software packages and via new web interfaces in MGnify, which will provide unparalleled access to the MGnify MAG catalogues. These tools will also be applied to MGnify data to improve MAG quality and help prioritise datasets for analysis.

Publications

10 25 50
publication icon
Gurbich T (2023) MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues in Journal of Molecular Biology

publication icon
Harrison PW (2023) Ensembl 2024. in Nucleic acids research

 
Description 26th Annual Meeting EDF Plenary Guest Lecture "Role of microbial communities in skin health and disease" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Plenary guest lecture by PI Rob Finn at the 26th Annual Meeting of the European Dermatology Forum.
Year(s) Of Engagement Activity 2023
URL https://www.edf-meeting.com/en/program/plenary-guest-lectures
 
Description BIOCEV Special Lecture "Genome resolved metagenomics analysis for understanding the composition of the human gut microbiome" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Special Lecture by PI Rob Finn at the Microbial Communities: Function, Structure, and Complexity" conference, which was organized in BIOCEV (Vestec).
Year(s) Of Engagement Activity 2022
URL https://www.biocev.eu/en/about/events/microbial-communities-function-structure-and-complexity.294?ty...
 
Description BioSB Computational Metagenomics Course talk "MGnify and metagenome resources at EBI" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk by MGnify Bioinformatician Tatiana Gurbich at the metagenomics course in Wageningen, Netherlands (online).
Year(s) Of Engagement Activity 2022
URL https://www.dtls.nl/courses/computational-metagenomics/
 
Description ETIM 2022 talk "Genome resolved metagenomics: understanding the metabolic potential of microbial communities" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk by MGnify PI Rob Finn at the ETIM 2022 meeting on Artificial Intelligence and Bioinformatics held at Essen
Year(s) Of Engagement Activity 2022
URL https://etim.uk-essen.de
 
Description ICG-17 Keynote talk "Genome-level resolution metagenomics: from viruses to eukaryotes" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote speech by PI Rob Finn at the ICG-17 Conference held at Riga, Latvia.
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=x8WJysdL5zA&ab_channel=ICG-17Riga
 
Description ISME 18 Poster "MGnify Genomes: a resource for biome-specific genome catalogues" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster by MGnify Bioinformatician Tatiana Gurbich at the 18th International Symposium on Microbial Ecology conference.
ABSTRACT
MGnify provides a free to use platform for assembly, analysis and archiving of microbiome data from multiple environments. Recently we expanded the resource with the release of biome-specific non-redundant microbial genome catalogues that were generated using isolate and metagenome-assembled genomes that were assembled by MGnify or submitted to the European Nucleotide Archive by third parties. All genomes within a biome-specific catalogue are dereplicated to remove equivalences at the strain level. For species that contain multiple conspecific genomes after dereplication, we choose the highest quality genome as the species representative, always prioritising an isolate genome over a metagenome-assembled genome. For each catalogue we provide genomic sequences, functional annotations, pan-genomes for species that contain multiple conspecific genomes, a protein catalogue, a kraken2 database, and assembly statistics and metadata. The genomes and functional annotations can be browsed on the MGnify website or downloaded from the FTP server along with the rest of the data. We also provide a suite of search tools that allow users to compare their own gene sequences, whole genomes, or sets of genomes, against the catalogues. At the time of the writing, we have catalogues for four biomes available (human gut, human oral, cow rumen, and marine). In total, these catalogues are made up of nearly 300,000 genomes that are clustered into a total of 9,421 species representatives. The resource will continue to expand with addition of new catalogues and updates to the existing catalogues.
Year(s) Of Engagement Activity 2022
URL https://isme18.isme-microbes.org/poster-program
 
Description Virtual training course "Genome-resolved metagenomics bioinformatics" 2022 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Annual EMBL-EBI course delivered by the Microbiome Informatics Team which administers the MGnify microbiome resource. Participants learnt about the tools, processes and analysis approaches used in the field of genome-resolved metagenomics.
https://www.ebi.ac.uk/training/materials/genome-resolved-metagenomics-bioinformatics-materials/
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/training/events/metagenomics-bioinformatics-2022/#vf-tabs__section--tab1