📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

19-BBSRC-NSF/BIO genomeRxiv: a microbial whole-genome database & diagnostic marker design resource for classification, identification & data sharing

Lead Research Organisation: University of Strathclyde
Department Name: Inst of Pharmacy and Biomedical Sci

Abstract

Precise identification of microorganisms that impact on society and the environment is a prerequisite for maintaining a healthy society and a healthy environment and for combating diseases, in addition to providing a sound empirical core for understanding microbiology. The DNA sequencing revolution has created the opportunity to use genome sequences of cultured and uncultured microorganisms for fast and precise identification. However, precise identification is impossible without reference databases that precisely circumscribe classes of microorganisms with their unique characteristics, and rapid identification is impossible without fast algorithms that can handle the deluge of genome sequences being sequenced. Therefore, we will enhance our current web server to develop genomeRxiv, which will provide a database of hundreds of thousands of accurately catalogued and classified public genome sequences supplying the basic and applied research community with precise and accurate identification of unknown isolates based on their genome sequences alone.

A unique new feature will be provision of the academic, industrial, and government communities with the ability to identify, and announce, sequenced genomes without having to share sequences themselves, providing confidentiality for commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and benefit sharing with indigenous communities). genomeRxiv will also enable practical application of its classification scheme by providing the capability to design molecular diagnostic tools to detect specific groupings of bacteria, including high impact microorganisms, directly in the environment.

We are uniquely placed to develop genomeRxiv by leveraging the computational tools and platforms that we have already developed and by integrating them into the new web server. We will combine the highly-resolved classification framework of Life Identification Numbers (PIs Vinazter and Heath), the speed and computational efficiency of sourmash (PI Brown), and the precision and filtering of pyani (PI Pritchard), with the collaborative crowdsourcing framework of the LINbase web server (PIs Vinazter and Heath).

Technical Summary

We will extend and enhance the capabilities of LINbase to produce the genomeRxiv web server, providing:

1. Greatly increased capacity and functionality for genome classification and identification.
2. Novel capabilities, e.g. users may instantly and easily obtain and share precise identity of newly sequenced genomes without revealing the genome sequence, even to genomeRxiv, maintaining confidentiality for commercially or otherwise sensitive organisms while retaining findability.

LINbase circumscribes groups of organisms by assigning Life Identification Numbers (LINs) to genome sequences in the database. LINs express genome similarity based on average nucleotide identity (ANI), providing a neutral genome similarity framework (conceptually similar to GPS coordinates) independent of taxonomic rank, to which users can "pin" circumscriptions of any named species or any other monophyletic genome-similarity group (from now on simply referred to as "group") below the rank of genus. These permit precise identification by placing newly-sequenced genomes within them.

We will maximise database utility by making improvements in capacity, precision, and functionality to turn it into genomeRxiv:

1. Increase the number of genome sequences from approximately 8,000 to all prokaryotic genomes in NCBI's Genbank and JGI's Integrated Microbial Genomes (IMG) System (almost 500,000) and automatically import new genomes as they are released.
2. Maximise precision of classification and identification by pushing the resolution of LINs towards outbreak-level resolution.
3. Automatically classify bacteria based on validly published named species, genome phylogeny-based species clusters, and genome similarity-based clusters (cliques).
4. Automated diagnostic marker design specific to genomeRxiv classifications.
5. Increase speed of genome identification, and number of simultaneous users.
6. Improve the user interface.
 
Description We developed and implemented the genome-based prokaryotic classification and identification web server genomeRxiv (https://genomerxiv.cs.vt.edu/index.php), which integrates taxonomy with strain typing by expanding hierarchical taxonomy towards strain level for all prokaryotes, using a unified quantitative approach. The LINflow pipeline that classifies microbial genomes was rewritten, and we released new versions of the underpinning sourmash (https://sourmash.readthedocs.io/en/latest/index.html) and pyani (https://github.com/pyani-plus/pyani-plus) software as standalone tools. The pyani software specifically has been reimplemented to better take advantage of high performance computing clusters, with additional genome comparison methods, a new graph theory-based classification approach (cliques), and a shareable, reusable database backend.

The genomeRxiv service, and the underlying LINgroup methodology, is being used by the scientific community, US government agencies, and in teaching. In proof-of-concept investigations we have demonstrated that the resolution of our approach is generally high enough to distinguish strains causing different outbreaks, making it sufficient for outbreak detection but not outbreak investigations. The research supported four PhD students directly and a further five indirectly at Virginia Tech, a postdoctoral researcher and three undergraduate projects at UC Davis, and (in total) three postdoctoral researchers at Strathclyde. The work contributed to educational materials for use in undergraduate teaching at Strathclyde, MBL STAMPS in the US, and elsewhere.
Exploitation Route genomeRxiv is an open, public resource available to all. It provides rapid, accurate classification of prokaryotes on the basis of a draft or complete genome sequence. This primarily supports scientific efforts to rapidly classify pathogens and other microbes, but is also useful as an educational tool. The genome-based, quantitative classification system is independent of and complementary to existing taxonomies (e.g. NCBI and GTDB), and provides a mechanism to resolve disagreements between alternative classification schemes.

The LIN (Life Identification Number) scheme underlying genomeRxiv is of general utility and is gaining community traction/usage, having been employed and adapted in published work by our own and other groups describing studies of specific sets of organisms, or organisms in particular environments. We are exploring integration with the PhytoBacExplorer project through LP's position on the Scientific Advisory Board. The sourmash and pyani tools that underpin the service are downloaded thousands of times a months and also cited in hundreds of publications spanning a wide range of applications in microbial classification and microbiology in general. All software - genomeRxiv, sourmash, and pyani - is made available under permissive licences via either GitHub or GitLab; teaching and reference materials, and datasets are shared under Creative Commons licences.

The success of the genomeRxiv collaboration has led to an application for renewed funding to improve and expand genomeRxiv and its underpinning software to help accelerate research in microbial diversity, ecology, and evolution - alongside applied reaserch in biotechnology, biosecurity, and plant, animal, and human health.
Sectors Agriculture

Food and Drink

Digital/Communication/Information Technologies (including Software)

Education

Environment

Manufacturing

including Industrial Biotechology

 
Description PhytoBacExplorer
Geographic Reach Multiple continents/international 
Policy Influence Type Participation in a guidance/advisory committee
 
Title genomeRxiv 
Description genomeRxiv is a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. This resource integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? Yes  
Impact The LIN approach that underpins genomeRxiv has been adopted by a number of research groups for their own datasets, with a diverse set of impacts. These include an updating of the BIGSdb resource for analysis of bacterial isolates (doi:10.1101/2024.03.11.584534), generation of the Pneumococcal Genome Library (doi:10.1099/mgen.0.001280), revised classifications for complex pathogen groups such as Pseudomonas (doi:10.1038/s41597-024-03003-x), Streptococcus (doi:10.1099/mgen.0.001278), and Klebsiella at the Pathogenwatch service (e.g. https://cgps.gitbook.io/pathogenwatch/technical-descriptions/typing-methods/klebsiella-lin-codes#introduction). 
URL https://genomerxiv.cs.vt.edu/
 
Title 16S rRNA phylogeny and clustering is not a reliable proxy for genome-based taxonomy in Streptomyces 
Description This file is intended as supplementary information for a forthcoming publication: 16S rRNA phylogeny and clustering is not a reliable proxy for genome-based taxonomy in Streptomyces. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://zenodo.org/record/8223787
 
Title 16S rRNA phylogeny and clustering is not a reliable proxy for genome-based taxonomy in Streptomyces 
Description This file is intended as supplementary information for a forthcoming publication: 16S rRNA phylogeny and clustering is not a reliable proxy for genome-based taxonomy in Streptomyces. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://zenodo.org/record/8223786
 
Title Supplementary data for Kiepas et al. (2024) "16S taxonomy and clustering is not a proxy for taxonomy in Streptomyces" 
Description This repository contains all supplementary information for analyses reported in Kiepas et al. (2024) describing inconsistencies between taxonomies inferred using 16S and whole-genome identities in Streptomyces. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact This data supports research establishing that 16S metabarcoding is not sufficient for species-level resolution in Streptomyces, which has implications for metabarcoding studies in many contexts. 
URL https://github.com/sipbs-compbiol/Kiepas_et_al_2024_16S
 
Description PhytoBacExplorer Scientific Advisory Board 
Organisation Leibniz Association
Department Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures
Country Germany 
Sector Public 
PI Contribution I sit on the PhytoBacExplorer Scientific Advisory board, having been appointed in 2022. We meet to discuss the progress and implementation of PhytoBacExplorer, which includes discussion of genome-based taxonomy and nomenclature. Our experience with genomeRxiv and pyani helps inform choices made for classification of plant pathogenic microbes in PhytoBacExplorer.
Collaborator Contribution My colleagues on the PhytoBacExplorer Scientific Advisory board share their expertise and experience, including the progress and implementation of PhytoBacExplorer, and discussion of genome-based taxonomy and nomenclature. This informs our approaches for pyani and genomeRxiv.
Impact The scientific advisory board has input which influences the PhytoBacExplorer online resource for analysis and visualisation of genomic variation in plant-pathogenic bacteria. The service can be found at https://phytobacexplorer.warwick.ac.uk/
Start Year 2022
 
Description PhytoBacExplorer Scientific Advisory Board 
Organisation University of Warwick
Country United Kingdom 
Sector Academic/University 
PI Contribution I sit on the PhytoBacExplorer Scientific Advisory board, having been appointed in 2022. We meet to discuss the progress and implementation of PhytoBacExplorer, which includes discussion of genome-based taxonomy and nomenclature. Our experience with genomeRxiv and pyani helps inform choices made for classification of plant pathogenic microbes in PhytoBacExplorer.
Collaborator Contribution My colleagues on the PhytoBacExplorer Scientific Advisory board share their expertise and experience, including the progress and implementation of PhytoBacExplorer, and discussion of genome-based taxonomy and nomenclature. This informs our approaches for pyani and genomeRxiv.
Impact The scientific advisory board has input which influences the PhytoBacExplorer online resource for analysis and visualisation of genomic variation in plant-pathogenic bacteria. The service can be found at https://phytobacexplorer.warwick.ac.uk/
Start Year 2022
 
Description What's in a name? Fit-for-purpose bacterial nomenclature 
Organisation Microbiology Society
Country United Kingdom 
Sector Learned Society 
PI Contribution I was co-organiser of an international focused meeting on bacterial nomenclature, classification, and phylogenomics, titled "What's in a name? Fit-for-purpose bacterial nomenclature" held in Glasgow in September 2023. This brought together representatives from the International Committee on Systematics of Prokaryotes, Microbiology Society, SeqCode, American Society of Microbiology, The National Collection of Type Cultures, and industry stakeholders (design and manufacture of clinical diagnostic equipment). I presented an account of genomeRxiv's motivation, capabilities, and implementation, and participated actively in the discussion as we attempted to plan a route for resolving nomenclatural issues for prokaryotes. AK, a PhD student now employed as PDRA on this project, presented their work on classification of Streptomyces using tools generated by this project.
Collaborator Contribution The Microbiology Society funded the meeting, including travel, accommodation, and a meal for the participants.
Impact Outputs are in preparation, including a publication report, and a whitepaper with recommendations for assigning prokaryotic nomenclature, and how conflicts should be resolved in the case of, e.g. diagnostic equipment in a clinical setting.
Start Year 2023
 
Title pyani-plus 
Description pyani-plus is a rewritten and reimplemented update of pyani, incorporating multiple novel capabilities including: storage of results in a lightweight, shareable database backend; implementation of novel Overall Genome Relatedness Index (OGRI) measures (e.g. dnadiff, sourmash, tANI); graph theory-based clique identification; new visualisation outputs; the ability to compare results obtained by a range of OGRI methods on the same input dataset. 
Type Of Technology Software 
Year Produced 2024 
Open Source License? Yes  
Impact None to report as yet. 
URL https://github.com/pyani-plus/pyani-plus
 
Title widdowquinn/pyani: v0.2.11 
Description This release fixes issues due to pandas API changes. exceptions used in pyani are now found in pandas.errors not pandas.io.common changes to the testing API (will not affect most users) 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact pyani is widely used internationally for definitive assignment of microbial taxonomy, and has contributed to the improved classification of numerous microbes of importance industrially and as pathogens. Over 150 such publications cited the pyani software in 2021; as pyani has been downloaded over 17,000 times (averaging over 800 downloads a month as of March 2022) and software is not always cited appropriately in literature, we expect the undocumented use to be more extensive than this. 
URL https://zenodo.org/record/5013461
 
Description Podcast appearance and interview 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I appeared as a guest on the MicroBinfie podcast, whose topic is microbial bioinformatics. My interview/appearance was split across two episodes (numbers 67 and 68), broadcast originally in November and December 2021. The programme's focus was on the influence of whole-genome based taxonomy and classification on modern microbiology, and the intended purpose was to inform and update the listening community, which is expected to include microbiologists, bioinformaticians, students (postgrad and undergrad) and any interested parties. In particular, my intent was to promote the nomenclature-free classification we are building in the genomeRxiv project, to raise awareness and promote discussion.

After the episode I received contacts from other researchers interested in discussing the topic. the other guest (Conor Meehan) and myself discussed plans for writing and hosting a whole-genome classification training course.
Year(s) Of Engagement Activity 2021
URL https://soundcloud.com/microbinfie/tracks