19-BBSRC-NSF/BIO genomeRxiv: a microbial whole-genome database & diagnostic marker design resource for classification, identification & data sharing

Lead Research Organisation: University of Strathclyde
Department Name: Inst of Pharmacy and Biomedical Sci

Abstract

Precise identification of microorganisms that impact on society and the environment is a prerequisite for maintaining a healthy society and a healthy environment and for combating diseases, in addition to providing a sound empirical core for understanding microbiology. The DNA sequencing revolution has created the opportunity to use genome sequences of cultured and uncultured microorganisms for fast and precise identification. However, precise identification is impossible without reference databases that precisely circumscribe classes of microorganisms with their unique characteristics, and rapid identification is impossible without fast algorithms that can handle the deluge of genome sequences being sequenced. Therefore, we will enhance our current web server to develop genomeRxiv, which will provide a database of hundreds of thousands of accurately catalogued and classified public genome sequences supplying the basic and applied research community with precise and accurate identification of unknown isolates based on their genome sequences alone.

A unique new feature will be provision of the academic, industrial, and government communities with the ability to identify, and announce, sequenced genomes without having to share sequences themselves, providing confidentiality for commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and benefit sharing with indigenous communities). genomeRxiv will also enable practical application of its classification scheme by providing the capability to design molecular diagnostic tools to detect specific groupings of bacteria, including high impact microorganisms, directly in the environment.

We are uniquely placed to develop genomeRxiv by leveraging the computational tools and platforms that we have already developed and by integrating them into the new web server. We will combine the highly-resolved classification framework of Life Identification Numbers (PIs Vinazter and Heath), the speed and computational efficiency of sourmash (PI Brown), and the precision and filtering of pyani (PI Pritchard), with the collaborative crowdsourcing framework of the LINbase web server (PIs Vinazter and Heath).

Technical Summary

We will extend and enhance the capabilities of LINbase to produce the genomeRxiv web server, providing:

1. Greatly increased capacity and functionality for genome classification and identification.
2. Novel capabilities, e.g. users may instantly and easily obtain and share precise identity of newly sequenced genomes without revealing the genome sequence, even to genomeRxiv, maintaining confidentiality for commercially or otherwise sensitive organisms while retaining findability.

LINbase circumscribes groups of organisms by assigning Life Identification Numbers (LINs) to genome sequences in the database. LINs express genome similarity based on average nucleotide identity (ANI), providing a neutral genome similarity framework (conceptually similar to GPS coordinates) independent of taxonomic rank, to which users can "pin" circumscriptions of any named species or any other monophyletic genome-similarity group (from now on simply referred to as "group") below the rank of genus. These permit precise identification by placing newly-sequenced genomes within them.

We will maximise database utility by making improvements in capacity, precision, and functionality to turn it into genomeRxiv:

1. Increase the number of genome sequences from approximately 8,000 to all prokaryotic genomes in NCBI's Genbank and JGI's Integrated Microbial Genomes (IMG) System (almost 500,000) and automatically import new genomes as they are released.
2. Maximise precision of classification and identification by pushing the resolution of LINs towards outbreak-level resolution.
3. Automatically classify bacteria based on validly published named species, genome phylogeny-based species clusters, and genome similarity-based clusters (cliques).
4. Automated diagnostic marker design specific to genomeRxiv classifications.
5. Increase speed of genome identification, and number of simultaneous users.
6. Improve the user interface.
 
Title genomeRxiv: a microbial whole-genome database for classification, identification, and data sharing 
Description This is a poster submission, presented at the Microbiology Society Annual Conference 2021, announcing the genomeRxiv project and describing its overall goals, and progress. The poster was made available online, as the conference was held remotely due to COVID. 
Type Of Art Artefact (including digital) 
Year Produced 2021 
Impact The poster session prompted multiple informal discussions about the microbial classification with other microbiologists and bioinformaticians. This gave me the opportunity to promote and, I hope, convince others that a framework for classification that does not involve binomial nomenclature and arbitrary rankings would be useful to organise microbial genomes, and enable more productive genomic analyses. 
URL https://figshare.com/articles/poster/genomeRxiv_a_microbial_whole-genome_database_for_classification...
 
Title genomeRxiv: a microbial whole-genome database for classification, identification, and data sharing 
Description genomeRxiv is a newly-funded US-UK collaboration to provide a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. Our goal is to supply the basic and applied research community with rapid, precise and accurate identification of unknown isolates based on genome sequence alone, and with molecular tools for environmental analysis.The DNA sequencing revolution enabled the use of cultured and uncultured microorganism genomes for fast and precise identification. However, precise identification is impossible without1. reference databases that precisely circumscribe classes of microorganisms, and label these with their uniquely-shared characteristics2. fast algorithms that can handle the volumes of genome dataOur approach integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome.genomeRxiv includes protocols for confidentiality, allowing groups to identify and announce the identities of newly-sequenced organisms without sharing genome data directly. This protects communities working with commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and to enable benefit sharing with indigenous communities).genomeRxiv will also provide online capability to design molecular diagnostic tools for metabarcoding and qPCR, to enable tracking of specific groupings of bacteria directly in the environment. 
Type Of Art Image 
Year Produced 2021 
URL https://figshare.com/articles/poster/genomeRxiv_a_microbial_whole-genome_database_for_classification...
 
Title genomeRxiv: a microbial whole-genome database for classification, identification, and data sharing 
Description genomeRxiv is a newly-funded US-UK collaboration to provide a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. Our goal is to supply the basic and applied research community with rapid, precise and accurate identification of unknown isolates based on genome sequence alone, and with molecular tools for environmental analysis.The DNA sequencing revolution enabled the use of cultured and uncultured microorganism genomes for fast and precise identification. However, precise identification is impossible without1. reference databases that precisely circumscribe classes of microorganisms, and label these with their uniquely-shared characteristics2. fast algorithms that can handle the volumes of genome dataOur approach integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome.genomeRxiv includes protocols for confidentiality, allowing groups to identify and announce the identities of newly-sequenced organisms without sharing genome data directly. This protects communities working with commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and to enable benefit sharing with indigenous communities).genomeRxiv will also provide online capability to design molecular diagnostic tools for metabarcoding and qPCR, to enable tracking of specific groupings of bacteria directly in the environment. 
Type Of Art Image 
Year Produced 2021 
URL https://figshare.com/articles/poster/genomeRxiv_a_microbial_whole-genome_database_for_classification...
 
Title genomeRxiv: a microbial whole-genome database for classification, identification, and data sharing 
Description genomeRxiv is a newly-funded US-UK collaboration to provide a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. Our goal is to supply the basic and applied research community with rapid, precise and accurate identification of unknown isolates based on genome sequence alone, and with molecular tools for environmental analysis.The DNA sequencing revolution enabled the use of cultured and uncultured microorganism genomes for fast and precise identification. However, precise identification is impossible without1. reference databases that precisely circumscribe classes of microorganisms, and label these with their uniquely-shared characteristics2. fast algorithms that can handle the volumes of genome dataOur approach integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome.genomeRxiv includes protocols for confidentiality, allowing groups to identify and announce the identities of newly-sequenced organisms without sharing genome data directly. This protects communities working with commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and to enable benefit sharing with indigenous communities).genomeRxiv will also provide online capability to design molecular diagnostic tools for metabarcoding and qPCR, to enable tracking of specific groupings of bacteria directly in the environment. 
Type Of Art Image 
Year Produced 2021 
URL https://figshare.com/articles/poster/genomeRxiv_a_microbial_whole-genome_database_for_classification...
 
Title widdowquinn/pyani: v0.2.11 
Description This release fixes issues due to pandas API changes. exceptions used in pyani are now found in pandas.errors not pandas.io.common changes to the testing API (will not affect most users) 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact pyani is widely used internationally for definitive assignment of microbial taxonomy, and has contributed to the improved classification of numerous microbes of importance industrially and as pathogens. Over 150 such publications cited the pyani software in 2021; as pyani has been downloaded over 17,000 times (averaging over 800 downloads a month as of March 2022) and software is not always cited appropriately in literature, we expect the undocumented use to be more extensive than this. 
URL https://zenodo.org/record/5013461
 
Description Podcast appearance and interview 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I appeared as a guest on the MicroBinfie podcast, whose topic is microbial bioinformatics. My interview/appearance was split across two episodes (numbers 67 and 68), broadcast originally in November and December 2021. The programme's focus was on the influence of whole-genome based taxonomy and classification on modern microbiology, and the intended purpose was to inform and update the listening community, which is expected to include microbiologists, bioinformaticians, students (postgrad and undergrad) and any interested parties. In particular, my intent was to promote the nomenclature-free classification we are building in the genomeRxiv project, to raise awareness and promote discussion.

After the episode I received contacts from other researchers interested in discussing the topic. the other guest (Conor Meehan) and myself discussed plans for writing and hosting a whole-genome classification training course.
Year(s) Of Engagement Activity 2021
URL https://soundcloud.com/microbinfie/tracks