19-BBSRC-NSF/BIO genomeRxiv: a microbial whole-genome database & diagnostic marker design resource for classification, identification & data sharing
Lead Research Organisation:
University of Strathclyde
Department Name: Inst of Pharmacy and Biomedical Sci
Abstract
Precise identification of microorganisms that impact on society and the environment is a prerequisite for maintaining a healthy society and a healthy environment and for combating diseases, in addition to providing a sound empirical core for understanding microbiology. The DNA sequencing revolution has created the opportunity to use genome sequences of cultured and uncultured microorganisms for fast and precise identification. However, precise identification is impossible without reference databases that precisely circumscribe classes of microorganisms with their unique characteristics, and rapid identification is impossible without fast algorithms that can handle the deluge of genome sequences being sequenced. Therefore, we will enhance our current web server to develop genomeRxiv, which will provide a database of hundreds of thousands of accurately catalogued and classified public genome sequences supplying the basic and applied research community with precise and accurate identification of unknown isolates based on their genome sequences alone.
A unique new feature will be provision of the academic, industrial, and government communities with the ability to identify, and announce, sequenced genomes without having to share sequences themselves, providing confidentiality for commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and benefit sharing with indigenous communities). genomeRxiv will also enable practical application of its classification scheme by providing the capability to design molecular diagnostic tools to detect specific groupings of bacteria, including high impact microorganisms, directly in the environment.
We are uniquely placed to develop genomeRxiv by leveraging the computational tools and platforms that we have already developed and by integrating them into the new web server. We will combine the highly-resolved classification framework of Life Identification Numbers (PIs Vinazter and Heath), the speed and computational efficiency of sourmash (PI Brown), and the precision and filtering of pyani (PI Pritchard), with the collaborative crowdsourcing framework of the LINbase web server (PIs Vinazter and Heath).
A unique new feature will be provision of the academic, industrial, and government communities with the ability to identify, and announce, sequenced genomes without having to share sequences themselves, providing confidentiality for commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and benefit sharing with indigenous communities). genomeRxiv will also enable practical application of its classification scheme by providing the capability to design molecular diagnostic tools to detect specific groupings of bacteria, including high impact microorganisms, directly in the environment.
We are uniquely placed to develop genomeRxiv by leveraging the computational tools and platforms that we have already developed and by integrating them into the new web server. We will combine the highly-resolved classification framework of Life Identification Numbers (PIs Vinazter and Heath), the speed and computational efficiency of sourmash (PI Brown), and the precision and filtering of pyani (PI Pritchard), with the collaborative crowdsourcing framework of the LINbase web server (PIs Vinazter and Heath).
Technical Summary
We will extend and enhance the capabilities of LINbase to produce the genomeRxiv web server, providing:
1. Greatly increased capacity and functionality for genome classification and identification.
2. Novel capabilities, e.g. users may instantly and easily obtain and share precise identity of newly sequenced genomes without revealing the genome sequence, even to genomeRxiv, maintaining confidentiality for commercially or otherwise sensitive organisms while retaining findability.
LINbase circumscribes groups of organisms by assigning Life Identification Numbers (LINs) to genome sequences in the database. LINs express genome similarity based on average nucleotide identity (ANI), providing a neutral genome similarity framework (conceptually similar to GPS coordinates) independent of taxonomic rank, to which users can "pin" circumscriptions of any named species or any other monophyletic genome-similarity group (from now on simply referred to as "group") below the rank of genus. These permit precise identification by placing newly-sequenced genomes within them.
We will maximise database utility by making improvements in capacity, precision, and functionality to turn it into genomeRxiv:
1. Increase the number of genome sequences from approximately 8,000 to all prokaryotic genomes in NCBI's Genbank and JGI's Integrated Microbial Genomes (IMG) System (almost 500,000) and automatically import new genomes as they are released.
2. Maximise precision of classification and identification by pushing the resolution of LINs towards outbreak-level resolution.
3. Automatically classify bacteria based on validly published named species, genome phylogeny-based species clusters, and genome similarity-based clusters (cliques).
4. Automated diagnostic marker design specific to genomeRxiv classifications.
5. Increase speed of genome identification, and number of simultaneous users.
6. Improve the user interface.
1. Greatly increased capacity and functionality for genome classification and identification.
2. Novel capabilities, e.g. users may instantly and easily obtain and share precise identity of newly sequenced genomes without revealing the genome sequence, even to genomeRxiv, maintaining confidentiality for commercially or otherwise sensitive organisms while retaining findability.
LINbase circumscribes groups of organisms by assigning Life Identification Numbers (LINs) to genome sequences in the database. LINs express genome similarity based on average nucleotide identity (ANI), providing a neutral genome similarity framework (conceptually similar to GPS coordinates) independent of taxonomic rank, to which users can "pin" circumscriptions of any named species or any other monophyletic genome-similarity group (from now on simply referred to as "group") below the rank of genus. These permit precise identification by placing newly-sequenced genomes within them.
We will maximise database utility by making improvements in capacity, precision, and functionality to turn it into genomeRxiv:
1. Increase the number of genome sequences from approximately 8,000 to all prokaryotic genomes in NCBI's Genbank and JGI's Integrated Microbial Genomes (IMG) System (almost 500,000) and automatically import new genomes as they are released.
2. Maximise precision of classification and identification by pushing the resolution of LINs towards outbreak-level resolution.
3. Automatically classify bacteria based on validly published named species, genome phylogeny-based species clusters, and genome similarity-based clusters (cliques).
4. Automated diagnostic marker design specific to genomeRxiv classifications.
5. Increase speed of genome identification, and number of simultaneous users.
6. Improve the user interface.
Publications
Pritchard L
(2022)
Could a Focus on the "Why" of Taxonomy Help Taxonomy Better Respond to the Needs of Science and Society?
in Frontiers in microbiology
Title | genomeRxiv: a microbial whole-genome database for classification, identification, and data sharing |
Description | genomeRxiv is a newly-funded US-UK collaboration to provide a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. Our goal is to supply the basic and applied research community with rapid, precise and accurate identification of unknown isolates based on genome sequence alone, and with molecular tools for environmental analysis.The DNA sequencing revolution enabled the use of cultured and uncultured microorganism genomes for fast and precise identification. However, precise identification is impossible without1. reference databases that precisely circumscribe classes of microorganisms, and label these with their uniquely-shared characteristics2. fast algorithms that can handle the volumes of genome dataOur approach integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome.genomeRxiv includes protocols for confidentiality, allowing groups to identify and announce the identities of newly-sequenced organisms without sharing genome data directly. This protects communities working with commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and to enable benefit sharing with indigenous communities).genomeRxiv will also provide online capability to design molecular diagnostic tools for metabarcoding and qPCR, to enable tracking of specific groupings of bacteria directly in the environment. |
Type Of Art | Image |
Year Produced | 2021 |
URL | https://figshare.com/articles/poster/genomeRxiv_a_microbial_whole-genome_database_for_classification... |
Description | PhytoBacExplorer |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Participation in a guidance/advisory committee |
Description | What's in a name? Fit-for-purpose bacterial nomenclature |
Organisation | Microbiology Society |
Country | United Kingdom |
Sector | Learned Society |
PI Contribution | I was co-organiser of an international focused meeting on bacterial nomenclature, classification, and phylogenomics, titled "What's in a name? Fit-for-purpose bacterial nomenclature" held in Glasgow in September 2023. This brought together representatives from the International Committee on Systematics of Prokaryotes, Microbiology Society, SeqCode, American Society of Microbiology, The National Collection of Type Cultures, and industry stakeholders (design and manufacture of clinical diagnostic equipment). I presented an account of genomeRxiv's motivation, capabilities, and implementation, and participated actively in the discussion as we attempted to plan a route for resolving nomenclatural issues for prokaryotes. AK, a PhD student now employed as PDRA on this project, presented their work on classification of Streptomyces using tools generated by this project. |
Collaborator Contribution | The Microbiology Society funded the meeting, including travel, accommodation, and a meal for the participants. |
Impact | Outputs are in preparation, including a publication report, and a whitepaper with recommendations for assigning prokaryotic nomenclature, and how conflicts should be resolved in the case of, e.g. diagnostic equipment in a clinical setting. |
Start Year | 2023 |
Title | widdowquinn/pyani: v0.2.11 |
Description | This release fixes issues due to pandas API changes. exceptions used in pyani are now found in pandas.errors not pandas.io.common changes to the testing API (will not affect most users) |
Type Of Technology | Software |
Year Produced | 2021 |
Open Source License? | Yes |
Impact | pyani is widely used internationally for definitive assignment of microbial taxonomy, and has contributed to the improved classification of numerous microbes of importance industrially and as pathogens. Over 150 such publications cited the pyani software in 2021; as pyani has been downloaded over 17,000 times (averaging over 800 downloads a month as of March 2022) and software is not always cited appropriately in literature, we expect the undocumented use to be more extensive than this. |
URL | https://zenodo.org/record/5013461 |
Description | Podcast appearance and interview |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | I appeared as a guest on the MicroBinfie podcast, whose topic is microbial bioinformatics. My interview/appearance was split across two episodes (numbers 67 and 68), broadcast originally in November and December 2021. The programme's focus was on the influence of whole-genome based taxonomy and classification on modern microbiology, and the intended purpose was to inform and update the listening community, which is expected to include microbiologists, bioinformaticians, students (postgrad and undergrad) and any interested parties. In particular, my intent was to promote the nomenclature-free classification we are building in the genomeRxiv project, to raise awareness and promote discussion. After the episode I received contacts from other researchers interested in discussing the topic. the other guest (Conor Meehan) and myself discussed plans for writing and hosting a whole-genome classification training course. |
Year(s) Of Engagement Activity | 2021 |
URL | https://soundcloud.com/microbinfie/tracks |