EnteroBase: A Powerful, User-Friendly Online Resource for Analyzing and Visualizing Genomic Variation within Escherichia coli and Salmonella enterica

Lead Research Organisation: University of Warwick
Department Name: Warwick Medical School

Abstract

It is hard to think of two organisms that are more important to scientists, policy makers and the public than E. coli and S. enterica. Both have been studied extensively in the laboratory as models of how bacterial cells function, behave and evolve. However, both are also important causes of human and animal INFECTION and are seldom out of the news, particularly given their propensity to cause outbreaks. The E. coli outbreak that hit Germany in 2011, with >4,000 cases and >50 deaths, amply illustrates the power of these organisms to devastate even a wealthy advanced society. In 2013, Salmonella gained media coverage in England when >200 people fell ill after a spice festival in Newcastle.

It is important to recognise that no single strain can capture the essence of either species. Instead, what we see in nature is a riotous profusion of diversity. For example, some strains of E. coli live harmlessly in our bowels, while others cause diarrhoea, urinary tract infection or even bloodstream infection. Two E. coli strains may differ by 1/3 of their genetic make-up (genome). Both Salmonella and E. coli undergo relentless evolution, including spread of ANTIBIOTIC RESISTANCE. The huge diversity already present, twinned with ongoing evolution and spread of new lineages creates tremendous problems for microbiologists and other scientists as well as policy makers in recognising and classifying strain types. Yet such classification into well-defined, scientifically robust populations is essential before scientific, clinical or even political conclusions can be generalised across sub-types or species.

Fortunately, we have been presented with an exciting new opportunity to capture and analyse within-species diversity in bacteria in the form of HIGH-THROUGHPUT SEQUENCING, a set of innovative technologies that make bacterial genome sequencing (a process of capturing all the DNA sequences within the cell) easier, cheaper and quicker than ever before. However, this sudden availability of new data creates a fresh challenge-the DRINKING-FROM-A-FIRE-HOSE problem-namely how to store, visualise and analyse all the new data on genomic diversity generated by this exciting new technology. In addition, while expert bioinformaticians can use command-line tools to analyse genomes, lab-based bacteriologists are dependent on the creation of new user-friendly web-based resources, if they are not to miss out on this exciting new opportunity.

To address this problem we will create a new, powerful but user-friendly online database called ENTEROBASE, which will act as a one-stop shop for anyone interested in analysing and visualising genetic diversity in E. coli and Salmonella. EnteroBase will incorporate ENTEROTOOLS, a set of modular, open-source, web-based tools compatible with data formats and standards from both current and future sequencing technologies. Together, these two resources will allow bacteriologists who work in the laboratory and lack high-level computer skills to perform incisive and sophisticated computer-based analyses of bacterial DNA sequence data. Users will be able to upload and analyse their own data, as well as exploit the cumulative knowledge of the microbiology community, not just to look at global patterns of diversity within these species but also to perform speedy, near-real-time analyses of ongoing or recent outbreaks.

Principal investigator Achtman has spearheaded efforts to replace outdated 19th- and 20th-century approaches to the typing and classification of these bacteria with more modern approaches; co-investigator Pallen has applied innovative approaches to analyse the German E. coli outbreak. Both will bring to this project 1000s of users of previous similar, well-established but less powerful databases. This project will also help maintain and enhance the UK skills base and make our country the destination of choice for the brightest and best scientists.

Technical Summary

EnteroBase will present a scalable structured, curated database containing data from 100,000s of genomes and their temporal and geographic metadata from ourselves, our users and public databases. It will support analyses ranging from 7-gene multi-locus sequence typing (MLST) to whole genomes. EnteroBase databases will only include high quality sequences from E. coli and S. enterica but EnteroTools will also support analyses of genomic data from other bacterial groups.

The public interface to EnteroBase will be a customised instance of Galaxy, which is a powerful, but flexible, web-based sequence analysis and workflow management system. Initially, we will adopt Galaxy's existing graphical user interface and existing tools in order to port basic components from our xBASE and MLST facilities.

Subsequently, we will enhance Enterobase's capabilities with EnteroTools, a set of open-source user-friendly Galaxy tools, compatible with both current and future data formats. We will incorporate other resources, such as MEGA and BIGSdb, include links to access specialised external databases for identifying repetitive and mobile elements, and encourage cloud-sourcing of novel solutions by letting users publish their work-flows. EnteroTools will allow users to:

->upload and analyse sequence reads, assemble and annotate genomes and align whole genomes or genes.

->visualise relationships between bacterial genotypes; drill down to genotype clusters; perform population genetics and real-time epidemiological analyses.

->evaluate and visualise the contributions of SNPs, indels, transpositions, recombination and selection, as well as details of changes in the core and accessory genomes.

->access processed data easily in the context of associated metadata. including bidirectional links between metadata in the genomic and MLST databases, thus providing a facility for scanning the metadata from genetically related isolates that share MLST or rMLST alleles.

Planned Impact

The proposed project will benefit anyone in the UK or overseas academic sector with an interest in E. COLI OR SALMONELLA AS PATHOGENS OR MODEL ORGANISMS (including those interested in systems biology or synthetic biology), or with interests in bacterial genome evolution or population genetics or epidemiology. More generally, the resource we create here will be of interest to ANYONE INTERESTED IN EXPLOITING COMPARATIVE SEQUENCE DATA from any bacterial species.

We anticipate bringing across 1000s of users of our existing MLST and xBASE facilities to this new resource.

The proposed project will benefit anyone within the commercial private sector who is interested in developing NEW DRUGS, VACCINES OR DIAGNOSTIC TESTS for E. coli or Salmonella. Industrial users could benefit from using EnteroBase to explore genotypic--and by implication phenotypic--diversity within these species when evaluating novel vaccine or drug targets. EnteroBase will allow users to explore how ANTIMICROBIAL RESISTANCE EVOLVES AND SPREADS within these species. Similarly, companies that sell sequencing technologies stand to benefit from exploitation of and demand for high-throughput sequence data (both Solexa and Oxford nanopore sequencing were developed within the UK, with benefits to our economy).

The delineation of epidemic or highly pathogenic lineages is of KEY INTEREST TO POLICY MAKERS, whether addressing FOOD SECURITY, FOOD SAFETY, HUMAN HEALTHCARE, HEALTH AND SAFETY AT WORK OR BIOTERRORISM (note that certain E. coli and Salmonella lineages are even defined within the UK's Anti-terrorism, Crime and Security Act 2001).

EnteroBase will also assist in increasing the effectiveness of public services and policy by facilitating analyses that will GROUND POLICY DECISIONS IN A SOLID UNDERSTANDING of bacterial evolution, epidemiology, population genetics and taxonomy. The UK food industry needs detailed knowledge about the diversity and sources of Salmonella infection, such as the Agona outbreak that spread to the UK via products from an Irish food producer. Achtman has been at the forefront of efforts to replace classification of these bacteria by serovar with a MORE RATIONAL AND DISCRIMINATORY SYSTEM OF CLASSIFICATION; these efforts are likely to lead to changes in international regulations governing Salmonella and E. coli infections in animals impacting on the human food chain. Our analyses already influence the policies of organizations such as the eCDC (Stockholm), which coordinates European efforts to stop outbreaks of salmonellosis and Listeriosis.

EnteroBase will help microbiologists, bioinformaticians, epidemiologists and population geneticists to integrate bacterial genomics with epidemiological disease patterns and to elucidate genetic relationships between S. enterica and E. coli from domestic animals and human patients, with IMPACTS ON DISEASE PREVENTION, MANAGEMENT OF INFECTION AND QUALITY OF LIFE. Obvious beneficiaries within the public sector include those employed in the HEALTH SERVICES, including the NHS and Public Health England, who will gain an improved understanding of the links between population biology, taxonomy and diagnosis/prognosis for these species.

The proposed resource will also enhance the UK'S REPUTATION AS A CENTRE OF EXCELLENCE, attracting highly skilled students, academics and collaborators from foreign countries. The research and professional skills in bioinformatics gained by staff working on the project will help ADDRESS THE NATIONAL SKILLS SHORTAGE in this area; similarly, the training provided more widely as part of the project will help improve bioinformatics and genomics skills among UK bacteriologists.

Publications

10 25 50
 
Description The development of EnteroBase has been funded since 1 August, 2014 and the EnteroBase website was opened for use by the general public in December, 2015. EnteroBase now offers databases of genomic sequence assemblies and genotyping assignments for the bacterial genera Salmonella, Escherichia, Yersinia, Moraxella, Clostridiodes, Helicobacter, Streptococcus and Vibrio. Of these the largest are Salmonella (>280,000 genomes) and Escherichia (>155,000). EnteroBase also includes legacy data from sequencing of a few housekeeping genes (7-gene MLST) for these genera and the original databases serving those data have now been subsumed into EnteroBase.
Exploitation Route EnteroBase now has >3,800 users from around the globe, many of which are associated with national reference laboratories. EnteroBase has been tested against competing websites for functionality for food safety by EFSA and eCDC, and the whole genome MLST scheme implemented by EnteroBase for Salmonella and Escherichia has been cited as the preferred reference method to be used by the PulseNet International and Bionumerics. The University of Warwick has committed to long term bioinformatic support for EnteroBase, independent of external soft funding. Similarly, national laboratories in France (Institut Pasteur) and Germany (DMSZ) have committed to sub-branches of EnteroBase which focus on Salmonella and Clostridiodes difficile, respectively.
Sectors Agriculture, Food and Drink,Healthcare

URL http://enterobase.warwick.ac.uk
 
Description EnteroBase now provides hierarchical clustering of cgMLST STs. This is being used as an epidemiological tool for for outbreaks of gastrointestinal disease caused by Escherichia coli and Salmonella enterica. We know of extensive use of this facility by the national microbiological reference laboratory for Scotland, by the Institut Pasteur in Paris, France and it is beginning to be used for this purpose by PHE, Colindale. We have also been advised by one of the regional laboratories of the FDA in Florida that they wish to use this as well. These changes are so dramatic that some of these laboratories are now dependent on the continuing existence of EnteroBase for their epidemiological investigations.
First Year Of Impact 2017
Sector Agriculture, Food and Drink
Impact Types Policy & public services

 
Description Investigator Award
Amount £2,200,000 (GBP)
Funding ID 202792/Z/16/Z 
Organisation Wellcome Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 09/2016 
End 08/2021
 
Title GrapeTree 
Description A software GUI which operates in standalone mode or in combination with a database that can display a minimal spanning tree of the relationships of up to 100,000 genomes. GrapeTree has been included within EnteroBase 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact GrapeTree has now been added to the BigsDB environment and is also available at PubMlst. 
URL https://bitbucket.org/enterobase/enterobase-web/wiki/GrapeTree
 
Title EnteroBase 
Description EnteroBase assembles 10,000s of genomes from public short read archives as well as from sequencing short reads uploaded by its users and associates them with metadata on the bacterial strain. It provides a user friendly web browser for examining data as well as a computer friendly API for high throughput data access and uploading. EnteroBase contains assemblies from all publicly available short read archives for Escherichia and Shigella, Salmonella, Yersinia and Moraxella catarrhalis. Genotypes are called automatically from the genomes for MLST, rMLST and CRISPR. EnteroBase uses state of the art methods for assembling genomes and calling genotypes thus providing a unique service for non-IT specialists. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact An initial overview of the genetic diversity of four species causing enteric diseases in humans and animals 
URL http://enterobase.warwick.ac.uk
 
Title GrapeTree 
Description A software GUI which operates in standalone mode or in combination with a database that can display a minimal spanning tree of the relationships of up to 100,000 genomes. GrapeTree has been included within EnteroBase 
Type Of Material Data analysis technique 
Year Produced 2017 
Provided To Others? Yes  
Impact GrapeTree has now been added to the BigsDB environment and is also available at PubMlst. If there is a URL which relates to this research tool or method, enter it here. 
URL https://bitbucket.org/enterobase/enterobase-web/wiki/GrapeTree
 
Description DSMZ, Braunschweig, Germany 
Organisation Leibniz Association
Department Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures
Country Germany 
Sector Public 
PI Contribution legal agreement to share IP with DSMZ and transfer control to them when funding expires
Collaborator Contribution DSMZ will help to develop EnteroBase and will take over financial and administrative responsibility once University of Warwick releases control due to lack of funding
Impact None yet
Start Year 2017
 
Description Univ of Oxford 
Organisation University of Oxford
Country United Kingdom 
Sector Academic/University 
PI Contribution Plan to provide link rMLST of Salmonella enterica and Escherichia coli with BigsDB
Collaborator Contribution Plan to provide direct link to BigsDB allele server for automated uploads and downloads of rMLST alleles for EnteroBase
Impact None yet
Start Year 2014
 
Title MGplacer 
Description Assign metagenomic data onto phylogeny based on 1000's isolates. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None 
URL https://sourceforge.net/projects/mgplacer/