Analytical methodology to perform genome-wide association studies in bacteria

Lead Research Organisation: Imperial College London
Department Name: School of Public Health

Abstract

Bacteria are extremely diverse organisms: some live freely in the Oceans, some live in the soil, some infect animals, some colonize our human guts without which we could not live, whereas others cause serious diseases. In that last category, we find the causes of some of the most deadly threats that mankind has ever been faced with, such as the Black Death epidemic that killed half of the European population in the Middle Ages, or the tuberculosis global pandemic that currently affects a third of the worldwide population, killing over a million people every year. Significant diversity is often observed within a given bacterial species and not just between species. For example, E. coli is well known as a source of serious food-associated disease outbreaks, like the one that occurred in Germany in 2011. However, E. coli is also a normal inoffensive inhabitant of our guts, and can even benefit us by producing a vitamin or protecting us for example from Salmonella infections. Another example is S. aureus, which can cause hospital-acquired infections, but is also carried asymptomatically by a third of healthy humans.

Understanding the cause for this diversity in virulence is important, but there are also many other properties of interest. For example, resistance or sensitivity to a given antibiotic can make the difference between a treatment being successful or not. Another important property is host-specificity which for species like Salmonella or Campylobacter determines the likelihood to infect a cow, pig or bird and therefore to end up in the human food chain. There are indeed many such properties that vary within each bacterial species, in the same way that we see many important differences between one person to another in the human population. In bacteria, many of these properties are of major biomedical importance, and yet their genetic basis are not always fully understood.

Bacterial genomes are relatively short compared to other organisms, being made of one to ten million letters. The first bacterial genome to be fully sequenced was H. influenzae in 1995, shortly followed by E. coli in 1997. Sequencing a single representative genome from a species is however not useful to study variation of a property within the species. In the past few years, the sequencing technology has made huge progress in cost, speed and accuracy, to the point that very large numbers of genomes can now be sequenced. This genomic revolution enables a new strategy to study the genetic basis of properties, and this is the strategy we are proposing to implement. The basic idea is simply to take some bacteria that exhibit the property, and some that do not, and scan their genomes to find the elements that are present in the former and not in the latter.

If this simple strategy was implemented naively, there are several reasons why it could lead to wrong results. We have identified these pitfalls and suggest ways to avoid them. To ensure that the method we will develop reliably gives correct results, simulations will be used where an artificial dataset is generated in which we know what is causing the property, and we apply our method as if we did not know it. This will allow to determine exactly when the method works and when it does not, and to improve it accordingly.

The method will be applied to two large genomic datasets in Campylobacter and E. coli, two important human pathogens for which many questions are still open. Most importantly, the method will be released freely online as a user-friendly software package called BassoMapper. This will enable other scientists to apply the method to other genomic datasets, to help them discover the genetic basis for many other properties in many organisms.

Technical Summary

Genome-wide association studies (GWAS) aim at discovering the genetic basis for a phenotypic trait of interest. GWAS have been hugely popular and successful in human genetics, but not yet so in bacterial genetics. This is because the GWAS methods used by human geneticists are not directly applicable to bacteria, due to important differences in recombination, population structure and genome plasticity.

We propose to develop novel methodology specifically designed to perform GWAS in bacteria, which will account for these differences. The new method will compare a sample of genomes to determine the genetic elements and events that may be causing a phenotype of interest. The method will by applied to simulated datasets, in which the existence or not of causative links between genotypes and phenotypes is known exactly. These simulations will be used to guide the development of the method by providing a benchmark for the optimisation of the results. They will also reveal the strengths and weaknesses of the final method, under a variety of conditions.

The methodology will be applied to two state-of-the-art datasets in Campylobacter jejuni and Escherichia coli, each of which consists of about a thousand genomes. These two applications should reveal interesting new mechanisms, for example about the adaptation of Campylobacter to different hosts or about the differences between environmental and clinical E. coli. These applications will also ensure that the method is generally applicable to the large genomic datasets that are increasingly becoming available. The methodology will be made available via the internet as a software package called BassoMapper which will be free and open source. BassoMapper will be developed so that it does not require bioinformatics or statistical expertise to be used. This will guarantee that the new methodology is available for microbiologists to apply it to a wide variety of systems.

Planned Impact

The project will generate impact for a wide range of academic and non-academic beneficiaries. In particular:

(1) Academic researchers who work with bacterial genomic data. This includes researchers with a wide variety of backgrounds, ranging from statistics, bioinformatics and microbiology. These researchers typically have specific questions they want to investigate about a given organism, and have collected large amounts of genomic data to answer them. They will directly benefit from the implementation of our methodology into the software package BassoMapper, since they will be able to apply it to their data.

(2) Non-academic microbiologists and health professionals. As the new method is applied to a variety of systems, this will reveal new insights into many important microbiological processes. For example, our application to C. jejuni will help understand the source of campylobacteriosis and therefore provide evidence for which public health measures are likely to limit its incidence. Our application to E. coli should likewise provide a better understanding of the pathogenicity and evolution of this pathogen, which could help prevent future outbreaks. Most of the impact will however come from applications of the methodology by other academic researchers rather than ourselves. For example, our proposed methodology should be directly applicable to investigate the genetic basis of antimicrobial resistance in various bacterial species. Such applications will reveal how antimicrobial resistance arise and spread, which will have an impact on antimicrobial stewardship measures.

(3) The general public. Application of the proposed methodology will create a better understanding of the evolution, ecology, epidemiology and pathogenicity of many bacterial pathogens. This will allow more effective measures to be taken to limit their burden on public health and therefore benefit to the general public.

Individuals in category (1) will be directly impacted by the proposed research, even before the research programme is completed since they will be able to download and apply BassoMapper while it is still in development. Members of categories (2) and (3) will be less directly and less immediately impacted by the proposed research than members of category (1). This future impact on categories (2) and (3) is hard to predict and measure, but could nevertheless be very far reaching. For example, the applicant developed the software ClonalFrame to infer relationships between bacterial isolates in a way that accounts for recombination, and this software was of most direct impact to the academic researchers who applied it to their molecular datasets. Over the past five years, there have been over 300 published studies that have applied ClonalFrame to a wide range of bacterial species. These studies have had a significant impact on our understanding of the population biology of these species, many of which are important human pathogens.

Publications

10 25 50
 
Description We have developed a new methodology for performing microbial Genome-Wide Association Studies. The new methods are implemented into a software package called TreeWAS which is already available online at https://github.com/caitiecollins/treeWAS
Exploitation Route not applicable this year
Sectors Agriculture, Food and Drink,Healthcare

 
Title ClonalFrameML 
Description Inference of bacterial phylogeny in the presence of recombination. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Cited 10 times in first year. 
URL https://github.com/xavierdidelot/ClonalFrameML
 
Title TreeWAS 
Description Microbial Genome Wide Association Mapping 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact NA 
URL https://github.com/caitiecollins/treeWAS