The evolutionary characterisation of bacterial diversity from DNA sequence data

Lead Research Organisation: Imperial College London
Department Name: Life Sciences

Abstract

DNA sequence data are being increasingly used to characterise biodiversity, not least in groups of organisms in which traditional taxonomic approaches have proved of limited use. One group that is particularly dependent on DNA approaches, and particularly challenging, is the bacteria. Only a tiny fraction of bacteria are culturable and the true species richness of bacteria, as defined by current methodology, could number in the billions. However, although a wealth of sequence data for bacteria is becoming available, there remain major theoretical challenges to characterising the diversity of bacteria. First and foremost, bacteria have proved difficult to accommodate within traditional species definitions developed for plants and animals, because of important differences in their mode of inheritance. Bacteria are clonal (they reproduce by simple division of cells), yet they can exchange DNA by a variety of mechanisms, some of which occur most often between closer relatives whereas others occur between distantly related strains. However, the same basic processes cause diversification in bacteria as in plants and animals: the question is to what extent do these processes act together to produce units equivalent to species, rather than acting separately on different genes to produce a more complex pattern of diversity. On top of this problem, methods for identifying evolutionarily and biologically meaningful units of diversity from DNA data are in their infancy: most studies use crude thresholds of DNA divergence to delimit species, or graphical approaches to delimit species by eye, rather than statistical models to test for the action of different processes known to be important for causing diversity to evolve. This project will develop new methods for characterising the diversity of bacteria and use them to test whether bacteria do fall into simple units of diversity equivalent to species, or whether a more complex model of diversity is needed. First, we will develop a broadly applicable suite of new methods for identifying units of diversity from DNA sequence data. The methods will range from those suitable when only a single gene region has been sequenced from each individual, to those suitable when several genes have been sequenced from each individual, called multi-locus sequence analysis (MLSA). Software will be made freely available to enable other researchers to apply our methods in a broad range of applications. The software will be tested in relation to two existing databases, one compiling sequences of a single gene region (16S rRNA) from several hundred thousand isolates of bacteria, and one sampling bacterial genomes from environmental samples. To answer our central question concerning the simplicity or complexity of bacterial diversity, we will generate a new dataset compiling gene sequence data for the Bacillus cereus species complex. This group includes strains with beneficial roles in soil and the plant surfaces, such as nutrient cycling and blocking of plant pathogens, whereas genetically similar strains are disease agents in humans, other mammals or insects. Going beyond previous studies, we will sequence genes with important ecological functions, such as those involved in attacking host defences, as well as the so-called 'house-keeping' genes involved in basic biological processes that are normally used in MLSA studies. This will allow a more comprehensive test of different scenarios for diversification, in particular comparing functional units with different ecological attributes. The results will indicate whether simple units of diversity exist or whether the pattern of diversification is different depending on which set of genes or which aspect of diversity is being considered. The outputs will establish new methods and evidence both for practical delimitation of bacterial diversity and for theoretical debates on the evolution of diversity and the nature of species.

Technical Summary

DNA sequence data are increasingly used to characterise biodiversity, not least in groups of organisms in which traditional taxonomic approaches have limited use. However, methods for identifying biologically meaningful units of diversity from DNA data are in their infancy. This project will develop new methods for characterising diversity in a group that is particular dependent on DNA data, and particularly challenging, namely the bacteria. We will test whether bacteria fall into simple units of diversity equivalent to species, or whether a more complex model of diversity is needed. Recent work has explored the nature of bacterial species using multi-locus sequence analysis (MLSA) of house-keeping genes. We will go beyond this by comparing patterns of diversity between core genes and those with key ecological functions, including genes on plasmids. What are the arenas of drift, natural selection and recombination for these different genes? Statistical methods testing directly for the signature of these processes will be developed from methods recently devised by the PI and collaborator (Track Record). The methods will be implemented in open source software and their use on large-scale datasets (e.g. 16S rRNA surveys) explored. Finally, we will apply the methods to a case study sampling multi-locus core and ecological genes in the Bacillus cereus species complex, a group that includes strains with beneficial roles in soil and the phytosphere and genetically similar strains that are pathogenic to humans, other mammals or insects. The results will establish new methods and evidence both for practical delimitation of bacterial diversity and for theoretical debates on the evolution of diversity and the nature of species.

Publications

10 25 50
 
Description 1) We devised and validated (with simulations) a revised version of the GMYC method for delimiting species from single locus data. A paper has been submitted describing the new version and simulation results, another implements the method for studies of the meiofauna (Tang et al. 2012) and another paper is in preparation demonstrating its application to bacterial datasets (to build on the PI's previous paper showing how the prototype version of the method could be used on bacteria, Barraclough et al. 2009 Biol Lett 5:425-428).


2) We have developed methods to delimit units of diversity based on: i) arenas of recombination by extension of the Infinite Alleles Model; and ii) from patterns of divergent selection. These methods are not yet published but will be presented in the analyses of the Bacillus cereus dataset.


3) We have assembled a dataset for Bacillus cereus from two localities (Silwood Park and Oxford), for 7 house-keeping genes and 6 ecological genes. The data are still being analysed but initial indications show that the ecological genes do indeed display a different pattern of diversification, as presented as a hypothesis in the original proposal. First paper now in 2nd revision at Systematic Biology
Exploitation Route Surveys of bacterial diversity and any other taxa dependent on DNA-based methods can use the methods to delimit units of diversity. The PI is furthering the research with applications to both environmental bacteria and bacteria from human guts. Software for implementing the GMYC method is freely available at http://r-forge.r-project.org/projects/splits and is being widely used by researchers internationally


Software for the further analyses of multilocus will be made available following publication
Sectors Environment,Healthcare

 
Description The method for species delimitation is being used to delimit species in a range of groups of organisms, and the new paper published as a result of this work in 2013 has already been cited over 100 times by a variety of studies on different organisms.
First Year Of Impact 2013
Sector Environment
Impact Types Societal

 
Title Generalized Mixed Yule Coalescent method for species delimitation 
Description The package is used to delimit species, and the GMYC method in particular can estimate species boundaries from genetic data collected for multiple individuals from a large clade. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact The software provided new options for the method that solved technical problems in earlier versions. 
URL http://r-forge.r-project.org/projects/splits
 
Title tr2 - Multilocus species delimitation using a trinomial distribution model 
Description The algorithm takes a set of gene trees from multiple loci, sampled across a clade, and searches for the optimal delimitation of non-recombining groups based on concordance of triplets. It is intended for delimitation of samples of large samples of individuals and loci where more exact methods of multi locus delimitation are prohibitively slow. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Application to animals and bacteria described in the paper. 
URL https://bitbucket.org/tfujisawa/tr2-delimitation/
 
Description Public engagement at Science Uncovered event at the Natural History Museum, 28th September 2012 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Public learned scientific information about microbial diversity and evolution in the context of digestive health

Public expressed interest and learned new facts
Year(s) Of Engagement Activity 2012