Statistical Methods for Genomic Analysis of Species Divergences

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment

Abstract

Our evolutionary history is written in our genomes. By comparing DNA sequences from different species we can work out how the species are related. By comparing the DNA sequences of multiple individuals from the same species, we can estimate the population size and infer demographic changes (such as population bottleneck) of the species. Such studies fall into the domains of phylogenetics and population genetics. Genomic sequence data from multiple individuals of several closely related species allow powerful inference at the interface of phylogenetics and population genetics. One can use such data to estimate species divergence times and ancestral population sizes, accounting for lineage sorting, and to detect gene flow at the time speciation or to test different models of speciation Such data also allow delimitation of species (for example, to decide whether the sampled individuals belong to one or two species).

To achieve those goals, powerful statistical methods and computational algorithms are necessary. In this project we will implement such methods within two well-established statistical frameworks: maximum likelihood and Bayesian inference.

We will develop maximum likelihood methods for estimating migration rates between populations, and design likelihood ratio tests to test whether there is gene flow at the time of speciation (that is, whether speciation is clean). We will implement models that allow the migration rate to decrease over time since species divergence. Those methods will be useful for testing different speciation models such as allopatric and parapatric speciation. Computational difficulties will limit our likelihood methods to 2 or 3 sequences at each sampled locus. However the methods can accommodate a huge number of loci (indeed the whole genome), and with population data at some loci and species data at other loci, powerful inference is feasible. We will use computer simulations to examine the statistical properties of the new methods, and apply the methods to genomic datasets from the hominoids.

We will introduce significant improvements and extensions to a Bayesian model-comparison approach to delimiting species using genomic sequence data. Published a year ago (Yang and Rannala 2010 Proc Natl Acad Sci USA 107:9264-9269), this method has attracted much attention among evolutionary biologists. This uses an algorithm called reversible-jump Markov chain Monto Carlo (rjMCMC) to sample different species-delimitation models, such as the one-species model (which assumes that all sampled individuals are from one single species) and the two species model (which assumes that the sampled individuals are from two distinct species). However, our current implementation in the computer program BPP has serious limitations and is inefficient in intermediate or large datasets. A major objective of this project is to improve the rjMCMC algorithm so that the program becomes feasible for analysis of large genomic-scale datasets. We will also parallelize the programs to improve the computational efficiency.

Technical Summary

We will improve the previous maximum likelihood and Bayesian MCMC methods developed by the PI and collaborators for analysis of genomic sequence data from several closely related species to understand the speciation process and the migration patterns and to estimate migration rates between populations. Those methods are superior to most existing methods in that they are able to accommodate ancestral polymorphism and lineage sorting, gene tree-species tree conflicts, and uncertainties and errors in gene trees due to limited information in the sequence data. We will improve a previous likelihood ratio test of speciation with gene flow, and also implement new migration models in which the migration rate varies over time to reflect the build-up of reproductive isolation since species divergence.

We will extend the BPP program for Bayesian phylogeographic analysis, to implement a model of sequence errors and a method for resolving phase in diploid SNP data produced by new sequencing methodologies. We will improve the reversible-jump MCMC (rjMCMC) algorithms to improve the mixing of the Markov chain, enabling the method to be used in large datasets. We will introduce MCMC moves to modify the guide tree. As a result, species delimitation and species phylogeny will be jointly inferred by the algorithm (although within the search space defined by the population assignments in the initial guide tree). We will also make an effort to parallelize the likelihood and Bayesian programs 3S and BPP.

Planned Impact

Understanding the speciation process and delimiting species is of vital importance to assessing the current biodiversity, to understanding the impact of environmental and societal changes on species extinctions, and to developing effective conservation policies. The methods developed in this project, for delimiting species and inferring the common mechanisms of speciation, provide powerful tools for analysis of genomic datasets, and results obtained from such analyses will be critical to effective decision making concerning biodiversity management and conservation.

Publications

10 25 50