Phylogeographic inference using genomic sequence data under the multispecies coalescent model

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment

Abstract

Our evolutionary history is written in our genomes. By comparing DNA sequences from different species or multiple individuals of the same species we can work out how the species are related, when they diverged from each other, whether there was introgression between the species, and whether the population size of a species went through a bottleneck or other demographic changes. DNA sequences can also be used to identify species and delineate species boundaries. To address such exciting questions, powerful statistical methods and computational algorithms are necessary. In this project we will develop new statistical models and computer algorithms for efficient analysis of genomic sequence data within two well-established statistical frameworks: maximum likelihood and Bayesian inference. We will develop a maximum likelihood method for estimating the species tree that accommodates the random process of biological reproduction and genetic sequence evolution, as well as introgression or hybridisation that may be common between closely related species, especially during radiative speciations. We will introduce significant improvements and extensions to our Bayesian model-comparison approach to delimiting species using genomic sequence data. We will implement sophisticated models to describe the evolutionary process of DNA sequences and to allow changes in the evolutionary rate among lineages so that the program can be applied to estimate species phylogenies for distantly related species, such as different orders of mammals. We will parallelize the program to improve the computational efficiency.

Technical Summary

We will improve the maximum likelihood and Bayesian MCMC methods developed by the PI and collaborators for analysis of genomic sequence data from multiple species to infer the species phylogeny under the multispecies coalescent model. Those methods are superior to existing heuristic methods in that they are able to accommodate ancestral polymorphism and incomplete lineage sorting, gene tree-species tree conflicts, and uncertainties and errors in gene trees due to limited information in the sequence data. We will extend our program 3S to develop a maximum likelihood method of species tree inference under the multispecies coalescent model with introgression, which is expected to be very useful for inferring species phylogenies when the species are closely related and introgression is common. We will extend our Bayesian MCMC program BPP, to implement sophisticated mutation model (such as GTR+G) and to relax the clock so that the method can be applied to distantly related species. We will implement and evaluate novel MCMC proposal kernels to improve the mixing efficiency of the transmodel MCMC algorithms. We will parallelize the program to make efficient use of modern multi-processor multi-core computer hardware.

Planned Impact

Delimiting species boundaries and inferring species phylogenies are of vital importance to assessing the current biodiversity, to understanding the impact of environmental and societal changes on species extinctions, and to developing effective conservation policies. The methods developed in this project, for delimiting and identifying species, provide powerful tools for analysis of genomic datasets, and results obtained from such analyses will be critical to effective decision making concerning biodiversity management and conservation. The methods can also be used to identify species, and are useful for tracking illegal wildlife trade.

Publications

10 25 50
 
Description We developed a Bayesian method for inferring the species phylogeny under the multispecies coalescent (MSC)
model. To improve the mixing properties of the Markov chain Monte Carlo (MCMC) algorithm that traverses the space of
species trees, we implement two efficient MCMC proposals: the first is based on the Subtree Pruning and Regrafting (SPR)
algorithm and the second is based on a node-slider algorithm. Like the Nearest-Neighbor Interchange (NNI) algorithm
we implemented previously, both new algorithms propose changes to the species tree, while simultaneously altering the
gene trees at multiple genetic loci to automatically avoid conflicts with the newly proposed species tree. The method
integrates over gene trees, naturally taking account of the uncertainty of gene tree topology and branch lengths given the
sequence data. A simulation study was performed to examine the statistical properties of the new method. The method
was found to show excellent statistical performance, inferring the correct species tree with near certainty when 10 loci were
included in the dataset. The results suggest that the Bayesian coalescent-based method is statistically more efficient
than heuristic methods based on summary statistics, and that our implementation is computationally more efficient than
alternative full-likelihood methods under theMSC. Parameter estimates for the rattlesnake data suggest drastically different
evolutionary dynamics between the nuclear and mitochondrial loci, even though they support largely consistent species
trees. We discuss the different challenges facing the marginal likelihood calculation and transmodel MCMC as alternative
strategies for estimating posterior probabilities for species tre
Exploitation Route Scientists working in different species groups can use our software to conduct comparative genomic data analysis.
Sectors Environment,Healthcare,Culture, Heritage, Museums and Collections

 
Title Phase resolution of heterozygous sites in diploid genomes is important to phylogenomic analysis under the multispecies coalescent model 
Description Genome sequencing projects routinely generate haploid consensus sequences from diploid genomes, which are effectively chimeric sequences with the phase at heterozygous sites resolved at random. The impact of phasing errors on phylogenomic analyses under the multispecies coalescent (MSC) model is largely unknown. Here we conduct a computer simulation to evaluate the performance of four phase-resolution strategies (the true phase resolution, the diploid analytical integration algorithm which averages over all phase resolutions, computational phase resolution using the program PHASE, and random resolution) on estimation of the species tree and evolutionary parameters in analysis of multi-locus genomic data under the MSC model. We found that species tree estimation is robust to phasing errors. Estimation of parameters under the MSC model with and without introgression may be affected by phasing errors, especially at high mutation rate or when many sequences are sampled from the same species. In particular, random phase resolution causes serious overestimation of population sizes for modern species and biased estimation of cross-species introgression probability. Use of phased sequences inferred by the PHASE program produced very small biases in parameter estimates. We suggest that genome sequencing projects should produce unphased diploid genotype sequences instead of haploid consensus sequences, which have heterozygous sites phased at random. In cases where phased data are not directly provided by next-generation sequencing, we recommend the analytical integration algorithm or computational phasing (e.g., using the PHASE program) prior to population genomic analyses. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL http://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncrd
 
Title The asymptotic behavior of bootstrap support values in molecular phylogenetics 
Description The phylogenetic bootstrap is the most commonly used method for assessing statistical confidence in estimated phylogenies by non-Bayesian methods such as maximum parsimony and maximum likelihood (ML). It is observed that bootstrap support tends to be high in large genomic datasets whether or not the inferred trees and clades are correct. Here we study the asymptotic behavior of bootstrap support for the ML tree in large datasets when the competing phylogenetic trees are equally right or equally wrong. We consider phylogenetic reconstruction as a problem of statistical model selection when the compared models are nonnested and misspecified. The bootstrap is found to have qualitatively different dynamics from Bayesian inference, and does not exhibit the polarized behavior of posterior model probabilities, consistent with the empirical observation that the bootstrap is more conservative than Bayesian probabilities. Nevertheless bootstrap support similarly shows fluctuations among large datasets, with no convergence to a point value, when the compared models are equally right or equally wrong. Thus in large datasets strong support for wrong trees or models is likely to occur. Our analysis provides a partial explanation for the high bootstrap support values for incorrect clades observed in empirical data analysis. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL http://datadryad.org/stash/dataset/doi:10.5061/dryad.7m0cfxprw