Bayesian implementation of the multispecies-coalescent-with-introgression (MSci) model for analysis of population genomic data
Lead Research Organisation:
UNIVERSITY COLLEGE LONDON
Department Name: Genetics Evolution and Environment
Abstract
Genomes from different species contain rich information about the evolutionary history of the species. By comparing DNA sequences from different species or different individuals of the same species we can work out how the species are related, when they diverged from each other, whether and when there was cross-species hybridisation. Nevertheless, to extract this information from our genomes, powerful statistical models and efficient computational algorithms are necessary. The multispecies-coalescent-with-introgression (MSci) model provides a natural framework for comparative analysis of genomic sequence data, accommodating the random fluctuations of biological reproduction when genetic materials are passed over generations, random accumulations of genetic mutations as well as possible cross-species hybridisation events. We will implement the MSci model in our Bayesian Markov chain Monte Carlo simulation program, so that it can be used to estimate species phylogenies and species divergence times, ancestral population sizes, and the time and rate of hybridisation. Those parameters will provide important insights into the origin of species. We will apply our newly developed methods to analyse genomic datasets from Heliconius butterflies, Malagasy mouse lemurs, and lizards, generated by our collaborators.
Technical Summary
We will implement the multispecies-coalescent-with-introgression (MSci) model in our Bayesian Markov chain Monte Carlo (MCMC) program BPP, and improve the computational and mixing efficiency of the MCMC algorithms. The MSci model can be used to estimate species phylogenies and species divergence times, ancestral population sizes, and the time and magnitude of hybridisation events. Those parameters will provide important insights into the process of species formation. The Bayesian methods are superior to heuristic methods in that they are able to accommodate ancestral polymorphism and incomplete lineage sorting, gene tree-species tree conflicts, and uncertainties and errors in the gene trees due to limited information in the sequence data. We will develop and evaluate novel MCMC proposals to improve the mixing efficiency of the trans-model MCMC algorithms. We will parallelize the program to make efficient use of modern multi-processor multi-core computer hardware. We will design a friendly web-based graphical user interface (GUI). We will apply our newly developed methods to analyse genomic datasets from Heliconius butterflies, Malagasy mouse lemurs, and lizards, in collaboration with evolutionary biologists.
Planned Impact
Delimiting species boundaries and inferring species phylogenies are of vital importance to assessing the current biodiversity, to understanding the impact of environmental and societal changes on species extinctions and loss of biodiversity, and to developing effective conservation policies. Methods for inferring species phylogenies and cross-species introgression events to be developed in this project will become powerful tools for analysis of genomic datasets, and results obtained from such analyses will be critical to effective decision making concerning biodiversity management and conservation. The methods can also be used to identify species, and are useful for tracking illegal wildlife trade.
Organisations
Publications
Finger N
(2022)
Genome-Scale Data Reveal Deep Lineage Divergence and a Complex Demographic History in the Texas Horned Lizard (Phrynosoma cornutum) throughout the Southwestern and Central United States.
in Genome biology and evolution
Flouri T
(2020)
A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis
in Molecular Biology and Evolution
Flouri T
(2022)
Bayesian Phylogenetic Inference using Relaxed-clocks and the Multispecies Coalescent.
in Molecular biology and evolution
Flouri T
(2023)
Efficient Bayesian inference under the multispecies coalescent with migration.
in Proceedings of the National Academy of Sciences of the United States of America
Huang J
(2022)
Phase Resolution of Heterozygous Sites in Diploid Genomes is Important to Phylogenomic Analysis under the Multispecies Coalescent Model.
in Systematic biology
Huang J
(2020)
A Simulation Study to Examine the Information Content in Phylogenomic Data Sets under the Multispecies Coalescent Model.
in Molecular biology and evolution
Huang J
(2022)
Inference of Gene Flow between Species under Misspecified Models.
in Molecular biology and evolution
| Title | Bayesian inference under the multispecies coalescent with ancient DNA sequences |
| Description | Ancient DNA (aDNA) is increasingly being used to investigate questions such as the phylogenetic relationships and divergence times of extant and extinct species. If aDNA samples are sufficiently old, expected branch lengths (in units of DNA substitutions) are reduced relative to contemporary samples. This can be accounted for by incorporating sample ages into phylogenetic analyses. Existing methods that use tip (sample) dates infer gene trees rather than species trees, which can lead to incorrect or biased inferences of the species tree. Methods using a multispecies coalescent (MSC) model overcome these issues. We developed an MSC model with tip dates and implemented it in the program bpp. The method performed well for a range of biologically realistic scenarios, estimating calibrated divergence times and mutation rates precisely. Simulations suggest that estimation precision can be best improved by prioritizing sampling of many loci and more ancient samples. Incorrectly treating ancient samples as contemporary in analyzing simulated data, mimicking a common practice of empirical analyses, led to large systematic biases in model parameters, including divergence times. Two genomic datasets of mammoths and elephants were analyzed, demonstrating the method's empirical utility. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| URL | https://datadryad.org/stash/dataset/doi:10.5061/dryad.4mw6m90h0 |
| Title | Data from: Inference of cross-species gene flow using genomic data depends on the methods: Case study of gene flow in Drosophila |
| Description | Analysis of genomic data in the past two decades has highlighted the prevalence of introgression as an important evolutionary force in both plants and animals. The genus Drosophila has received much attention recently, with an analysis of genomic sequence data detailing widespread introgression across the species phylogeny for the genus. However, the methods used in the study are based on data summaries for species triplets and are unable to infer gene flow between sister lineages or to identify the direction of gene flow. Hence, we reanalyze a subset of the data using the Bayesian program bpp, which is a full-likelihood implementation of the multispecies coalescent (MSC) model and can provide more powerful inference of gene flow between species, including its direction, timing, and strength. While our analysis supports the presence of gene flow in the species group, the results differ from the previous study: we infer gene flow between sister lineages undetected previously whereas most gene-flow events inferred in the previous study are rejected in our tests. To verify our conclusions, we performed simulations to examine the properties of Bayesian and summary methods. Bpp was found to have high power to detect gene flow, high accuracy in estimated rates of gene flow, and robustness under misspecification of the mode of gene flow. In contrast, summary methods had low power and produced biased estimates of introgression probability. Our results suggest that likelihood methods under the MSC models of gene flow provide important complements to summary methods for charactering the rich history of species divergence and introgression using genomic data. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| URL | https://datadryad.org/stash/dataset/doi:10.5061/dryad.ngf1vhj33 |
| Title | Estimation of species divergence times in presence of cross-species gene flow |
| Description | Cross-species introgression can have significant impacts on phylogenomic reconstruction of species divergence events. Here, we used simulations to show how the presence of even a small amount of introgression can bias divergence time estimates when gene flow is ignored in the analysis. Using advances in analytical methods under the multispecies coalescent (MSC) model, we demonstrate that by accounting for incomplete lineage sorting and introgression using large phylogenomic data sets this problem can be avoided. The multispecies-coalescent with-introgression (MSci) model is capable of accurately estimating both divergence times and ancestral effective population sizes, even when only a single diploid individual per species is sampled. We characterize some general expectations for biases in divergence time estimation under three different scenarios: 1) introgression between sister species, 2) introgression between non-sister species, and 3) introgression from an unsampled (i.e., ghost) outgroup lineage. We also conducted simulations under the isolation-with-migration (IM) model, and found that the MSci model assuming episodic gene flow was able to accurately estimate species divergence times despite high levels of continuous gene flow. We estimated divergence times under the MSC and MSci models from two published empirical datasets with previous evidence of introgression, one of 372 target enrichment loci from baobabs (Adansonia), and another of 1,000 transcriptome loci from fourteen species of the tomato relative, Jaltomata. The empirical analyses not only confirm our findings from simulations, demonstrating that the MSci model can reliably estimate divergence times, but also show that divergence time estimation under the MSC can be robust to the presence of small amounts of introgression in empirical datasets with extensive taxon sampling. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2021 |
| Provided To Others? | Yes |
| URL | http://datadryad.org/stash/dataset/doi:10.5061/dryad.zs7h44j8x |
| Title | Hierarchical heuristic species delimitation under the multispecies coalescent model with migration |
| Description | The multispecies coalescent (MSC) model accommodates genealogical fluctuations across the genome and provides a natural framework for comparative analysis of genomic sequence data to infer the history of species divergence and gene flow. Given a set of populations, hypotheses of species delimitation (and species phylogeny) may be formulated as instances of MSC models (e.g., MSC for one species versus MSC for two species) and compared using Bayesian model selection. This approach, implemented in the program bpp, has been found to be prone to over-splitting. Alternatively, heuristic criteria based on population parameters under the MSC model (such as population/species divergence times, population sizes, and migration rates) estimated from genomic sequence data may be used to delimit species. Here we extend the approach of species delimitation using the genealogical divergence index () to develop hierarchical merge and split algorithms for heuristic species delimitation and implement them in a python pipeline called hhsd. Applied to data simulated under a model of isolation by distance, the approach was able to recover the correct species delimitation, whereas model comparison by bpp failed. Analyses of empirical datasets suggest that the procedure may be less prone to over-splitting. We discuss possible strategies for accommodating paraphyletic species in the procedure, as well as the challenges of species delimitation based on heuristic criteria. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| URL | https://datadryad.org/stash/dataset/doi:10.5061/dryad.jm63xsjhc |
