📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

Bayesian inference of the mode of speciation and gene flow using genomic data

Lead Research Organisation: UNIVERSITY COLLEGE LONDON
Department Name: Genetics Evolution and Environment

Abstract

Genomic sequences from contemporary species contain rich information about the history of species divergence and the origin of species. By comparing DNA sequences from different species we can work out how the species are related, when they diverged from each other, and whether they have been exchanging genes. However, powerful statistical models and efficient computational algorithms are necessary to extract this information from genomic data. The multispecies coalescent model provides a natural framework for comparative analysis of genomic sequence data, as it accommodates the random fluctuations of biological reproduction when genetic materials are passed over generations, random accumulations of genetic mutations as well as possible hybridisation or introgression between species. We will extend our Bayesian inference program BPP to allow continuous gene flow between species, including isolation-with-migration, isolation-with-initial-migration, secondary contact, as well as complete isolation. Those models represent different biological hypotheses about the modes of speciation, and can be compared to further our understanding of the speciation process. The Bayesian methods to be developed in the project are efficient in extracting information in the genomic datasets, while accommodating important biological processes involved, such as the polymorphism in the ancestral species, uncertainties in the gene genealogical trees due to limited information in the sequence data from a short genomic segment. We will apply our newly developed methods to analyse genomic datasets from the big cats in the Panthera genus, Heliconius butterflies, Malagasy mouse lemurs, and North-American lizards, generated by our collaborators. We will infer the history of species divergences, the direction and timing of gene flow, as well as evolutionary parameters such as species divergence times and population sizes.

Technical Summary

We will implement the multispecies-coalescent-with-migration (MSC+M) model in our Bayesian MCMC program BPP, including variants that represent different biological hypotheses about speciation, such as isolation with migration (IM), isolation with initial migration (IIM), and secondary contact (SC).

We will develop within-model MCMC algorithms to change parameters in the MSC-M model. These include MCMC proposals to change the migration rates and the migration times on the gene trees, and the gene-tree SPR move to average over the gene tree at each locus. We will modify our rubber-band algorithm for changing the species divergence time in presence of migration events on the gene trees.

We will develop cross-model MCMC algorithms to explore different MSC-M models. We will implement a proposal to insert or delete a migration rate parameter when the species tree is fixed, and another move to extend our species-tree NNI/SPR algorithms to work under MSC-M. The algorithm involves identifying so-called affected nodes on the gene trees which are pruned off and reattached to the gene-tree backbone by simulation.

The migration and introgression models implemented in BPP represent different biological hypotheses concerning the speciation process, and can be compared using genomic sequence data to learn about speciation. We will develop models of variable rates of gene flow across the genome, which can be used to identify genomic regions with extreme migration rates or extreme genealogical trees, which are candidates for adaptive introgression. We will apply our newly developed methods to analyse genomic datasets from Heliconius butterflies, Malagasy mouse lemurs, North-American lizards, and big cats in the Panthera genus, generated by our collaborators. We hope to resolve long-standing phylogenetic problems in those species groups.
 
Title Bayesian inference under the multispecies coalescent with ancient DNA sequences 
Description Ancient DNA (aDNA) is increasingly being used to investigate questions such as the phylogenetic relationships and divergence times of extant and extinct species. If aDNA samples are sufficiently old, expected branch lengths (in units of DNA substitutions) are reduced relative to contemporary samples. This can be accounted for by incorporating sample ages into phylogenetic analyses. Existing methods that use tip (sample) dates infer gene trees rather than species trees, which can lead to incorrect or biased inferences of the species tree. Methods using a multispecies coalescent (MSC) model overcome these issues. We developed an MSC model with tip dates and implemented it in the program bpp. The method performed well for a range of biologically realistic scenarios, estimating calibrated divergence times and mutation rates precisely. Simulations suggest that estimation precision can be best improved by prioritizing sampling of many loci and more ancient samples. Incorrectly treating ancient samples as contemporary in analyzing simulated data, mimicking a common practice of empirical analyses, led to large systematic biases in model parameters, including divergence times. Two genomic datasets of mammoths and elephants were analyzed, demonstrating the method's empirical utility. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.4mw6m90h0
 
Title Data from: Inference of cross-species gene flow using genomic data depends on the methods: Case study of gene flow in Drosophila 
Description Analysis of genomic data in the past two decades has highlighted the prevalence of introgression as an important evolutionary force in both plants and animals. The genus Drosophila has received much attention recently, with an analysis of genomic sequence data detailing widespread introgression across the species phylogeny for the genus. However, the methods used in the study are based on data summaries for species triplets and are unable to infer gene flow between sister lineages or to identify the direction of gene flow. Hence, we reanalyze a subset of the data using the Bayesian program bpp, which is a full-likelihood implementation of the multispecies coalescent (MSC) model and can provide more powerful inference of gene flow between species, including its direction, timing, and strength. While our analysis supports the presence of gene flow in the species group, the results differ from the previous study: we infer gene flow between sister lineages undetected previously whereas most gene-flow events inferred in the previous study are rejected in our tests. To verify our conclusions, we performed simulations to examine the properties of Bayesian and summary methods. Bpp was found to have high power to detect gene flow, high accuracy in estimated rates of gene flow, and robustness under misspecification of the mode of gene flow. In contrast, summary methods had low power and produced biased estimates of introgression probability. Our results suggest that likelihood methods under the MSC models of gene flow provide important complements to summary methods for charactering the rich history of species divergence and introgression using genomic data. 
Type Of Material Database/Collection of data 
Year Produced 2024 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.ngf1vhj33
 
Title Hierarchical heuristic species delimitation under the multispecies coalescent model with migration 
Description The multispecies coalescent (MSC) model accommodates genealogical fluctuations across the genome and provides a natural framework for comparative analysis of genomic sequence data to infer the history of species divergence and gene flow. Given a set of populations, hypotheses of species delimitation (and species phylogeny) may be formulated as instances of MSC models (e.g., MSC for one species versus MSC for two species) and compared using Bayesian model selection. This approach, implemented in the program bpp, has been found to be prone to over-splitting. Alternatively, heuristic criteria based on population parameters under the MSC model (such as population/species divergence times, population sizes, and migration rates) estimated from genomic sequence data may be used to delimit species. Here we extend the approach of species delimitation using the genealogical divergence index () to develop hierarchical merge and split algorithms for heuristic species delimitation and implement them in a python pipeline called hhsd. Applied to data simulated under a model of isolation by distance, the approach was able to recover the correct species delimitation, whereas model comparison by bpp failed. Analyses of empirical datasets suggest that the procedure may be less prone to over-splitting. We discuss possible strategies for accommodating paraphyletic species in the procedure, as well as the challenges of species delimitation based on heuristic criteria. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.jm63xsjhc
 
Title The power of coalescent methods for inferring recent and ancient gene flow in endangered Bactrian camels 
Description Genomic sequence data harbour valuable information concerning the history of species divergence and interspecific gene flow, and may offer important insights into conservation of endangered species. However, extracting such information from genomic data requires powerful statistical inference methods. A recent analysis of genomic sequence data found little evidence for gene flow from domestic Bactrian camels into the endangered wild Bactrian species. Nevertheless, the methods used to infer gene flow are based on data summaries and lack the power and precision to represent the complex phylogenetic history of the species with gene flow. Here we apply newly developed Bayesian methods to genomic sequence data to test for both recent and ancient gene flow among the three species in the genus Camelus, and to estimate the strength and timing of gene flow. We detect strong signal of gene flow from domestic into wild Bactrian camels, confirming early evidence based on mitochondrial DNA and the Y chromosome. Overall gene flow appears to affect the autosomal genome uniformly, with similar effective rates of gene flow for exonic and noncoding regions. Estimation of species divergence times is seriously affected if gene flow is not accommodated in the analysis. Our results highlight the power of the coalescent model in analysis of genomic data and the utility of the coding as well as noncoding parts of the genome in elucidating the evolutionary history of modern species. 
Type Of Material Database/Collection of data 
Year Produced 2025 
Provided To Others? Yes  
URL https://datadryad.org/dataset/doi:10.5061/dryad.3xsj3txrk