Bayesian inference of the mode of speciation and gene flow using genomic data

Lead Research Organisation: University College London
Department Name: Genetics Evolution and Environment

Abstract

Genomic sequences from contemporary species contain rich information about the history of species divergence and the origin of species. By comparing DNA sequences from different species we can work out how the species are related, when they diverged from each other, and whether they have been exchanging genes. However, powerful statistical models and efficient computational algorithms are necessary to extract this information from genomic data. The multispecies coalescent model provides a natural framework for comparative analysis of genomic sequence data, as it accommodates the random fluctuations of biological reproduction when genetic materials are passed over generations, random accumulations of genetic mutations as well as possible hybridisation or introgression between species. We will extend our Bayesian inference program BPP to allow continuous gene flow between species, including isolation-with-migration, isolation-with-initial-migration, secondary contact, as well as complete isolation. Those models represent different biological hypotheses about the modes of speciation, and can be compared to further our understanding of the speciation process. The Bayesian methods to be developed in the project are efficient in extracting information in the genomic datasets, while accommodating important biological processes involved, such as the polymorphism in the ancestral species, uncertainties in the gene genealogical trees due to limited information in the sequence data from a short genomic segment. We will apply our newly developed methods to analyse genomic datasets from the big cats in the Panthera genus, Heliconius butterflies, Malagasy mouse lemurs, and North-American lizards, generated by our collaborators. We will infer the history of species divergences, the direction and timing of gene flow, as well as evolutionary parameters such as species divergence times and population sizes.

Technical Summary

We will implement the multispecies-coalescent-with-migration (MSC+M) model in our Bayesian MCMC program BPP, including variants that represent different biological hypotheses about speciation, such as isolation with migration (IM), isolation with initial migration (IIM), and secondary contact (SC).

We will develop within-model MCMC algorithms to change parameters in the MSC-M model. These include MCMC proposals to change the migration rates and the migration times on the gene trees, and the gene-tree SPR move to average over the gene tree at each locus. We will modify our rubber-band algorithm for changing the species divergence time in presence of migration events on the gene trees.

We will develop cross-model MCMC algorithms to explore different MSC-M models. We will implement a proposal to insert or delete a migration rate parameter when the species tree is fixed, and another move to extend our species-tree NNI/SPR algorithms to work under MSC-M. The algorithm involves identifying so-called affected nodes on the gene trees which are pruned off and reattached to the gene-tree backbone by simulation.

The migration and introgression models implemented in BPP represent different biological hypotheses concerning the speciation process, and can be compared using genomic sequence data to learn about speciation. We will develop models of variable rates of gene flow across the genome, which can be used to identify genomic regions with extreme migration rates or extreme genealogical trees, which are candidates for adaptive introgression. We will apply our newly developed methods to analyse genomic datasets from Heliconius butterflies, Malagasy mouse lemurs, North-American lizards, and big cats in the Panthera genus, generated by our collaborators. We hope to resolve long-standing phylogenetic problems in those species groups.