Efficient Bayesian phylogenomic dating with new models of trait evolution and rich diversities of living and fossil species

Lead Research Organisation: Queen Mary University of London
Department Name: Sch of Biological & Behavioural Sciences

Abstract

As species diverge, they accumulate nucleotide substitutions in their genomes at a rate approximately constant in time. Thus, substitutions serve as timepieces to infer species divergences. By incorporating information from the fossil record, the inferred speciation timings can be calibrated to geological time. This method, known as molecular-clock dating, has broad applications in evolutionary biology, such as studying the timing of spread of viral pandemics, ancient rates of diversification in animals and plants, the relationship of species evolution with past climate or extinction events, human evolution, or the origin of agriculture and animal domestication. Indeed, evolutionary timetrees provide much richer information about species histories than trees without temporal information, thus allowing the formulation and testing of hypotheses on evolutionary timescales.

Currently, Bayesian methods are the-state-of-the-art in molecular-clock dating as they allow flexible modelling of evolutionary processes and integration of fossil uncertainties in the analysis. Progresses in Bayesian clock-dating include stochastic models of rate variation among lineages (so-called relaxed clock models), modelling of trait evolution in extant and extinct taxa, and development of "soft-bounds" and flexible fossil calibration densities. While these advances have made the Bayesian method attractive for clock-dating, Bayesian computation relies on MCMC sampling which requires computationally expensive stochastic simulation, precluding the Bayesian method for analysis of large-scale datasets. This is unfortunate since large scale molecular datasets are now commonplace: several high-throughput genome sequencing projects have now been announced or are in progress and we expect a flood of genome-scale data for several thousand species (e.g. the 10K animal genomes and the UK's 66K eukaryotic genomes projects). This deluge of genome data has been accompanied by an explosive increase in the number of morphological datasets based on a computational revolution in comparative anatomy - the widespread deployment of X-Ray Tomography and photogrammetry resulting in vast databases of trait data: MorphoBank and Phenome10K now store over 64,200 surface scans for over 7,000 species. Computational tools capable of exploiting these newly generated datasets are now urgently required. For example, with current methods, inference of a 66K-species timetree would require at least 55 years of computing time (extrapolating from some of our previous analyses). Evidently, the efficiency of analytic methods has not kept apace with the volume of data available and increasingly required to tackle large scale questions in evolutionary biology.

In this project we will overcome two major challenges in Bayesian clock dating of species divergences: (i) the mixing and computational limitations of MCMC algorithms in analyses of large datasets, and (ii) the limitations of current trait models of evolution in timetree inference. We will design novel MCMC algorithms to improve the mixing efficiency making use of new ideas about MCMC algorithm design and improve the computational efficiency through code improvement and parallelization. We will incorporate advanced trait models to infer timetrees of extant and fossil species. In particular, we will adapt trait models to analyse large genomic trait datasets such as RNA-seq expression data. The newly developed algorithms will be implemented in our MCMCtree software, and applied to several large-scale empirical datasets with densely sampled extant and fossil species. The data analyses will provide important motivations for method development and serve to showcase our new software by addressing fundamental questions in evolutionary biology. Our proposal addresses the BBSRC's strategic priorities of "data driven biology" and "system approaches to the biosciences".

Technical Summary

Timetrees provide much richer information about patterns of species diversification and associations with past climate and major geological events. Bayesian relaxed-clock dating is the method of choice for deriving time trees as it naturally integrates information from the molecules and fossils. However, the method replies on stochastic MCMC sampling of the posterior distribution of times and rates and is computationally too expensive for large datasets. Recently, several large-scale genome sequencing projects have been announced, such as the 66,000 UK eukaryotic genomes project. This genomic revolution is now accompanied by a computer tomography (CT) revolution that is generating vast amounts of scan data for thousands of museum specimens. Thus, methods that can integrate analysis of genomic data from high-throughput sequencing projects with trait data from CT scans are urgently needed. The main advantage of CT-scan data is that the rich diversities of fossil species in museum collections can now be integrated in the dating analysis, providing a more robust calibration of the molecular clock and improving the amount of information in timetrees about past diversification events. The two main aims of this project are: (i) to improve the computational efficiency of MCMC sampling in timetree inference in large phylogenies and (ii) developing new models of trait evolution for co-analysis of genomic and trait data. We will achieve (i) by developing new proposal algorithms to improve the mixing efficiency of the MCMC, and by improving the C code in the MCMCtree software through vectorization and parallelisation. Our preliminary data indicate we can reduce computing time by 2-5 folds, making analysis of thousands of species within rich. The new software and models will be tested in several high-profile real data analysis.

Planned Impact

We will implement the methods and algorithms to be developed in this project in the MCMCTREE program in the PAML software package, and distribute it at its web site, free of charge to academics. We aim to disseminate our new models and software as required in accordance to the Data Driven Biology and System Approaches to the Biosciences BBSRC priority areas. In particular, our new software will allow the analysis of very large datasets from complex phylogenetic ensembles. We will champion integration of rich fossil diversities together with high-throughput sequencing data to infer evolutionary timelines in large phylogenies, thus providing the tools urgently required to analyse the explosive amounts of sequence and phenotype trait data now available.

We will attend national and international meetings to present our research results. Methodological advances will be disseminated in this way, as well as through teaching in the world-leading MSc Palaeobiology at Bristol, and the advanced workshop on Computational Molecular Evolution (funded by the Wellcome Trust and EMBO) that is organized and co-instructed by Yang. These courses will provide much needed training to our academic beneficiaries on how to use our software and models. We will apply for funds from the Royal Society to run a 2-day Discussion Meeting in London (which is open to scientists and the general public) and an associated satellite workshop at the Royal Society's Chicheley Hall. The focus will be on integrating biological and geological timescales to elucidate the co-evolution of Earth and Life. The Chicheley Hall workshop will have the aim of training evolutionary biologists, bioinformaticians, palaeontologists and Earth System modellers to conduct molecular clock dating (and Earth System Modelling) using cutting-edge methods, showcasing the new models and new algorithms to be developed in this project. We will research and design a school's outreach module on the tree of life, evolutionary timescales and evolutionary rates, to be delivered through GeoBus and the Bristol Dinosaur Project STEM engagement projects, as well as making the teaching materials freely available to science teachers. We will engage the broader public in our science and its deliverables by transmitting our science through a science-art collaboration, achieved by hosting an artist in residence at Bristol University and a touring display of their work.

Publications

10 25 50
 
Description We developed new fast computational methodologies to estimate evolutionary timelines using large genomic datasets. We applied our new technology to unravel the pattern of diversification of mammals with respect to the end-Cretaceous mass extinction event.
Exploitation Route Biologists can use our software to to infer species evolutionary timelines using their genomic datasets.
Sectors Environment,Other

 
Title Data for "A Species-Level Timeline of Mammal Evolution Integrating Phylogenomic Data" 
Description Supplementary data (alignments, phylogenies, etc.) for A Species-Level Timeline of Mammal Evolution Integrating Phylogenomic Data by Álvarez-Carretero et al. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact The molecular alignment of genetic data for mammals will be useful to academic beneficiaries seeking to estimate mammal evolutionary rates and mammal phylogeny. The inferred species-level, time-calibrated phylogeny will be useful to academic beneficiaries doing macroevolutionary studies. 
URL https://figshare.com/articles/dataset/Data_for_A_Species-Level_Timeline_of_Mammal_Evolution_Integrat...