Improving Bayesian methods for estimating divergence times integrating genomic and trait data

Lead Research Organisation: University of Bristol
Department Name: Earth Sciences

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

Molecular clock dating methods have been improved recently to accommodate the violation of the clock by the use of relaxed-clock models and to incorporate uncertainties in fossil calibrations through the use of soft bounds. Yet, representation of errors and uncertainties in the fossil record in a molecular dating analysis remains a challenging task. In this project, we will implement models of trait evolution to conduct Bayesian MCMC analysis of morphological traits in fossil and extent species. The resulting posterior for divergence times will be used as calibration densities for molecular clock dating. The new models and methods will be implemented in the MCMCtree program in the paml package, and will be applied to large datasets to date divergence events in the metazoan, the hominoid and primates, and the flowering plants. We will also analyse skull measurements of fossil and modern species within the hominoids, to generate posterior estimates of the hominoid divergence times, which will be used in a multispecies coalescent analysis of the hominoid genomic sequence data, to generate estimates of human-chimpanzee divergence time and of the mutation rate. Our mutation rate estimates will be vitally important to testing hypotheses concerning the origin and migration patterns of modern humans. We will use the same trait-evolution models to analyse viral phenotype (such as influenza virus epitopes) and its correlation with the evolutionary rate of the bird flu protein hemagglutinin.

Planned Impact

We will implement the methods and algorithms to be developed in this project in the MCMCTREE program in the PAML software package, and distribute it at its web site, free of charge to academics. We will also develop a project-specific website including YouTube-hosted video manuals for the software.

We will attend local and international meetings to present our research results. Methodological advances will be disseminated in this way, as well as through teaching in the world-leading MSc Palaeobiology at Bristol, and the advanced workshop on Computational Molecular Evolution (funded by the Wellcome Trust and the EMBO) that is organized and co-instructed by Yang.

Publications

10 25 50
 
Title Data & Scripts from Beavan et al. (2020): Performance of a priori and a posteriori calibration strategies in divergence time estimation. 
Description The data and scripts includes all information necessary to recreate the simulated data of Beavan et al. (2020). In addition, it includes all the parameters and scripts to analyse the data and the original alignments used in the study. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://data.bris.ac.uk/data/dataset/uopumskkuech206ueqdxpcrif/
 
Title Data from: Anatomy of the Ediacaran rangeomorph Charnia masoni 
Description The Ediacaran macrofossil Charnia masoni Ford is perhaps the most iconic member of the Rangeomorpha: a group of seemingly sessile, frondose organisms that dominates late Ediacaran benthic, deep-marine fossil assemblages. Despite C. masoni exhibiting broad palaeogeographical and stratigraphical ranges, there have been few morphological studies that consider the variation observed among populations of specimens derived from multiple global localities. We present an analysis of C. masoni that evaluates specimens from the UK, Canada and Russia, representing the largest morphological study of this taxon to date. We describe substantial morphological variation within C. masoni and present a new morphological model for this species that has significant implications both for interpretation of rangeomorph architecture, and potentially for existing taxonomic schemes. Previous reconstructions of Charnia include assumptions regarding the presence of structures seen in other rangeomorphs (e.g. an internal stalk) and of homogeneity in higher order branch morphology; observations that are not borne out by our investigations. We describe variation in the morphology of third and fourth order branches, as well as variation in gross structure near the base of the frond. The diagnosis of Charnia masoni is emended to take account of these new features. These findings highlight the need for large-scale analyses of rangeomorph morphology in order to better understand the biology of this long-enigmatic group. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.fg14s2r
 
Title Data from: Bayesian and likelihood phylogenetic reconstructions of morphological traits are not discordant when taking uncertainty into consideration: a comment on Puttick et al 
Description Puttick et al. (2017 Proc. R. Soc. B 284, 20162290 (doi:10.1098/rspb.2016.2290)) performed a simulation study to compare accuracy among methods of inferring phylogeny from discrete morphological characters. They report that a Bayesian implementation of the Mk model (Lewis 2001 Syst. Biol. 50, 913-925 (doi:10.1080/106351501753462876)) was most accurate (but with low resolution), while a maximum-likelihood (ML) implementation of the same model was least accurate. They conclude by strongly advocating that Bayesian implementations of the Mk model should be the default method of analysis for such data. While we appreciate the authors' attempt to investigate the accuracy of alternative methods of analysis, their conclusion is based on an inappropriate comparison of the ML point estimate, which does not consider confidence, with the Bayesian consensus, which incorporates estimation credibility into the summary tree. Using simulation, we demonstrate that ML and Bayesian estimates are concordant when confidence and credibility are comparably reflected in summary trees, a result expected from statistical theory. We therefore disagree with the conclusions of Puttick et al. and consider their prescription of any default method to be poorly founded. Instead, we recommend caution and thoughtful consideration of the model or method being applied to a morphological dataset. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.dh0dv
 
Title Data from: Bayesian methods outperform parsimony but at the expense of precision in the estimation of phylogeny from discrete morphological data 
Description Different analytical methods can yield competing interpretations of evolutionary history and, currently, there is no definitive method for phylogenetic reconstruction using morphological data. Parsimony has been the primary method for analysing morphological data, but there has been a resurgence of interest in the likelihood-based Mk-model. Here, we test the performance of the Bayesian implementation of the Mk-model relative to both equal and implied-weight implementations of parsimony. Using simulated morphological data, we demonstrate that the Mk-model outperforms equal-weights parsimony in terms of topological accuracy, and implied-weights performs the most poorly. However, the Mk-model produces phylogenies that have less resolution than parsimony methods. This difference in the accuracy and precision of parsimony and Bayesian approaches to topology estimation needs to be considered when selecting a method for phylogeny reconstruction. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.10qf3
 
Title Data from: Bayesian methods outperform parsimony but at the expense of precision in the estimation of phylogeny from discrete morphological data 
Description Simulated data matrices from 'Bayesian methods outperform parsimony but at the expense of precision in the estimation of phylogeny from discrete morphological data' 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
URL http://datadryad.org/resource/doi:10.5061/dryad.10qf3/1
 
Title Data from: Evolution of fungal phenotypic disparity 
Description Organismal grade multicellularity has been achieved only in animals, plants, and fungi. All three kingdoms manifest phenotypically disparate body plans, but their evolution has only been considered in detail for animals. Here we seek to test the general relevance of hypotheses on the evolution of animal body plans by characterising the evolution of fungal phenotypic variety (disparity). The distribution of living fungal form is defined by four distinct morphotypes: flagellated, zygomycetous, sac-bearing, and club-bearing. The discontinuity between morphotypes is a consequence of the extinction of phylogenetic intermediates, indicating that a complete record of fungal disparity would present a much more homogeneous distribution of form. Fungal phenotypic variety gradually expands through time for the most part but sharply increases with the emergence of multicellular body plans. Simulations show these temporal trends to be decidedly non-random, and at least partially shaped by hierarchical contingency. Fungal phenotypic distance is decoupled from changes in gene number, genome size, and taxonomic diversity. Only differences in organismal complexity, the number of traits that constitute an organism, at the cellular and multicellular levels present a meaningful relationship with fungal disparity. Both animals and fungi exhibit a gradual increase in disparity through time, resulting in distributions of form made discontinuous by the extinction of phylogenetic intermediates. These congruences hint at a common mode of multicellular body plan evolution. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
URL http://datadryad.org/stash/dataset/doi:10.5061/dryad.wwpzgmsm9
 
Title Data from: Probabilistic methods outperform parsimony in the phylogenetic analysis of data simulated without a probabilistic model 
Description In order to understand patterns and processes of the diversification of life we require an accurate understanding of taxa interrelationships. Recent studies have suggested that analyses of morphological character data using the Bayesian and Maximum likelihood Mk model provide phylogenies of higher accuracy compared to parsimony methods. These studies have proved controversial, particularly simulating morphology-data under Markov models that assume shared branch lengths for characters, as it is claimed this leads to bias favouring the Bayesian or Maximum likelihood Mk model over parsimony models which do not explicitly make this assumption. We avoid these potential issues by employing a simulation protocol in which character states are randomly assigned to tips, but datasets are constrained to an empirically-realistic distribution of homoplasy as measured by the Consistency Index. Datasets were analysed with equal-weights and implied weights parsimony, and the Maximum Likelihood and Bayesian Mk model. We find that consistent (low homoplasy) datasets render method choice largely irrelevant, as all methods perform well with high consistency (low homoplasy) datasets, but the largest discrepancies in accuracy occur with low consistency datasets (high homoplasy). In such cases, the Bayesian Mk model is significantly more accurate than alternative models, and Implied weights parsimony never significantly out-performs the Bayesian Mk model. When poorly-supported branches are collapsed, the Bayesian Mk model recovers trees with higher resolution compared to other methods. Since it is not possible to assess homoplasy independently of a tree estimate, the Bayesian Mk model emerges as the most reliable method for categorical morphological analyses. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.h8r2629
 
Title Data from: Probabilistic methods surpass parsimony when assessing clade support in phylogenetic analyses of discrete morphological data 
Description Fossil taxa are critical to inferences of historical diversity and the origins of modern biodiversity, but realizing their evolutionary significance is contingent on restoring fossil species to their correct position within the tree of life. For most fossil species, morphology is the only source of data for phylogenetic inference; this has traditionally been analysed using parsimony, the predominance of which is currently challenged by the development of probabilistic models that achieve greater phylogenetic accuracy. Here, based on simulated and empirical datasets, we explore the relative efficacy of competing phylogenetic methods in terms of clade support. We characterize clade support using bootstrapping for parsimony and Maximum Likelihood, and intrinsic Bayesian posterior probabilities, collapsing branches that exhibit less than 50% support. Ignoring node support, Bayesian inference is the most accurate method in estimating the tree used to simulate the data. After assessing clade support, Bayesian and Maximum Likelihood exhibit comparable levels of accuracy, and parsimony remains the least accurate method. However, Maximum Likelihood is less precise than Bayesian phylogeny estimation, and Bayesian inference recaptures more correct nodes with higher support compared to all other methods, including Maximum Likelihood. We assess the effects of these findings on empirical phylogenies. Our results indicate probabilistic methods should be favoured over parsimony. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.8dd39
 
Title Data from: Testing the molecular clock using mechanistic models of fossil preservation and molecular evolution 
Description Molecular sequence data provide information about relative times only, and fossil-based age constraints are the ultimate source of information about absolute times in molecular clock dating analyses. Thus, fossil calibrations are critical to molecular clock dating, but competing methods are difficult to evaluate empirically because the true evolutionary time scale is never known. Here, we combine mechanistic models of fossil preservation and sequence evolution in simulations to evaluate different approaches to constructing fossil calibrations and their impact on Bayesian molecular clock dating, and the relative impact of fossil versus molecular sampling. We show that divergence time estimation is impacted by the model of fossil preservation, sampling intensity and tree shape. The addition of sequence data may improve molecular clock estimates, but accuracy and precision is dominated by the quality of the fossil calibrations. Posterior means and medians are poor representatives of true divergence times; posterior intervals provide a much more accurate estimate of divergence times, though they may be wide and often do not have high coverage probability. Our results highlight the importance of increased fossil sampling and improved statistical approaches to generating calibrations, which should incorporate the non-uniform nature of ecological and temporal fossil species distributions. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.5706p
 
Title Data from: Testing the molecular clock using mechanistic models of fossil preservation and molecular evolution 
Description This repository contains simulated data and Bayesian MCMC output from *Testing the molecular clock using mechanistic models of fossil preservation and molecular evolution* by Rachel CM Warnock, Ziheng Yang and Philip CJ Donoghue. (2017) **Proc. R. Soc. B** 284 (1857). This data is associated with the following paper: http://rspb.royalsocietypublishing.org/content/284/1857/20170227. This data is also associated with code available on dryad: http://datadryad.org/resource/doi:10.5061/dryad.5706p. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title Data from: The effect of fossil sampling on the estimation of divergence times with the fossilised birth death process 
Description Timescales are of fundamental importance to evolutionary biology as they facilitate hypothesis tests of historical evolutionary processes. Through the incorporation of fossil occurrence data, the fossilised birth-death (FBD) process provides a framework for estimating divergence times using more palaeontological data than traditional node calibration approaches have allowed. The inclusion of more data can refine evolutionary timescale estimates, but for many taxonomic groups it is computationally infeasible to include all fossil occurrence data. Here, we utilise both empirical data and a simulation framework to identify approaches to subsampling fossil occurrence data that result in the most accurate estimates of divergence times. To achieve this we assess the performance of the FBD-Skyline model when implementing multiple approaches to incorporating subsampled fossil occurrences. Our results demonstrate that it is necessary to account for all available fossil occurrence data to achieve the most accurate estimates of clade age. We show that this can be achieved if an empirical Bayes approach to account for fossil sampling through time is applied to the FBD process. Random subsampling of occurrence data can lead to estimates of clade age that are incompatible with fossil evidence if no control over the affinities of fossil occurrences is enforced. Our results call into question the accuracy of previous divergence time studies incorporating the FBD process that have used only a subsample of all available fossil occurrence data. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.g7s0hk3
 
Title Data from: The efficacy of consensus tree methods for summarising phylogenetic relationships from a posterior sample of trees estimated from morphological data 
Description Consensus trees are required to summarise trees obtained through MCMC sampling of a posterior distribution, providing an overview of the distribution of estimated parameters such as topology, branch lengths and divergence times. Numerous consensus tree construction methods are available, each presenting a different interpretation of the tree sample. The rise of morphological clock and sampled-ancestor methods of divergence time estimation, in which times and topology are co-estimated, has increased the popularity of the maximum clade credibility (MCC) consensus tree method. The MCC method assumes that the sampled, fully resolved topology with the highest clade credibility contains an adequate summary of the most probable clades, with parameter estimates from compatible sampled trees used to obtain the marginal distributions of parameters such as clade ages and branch lengths. Using both simulated and empirical data, we demonstrate that MCC trees, and trees constructed using the similar maximum a posteriori (MAP) method, often include poorly supported and incorrect clades when summarising diffuse posterior samples of trees. We demonstrate that the paucity of information in morphological datasets contributes to the inability of MCC and MAP trees to present an accurate summary of the posterior distribution. Conversely, majority-rule consensus (MRC) trees report a lower proportion of incorrect nodes when summarising the same posterior samples of trees. Thus, we advocate the use of MRC trees, in place of MCC or MAP trees, in attempts to summarise the results of Bayesian phylogenetic analyses of morphological data. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.66s9h
 
Title Data from: The impact of fossil stratigraphic ranges on tip-calibration, and the accuracy and precision of divergence time estimates 
Description The molecular clock provides the only viable means of establishing realistic evolutionary timescales but it remains unclear how best to calibrate divergence time analyses. Calibrations can be applied to the tips and/or to the nodes of a phylogeny. Tip-calibration is an attractive approach since it allows fossil species to be included alongside extant relatives in molecular clock analyses. However, most fossil species are known from multiple stratigraphical horizons and it remains unclear how such age ranges should be interpreted to codify tip-calibrations. We use simulations and empirical data to explore the impact on precision and accuracy of different approaches to informing tip-calibrations. In particular, we focus on the effect of using tip-calibrations defined using the oldest vs youngest stratigraphic occurrences, the full stratigraphical range, as well as confidence intervals on these data points. The results of our simulations show that using different calibration approaches leads to different divergence-time estimates and demonstrate that concentrating tip-calibrations near the root of the dated phylogeny improves both precision and accuracy of estimated divergence times. Finally, our results indicate that the highest levels of accuracy and precision are achieved when fossil tips are calibrated based on the fossil occurrence from which the morphological data were derived. These trends were corroborated by analysis of an empirical dataset for Ursidae. Overall, we conclude that tip-dating analyses should, in particular, employ tip calibrations close to the root of the tree and they should be calibrated based on the age of the fossil used to inform the morphological data used in Total Evidence Dating. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.7kh57t5
 
Title Data from: Tips and nodes are complimentary not competing approaches to the calibration of molecular clocks 
Description Molecular clock methodology provides the best means of establishing evolutionary timescales, the accuracy and precision of which remain reliant on calibration, traditionally based on fossil constraints on clade (node) ages. Tip calibration has been developed to obviate undesirable aspects of node calibration, including the need for maximum age constraints that are invariably very difficult to justify. Instead, tip calibration incorporates fossil species as dated tips alongside living relatives, potentially improving the accuracy and precision of divergence time estimates. We demonstrate that tip calibration yields node calibrations that violate fossil evidence, contributing to unjustifiably young and ancient age estimates, less precise and (presumably) accurate than conventional node calibration. However, we go on to show that node and tip calibrations are complementary, producing meaningful age estimates, with node minima enforcing realistic ages and fossil tips interacting with node calibrations to objectively define maximum age constraints on clade ages. Together, tip and node calibrations may yield evolutionary timescales that are better justified, more precise and accurate than either calibration strategy can achieve alone. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
URL https://datadryad.org/stash/dataset/doi:10.5061/dryad.2q3k2
 
Title Electronic supplementary material from Parsimony and maximum-likelihood phylogenetic analyses of morphology do not generally integrate uncertainty in inferring evolutionary history. A response to Brown et al. 
Description Literature reviewed 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Electronic_supplementary_material_from_Parsimony_and_maximu...
 
Title Literature reviewed from Parsimony and maximum-likelihood phylogenetic analyses of morphology do not generally integrate uncertainty in inferring evolutionary history: a response to Brown et al.
Description ESM 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Electronic_supplementary_material_from_Parsimony_and_maximu...
 
Title Literature reviewed from Parsimony and maximum-likelihood phylogenetic analyses of morphology do not generally integrate uncertainty in inferring evolutionary history: a response to Brown et al.
Description ESM 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Electronic_supplementary_material_from_Parsimony_and_maximu...
 
Title Puttick_et_al_R_script from Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data 
Description The R script used to generate the simulated data on which this analysis was based 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Puttick_et_al_R_script_from_Uncertain-tree_discriminating_a...
 
Title Puttick_et_al_R_script from Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data 
Description The R script used to generate the simulated data on which this analysis was based 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
URL https://rs.figshare.com/articles/dataset/Puttick_et_al_R_script_from_Uncertain-tree_discriminating_a...
 
Description Conference presentation 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Conference presentation
Year(s) Of Engagement Activity 2017
 
Description Elizabeth Pennisi: Yu et al. (2024) NEE Science Magazine https://www.science.org/content/article/slimy-hagfish-help-solve-mysteries-genome-duplication 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Interview with a journalist resulting in the following output:

Elizabeth Pennisi: Yu et al. (2024) NEE Science Magazine
https://www.science.org/content/article/slimy-hagfish-help-solve-mysteries-genome-duplication
Year(s) Of Engagement Activity 2024
URL https://www.science.org/content/article/slimy-hagfish-help-solve-mysteries-genome-duplication
 
Description Press release and associated interviews 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact Press releases associated with three consecutive papers and associated interviews
Year(s) Of Engagement Activity 2018
 
Description Public 'Great Debate' at Oxford University Museum of Natural History on 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact 250 people present on the evening, booked ahead plus others watching the event streamed and still others watching it offline. e-polls taken before an after the event - on the timing and nature of the Cambrian Explosion - showed that people had changed their views
Year(s) Of Engagement Activity 2020
URL https://oumnh.ox.ac.uk/event/the-first-animals-when-where-and-how
 
Description Westbury on Trym C of E primary Academy 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Science Week school presentation
Year(s) Of Engagement Activity 2017