# Representation and Incorporation of Fossil Data in Molecular Dating of Species Divergences

Lead Research Organisation:
University of Bristol

Department Name: Earth Sciences

### Abstract

If our genes accumulate changes over time at a constant rate, the genetic distance between two species, measured by the number of changes accumulated, will be proportional to the time of species divergence. Thus molecules can serve as a clock, keeping time of species divergence by the accumulated changes. If fossil records or geological events can be used to assign an absolute geological time to a species divergence event on the phylogenetic tree, one can convert all calculated genetic distances into absolute geological times. This rationale for molecular clock dating has recently been extended to deal with local variation in evolutionary rate. Critical to molecular dating is the use of fossil information to calibrate the clock. In this project, we will develop statistical models and computer algorithms to accurately represent and incorporate fossil calibration information in molecular dating analysis. We will also implement models that explicitly consider errors in fossil calibrations. The new methods will be applied to analyze large sequence datasets to estimate divergence times among primates and vertebrates.

### Technical Summary

Two major improvements have recently been made to molecular clock dating methods: (i) relaxation of the clock assumption through local-clock models and (ii) incorporation of uncertainties in fossil calibrations. Furthermore, modern dating methods can analyze multiple genes and use multiple calibrations simultaneously. Nevertheless, representation of errors and uncertainties in the fossil record in a molecular dating analysis remains a challenging task. In this project, we will study models of clade divergences and species preservation in the fossil record to improve our representation of fossil calibration information for molecular clock dating. We will implement models that explicitly account for errors in the fossil record. We will conduct computer simulation to examine the impact of fossil calibrations on divergence time estimation. The new models and methods will be applied to large datasets to estimate the divergence times among primates and among vertebrates.

### Organisations

### Publications

Barba-Montoya J
(2018)

*Constraining uncertainty in the timescale of angiosperm evolution and the veracity of a Cretaceous Terrestrial Revolution.*in The New phytologist
Clark JW
(2019)

*Origin of horsetails and the role of whole-genome duplication in plant macroevolution.*in Proceedings. Biological sciences
Clark JW
(2017)

*Constraining the timing of whole genome duplication in plant evolutionary history.*in Proceedings. Biological sciences
Clark JW
(2018)

*Whole-Genome Duplication and Plant Macroevolution.*in Trends in plant science
Clarke JT
(2011)

*Establishing a time-scale for plant evolution.*in The New phytologist
Cunningham JA
(2017)

*The origin of animals: Can molecular clocks and the fossil record be reconciled?*in BioEssays : news and reviews in molecular, cellular and developmental biology
De Baets K
(2015)

*Constraining the Deep Origin of Parasitic Flatworms and Host-Interactions with Fossil Evidence.*in Advances in parasitology
Dong X
(2016)

*Developmental biology of the early C ambrian cnidarian O livooides*in Palaeontology
Donoghue P
(2019)

*Evolution: The Flowering of Land Plant Evolution.*in Current biology : CB
Dos Reis M
(2012)

*Phylogenomic datasets provide both precision and accuracy in estimating the timescale of placental mammal phylogeny.*in Proceedings. Biological sciencesDescription | Objectives listed in the original grant proposal 1. To develop flexible statistical distributions to accurately represent fossil calibration information by studying clade divergences and species preservation in the fossil record. 2. To develop explicit models of errors in the fossil record for use in Bayesian estimation of species divergence times and to implement the new models and methods in the MCMCTREE program. 3. To conduct computer simulations to examine the impact of fossil calibration on divergence time estimation. 4. To apply the methods to large datasets to infer divergence times among primates and vertebrates. _Objective 1. To develop flexible statistical distributions to accurately represent fossil calibration information by studying clade divergences and species preservation in the fossil record. We implemented a truncated Cauchy distribution to describe minimum_age bounds, the most common type of fossil calibrations. This was described in a paper we published in Systematic Biology [1]. In that paper, we also examined the impact of different procedures for incorporating the same fossil calibration information in Bayesian dating programs, and made the disturbingly surprising finding that the different procedures can produce very different prior and posterior time estimates. In addition to the truncated Cauchy, we implemented several flexible statistical distributions to describe uncertain fossil calibrations, including the skew normal, the skew t, and the 2_moded mixture of two skew normal distributions. In particular, the skew t is very flexible, able to faithfully represent a sharp minimum bound and a soft maximum bound on the same calibration node. The new suite of flexible calibration densities has been implemented in the MCMCTREE program and already been used in analysis of various real datasets. It allows the integration of fossil information in novel ways. For example, together with my collaborator on the grant, Simon Tavare, and others, we fitted a model of species genesis and of fossil preservation and discovery to primate fossil occurrence data [2]. In the first step of the analysis, no molecular sequence data was used. Instead, an nonhomogeneous branching process was used to describe speciation and a model of fossil discovery and preservation rates was used to conduct a Bayesian analysis of fossil finds throughout the world. The analysis led to posterior estimates of divergence times for two nodes in the primate phylogeny, which were then summarized using the skew_t densities mentioned above. In the second step of the analysis, the two skew t distributions were used as calibrations in a more conventional molecular dating analysis using two genomic regions from 15 primate species. The skew t calibrations incorporate far more information than simple minimum node age constraints, and the reduced uncertainty led to improved precision in the posterior time estimates in the molecular analysis. This work represents the first integrated analysis of fossil and molecular data using proper statistical methods, and is published in Systematic Biology [2]. Objective 1 has now been completed. Objectives 2 and 3: To develop explicit models of errors in the fossil record for use in Bayesian estimation of species divergence times and to implement the new models and methods in the MCMCTREE program. To conduct computer simulations to examine the impact of fossil calibration on divergence time estimation. In collaboration with our collaborator Professor Bruce Rannala, we have developed an explicit model of errors in the fossil record and validated our implementation of the model in the MCMCTREE program. The model assumes that each fossil has a probability pE of being in error and thus excluded from the molecular dating analysis. A beta prior is assigned on pE. The method uses the probability theory to evaluate the conflicts among fossils and between fossils and molecules, producing a posterior probability that each fossil is in error. It thus allows dating analysis to proceed even in presence of unreliable fossil data. This approach appears to be superior to existing heuristic method of delete_one cross_validation, in which each fossil is removed and the resulting time estimates are examined to assess the impact of that fossil. We have conducted simulation studies to examine the performance of the new Bayesian model, which show that the new method was much more robust to errors in fossil calibrations than previous method assuming no calibration errors. We are currently analyzing two real datasets to compare the new and old methods empirically. This work is being written up for publication [3]. Objectives 2 and 3 are thus at the final state of completion. Objective 4: To apply the methods to large datasets to infer divergence times among primates and vertebrates. We have applied the newly_developed methods in several dating analyses. Besides estimating the divergence times in primates mentioned above [2], we also used the new method to estimate divergence times in Foraminifera, in collaboration with M. Groussin of Lyon and J. Pawlowski of Geneva. We used multiple calibration points from the rich fossil record of Foraminifera to calibrate a foraminiferan phylogeny reconstructed from 18S rRNA and three protein_coding genes in a relaxed_clock Bayesian context. We discuss the effects of prior specifications on the posterior time estimation. The results are broadly in agreement with the fossil record but highlight a few interesting possible cases of conflict. This is the most comprehensive study so far on protist molecular dating. The work has been published in Molecular Phylogenetics and Evolution [4]. We have undertaken two further studies, establishing novel calibrations for plant and turtle phylogeny, exploring the impact of different time priors on divergence dating, as well as evaluating and developing methods of assessing calibration quality. The study on plant phylogeny, published in New Phytologist [5], establishes a new standard for the quality of evidence required for constraining molecular clock analyses of plant evolution, and is the first to explore calibration consistency when minimum and maximum constraints are used on divergence dating. Although all of the nodes in the tree are constrained by minimum and maximum constraints derived from fossil data, the results allow us to reject a post_Cambrian origin of land plants and a post_Jurassic origin of angiosperms. We also studied the timescale of turtle phylogeny because it has been the focus of debate over the quality of fossil calibrations, and how fossil data should be interpreted and implemented in calibration. This analysis, in collaboration with W. Joyce (Tübingen), Tyler Lyson (Yale) and Jim Parham (Alabama), showed that the practice of evaluating the relative quality of calibrations, based on their consistency, is misguided, demonstrates the importance of establishing the time prior probability, and the perturbation of initial time priors in the establishment of the joint prior. The report of this study is about to be submitted to Molecular Biology & Evolution [6]. Because of dramatic improvement in computational speed to the MCMCTREE program (see below), we were able to conduct molecular dating analysis using very large genome sequence alignments. Mammals (and particularly primates) have been the group that has attracted most interest in molecular dating analysis. Currently, 36 mammal genomes have either been completely sequenced, or are at an advanced sequencing stage (www.ensembl.org). We have conducted an unprecedented revision of mammal divergence times, using >14,000 nuclear gene alignments (about 20 million sites) across the 36 species, and using 25 fossil calibrations (many of which are newly updated) [13]. Besides the nuclear genomic data our analysis included 272 fully sequenced mammal mitochondrial genomes as well. In a similar way to the primate work mentioned above [1], we use the skew t distribution to summarize the posterior times estimated in the 36 species phylogeny, to be used as calibrations in the larger mitochondrial phylogeny, effectively propagating the information on posterior times and substitution rates from the 36 species to the 272 species dataset. As a result this is the most thorough work on mammal divergence times to date. Our results agree with previous works in that most mammal orders diversified before the K_T event (the catastrophic mass extinction 65 million years ago when dinosaurs were wiped out). However, our estimated dates are younger than in previous reports. The use of large alignments and multiple calibrations has led to robust estimation and much narrower credibility intervals around the node ages. This work is currently in its final stage of preparation for submission [7]. In collaboration with Professor Richard Brown of Liverpool, we conducted a computer simulation study to examine the accuracy of Bayesian divergence time estimation on shallow phylogenies where DNA sequence divergence is low. We analyzed both simulated and real sequences to evaluate dating methods in phylogenies with mid-late Miocene roots. Our study demonstrates the impact of the prior on divergence times in shallow phylogenies and shows that prior intervals on nodes should be assessed as a prerequisite to a dating analysis, and that >1 kb of quite rapidly evolving sequence may be required to obtain accurate posterior means and usefully narrow posterior intervals. This work has been published in Systematic Biology [8]. In short, we have analyzed more datasets than originally planned, partly because of received collaboration requests from empirical biologists. Objective 4 is thus completed. Research outputs not originally planned in the proposal We have also worked on a few problems that are highly relevant to the project but not planned in the original proposal. First, statistical analysis of genomic sequence data from closely related species or potential species has been Yang's ongoing research interest. As there is a huge amount of interest by empirical biologists in the use of genomic sequence data to delimit species boundaries, Bruce Rannala and Yang extended our early method (Yang 2002 Genetics 2002. 162: 1811_1823; Rannala and Yang 2003 Genetics 2003. 164: 1645_1656) (which are research outputs from previous BBSRC fundings) to produce a new Bayesian method of species delimitation. Unlike existing methods, the new method accounts for uncertainties in gene genealogies (gene trees) at loci, deals with ancestral lineage sorting and other population genetic processes, which can lead to discordant gene trees. The posterior probabilities of species delimitation models are evaluated through a reversible_jump MCMC algorithm. The method is implemented in a program called BP&P. The work was published in PNAS in 2010 [9] and has attracted much attention among evolutionary and speciation biologists. Furthermore, in collaboration with colleagues at Institute of Zoology, Chinese Academy of Sciences in Beijing, we conducted a computer simulation study to examine the statistical performance of the new Bayesian method of species delimitation, in particular, when there is hybridization. The work has been published in Systematic Biology [10]. Those two projects were not planned in the original proposal, and both published papers acknowledge the BBSRC support. Second, to improve the computational efficiency of the Markov chain Monte Carlo (MCMC) algorithms for molecular dating analysis, we developed and evaluated several methods for approximate calculation of the likelihood function. We explored the use of parameter transforms (square root, logarithm, and arcsine) in the Taylor expansion to approximate the log likelihood curve. We found that the new methods, particularly the arcsine_based transform, provided very good approximations under relaxed clock models. The approximation is _100 times faster than the conventional likelihood calculation based on Felsenstein's pruning algorithm, and opens up opportunities for Bayesian dating analysis of large datasets with >1000 species. The new methods are implemented in the MCMCTREE program. This work has been published in Molecular Biology and Evolution [11] and acknowledges the BBSRC support. A successful interdisciplinary collaboration One of the major achievements of this project is the establishment of a successful collaboration between the two co_PIs, one a theoretical molecular phylogeneticist and the other a palaeontologist. We attended a workshop on molecular clock dating in the 71st Society of Vertebrate Palaeontology Meeting, which was held in Bristol in September 2010, the outcome of which is an article on best_ practice in developing fossil calibrations for molecular dating analysis, co_authored by the participants of the workshop. This article is now in press for Systematic Biology [12]. We have both learned a lot through this collaboration, Yang gaining further insight into the nature and complexities of fossil data, and Donoghue gaining better insight into the calibration requirements of molecular dating analyses - and both of us learning of the potential for statistical interpretation of the fossil data and construction of informative calibrations for clocking dating. This collaboration will, however, also have a beneficial impact on palaeontology, helping the field to move away from an obsession with finding the earliest fossil evidences of lineages, and towards rigorous probabilistic modelling and statistical analysis of fossil data, with the aim of providing objective and precise calibration information for molecular clock dating. Through this collaborative project, Donoghue and Yang have been jointly supervising a Ph.D. student, Rachel Warnock, with funding from a NERC algorithm studentship. Rachel is receiving training in both palaeontology and molecular phylogenetics. We have finished a joint paper, in which we updated and extended fossil calibrations on an arthropod phylogeny, and explored the impact of different approaches to expressing uncertainties in the fossil record in molecular clock dating analysis. Our analysis demonstrated that the parameters in the calibration densities had a major impact upon the prior and posterior of the divergence times, and that it was critically important for the user to examine the joint prior distribution of divergence times used by the dating program. We illustrate a procedure to derive calibration densities in the Bayesian dating analysis through the use of soft maximum constraints. This work has been submitted [13]. Rachel has contributed to two further studies: Clarke et al. [5] and Warnock et al. [6]. Publications that result directly from this new collaboration include Inoue et al. [1], Parham et al. [12], Warnock et al. [13], Clarke et al. [5], the ms. in preparation on mammalian divergence [7] and the ms. on turtle phylogeny [6]. Trainings provided by the project The project has provided excellent opportunities for training postdoc researchers and Ph.D. students. Dr Jun Inoue, the PDRA initially appointed on the project, greatly improved his bioinformatics and statistics skills as well as skills for writing papers for publication. He has taken up a junior group leader position in University of Tokyo. Through this project, Dr Mario dos Reis has gained considerable skills for probabilistic modelling and statistical analysis, and expanded his research areas in molecular evolution and phylogenetics. He has developed a new collaboration with colleagues in China working on viral evolution. The Ph.D. student associated with this project and funded by an NERC studentship, Rachel Warnock, is receiving training in both palaeontology and molecular phylogenetics. She has co_authored two papers that are now in press, and she is working on two others that are near submission. Summary In summary, the objectives listed in the original proposal have been completed. We have also solved a few problems not planned originally. The approximate likelihood calculation is a huge bonus for the molecular dating method. The new collaboration established between the co_PIs is bearing fruits, and we look forward to continuing our productive collaboration. We have published at least twice as many papers from this project as the estimate (5_6) in the original proposal. Publications that have resulted from this project 1. Inoue, J., P.C.H. Donoghue and Z. Yang, The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times. Syst. Biol., 2010. 59: p. 74_89. 2. Wilkinson, R.D., M.E. Steiper, C. Soligo, R.D. Martin, Z. Yang and S. Tavare, Dating primate divergences through an integrated analysis of palaeontological and molecular data. Syst. Biol., 2011. 60: p. 16_31. 3. Yang, Z. and B. Rannala, Bayesian estimation of species divergence times under a model of fossil errors. in preparation. 4. Groussin, M., J. Pawlowski and Z. Yang, Bayesian relaxed clock estimation of divergence times in Foraminifera. Mol. Phylogenet. Evol., 2011. 61: p. 157_166. 5. Clarke, J.T., R.C.M. Warnock and P.C.J. Donoghue, Establishing a timescale for plant evolution. New Phytologist, 2011. 192: p. 266_301. 6. Warnock, R.C.M., W.A. Joyce, T. Lyson, J. Parham and P.C.J. Donoghue, Calibration quality and turtle phylogeny. Mol. Bio.l Evol., 2012: p. in preparation. 7. dos Reis, M., J. Inoue, M. Hasegawa, R. Asher, P.C. Donoghue and Z. Yang, Phylogenomic data sets provide both precision and accuracy in estimating the timescale of placental mammal evolution. 2012: p. in preparation. 8. Brown, R.P. and Z. Yang, Bayesian dating of shallow phylogenies with a relaxed clock. Syst. Biol., 2010. 59: p. 119_131. 9. Yang, Z. and B. Rannala, Bayesian species delimitation using multilocus sequence data. Proc. Natl. Acad. Sci. U.S.A., 2010. 107: p. 9264_9269. 10. Zhang, C., D._X. Zhang, T. Zhu and Z. Yang, Evaluation of a Bayesian coalescent method of species delimitation. Syst. Biol., 2011. 60: p. 747_761. 11. dos Reis, M. and Z. Yang, Approximate likelihood calculation for Bayesian estimation of divergence times. Mol. Biol. Evol., 2011. 28: p. 2161-2172. 12. Parham, J., et al., Best practices for applying paleontological data to molecular divergence dating analyses. Syst. Biol., 2012: p. in press. 13. Warnock, R.C.M., Z. Yang and P.C.J. Donoghue, Exploring uncertainty in the calibration of the molecular clock. Biol. Lett., 2011: p. in press |

Exploitation Route | Viral evolution Publications and presentations |

Sectors | Education Healthcare |

Title | Data from: Testing the molecular clock using mechanistic models of fossil preservation and molecular evolution |

Description | This repository contains simulated data and Bayesian MCMC output from *Testing the molecular clock using mechanistic models of fossil preservation and molecular evolution* by Rachel CM Warnock, Ziheng Yang and Philip CJ Donoghue. (2017) **Proc. R. Soc. B** 284 (1857). This data is associated with the following paper: http://rspb.royalsocietypublishing.org/content/284/1857/20170227. This data is also associated with code available on dryad: http://datadryad.org/resource/doi:10.5061/dryad.5706p. |

Type Of Material | Database/Collection of data |

Year Produced | 2017 |

Provided To Others? | Yes |

Title | Molecular clock fossil calibration database |

Description | A database of fully researched and evidenced fossil calibrations for molecular clock analyses. |

Type Of Material | Database/Collection of data |

Year Produced | 2014 |

Provided To Others? | Yes |

Impact | There are a number of launch publications associated, it is changing best practice in divergence time estimation, and it fostering links between palaeontologists and molecular biologists. |

URL | http://www.nescent.org/science/awards_summary.php?id=259 |

Description | Conference presentation |

Form Of Engagement Activity | A talk or presentation |

Part Of Official Scheme? | No |

Geographic Reach | International |

Primary Audience | Professional Practitioners |

Results and Impact | Conference presentation |

Year(s) Of Engagement Activity | 2017 |