Fast supertree construction using quartet joining

Lead Research Organisation: The Natural History Museum
Department Name: Life Sciences

Abstract

That all kinds of organisms that have ever lived are related through common ancestry and descent in one Tree of Life is one of the major insights of bological science. Knowledge of these phylogenetic relationships helps scientists to understand how the great diversity of life we see today has originated, provides a framework for inferring how living things have evolved, and allows testing hypotheses that seek to explain this diversity and identify the mechanisms that have generated it. Phylogenetic relationships can be inferred using morphology but are increasingly inferred from DNA or amino acid sequence data. However, the inferred phylogeny of a single gene may differ from (be incongruent with) the true species phylogeny, either due to errors in the inference or because the gene tree is not identical to the species tree. The latter can arise when, for example, genes are transferred horizontally between species, as has happened in the development of antibiotic resistance in some bacteria, or when genes are duplicated and subsequently lost. This raises questions of how best to do phylogenomics (the phylogenetic analysis of genomic scale data) with two alternative strategies currently being pursued (1) combining all genes into a single analysis and (2) building a supertree - a synthesis of the individual gene trees. Supertree methods can be considered a 'divide-and-conquer' approach where a large phylogenetic problem is decomposed into smaller problems which are then combined to give a global solution. Underpinning this is the expectation that individual gene trees can be more easily or effectively analysed because they are smaller and because they include only those taxa for which particular genes are available. This also assumes that the information in the individual trees can be combined efficiently, but unfortunately the supertree methods that are currently most relied upon in practice have a number of obviously undesirable properties, such as producing supertrees that contradict relationships that are true of every input tree (and which therefore must be true if any input tree is true). We propose to develop a new supertree method that uses logical inference to make species phylogenies from collections of gene trees, to implement it in software, and to test it with simulations and empirical data. In this method a supertree is grown by adding leaves; the inference about where to put new leaves is given by 'quartets', which can be considered the quanta of phylogenetic information, in the input trees. The new method is needed to enable researchers to make best use of the rapidly expanding number of complete genome sequences which may be of relevance to understanding the evolution of metabolic pathways, of drug resistance, to drug discovery, epidemiology, and diversification studies linked to historical climate change. Technical advances have seen the massive increases in the rate of production of new genomic data; complete genomes of prokaryotes can now be produced in an afternoon. Advances are now needed in the methods used to analyse this flood of data, and aim to replace ad hoc methods with better-founded alternatives. Based on its logical foundation, its flexibility, and on the speed of its computation, we expect that this will be a method of choice in phylogenomic analysis, but this needs to be confirmed through simulation to show its properties and determine its error rates, and through empirical tests that will provide proof of concept.

Technical Summary

Supertree methods are widely used for constructing large phylogenetic trees, and have become central to reconstructing the tree of life. Existing supertree methods are ad hoc and have a number of drawbacks. For example, the most widely-used supertree method, matrix representation with parsimony (MRP), has biases associated with the shapes of the input trees, may make a supertree with relationships that are in conflict with all the input trees, and may show 'unsupported groups' which are not present in any of the input trees. Most supertree methods, including MRP, require searching tree space. A recently proposed supertree method appears to be a much-needed fast and flexible alternative. This method, quartet joining (QJ), grows a supertree by using the information contained in quartets in the input trees to infer placement of new leaves. It is very fast, with a complexity of O(n log n). Initial testing appears promising, and we propose to: enhance the efficiency of the method by allowing it to use more information contained in the input trees, increase the speed through parallelization and by allowing grafting of subtrees onto the growing supertree, increase the realism of the construction by allowing it to use support information in the input trees, allow optimization of the speed to accuracy via dataset dependent automatic tuning of the number of quartets from the input trees consulted to place a new leaf, and to enhance ease of use by producing standalone applications. We will test the method and evaluate the effects of these enhancements using simulated and empirical data, and compare its performance and accuracy with other supertree methods, especially the widely-used MRP.

Publications

10 25 50

publication icon
Akanni WA (2015) Horizontal gene flow from Eubacteria to Archaebacteria and what it means for our understanding of eukaryogenesis. in Philosophical transactions of the Royal Society of London. Series B, Biological sciences

publication icon
Williams TA (2017) Integrative modeling of gene and genome evolution roots the archaeal tree of life. in Proceedings of the National Academy of Sciences of the United States of America

 
Description Construction of an inclusive Tree of Life is a daunting technical problem. The "supertree" approach has advantages because it allows previously done smaller analyses to be included in the new whole. However, methods of construction of supertrees are many, and that includes quartet joining, the original subject of this project. As we developed methods for supertree construction, that allowed us to examine and implement newly described model-based methods, including a novel Bayesian implementation which I have made. These new methods have a sound statistical basis, and will allow flexibility in the model.
Exploitation Route This project has resulted in fundamental methods development. We have now shown proof of concept, and are moving to show the use of model-based supertree methods in the origin of eukaryotes. We expect these will also be successful and lead to wider adoption of these methods.

A PhD project by Wasiu Akanni was completed, with one paper published, another submitted (with the Bayesian supertree method described), and another in preparation (where the Bayesian supertree method will be put to use).

A Marie Curie fellowship for Patrick Kuck, hosted by Mark Wilkinson and co-hosted myself, will use quartet joining.
Countering confounding heterogeneity in phylogenetics through non-parametric analyses of quartet split patterns.
EU Marie Curie - Intra-European Fellowship of Dr P. Kück. MEIF-CT-2013-629706. €221,606.
Sectors Other