Improved phylogenetic tools for gene content data

Lead Research Organisation: University of Liverpool
Department Name: Sch of Biological Sciences

Abstract

The traditional approach to constructing an evolutionary tree from molecular data is to obtain the sequence of nucleotides or amino acids that makes up a single gene, and build a tree that explains the differences in these sequences among species. Since the mid-90s, a large number of prokaryotic organisms (bacteria and archaea: single-celled organisms without a nucleus) have had their genomes sequenced. These data have suggested that the traditional approach is unreliable for two reasons. First, the sequence of a single gene may be quite short, so that statistical variability may limit the accuracy of a tree derived from a single gene. Second, different genes may genuinely have different evolutionary histories, because genes can be transferred between prokaryotic organisms, even when they are not closely related. An alternative approach exists. Genes can be rapidly gained and lost from the genome, and it is apparent that closely-related organisms tend to have similar patterns of presence and absence of genes. This suggests that we could use data on the presence and absence of genes to construct an evolutionary tree that reflects the overall history of the genome. There are many existing methods that attempt to do this. However, most either lack a sound theoretical justification and are known to give the wrong answer in some cases, or do not account for variability in the evolutionary process over time. A new method known as conditioned genome reconstruction has a sound theoretical justification and performs well even when the rates of gene gain and loss vary over time. We will develop new statistical methods that widen the range of genomes to which this method can be applied. However, as with most other gene content methods, conditioned genome reconstruction does not perform well for parasitic bacteria. This may be an artefact of the way that databases of gene presence/absence are constructed. We will explore this by comparisons across databases using different methods. Another explanation is that the problem is a consequence of increased loss rates of genes that are unnecessary for parasitic organisms that live inside other cells. For these parasites, many genes that are essential for life in the external environment may suddenly become unimportant and may be lost. This loss could occur independently in unrelated lineages of parasites, resulting in these lineages having similar gene content. We will determine whether this is the case by separately analyzing essential and dispensable sets of genes. Finally, we will determine whether our new methods are really better by simulating large and realistic sets of genome data, and comparing the performance of new and existing methods on these data. This work will increase our understanding of the evolution of whole genomes, and will provide new tools for the construction of evolutionary trees.

Technical Summary

As the availability of completely sequenced genomes has increased, we have learned that a single gene may not accurately reflect the evolutionary history of the organism, and that the rates at which genes are gained and lost may be comparable to the rate of nucleotide substitutions. We could therefore use data on the presence/absence of orthologous gene families to construct phylogenies that accurately reflect organismal evolutionary history. There are many existing methods, but all are unsatisfactory. Ad-hoc distances are not tree-additive evolutionary distances, and as a result trees based on these distances will sometimes be wrong, even with infinite amounts of data. Maximum parsimony assumes that the tree requiring the smallest number of gene gains and losses is the correct tree, and thus performs badly when evolutionary rates are high or variable. Stationary maximum likelihood distances can deal with high rates of evolution, but perform badly when the evolutionary process changes over the tree. One of the most promising methods is the conditioned genome reconstruction approach. Our modified version of this method performs well on small simulated data sets, but in common with most other methods, it artefactually produces a clade of parasites when applied to real genome data. This is probably either a database artefact, or because there is a common subset of dispensable genes in unrelated parasites. Existing conditioned genome reconstruction methods also fail for some data because of sampling variability. We will develop new statistical methods to reduce the effects of sampling variability, use cross-database comparisons to determine the extent of database artefacts, apply network methods and data partitioning to determine whether the parasites problem is due to increased loss rates in a subset of genes in parasites, and compare the performance of new and existing gene content methods for phylogenies using large and realistic simulated data sets.

Publications

10 25 50