Fine-scale phylogeny using a mathematical model of the dynamics of rDNA repeat sequence evolution

Lead Research Organisation: John Innes Centre
Department Name: Computational and Systems Biology

Abstract

Ever since Charles Darwin introduced his theory for evolution, biologists have been interested in reconstructing the Tree of Life, the tree representing the evolutionary history of all present-day species (http://www.phylo.org/). This is an extremely ambitious goal, and so biologists usually concentrate on constructing evolutionary trees for small sub-collections of present-day species. In the past, the construction of such trees was based on particular characteristics of the species, such as properties of their skeleton or anatomy. However, now that we are able to sequence parts of genomes, or in some cases whole genomes, it is now commonplace for biologists to construct evolutionary trees using DNA data. Consequently, in recent years a whole new theory, called 'Phylogenetics', has grown around building such trees. The most important DNA sequence used in phylogenetics is that of the ribosomal DNA repeat unit (or rDNA), a section of DNA which is present in all species, and which has been used to construct the 'universal' tree-of-life. Thanks to recent large-scale genome sequencing projects, which have revealed the DNA codes of many organisms, including very closely related ones such as various yeast strains (or sub-species), we now have data available to construct far more detailed phylogenies. In particular some DNA sequences, such the rDNA sequences that we plan to use in this project, vary within genomes and well as between them. These DNA sequences will enable us to uncover the relationships between closely related organisms, such as our yeast strains, much more clearly than we have been able to do before now. However, new tools will be required to carry out the analytical processes involved. Recent advances in computational biology mean we now have the ability to build rapid and efficient tools to achieve this goal. The aim of this project is to build a new mathematical tool to analyse rDNA sequence variation and dynamics at the most basic level. The tool will be applied to yeast and, if possible, plant data, allowing much more detailed phylogenies to be constructed than hitherto possible. Yeast genomes provide excellent models for understanding genome dynamics in plants and in other eukaryotic genomes, including humans. Therefore, our new tool can also be used by scientists who wish to analyse datsets of other species groups.

Technical Summary

We have recently quantified a new form of sequence variation, partial single nucleotide polymorphism (pSNP), within a newly developed yeast genome resequencing dataset. pSNPs possess the exciting potential to resolve phylogenies for closely related organisms. However, testing and evaluation of new mathematical models of pSNP evolution are required urgently. Here, we propose to develop a new prototype web-based tool for the extraction and analysis of cryptic rDNA sequence variation, bringing together new datasets, software engineering and mathematical modelling processes in a novel way. We will focus on five main tasks: (i) We will develop an automated computational pipeline to extract and process rDNA sequence data from sequencing databases. Adoption of algorithms developed during a preliminary study in Saccharomyces will expedite this process. We will apply the new pipeline to data recently developed in the Saccharomyces Genome Resequencing Project and we will populate a new yeast pSNP database, based on the GMOD software suite, to be mounted on a new pSNP website initiated by us. (ii) We will develop a simple graphical front-end to the yeast pSNP database, allowing the user to view both SNPs and pSNPs along the rDNA array. iii) We will integrate or interface, as appropriate, mathematical models (of pSNP evolution) and phylogenetic analysis software with the pSNP tool to enable phylogenetic analysis to be carried out on the entire rDNA region or subregions of it. iv) We will develop a simple simulation tool in the Java programming language, to model the process of sequence variation in repeated regions, and we will use it in a carefully controlled manner to carry out a small evaluation study of different mathematical models. v) We will analyse the yeast SNP/pSNP data using the tools and knowledge developed above.