Inference of Evolutionary Histories of Mobile DNAs

Lead Research Organisation: University of Nottingham
Department Name: Sch of Biology

Abstract

One of the major discoveries that has been made in the study of the genomes of higher organisms is that only a small fraction of the DNA in the chromosomes consists of genes, and an even smaller fraction of the DNA is involved in specifying the amino acid sequences of proteins. In organisms with large genomes, such as man, it remains true that the majority of the DNA is of unknown function. Of this unexplained DNA, a large fraction (around 40% of the genome in total) consists of repeated DNAs that have become scattered throughout the chromosomes as a result of their capacity to 'transpose'- to move to new chromosomal locations. These transposable elements, or TEs, replicate as they move to new locations, and so, over time, their numbers in the chromosomes will tend to build up. For this reason, they are now usually viewed as primarily selfish DNAs, increasing their abundance in the chromosomes in which they 'live', but only rarely conferring any advantage on the organism in which they are found. The capacity to move between chromosomal locations has the effect that copies of these TEs, found at different sites, share common ancestry, which could have consisted of a common ancestor just one fruit fly's generation ago, or could have been a common ancestor fifty million years ago, in the case of mammalian TEs. The evolutionary relationships between these sequences tell us about the process through which they have come to spread themselves through the chromosomes, either in the past, or in an ongoing process. The science of population genetics interprets data on DNA sequence variation that we see today, and uses this to reconstruct evolutionary events in the past. This can be used to interpret the variation in TE families, and has been used in this way by the Principal Investigator. This project will allow inferences of the evolutionary histories of families of elements to be made in a more formal way, assessing the probabilities of various evolutionary scenarios on the basis of the transposable element sequence data that we now see. However, the situation is complex, and many different mathematical approaches can be used in this process of inference. This project will produce more sophisticated methods, which will combine features of earlier models, and will allow us to say, from a collection of DNA sequences, how likely are various differing histories of these sequences. As these mobile DNAs, the TEs, constitute more than forty per cent of our DNA, and as they are often being put to use in the creation of new adaptive functions, a full understanding of eukaryotic evolution, and human health, requires us to be able to see where these elements are derived from, and what, if any, purposes they now serve in our genomes. The project will allow the sophisticated mathematical approaches to be developed to be accessed using user-friendly software, such that workers worldwide who have discovered new TE families in any genome will be able to draw inferences about the families' evolutionary histories.

Technical Summary

Transposable elements (TEs) form a population of replicating lineages which are found in chromosomes, and the elements of a given family have homologous DNA sequences due to their shared descent. The variation in the sequences can be used to infer what their evolutionary histories have been, in the light of models of their proliferation through chromosomes. However, it is still not clear how the modelling of these sequences should be carried out. The coalescent methods that dominate inference in population genetics assume that only a small sample of the population of lineages are sampled, which will not always be true for TEs given the complete genomic information that is available. Branching process methods, which are used to look at, for example, variations in speciation rates, are less effective when sampling is incomplete, and sampling can never be complete, in that element insertion sites can be created, can be the donors for further transposition events, and can then be lost. The project will create hybrid, although computationally demanding, methods to carry out assessment of evolutionary processes, using Markov Chain Monte Carlo (MCMC) methods. Selection and horizontal transfer of elements will be incorporated in these models of inference and the evidence supplied for selection for transposable element function found in some data sets will be quantified. While methods to be developed are inspired primarily by mammalian TE families, the software to be generated will allow workers interested in TEs from any group to create evolutionary inferences from current sequence data.

Planned Impact

The work will create scientific papers and computer software which will allow researchers to investigate the evolutionary history of families of mobile DNAs, using, as the basis of these interpretations, the DNA sequences that are being derived in increasing numbers from genome projects on various species and individuals. In addition to scientific publications, the PI and the PDRA will also make presentations at scientific meetings. Interpretation of genomic information is a burgeoning field with applications in agriculture and human and animal health. In the case of investigations, as here, of the mobile, interspersed repetitive, component of genomes, another possible spin-off will be that a greater understanding of the evolutionary dynamics of these sequences will form an important input into attempts to genetically modify wild populations of insect vectors. These might either of vectors of human disease, or, potentially, vectors of plant or animal pathogens, such as blue-tongue. The PI will continue his ongoing work on outreach to young people and through media interactions to promote the public understanding of science, a task that is of particular importance for researchers in evolutionary biology. Another important impact of the work will come from the training of the post-doctoral researcher in skills in quantitative population and evolutionary genetics and their application to genomic problems, which the PI can supply.
 
Description While DNA sequences in the genome are usually thought of simply having evolved by a process of natural selection, in which mutations that improve the adaptation of organisms to their environment are retained and spread through populations, we have known, since before the start of the genomic era, that there are a many sequences in our genomes that are not simply explained in this way. It has become clear that a large fraction (almost half) of the human genome consists of DNA sequences that were once, in our evolutionary history, mobile in the chromosomes. These DNA sequences have, in the past, spread through our chromosomes by copying themselves to new locations. This process of spread does not require that these DNA sequences confer any benefits on the organism in whose genome they are found. Rather, they might be able to spread even if they are harmful to the organism. For this reason, we regard these sequences as "parasitic" or "selfish" DNAs. This project has used the DNA sequences of hundreds of thousands of these mobile elements, currently residing in the human genome, to reveal the evolutionary history of the sequences. What we have found is that, for one major class of elements, their proliferation was an early event in the genomes of ancestral mammals, and they were very soon inactivated, so they were no longer mobile. Since then they have remained in mammalian, and, ultimately, human chromosomes, and one important question that we have also addressed is whether these formerly parasitic sequences have, over these tens or hundreds of millions of years, evolved new activities and functions, which actually benefit their mammalian "hosts". Here the data form interesting contrasts. We have found some highly significant evidence that one type of sequence has indeed evolved new and useful functions. On the other hand, a more detailed and thorough re-analysis of one sequence family (a particularly abundant type called the "Alu" sequences) has revealed that evidence which has led others in the past to conclude that these are function was probably over-simplified. By looking at individual elements in more detail, we see no signs of retention of these sequences in the human genome by natural selection, which suggests that they may not have any useful function.
Exploitation Route There are no results from our research that lead directly to changes in healthcare. However, an important question in human healthcare is which DNA sequences are responsible for human genetic disorders. The genes which, when mutated, give rise to "single-gene" disorders such as Duchenne muscular dystrophy, cystic fibrosis etc., have almost all been identified. The main focus of medical genetics is now the identification of the changes in the genomes that underlie multifactorial diseases such as heart disease, bipolar disorder, and diabetes etc. These diseases are called "multifactorial" because they are affected by variation in many different genes and also affected by the environment (diet etc.). A major unsolved question about multifactorial diseases is which DNA sequences can influence their occurrence. While we think of the genes that code for the proteins of the body being the important parts of the genome, these protein-coding genes form less than five per cent of the DNA in the genome. So, in considering the determination of disease, there is clearly an important question, which is whether the remaining 95%+ of DNA sequences that do not code for proteins are important in making us what we are, such that changes in these DNAs may well be a cause of these multifactorial diseases. One view is that most of the DNA that does not code for proteins is "junk DNA", DNA that has accumulated over evolution, but which serves no purpose today. Another view is that the so-called "junk DNA" is merely DNA whose functions have not yet been discovered. If so, it may well be that it is in this DNA that the determinants of human multifactorial disease are found.
The DNAs postulated to potentially be "junk DNA" are mainly DNAs derived from mobile DNA ancestors, DNA sequences which were once parasites in the chromosomes of ancestral mammals. Our results, revealing the historical changes to these sequences, and seeing whether they have now evolved activities and functions that are useful to their human hosts, are relevant to whether these should now be seen as "junk DNAs". If they still have no useful function, they can be eliminated as candidate DNA sequences that might be responsible for human disease, thus narrowing down the search for medically important DNAs, among those that do not code for proteins, to those that do not have ancestry as mobile sequences.
Sectors Healthcare

 
Description The finding will be of interest to those who have an academic interest in the evolution of the genome. There is a debate about the importance of those parts of the genome outside structural genes. If these non-genic DNAs are selectively important, then mutations in them will be subject to purifying selection. In the human context, this purifying selection will typically manifest itself as a medical outcome.
First Year Of Impact 2011
Impact Types Cultural