Tools for motif recognition in fungi

Lead Research Organisation: University of Liverpool
Department Name: Computer Science

Abstract

Recent technical advances in molecular genetics have made it possible to efficiently sequence the full complement of an organism's DNA (the genome), giving us the code that defines the makeup of that species. This will have many benefits in medicine, agriculture and biotechnology, as we are in a much better position to understand the fundamental processes and components that underlie the biology all around us. However, the mass of data is somewhat overwhelming, with each organism's genome extending to a code of many millions of bases. Therefore it is important that we develop new and efficient approaches to uncover the coded information it contains. For a long time we have understood the basics of how a gene is arranged and in particular how the three letter nucleic acid code can be interpreted as an amino acid code that makes up the proteins of the cell. But this is only part of the story, and even finding genes can be difficult at times. This project will aim to uncover the essential information that determines when genes are switched on and at what level. These regulatory sequences are also encoded, often near the genes themselves, but are much harder to identify. We hope to apply and develop computational analyses to allow us to rapidly compare relevant parts of a number of genome sequences simultaneously and thereby advance our understanding of the complex higher order code that defines gene expression. Our chosen organisms are the Aspergilli, which are responsible for various human and animal diseases, crop damage and the production of toxins as well as playing an important role in both the biotechnology and food industries. Because of their importance, many have been sequenced providing us with an ideal system for developing and testing new computational approaches. Tools for experimental testing of computer-generated hypotheses are also readily available.

Technical Summary

The number of full genome sequences is rapidly increasing and with continuing advances in sequencing technology this trend is likely to continue. For the research community as a whole this is a fantastic resource. However, the realisation of the full potential of these genome data demands the development of effective bioinformatic tools. Fundamental to this is the development of computational algorithms aimed at identifying regulatory motifs and promoter signatures and their refinement with respect to the available genome data. Only with effective tools will the 'transcriptional regulatory code' be unravelled. Most computational motif finding algorithms model a motif as either a string or a probability matrix, although new algorithms combining both approaches are emerging. We have been involved in some of the latest developments. We will test and develop these approaches on Aspergillus genomes with the aim of optimising their use. Secondly, we will apply AI techniques to investigate whether DNA features such as flexibility and nucleosome positioning can be used to differentiate between functional and non functional motifs. Finally we aim to utilise comparative genomics and transcriptomics data sets to investigate the motifs, associated features and higher order patterns that determine transcriptional regulation. Novel motifs, and their mappings against Gene Ontology terms, will offer valuable clues to the functions of presently unannotated proteins. This project focuses on sequenced Aspergillus genomes, nine of which are already available, extending over an evolutionary distance equivalent to that separating humans and fish and including organisms which are important in medicine, agriculture, food storage, biotechnology and fundamental research.

Publications

10 25 50