Next generation computational tools for the analysis and prediction of protein disorder and related gene function

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

With many genomes now completely sequenced, life scientists face the challenge of characterizing the biological role of the encoded proteins as to advance our understanding of cell physiology. Over the past decades, several experimental studies reinforced the view that the three dimensional structure of a protein is a prerequisite to its function. Evolution optimized the relative positions of specific protein atoms so that they can perform different tasks, including ligand binding and catalysis of reactions. Recently this view has been revised in light of additional data from eukaryotic species. Indeed, several observations prove that a large number of their proteins include highly flexible segments, which assume a fixed conformation when they recognize their biological partners only. These fragments - or whole proteins - are usually called natively unfolded, intrinsically unstructured or disordered and are predominantly found in multi-cellular organisms where they play key roles in signalling and regulatory processes through the binding to proteins, nucleotides, nucleic acids and metal ions. The identification and functional characterization of disordered proteins has drawn increasing attention. Different assays can produce systematic information on the location of disordered regions. However, these techniques suffer from intrinsic limitations and cannot be reasonably applied to all the proteins that a typical eukaryotic organism expresses. On the other side, thorough analyses showed that the amino acid sequences of these proteins are characterized by clear patterns and so computer programs can distinguish them fairly accurately and quickly. The classification of a protein as natively unfolded and the location of its disordered regions are valuable information. Yet, this is not enough to describe in detail what molecular actions the protein performs and what biological processes they relate to. Although we know that some functional categories are particularly enriched in disordered regions, we cannot afford to experimentally test all possible alternatives. A reasonable solution consists in exploiting computers to further analyze these proteins and then in performing much less lab assays to validate the results of computational analyses. This project aims at the development of a web server that will be accessible to everyone through the Internet and that will output functional predictions - i.e. hints - for disordered proteins. The program will first locate disordered regions within the input sequence using a method we previously developed. It will then exploit the chemical and physical features of some ligands - such as DNA and metal ions - to assess the likelihood for the input protein sequence to interact with them through its potential disordered regions. The results are expected to improve our knowledge of this important class of proteins and prioritize experiments aimed at characterizing their functions.

Technical Summary

The main aim of this project is to make use of machine learning and simplified molecular simulations of protein disorder-order transitions in the presence of certain ligands (particularly metals, small peptides and DNA) to produce new software tools which can better predict functionally relevant disordered regions and proteins in eukaryotic genomes. The first part of this project will entail the development of an improved predictor for generic binding within disordered regions. It is clear that the largest class of function that is associated with protein disorder related to the binding of ligands, peptides and proteins. It is also clear that improving our ability to distinguish between non-functional regions of disorder and such functional regions will be vital in improving the quality and usefulness of disorder prediction to the general biological community. The main novel aspect here will be the use of disordered domain linkers as control data, along with sequence analysis of evolutionarily conserved disordered regions, thought to be functional modules. The largest part of the project will entail running simulations of disordered protein segments both with and without likely binding ligands present. The ligands we intend to focus on will be metal ions, DNA and small peptides. From statistical analysis of these simulated structural ensembles, we plan to derive statistical models that can predict the likely functional class of the region under study. By integrating all of these results into a single computational tool (available via a Web server and as standalone software), we hope to produce a new generation of protein disorder prediction tool which is able to prediction disordered regions more accurately, but also assign functional significance to these regions and thus provide key functional insight for proteins of which a large fraction are functionally uncharacterised.

Planned Impact

The immediate beneficiaries of this research are the broad community of bench biologists interested in analysing their proteins of interest for regions of disorder, and from this deriving new insights into the possible function of the proteins. Both academic and industry scientists will benefit in a similar way as the Web services developed as a result of this research will be available freely to all users. Commercial scientists with sensitive data will be able to license the software through UCL Business so that they can exploit the tools without revealing their research interests to other users. Being able to determine even some clue as to the function of the 40% of functionally uncharacterised proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, biochemical engineering, protein design and even nanotechnology. Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.

Publications

10 25 50
 
Description We initially extended the DISOPRED method with new machine learning based modules to predict long intrinsically disordered regions. The results of community-wide independent blind testsshow that DISOPRED3 is one of the most effective tools for disordered residue prediction, and our analyses indicate that it gives considerably more accurate predictions than its predecessor. We also trained an additional predictor to classify disordered regions that fold upon protein binding, as to provide some functional clues about the molecular activities performed by these regions. Using stringent benchmarking experiments, we showed that combining evolutionary information and other sequence-derived data about the target disordered region is more effective than using amino acid sequence or evolutionary information alone. We also found that the resulting classifier performs better than other publicly available tools for the same task.

We have also explored the conformational landscape of well-known disordered protein regions, which fold partly or completely upon binding the biological partners.This allowed us to gather summary statistics about the states sampled by such regions, including the propensity for different transient secondary structures, residue contact frequencies at different sequence separations, the level of compaction observed for each conformation, etc. We initially ran classical molecular dynamics simulations, but the critical need for fast and large-scale sampling led us to employ more coarse-grained approaches based on amino-acid specific dihedral angles observed in coil regions, and simplified geometric constrains and energy functions. We are currently finalizing the analysis of the correlations between the features we have derived and different sets of disordered regions, e.g. those involved in ligand binding, those acting as flexible linkers, and those labelled with broad functional categories in manually curated databases.
Exploitation Route DISOPRED3 is already available online through the PSIPRED Protein Analysis Workbench webpagesand we expect that experimentalists involved in small scale structural biology or functional genomics projectswill find it useful to generate testable hypotheses about intrinsically disordered regions. Because the lack of experimental information on protein function is a major bottleneck in most areas of modern biology, we expect that our results can make a valuable contribution to functional annotation efforts.

The program is also available for download, and this will allow for further investigations about the implications of protein disorder at higher levels of biological complexity. Our group is certainly keen to re-assess the correlations between broad functional categories (as described by the Gene Ontology) and patterns of predicted disordered regions and protein binding sites within them in human and other model organisms.We plan to use these data to build improved statistical models that are expected to enhance the accuracy of protein function prediction when standard approaches based on sequence similarity transfers are not viable. Furthermore, intrinsically disordered regions have been linked to the organization and re-wiring of protein-protein interaction networks, to increased proteome diversity through alternative splicing across tissues and organisms, as well as to human and animal disease like cancer. We would expect that DISOPRED3 and the findings of the ongoing simulation efforts will help complete this emerging picture, especially once they are combined with genome-wide datasets reporting protein-protein interaction or gene expression profiles under different conditions.
Sectors Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://bioinf.cs.ucl.ac.uk/psipred
 
Description The software developed in this project has recently been added to our general workbench of tools that can be used by both academics and commercial users. So far we are seeing around 50 DISOPRED3 jobs per day from users in the commercial sector (judged by IP address/domain name). The training and production of skilled research staff with appropriate transferable skills is probably the other most significant delivered item of impact from this project. Dr Domenico Cozzetto has moved on to another BBSRC funded project in the lab, and is building up an impressive CV which should lead to long term employment in either academia or industry.
First Year Of Impact 2014
Sector Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title DISOPRED 
Description A sizeable fraction of eukaryotic proteins contain intrinsically disordered regions (IDRs), which act in unfolded states or by undergoing transitions between structured and unstructured conformations. Over time, sequence-based classifiers of IDRs have become fairly accurate and currently a major challenge is linking IDRs to their biological roles from the molecular to the systems level. DISOPRED3, which extends its predecessor with new modules to predict IDRs and protein binding sites within them. Based on recent CASP evaluation results, DISOPRED3 can be regarded as state of the art in the identification of IDRs, and our self-assessment shows that it significantly improves over DISOPRED2 because its predictions are more specific across the whole board and more sensitive to IDRs longer than 20 amino acids. Predicted IDRs are annotated as protein binding through a novel SVM-based classifier, which uses profile data and additional se-quence-derived features. Based on benchmarking experiments with full cross-validation, we show that this predictor generates precise assignments of disordered protein binding regions and that it compares well with other publicly available tools. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact None so far. 
URL http://bioinf.cs.ucl.ac.uk/psipred
 
Title FFPRED 
Description This server is designed to predict Gene Ontology Biological Process and Molecular Function terms for orphan and unannotated protein sequences. The prediction method has been optimised for performance on these 'difficult to annotate' targets using a protein feature based method that does not require prior identification of protein sequence homologues. The method is best suited to annotating proteins with general function classes rather than deciphering specific annotations of proteins with many homologues. 
Type Of Technology Webtool/Application 
Year Produced 2013 
Impact Currently used by an estimated 10-20 commercial sector users per day and 80-100 academic users. 
URL http://bioinf.cs.ucl.ac.uk/web_servers