Further development of the PSIPRED server into an integrated tool for systems biology and functional genomics researchers

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Computer Science

Abstract

The completion of the first draft of the human genome in 2001, after years of effort, was heralded as a major breakthrough that would finally enable researchers throughout the world to answer intriguing and elusive questions relating to the mechanism that govern complex biological processes. Now the genome of a human can be sequenced in a matter of weeks and we will soon have the complete genomes of many thousands of different organisms. The hope is that the information generated from this explosion of genome data worldwide will be harnessed to further our understanding and applied to beneficial and therapeutic use through computer aided biological research. Most genes are designed to code for specific proteins which have useful functions in the body. Proteins are essentially strings of simpler molecules, called amino acids and these strings can self-assemble into a complex 3-D structure as soon as the protein is formed by the protein-making machinery (ribosomes) in the cell. It is this unique structure which determines the precise chemical function of the protein (i.e. what is does in the cell and how it does it). By firing X-rays at crystallised proteins, scientists can determine their structure, but this process can take many months or even years. With hundreds of thousands of proteins for which the native structure is unknown, it is not surprising that scientists want to find a clever shortcut to working out the structure of proteins. We, like many other scientists have been trying to 'crack the code' of protein structure i.e. working out the rules which govern how the protein finds its unique structure and then trying to program a computer with these rules to allow scientists to quickly 'predict' what the structure of their protein of interest might be. The PSIPRED service is a collection of Web servers maintained at UCL which does just this - it allows biologists to predict protein structure from amino acid sequence. Over the years it has helped many thousands of scientists with their work by providing these services and we now wish not only to upgrade and maintain these existing servers but also to implement new methods which allow the structures of even the most difficult proteins to be deduced by computer simulations. More recently, for example, we have been building upon the original PSIPRED service to cover other important problems in biology. Probably the biggest of these problems is the prediction of biological function of sequenced genes. Relationships between protein structure and function have been well documented over the last 30 years, however the diversity and complexity presented by nature poses several challenging problems. Gene products from different species may exhibit the matching biological functions, but may show little or no sequence similarity, perhaps due to convergent evolution. It may be that although there is little overall structural and sequence similarity between two proteins that key properties of the active sites (e.g. overall charge or approximate shape) are conserved allowing similar functions to be carried out. Analyses of functional regions within protein structures on a large scale will not only allow the development of more reliable genome annotation tools but also enhance the knowledge base of the biological role of proteins at a cellular level. Such understanding will be a key stepping stone in the development of techniques and pharmaceuticals to target diseased genes and their products as well as proteins from pathological organisms.

Technical Summary

The Jones Group at University College London has been maintaining a suite of web-based tools based on a number of cutting edge protein structure prediction methods since 1999. The methods allow users to predict a variety of protein structural features, including secondary structure and natively disordered regions, protein domain boundaries and 3D models of tertiary structure. More recently we have been developing new services to assist users in prediction gene function and protein-protein interactions - all of which we believe are vital developments to make PSIPRED more useful to systems biologists. The current web servers employ a number of features to help users become familiar with the software e.g. online tutorials and common look and feel. However, we have until now stopped short of fully integrating the suite of tools - this will be addressed in the proposed project. These developments would result in the only single server worldwide which provides all of the following prediction services to biologists: comparative modelling, fold recognition, ab initio (new fold) prediction, transmembrane protein structure prediction, disorder prediction, domain boundary prediction, binding hotspot prediction, ligand binding site prediction, and several novel approaches to gene function prediction. In addition to maintaining and improving the usability of the PSIPRED services, we also plan to add important new functionality. The main area we wish to address is dealing with high throughput sequencing data efficiently -providing users with functional and structural information relating to sequence variations in large data sets. In addition we will develop new approaches to predicting ligand-binding sites and new transmembrane prediction tools.

Planned Impact

SUMMARY OF RESOURCE This proposal is to maintain and further develop a set of Web-accessible tools and services that has been developed at UCL (the PSIPRED server portal). This portal provides a wide variety of tools to the general biomedical research community, and is available for use to both academic and commercial researchers. In many independent tests, these tools have proven to be amongst the very best worldwide, and are even used by other resources around the world as part of their own pipelines and workflows. IMPACT OVERVIEW The PSIPRED portal was used a total of 183,000 times in the last year, and had nearly 85,000 unique visitors. Users are spread across the globe, with 22% of users coming from the US and 21% of users from the UK. This testifies to the importance of this resource, particularly to the UK bioscience community. Users typically also come from a wide variety of scientific research areas. Based on our user support enquiries and user surveys, we can identify users in areas across the whole BBSRC remit e.g. bio-energy, ageing research, biotechnology, synthetic biology, vaccine design, plant biology, animal health and even nanotechnology. In summary, the immediate beneficiaries of this research are the broad community of experimental biologists needing additional functional or structural clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the structure or function of uncharacterised proteins can have significant impact in the broad variety of areas mentioned above. Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms. We also note that many users of our servers use the resources for teaching purposes. It's clearly vital that for maximum impact, the next generations of graduates and postgraduates in the biosciences be trained in advanced computational biology techniques. We are therefore pleased that our tools, because of our focus on good quality visual output and speed of returning jobs, find use in teaching laboratories around the world.

Funded Value:

£302,891

Funded Period:

Sep 11 - Sep 14

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/I026014/1

Principal Investigator:

David Jones

Research Subject:

Biomolecules & biochemistry (56%)

Omic sciences & technologies (14%)

Tools, technologies & methods (14%)

Research Topic:

Protein expression (28%)

Protein folding / misfolding (14%)

Proteomics (14%)

Structural biology (14%)

Theoretical biology (14%)

Organisations

UNIVERSITY COLLEGE LONDON (Lead Research Organisation)

People	ORCID iD
David Jones (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Buchan DW (2013) Scalable web services for the PSIPRED Protein Analysis Workbench. in Nucleic acids research

Lewis T (2015) Genome3D: exploiting structure to help users understand their sequences in Nucleic Acids Research

Lewis TE (2013) Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. in Nucleic acids research

Key Findings
Impact Summary
Software and Technical Products


Description	Through the work funded by this grant, the PSIPRED server for bioinformatics analysis of proteins has been substantially overhauled and streamlined, with many new features added. Firstly, the web server implementation has been improved by rationalising the user interface. Several different landing pages, corresponding to different packages included in the PSIPRED server, have been replaced by one home page for sequence-based analysis and one home page for access to structure-based tools. This corresponds to a more centralised organisation of the code in the server's backend, which allowed to eliminate redundant running of time-consuming and memory expensive software in answer to complex user requests. In parallel to this, the presentation of results to the users has been greatly improved, for example by providing a Summary page that offers an overview of the results, as well as clickable tabs with detailed sections about the output obtained by individual analysis tools. Also, following feedback from the user community, asynchronous web services were implemented within the PSIPRED server. These allow users who need to perform analysis on a larger scale to submit jobs to the server using automated programs, without having to manually go through the web interface. The analysis options of the server have been enhanced, especially in the fields of membrane protein analysis and automated domain-based homology modelling, using new software that became available in the group. From the technical point of view, job execution has been accelerated with the addition of new data processing machines and new data caching algorithms, so that the PSIPRED server can now count on 40 dedicated CPUs (and a total of 160GB memory) for processing user jobs and remains one of the fastest comparable services in the world. Additionally, several adjustments have been put in place that allow the server to be much more resilient in front of several kinds of external issues such as power failures and network breakages. These include frontend and backend code that automatically compensates for internal hardware failures and automated procedures that take care of emergency situations and catastrophic events such as power shutdowns.
Exploitation Route	The PSIPRED server remains a popular bench of bioinformatics tools, and it can be easily and freely accessed by researchers via its web interfaces. It is being used around 800 times a day by researchers around the world. Moreover, the current resilient formulation guarantees that the server requires minimal maintenance, which will make it easy for other researchers in the group to both maintain and streamline the server further. Also, this will give time to think about possible major improvements, or to further development of the server as necessary to cope with a fast changing field of research. One example of this is the ongoing scrutiny of the performance of multiple sequence analysis using the popular BLAST+ package. Due to increasing sizes of public biological (and especially sequence) databases, running time and memory requirements for this package, that is used frequently by the PSIPRED server, are possibly going to hamper the smooth operation of the server; for this reason, alternative solutions may be sought and included at a later time. Also, the core software developed in this grant has been made freely available to the community under appropriate open source licenses so that it can be easily exploited or improved by other researchers. This should lead to even better (free) tools for the research community either through work from our own lab, or from collaborative efforts around the world.
Sectors	Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology
URL	http://bioinf.cs.ucl.ac.uk/psipred


Description	The publicly available computational tools developed in this project are used by both academics and commercial users, and we estimate that around 15% of our 800 user jobs per day are from the commercial sector. The RA employed on this project has also received training that can be valuable both in academia and the commercial sector. Indeed, the original RA, Dr Dan Buchan left the project to join a real estate company to help apply some of the data mining technologies he used in the project to improve the accuracy of property pricing in that commercial sector. This is an excellent example of how general IT skill can be transferred directly from an academic project in Life Sciences to a quite distinct commercial area.
First Year Of Impact	2011
Sector	Agriculture, Food and Drink,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Title	PSIPRED Server
Description	The PSIPRED Protein Sequence Analysis Workbench aggregates several UCL structure prediction methods into one location. Users can submit a protein sequence, perform the predictions of their choice and receive the results of the prediction via e-mail or the web.
Type Of Technology	Webtool/Application
Impact	The web portal is used over 800 times a day and we estimate that around 15% of this usage is from the commercial sector.
URL	http://bioinf.cs.ucl.ac.uk/psipred