Expansion and Further Development of the PSIPRED Protein Structure and Function Bioinformatics Workbench

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Computer Science

Abstract

With many genomes now completely sequenced, life scientists now face the challenge of characterizing the biological role of the encoded proteins as to advance our understanding of cell physiology. Most genes are designed to code for proteins which have useful functions in an organism. Proteins are essentially strings of simpler molecules, called amino acids and these strings can self-assemble into a complex 3-D structure as soon as the protein is formed by the protein-making machinery (ribosomes) in the cell. It is this unique structure which determines the precise chemical function of the protein (i.e. what is does in the cell and how it does it). By firing X-rays at crystallised proteins, scientists can determine their structure, but this process can take many months or even years. With hundreds of thousands of proteins for which the native structure is unknown, it is not surprising that scientists want to find a clever shortcut to working out the structure of proteins. We, like many other scientists have been trying to "crack the code" of protein structure i.e. working out the rules which govern how the protein finds its unique structure and then trying to program a computer with these rules to allow scientists to quickly "predict" what the structure of their protein of interest might be.

The PSIPRED Workbench is a collection of Web servers maintained at UCL which does just this i.e. it allows biologists to predict the structure of their protein structure given just its amino acid sequence. Over the years it has helped many thousands of scientists with their work by providing these services and we now wish not only to upgrade and maintain these existing servers but also to implement new methods which allow the structures of even the most difficult proteins to be deduced by computer simulations.

More recently, however, PSIPRED has been given a wider range of features to cover other important problems in biology. For example, using PSIPRED, a scientist can predict which proteins do not fold into stable shapes (called disordered proteins) or which chemical substances are likely to bind to a protein. Even where a protein does not appear to fold into a single stable structure, PSIPRED can still help scientists deduce what the function of his or her protein is likely to be. Generating such information on a large scale using computer algorithms can help expand our knowledge base of the biological role of proteins at a cellular level, and such understanding will be a key stepping stone in the development of techniques and pharmaceuticals to target diseased genes and their products as well as proteins from pathological organisms such as bacteria or viruses. In a similar way, knowledge on the function of certain bacterial genes can, for example, help develop new industrial processes by modifying the genes to make them produce novel chemical compounds, or even helping to detoxify industrial waste by producing friendly bacteria that can use the poisonous chemicals as food.

Technical Summary

The Jones Group at UCL has been developing a widely-used suite of web-based tools based on a number of cutting edge protein structure prediction methods since 1998. The methods allow users to predict a variety of protein structural features, including secondary structure and natively disordered regions, protein domain boundaries and 3D models of tertiary structure. More recently we have been developing new services to assist users in prediction gene function and protein-protein interactions - all of which we believe are vital developments to make PSIPRED a vital and unique tool for biologists.

PSIPRED employs a number of features to help users become familiar with the software e.g. online tutorials and common look and feel. Through work done in the original BBR grant, we have successfully integrated our suite of tools, resulting in the only single site worldwide which, after learning one simple user interface, provides all of the following prediction services to biologists: comparative modelling, fold recognition, ab initio (new fold) prediction, transmembrane protein structure prediction, disorder prediction, domain boundary prediction, binding hotspot prediction, ligand binding site prediction, and several novel approaches to gene function prediction.

These improvements have come at a price - namely a doubling of usage over the past two years, which has caused significant strain on the service. We plan to address this by completely redesigning the server architecture. In doing this, we also plan to rationalise our software engineering processes by rewriting our code base according to best industry practice, and making the code easily accessible and modifiable to 3rd party developers.

In addition to improving the architecture and code base, we also plan to add important new functionality such as co-evolution analysis, new transmembrane protein modelling tools, very fast protein fold recognition, and structure-based analysis of amino acid mutations.

Planned Impact

SUMMARY OF RESOURCE

This proposal is to maintain and further develop a set of Web-accessible tools and services that has been developed at UCL (namely the PSIPRED Workbench - originally called the PSIPRED Server) since 2001 (and at the University Warwick since 1998). This portal provides a wide variety of very well-known tools (e.g. PSIPRED/DISOPRED/GenTHREADER/MEMSAT/FFPRED) to the general life science research community, and is available for use (free of charge) to both academic and commercial researchers. In many independent tests (e.g. every CASP experiment since 1994), these tools have proven to be amongst the very best worldwide, and are widely used by other resources around the world as part of their own pipelines and workflows. The PSIPRED Workbench is probably one of the most widely accepted and used bioinformatics resources that is operated from a UK University, and is frequently referenced in many textbooks and training courses. The close association between a world-class bioinformatics research group and such a widely-used tool means that the methods are kept fully up to date with changing technological and demand-based trends.

IMPACT OVERVIEW

The PSIPRED portal was used a total of 134,127 times in the last year, with over 950 jobs handled per day during busy periods, and had over 62,000 unique visitors. Although the job count is lower than it was 3 years ago, the total number of jobs has more than doubled due to the fact that users can now generate multiple requests in a single session thanks to work done in the original grant. Users are spread further across the globe than before, with 18% of users coming from the US and 9% of users from the UK. This testifies to the importance of this resource, particularly to the UK bioscience community given the ratio of researcher headcounts in the two countries. Users typically also come from a wide variety of scientific research areas. Based on our user support enquiries and user surveys, we can identify users in areas across the whole BBSRC remit e.g. bio-energy, ageing research, biotechnology, synthetic biology, vaccine design, plant biology, animal health and even nanotechnology.

In summary, the immediate beneficiaries of this research are the broad community of experimental biologists needing additional functional or structural clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data, or companies wishing to released closed-source code, will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the structure or function of uncharacterised proteins can have significant impact in the broad variety of areas mentioned above.

Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.

We also note that many users of our servers use the resources for teaching purposes. It's clearly vital that for maximum impact, the next generations of graduates and postgraduates in the biosciences be trained in advanced computational biology techniques. We are therefore pleased that our tools, because of our focus on good quality visual output and speed of returning jobs, find use in teaching laboratories around the world.

Funded Value:

£417,813

Funded Period:

May 15 - May 20

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/M011712/1

Principal Investigator:

David Jones

Research Subject:

Omic sciences & technologies (24%)

Tools, technologies & methods (72%)

Research Topic:

Bioinformatics (24%)

Proteomics (24%)

Theoretical biology (24%)

eScience (24%)

Organisations

UNIVERSITY COLLEGE LONDON (Lead Research Organisation)

People	ORCID iD
David Jones (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Buchan D (2019) The PSIPRED Protein Analysis Workbench: 20 years on in Nucleic Acids Research

Buchan DWA (2017) EigenTHREADER: analogous protein fold recognition by efficient contact map threading. in Bioinformatics (Oxford, England)

Buchan DWA (2020) Learning a functional grammar of protein domains using natural language word embedding techniques. in Proteins

Buchan DWA (2018) Improved protein contact predictions with the MetaPSICOV2 server in CASP12. in Proteins

Greener J (2018) Design of metalloproteins and novel protein folds using variational autoencoders

Greener JG (2018) Design of metalloproteins and novel protein folds using variational autoencoders. in Scientific reports

Ian Sillitoe (2019) Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation

Kosciolek T (2017) Predictions of Backbone Dynamics in Intrinsically Disordered Proteins Using De Novo Fragment-Based Protein Structure Predictions. in Scientific reports

Sillitoe I (2020) Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation. in Nucleic acids research

Key Findings
Impact Summary


Description	The PSIPRED Workbench is an internationally recognized collection of Web servers maintained at UCL which allows biologists to predict the structure and functions of their protein sequences. Over the years it has helped many thousands of scientists with their work by providing these services and by means of this grant we were able to not only to upgrade and maintain these existing services but also to implement new methods which allow the structures of even the most difficult proteins to be deduced by computer simulations. At the start of the grant, a prediction would be returned to the user within a few hours to a day. Thanks to the algorithmic and hardware developments enabled by this grant, results are now returned usually within a minute or two, a speedup of over 50 times. This makes the whole user experience vastly better and also allows a larger number of calculations to be carried out in the same time, which is essential given the huge increases in the sizes of data sets that biologists study these days. By recoding the whole system in a more modern Python-based framework, the reliability and ease of maintenance has also hugely improved. This means that it is far easier now for us to deal with problems with a very small staff compared to major centres such as the European Bioinformatics Institute. This massively increases the value for money for our service whilst providing better functionality for users. These improvements have also improved reliability, both in terms of server uptime and in terms of the number of failed analyses for users (currently below 0.01% of analyses fail). Alongside these technical upgrades we also rewrote the documentation for the predictive methods making them more up to date. In addition to the key reliability and speed improvements made on the service, we also implemented important new functionality which involved completing published research work. All of these new methods are available both as new PSIPRED server features and also open-source software, which can benefit other scientists who wish to develop our methods further. The highlight of the new functionality added would be a new suite of co-evolution based predictive methods now hosted on the web server. DeepMetaPSICOV, the third iteration of our contact prediction suite, allows users to accurately predict protein intra-chain contacts from families of protein sequences. We additionally published a new method called EigenTHREADER, which allows users to search a library of these contact maps to recognise protein folds in a homology free manner. Building on the contact prediction work we have new Protein structure prediction method, DMPfold, which uses co-evolution analysis to build very accurate protein 3D models. This method also has excellent performance in the task of predicting the structure of membrane proteins. Also, the 4th release of the PSIPRED secondary structure prediction method was released in 2017. This new method with a modified, deeper, neural network architecture achieves a predictive accuracy of 84.2% making it at least match the state of the art in secondary structure prediction.
Exploitation Route	The immediate beneficiaries of this research are the broad community of experimental biologists needing additional functional or structural clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data, or companies wishing to released closed-source code, will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the structure or function of uncharacterised proteins can have significant impact in the broad variety of areas mentioned above. We have made all the code for the new web server open source for the first time during this grant, this allows interested 3rd parties the ability to install or mirror our entire server. Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms. We also note that many users of our servers now use the resources for educational purposes. It's clearly vital that for maximum impact, the next generations of graduates and postgraduates in the biosciences be trained in advanced computational biology techniques. We are therefore pleased that our tools, because of our focus on good quality visual output and speed of returning jobs, find use in teaching laboratories around the world.
Sectors	Agriculture Food and Drink Education Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology


Description	One key impact of this grant I would say would be the training and career development of the original RA, Daniel Buchan. Through his work on the grant, Daniel has moved on to a Lectureship at Goldsmiths College. Beyond this, our user support requests show a high take of our services in pharma and biotech industries. For a grant of this type, however, it is impossible to identify specific applications where the services have had an impact, however we estimate that approximately 10% of our serv
First Year Of Impact	2017
Sector	Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic