Accelerating and enhancing the PSIPRED Workbench with deep learning

Lead Research Organisation: Goldsmiths University of London

Department Name: Computing Department

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

The Jones Group at UCL has been developing a widely-used suite of web-based tools based on cutting edge protein structure prediction methods since 1998. The methods allow users to predict a variety of protein structural features, including secondary structure and natively disordered regions, protein domain boundaries and 3D models of tertiary structure. More recently we have been developing new services to assist users in prediction gene function and protein-protein interactions - all of which we believe are vital developments to make PSIPRED a vital and unique tool for biologists.

PSIPRED employs a number of features to help users become familiar with the software e.g. via online tutorials. Through work done in the original BBR grant, we have successfully integrated our suite of tools, resulting in the only single site worldwide which, after learning one simple user interface, provides all of the following prediction services to biologists: comparative modelling, fold recognition, ab initio (new fold) prediction, transmembrane protein structure prediction, disorder prediction, domain boundary prediction, binding hotspot prediction, ligand binding site prediction, and several novel approaches to gene function prediction.

The key bioinformatics developments in this proposal will be to harness the power of deep learning methods to hugely accelerate the PSIPRED toolset. In particular we plan to circumvent the need for running time consuming databank searches using PSIBLAST or HHblits by using sequence-sequence learning models to maintain a continuously updated embedding of UniProt. This means we will be able to extract sequence neighbourhood information for any given sequence with a related sequence in UniProt almost instantaneously, compared to minutes or even hours using standard sequence databank searching methods. Furthermore we wish to implement user-friendly implementations of the recent breakthroughs in deep learning-based covariation-based modelling.

Planned Impact

SUMMARY OF RESOURCE

This proposal is to maintain and further develop a set of Web-accessible tools and services that has been developed at UCL (namely the PSIPRED Workbench - originally called the PSIPRED Server) since 2001 (and at the University Warwick since 1998). This portal provides a wide variety of very well-known tools (e.g. PSIPRED/DISOPRED/GenTHREADER/MEMSAT/FFPRED) to the general life science research community, and is available for use (free of charge) to both academic and commercial researchers. In many independent tests (e.g. every CASP experiment since 1994), these tools have proven to be amongst the very best worldwide, and are widely used by other resources around the world as part of their own pipelines and workflows. The PSIPRED Workbench is probably one of the most widely accepted and used bioinformatics resources that is operated from a UK University, and is frequently referenced in many textbooks and training courses. The close association between a world-class bioinformatics research group and such a widely-used tool means that the methods are kept fully up to date with changing technological and demand-based trends.

IMPACT OVERVIEW

The PSIPRED portal was used over 170,000 times in the last year, with nearly 1000 jobs handled per day during busy periods, and has over 5,200 unique visitors per month. The overall usage is up 20% since our last application to the BBR Fund, which demonstrates a clear growth in demand. Users are spread further across the globe than before, with 18% of users coming from the US and 9% of users from the UK. This testifies to the importance of this resource, particularly to the UK bioscience community given the ratio of researcher headcounts in the two countries. Users typically also come from a wide variety of scientific research areas. Based on our user support enquiries and user surveys, we can identify users in areas across the whole BBSRC remit e.g. bio-energy, ageing research, biotechnology, synthetic biology, vaccine design, plant biology, animal health and even nanotechnology.

In summary, the immediate beneficiaries of this research are the broad community of experimental biologists needing additional functional or structural clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data, or companies wishing to released closed-source code, will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the structure or function of uncharacterised proteins can have significant impact in the broad variety of areas mentioned above.

Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.

We also note that many users of our servers use the resources for teaching purposes. It's clearly vital that for maximum impact, the next generations of graduates and postgraduates in the biosciences be trained in advanced computational biology techniques. We are therefore pleased that our tools, because of our focus on good quality visual output and speed of returning jobs, find use in teaching laboratories around the world.

Funded Value:

£113,802

Funded Period:

Oct 20 - Apr 22

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/T019379/1

Principal Investigator:

Daniel Buchan

Research Subject:

Omic sciences & technologies (14%)

Tools, technologies & methods (77%)

Research Topic:

Bioinformatics (35%)

Proteomics (14%)

Theoretical biology (14%)

Tools for the biosciences (14%)

eScience (14%)

Organisations

People	ORCID iD
Daniel Buchan (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Key Findings
Impact Summary
Collaboration
Software and Technical Products


Description	The award funds a free, open source web service which offers AI methods for predicting structural and functional features of proteins. The service is open to the public and any interested user and is used by researchers in the biosciences around the world to accelerate their work. We estimate that the web service contributes to 1000s of research studies across to globe helping save money and resource by allowing researchers to better design their biochemistry and molecular biology experiments.
Exploitation Route	We know that predictions that our web server produces are used by biochemists and molecular biologists working in all topics in these fields and that our service and the software it provides are critical to the work of many other researchers. The service is additionally used extensively as a teaching tool by academic in institutes across the world. We also know that researchers frequently download our software to complete large scale studies of protein structure and function not possible using the web site.
Sectors	Education,Healthcare,Manufacturing, including Industrial Biotechology,Other


Description	Our web service is used by researchers around the world many of whom are not Academic researchers. The service is also used widely in educational settings.
Sector	Education,Pharmaceuticals and Medical Biotechnology,Other


Description	Elixir-UK Node Service
Organisation	ELIXIR
Department	ELIXIR UK
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	The project provides a highly used Bioinformatics resource to the worldwide community and European researchers. Our collaboration allies our service with other such services and resources across europe to provide a commonly supported collaborative network.
Collaborator Contribution	ELIXIR is an Europe-wide initiative to life science laboratories, resources and services across Europe to share and store their research data as part of an organised network. ELIXIR Europe provides the management and infrastructure to enable this coordination with the ultimate goal of providing a single infrastructure to make exchange of data, expertise and best-practices easier.
Impact	No direct outcomes from this collaboration at this point of the project
Start Year	2021


Title	The PSIPRED webserver maintance
Description	Maintenance and bug fixes for the PSIPRED web server (http://bioinf.cs.ucl.ac.uk/psipred/) in line with the grant deliverables
Type Of Technology	Webtool/Application
Year Produced	2021
Open Source License?	Yes
Impact	Users of the service will experience fewer bugs and fewer analyses will fail. Failure rate was already below 1 in 10,000 and should now be lower
URL	http://bioinf.cs.ucl.ac.uk/psipred/

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications