Accelerating and enhancing the PSIPRED Workbench with deep learning

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Computer Science

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Planned Impact

SUMMARY OF RESOURCE

This proposal is to maintain and further develop a set of Web-accessible tools and services that has been developed at UCL (namely the PSIPRED Workbench - originally called the PSIPRED Server) since 2001 (and at the University Warwick since 1998). This portal provides a wide variety of very well-known tools (e.g. PSIPRED/DISOPRED/GenTHREADER/MEMSAT/FFPRED) to the general life science research community, and is available for use (free of charge) to both academic and commercial researchers. In many independent tests (e.g. every CASP experiment since 1994), these tools have proven to be amongst the very best worldwide, and are widely used by other resources around the world as part of their own pipelines and workflows. The PSIPRED Workbench is probably one of the most widely accepted and used bioinformatics resources that is operated from a UK University, and is frequently referenced in many textbooks and training courses. The close association between a world-class bioinformatics research group and such a widely-used tool means that the methods are kept fully up to date with changing technological and demand-based trends.

IMPACT OVERVIEW

The PSIPRED portal was used over 170,000 times in the last year, with nearly 1000 jobs handled per day during busy periods, and has over 5,200 unique visitors per month. The overall usage is up 20% since our last application to the BBR Fund, which demonstrates a clear growth in demand. Users are spread further across the globe than before, with 18% of users coming from the US and 9% of users from the UK. This testifies to the importance of this resource, particularly to the UK bioscience community given the ratio of researcher headcounts in the two countries. Users typically also come from a wide variety of scientific research areas. Based on our user support enquiries and user surveys, we can identify users in areas across the whole BBSRC remit e.g. bio-energy, ageing research, biotechnology, synthetic biology, vaccine design, plant biology, animal health and even nanotechnology.

In summary, the immediate beneficiaries of this research are the broad community of experimental biologists needing additional functional or structural clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data, or companies wishing to released closed-source code, will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the structure or function of uncharacterised proteins can have significant impact in the broad variety of areas mentioned above.

Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.

We also note that many users of our servers use the resources for teaching purposes. It's clearly vital that for maximum impact, the next generations of graduates and postgraduates in the biosciences be trained in advanced computational biology techniques. We are therefore pleased that our tools, because of our focus on good quality visual output and speed of returning jobs, find use in teaching laboratories around the world.

Funded Value:

£80,723

Funded Period:

Apr 22 - Sep 25

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/T019379/2

Principal Investigator:

Daniel Buchan

Research Subject:

Omic sciences & technologies (14%)

Tools, technologies & methods (77%)

Research Topic:

Bioinformatics (35%)

Proteomics (14%)

Theoretical biology (14%)

Tools for the biosciences (14%)

eScience (14%)

Organisations

People	ORCID iD
Daniel Buchan (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Lau AM (2023) Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. in Nature communications

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
BB/T019379/1			30/09/2020	12/04/2022	£113,803
BB/T019379/2	Transfer	BB/T019379/1	13/04/2022	29/09/2025	£80,724

Key Findings
Impact Summary
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products


Description	The award funds a free, open source web service which offers AI methods for predicting structural and functional features of proteins. The service is open to the public and any interested user and is used by researchers in the biosciences around the world to accelerate their work. We estimate that the web service contributes to 1000s of research studies across to globe helping save money and resource by allowing researchers to better design their biochemistry and molecular biology experiments.
Exploitation Route	We know that predictions that our web server produces are used by biochemists and molecular biologists working in all topics in these fields and that our service and the software it provides are critical to the work of many other researchers. The service is additionally used extensively as a teaching tool by academic in institutes across the world. We also know that researchers frequently download our software to complete large scale studies of protein structure and function not possible using the web site.
Sectors	Education Healthcare Manufacturing including Industrial Biotechology Other


Description	Our web service is used by researchers around the world many of whom are not Academic researchers. The service is also used widely in educational settings.
Sector	Education,Pharmaceuticals and Medical Biotechnology,Other
Impact Types	Economic Policy & public services


Title	Online access to GsRCL predition tool
Description	Gaussian noise augmentation-based scRNA-seq contrastive learning method (GsRCL) is a deep learning method designed to all cell type identification from transcriptomics data. A large-scale computational evaluation suggests that GsRCL successfully outperformed other state-of-the-art predictive methods on difficult cell type identification tasks, while the conventional random genes masking augmentation-based contrastive learning method also improved the accuracy of easy cell type identification tasks in general.
Type Of Material	Improvements to research infrastructure
Year Produced	2025
Provided To Others?	Yes
Impact	At the time of writing this tool has only just been made available (as of March 2025) on our web server so we are not able to assess the impact at this point.
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	Online access to Merizo Search predition tool
Description	Merizo-search is a method that builds on the original Merizo (Lau et al., 2023) by combining state-of-the-art domain segmentation with fast embedding-based searching. Specifically, Merizo-search makes use of an EGNN-based method called Foldclass, which embeds a structure and its sequence into a fixed size 128-length vector. This vector is then searched against a pre-encoded library of domains, and the top-k matches in terms of cosine similarity are used for confirmatory TM-align runs to validate the search. Merizo-search also supports searching larger-than-memory databases of embeddings using the Faiss library.
Type Of Material	Improvements to research infrastructure
Year Produced	2025
Provided To Others?	Yes
Impact	At the time of writing this tool has only just been made available (as of Jan 2024) on our web server so we are not able to assess the impact at this point.
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	Online access to Merizo predition tool
Description	Merizo is a fast and accurate deep learning method for domain segmentation in complex protein structures. Notably, it makes use of invariant point attention (IPA) to read a protein structure into a latent representation. Domains are predicted via an affinity learning approach whereby the embeddings of residues belonging to the same domain, are encouraged towards similar embeddings and discouraged if belonging to different domains.
Type Of Material	Improvements to research infrastructure
Year Produced	2024
Provided To Others?	Yes
Impact	At the time of writing this tool has only just been made available (as of late 2024) on our web server so we are not able to assess the impact at this point.
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	Online access to S4Pred predition tool
Description	S4PRED is a tool for the accurate prediction of a protein chain's secondary structure. Unlike PSIPRED, it is optimized to predict using only the amino acid chain of the protein in question, without relying on any additional evolutionary information.
Type Of Material	Improvements to research infrastructure
Year Produced	2023
Provided To Others?	Yes
Impact	At the time of writing this tool has only just been made available (as of March 2023) on our web server so we are not able to assess the impact at this point.
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	Foldclass databases for protein structural domains in CATH and TED
Description	This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3. Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library. The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you. IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
Impact	At the time of writing this tool has only just been made available (as of early 2025) on our web server so we are not able to assess the impact at this point.
URL	https://rdr.ucl.ac.uk/articles/dataset/Foldclass_databases_for_protein_structural_domains_in_CATH_an...


Description	Elixir-UK Node Service
Organisation	ELIXIR
Department	ELIXIR UK
Country	United Kingdom
Sector	Charity/Non Profit
PI Contribution	The project provides a highly used Bioinformatics resource to the worldwide community and European researchers. Our collaboration allies our service with other such services and resources across europe to provide a commonly supported collaborative network.
Collaborator Contribution	ELIXIR is an Europe-wide initiative to life science laboratories, resources and services across Europe to share and store their research data as part of an organised network. ELIXIR Europe provides the management and infrastructure to enable this coordination with the ultimate goal of providing a single infrastructure to make exchange of data, expertise and best-practices easier.
Impact	No direct outcomes from this collaboration at this point of the project
Start Year	2021


Title	DMPFold 2.0 Web server
Description	DMPfold uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces accurate residue-residue contact prediction even for shallow sequence alignments, and and works just as well for transmembrane proteins.
Type Of Technology	Webtool/Application
Year Produced	2024
Open Source License?	Yes
Impact	At the time of writng we don't have statistics on the DMPfold2 useage. Prior useage of DMPFold was of the order of 3,000 analyses per year
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	DMPmetal web server
Description	DMPmetal is a deep learning-based method for predicting metal binding sites from amino acid sequences. It follows the approach of using a large (1.2 billion parameter) pre-trained transformer encoder protein language model (pLM) to embed the target sequences and to provide the features for simple feed-forward classifier. One difference from many other pLMs is that the DMPmetal pLM was jointly pre-trained on both sequence and structures through training on the UniRef50 subset of the AlphaFold Database (Varadi et al., 2022). From a user perspective, the input to the model is a single amino acid sequence, and the output probabilities relate to each of the 29 CHEBI metal codes
Type Of Technology	Webtool/Application
Year Produced	2024
Open Source License?	Yes
Impact	At the time of writing the method does not have usage statistics available
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	GsRCL web application
Description	Gaussian noise augmentation-based scRNA-seq contrastive learning method (GsRCL) is a contrastive deep learning method which attempts to learn a type of discriminative feature representations for cell type identification tasks. The input data is transcriptomics data over multiple genes for a set of unknown cells and the output are the predictions of which cell types the unknowns may be. A large-scale computational evaluation suggests that GsRCL successfully outperformed other state-of-the-art predictive methods on difficult cell type identification tasks, while the conventional random genes masking augmentation-based contrastive learning method also improved the accuracy of easy cell type identification tasks in general.
Type Of Technology	Webtool/Application
Year Produced	2025
Open Source License?	Yes
Impact	At the time of writing we do not have usage statistics for this new tool
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	Merizo Search Web Application
Description	Merizo-search is a method that builds on the original Merizo (Lau et al., 2023) by combining state-of-the-art domain segmentation with fast embedding-based searching. Specifically, Merizo-search makes use of an EGNN-based method called Foldclass, which embeds a structure and its sequence into a fixed size 128-length vector. This vector is then searched against a pre-encoded library of domains, and the top-k matches in terms of cosine similarity are used for confirmatory TM-align runs to validate the search. Merizo-search also supports searching larger-than-memory databases of embeddings using the Faiss library.
Type Of Technology	Webtool/Application
Year Produced	2025
Open Source License?	Yes
Impact	At the time of writing we do not have usage statistics for this new tool
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	Merizo web application
Description	Merizo is a deep learning-based method for predicting the location of domains in protein structures. It consists of an encoder-decoder architecture which makes use of an invariant point attention encoder, that leverages both structure coordinates as well as sequence, to generate an embedding of the model. Residues are individually assigned into domains by the decoder, which additionally handles residues not part of domains (a.k.a. non-domain residues, NDRs).
Type Of Technology	Webtool/Application
Year Produced	2024
Open Source License?	Yes
Impact	At the time of writing we do not have usage statistics for this new tool
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	ReactJS based webserver code
Description	This is an implementation of the PSIPRED webserver frontend code in the React JS framework. It replaces the old web server website which was implement in RActive
Type Of Technology	Webtool/Application
Year Produced	2024
Open Source License?	Yes
Impact	This brand new code makes it quicker and easier for us to deploy more predictive methods on our web server and greatly reduces coder time to add new methods.
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	S4Pred online service
Description	S4PRED is a state-of-the-art single-sequence protein secondary structure method. It is used to provide accurate secondary structure modelling for the challenging but important class of proteins, single orphan proteins. Accordingly, the model takes only a protein's amino acid sequence as input, with no additional homology information, and subsequently returns 3-state secondary structure predictions for the sequence. Similarly to PSIPRED, S4PRED prediction results are presented with the confidence score, a cartoon representation, 3-state prediction assignment, and the original amino acid sequence. The model's architecture is an ensemble of five 3-layered recurrent deep neural networks. It is trained using a semi-supervised learning approach to massively supplement the available number of protein sequences that can be trained on. This results in a training set in excess of a million examples. This set combines real-labelled examples, where a sequence and its secondary structure are known, and artificially labelled examples, where only the primary amino acid sequence is known. S4PRED has a Q3 secondary structure prediction accuracy of 75.3%. This is a significant improvement over our cutting edge PSIPRED method, which achieves a Q3 accuracy of 70.6% when tested on single sequences without any provided homology information.
Type Of Technology	Webtool/Application
Year Produced	2023
Open Source License?	Yes
Impact	The tool is available on the PSIPRED workbench and is used by researchers around the world and saw 3,000 uses in the first year of operation
URL	http://bioinf.cs.ucl.ac.uk/psipred/


Title	The PSIPRED webserver maintance
Description	Maintenance and bug fixes for the PSIPRED web server (http://bioinf.cs.ucl.ac.uk/psipred/) in line with the grant deliverables
Type Of Technology	Webtool/Application
Year Produced	2021
Open Source License?	Yes
Impact	Users of the service will experience fewer bugs and fewer analyses will fail. Failure rate was already below 1 in 10,000 and should now be lower
URL	http://bioinf.cs.ucl.ac.uk/psipred/