Transforming the Structural Landscape of CATH to Aid Variant Analyses in Human and Agricultural Organisms and their Pathogens
Lead Research Organisation:
UNIVERSITY COLLEGE LONDON
Department Name: Structural Molecular Biology
Abstract
Proteins are Nature's molecular machines involved in most biochemical processes in living systems. Mutations in proteins can affect their stability and/or shape or chemical properties, altering their function. Knowing the 3D structure of the protein can be extremely helpful in understanding whether and how these mutations have this effect. Proteins are typically made up of multiple 'domains' - important functional modules - each associated with a distinct globular shape. Our CATH classification groups domains according to evolutionary ancestry. Relatives are recognised because they have similar structures in their core and often functional features in common, though variations outside the core can modify function. We therefore sub-classify relatives into functional families if they have highly similar structures and functions.
Experimental techniques for determining protein structures are challenging <1% of known proteins have experimental structures. However, AI technologies for predicting structures have been improving immensely. The best use information from millions of protein sequences (1D strings of molecules (residues)) to predict how proteins will fold up in 3D. The massive increase in sequence data (> one billion sequences now known) obtained by sampling diverse environments have empowered new methods (DeepMind's AlphaFold2) to predict model structures that are as good as experimental structures. DeepMind will provide ~138 million protein structures in 2022, ~200 times more than exists now. We will transform knowledge in our CATH evolutionary classification by bringing in this vast 3D data - and we will also bring in the sequences involved in predicting the structures. This even vaster sequence data will reveal evolutionary conserved sites highly likely to be linked to function.
To handle this massive amount of data we will build powerful new methods. Our recent trials using a new approach (CATHe) correctly assigned domain sequences to their evolutionary family ~90% of the time. Where we have an AlphaFold2 structure for the domain we will apply accurate structure comparisons to validate the classification.
A major aim will be use this new 3D data and more accurately predicted functional sites to understand how mutations in pathogens (e.g. SARS-CoV-2) can lead to increased virulence or transmission. We'll do this through our CATH-FunVar platform which examines where mutations lie on the protein structure. Proximity to functional sites means the mutation may damage or enhance the function. We have started using FunVar to analyse variants of concern in SARS-CoV2. We will extend it to other organisms and pathogens linked to human health and well-being e.g. crops like wheat and rice that are essential for food security and where knowledge of variant impacts can guide selection and engineering of more hardy or faster growing varieties.
To improve FunVar we will improve the accuracy of our predicted functional families and detection of conserved functional sites in them. To do this we will exploit the vast structure and sequence data and adapt our new AI methods to make them even more powerful for this challenging task. We will build tools to analyse structure - function relationships in these families and develop powerful new visualisations for displaying these insights. Since we'll need to handle massive expansions in the data coming into CATH and lots of new methods for processing it - and since some new data is now captured in a way our computer programs can't read - we will completely re-engineer existing pipelines for classifying domains in CATH.
We have already built preliminary pipelines that brought over a quarter of a million AlphaFold2 models into CATH. This project will allow us to make these methods more robust and then apply them to bring in at least 100 fold more models to expand FunVar and determine the impacts of variants that could impact on human health and food security.
Experimental techniques for determining protein structures are challenging <1% of known proteins have experimental structures. However, AI technologies for predicting structures have been improving immensely. The best use information from millions of protein sequences (1D strings of molecules (residues)) to predict how proteins will fold up in 3D. The massive increase in sequence data (> one billion sequences now known) obtained by sampling diverse environments have empowered new methods (DeepMind's AlphaFold2) to predict model structures that are as good as experimental structures. DeepMind will provide ~138 million protein structures in 2022, ~200 times more than exists now. We will transform knowledge in our CATH evolutionary classification by bringing in this vast 3D data - and we will also bring in the sequences involved in predicting the structures. This even vaster sequence data will reveal evolutionary conserved sites highly likely to be linked to function.
To handle this massive amount of data we will build powerful new methods. Our recent trials using a new approach (CATHe) correctly assigned domain sequences to their evolutionary family ~90% of the time. Where we have an AlphaFold2 structure for the domain we will apply accurate structure comparisons to validate the classification.
A major aim will be use this new 3D data and more accurately predicted functional sites to understand how mutations in pathogens (e.g. SARS-CoV-2) can lead to increased virulence or transmission. We'll do this through our CATH-FunVar platform which examines where mutations lie on the protein structure. Proximity to functional sites means the mutation may damage or enhance the function. We have started using FunVar to analyse variants of concern in SARS-CoV2. We will extend it to other organisms and pathogens linked to human health and well-being e.g. crops like wheat and rice that are essential for food security and where knowledge of variant impacts can guide selection and engineering of more hardy or faster growing varieties.
To improve FunVar we will improve the accuracy of our predicted functional families and detection of conserved functional sites in them. To do this we will exploit the vast structure and sequence data and adapt our new AI methods to make them even more powerful for this challenging task. We will build tools to analyse structure - function relationships in these families and develop powerful new visualisations for displaying these insights. Since we'll need to handle massive expansions in the data coming into CATH and lots of new methods for processing it - and since some new data is now captured in a way our computer programs can't read - we will completely re-engineer existing pipelines for classifying domains in CATH.
We have already built preliminary pipelines that brought over a quarter of a million AlphaFold2 models into CATH. This project will allow us to make these methods more robust and then apply them to bring in at least 100 fold more models to expand FunVar and determine the impacts of variants that could impact on human health and food security.
Technical Summary
CATH classifies protein domains into evolutionary superfamilies to better understand sequence-structure-function relationships and improve prediction of protein functions and functional sites.
New tools and computational workflows will be developed (using NextFlow) to harness the massive expansions in protein structure data (3D-models from AlphaFold2 (AF2)) and sequence data (e.g. metagenome sequences in MGnify).
We will extend our webpages and APIs to provide the expanded CATH data to the biological and biomedical communities.
Major tools and workflows we will develop:
-Powerful new homologue detection methods (eg CATHe, recently successfully piloted) using deep learning (DL) strategies exploiting sequence embeddings based on natural language models (eg Prot-BERT-T5).
-Novel workflows for comparing AF2 3D-models against CATH 3D fold libraries. These will exploit in-house methods (eg SSAP) and powerful approaches using DL strategies employing 3D graphlets or shape-mers (e.g.FoldSeek, Geometricus).
-Novel workflows combining the new DL based sequence and structure based methods (above) to improve CATH functional families (FunFams) and functional site prediction.
-New tools for CATH-Plus (providing derived data) e.g. novel structure - function relationships including comprehensive multiple structure alignments of AF2 models and visualisations of conserved 3D and 1D motifs.
-Re-engineered platform for CATH-Classify (in NextFlow) integrating the new workflows (above) and handling massive expansions in the data
All new methods will be benchmarked robustly.
The PDRA will also run and maintain CATH's computational platforms i.e. software, hardware, databases and web services required to process a constantly increasing amount of data; manually validate remote homologues and new folds; generate derived data for CATH-Plus (eg multiple sequence alignments). Time will also be spent improving CATH tutorials/videos and extending them to explain the new data.
New tools and computational workflows will be developed (using NextFlow) to harness the massive expansions in protein structure data (3D-models from AlphaFold2 (AF2)) and sequence data (e.g. metagenome sequences in MGnify).
We will extend our webpages and APIs to provide the expanded CATH data to the biological and biomedical communities.
Major tools and workflows we will develop:
-Powerful new homologue detection methods (eg CATHe, recently successfully piloted) using deep learning (DL) strategies exploiting sequence embeddings based on natural language models (eg Prot-BERT-T5).
-Novel workflows for comparing AF2 3D-models against CATH 3D fold libraries. These will exploit in-house methods (eg SSAP) and powerful approaches using DL strategies employing 3D graphlets or shape-mers (e.g.FoldSeek, Geometricus).
-Novel workflows combining the new DL based sequence and structure based methods (above) to improve CATH functional families (FunFams) and functional site prediction.
-New tools for CATH-Plus (providing derived data) e.g. novel structure - function relationships including comprehensive multiple structure alignments of AF2 models and visualisations of conserved 3D and 1D motifs.
-Re-engineered platform for CATH-Classify (in NextFlow) integrating the new workflows (above) and handling massive expansions in the data
All new methods will be benchmarked robustly.
The PDRA will also run and maintain CATH's computational platforms i.e. software, hardware, databases and web services required to process a constantly increasing amount of data; manually validate remote homologues and new folds; generate derived data for CATH-Plus (eg multiple sequence alignments). Time will also be spent improving CATH tutorials/videos and extending them to explain the new data.
People |
ORCID iD |
| Christine Orengo (Principal Investigator) |
Publications
Bonello J
(2024)
FunPredCATH: An ensemble method for predicting protein function using CATH
in Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics
Bordin N
(2024)
Clustering protein functional families at large scale with hierarchical approaches.
in Protein science : a publication of the Protein Society
Bordin N
(2023)
Large-scale clustering of AlphaFold2 3D models shines light on the structure and function of proteins.
in Molecular cell
Lau AM
(2024)
Exploring structural diversity across the protein universe with The Encyclopedia of Domains.
in Science (New York, N.Y.)
Lin W
(2024)
Enhancing missense variant pathogenicity prediction with protein language models using VariPred
in Scientific Reports
Waman VP
(2024)
CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds.
in Journal of molecular biology
Waman VP
(2025)
CATH v4.4: major expansion of CATH by experimental and predicted structural data.
in Nucleic acids research
| Description | The CATH domain structure classification (https://www.cathdb.info/) has a wide user-base across 180 countries. Established 28 years ago, it uses both automated protocols and manual curation to combine highly detailed knowledge of protein structural domains with data on related protein sequences and their functional annotations. As a result, CATH provides unique insights into evolutionary mechanisms and structural and functional relationships across >500k proteomes. The structure of a protein illuminates the mechanism by which a protein works. Furthermore, structural data can help interpret the impact of genetic variations in proteins. Examining these residue mutations, insertions and deletions in their 3D context helps explain how they enhance the protein function or lead to disease. CATH focuses on protein domains as these are primary evolutionary and functional units. We use in-house and publicly available protein structure comparison and sequence analysis algorithms to classify known structures in the Protein Databank (wwPDB) (obtained through X-ray Crystallography, NMR experiments) and good quality predicted structures (from AlphaFold Database at EBI). CATH-AlphaFlow: Our group developed a new protocol (CATH-Assign) and workflow (CATH-Alphaflow) for processing AFDB models. This includes domain detection using deep learning method (Chainsaw), evaluation of model quality and subsequent classification in CATH. CATH-Assign uses a combination of approaches involving new deep-learning methods for protein structure comparison (Foldseek) and protein language model based methods for remote homology detection (CATHe, developed in-house). It was used to perform a preliminary analysis of AFDB models from 21 model organisms. Application of CATH-AlphaFlow to the PDB and AFDB structures expanded CATH by 112% (1,060,659 domain structures) and brought 349 new folds into CATH (253 from PDB structures and 96 from AFDB). CATH database v 4.4 and TED resource: Since the release of AlphaFold Protein Structure Database v2, CATH 4.4 release provided high-quality domain assignments for 90M AFDB models, resulting into ~180 fold expansion of the CATH data (published in NAR, 2024). CATH employed a consensus approach to domain detection from structure, by using AI/Deep-Learning approaches from 3 algorithms [ChainSaw(https://doi.org/10.1101/2023.07.19.549732), MERIZO(https://doi.org/10.1101/2023.02.19.529114), UniDoc(https://doi.org/10.1093/bioinformatics/btad070)]. The consensus domain assignment protocol is developed in collaboration with David Jones group at UCL (Science 20240 and the data is made publicly available recently via the sister resource - The Encyclopedia of Domain (TED) resource (https://ted.cathdb.info/ , Science, 2024). CATHe (CATH embeddings) Deep-learning methods are also being used to detect evolutionary and functional relationships in CATH - We developed CATHe (CATH embeddings) for identification of remote homologues (<20 % sequence identify). CATHe employes embeddings from ProtT5 as input to train machine learning models to classify protein sequences into CATH superfamilies (https://doi.org/10.1093/bioinformatics/btad029). Improvements to FunFam algorithm CATH provides a comprehensive classification of protein sequence and structural domains into homologous superfamilies that have been further subclassified into functional families (FunFams) using the automated family classification protocol, FunFHMMer. CATH Functional Families (FunFams) are coherent subsets of CATH protein families where a conserved function is shared across all members. We recently developed CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, for classifying functions. Previously to CATH-eMMA, to generate FunFams we generated a tree of relationships between clusters of protein domains and using a tool that assessed the presence of differentially conserved residues, we traversed the tree to obtain groups of sequences where differentially conserved residues are conserved across all members. This method, while precise, is very computationally expensive. CATH-eMMA reduces the overhead in the tree building step by encoding protein sequences into embeddings from protein language models and calculates the relationships based on Euclidean distances between them. CATH-FunVar webpages (https://funvar.cathdb.info/) uniquely provide information on variant impacts on structure linked to functional modification and disease. |
| Exploitation Route | In 2024, CATH was recognised as a Global Core BioData Resource (GCBR) and is one of the few national resources to be endorsed in this way. CATH also has 2 sister resources: Gene3D and TED. CATH-Gene3D, provides predicted domains in UniProt entries using HMM based assignments. The TED resource, developed by the Jones and Orengo groups, provides domain assignments for the AFDB protein structures together with annotations for CATH superfamilies and novel fold groups. CATH thus provides globally unique datasets (processed from PDB and AlphaFold Database) facilitating biological research and providing benchmarks for novel algorithms. Its datasets are widely used by researchers to explore fundamental questions related to protein structure and evolution. The structural data provided by CATH has been instrumental in the design and benchmarking of computational methods and deep-learning algorithms for protein structure prediction, protein design and prediction of protein function. Examples include Google's DeepMind AlphaFold2, RestNet, and ScanNet. CATH datasets also are part of PDBench, a software package for evaluating methods for protein sequence design. CATH datasets are applied for the study of relationships between sequences, structure and function in microbial proteins (e.g, Nature Communications 2021) and to understand the classification and evolution of enzyme families (e.g. Biophysical Review 2022). CATH is used for research and development of drugs. It was one of four foundational resources used by the start up company Inpharmatica, which provided protein structures and functional data for several large pharmaceutical companies. Additionally, CATH has been used in a variety of commercial applications, such as Gene Tools LL and BrainMicro LLC to identify domains in enzymes as therapeutic targets for parasitic infections. Importantly, CATH v4.4 is now expanded by ~200 fold, by bringing in data not only from experimental structures but also from AlphaFold Protein Structure Database (please see our recent release/publication: https://doi.org/10.1093/nar/gkae1087). The expansion of information on CATH superfamilies and FunFams with high quality predicted domain structures opens the door for future research on understanding structural diversity and mechanisms underpinning functional divergence across protein superfamilies (Science, 2024, NAR, 2024). |
| Sectors | Agriculture Food and Drink Chemicals Education Environment Healthcare |
| URL | https://www.cathdb.info/ |
| Description | Outside academia, CATH is widely used across the global pharmaceutical industry for drug design and research and development. It is also used to assess impacts of mutations in proteins supporting clinical diagnostics (e.g. hypercholesterolemia). CATH has informed policy on the host range of SARS-CoV2, and led to efficiencies in drug discovery. CATH functional families (FunFams) can facilitate drug repurposing to target disease genes, by providing valuable data for pharmaceutical companies interested in repurposing as a cost effective mechanism for selecting drugs. FunFams can also identify drug targets which are less likely to be associated with side effects, providing information that is valuable for drug design. CATH methods and data are being exploited in the NHS-funded Genomics England Functional Effects Domain, and in a large-scale analysis of lung cancer data to uncover mechanisms of cancer evolution: the GBP 14m, 9-year Cancer Research UK-funded TracerX project. FunFam classification also allows accurate detection of functionally important sites to guide mutagenesis experiments for synthetic biology and is being used to enhance functional sites in bacterial enzymes capable of degrading plastics and pesticides. It has also highlighted sites involved in SARS-CoV-2 infection. Recent machine learning algorithms have exploited FunFams to improve detection of functional sites (FunSites) and FunSites are being incorporated in two highly accessed resources (PDBe, with 367,655 users per month and UniProt, with ~900,000 users per month at the European Bioinformatics Institute). Both capture this data to facilitate disease diagnostics and personalised medicine. CATH is the only resource which is capable of performing functional sub-classification on such a large scale, identifying 220,000 families each with at least one experimentally characterised protein. Validation has shown high structural and functional coherence across FunFams, allowing much more accurate predictions to be made. CATH methods ranked in the top three (out of 150) in international assessments of molecular function prediction, and first in 2020. CATH data is also disseminated via the web portal of the international protein structure resource, the Protein Databank (PDBe), with over 4,411,871 unique users/year, and UniProt, a major source of protein functional data with over 10,800,000 unique users/year (2,220,000 of which are from industry). Further links to CATH are provided by many international web-based computational biology resources, for example Pfam, BRENDA. CATH FunFam analyses of binding sites involved in SARS-CoV-2 infection of animal hosts was used by the WHO and a UN Food and Agriculture Organisation policy unit in strategy discussions on animals at risk from infection, or which are likely to become reservoirs for the virus. These sites constitute major mechanisms of infection which are targetable by drugs. The work was also reported in several newspapers globally. |
| Title | CATH-AlphaFlow |
| Description | CATH-Assign is a computational workflow for processing predicted 3D protein structures in the AlphaFold Database (AFDB). This includes computational modules for domain detection, evaluation of model quality and subsequent classification in CATH. CATH-Assign uses a combination of approaches involving new deep-learning methods for protein structure comparison (Foldseek) and protein language model based methods for remote homology detection (CATHe, developed in-house). It was first used to perform a preliminary analysis of AFDB models from 21 model organisms. Subsequently CATH-Assign was integrated in a more comprehensive workflow (CATH-AlphaFlow) which encodes all major steps of the CATH-Assign protocol in a NextFlow workflow. CATH-AlphaFlow comprises a series of Python modules created to perform consistent processing of protein chains (either from models or from experimental data), many of which have been orchestrated in NextFlow (https://github.com/UCLOrengoGroup/cath-alphaflow). CATH-AlphaFlow has been applied to all novel structures in the PDB not currently classified in CATH. It has also been applied to all predicted structures in the AFDB structures to detect domain boundaries and classify domains into CATH superfamilies. CATH-AlphaFlow is robust and fast and has assisted in mining the vast data released by AFDB and related platforms (e.g. 3D-Beacons https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/). The novel domain assignments and fold groups enabled by CATH-Assign/AlphaFlow are available from the CATH-beta daily snapshot. CATH-AlphaFlow has been used to populate a new sister resource for CATH - The Encyclopaedia of Domains (TED). Domains identified in the AFDB predicted structures can be viewed via the TED web resource (URL) and also via the AFDB web portal (URL). |
| Type Of Material | Improvements to research infrastructure |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | Using CATH-AlphaFlow we will be able to keep up to date with the PDB and analyse all the predicted structures in the AlphaFold database (AFDB). |
| URL | https://github.com/UCLOrengoGroup/cath-alphaflow |
| Title | CATH-eMMA: Protein functional classification using embedding from protein language models |
| Description | CATH Functional Families (FunFams) are coherent subsets of CATH protein families where a conserved function is shared across all members. Previously to CATH-eMMA, to generate FunFams we generated a tree of relationships between clusters of protein domains and using a tool that assessed the presence of differentially conserved residues, we traversed the tree to obtain groups of sequences where differentially conserved residues are conserved across all members. This method, while precise, is very computationally expensive. CATH-eMMA reduces the overhead in the tree building step by encoding protein sequences into embeddings from protein language models and calculates the relationships based on Euclidean distances between them. |
| Type Of Material | Improvements to research infrastructure |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | CATH-eMMA has been applied successfully to very large enzyme families from metagenomes, discovering novel plastic degrading enzymes. |
| URL | https://github.com/UCLOrengoGroup/eMMA |
| Title | CATHe |
| Description | CATHe (short for CATH embeddings) is a deep learning tool designed to detect remote homologues (up to 20% sequence similarity) for superfamilies in the CATH database. CATHe consists of an artificial neural network model which was trained on sequence embeddings from the ProtT5 protein Language Model (pLM). It was able to achieve an accuracy of 85.6% +- 0.4%, and outperform the other baseline models derived from both, simple machine learning algorithms such as Logistic Regression, and homology-based inference using BLAST. |
| Type Of Material | Improvements to research infrastructure |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | The CATHe dataset is amde available via zenodo and has got over 100 downloads. CATHe based dataset has been used for training other deep learning-based models (such as DeepRHD) for detecting remote homologues. |
| URL | https://zenodo.org/records/6327572 |
| Title | Chainsaw |
| Description | Chainsaw is a deep-learning based method we developed to detect globular domains within protein structure chains. Using an end-to-end CNN, Chainsaw outperforms current state-of-the-art methods for domain detection, while being very fast (0.4s per chain on GPU). Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. Currently, the most performant computational methods for parsing protein domains are structure-based methods that rely on unsupervised heuristics to determine the optimal segmentation. Given the availability of large curated databases of domain annotations, we revisited structure-based approaches but adopt a supervised learning approach. Chainsaw is a novel supervised learning approach to domain parsing that achieves accuracy that surpassed state-of-the-art methods when evaluated in 2023. The Chainsaw method uses a residual convolutional neural network which is trained to predict the adjacency matrix over residues, where entry A_ij is an estimate for the probability that residues i and j are in the same domain. Domain assignments are then derived from this soft-adjacency matrix using a non-learned algorithm that is equivalent to maximising the likelihood of the assignments given the set of pairwise probabilities expressed in the soft-adjacency matrix. We benchmarked the method against recent state-of-the-art structure-based methods and show that it outperforms them, achieving an NDO score of 0.87 vs 0.80 for the next closest method when predicting a non-redundant test set of multi-domain proteins. In addition to improved accuracy on labeled PDB structures, we proposed several modifications that optimise Chainsaw for increased performance on AlphaFold predicted structures which are often characterised by long disordered regions that are not well-handled by existing methods. The method was designed to be fast with minimal dependencies and can predict non-domain residues and discontinuous domains. We demonstrated that the approach performs well at identifying completely novel domains not observed during training. ChainSaw has been applied to detect domains in all experimental structures in the PDB and all predicted protein structures in the AlphaFold database (AFDB). |
| Type Of Material | Improvements to research infrastructure |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | Chainsaw has been applied to update the domain assignments from structures in the PDB. It has also been applied to segment all predicted protein structure chains in the AFDB. Chainsaw is a component module in AlphaFlow workflow to classify domains into CATH superfamilies. |
| URL | https://www.biorxiv.org/content/10.1101/2023.07.19.549732v1 |
| Title | GOBeacon: An Ensemble Model for Protein Function Prediction Enhanced by Contrastive Learning |
| Description | Here we present GOBeacon, an ensemble model integrating structure-aware protein language models with protein interaction networks to predict protein function. Designed to address limitations in capturing structure-function-evolution relationships, GOBeacon achieves superior performance on the CAFA3 benchmark, which outperforms existing methods like DeepGOPlus. Notably, it matches specialized structure-based tools (e.g., DeepFRI) in structure-aware predictions despite no explicit structural training. Current applications include sequence-based and structure-based functional annotation, with potential for expansion through its modular architecture to incorporate additional biological data types. This tool advances automated protein annotation by bridging sequence-structure-function relationships, offering critical support for biological discovery. |
| Type Of Material | Improvements to research infrastructure |
| Year Produced | 2024 |
| Provided To Others? | No |
| Impact | GOBeacon advances protein function prediction by outperforming leading methods (e.g., DeepGOPlus) on CAFA3 benchmarks and matching structure-based tools (e.g., DeepFRI) without structural training, while its modular design enables scalable integration of biological data for improved annotation. |
| URL | https://github.com/wlin16/GOBeacon.git |
| Title | VariPred: Enhancing missense variant pathogenicity prediction with protein language models |
| Description | We introduce VariPred, a deep learning framework using pre-trained protein language models (ESM-1b) to predict genetic variant pathogenicity. Designed to bypass traditional feature engineering, VariPred requires only raw protein sequences as input, eliminating dependencies on structural data or multiple sequence alignments. It outperforms existing tools (e.g., Polyphen-2, REVEL, FATHMM) across six variant impact benchmarks while avoiding data pre-processing. Current applications focus on broad-spectrum variant analysis, with future versions expanding to cancer genomics and pathogen-host interactions, aiming to enhance clinical variant prioritization and integrate quantitative confidence metrics. |
| Type Of Material | Improvements to research infrastructure |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | VariPred enables accurate, sequence-only prediction of genetic variant pathogenicity, outperforming traditional feature-dependent methods and accelerating clinical variant prioritization without structural data requirements. |
| URL | https://github.com/wlin16/VariPred.git |
| Title | CATH-KinFams |
| Description | CATH-KinFams are protein kinase domain families classified according to functional similarity based on SDP (specificity determining positions). In this deposition we make available 2,210 KinFams sequence alignments alongside Hidden Markov Models built from them to be used with HMMER3. |
| Type Of Material | Data analysis technique |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | This dataset was downloaded by members of the research community over 40 times to be used in protein kinase research. |
| URL | https://zenodo.org/records/7575924 |
| Title | TED-The Encyclopedia of Domains |
| Description | This dataset contains 365 million CATH domain assignments and structures for 214 million protein structure models from the AlphaFold Protein Structure Database, covering the proteomes of over 600,000 organisms. We include PDB files for 40 model organisms and global health proteomes, novel folds and a table containing metadata on domain quality and assignments. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2024 |
| Provided To Others? | Yes |
| Impact | Since the release of this data, there are over 4.000 dataset downloads from the Zenodo. This research dataset provides a access to high-quality curated domains from AFDB, for all proteomes from the UniProt database. This has provided an opportunity to (i) identify novel folds (ii) investigate remote homologous relationship illuminated by structural information (please refer to doi: 10.1016/j.molcel.2023.10.039) ; (iii) functional annotations of CATH superfamilies' and pathogen domains/drug targets. The dataset is available via 10.5281/zenodo.10788942 . The TED domain assignments are also linked to AFDB structures at https://alphafold.ebi.ac.uk/ |
| URL | https://ted.cathdb.info/ |
| Title | Understanding structural and functional diversity of ATP-PPases using protein domains and functional families in CATH database |
| Description | The dataset of AF2-predicted HUP domains with overall pLDDT > 90, culled at 90% identity. |
| Type Of Material | Database/Collection of data |
| Year Produced | 2023 |
| Provided To Others? | Yes |
| Impact | The research is published in Structure journal, 2025. The work got selected for the oral presentation at UCL-ISMB Symposium. |
| URL | https://zenodo.org/record/8346482 |
| Description | InterPro |
| Organisation | EMBL European Bioinformatics Institute (EMBL - EBI) |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. BBSRC BBR funding extended the mapping between SCOP and CATH, integrated annotations in InterPro for selected model organisms, and provided a 3D viewer for the structural annotations. Current collaborations involve evaluation of novel deep learning strategies for proving CATH superfamily annotations via InterPro |
| Collaborator Contribution | Annotations from other sources, manual curations, central database and web site. |
| Impact | Publications Community resource to further biological research. |
| Start Year | 2007 |
| Description | PDBe |
| Organisation | EMBL European Bioinformatics Institute (EMBL - EBI) |
| Country | United Kingdom |
| Sector | Academic/University |
| PI Contribution | Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D initiative (now renamed as 3D-Beacons) has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. As mentioned under 3D-Beacons collaboration we are also contributing predicted domain structures generated for the 214 million predicted 3D-structures in the AlphaFold database (AFDB). We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the 3D-Beacons portal. |
| Collaborator Contribution | Host, maintain and curate the central PDBe resource and website. |
| Impact | Publications Community resources to further scientific research. |
| Start Year | 2006 |
| Description | ProtFunAI - Collababoration with Burkhard Rost Team |
| Organisation | Technical University of Munich |
| Country | Germany |
| Sector | Academic/University |
| PI Contribution | Development of deep learning algorithms for protein function prediction, protein classification and analysis |
| Collaborator Contribution | Training in deep learning protocols and protein language models. Contributions to project design. Novel protein language models to generate protein embeddings for protein function prediction and other protein based prediction tasks. |
| Impact | Project has just started so no outputs yet |
| Start Year | 2024 |
| Description | "Understanding structural and functional diversity of PP-ATPases: insights using CATH Functional families , ISMB Retreat (Cambridge, UK) |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Other audiences |
| Results and Impact | This talk was presented as part of UCL's ISMB symposium. |
| Year(s) Of Engagement Activity | 2023 |
| Description | 25th Bologna Winter School: Artificial Intelligence, Deep Learning and protein functional annotation: the state-of-the-art |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Undergraduate students |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2024 |
| Description | Biocenter Oulu Day (Finland). "How much of protein space will AlphaFold illuminate?". |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | Bioinformatics and deep learning for biodata analysis workshop |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | Birkbeck College (UK) - "AI for protein structure and Function" - Virtual (Microsoft Teams) |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | Invited lecture for the Birkbeck Masters in Bioinformatics |
| Year(s) Of Engagement Activity | 2024 |
| Description | Birkbeck College Lecture (UK). "The impact of AI on protein structure and function". |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2024 |
| Description | BlueRemediomics workshop: Advances in large-scale enzyme annotations for metagenome sequences |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | CABD 20th Anniversary (Spain). "How much of protein space will AlphaFold illuminate?". |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | CATH DB - protein folds and structural family resources |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | CNB-CSIC Madrid symposium: AI brings structural insights on protein fold and function evolution |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Careers Encounter event at Wallington County Grammer School , Wallington, UK |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | Local |
| Primary Audience | Schools |
| Results and Impact | As part of National Careers Week in the UK, (on Wednesday, 5th March), the school held its annual Careers Encounter for students in Year 8 and 10. Dr. Waman introduced students on Protein structures, Bioinformatics , CATH and applications of this field. Students were keen to understand what subject skills are required and accordingly they could make informed choices about subject selections. |
| Year(s) Of Engagement Activity | 2025 |
| Description | Centre International de Recontres Mathematiques (CIRM) (France). "How much of protein space will AlphaFold illuminate?". |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | EBI Structural Bioinformatics Workshop |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | EMBO Conference on AI in Structural Biology, Heidelberg, Germany |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | EuroBiotech 2024 (Poland). "The Encyclopedia of Domains". Krakow Conference Centre |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | European COST Action "ML4NGP'' workshop on non-globular proteins and Machine Learning |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | ISCB 3DSIG Webinar. "The Encyclopaedia of Domains". Virtual (Zoom) |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | ISMB 2024 Tech Track (Canada). "CATH and TED. Protein structure classification in the age of AI" |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | ISMB/ECCB2023 NIH/ELIXIR Special Track (France). CATH - Protein Structure Classification Database. |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | ISMB/ECCB2023 Tech Track (France). "Scaling up Protein Classification. CATH-Alphaflow and Chainsaw". |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | ISMB/ECCB2023 Tutorials Track (France). CATH Alphaflow Tutorial. |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | This was the tutorial held as part of the symposium. |
| Year(s) Of Engagement Activity | 2023 |
| Description | ISMB2024 3DSIG (Canada). "The Encyclopedia of Domains". Montreal Convention Centre |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | In2Science UK 16th August-29th August 2023 |
| Form Of Engagement Activity | Participation in an activity, workshop or similar |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Schools |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | International conference in Barcelona PRBB organised by Koret-UC Berkeley-Tel Aviv University Initiative in Computational Biology and Bioinformatics (KBT) and Center for Regulatory Genomics (CRG) 'Advances in Computational Biology' |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Interplay between AI and mathematical modelling in the post-structural genomics era CIRMM, Marseilles France |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | Keynote for Cambridge CD23 Symposium |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Postgraduate students |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | Keynote for the 16th International Symposium on Health Informatics and Bioinformatics (HIBIT'23) |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | ML4NGP Montpellier (France). "Novel pipelines and tools for discoveries in protein structure space |
| Form Of Engagement Activity | A formal working group, expert panel or dialogue |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | MRC Laboratory of Molecular Biology (UK)."Shining light on function and structure in the dark protein universe". |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Novozymes Prize Symposium: A million shades of green: Understanding and harnessing plant metabolic diversity |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Protein Evolution Conference |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Postgraduate students |
| Results and Impact | n/a |
| Year(s) Of Engagement Activity | 2023 |
| Description | Quadram Institute (UK). "The Encyclopaedia of Domains" |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | National |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Quest for Orthologs Conference in Montreal: AlphaFold structures expand our understanding of functional divergence in protein families |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Spanish Society of Computational Biology and Bioinformatics in Valencia. Plenary Keynote: AlphaFold predicted structures expand our understanding of functional divergence in protein families |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |
| Description | Università degli Studi di Padova (Italy)."Shining light on function and structure in the dark protein universe". Aula Magna Vallisneri. |
| Form Of Engagement Activity | A talk or presentation |
| Part Of Official Scheme? | No |
| Geographic Reach | International |
| Primary Audience | Other audiences |
| Results and Impact | not known |
| Year(s) Of Engagement Activity | 2024 |