Transforming the Structural Landscape of CATH to Aid Variant Analyses in Human and Agricultural Organisms and their Pathogens
Lead Research Organisation:
UNIVERSITY COLLEGE LONDON
Department Name: Structural Molecular Biology
Abstract
Proteins are Nature's molecular machines involved in most biochemical processes in living systems. Mutations in proteins can affect their stability and/or shape or chemical properties, altering their function. Knowing the 3D structure of the protein can be extremely helpful in understanding whether and how these mutations have this effect. Proteins are typically made up of multiple 'domains' - important functional modules - each associated with a distinct globular shape. Our CATH classification groups domains according to evolutionary ancestry. Relatives are recognised because they have similar structures in their core and often functional features in common, though variations outside the core can modify function. We therefore sub-classify relatives into functional families if they have highly similar structures and functions.
Experimental techniques for determining protein structures are challenging <1% of known proteins have experimental structures. However, AI technologies for predicting structures have been improving immensely. The best use information from millions of protein sequences (1D strings of molecules (residues)) to predict how proteins will fold up in 3D. The massive increase in sequence data (> one billion sequences now known) obtained by sampling diverse environments have empowered new methods (DeepMind's AlphaFold2) to predict model structures that are as good as experimental structures. DeepMind will provide ~138 million protein structures in 2022, ~200 times more than exists now. We will transform knowledge in our CATH evolutionary classification by bringing in this vast 3D data - and we will also bring in the sequences involved in predicting the structures. This even vaster sequence data will reveal evolutionary conserved sites highly likely to be linked to function.
To handle this massive amount of data we will build powerful new methods. Our recent trials using a new approach (CATHe) correctly assigned domain sequences to their evolutionary family ~90% of the time. Where we have an AlphaFold2 structure for the domain we will apply accurate structure comparisons to validate the classification.
A major aim will be use this new 3D data and more accurately predicted functional sites to understand how mutations in pathogens (e.g. SARS-CoV-2) can lead to increased virulence or transmission. We'll do this through our CATH-FunVar platform which examines where mutations lie on the protein structure. Proximity to functional sites means the mutation may damage or enhance the function. We have started using FunVar to analyse variants of concern in SARS-CoV2. We will extend it to other organisms and pathogens linked to human health and well-being e.g. crops like wheat and rice that are essential for food security and where knowledge of variant impacts can guide selection and engineering of more hardy or faster growing varieties.
To improve FunVar we will improve the accuracy of our predicted functional families and detection of conserved functional sites in them. To do this we will exploit the vast structure and sequence data and adapt our new AI methods to make them even more powerful for this challenging task. We will build tools to analyse structure - function relationships in these families and develop powerful new visualisations for displaying these insights. Since we'll need to handle massive expansions in the data coming into CATH and lots of new methods for processing it - and since some new data is now captured in a way our computer programs can't read - we will completely re-engineer existing pipelines for classifying domains in CATH.
We have already built preliminary pipelines that brought over a quarter of a million AlphaFold2 models into CATH. This project will allow us to make these methods more robust and then apply them to bring in at least 100 fold more models to expand FunVar and determine the impacts of variants that could impact on human health and food security.
Experimental techniques for determining protein structures are challenging <1% of known proteins have experimental structures. However, AI technologies for predicting structures have been improving immensely. The best use information from millions of protein sequences (1D strings of molecules (residues)) to predict how proteins will fold up in 3D. The massive increase in sequence data (> one billion sequences now known) obtained by sampling diverse environments have empowered new methods (DeepMind's AlphaFold2) to predict model structures that are as good as experimental structures. DeepMind will provide ~138 million protein structures in 2022, ~200 times more than exists now. We will transform knowledge in our CATH evolutionary classification by bringing in this vast 3D data - and we will also bring in the sequences involved in predicting the structures. This even vaster sequence data will reveal evolutionary conserved sites highly likely to be linked to function.
To handle this massive amount of data we will build powerful new methods. Our recent trials using a new approach (CATHe) correctly assigned domain sequences to their evolutionary family ~90% of the time. Where we have an AlphaFold2 structure for the domain we will apply accurate structure comparisons to validate the classification.
A major aim will be use this new 3D data and more accurately predicted functional sites to understand how mutations in pathogens (e.g. SARS-CoV-2) can lead to increased virulence or transmission. We'll do this through our CATH-FunVar platform which examines where mutations lie on the protein structure. Proximity to functional sites means the mutation may damage or enhance the function. We have started using FunVar to analyse variants of concern in SARS-CoV2. We will extend it to other organisms and pathogens linked to human health and well-being e.g. crops like wheat and rice that are essential for food security and where knowledge of variant impacts can guide selection and engineering of more hardy or faster growing varieties.
To improve FunVar we will improve the accuracy of our predicted functional families and detection of conserved functional sites in them. To do this we will exploit the vast structure and sequence data and adapt our new AI methods to make them even more powerful for this challenging task. We will build tools to analyse structure - function relationships in these families and develop powerful new visualisations for displaying these insights. Since we'll need to handle massive expansions in the data coming into CATH and lots of new methods for processing it - and since some new data is now captured in a way our computer programs can't read - we will completely re-engineer existing pipelines for classifying domains in CATH.
We have already built preliminary pipelines that brought over a quarter of a million AlphaFold2 models into CATH. This project will allow us to make these methods more robust and then apply them to bring in at least 100 fold more models to expand FunVar and determine the impacts of variants that could impact on human health and food security.
Technical Summary
CATH classifies protein domains into evolutionary superfamilies to better understand sequence-structure-function relationships and improve prediction of protein functions and functional sites.
New tools and computational workflows will be developed (using NextFlow) to harness the massive expansions in protein structure data (3D-models from AlphaFold2 (AF2)) and sequence data (e.g. metagenome sequences in MGnify).
We will extend our webpages and APIs to provide the expanded CATH data to the biological and biomedical communities.
Major tools and workflows we will develop:
-Powerful new homologue detection methods (eg CATHe, recently successfully piloted) using deep learning (DL) strategies exploiting sequence embeddings based on natural language models (eg Prot-BERT-T5).
-Novel workflows for comparing AF2 3D-models against CATH 3D fold libraries. These will exploit in-house methods (eg SSAP) and powerful approaches using DL strategies employing 3D graphlets or shape-mers (e.g.FoldSeek, Geometricus).
-Novel workflows combining the new DL based sequence and structure based methods (above) to improve CATH functional families (FunFams) and functional site prediction.
-New tools for CATH-Plus (providing derived data) e.g. novel structure - function relationships including comprehensive multiple structure alignments of AF2 models and visualisations of conserved 3D and 1D motifs.
-Re-engineered platform for CATH-Classify (in NextFlow) integrating the new workflows (above) and handling massive expansions in the data
All new methods will be benchmarked robustly.
The PDRA will also run and maintain CATH's computational platforms i.e. software, hardware, databases and web services required to process a constantly increasing amount of data; manually validate remote homologues and new folds; generate derived data for CATH-Plus (eg multiple sequence alignments). Time will also be spent improving CATH tutorials/videos and extending them to explain the new data.
New tools and computational workflows will be developed (using NextFlow) to harness the massive expansions in protein structure data (3D-models from AlphaFold2 (AF2)) and sequence data (e.g. metagenome sequences in MGnify).
We will extend our webpages and APIs to provide the expanded CATH data to the biological and biomedical communities.
Major tools and workflows we will develop:
-Powerful new homologue detection methods (eg CATHe, recently successfully piloted) using deep learning (DL) strategies exploiting sequence embeddings based on natural language models (eg Prot-BERT-T5).
-Novel workflows for comparing AF2 3D-models against CATH 3D fold libraries. These will exploit in-house methods (eg SSAP) and powerful approaches using DL strategies employing 3D graphlets or shape-mers (e.g.FoldSeek, Geometricus).
-Novel workflows combining the new DL based sequence and structure based methods (above) to improve CATH functional families (FunFams) and functional site prediction.
-New tools for CATH-Plus (providing derived data) e.g. novel structure - function relationships including comprehensive multiple structure alignments of AF2 models and visualisations of conserved 3D and 1D motifs.
-Re-engineered platform for CATH-Classify (in NextFlow) integrating the new workflows (above) and handling massive expansions in the data
All new methods will be benchmarked robustly.
The PDRA will also run and maintain CATH's computational platforms i.e. software, hardware, databases and web services required to process a constantly increasing amount of data; manually validate remote homologues and new folds; generate derived data for CATH-Plus (eg multiple sequence alignments). Time will also be spent improving CATH tutorials/videos and extending them to explain the new data.
People |
ORCID iD |
Christine Orengo (Principal Investigator) |
Publications

Bonello J
(2024)
FunPredCATH: An ensemble method for predicting protein function using CATH
in Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics

Bordin N
(2023)
Large-scale clustering of AlphaFold2 3D models shines light on the structure and function of proteins.
in Molecular cell


Waman VP
(2024)
CATH v4.4: major expansion of CATH by experimental and predicted structural data.
in Nucleic acids research
Title | CATH-AlphaFlow |
Description | The Orengo group has developed a new protocol (CATH-Assign) for processing AFDB models. This includes domain detection, evaluation of model quality and subsequent classification in CATH. CATH-Assign uses a combination of approaches involving new deep-learning methods for protein structure comparison (Foldseek) and protein language model based methods for remote homology detection (CATHe, developed in-house). It was used to perform a preliminary analysis of AFDB models from 21 model organisms. We have developed a novel pipeline (CATH-AlphaFlow) which encodes major steps of the CATH-Assign protocol in a NextFlow workflow, called CATH-Alphaflow. CATH-AlphaFlow is a series of Python modules created to perform consistent processing of protein chains (either from models or from experimental data), many of which have been orchestrated in NextFlow (https://github.com/UCLOrengoGroup/cath-alphaflow). CATH-AlphaFlow has been applied to all novel structures in the PDB not currently classified in CATH. It was also applied to the AFDB structures from the 21 model organisms to refine domain boundaries and improve classification of domains. CATH-AlphaFlow is robust and fast and has assisted in mining the vast data released by AFDB and related platforms (e.g. 3D-Beacons https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/). The novel domain assignments and fold groups enabled by CATH-Assign/AlphaFlow are available from the CATH-beta daily snapshot. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2024 |
Provided To Others? | Yes |
Impact | Using CATH-AlphaFlow we will be able to keep up to date with the PDB and we will also be able to analyse all the predicted structures in the AlphaFold database (AFDB). |
URL | https://github.com/UCLOrengoGroup/cath-alphaflow |
Title | Chainsaw |
Description | Chainsaw is a deep-learning based method we developed to detect globular domains within protein chains. Using an end-to-end CNN, Chainsaw outperforms current state-of-the-art methods for domain detection, while being very fast (0.4s per chain on GPU).. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | Chainsaw has been applied to the portion of PDB currently unclassified in CATH and identified an additional 200k domains, and is part of the TED classification pipeline for the detection and classification of protein domains in AlphaFoldDB. |
URL | https://www.biorxiv.org/content/10.1101/2023.07.19.549732v1 |
Title | CATH AlphaFold2 structural domains for 21 model organisms |
Description | This dataset contains CATH domain assignments and structures for the first release of AlphaFold Database covering the proteomes of 21 model organisms. We include PDB files and a table containing metadata on domain quality and assignments. |
Type Of Material | Data analysis technique |
Year Produced | 2021 |
Provided To Others? | Yes |
Impact | This dataset was downloaded by members of the research community over 680 times for analyses and benchmark purposes. |
URL | https://zenodo.org/records/7404988 |
Title | CATH-Gene3D |
Description | Please note that this research database is still being continuously developed and improved. CATH-Gene3D is a domain family classification. As of 2024, over 200 million protein domain sequences are classified into evolutionary superfamilies. Within these, relatives are further classed into groups in which relatives share very similar 3D-structures and functional properties. These groupings are described as functional families, or FunFams. The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.3, https://www.cathdb.info). The resource comprises over 500,000 domain structures and over 150 million protein domains classified into over 5000 homologous superfamilies. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 50,000 additional protein domains. Gene3D is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing over 5000 CATH superfamilies. The current Gene3D (v22) release has expanded its domain assignments to ~20 000 cellular genomes and over 200 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family (FunFam) annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. |
Type Of Material | Database/Collection of data |
Provided To Others? | Yes |
Impact | CATH-Gene3D is widely used by biologists for teaching and research. There are ~1 million webpage accesses per month from ~9,000 unique visitors. CATH-Gene3D is a member database of InterPro, which receives more than 5 million web page accesses per month. It is also linked to from other major public sites including Pfam, PDB, PDBe. |
URL | https://www.cathdb.info |
Title | CATH-KinFams |
Description | CATH-KinFams are protein kinase domain families classified according to functional similarity based on SDP (specificity determining positions). In this deposition we make available 2,210 KinFams sequence alignments alongside Hidden Markov Models built from them to be used with HMMER3. |
Type Of Material | Data analysis technique |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | This dataset was downloaded by members of the research community over 40 times to be used in protein kinase research. |
URL | https://zenodo.org/records/7575924 |
Title | TED-The Encyclopedia of Domains |
Description | This dataset contains 324 million CATH domain assignments and structures for 188 million protein structure models from the AlphaFold Protein Structure Database, covering the proteomes of over 600,000 organisms. We include PDB files for 40 model organisms and global health proteomes, novel folds and a table containing metadata on domain quality and assignments. |
Type Of Material | Data analysis technique |
Year Produced | 2024 |
Provided To Others? | Yes |
Impact | This research dataset provides a access to high-quality curated domains from AFDB, for all proteomes from the UniProt database. This has provided an opportunity to (i) identify novel folds (ii) investigate remote homologous relationship illuminated by structural information (please refer to doi: 10.1016/j.molcel.2023.10.039) ; (iii) functional annotations of CATH superfamilies' and pathogen domains/drug targets. |
URL | http://zenodo.org/records/10788942 |
Title | Understanding structural and functional diversity of ATP-PPases using protein domains and functional families in CATH database |
Description | The dataset of AF2-predicted HUP domains with overall pLDDT > 90, culled at 90% identity. |
Type Of Material | Database/Collection of data |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | We designed a protocol to analyse AlphaFold2 domains to understand functional diversity of protein superfamily called ATP-PPases. The computational protocol designed in this study will be used to analyse other important super families using data obtained from TED analyses. |
URL | https://zenodo.org/record/8346481 |
Description | InterPro |
Organisation | EMBL European Bioinformatics Institute (EMBL - EBI) |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. BBSRC BBR funding extended the mapping between SCOP and CATH, integrated annotations in InterPro for selected model organisms, and provided a 3D viewer for the structural annotations. Current collaborations involve evaluation of novel deep learning strategies for proving CATH superfamily annotations via InterPro |
Collaborator Contribution | Annotations from other sources, manual curations, central database and web site. |
Impact | Publications Community resource to further biological research. |
Start Year | 2007 |
Description | PDBe |
Organisation | EMBL European Bioinformatics Institute (EMBL - EBI) |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D initiative (now renamed as 3D-Beacons) has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. As mentioned under 3D-Beacons collaboration we are also contributing predicted domain structures generated for the 214 million predicted 3D-structures in the AlphaFold database (AFDB). We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the 3D-Beacons portal. |
Collaborator Contribution | Host, maintain and curate the central PDBe resource and website. |
Impact | Publications Community resources to further scientific research. |
Start Year | 2006 |
Description | ProtFunAI |
Organisation | Technical University of Munich |
Country | Germany |
Sector | Academic/University |
PI Contribution | Development of deep learning algorithms for protein function prediction, protein classification and analysis |
Collaborator Contribution | Training in deep learning protocols and protein language models. Contributions to project design. Novel protein language models to generate protein embeddings for protein function prediction and other protein based prediction tasks. |
Impact | Project has just started so no outputs yet |
Start Year | 2024 |
Description | "Understanding structural and functional diversity of PP-ATPases: insights using CATH Functional families , ISMB Retreat (Cambridge, UK) |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Other audiences |
Results and Impact | This talk was presented as part of UCL's ISMB symposium. |
Year(s) Of Engagement Activity | 2023 |
Description | Biocenter Oulu Day (Finland). "How much of protein space will AlphaFold illuminate?". |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Bioinformatics and deep learning for biodata analysis workshop |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Birkbeck College Lecture (UK). "The impact of AI on protein structure and function". |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2024 |
Description | CABD 20th Anniversary (Spain). "How much of protein space will AlphaFold illuminate?". |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Centre International de Recontres Mathematiques (CIRM) (France). "How much of protein space will AlphaFold illuminate?". |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | EBI Structural Bioinformatics Workshop |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | EMBO Conference on AI in Structural Biology, Heidelberg, Germany |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | ISMB/ECCB2023 NIH/ELIXIR Special Track (France). CATH - Protein Structure Classification Database. |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | ISMB/ECCB2023 Tech Track (France). "Scaling up Protein Classification. CATH-Alphaflow and Chainsaw". |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | ISMB/ECCB2023 Tutorials Track (France). CATH Alphaflow Tutorial. |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | This was the tutorial held as part of the symposium. |
Year(s) Of Engagement Activity | 2023 |
Description | In2Science UK 16th August-29th August 2023 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Schools |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Interplay between AI and mathematical modelling in the post-structural genomics era CIRMM, Marseilles France |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Keynote for Cambridge CD23 Symposium |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Postgraduate students |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Keynote for the 16th International Symposium on Health Informatics and Bioinformatics (HIBIT'23) |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | ML4NGP Montpellier (France). "Novel pipelines and tools for discoveries in protein structure space |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Other audiences |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |
Description | Protein Evolution Conference |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | n/a |
Year(s) Of Engagement Activity | 2023 |