Transforming the Structural Landscape of CATH to Aid Variant Analyses in Human and Agricultural Organisms and their Pathogens

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Proteins are Nature's molecular machines involved in most biochemical processes in living systems. Mutations in proteins can affect their stability and/or shape or chemical properties, altering their function. Knowing the 3D structure of the protein can be extremely helpful in understanding whether and how these mutations have this effect. Proteins are typically made up of multiple 'domains' - important functional modules - each associated with a distinct globular shape. Our CATH classification groups domains according to evolutionary ancestry. Relatives are recognised because they have similar structures in their core and often functional features in common, though variations outside the core can modify function. We therefore sub-classify relatives into functional families if they have highly similar structures and functions.

Experimental techniques for determining protein structures are challenging <1% of known proteins have experimental structures. However, AI technologies for predicting structures have been improving immensely. The best use information from millions of protein sequences (1D strings of molecules (residues)) to predict how proteins will fold up in 3D. The massive increase in sequence data (> one billion sequences now known) obtained by sampling diverse environments have empowered new methods (DeepMind's AlphaFold2) to predict model structures that are as good as experimental structures. DeepMind will provide ~138 million protein structures in 2022, ~200 times more than exists now. We will transform knowledge in our CATH evolutionary classification by bringing in this vast 3D data - and we will also bring in the sequences involved in predicting the structures. This even vaster sequence data will reveal evolutionary conserved sites highly likely to be linked to function.

To handle this massive amount of data we will build powerful new methods. Our recent trials using a new approach (CATHe) correctly assigned domain sequences to their evolutionary family ~90% of the time. Where we have an AlphaFold2 structure for the domain we will apply accurate structure comparisons to validate the classification.

A major aim will be use this new 3D data and more accurately predicted functional sites to understand how mutations in pathogens (e.g. SARS-CoV-2) can lead to increased virulence or transmission. We'll do this through our CATH-FunVar platform which examines where mutations lie on the protein structure. Proximity to functional sites means the mutation may damage or enhance the function. We have started using FunVar to analyse variants of concern in SARS-CoV2. We will extend it to other organisms and pathogens linked to human health and well-being e.g. crops like wheat and rice that are essential for food security and where knowledge of variant impacts can guide selection and engineering of more hardy or faster growing varieties.

To improve FunVar we will improve the accuracy of our predicted functional families and detection of conserved functional sites in them. To do this we will exploit the vast structure and sequence data and adapt our new AI methods to make them even more powerful for this challenging task. We will build tools to analyse structure - function relationships in these families and develop powerful new visualisations for displaying these insights. Since we'll need to handle massive expansions in the data coming into CATH and lots of new methods for processing it - and since some new data is now captured in a way our computer programs can't read - we will completely re-engineer existing pipelines for classifying domains in CATH.

We have already built preliminary pipelines that brought over a quarter of a million AlphaFold2 models into CATH. This project will allow us to make these methods more robust and then apply them to bring in at least 100 fold more models to expand FunVar and determine the impacts of variants that could impact on human health and food security.

Technical Summary

CATH classifies protein domains into evolutionary superfamilies to better understand sequence-structure-function relationships and improve prediction of protein functions and functional sites.
New tools and computational workflows will be developed (using NextFlow) to harness the massive expansions in protein structure data (3D-models from AlphaFold2 (AF2)) and sequence data (e.g. metagenome sequences in MGnify).

We will extend our webpages and APIs to provide the expanded CATH data to the biological and biomedical communities.

Major tools and workflows we will develop:
-Powerful new homologue detection methods (eg CATHe, recently successfully piloted) using deep learning (DL) strategies exploiting sequence embeddings based on natural language models (eg Prot-BERT-T5).
-Novel workflows for comparing AF2 3D-models against CATH 3D fold libraries. These will exploit in-house methods (eg SSAP) and powerful approaches using DL strategies employing 3D graphlets or shape-mers (e.g.FoldSeek, Geometricus).
-Novel workflows combining the new DL based sequence and structure based methods (above) to improve CATH functional families (FunFams) and functional site prediction.
-New tools for CATH-Plus (providing derived data) e.g. novel structure - function relationships including comprehensive multiple structure alignments of AF2 models and visualisations of conserved 3D and 1D motifs.
-Re-engineered platform for CATH-Classify (in NextFlow) integrating the new workflows (above) and handling massive expansions in the data

All new methods will be benchmarked robustly.

The PDRA will also run and maintain CATH's computational platforms i.e. software, hardware, databases and web services required to process a constantly increasing amount of data; manually validate remote homologues and new folds; generate derived data for CATH-Plus (eg multiple sequence alignments). Time will also be spent improving CATH tutorials/videos and extending them to explain the new data.

Publications

10 25 50