BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Proteins play a major role in most important processes in life, such as the digestion of nutrients, immune response, and cellular regulation. They are comprised of long polymers that fold into compact globular forms known as domains. Most proteins have at least two domains and some are composed of dozens. Domains tend to be associated with specific functions, although sometimes an important function will result from combining multiple domains. 3D structure data and models are particularly valuable for detecting the pockets and surface features linked to domain function. Determining the structure and orientations of the constituent domains is important for understanding the overall function of the protein and the dynamic conformational changes linked to that.

Until recently, structural data for proteins was very sparse, with <1% of all known proteins experimentally characterised. Whilst structures can be predicted with reasonable accuracy when the structure of a close relative is known, for a significant proportion of proteins such data did not exist. Even for important organisms like humans or wheat, <50% of proteins had structural data accurate enough to understand the structural impacts of changes in the genes coding the proteins.

This situation changed dramatically in 2021 when DeepMind's AlphaFold AI system succeeded in predicting protein structures of comparable quality to experimentally characterised proteins. In August 2022, DeepMind released >214 million protein structures for all known proteins. Whilst recent analyses showed that in some cases AlphaFold models are not accurate enough for detailed studies, largely because the data needed to make the prediction is still too sparse, the AlphaFold data still massively increases the amount of high-quality structural data available for understanding the mechanisms by which proteins function.

Identifying constituent domains in a protein is not trivial. This project will exploit powerful AI technologies to more accurately predict domain boundaries. Preliminary studies are already showing significant improvements. We will apply multiple domain detection algorithms independently developed by two world-renowned protein domain classification teams (ECOD, CATH), both of whom have long track records in successfully automating domain detection. Their methods employ complementary strategies that can be combined to give a consensus prediction where agreement in assignments reflects higher confidence levels.

Another major challenge will be coping with the scale of the data. Even allowing for a 50% loss due to poor model quality, the data represents a >200-fold increase in the data already classified in these evolutionary resources. An existing domain assignment and classification pipeline (3D-SCAFOLD) built to integrate experimental domain data from two resources (SCOP, CATH) will be re-engineered to incorporate ECOD (which is much more comprehensive than SCOP) and capture the vast predicted data from AlphaFold. This will require new and more efficient workflows that parallelise the processes. Furthermore, the pipeline will be more complex as additional steps will be necessary to determine the model quality and remove poor models. We will also adapt access to the webpages and APIs to allow users to request targeted subsets and perform more complex queries needed by the increase in the scale of the data.

In addition, we expect that many large, more complex multidomain proteins will be very challenging, leading to discrepancies between the results provided by the different resources. We will hold workshops for the teams to agree on consensus assignments.

To cope with the scale of the data, we will initially target proteins in pathogenic organisms, crops essential for food security, and protein families linked to human health and well-being, including enzyme families important for environmental remediation and the production of commercially valuable compounds.

Technical Summary

Protein domains are fundamental units of life associated with most biological processes. Recognising constituent domains in proteins is critical for understanding protein function, protein dynamics and the effects of genetic variations. This project will facilitate collaborations between world-leading protein domain resources to:

(1) Develop automated domain detection algorithms that can handle predicted protein structures (e.g., from AlphaFold and ESMAtlas). We will build on existing methods from the ECOD (DPAM method) and CATH (CRH method) teams that exploit homology information. DPAM also uses estimated errors (PAE) provided by AlphaFold. We will also design novel AI methods that exploit protein sequence and structure embeddings. Preliminary work by the CATH team shows that sequence embeddings from the Prot-T5 protein language model enable higher accuracy than machine learning methods exploiting homology or protein biophysical data. We will build a meta-predictor that uses machine learning to combine outputs from all predictors. The performance will be benchmarked rigorously, e.g., using curated domain datasets and the performance metrics established independently by the CASP evaluation team.
(2) Since >800 million predicted protein structures are available from AlphaFold and ESMAtlas, we will re-engineer an existing domain detection platform developed by PDBe that provides domain assignments for experimental structures, to handle this large-scale predicted data.
(3) To prioritise domain detection for organisms and protein families most important for human health, food security and wealth, we will develop computational workflows in NextFlow that rank targets. For example, organisms and protein families will be prioritised based on PubMed citations. To expand our understanding of protein structures and thereby improve protein design and variant analysis, we will build workflows to identify protein clusters distant from experimentally characterised structures

Publications

10 25 50