BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

Protein domains are fundamental units of life associated with most biological processes. Recognising constituent domains in proteins is critical for understanding protein function, protein dynamics and the effects of genetic variations. This project will facilitate collaborations between world-leading protein domain resources to:

(1) Develop automated domain detection algorithms that can handle predicted protein structures (e.g., from AlphaFold and ESMAtlas). We will build on existing methods from the ECOD (DPAM method) and CATH (CRH method) teams that exploit homology information. DPAM also uses estimated errors (PAE) provided by AlphaFold. We will also design novel AI methods that exploit protein sequence and structure embeddings. Preliminary work by the CATH team shows that sequence embeddings from the Prot-T5 protein language model enable higher accuracy than machine learning methods exploiting homology or protein biophysical data. We will build a meta-predictor that uses machine learning to combine outputs from all predictors. The performance will be benchmarked rigorously, e.g., using curated domain datasets and the performance metrics established independently by the CASP evaluation team.
(2) Since >800 million predicted protein structures are available from AlphaFold and ESMAtlas, we will re-engineer an existing domain detection platform developed by PDBe that provides domain assignments for experimental structures, to handle this large-scale predicted data.
(3) To prioritise domain detection for organisms and protein families most important for human health, food security and wealth, we will develop computational workflows in NextFlow that rank targets. For example, organisms and protein families will be prioritised based on PubMed citations. To expand our understanding of protein structures and thereby improve protein design and variant analysis, we will build workflows to identify protein clusters distant from experimentally characterised structures

Publications

10 25 50