ProtFunAI: AI based methods for functional annotation of proteins in crop genomes

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Our project will 'build on existing links and deepen existing relationships' between the two groups pioneering the development of AI/Deep-Learning models for proteins (Rost Group, TUM) and the application of these to protein domain families (Orengo Group, UCL). It will leverage world leading expertise in protein Language Models (pLMs) in order to accelerate the scientific discovery of protein functions in the genomes of key agricultural crops important for food security. However, our approaches will be generic and rolled out to all UniProt proteins through existing collaborations.

Synergies between both groups have evolved over several collaborations. Since 2019, ground-breaking results tuned the pLMs developed in the Rost Group (e.g. the ProtTrans series, incl. ProtT5, ProtTucker) with protein family and functional family data (CATH superfamilies and FunFams) generated and maintained by the Orengo Group.

The partnership proposed, here, would allow researchers in the Rost and Orengo Groups to intensify exchanges through visiting each others labs and interacting more comprehensively to design more effective protocols that enhance (1) protein homologue detection (2) protein function prediction and (3) protein functional site prediction.

The Orengo and Rost Groups began collaborating in 2000 when working together on protein family analysis for target identification in the NIH-funded USA Structural Genomics initiative (PSI), which ended in 2015 [21-23]. Subsequently funding from the German BMBF (Federal German Research Ministry) and DFG (German Research Foundation) supported visits of PhD and Masters students from both groups and resulted in the development of new approaches for protein function prediction [14,15]. This application seeks funds to continue these collaborations to leverage the latest advances in AI/Deep Learning. The Rost Group recently enhanced their pLMs significantly (ProstT5 [18]) and the funding would allow us to apply ProstT5 to exploit the hugely expanded CATH classification, which is currently integrating hundreds of millions of predicted protein structures from the AlphaFold portal (AFDB).

The application is very timely as it will address key BBSRC strategic priorities around data intensive biology and AI and the important challenge of food security. We will apply improved function prediction methods to significantly increase the functional annotations of plant genomes. This will bring 'new knowledge about key biological principles and mechanisms using AI-based approaches' and bring 'AI in sustainable agriculture and food' and enable 'smart agriculture' by identifying genes implicated in biological systems associated with growth and stress resistance e.g. drought and antimicrobial resistance. Most genes (typically >90%) from plants valuable as crops (e.g. wheat, maize, rice, sorghum) are experimentally uncharacterized or very poorly annotated. Our methods will be state-of-the-art to accurately guide experimental validation.

We will disseminate the annotations using our established web-based CATH resource accessed by over 27,000 users/month. Since CATH data is also disseminated by PDB, UniProt and InterPro the predictions will be accessible to >900,000s of users/month. We will also work closely with collaborators in the UK researching plant genomes to get feedback and solicit experimental validation where possible.

The project will significantly enhance the AI/ML skills of UK based researchers in the Orengo Group, whose prior training was largely in biology. On the flip side, the more AI-focused members from the Rost group will deepen their understanding of individual proteins, organisms, and evolution. German scholars will also dive deeper into the workings of UK-based resources.

Publications

10 25 50