ProtFunAI: AI based methods for functional annotation of proteins in crop genomes
Lead Research Organisation:
UNIVERSITY COLLEGE LONDON
Department Name: Structural Molecular Biology
Abstract
Our project will 'build on existing links and deepen existing relationships' between the two groups pioneering the development of AI/Deep-Learning models for proteins (Rost Group, TUM) and the application of these to protein domain families (Orengo Group, UCL). It will leverage world leading expertise in protein Language Models (pLMs) in order to accelerate the scientific discovery of protein functions in the genomes of key agricultural crops important for food security. However, our approaches will be generic and rolled out to all UniProt proteins through existing collaborations.
Synergies between both groups have evolved over several collaborations. Since 2019, ground-breaking results tuned the pLMs developed in the Rost Group (e.g. the ProtTrans series, incl. ProtT5, ProtTucker) with protein family and functional family data (CATH superfamilies and FunFams) generated and maintained by the Orengo Group.
The partnership proposed, here, would allow researchers in the Rost and Orengo Groups to intensify exchanges through visiting each others labs and interacting more comprehensively to design more effective protocols that enhance (1) protein homologue detection (2) protein function prediction and (3) protein functional site prediction.
The Orengo and Rost Groups began collaborating in 2000 when working together on protein family analysis for target identification in the NIH-funded USA Structural Genomics initiative (PSI), which ended in 2015 [21-23]. Subsequently funding from the German BMBF (Federal German Research Ministry) and DFG (German Research Foundation) supported visits of PhD and Masters students from both groups and resulted in the development of new approaches for protein function prediction [14,15]. This application seeks funds to continue these collaborations to leverage the latest advances in AI/Deep Learning. The Rost Group recently enhanced their pLMs significantly (ProstT5 [18]) and the funding would allow us to apply ProstT5 to exploit the hugely expanded CATH classification, which is currently integrating hundreds of millions of predicted protein structures from the AlphaFold portal (AFDB).
The application is very timely as it will address key BBSRC strategic priorities around data intensive biology and AI and the important challenge of food security. We will apply improved function prediction methods to significantly increase the functional annotations of plant genomes. This will bring 'new knowledge about key biological principles and mechanisms using AI-based approaches' and bring 'AI in sustainable agriculture and food' and enable 'smart agriculture' by identifying genes implicated in biological systems associated with growth and stress resistance e.g. drought and antimicrobial resistance. Most genes (typically >90%) from plants valuable as crops (e.g. wheat, maize, rice, sorghum) are experimentally uncharacterized or very poorly annotated. Our methods will be state-of-the-art to accurately guide experimental validation.
We will disseminate the annotations using our established web-based CATH resource accessed by over 27,000 users/month. Since CATH data is also disseminated by PDB, UniProt and InterPro the predictions will be accessible to >900,000s of users/month. We will also work closely with collaborators in the UK researching plant genomes to get feedback and solicit experimental validation where possible.
The project will significantly enhance the AI/ML skills of UK based researchers in the Orengo Group, whose prior training was largely in biology. On the flip side, the more AI-focused members from the Rost group will deepen their understanding of individual proteins, organisms, and evolution. German scholars will also dive deeper into the workings of UK-based resources.
Synergies between both groups have evolved over several collaborations. Since 2019, ground-breaking results tuned the pLMs developed in the Rost Group (e.g. the ProtTrans series, incl. ProtT5, ProtTucker) with protein family and functional family data (CATH superfamilies and FunFams) generated and maintained by the Orengo Group.
The partnership proposed, here, would allow researchers in the Rost and Orengo Groups to intensify exchanges through visiting each others labs and interacting more comprehensively to design more effective protocols that enhance (1) protein homologue detection (2) protein function prediction and (3) protein functional site prediction.
The Orengo and Rost Groups began collaborating in 2000 when working together on protein family analysis for target identification in the NIH-funded USA Structural Genomics initiative (PSI), which ended in 2015 [21-23]. Subsequently funding from the German BMBF (Federal German Research Ministry) and DFG (German Research Foundation) supported visits of PhD and Masters students from both groups and resulted in the development of new approaches for protein function prediction [14,15]. This application seeks funds to continue these collaborations to leverage the latest advances in AI/Deep Learning. The Rost Group recently enhanced their pLMs significantly (ProstT5 [18]) and the funding would allow us to apply ProstT5 to exploit the hugely expanded CATH classification, which is currently integrating hundreds of millions of predicted protein structures from the AlphaFold portal (AFDB).
The application is very timely as it will address key BBSRC strategic priorities around data intensive biology and AI and the important challenge of food security. We will apply improved function prediction methods to significantly increase the functional annotations of plant genomes. This will bring 'new knowledge about key biological principles and mechanisms using AI-based approaches' and bring 'AI in sustainable agriculture and food' and enable 'smart agriculture' by identifying genes implicated in biological systems associated with growth and stress resistance e.g. drought and antimicrobial resistance. Most genes (typically >90%) from plants valuable as crops (e.g. wheat, maize, rice, sorghum) are experimentally uncharacterized or very poorly annotated. Our methods will be state-of-the-art to accurately guide experimental validation.
We will disseminate the annotations using our established web-based CATH resource accessed by over 27,000 users/month. Since CATH data is also disseminated by PDB, UniProt and InterPro the predictions will be accessible to >900,000s of users/month. We will also work closely with collaborators in the UK researching plant genomes to get feedback and solicit experimental validation where possible.
The project will significantly enhance the AI/ML skills of UK based researchers in the Orengo Group, whose prior training was largely in biology. On the flip side, the more AI-focused members from the Rost group will deepen their understanding of individual proteins, organisms, and evolution. German scholars will also dive deeper into the workings of UK-based resources.
People |
ORCID iD |
| Christine Orengo (Principal Investigator) |
| Description | ProFam combines different types of information (sequence, structure, and function) to better understand protein families. By learning from these diverse sources, ProFam can more accurately predict protein characteristics and generate new protein sequences with desired properties. This approach is more effective than previous methods that rely on a single type of information. |
| Exploitation Route | ? Guided Sequence Design: Generate novel, functional, and stable protein sequences, accelerating the design of new proteins like enzymes or antibodies. ? Exploring Family Diversity: Identify mutation-tolerant regions and discover novel functionalities within protein families. ? Improving Fitness Prediction: Use ProFam's likelihood scores to guide experimental efforts in directed evolution or library design. ? De novo Protein Design: Design entirely new protein families with desired properties based on structural and functional constraints. |
| Sectors | Education Healthcare |
| Title | ProFam model |
| Description | ProFam is a novel multimodal generative model for enhanced protein family modeling. It uses a decoder-only transformer architecture to integrate sequence, structure, and functional annotations for defining protein families. It leverages data from TED, FoldSeek, and CATH FunFams. ProFam employs flash attention and sequence packing for efficient training on large datasets. The model can be used for family classification, fitness prediction, and novel sequence generation. The model is still being evaluated and enhanced with new features and the method for generating the model will be submitted for publication once the model is complete. The model will be made available together with the manuscript. |
| Type Of Material | Computer model/algorithm |
| Year Produced | 2024 |
| Provided To Others? | No |
| Impact | ProFam integrates multiple family definitions, enabling multimodal control in generation and employs efficient training. It outperforms existing methods in fitness prediction, family classification, and sequence generation. It captures a more comprehensive understanding of protein families. |
| Description | ProtFunAI - Collababoration with Burkhard Rost Team |
| Organisation | Technical University of Munich |
| Country | Germany |
| Sector | Academic/University |
| PI Contribution | Development of deep learning algorithms for protein function prediction, protein classification and analysis |
| Collaborator Contribution | Training in deep learning protocols and protein language models. Contributions to project design. Novel protein language models to generate protein embeddings for protein function prediction and other protein based prediction tasks. |
| Impact | Project has just started so no outputs yet |
| Start Year | 2024 |