SENSE - Screening of ENvironmental SEquences to discover novel protein functions, using informatics target selection and high-throughput validation

Lead Research Organisation: University College London

Department Name: Structural Molecular Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

This project will enable very large-scale discovery of novel enzymes and bacteriocins from assembled metagenomics sequence data by developing new computational and experimental platforms. Significant technical improvements will emerge from cycles of computational/experimental work, as results from experimental validation will inform algorithm refinements. Our predictions will be concomitantly captured in widely used databases. The scale of experimental validation will be extremely large compared to conventional approaches, enabling increased sampling of sequence space to identify functional novelty.
To sample metagenomic sequences, we will exploit existing and new assemblies to extract sequence data from a variety of sampled biomes. We will make major adaptations to existing bioinformatic platforms to functionally sub-classify metagenomic sequences and apply them to two cases, alpha/beta-hydrolases and bacteriocins. We will develop algorithms characterising key functional determinant residues to score the likelihood of new families having substantially different functionality. Transferring this to bacteriocins will be more challenging as these are often small peptides requiring accessory genes, which can be hard to detect and/or are functionally uncharacterised. Providing sensitive and accurate bacteriocin gene cluster identification and classification will require new methods to identify all components of the gene cluster through expanded homology and contextual models, prior to sub-family classification.
Key to our proposal will be the ability to perform very large-scale experimental validation of the bioinformatics predictions. This will be facilitated by using novel gene synthesis platforms that can synthesise 1000s of genes for screening so as to test sequence neighbours and the target sequences provided by bioinformatics predictions. Furthermore, use of high-throughput microfluidic droplet technology permits testing in a very cost effective and timely way.

Planned Impact

This project will enable large-scale detection of functional biomolecules (proteins), the discovery of which impacts diverse spheres, including biotechnology and biocatalysis, development of new materials, food security and medical applications. It will impact on four BBSRC strategic areas related to metagenomes, synthetic biology, antibiotic resistance and data driven biology.

Firstly, we will analyse the available sequence data more efficiently using a combination of novel bioinformatic and experimental platforms allowing unprecedented throughput. Secondly, newly identified hydrolases and bacteriocins may be valorised as novel functional proteins for the benefit of academic and industry communities. Thirdly, we hope to have educational impact by training researchers in this project in a consortium that will traverse traditional boundaries between in silico biology, microengineering, high-throughput screening and classical enzymology.

The first objective will develop powerful new methods for exploring the vast sequence data being captured by metagenome initiatives. The Finn team manage EMBL-EBI's MGnify resource and have developed robust platforms for handling data on this scale and providing high quality sequence outputs. Leveraging RF and CO's extensive experience in family classification, we will develop new techniques to detect relatives with a high likelihood of functional novelty. Putative targets will be experimentally validated by novel experimental platforms that allow high-throughput at an unprecedented level and additionally probe neighbours in sequence space to detect more stable mutants and further expand knowledge of functional determinants. Importantly, there will be cycles of bioinformatic analysis and prediction followed by experimental validation.

Although we will develop the protocols using two important classes of biomolecules, i.e. enzymes and bacteriocins, the methods will be generic and publicly available to apply to other families expanded by metagenomic data. Our tools will be made widely available to the large community of groups analysing this data, increasing impact. RF and CO coordinate different ELIXIR communities and will have opportunities to publicise the work and promote adoption of these techniques.

The novel hydrolases and bacteriocins, have commercial value and relevance to human health. Bacterial alpha/beta-hydrolases are widely used in many industries, including dairy, pharmaceutical, and laundry, as they are easy to cultivate, nontoxic, and eco-friendly. Bacteriocins have value in both food security and human health, e.g. producing strains can be applied in food to extend preservation times. Bacteriocins can also be added directly to foods as a preservative, incorporated into bioactive packaging, added to animal feed as an anti-pathogen additive to protect livestock against pathogen damage, or help balance the bacteria in the digestive tract of livestock and humans to reduce gastrointestinal diseases. They have the potential to replace existing antibiotics (especially those with resistance) and have been indicated as novel anticancer drugs.

The interdisciplinary aspect of our project will provide additional training opportunities and distinguish the staff development in this project from more conventional training, expanding interdisciplinary skills in the UK. Thus, the postdocs in this project will receive training that positions them to obtain jobs in small or large biotechnology enterprises. This aspect of the project will be accompanied by interactions with the institutions' technology transfer offices (CE, EMBLEM, and UCL Business) and industrial stakeholders, so that information is initially protected and then shared and commercialised.

Funded Value:

£229,346

Funded Period:

Jan 20 - Jan 23

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/T002735/1

Principal Investigator:

Christine Orengo

Research Subject:

Bioengineering (18%)

Biomolecules & biochemistry (18%)

Tools, technologies & methods (54%)

Research Topic:

Bioinformatics (24%)

Chemical Biology (18%)

Environmental Informatics (12%)

Protein engineering (18%)

Tools for the biosciences (18%)

Organisations

University College London (Lead Research Organisation)

People	ORCID iD
Christine Orengo (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Bordin N (2023) Novel machine learning approaches revolutionize protein knowledge in Trends in Biochemical Sciences

Bordin N (2023) AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms in Communications Biology

Lam SD (2020) SARS-CoV-2 spike protein predicted to form complexes with host receptor protein orthologues from a broad range of mammals. in Scientific reports

Rauer C (2021) Computational approaches to predict protein functional families and functional sites. in Current opinion in structural biology

Sillitoe I (2021) CATH: increased structural coverage of functional space. in Nucleic acids research

Key Findings
Impact Summary
Research Tools and Methods
Engagement Activities


Description	We have developed two new algorithms CATH FunFam-Fran and CATH-eMMA which allow us to analyse large datasets of protein sequences from metagenomes. FunFam-Fran allowed us to search for novel enzymes able to degrade plastics, in particular a family of enzymes called PETases. We have mined the metagenome data in MGnify and identified more than 20,000 putative novel PETases. We have also built computational workflows (SiteTuner) that are allowing us to examine sequence and structure features of the active sites to identify residues most likely to enhance the activity of the enzyme. We have selected a number of enzymes that our experimental collaborator have tested. Preliminary results showed activity in cells but some problems with solubility. We have therefore developed an AI based approach for detecting which residues should be mutated to improve solubility. Subsequently, a new approach FD-motifs refined the selection of functional determinants and that, together with CATH-eMMA, allowed us to select further putative PETases. The solubility of the proteins is important for the proteins to be considered for industrial application. Therefore we also revised the selection of putative PETases to include enzymes with surface site properties that should improve solubility. Experimental validation identified 3 novel enzymes from the metagenomes with PETase activity. A manuscript is being prepared for submission.
Exploitation Route	The data from our analyses would allow researchers in the biotech industries to design novel enzymes that are more effective at degrading plastics.
Sectors	Environment Manufacturing including Industrial Biotechology


Description	The poster titled Identifying novel plastic degrading enzymes using computational methods was presented at UCL ISCB 2021. The work won best poster prize.
First Year Of Impact	2023


Title	CATH-FRAN - an randomized splitting algorithm for the classification of Functional Families based on CATH-Gardener
Description	CATH-FRAN is an incremental update to CATH-Gardener, a pipeline for the classification of sequences in Functional Families after an initial partitioning of the dataset according to their Multi-Domain-Architecture (MDA). CATH-FRAN further splits the initial dataset into random partitions, allowing for the processing of large SuperFamilies in CATH and metagenomes sets.
Type Of Material	Improvements to research infrastructure
Year Produced	2020
Provided To Others?	No
Impact	CATH-FRAN allowed the group to create Functional Families from datasets that were untreatable using the previous version of the algorithm (CATH-Gardener). These datasets include promising promising sequences sets from metagenomic data.


Title	CATH-eMMA: Protein functional classification using embedding from protein language models
Description	CATH Functional Families (FunFams) are coherent subsets of CATH protein families where a conserved function is shared across all members. Previously to CATH-eMMA, to generate FunFams we generated a tree of relationships between clusters of protein domains and using a tool that assessed the presence of differentially conserved residues, we traversed the tree to obtain groups of sequences where differentially conserved residues are conserved across all members. This method, while precise, is very computationally expensive. CATH-eMMA reduces the overhead in the tree building step by encoding protein sequences into embeddings from protein language models and calculates the relationships based on Euclidean distances between them.
Type Of Material	Improvements to research infrastructure
Year Produced	2024
Provided To Others?	Yes
Impact	CATH-eMMA has been applied successfully to very large enzyme families from metagenomes, discovering novel plastic degrading enzymes.
URL	https://github.com/UCLOrengoGroup/eMMA


Title	PETase Functional determinant motifs: PETase-FDmotifs
Description	This program analyses functional families (FunFams) within a CATH superfamily to determine residue positions which are differentially conserved between these FunFams. Differentially conserved residues within 8A of the catalytic residues for the superfamily are deemed putative functional determinants (FDs) and used to construct distinct functional motifs for each family which can be used to identify further relatives.
Type Of Material	Improvements to research infrastructure
Year Produced	2024
Provided To Others?	No
Impact	PETase Functional determinant motifs (PETase-FDmotifs) is used for functional families identified within the Alpha-Beta Hydrolase (ABH) superfamily.


Title	VariPred
Description	Program which predicts the pathogenic impact of a residue mutation in a protein. The method exploits sequence embeddings from the ESM-1b protein language model.
Type Of Material	Improvements to research infrastructure
Year Produced	2023
Provided To Others?	Yes
Impact	VariPred is a novel and simple framework that leverages the power of pre-trained protein language models to predict variant pathogenicity.
URL	https://github.com/wlin16/VariPred


Description	Poster-Identifying novel plastic degrading enzymes using computational methods
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	This work is presented at UCL ISCB 2021 conference.
Year(s) Of Engagement Activity	2021