SENSE - Screening of ENvironmental SEquences to discover novel protein functions using informatics target selection and high-throughput validation

Lead Research Organisation: European Bioinformatics Institute
Department Name: Genome Assembly and Annotation

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

This project will enable very large-scale discovery of novel enzymes and bacteriocins from assembled metagenomics sequence data by developing new computational and experimental platforms. Significant technical improvements will emerge from cycles of computational/experimental work, as results from experimental validation will inform algorithm refinements. Our predictions will be concomitantly captured in widely used databases. The scale of experimental validation will be extremely large compared to conventional approaches, enabling increased sampling of sequence space to identify functional novelty.
To sample metagenomic sequences, we will exploit existing and new assemblies to extract sequence data from a variety of sampled biomes. We will make major adaptations to existing bioinformatic platforms to functionally sub-classify metagenomic sequences and apply them to two cases, alpha/beta-hydrolases and bacteriocins. We will develop algorithms characterising key functional determinant residues to score the likelihood of new families having substantially different functionality. Transferring this to bacteriocins will be more challenging as these are often small peptides requiring accessory genes, which can be hard to detect and/or are functionally uncharacterised. Providing sensitive and accurate bacteriocin gene cluster identification and classification will require new methods to identify all components of the gene cluster through expanded homology and contextual models, prior to sub-family classification.
Key to our proposal will be the ability to perform very large-scale experimental validation of the bioinformatics predictions. This will be facilitated by using novel gene synthesis platforms that can synthesise 1000s of genes for screening so as to test sequence neighbours and the target sequences provided by bioinformatics predictions. Furthermore, use of high-throughput microfluidic droplet technology permits testing in a very cost effective and timely way.

Planned Impact

This project will enable large-scale detection of functional biomolecules (proteins), the discovery of which impacts diverse spheres, including biotechnology and biocatalysis, development of new materials, food security and medical applications. It will impact on four BBSRC strategic areas related to metagenomes, synthetic biology, antibiotic resistance and data driven biology.

Firstly, we will analyse the available sequence data more efficiently using a combination of novel bioinformatic and experimental platforms allowing unprecedented throughput. Secondly, newly identified hydrolases and bacteriocins may be valorised as novel functional proteins for the benefit of academic and industry communities. Thirdly, we hope to have educational impact by training researchers in this project in a consortium that will traverse traditional boundaries between in silico biology, microengineering, high-throughput screening and classical enzymology.

The first objective will develop powerful new methods for exploring the vast sequence data being captured by metagenome initiatives. The Finn team manage EMBL-EBI's MGnify resource and have developed robust platforms for handling data on this scale and providing high quality sequence outputs. Leveraging RF and CO's extensive experience in family classification, we will develop new techniques to detect relatives with a high likelihood of functional novelty. Putative targets will be experimentally validated by novel experimental platforms that allow high-throughput at an unprecedented level and additionally probe neighbours in sequence space to detect more stable mutants and further expand knowledge of functional determinants. Importantly, there will be cycles of bioinformatic analysis and prediction followed by experimental validation.

Although we will develop the protocols using two important classes of biomolecules, i.e. enzymes and bacteriocins, the methods will be generic and publicly available to apply to other families expanded by metagenomic data. Our tools will be made widely available to the large community of groups analysing this data, increasing impact. RF and CO coordinate different ELIXIR communities and will have opportunities to publicise the work and promote adoption of these techniques.

The novel hydrolases and bacteriocins, have commercial value and relevance to human health. Bacterial alpha/beta-hydrolases are widely used in many industries, including dairy, pharmaceutical, and laundry, as they are easy to cultivate, nontoxic, and eco-friendly. Bacteriocins have value in both food security and human health, e.g. producing strains can be applied in food to extend preservation times. Bacteriocins can also be added directly to foods as a preservative, incorporated into bioactive packaging, added to animal feed as an anti-pathogen additive to protect livestock against pathogen damage, or help balance the bacteria in the digestive tract of livestock and humans to reduce gastrointestinal diseases. They have the potential to replace existing antibiotics (especially those with resistance) and have been indicated as novel anticancer drugs.

The interdisciplinary aspect of our project will provide additional training opportunities and distinguish the staff development in this project from more conventional training, expanding interdisciplinary skills in the UK. Thus, the postdocs in this project will receive training that positions them to obtain jobs in small or large biotechnology enterprises. This aspect of the project will be accompanied by interactions with the institutions' technology transfer offices (CE, EMBLEM, and UCL Business) and industrial stakeholders, so that information is initially protected and then shared and commercialised.

Publications

10 25 50
publication icon
Richardson L (2023) MGnify: the microbiome sequence data analysis resource in 2023. in Nucleic acids research

 
Description We have explored the diversity of the a/b hydrolase superfamily present in metagenomic samples with a specific focus on PETases, the enzymes responsible for the degradation of polyethylene terephthalate (PET plastic). We have found that metagenomic samples are substantially enriched for this enzyme class, especially in marine samples. To enrich the first tranche, and in response to our collaborators request, we have also assembled and analysed datasets from extreme temperatures (hot and cold) for PETases. These provided a supplementary set of enzymes that interleaved with the previous set. These results are allowing us to understand both immediate changes around the enzyme active site, as well as allosteric interactions that may also impact enzyme activity and substrate affinity. To complement this activity, we are developing the infrastructure to improve the ability to select subsets of the MGnify protein database, and combine this with both the sample metadata and the genomic context in which a protein may be found. The size of the protein database, i.e. >2.5 billion non-redundant sequences, has presented major technical challenges in exactly how this is stored and accessed. We have now overcome these major obstacles and have released an updated version of the database which additionally provides Pfam annotations on all sequences. These are supplemented by annotations provided by ProtENN2, a sequence embedding approach for functional annotation developed by Google AI. The database is made available as flatfiles, but the underlying relational database contains the contextual information for every protein sequence, which allows us to trace genomic context and sample information. Finally, we have provided our predictions for Ribosomally synthesised and post-translationally modified peptides (RiPPs) to our collaborators to facilitate the evaluation of bacterial growth inhibition by these novel RiPPs. Assay results have demonstrated that we have indeed correctly identified novel RiPPs since activity was observed against both Staphylococcus aureus and Escherichia coli, with the latter occurring at levels that are useful for applications in biotechnology and/or pharmaceutical industries.
Exploitation Route Metagenomics is providing unprecedented access to 99% of microbes that are yet to be experimentally isolated and cultured. Consequently, such datasets are providing a wealth of new sequences providing novel insights into the ability of microbes to exploit different niches. However, mining metagenomics data is complex due to the magnitude of the volumes and the fundamental need for specialist computational pipelines to assemble and analyse the data. This project has paved the way to increase the accessibility to this data by allowing a multidisciplinary set of scientists to gather collections of sequences to understand enzyme evolution, and how this data can be utilised for rational enzyme design. Natural products is a growing area of research and metagenomics represents an important new source of potential products. The scale of data housed in MGnify indicates that tens of thousands of novel natural products are awaiting discovery and experimental analysis. We have focused on RiPPs in the SENSE project, as they are typically produced by small genomic regions that can be relatively easily synthesised and screened using high throughput techniques. RiPPS have been widely used in a variety of different industries, from food preservation to healthcare. They also tend to have a fairly narrow spectrum of activity and understanding their mode of action can have major impacts on overcoming societal threats, such as antimicrobial resistance.
Sectors Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description Member, UKRI Knowledge Transfer Network (KTN) Microbiome Innovation Network
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
 
Title MGnify protein database 
Description MGnify protein database that contains 2.4 billion non-redundant sequences. This latest release is more than double of the previous release of 1.1 billion sequences. Sequences are clustered at 90% coverage and identity to generate 620 million clusters. ProtENN2 annotations by Google AI are included in this release. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Information relating to the biome and genomic context for the protein sequences provided here is crucial for downstream analyses and applications for proteins of interest. Moreover, majority of proteins identified in metagenomics studies and included in this release is not covered by other major protein resources, thus representing a significant novel source of proteins, especially from uncultured organisms, with a wide range of applications. 
URL http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/README.txt
 
Description "What metagenomic data can tell us about healing the planet" talk at the Life Science Across the Globe - talks on science and culture 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Talk by PI Rob Finn on MGnify at the Learning from the planet to heal the planet: Microbial Ecosystems online seminar series (hosted by EMBL and HHMI Janelia Research Campus).
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=Hc89Rrs_ykY&ab_channel=HHMI%27sJaneliaResearchCampus
 
Description BIOPROSP_23 Keynote talk "Genome Resolved Metagenomics - Understanding the potential of marine microbial communities for novel product discovery" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote talk by PI Rob Finn at the BIOPROSP-23 conference held at Tromsø, Norway. BIOPROSP is the international biennial scientific conference on marine biotechnology, which aims to translate basic research into applied research with industrial application.
Year(s) Of Engagement Activity 2023
URL https://www.tekna.no/en/events/bioprosp_23-42323/Program/?info=156913
 
Description EMBL-EBI News "2.4 billion sequences now available in the latest MGnify protein database release" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Newsletter announcing MGnify's new release of their protein database which contains 2.4 billion non-redundant sequences, inlcuding new annotations provided by Google AI.
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/about/news/updates-from-data-resources/2-4-billion-sequences-now-available-in-...
 
Description ISME 18 Roundtable "What does it take to be FAIR?" by the National Microbiome Data Collaborative 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Roundtable organised by the National Microbiome Data Collaborative at ISME18. PI Rob Finn was an expert panelist on the roundtable. Discussions covered attitude shifts required for microbiome data sharing, what constitutes good metadata and other points.
Year(s) Of Engagement Activity 2022
URL https://twitter.com/MicrobiomeData/status/1559210668485640194
 
Description MGnify public engagement at the Love Nature Festival- February Half-term activities 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Pulblic engagement activity at the Love Nature Festival half-term even organised by the Ipswich Museum at the Christchurch Mansions. MGnify had a stall where they presented recent research work on plastic degrading enzymes from bacteria.
Year(s) Of Engagement Activity 2023
URL https://twitter.com/MGnifyDB/status/1625858534637436930
 
Description Meta AI Research blogpost "ESM Metagenomic Atlas: The first view of the 'dark matter' of the protein universe" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Meta AI blogpost describing the release of 600+ million protein ESM Metagenomic Atlas, with predictions for nearly the entire MGnify90 database, a public resource cataloging metagenomic sequences.
Year(s) Of Engagement Activity 2022
URL https://ai.facebook.com/blog/protein-folding-esmfold-metagenomics/
 
Description Nature News "Meta just dropped 600+ million protein structure predictions, made using a large language model." 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nature coverage publication on Meta AI's new ESM Mategenomic Atlas "AlphaFold's new rival? Meta AI predicts shape of 600 million proteins"
Year(s) Of Engagement Activity 2022
URL https://www.nature.com/articles/d41586-022-03539-1