📣 Help Shape the Future of UKRI's Gateway to Research (GtR)

We're improving UKRI's Gateway to Research and are seeking your input! If you would be interested in being interviewed about the improvements we're making and to have your say about how we can make GtR more user-friendly, impactful, and effective for the Research and Innovation community, please email gateway@ukri.org.

SENSE - Screening of ENvironmental SEquences to discover novel protein functions using informatics target selection and high-throughput validation

Lead Research Organisation: European Bioinformatics Institute
Department Name: Genome Assembly and Annotation

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

This project will enable very large-scale discovery of novel enzymes and bacteriocins from assembled metagenomics sequence data by developing new computational and experimental platforms. Significant technical improvements will emerge from cycles of computational/experimental work, as results from experimental validation will inform algorithm refinements. Our predictions will be concomitantly captured in widely used databases. The scale of experimental validation will be extremely large compared to conventional approaches, enabling increased sampling of sequence space to identify functional novelty.
To sample metagenomic sequences, we will exploit existing and new assemblies to extract sequence data from a variety of sampled biomes. We will make major adaptations to existing bioinformatic platforms to functionally sub-classify metagenomic sequences and apply them to two cases, alpha/beta-hydrolases and bacteriocins. We will develop algorithms characterising key functional determinant residues to score the likelihood of new families having substantially different functionality. Transferring this to bacteriocins will be more challenging as these are often small peptides requiring accessory genes, which can be hard to detect and/or are functionally uncharacterised. Providing sensitive and accurate bacteriocin gene cluster identification and classification will require new methods to identify all components of the gene cluster through expanded homology and contextual models, prior to sub-family classification.
Key to our proposal will be the ability to perform very large-scale experimental validation of the bioinformatics predictions. This will be facilitated by using novel gene synthesis platforms that can synthesise 1000s of genes for screening so as to test sequence neighbours and the target sequences provided by bioinformatics predictions. Furthermore, use of high-throughput microfluidic droplet technology permits testing in a very cost effective and timely way.

Planned Impact

This project will enable large-scale detection of functional biomolecules (proteins), the discovery of which impacts diverse spheres, including biotechnology and biocatalysis, development of new materials, food security and medical applications. It will impact on four BBSRC strategic areas related to metagenomes, synthetic biology, antibiotic resistance and data driven biology.

Firstly, we will analyse the available sequence data more efficiently using a combination of novel bioinformatic and experimental platforms allowing unprecedented throughput. Secondly, newly identified hydrolases and bacteriocins may be valorised as novel functional proteins for the benefit of academic and industry communities. Thirdly, we hope to have educational impact by training researchers in this project in a consortium that will traverse traditional boundaries between in silico biology, microengineering, high-throughput screening and classical enzymology.

The first objective will develop powerful new methods for exploring the vast sequence data being captured by metagenome initiatives. The Finn team manage EMBL-EBI's MGnify resource and have developed robust platforms for handling data on this scale and providing high quality sequence outputs. Leveraging RF and CO's extensive experience in family classification, we will develop new techniques to detect relatives with a high likelihood of functional novelty. Putative targets will be experimentally validated by novel experimental platforms that allow high-throughput at an unprecedented level and additionally probe neighbours in sequence space to detect more stable mutants and further expand knowledge of functional determinants. Importantly, there will be cycles of bioinformatic analysis and prediction followed by experimental validation.

Although we will develop the protocols using two important classes of biomolecules, i.e. enzymes and bacteriocins, the methods will be generic and publicly available to apply to other families expanded by metagenomic data. Our tools will be made widely available to the large community of groups analysing this data, increasing impact. RF and CO coordinate different ELIXIR communities and will have opportunities to publicise the work and promote adoption of these techniques.

The novel hydrolases and bacteriocins, have commercial value and relevance to human health. Bacterial alpha/beta-hydrolases are widely used in many industries, including dairy, pharmaceutical, and laundry, as they are easy to cultivate, nontoxic, and eco-friendly. Bacteriocins have value in both food security and human health, e.g. producing strains can be applied in food to extend preservation times. Bacteriocins can also be added directly to foods as a preservative, incorporated into bioactive packaging, added to animal feed as an anti-pathogen additive to protect livestock against pathogen damage, or help balance the bacteria in the digestive tract of livestock and humans to reduce gastrointestinal diseases. They have the potential to replace existing antibiotics (especially those with resistance) and have been indicated as novel anticancer drugs.

The interdisciplinary aspect of our project will provide additional training opportunities and distinguish the staff development in this project from more conventional training, expanding interdisciplinary skills in the UK. Thus, the postdocs in this project will receive training that positions them to obtain jobs in small or large biotechnology enterprises. This aspect of the project will be accompanied by interactions with the institutions' technology transfer offices (CE, EMBLEM, and UCL Business) and industrial stakeholders, so that information is initially protected and then shared and commercialised.

Publications

10 25 50
publication icon
Richardson L (2023) MGnify: the microbiome sequence data analysis resource in 2023. in Nucleic acids research

 
Description We have explored the diversity of the a/b hydrolase superfamily present in metagenomic samples with a specific focus on PETases, the enzymes responsible for the degradation of polyethylene terephthalate (PET plastic). We have found that metagenomic samples are substantially enriched for this enzyme class, especially in marine samples. To enrich the first tranche, and in response to our collaborators request, we have assembled and analysed datasets from extreme temperatures (hot and cold) for PETases. These provided a supplementary set of enzymes that interleaved with the previous set. These results are allowing us to understand both immediate changes around the enzyme active site, as well as allosteric interactions that may also impact enzyme activity and substrate affinity. To complement this activity, we are improving the infrastructure to select subsets of the MGnify protein database, and combine this with both the sample metadata and the genomic context in which a protein may be found. The size of the protein database, i.e. >2.4 billion non-redundant sequences, has presented major technical challenges in how this is stored and accessed. We have now overcome these major obstacles and have released an updated version of the database which additionally provides Pfam annotations on all sequences. These are supplemented by annotations provided by ProtENN2, a sequence embedding approach for functional annotation developed by Google AI.
The database is made available as flat-files, but the underlying relational database contains the contextual information for every protein sequence, which allows us to trace genomic context and sample information.

This database was the foundation for the creation of ESMAtlas, a collection of over 650 million protein structural models, which were produced using the ESMFold algorithm (Lin Z. et al. Science 2023). Coupled with MGnify, ESMAtlas is amplifying the use of our proteins across different areas of science, and across the academic and industrial sectors. It is not possible to estimate the usage of our data at this stage due to the fact the MGnify proteins are in ESMAtlas, which has then been propagated to widely used structural comparison tools, such as FoldSeek, where MGnify-ESM30 is one of the default databases alongside AlphaFold and PDB.

Finally, we have provided our predictions for Ribosomally synthesised and post-translationally modified peptides (RiPPs) to our collaborators to facilitate the evaluation of bacterial growth inhibition by these novel RiPPs. Assay results have demonstrated that we have indeed correctly identified novel RiPPs since activity was observed against both Staphylococcus aureus and Escherichia coli, with the latter occurring at levels that are useful for applications in biotechnology and/or pharmaceutical industries.
Exploitation Route Metagenomics is providing unprecedented access to 99% of microbes that are yet to be experimentally isolated and cultured. Consequently, such datasets are providing a wealth of new sequences providing novel insights into the ability of microbes to exploit different niches. However, mining metagenomics data is complex due to the magnitude of the data volume and the fundamental need for specialist computational pipelines to assemble and analyse the data. This project has paved the way to increase the accessibility to this data by allowing a multidisciplinary set of scientists to gather collections of sequences to understand enzyme evolution, and how this data can be utilised for rational enzyme design. Natural products is a growing area of research and metagenomics represents an important new source of potential products. The scale of data housed in MGnify indicates that tens of thousands of novel natural products are awaiting discovery and experimental analysis. We have focused on RiPPs in the SENSE project, as they are typically produced by small genomic regions that can be relatively easily synthesised and screened using high throughput techniques. RiPPS have been widely used in a variety of different industries, from food preservation to healthcare. They also tend to have a fairly narrow spectrum of activity and understanding their mode of action can have major impacts on overcoming societal threats, such as antimicrobial resistance.
Sectors Aerospace

Defence and Marine

Agriculture

Food and Drink

Chemicals

Communities and Social Services/Policy

Digital/Communication/Information Technologies (including Software)

Environment

Healthcare

Manufacturing

including Industrial Biotechology

Pharmaceuticals and Medical Biotechnology

 
Description Member, UKRI Knowledge Transfer Network (KTN) Microbiome Innovation Network
Geographic Reach National 
Policy Influence Type Participation in a guidance/advisory committee
 
Description BlueRemediomics: Harnessing the marine microbiome for novel sustainable biogenics and ecosystem services
Amount € 7,649,827 (EUR)
Funding ID 101082304 
Organisation European Commission 
Sector Public
Country Belgium
Start 12/2022 
End 11/2026
 
Description Novel Plastizymes: discovery and improvement of plastic-degrading enzymes by integrated cycles of computational and experimental approaches
Amount £3,024,438 (GBP)
Funding ID BB/X00306X/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 03/2023 
End 03/2028
 
Title MGnify protein database 
Description MGnify protein database that contains 2.4 billion non-redundant sequences. This latest release is more than double of the previous release of 1.1 billion sequences. Sequences are clustered at 90% coverage and identity to generate 620 million clusters. ProtENN2 annotations by Google AI are included in this release. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Information relating to the biome and genomic context for the protein sequences provided here is crucial for downstream analyses and applications for proteins of interest. Moreover, majority of proteins identified in metagenomics studies and included in this release is not covered by other major protein resources, thus representing a significant novel source of proteins, especially from uncultured organisms, with a wide range of applications. 
URL http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/README.txt
 
Title Metagenomic non-redundant protein database 
Description Database of protein sequences produced from assembly of metagenomic datasets. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
Impact This database was initially supported the discovery of novel enzymes by an SME biotech company (BioCatalysts) as part of an InnovateUK BBSRC grant. Since this initial dataset was produced, it has grown to over 2.4 billion non-redundant sequences. This has been made available as a Google BigQuery resource to make it accessible to the research community. Another major impact has been the role of this MGnify dataset in the production of the AlphaFold and ESMFold. 
URL https://www.ebi.ac.uk/metagenomics/sequence-search/search/phmmer
 
Description "What metagenomic data can tell us about healing the planet" talk at the Life Science Across the Globe - talks on science and culture 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Talk by PI Rob Finn on MGnify at the Learning from the planet to heal the planet: Microbial Ecosystems online seminar series (hosted by EMBL and HHMI Janelia Research Campus).
Year(s) Of Engagement Activity 2022
URL https://www.youtube.com/watch?v=Hc89Rrs_ykY&ab_channel=HHMI%27sJaneliaResearchCampus
 
Description BIOPROSP_23 Keynote talk "Genome Resolved Metagenomics - Understanding the potential of marine microbial communities for novel product discovery" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote talk by PI Rob Finn at the BIOPROSP-23 conference held at Tromsø, Norway. BIOPROSP is the international biennial scientific conference on marine biotechnology, which aims to translate basic research into applied research with industrial application.
Year(s) Of Engagement Activity 2023
URL https://www.tekna.no/en/events/bioprosp_23-42323/Program/?info=156913
 
Description Business Insider Interview titled "Scientists are racing to explore more of the ocean hoping to discover medical breakthroughs and address the climate crisis" 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Business Insider interview with MGnify PI Dr Robert Finn on using MGnify outputs for propelling novel Biodiscovery in the marine domain. Dr Finn coordinates an EU HORIZON initiative BlueRemediomics that is centred around the MGnify microbiome resource data and technology. The interview was reposted by Business Insider India https://www.businessinsider.in/science/news/scientists-are-racing-to-explore-more-of-the-ocean-hoping-to-discover-medical-breakthroughs-and-address-the-climate-crisis/articleshow/102210340.cms
Year(s) Of Engagement Activity 2023
URL https://www.businessinsider.com/marine-biodiscovery-could-unlock-answers-health-climate-crises-2023-...
 
Description EMBL-EBI News "2.4 billion sequences now available in the latest MGnify protein database release" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Newsletter announcing MGnify's new release of their protein database which contains 2.4 billion non-redundant sequences, inlcuding new annotations provided by Google AI.
Year(s) Of Engagement Activity 2022
URL https://www.ebi.ac.uk/about/news/updates-from-data-resources/2-4-billion-sequences-now-available-in-...
 
Description ISME 18 Roundtable "What does it take to be FAIR?" by the National Microbiome Data Collaborative 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Roundtable organised by the National Microbiome Data Collaborative at ISME18. PI Rob Finn was an expert panelist on the roundtable. Discussions covered attitude shifts required for microbiome data sharing, what constitutes good metadata and other points.
Year(s) Of Engagement Activity 2022
URL https://twitter.com/MicrobiomeData/status/1559210668485640194
 
Description Invited talk titled "MGnify - a hub for the archiving, analysis, and discovery of microbiome derived sequence data" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk by PI Dr Robert Finn titled "MGnify - a hub for the archiving, analysis, and discovery of microbiome derived sequence data" at the 2023 RoBioinfo Conference organised by the Romanian Society of Bioinformatics (RSBI) in Bucharest, May 2023. The 2023 conference focused on two main themes, namely human genomics and biodiversity-microbiome and Dr Finn talk was featured in the latter.
Year(s) Of Engagement Activity 2023
URL https://rsbi.ro/evenimente/2023-robioinfo-conference/
 
Description MGnify public engagement at the Love Nature Festival- February Half-term activities 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact Pulblic engagement activity at the Love Nature Festival half-term even organised by the Ipswich Museum at the Christchurch Mansions. MGnify had a stall where they presented recent research work on plastic degrading enzymes from bacteria.
Year(s) Of Engagement Activity 2023
URL https://twitter.com/MGnifyDB/status/1625858534637436930
 
Description Meta AI Research blogpost "ESM Metagenomic Atlas: The first view of the 'dark matter' of the protein universe" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Meta AI blogpost describing the release of 600+ million protein ESM Metagenomic Atlas, with predictions for nearly the entire MGnify90 database, a public resource cataloging metagenomic sequences.
Year(s) Of Engagement Activity 2022
URL https://ai.facebook.com/blog/protein-folding-esmfold-metagenomics/
 
Description Nature News "Meta just dropped 600+ million protein structure predictions, made using a large language model." 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Nature coverage publication on Meta AI's new ESM Mategenomic Atlas "AlphaFold's new rival? Meta AI predicts shape of 600 million proteins"
Year(s) Of Engagement Activity 2022
URL https://www.nature.com/articles/d41586-022-03539-1
 
Description Talk titled "Exploring the diversity of microbial proteins" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk by MGnify PI Dr Robert Finn at the 16th Brazilian Symposium on Bioinformatics BSB 2023 held in Brazil in June 2023. This talk was part of a special session during the conference.
Year(s) Of Engagement Activity 2023
URL https://bsb.sbc.org.br/2023/program/
 
Description Talk titled "Microbiome research at EMBL" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Talk by MGnify PI Dr Robert Finn at the annual meeting of EMBL with European Commission delegates. The talk provided updates on microbiome research being undertaken at EMBL and perspectives of harnessing this for applications that benefit society.
Year(s) Of Engagement Activity 2023
 
Description Talk titled "Mining microbial communities for novel functions and biodiversity" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk by MGnify PI at the Scientific Advisory Board (SAB) Meeting for the Latvian Biomedical Research and Study Centre (LBMC). Dr Finn is a SAB member and is at the forefront of establishing strategic connections with the LBMC as a wider European initiative. His talk was focused on identifying new approaches to collaborating with LBMC researchers and capacity building.
Year(s) Of Engagement Activity 2024