SENSE - Screening of ENvironmental SEquences to discover novel protein functions using informatics target selection and high-throughput validation

Lead Research Organisation: European Bioinformatics Institute

Department Name: Genome Assembly and Annotation

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

This project will enable very large-scale discovery of novel enzymes and bacteriocins from assembled metagenomics sequence data by developing new computational and experimental platforms. Significant technical improvements will emerge from cycles of computational/experimental work, as results from experimental validation will inform algorithm refinements. Our predictions will be concomitantly captured in widely used databases. The scale of experimental validation will be extremely large compared to conventional approaches, enabling increased sampling of sequence space to identify functional novelty.
To sample metagenomic sequences, we will exploit existing and new assemblies to extract sequence data from a variety of sampled biomes. We will make major adaptations to existing bioinformatic platforms to functionally sub-classify metagenomic sequences and apply them to two cases, alpha/beta-hydrolases and bacteriocins. We will develop algorithms characterising key functional determinant residues to score the likelihood of new families having substantially different functionality. Transferring this to bacteriocins will be more challenging as these are often small peptides requiring accessory genes, which can be hard to detect and/or are functionally uncharacterised. Providing sensitive and accurate bacteriocin gene cluster identification and classification will require new methods to identify all components of the gene cluster through expanded homology and contextual models, prior to sub-family classification.
Key to our proposal will be the ability to perform very large-scale experimental validation of the bioinformatics predictions. This will be facilitated by using novel gene synthesis platforms that can synthesise 1000s of genes for screening so as to test sequence neighbours and the target sequences provided by bioinformatics predictions. Furthermore, use of high-throughput microfluidic droplet technology permits testing in a very cost effective and timely way.

Planned Impact

This project will enable large-scale detection of functional biomolecules (proteins), the discovery of which impacts diverse spheres, including biotechnology and biocatalysis, development of new materials, food security and medical applications. It will impact on four BBSRC strategic areas related to metagenomes, synthetic biology, antibiotic resistance and data driven biology.

Firstly, we will analyse the available sequence data more efficiently using a combination of novel bioinformatic and experimental platforms allowing unprecedented throughput. Secondly, newly identified hydrolases and bacteriocins may be valorised as novel functional proteins for the benefit of academic and industry communities. Thirdly, we hope to have educational impact by training researchers in this project in a consortium that will traverse traditional boundaries between in silico biology, microengineering, high-throughput screening and classical enzymology.

The first objective will develop powerful new methods for exploring the vast sequence data being captured by metagenome initiatives. The Finn team manage EMBL-EBI's MGnify resource and have developed robust platforms for handling data on this scale and providing high quality sequence outputs. Leveraging RF and CO's extensive experience in family classification, we will develop new techniques to detect relatives with a high likelihood of functional novelty. Putative targets will be experimentally validated by novel experimental platforms that allow high-throughput at an unprecedented level and additionally probe neighbours in sequence space to detect more stable mutants and further expand knowledge of functional determinants. Importantly, there will be cycles of bioinformatic analysis and prediction followed by experimental validation.

Although we will develop the protocols using two important classes of biomolecules, i.e. enzymes and bacteriocins, the methods will be generic and publicly available to apply to other families expanded by metagenomic data. Our tools will be made widely available to the large community of groups analysing this data, increasing impact. RF and CO coordinate different ELIXIR communities and will have opportunities to publicise the work and promote adoption of these techniques.

The novel hydrolases and bacteriocins, have commercial value and relevance to human health. Bacterial alpha/beta-hydrolases are widely used in many industries, including dairy, pharmaceutical, and laundry, as they are easy to cultivate, nontoxic, and eco-friendly. Bacteriocins have value in both food security and human health, e.g. producing strains can be applied in food to extend preservation times. Bacteriocins can also be added directly to foods as a preservative, incorporated into bioactive packaging, added to animal feed as an anti-pathogen additive to protect livestock against pathogen damage, or help balance the bacteria in the digestive tract of livestock and humans to reduce gastrointestinal diseases. They have the potential to replace existing antibiotics (especially those with resistance) and have been indicated as novel anticancer drugs.

The interdisciplinary aspect of our project will provide additional training opportunities and distinguish the staff development in this project from more conventional training, expanding interdisciplinary skills in the UK. Thus, the postdocs in this project will receive training that positions them to obtain jobs in small or large biotechnology enterprises. This aspect of the project will be accompanied by interactions with the institutions' technology transfer offices (CE, EMBLEM, and UCL Business) and industrial stakeholders, so that information is initially protected and then shared and commercialised.

Funded Value:

£216,677

Funded Period:

May 20 - Apr 23

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/T000902/1

Principal Investigator:

Robert Finn

Research Subject:

Bioengineering (18%)

Biomolecules & biochemistry (18%)

Tools, technologies & methods (54%)

Research Topic:

Bioinformatics (24%)

Chemical Biology (18%)

Environmental Informatics (12%)

Protein engineering (18%)

Tools for the biosciences (18%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Robert Finn (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Richardson L (2023) MGnify: the microbiome sequence data analysis resource in 2023. in Nucleic acids research

Key Findings
Policy Influence
Further Funding
Research Databases and Models
Engagement Activities


Description	We have explored the diversity of the a/b hydrolase superfamily present in metagenomic samples with a specific focus on PETases, the enzymes responsible for the degradation of polyethylene terephthalate (PET plastic). We have found that metagenomic samples are substantially enriched for this enzyme class, especially in marine samples. To enrich the first tranche, and in response to our collaborators request, we have assembled and analysed datasets from extreme temperatures (hot and cold) for PETases. These provided a supplementary set of enzymes that interleaved with the previous set. These results are allowing us to understand both immediate changes around the enzyme active site, as well as allosteric interactions that may also impact enzyme activity and substrate affinity. To complement this activity, we are improving the infrastructure to select subsets of the MGnify protein database, and combine this with both the sample metadata and the genomic context in which a protein may be found. The size of the protein database, i.e. >2.4 billion non-redundant sequences, has presented major technical challenges in how this is stored and accessed. We have now overcome these major obstacles and have released an updated version of the database which additionally provides Pfam annotations on all sequences. These are supplemented by annotations provided by ProtENN2, a sequence embedding approach for functional annotation developed by Google AI. The database is made available as flat-files, but the underlying relational database contains the contextual information for every protein sequence, which allows us to trace genomic context and sample information. This database was the foundation for the creation of ESMAtlas, a collection of over 650 million protein structural models, which were produced using the ESMFold algorithm (Lin Z. et al. Science 2023). Coupled with MGnify, ESMAtlas is amplifying the use of our proteins across different areas of science, and across the academic and industrial sectors. It is not possible to estimate the usage of our data at this stage due to the fact the MGnify proteins are in ESMAtlas, which has then been propagated to widely used structural comparison tools, such as FoldSeek, where MGnify-ESM30 is one of the default databases alongside AlphaFold and PDB. Finally, we have provided our predictions for Ribosomally synthesised and post-translationally modified peptides (RiPPs) to our collaborators to facilitate the evaluation of bacterial growth inhibition by these novel RiPPs. Assay results have demonstrated that we have indeed correctly identified novel RiPPs since activity was observed against both Staphylococcus aureus and Escherichia coli, with the latter occurring at levels that are useful for applications in biotechnology and/or pharmaceutical industries.
Exploitation Route	Metagenomics is providing unprecedented access to 99% of microbes that are yet to be experimentally isolated and cultured. Consequently, such datasets are providing a wealth of new sequences providing novel insights into the ability of microbes to exploit different niches. However, mining metagenomics data is complex due to the magnitude of the data volume and the fundamental need for specialist computational pipelines to assemble and analyse the data. This project has paved the way to increase the accessibility to this data by allowing a multidisciplinary set of scientists to gather collections of sequences to understand enzyme evolution, and how this data can be utilised for rational enzyme design. Natural products is a growing area of research and metagenomics represents an important new source of potential products. The scale of data housed in MGnify indicates that tens of thousands of novel natural products are awaiting discovery and experimental analysis. We have focused on RiPPs in the SENSE project, as they are typically produced by small genomic regions that can be relatively easily synthesised and screened using high throughput techniques. RiPPS have been widely used in a variety of different industries, from food preservation to healthcare. They also tend to have a fairly narrow spectrum of activity and understanding their mode of action can have major impacts on overcoming societal threats, such as antimicrobial resistance.
Sectors	Aerospace Defence and Marine Agriculture Food and Drink Chemicals Communities and Social Services/Policy Digital/Communication/Information Technologies (including Software) Environment Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology


Description	Member, UKRI Knowledge Transfer Network (KTN) Microbiome Innovation Network
Geographic Reach	National
Policy Influence Type	Participation in a guidance/advisory committee


Description	BlueRemediomics: Harnessing the marine microbiome for novel sustainable biogenics and ecosystem services
Amount	€ 7,649,827 (EUR)
Funding ID	101082304
Organisation	European Commission
Sector	Public
Country	Belgium
Start	12/2022
End	11/2026


Description	Novel Plastizymes: discovery and improvement of plastic-degrading enzymes by integrated cycles of computational and experimental approaches
Amount	£3,024,438 (GBP)
Funding ID	BB/X00306X/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	03/2023
End	03/2028


Title	MGnify protein database
Description	MGnify protein database that contains 2.4 billion non-redundant sequences. This latest release is more than double of the previous release of 1.1 billion sequences. Sequences are clustered at 90% coverage and identity to generate 620 million clusters. ProtENN2 annotations by Google AI are included in this release.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	Information relating to the biome and genomic context for the protein sequences provided here is crucial for downstream analyses and applications for proteins of interest. Moreover, majority of proteins identified in metagenomics studies and included in this release is not covered by other major protein resources, thus representing a significant novel source of proteins, especially from uncultured organisms, with a wide range of applications.
URL	http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/README.txt


Title	Metagenomic non-redundant protein database
Description	Database of protein sequences produced from assembly of metagenomic datasets.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes
Impact	This database was initially supported the discovery of novel enzymes by an SME biotech company (BioCatalysts) as part of an InnovateUK BBSRC grant. Since this initial dataset was produced, it has grown to over 2.4 billion non-redundant sequences. This has been made available as a Google BigQuery resource to make it accessible to the research community. Another major impact has been the role of this MGnify dataset in the production of the AlphaFold and ESMFold.
URL	https://www.ebi.ac.uk/metagenomics/sequence-search/search/phmmer


Description	"What metagenomic data can tell us about healing the planet" talk at the Life Science Across the Globe - talks on science and culture
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Policymakers/politicians
Results and Impact	Talk by PI Rob Finn on MGnify at the Learning from the planet to heal the planet: Microbial Ecosystems online seminar series (hosted by EMBL and HHMI Janelia Research Campus).
Year(s) Of Engagement Activity	2022
URL	https://www.youtube.com/watch?v=Hc89Rrs_ykY&ab_channel=HHMI%27sJaneliaResearchCampus


Description	BIOPROSP_23 Keynote talk "Genome Resolved Metagenomics - Understanding the potential of marine microbial communities for novel product discovery"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Keynote talk by PI Rob Finn at the BIOPROSP-23 conference held at Tromsø, Norway. BIOPROSP is the international biennial scientific conference on marine biotechnology, which aims to translate basic research into applied research with industrial application.
Year(s) Of Engagement Activity	2023
URL	https://www.tekna.no/en/events/bioprosp_23-42323/Program/?info=156913


Description	Business Insider Interview titled "Scientists are racing to explore more of the ocean hoping to discover medical breakthroughs and address the climate crisis"
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Business Insider interview with MGnify PI Dr Robert Finn on using MGnify outputs for propelling novel Biodiscovery in the marine domain. Dr Finn coordinates an EU HORIZON initiative BlueRemediomics that is centred around the MGnify microbiome resource data and technology. The interview was reposted by Business Insider India https://www.businessinsider.in/science/news/scientists-are-racing-to-explore-more-of-the-ocean-hoping-to-discover-medical-breakthroughs-and-address-the-climate-crisis/articleshow/102210340.cms
Year(s) Of Engagement Activity	2023
URL	https://www.businessinsider.com/marine-biodiscovery-could-unlock-answers-health-climate-crises-2023-...


Description	EMBL-EBI News "2.4 billion sequences now available in the latest MGnify protein database release"
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Newsletter announcing MGnify's new release of their protein database which contains 2.4 billion non-redundant sequences, inlcuding new annotations provided by Google AI.
Year(s) Of Engagement Activity	2022
URL	https://www.ebi.ac.uk/about/news/updates-from-data-resources/2-4-billion-sequences-now-available-in-...


Description	ISME 18 Roundtable "What does it take to be FAIR?" by the National Microbiome Data Collaborative
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Roundtable organised by the National Microbiome Data Collaborative at ISME18. PI Rob Finn was an expert panelist on the roundtable. Discussions covered attitude shifts required for microbiome data sharing, what constitutes good metadata and other points.
Year(s) Of Engagement Activity	2022
URL	https://twitter.com/MicrobiomeData/status/1559210668485640194


Description	Invited talk titled "MGnify - a hub for the archiving, analysis, and discovery of microbiome derived sequence data"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Invited talk by PI Dr Robert Finn titled "MGnify - a hub for the archiving, analysis, and discovery of microbiome derived sequence data" at the 2023 RoBioinfo Conference organised by the Romanian Society of Bioinformatics (RSBI) in Bucharest, May 2023. The 2023 conference focused on two main themes, namely human genomics and biodiversity-microbiome and Dr Finn talk was featured in the latter.
Year(s) Of Engagement Activity	2023
URL	https://rsbi.ro/evenimente/2023-robioinfo-conference/


Description	MGnify public engagement at the Love Nature Festival- February Half-term activities
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	Pulblic engagement activity at the Love Nature Festival half-term even organised by the Ipswich Museum at the Christchurch Mansions. MGnify had a stall where they presented recent research work on plastic degrading enzymes from bacteria.
Year(s) Of Engagement Activity	2023
URL	https://twitter.com/MGnifyDB/status/1625858534637436930


Description	Meta AI Research blogpost "ESM Metagenomic Atlas: The first view of the 'dark matter' of the protein universe"
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Meta AI blogpost describing the release of 600+ million protein ESM Metagenomic Atlas, with predictions for nearly the entire MGnify90 database, a public resource cataloging metagenomic sequences.
Year(s) Of Engagement Activity	2022
URL	https://ai.facebook.com/blog/protein-folding-esmfold-metagenomics/


Description	Nature News "Meta just dropped 600+ million protein structure predictions, made using a large language model."
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Nature coverage publication on Meta AI's new ESM Mategenomic Atlas "AlphaFold's new rival? Meta AI predicts shape of 600 million proteins"
Year(s) Of Engagement Activity	2022
URL	https://www.nature.com/articles/d41586-022-03539-1


Description	Talk titled "Exploring the diversity of microbial proteins"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk by MGnify PI Dr Robert Finn at the 16th Brazilian Symposium on Bioinformatics BSB 2023 held in Brazil in June 2023. This talk was part of a special session during the conference.
Year(s) Of Engagement Activity	2023
URL	https://bsb.sbc.org.br/2023/program/


Description	Talk titled "Microbiome research at EMBL"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Policymakers/politicians
Results and Impact	Talk by MGnify PI Dr Robert Finn at the annual meeting of EMBL with European Commission delegates. The talk provided updates on microbiome research being undertaken at EMBL and perspectives of harnessing this for applications that benefit society.
Year(s) Of Engagement Activity	2023


Description	Talk titled "Mining microbial communities for novel functions and biodiversity"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk by MGnify PI at the Scientific Advisory Board (SAB) Meeting for the Latvian Biomedical Research and Study Centre (LBMC). Dr Finn is a SAB member and is at the forefront of establishing strategic connections with the LBMC as a wider European initiative. His talk was focused on identifying new approaches to collaborating with LBMC researchers and capacity building.
Year(s) Of Engagement Activity	2024