Scaling the next generation of protein sequence searches to enable rapid discovery of novel actives

Lead Research Organisation: University of Cambridge
Department Name: Medicine

Abstract

Metagenomics investigates the collective genetic material from microorganisms within a specific environment. The advent of modern sequencing technologies has enabled sufficiently deep sequencing of microbial communities to recover large contiguous sequences (contigs). Using in silico approaches, contigs can be binned into sets originating from common species, which can yield high-quality metagenome assembled genomes (MAGs). The EMBL-EBI team behind MGnify are assembling metagenomes at scale, identifying >5,000 novel MAGs, alongside millions of contigs encoding billions of proteins. Removing redundancy, this growing metagenomics database (MGDB) already comprises >850m protein sequences (grouping into ~280m clusters) with <1% of sequences in common with UniProtKB.

Working with the SME BioCatalysts, a significantly smaller version of this database was mined for commercial benefit, identifying novel enzymes for the food industry. MGDB represents an invaluable opportunity for Unilever in the search for anti-microbial (AMP) actives (e.g. preservation) or host-targeting effectors. However, there exist major technical challenges in providing interactive searches, presenting results and linking metadata to make informed target selections.

Heuristics and in-memory solutions have helped the HMMER webserver achieve interactive search speeds, however the MGDB is over 200GB and is not tractable for in-memory solutions. Furthermore, there is a data presentation issue, as searches against the MGDB can result in 10s of thousands of matches. Identification of the most relevant query result, which may not be the top hit, requires development of complex search infrastructure with multiple facets for filtering.

Objectives: We will develop technology to expose and interrogate the MGDB through EMBL-EBI and use it to identify new actives for BPC in real time. This will extend beyond sequence similarity searches by (i) linking searches to find multiple genes pertaining to an operon or gene cluster; (ii) enabling filtering of search results based on original sample metadata; (iii) enabling retrieval of source contigs for further analysis; (iv) technical innovation to provide real-time search speeds. This tool-set will be applied in two Unilever case studies. The search infrastructure will be delivered to scientists via the TRON/BD4BS Bio-platform, to utilise the MGnify API and MGDB alongside internal data and analysis tools.

Strategic outcomes:
- As the only existing resource of its kind, Unilever can exploit the entire MGDB at an early stage.
- MGDB includes eukaryotes, which represent a vast untapped source of novel sequences.
- Unilever driven use-cases will ensure that the search infrastructure meets FMCG business needs.
- Public datasets of Unilever interest prioritised for representation in the MGnfiy MGDB.
- Web services will expose data types of interest to Unilever.
- EMBL-EBI deployment will be available for all researchers.
- Portability & scalability for academia and industry, allowing search of similar in-house datasets.
- Creation of novel solutions for other resources, such as UniProtKB, to provide searches as their data volumes expand.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/T508391/1 14/10/2019 13/04/2024
2290636 Studentship BB/T508391/1 14/10/2019 13/04/2024 Felix Langer