SENSE - Screening of ENvironmental SEquences to discover novel protein functions using informatics target selection and high-throughput validation

Lead Research Organisation: University of Cambridge
Department Name: Biochemistry

Abstract

As species diverge and new strains emerge, their proteins evolve through mutations in their sequences that alter functional properties. Very cheap and robust technologies have enabled the sequencing of genomes from many diverse bacterial communities e.g. different soils, oceans, human body sites. Proteins (encoded in the genomes) from these bacteria have enabled adaptation to different environments e.g. extremes of temperature. Although, we possess extensive information about protein sequences- UniProtKB contains >100 million sequences (but < 0.5% are experimentally characterised) - the new sequence data from metagenomes is ten-fold larger, providing a valuable treasure trove to hunt for proteins with novel functionality. Yet, it is challenging to predict protein function from sequence alone, which is why we will combine finer-grained prediction with high-throughput experimental testing.

Handling this vast data is challenging but our project benefits from outputs already produced by the MGnify metagenomics analysis platform. We will introduce new strategies to classify this data and focus additional analyses on biomes containing greater functional diversity.

To unearth proteins whose functions are very different from any observed previously, we will classify related proteins into evolutionary families and then sub-classify into functional families (called FunFams). RF and CO already have methods for doing this, but they need to be adapted to handle the vast metagenomic data. By aligning sequences in a FunFam, you can find residue positions highly conserved throughout evolution, indicating they are important for function. Residue positions conserved in different ways between different FunFams are particularly interesting as these are sites that change to enable different functions. The massive metagenomic sequence data will facilitate easy discovery of these key functional determinants (FDs) as conservation patterns will be much clearer.

We will develop new tools to characterise chemical features of these FDs and score differences in properties of FDs between FunFams to find new FunFams in metagenomes, very likely to have novel functions. The outcomes of experimental tests will give further insights e.g. on whether specificity, efficiency can be ascribed to FDs, making our searches more likely to predict function successfully. Two exemplar classes of biomolecules will be investigated: (1) alpha/beta hydrolases- proteins used for making drugs and laundry detergents; (2) bacteriocins- small antibacterial peptides with valuable applications in novel antibiotic discovery and food preservation. These are more complicated as they are produced as part of a cluster of genes (and hence proteins) on the genome, involved in processing the bacteriocin and rendering the bacteria immune to their own bacteriocin. We will adapt our FD-based methods to analyse key sequence differences across multiple proteins to identify novel bacteriocin functionality.

Unlike previous analyses of enzyme superfamilies and bacteriocins, we will test our predictions of functional novelty through novel experimental platforms that can verify the predictions on an unprecedented scale. We will exploit a microfluidic technology that screens the function of >1 million proteins in one afternoon in minute droplets and use it for functionally scanning the gene neighbourhood of predictions (after randomisation) e.g. for discovering mutants with better stability, specificity and evolvability. We will also test predictions for genes derived 50-fold cheaper than currently possible via array-based gene assembly. We will thus be experimentally exploring protein sequence space from metagenome communities at an unprecedented scale. We will deliver powerful new computational and experimental technologies, tested on biomolecules important for industry and human health but applicable to many protein families and secondary metabolite gene clusters.

Technical Summary

This project will enable very large-scale discovery of novel enzymes and bacteriocins from assembled metagenomics sequence data by developing new computational and experimental platforms. Significant technical improvements will emerge from cycles of computational/experimental work, as results from experimental validation will inform algorithm refinements. Our predictions will be concomitantly captured in widely used databases. The scale of experimental validation will be extremely large compared to conventional approaches, enabling increased sampling of sequence space to identify functional novelty.
To sample metagenomic sequences, we will exploit existing and new assemblies to extract sequence data from a variety of sampled biomes. We will make major adaptations to existing bioinformatic platforms to functionally sub-classify metagenomic sequences and apply them to two cases, alpha/beta-hydrolases and bacteriocins. We will develop algorithms characterising key functional determinant residues to score the likelihood of new families having substantially different functionality. Transferring this to bacteriocins will be more challenging as these are often small peptides requiring accessory genes, which can be hard to detect and/or are functionally uncharacterised. Providing sensitive and accurate bacteriocin gene cluster identification and classification will require new methods to identify all components of the gene cluster through expanded homology and contextual models, prior to sub-family classification.
Key to our proposal will be the ability to perform very large-scale experimental validation of the bioinformatics predictions. This will be facilitated by using novel gene synthesis platforms that can synthesise 1000s of genes for screening so as to test sequence neighbours and the target sequences provided by bioinformatics predictions. Furthermore, use of high-throughput microfluidic droplet technology permits testing in a very cost effective and timely way.

Planned Impact

This project will enable large-scale detection of functional biomolecules (proteins), the discovery of which impacts diverse spheres, including biotechnology and biocatalysis, development of new materials, food security and medical applications. It will impact on four BBSRC strategic areas related to metagenomes, synthetic biology, antibiotic resistance and data driven biology.

Firstly, we will analyse the available sequence data more efficiently using a combination of novel bioinformatic and experimental platforms allowing unprecedented throughput. Secondly, newly identified hydrolases and bacteriocins may be valorised as novel functional proteins for the benefit of academic and industry communities. Thirdly, we hope to have educational impact by training researchers in this project in a consortium that will traverse traditional boundaries between in silico biology, microengineering, high-throughput screening and classical enzymology.

The first objective will develop powerful new methods for exploring the vast sequence data being captured by metagenome initiatives. The Finn team manage EMBL-EBI's MGnify resource and have developed robust platforms for handling data on this scale and providing high quality sequence outputs. Leveraging RF and CO's extensive experience in family classification, we will develop new techniques to detect relatives with a high likelihood of functional novelty. Putative targets will be experimentally validated by novel experimental platforms that allow high-throughput at an unprecedented level and additionally probe neighbours in sequence space to detect more stable mutants and further expand knowledge of functional determinants. Importantly, there will be cycles of bioinformatic analysis and prediction followed by experimental validation.

Although we will develop the protocols using two important classes of biomolecules, i.e. enzymes and bacteriocins, the methods will be generic and publicly available to apply to other families expanded by metagenomic data. Our tools will be made widely available to the large community of groups analysing this data, increasing impact. RF and CO coordinate different ELIXIR communities and will have opportunities to publicise the work and promote adoption of these techniques.

The novel hydrolases and bacteriocins, have commercial value and relevance to human health. Bacterial alpha/beta-hydrolases are widely used in many industries, including dairy, pharmaceutical, and laundry, as they are easy to cultivate, nontoxic, and eco-friendly. Bacteriocins have value in both food security and human health, e.g. producing strains can be applied in food to extend preservation times. Bacteriocins can also be added directly to foods as a preservative, incorporated into bioactive packaging, added to animal feed as an anti-pathogen additive to protect livestock against pathogen damage, or help balance the bacteria in the digestive tract of livestock and humans to reduce gastrointestinal diseases. They have the potential to replace existing antibiotics (especially those with resistance) and have been indicated as novel anticancer drugs.

The interdisciplinary aspect of our project will provide additional training opportunities and distinguish the staff development in this project from more conventional training, expanding interdisciplinary skills in the UK. Thus, the postdocs in this project will receive training that positions them to obtain jobs in small or large biotechnology enterprises. This aspect of the project will be accompanied by interactions with the institutions' technology transfer offices (CE, EMBLEM, and UCL Business) and industrial stakeholders, so that information is initially protected and then shared and commercialised.

Publications

10 25 50