Machine Learning Algorithms for Actionable Knowledge Discovery in Synthetic Biology

Lead Research Organisation: Newcastle University
Department Name: Sch of Computing

Abstract

Synthetic biology applies engineering principles to design biological systems that do not exist in the natural world so as to achieve desired properties within a given organism. This approach is of great value to society since it can be used to produce high-value materials, such as fine chemicals, pharmaceuticals, bio-remediation, bio-fuels, etc. However, the inability to predict the behaviour of biological systems largely hinders progress in bioengineering applications. While domain knowledge fails to predict the effect of genotypes changes on phenotype, the development of machine learning techniques and tremendous amounts of data generated by omics technologies have made this possible. This project thus envisions innovative computational methods to discover actionable knowledge that can be fed into synthetic biology experiments and exploit in industry. The benefits are two-folds: (1) Meaningful biological findings deduced from omics information. (2) Novel machine learning model capable of extracting high-level information from high throughput dataset.

More specifically, the main biological tasks are identifying biomarkers for a particular biological state, typically related to the risk, and constructing the biological network whose nodes representing gene, proteins, metabolites and edges indicating complex relations which can be functional or regulatory. For example, the first part of this project is to look at how bacteria, which are often used as the organism to design genetic circuits in synthetic biology, adjust their transciptomics to adapt to different environmental stimuli. This is very important as bacteria almost always experience a diverse range of stresses while growing in different conditions which may affect their own growth as well as the desired properties. To our knowledge no previous research has characterised genetic changes underpinning different biological states an organism may exhibit in various conditions, nor compensatory genetic circuit to relieve the stresses has been explored. This research will learn the phenotypical landscape of bacteria growing in various conditions, identify the genes responding to general stress conditions (i.e. the biomarkers), predict the cell state by looking at the gene expressions of these biomarkers (i.e. the gene fingerprint) and ultimately answer the question of how bacteria adjust their transcriptomics to adapt to different conditions in the form of a biological network. This work can be extended to any similar questions for different purpose while the same set of routines may be replicated.

An automatic pipeline of data mining techniques will be designed to extract desired information from heavy noise, high dimension omics data. As a totally data-driven research to complement with detailed mechanistic understanding in domain knowledge, statistical tests and unsupervised learning algorithms such as differential expression analysis, dimension reduction and clustering methods will first be applied to effectively tackling the data dimensionality and extract interesting data patterns, based on which supervised learnings are followed. Biomarker identification will be achieved by devising feature selection methods embedded with classier that are robust to small sample size data. While biological networks can be much more flexible, the most widely studied ones are association networks where entities are only known to be functionally connected in some way. We aim to go beyond the mere association to causation by exploiting the structure and various representations of machine learning models being used to describe the biological processes, preferably in a probabilistic way.

Publications

10 25 50
 
Description We identified a reduced small set of genes whose expression levels are indicative of cellular growth states for Bacillus subtilis. To achieve this process, we designed a machine learning model that are able to discover different cellular states and corresponding biomarker genes from the gene expression profiles measured under a diverse range of conditions.
Exploitation Route Based on the biomarker genes we identified, the genetic circuits can be designed to report the cellular growth state of bacteria, for example, if the bacteria is undergoing stress while growing in biotechnology industry.
The machine learning model we devised can be applied to an extended wealth of dataset for different organisms.
Sectors Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology

 
Title Biomarker Recommendation System-- prioritising a robust transcriptional biomarker panel for multi-stress sensing in bacteria 
Description We designed computational models that together build a biomarker recommendation system which ranks the candidate biomarker panels based on complementary information from machine learning model, prior-known network and data-inferred network. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact We applied the models to identify a robust transcriptional biomarker gene that can discriminate a wide range of cellular stress states presented in B. subtilis. and can generalise to an extended set of stress conditions in independent datasets. 
URL https://github.com/neverbehym/biomaker-recommendation-system
 
Title Machine learning model to construct transcriptional Landscape with condition-dependent transcriptomes 
Description We designed a machine learning model for constructing a transcriptional landscape with a compendium of transcriptomes containing diverse gene expression profiles under different phenotypes. In the resulted transcriptional landscape, objects with similar transcriptional states are positioned close. The model also identifies clusters of objects which can be linked to similar cellular states, and a reduced set of genes that can be worked as the biomarkers to indicate these cellular states. The pipeline of data mining techniques include differential expression analysis, feature engineering, dimension reduction, clustering, feature selection. 
Type Of Material Data analysis technique 
Year Produced 2020 
Provided To Others? Yes  
Impact We applied this model in identifying gene biomarkers for different cellular growth states for a model organism, i.e. Bacillus subtilis. It can be widely used for many other organisms if their condition-dependent transcriptomes are available. 
URL https://github.com/neverbehym/transcriptional-biomarkers-subtilis