Machine Learning Algorithms for Actionable Knowledge Discovery in Synthetic Biology

Lead Research Organisation: Newcastle University

Department Name: Sch of Computing

Abstract

Synthetic biology applies engineering principles to design biological systems that do not exist in the natural world so as to achieve desired properties within a given organism. This approach is of great value to society since it can be used to produce high-value materials, such as fine chemicals, pharmaceuticals, bio-remediation, bio-fuels, etc. However, the inability to predict the behaviour of biological systems largely hinders progress in bioengineering applications. While domain knowledge fails to predict the effect of genotypes changes on phenotype, the development of machine learning techniques and tremendous amounts of data generated by omics technologies have made this possible. This project thus envisions innovative computational methods to discover actionable knowledge that can be fed into synthetic biology experiments and exploit in industry. The benefits are two-folds: (1) Meaningful biological findings deduced from omics information. (2) Novel machine learning model capable of extracting high-level information from high throughput dataset.

More specifically, the main biological tasks are identifying biomarkers for a particular biological state, typically related to the risk, and constructing the biological network whose nodes representing gene, proteins, metabolites and edges indicating complex relations which can be functional or regulatory. For example, the first part of this project is to look at how bacteria, which are often used as the organism to design genetic circuits in synthetic biology, adjust their transciptomics to adapt to different environmental stimuli. This is very important as bacteria almost always experience a diverse range of stresses while growing in different conditions which may affect their own growth as well as the desired properties. To our knowledge no previous research has characterised genetic changes underpinning different biological states an organism may exhibit in various conditions, nor compensatory genetic circuit to relieve the stresses has been explored. This research will learn the phenotypical landscape of bacteria growing in various conditions, identify the genes responding to general stress conditions (i.e. the biomarkers), predict the cell state by looking at the gene expressions of these biomarkers (i.e. the gene fingerprint) and ultimately answer the question of how bacteria adjust their transcriptomics to adapt to different conditions in the form of a biological network. This work can be extended to any similar questions for different purpose while the same set of routines may be replicated.

An automatic pipeline of data mining techniques will be designed to extract desired information from heavy noise, high dimension omics data. As a totally data-driven research to complement with detailed mechanistic understanding in domain knowledge, statistical tests and unsupervised learning algorithms such as differential expression analysis, dimension reduction and clustering methods will first be applied to effectively tackling the data dimensionality and extract interesting data patterns, based on which supervised learnings are followed. Biomarker identification will be achieved by devising feature selection methods embedded with classier that are robust to small sample size data. While biological networks can be much more flexible, the most widely studied ones are association networks where entities are only known to be functionally connected in some way. We aim to go beyond the mere association to causation by exploiting the structure and various representations of machine learning models being used to describe the biological processes, preferably in a probabilistic way.

Student:

Yiming Huang

Period of Study:

Oct 18 - Jul 22

Funder:

NERC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2132169

Research Topic:

Unclassified

Organisations

Newcastle University (Lead Research Organisation)

People	ORCID iD
Jaume Bacardit (Primary Supervisor)
Yiming Huang (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Key Findings
Research Databases and Models


Description	We identified a reduced small set of genes whose expression levels are indicative of cellular growth states for Bacillus subtilis. To achieve this process, we designed a machine learning model that are able to discover different cellular states and corresponding biomarker genes from the gene expression profiles measured under a diverse range of conditions.
Exploitation Route	Based on the biomarker genes we identified, the genetic circuits can be designed to report the cellular growth state of bacteria, for example, if the bacteria is undergoing stress while growing in biotechnology industry. The machine learning model we devised can be applied to an extended wealth of dataset for different organisms.
Sectors	Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology


Title	Biomarker Recommendation System-- prioritising a robust transcriptional biomarker panel for multi-stress sensing in bacteria
Description	We designed computational models that together build a biomarker recommendation system which ranks the candidate biomarker panels based on complementary information from machine learning model, prior-known network and data-inferred network.
Type Of Material	Computer model/algorithm
Year Produced	2022
Provided To Others?	Yes
Impact	We applied the models to identify a robust transcriptional biomarker gene that can discriminate a wide range of cellular stress states presented in B. subtilis. and can generalise to an extended set of stress conditions in independent datasets.
URL	https://github.com/neverbehym/biomaker-recommendation-system


Title	Machine learning model to construct transcriptional Landscape with condition-dependent transcriptomes
Description	We designed a machine learning model for constructing a transcriptional landscape with a compendium of transcriptomes containing diverse gene expression profiles under different phenotypes. In the resulted transcriptional landscape, objects with similar transcriptional states are positioned close. The model also identifies clusters of objects which can be linked to similar cellular states, and a reduced set of genes that can be worked as the biomarkers to indicate these cellular states. The pipeline of data mining techniques include differential expression analysis, feature engineering, dimension reduction, clustering, feature selection.
Type Of Material	Data analysis technique
Year Produced	2020
Provided To Others?	Yes
Impact	We applied this model in identifying gene biomarkers for different cellular growth states for a model organism, i.e. Bacillus subtilis. It can be widely used for many other organisms if their condition-dependent transcriptomes are available.
URL	https://github.com/neverbehym/transcriptional-biomarkers-subtilis

Abstract

Organisations

People

ORCID iD

Publications