Facilitating Deep Learning with Domain-Specific Knowledge

Lead Research Organisation: Wellcome Sanger Institute

Department Name: Cancer Genetics and Genomics

Abstract

Machine Learning (ML) and Artificial Intelligence (AI) are exciting developments in large data analysis. Deep learning is a set of techniques in ML which use layered neural networks to learn structure and patterns in complex data. These techniques are particularly suited to analysis problems where we do not know the exact structure of the data de novo, and thus have to infer the interesting features from the data. This ability to learn complex problems makes deep learning a good conceptual fit to modern biology and healthcare challenges. As neural networks learn structure from data during a training phase, the quality of the result is heavily reliant on the quality of its training data. As with any ML task, there is a constant danger of overfitting, in which specific (and often irrelevant) features of the training set are learned as discriminatory. Overfitting is especially problematic when the training data are too small, or non-representative of the real-world problem we are attempting to address.

This project will allow us to apply recent advances in deep learning to our local biological and health related datasets. I will therefore aim to:
- Build novel software to integrate domain knowledge into ML using publicly-available data; and
- Apply this novel software to research data in Leeds to derive novel insight and treatment possibilities from our data.

I will use microarray and RNAseq data from both the public domain and also from four ongoing collaborations within the University of Leeds. These collaborative datasets fall into two categories: two transcriptomic datasets from relatively homogeneous diseases of the eye, and then two large, genetically heterogeneous cancer datasets. These data will allow me to investigate the general utility of tailored topologies in deep neural networks. By including multiple datasets early on, I aim to explore the diverse range of possible embedding strategies and to determine which are most appropriate for each data type as well as preventing the development of a technique which is not generally applicable.

Technical Summary

Although a good conceptual fit, current methods in deep learning do not perform particularly well on biological data. One reason for this is the paucity of samples: we simply do not have enough data to train a neural network of sufficient size or complexity to learn the intricacies of our input datasets. Although we have many fewer biological samples, we have a much richer understanding of the underlying structure of the observations, as we have extensive prior knowledge about the interactions between genes. I am planning to use this rich prior knowledge to enhance the efficiency of the training process in two ways:

1) Appropriate data embedding
A common technique in ML is to transform the input data into a form that is easier for the neural network to train upon (embedding). For example, high-dimensional data could be reduced using PCA to a lower dimensional form before use, thus removing unnecessary noise before training. I propose to develop similar methods using the large body of expertise in transcriptomic data analysis in Leeds to define suitable embedding strategies for preprocessing the input datasets.

2) Biologically appropriate network topologies
A standard neural network consists of several hidden layers of neurones connected (both inside and between layers) in a regular manner. I propose a similar approach to learning gene expression analysis in which we use the vast amount of prior knowledge available in online repositories (such as the Gene Ontology, KEGG and String) to build intermediate neural network layers that model the known biological pathways and interactions in the data. This way, the ML approach does not need to learn these already-known interactions.

Funded Value:

£185,651

Funded Period:

Apr 19 - Mar 21

Funder:

MRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

MR/S00386X/2

Principal Investigator:

Alastair Droop

Health Category:

Unclassified

Organisations

People	ORCID iD
Alastair Droop (Principal Investigator / Fellow)	http://orcid.org/0000-0001-7695-7480

Publications

Author Name

Title Publication Date Published

10 25 50

Christodoulou E (2021) Analysis of CRISPR-Cas9 screens identifies genetic dependencies in melanoma. in Pigment cell & melanoma research

Close HJ (2020) Expression profiling of single cells and patient cohorts identifies multiple immunosuppressive pathways and an altered NK cell phenotype in glioblastoma. in Clinical and experimental immunology

Da Silva B (2019) Chemically induced neurite-like outgrowth reveals a multicellular network function in patient-derived glioblastoma cells. in Journal of cell science

Ferreira I (2021) The clinicopathologic spectrum and genomic landscape of de-/trans-differentiated melanoma. in Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc

Filia A (2019) High-Resolution Copy Number Patterns From Clinically Relevant FFPE Material. in Scientific reports

Packer JR (2020) Notch signalling is a potential resistance mechanism of progenitor cells within patient-derived prostate cultures following ROS-inducing treatments. in FEBS letters

Riva L (2020) The mutational signature profile of known and suspected human carcinogens in mice. in Nature genetics

Tanner G (2021) Benchmarking pipelines for subclonal deconvolution of bulk tumour sequencing data. in Nature communications

Tanner G (2019) Simulation of heterogeneous tumour genomes with HeteroGenesis and in silico whole exome sequencing. in Bioinformatics (Oxford, England)

Van Der Weyden L (2021) CRISPR activation screen in mice identifies novel membrane proteins enhancing pulmonary metastatic colonisation in Communications Biology

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications