Facilitating Deep Learning with Domain-Specific Knowledge

Lead Research Organisation: Wellcome Sanger Institute
Department Name: Cancer Genetics and Genomics

Abstract

Machine Learning (ML) and Artificial Intelligence (AI) are exciting developments in large data analysis. Deep learning is a set of techniques in ML which use layered neural networks to learn structure and patterns in complex data. These techniques are particularly suited to analysis problems where we do not know the exact structure of the data de novo, and thus have to infer the interesting features from the data. This ability to learn complex problems makes deep learning a good conceptual fit to modern biology and healthcare challenges. As neural networks learn structure from data during a training phase, the quality of the result is heavily reliant on the quality of its training data. As with any ML task, there is a constant danger of overfitting, in which specific (and often irrelevant) features of the training set are learned as discriminatory. Overfitting is especially problematic when the training data are too small, or non-representative of the real-world problem we are attempting to address.

This project will allow us to apply recent advances in deep learning to our local biological and health related datasets. I will therefore aim to:
- Build novel software to integrate domain knowledge into ML using publicly-available data; and
- Apply this novel software to research data in Leeds to derive novel insight and treatment possibilities from our data.

I will use microarray and RNAseq data from both the public domain and also from four ongoing collaborations within the University of Leeds. These collaborative datasets fall into two categories: two transcriptomic datasets from relatively homogeneous diseases of the eye, and then two large, genetically heterogeneous cancer datasets. These data will allow me to investigate the general utility of tailored topologies in deep neural networks. By including multiple datasets early on, I aim to explore the diverse range of possible embedding strategies and to determine which are most appropriate for each data type as well as preventing the development of a technique which is not generally applicable.

Technical Summary

Although a good conceptual fit, current methods in deep learning do not perform particularly well on biological data. One reason for this is the paucity of samples: we simply do not have enough data to train a neural network of sufficient size or complexity to learn the intricacies of our input datasets. Although we have many fewer biological samples, we have a much richer understanding of the underlying structure of the observations, as we have extensive prior knowledge about the interactions between genes. I am planning to use this rich prior knowledge to enhance the efficiency of the training process in two ways:

1) Appropriate data embedding
A common technique in ML is to transform the input data into a form that is easier for the neural network to train upon (embedding). For example, high-dimensional data could be reduced using PCA to a lower dimensional form before use, thus removing unnecessary noise before training. I propose to develop similar methods using the large body of expertise in transcriptomic data analysis in Leeds to define suitable embedding strategies for preprocessing the input datasets.

2) Biologically appropriate network topologies
A standard neural network consists of several hidden layers of neurones connected (both inside and between layers) in a regular manner. I propose a similar approach to learning gene expression analysis in which we use the vast amount of prior knowledge available in online repositories (such as the Gene Ontology, KEGG and String) to build intermediate neural network layers that model the known biological pathways and interactions in the data. This way, the ML approach does not need to learn these already-known interactions.

Publications

10 25 50
publication icon
Christodoulou E (2021) Analysis of CRISPR-Cas9 screens identifies genetic dependencies in melanoma. in Pigment cell & melanoma research

publication icon
Ferreira I (2021) The clinicopathologic spectrum and genomic landscape of de-/trans-differentiated melanoma. in Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc