Machine learning for environmental analytics

Lead Research Organisation: University of Cambridge

Department Name: Computer Science and Technology

Abstract

This PhD topic relates to developing machine and especially deep learning methodologies for integrating heterogeneous data, in both medical and environmental contexts. So far, this kind of data has been investigated in a simple way that does not have the potential of discovering relevant profiles and events within all data types, levels and scales. The work will also involve creating more interpretable deep learning models from which physicians, for example, can understand how ML systems make decisions and visualise correlations in the data to aid them in the process of making a diagnosis and offering personalised treatments.
Another potential study area would involve integrating multi-omics (containing, for example, metabolomics, transcriptomics and phenotype
data) in order reveal the mechanisms of complex drug exposures in the environment.

Starting directions include building up on my existing research on multimodal deep learning, which could be highly applicable in these scenarios, as it enables communication between feature extractors that was previously not possible. During previous research degree, the MPhil in Advanced Computer Science, deep learning architectures were developed that allow for cross-modal dataflow between the feature extractors, thereby extracting more interpretable features and obtaining a better representation than through unimodal learning, for the same amount of training data. Having achieved state-of-the-art results on two benchmark datasets, these models can usefully exploit correlations between audio and visual data, which have a different dimensionality and are therefore nontrivially exchangeable.

Another interesting direction for this kind of research would encompass integrating the deep learning models to be developed with a novel model checking approach. This could provide guarantees for the behaviour and statistical properties of the models, depending on what kind of data they are processing. Existing work in this field includes specification-based monitoring of cyber-physical systems (CPS), whose purpose can be monitoring and/or controlling various biological processes and medical devices. This research offered the possibility to study mechanisms such as blood cell specialisation and the delivery of insulin to patients with type-1 diabetes using artificial pancreas control systems. The latter also constitutes an application of a new SMT solver-based synthesis method for Proportional-Integral-Derivative (PID) controllers for stochastic hybrid systems.

Automated reasoning can also be used for modelling biological systems, in order to select models consistent with experimental observations and identifying suitable parameters. Cardiac disorders and cell models can be then approached from this perspective, as well as modelling personalised prostate cancer therapies.

Other recent findings from the neuroscience domain could benefit this research and potentially bring the models developed closer to realistic learning capabilities. For example, models constructed according to the discovery that the brain contains multi-dimensional geometrical structures operating in up to 11 dimensions, or exploiting the conclusion of a study on reconsolidation, a brain process that occurs when the learning task is slightly modified. Yet another potentially useful fact refers to the discovery of brains learning on a single-cell basis, rather than through exploiting an entire neural network.

Student:

Catalina Cangea

Period of Study:

Oct 17 - Dec 20

Funder:

NERC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

2221169

Research Topic:

Unclassified

Organisations

University of Cambridge (Lead Research Organisation)

People	ORCID iD
Pietro Lio (Primary Supervisor)
Catalina Cangea (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
NE/M009009/1			05/10/2015	31/12/2022
2221169	Studentship	NE/M009009/1	01/10/2017	31/12/2020	Catalina Cangea

Key Findings


Description	We have developed general methodological foundations for working with graph-structured and multimodal data. The graph machine learning algorithms encompass two different coarsening approaches in order to predict a metric for an entire graph (i.e. protein class) and a graph visualisation method which provides insights into both the data being explored and the machine learning classifiers/models applied to it. All methods have been tested on chemical and social data. The cross-modal approaches will allow integration of various types of data when modelling and predicting certain quantities of interest related to the environment (e.g. level of risk associated with a certain development) and medical contexts (e.g. early onset of diseases and the likelihood of risk factors producing changes). The third method developed builds on Neural Processes and is useful in few-shot, multi-task and scarce-labelled data regimes: if we receive a few labelled points from a dataset, we are able to predict labels for the rest of the dataset and produce uncertainty estimates for these predictions, while using graph neural networks to exploit the relations present between the samples. This method is suited to a variety of settings, with great potential in real-world environmental and healthcare scenarios - here, labelled data is often limited and it is desirable to model the uncertainty in predictions which concern the behaviour of a system or setting. Additionally, we have worked on a specific problem with implications in environmental and health settings - classifying chemicals according to putative modes of action (MOAs), which is of paramount importance in the context of risk assessment, with current methods only being able to handle a very small proportion of the existing chemicals. We proposed an integrative deep learning architecture that learns a joint representation from molecular structures of drugs and their effects on human cells. Our choice of architecture was motivated by the significant influence of a drug's chemical structure on its MOA. We improved on the strong ability of a unimodal architecture (F1 score of 0.803) to classify drugs by their toxic MOAs (Verhaar scheme) through adding another learning stream that processes transcriptional responses of human cells affected by drugs. Our integrative model achieved an even higher classification performance on the LINCS L1000 dataset - the error is reduced by 4.6%. However, a follow-up study with data from another LINCS phase showed that the chemical fingerprints alone achieved better performance across both tasks, so future work will look at using only the graph structure of the drug molecules to produce the MOA classification.
Exploitation Route	We believe that our drug risk assessment method (https://arxiv.org/abs/1811.09714) can be used to extend the current Verhaar scheme and constitute a basis for fast drug validation and risk assessment. Remaining work includes using the graph structure of the drug molecules to determine whether performance can be improved even more. The general methodological output for graph-structured (https://arxiv.org/abs/1811.01287, https://arxiv.org/abs/2002.03864, https://arxiv.org/abs/2009.13895) and multimodal data (https://ieeexplore.ieee.org/abstract/document/8894404) can be applied to medical and environmental scenarios that require cross-modal and graph data integration.
Sectors	Chemicals,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects