HSM: Construction of graph-based network longitudinal algorithms to identify screening and prognostic biomarkers and therapeutic targets (GBNLA)

Lead Research Organisation: University College London
Department Name: Women's Cancer


Development of heavily requested personalised approaches in Systems Medicine have encountered a situation in which the amount of Big Data significantly overwhelms the data analysis methods used to analyse them. Typical Big Data contains high-dimensional data including parameters which can be continuous or categorical. Additionally, this data can be from serial measurements taken over time from the same patients. Hence, there is a need to develop methodology able to analyse different changes in high-dimensional data containing categorical and longitudinal continuous data in order to identify prognostic, diagnostic and therapeutic targets, e.g., to solve the task of classification of diseased/healthy patients, particularly for early diagnosis. The main aim of this application is to develop a methodology for representation of serial data containing categorical and continuous parameters in the form of networks, the longitudinal analysis of network dynamics, and hence, the construction of longitudinal network biomarkers, generating diagnostic, prognostic and druggable targets. This becomes possible if we utilise the idea of parenclitic network analysis. The main advantage of parenclitic network analysis is that it enables the construction of a graph without any a priori knowledge of the interactions between the parameters. An algorithm to build parenclitic networks, able to establish links between parameters/nodes without any a priori knowledge of their interactions was first described by Zanin and Bocaletti. Parenclitic networks have been successfully applied to the problem of detecting key genes and metabolites in different diseases. Recently we have applied this methodology to implement a machine learning classification of human DNA methylation data carrying signatures of cancer development. We have also described an improved algorithm to construct parenclitic networks and provided a simple case study of a protein dataset from ovarian cancer case and control serum samples. In the present project we plan to develop and investigate this algorithm to represent serial high-dimensional data, both continuous and categorical, in the form of connected networks without a priori knowledge of analyte-analyte links and to use this representation to search for diagnostic and prognostic markers and druggable targets via the construction of longitudinal network biomarker models. The research planned will include the combination of parenclitic network analysis with our previously developed algorithms for serial data analysis. Serial data analysis has shown its advantage over the typical snapshot analysis, being able to detect time-dependent changes in the data to enable earlier diagnosis in comparison to non-serial data. Methodological development will include testing of algorithms with different available clinical and epidemiological data, such as proteomic, genetic, and DNA methylation data, in order to identify data-specific features of the methodology under investigation. Moreover, our network analysis will be combined with other recently developed network analysis methods, including community detection and deep learning algorithms. In order to investigate the advantages and disadvantages of the developed methodology and precisely estimate its efficiency, we will generate synthetic data which closely mimic cancer screening data. We will investigate not only the application of our methodology to the early diagnosis of diseases, but also to the personalised differentiation of the diseases detected. The work will be communicated with other scientists contributing to the development of network machine learning algorithms as well as clinicians to discuss different clinical data and the implementation of our methodology in clinical practice.

Technical Summary

The methodology developed will be based on the previously proposed idea of parenclitic network analysis. This method will enable the representation of parameters corresponding to individual patients and controls in graph form without a priori knowledge of the interactions between them. The strength of the links will be estimated based on how different they are from those in healthy control subjects. In addition to the original estimation based on linear regression, we will use estimates based on two-dimensional kernel density estimation. The generated graphs or networks will be analysed using topological indices, widely used for characterising complex networks, e.g. mean, variance, and maximal values of edge weights, vertex degree, shortest path lengths, Kleinberg,, diameter of the graph, the degree centrality, network efficiency, betweenness centrality, Google maximal page rank index, number of communities and so on. The obtained metrics corresponding to one moment in time will be obtained for all available serial measurements and then linked to the serial algorithms such as change-point detection methods, parametric empirical Bayes and methods of mean trends. All these methods have already been successfully applied by us for data analysis. In order to combine and compare our methodology with other machine learning network algorithms, we will link it to community detection analysis and various types of shallow and deep artificial neural networks. We will also compare our methods with basic algorithms and models such as Bayesian networks, decision trees and forests, and advanced SVM methods. In order to identify data-specific features of the methodology we will test all newly developed algorithms on very different datasets, testing their potential for different tasks in clinical and biotechnological practice.

Planned Impact

Development of the methodology to find longitudinal network biomarkers will provide solutions for the early detection of important diseases such as ovarian, breast and pancreatic cancer, diseases linked to microbiota, ulcerative colitis, pathological neural disorders linked to ageing and many others. We will deliver novel methodology for interrogating Big Data, not only by derivation of new algorithms for the early detection of disease, but also providing methods for understanding the underlying biology of disease progression.
Our team includes academics with expertise in computational mathematics and clinical practice, particularly cancer screening, who will communicate with the corresponding communities. The network machine-learning algorithms that we will develop belong to the world's most complex and interesting research challenges, so our results will be of utmost interest for the community of scientists developing network machine-learning algorithms Indeed in 2017, machine learning algorithms and artificial intelligence were identified as a top priority in the development of new strategies for cancer treatment. However, our algorithms will be universal and will be helpful not only for finding new methods of early cancer detection, but a wide spectrum of other diseases and biotechnological problems where we have to analyse serial multi-dimensional data to predict the abrupt change in the complex biological system.


10 25 50
Title New methodology for the feature analysis of proteomic data 
Description A new transformation of data has been found to be used in the following network analysis. The transformation, identifying new and previously hidden features, can be applied to the longitudinal data for ovarian cancer case-control study. Our data have been obtained using two different analyte measurement systems (the Olink data represent unit-less values, whilst all other measurements are serum concentrations using ELISA-based assays). 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? No  
Impact The transformation resulting into the normalization of two different types of data and reduction of the data to one structure has been established. The advantages and disadvantages of each data set (before and after normalization) have been compared in terms of applying different parenclitic networks to them. A methodology developed will be helpful for conducting any longitudinal analysis of parenclitic networks based on proteomic data for ovarian cancer. 
Title New methodology for the preprocessing feature analysis of EPIC array DNAm data. 
Description We developed a new approach to use Intensity signals in the GREEN and RED channels of EPIC Array to find out the causes of patterns arising in specific DNA regions (such as SNPs). We have shown that it is the genetic sequency that results in different intensity scales for probes. For example, it was found that the presence of a larger number of nucleotide bases in the DNA sequence (target of the sample) contributes to better amplification of the probe with the target and, as a result, leads to stronger intensity indicators. Depending on the strength of the intensities, the reliability of the obtained beta-values was studied taking into account the presence of technical noise (it was shown that the lower the sum of intensity in channels, the loweris the reliability). 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? No  
Impact A new set (not indicated by any accessible SNPs-databases) of probes (matching the behavior-on-SNP) has been obtained. An approach has been developed that allows one to judge the reliability of beta-values at all intensity levels to avoid the widespread (and not informative) approach to determine a reliability of the probe signal. For this we have used only the distribution of control probes. A detailed analysis has shown the possibility to in the analysis of DNAm Epic array data not only the beta-values, but also the values of intensities, as an additional data source. The methodology developed will be usefull for all researchers using modern chips to analyse DNA methylation. 
Title Generalised Parenclitic Network Algorithm Implementation 
Description The algorithm implemented enables to run parenclitic network analysis with any machine-learning kernel chosen. The software enables parallel computations. 
Type Of Technology Software 
Year Produced 2019 
Impact The software developed will be useful and beneficial for all people from academy or industry who runs parenclitic network analysis, i.e., for all data analysts working with high-dimensional data. 
URL https://github.com/mike-live/parenclitic
Description Italian-Russian-British Workshop on DNA methylation analysis in Bologna 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact We have organize one day Italian-Russian-British Workshop
on DNA methylation analysis in Bologna, Italy to bring postgraduate students together to hear talks.
Year(s) Of Engagement Activity 2019
URL https://www.ucl.ac.uk/~rmjbale/Workshop.html