HSM: Construction of graph-based network longitudinal algorithms to identify screening and prognostic biomarkers and therapeutic targets (GBNLA)

Lead Research Organisation: UNIVERSITY COLLEGE LONDON

Department Name: Women's Cancer

Abstract

Development of heavily requested personalised approaches in Systems Medicine have encountered a situation in which the amount of Big Data significantly overwhelms the data analysis methods used to analyse them. Typical Big Data contains high-dimensional data including parameters which can be continuous or categorical. Additionally, this data can be from serial measurements taken over time from the same patients. Hence, there is a need to develop methodology able to analyse different changes in high-dimensional data containing categorical and longitudinal continuous data in order to identify prognostic, diagnostic and therapeutic targets, e.g., to solve the task of classification of diseased/healthy patients, particularly for early diagnosis. The main aim of this application is to develop a methodology for representation of serial data containing categorical and continuous parameters in the form of networks, the longitudinal analysis of network dynamics, and hence, the construction of longitudinal network biomarkers, generating diagnostic, prognostic and druggable targets. This becomes possible if we utilise the idea of parenclitic network analysis. The main advantage of parenclitic network analysis is that it enables the construction of a graph without any a priori knowledge of the interactions between the parameters. An algorithm to build parenclitic networks, able to establish links between parameters/nodes without any a priori knowledge of their interactions was first described by Zanin and Bocaletti. Parenclitic networks have been successfully applied to the problem of detecting key genes and metabolites in different diseases. Recently we have applied this methodology to implement a machine learning classification of human DNA methylation data carrying signatures of cancer development. We have also described an improved algorithm to construct parenclitic networks and provided a simple case study of a protein dataset from ovarian cancer case and control serum samples. In the present project we plan to develop and investigate this algorithm to represent serial high-dimensional data, both continuous and categorical, in the form of connected networks without a priori knowledge of analyte-analyte links and to use this representation to search for diagnostic and prognostic markers and druggable targets via the construction of longitudinal network biomarker models. The research planned will include the combination of parenclitic network analysis with our previously developed algorithms for serial data analysis. Serial data analysis has shown its advantage over the typical snapshot analysis, being able to detect time-dependent changes in the data to enable earlier diagnosis in comparison to non-serial data. Methodological development will include testing of algorithms with different available clinical and epidemiological data, such as proteomic, genetic, and DNA methylation data, in order to identify data-specific features of the methodology under investigation. Moreover, our network analysis will be combined with other recently developed network analysis methods, including community detection and deep learning algorithms. In order to investigate the advantages and disadvantages of the developed methodology and precisely estimate its efficiency, we will generate synthetic data which closely mimic cancer screening data. We will investigate not only the application of our methodology to the early diagnosis of diseases, but also to the personalised differentiation of the diseases detected. The work will be communicated with other scientists contributing to the development of network machine learning algorithms as well as clinicians to discuss different clinical data and the implementation of our methodology in clinical practice.

Technical Summary

The methodology developed will be based on the previously proposed idea of parenclitic network analysis. This method will enable the representation of parameters corresponding to individual patients and controls in graph form without a priori knowledge of the interactions between them. The strength of the links will be estimated based on how different they are from those in healthy control subjects. In addition to the original estimation based on linear regression, we will use estimates based on two-dimensional kernel density estimation. The generated graphs or networks will be analysed using topological indices, widely used for characterising complex networks, e.g. mean, variance, and maximal values of edge weights, vertex degree, shortest path lengths, Kleinberg,, diameter of the graph, the degree centrality, network efficiency, betweenness centrality, Google maximal page rank index, number of communities and so on. The obtained metrics corresponding to one moment in time will be obtained for all available serial measurements and then linked to the serial algorithms such as change-point detection methods, parametric empirical Bayes and methods of mean trends. All these methods have already been successfully applied by us for data analysis. In order to combine and compare our methodology with other machine learning network algorithms, we will link it to community detection analysis and various types of shallow and deep artificial neural networks. We will also compare our methods with basic algorithms and models such as Bayesian networks, decision trees and forests, and advanced SVM methods. In order to identify data-specific features of the methodology we will test all newly developed algorithms on very different datasets, testing their potential for different tasks in clinical and biotechnological practice.

Planned Impact

Development of the methodology to find longitudinal network biomarkers will provide solutions for the early detection of important diseases such as ovarian, breast and pancreatic cancer, diseases linked to microbiota, ulcerative colitis, pathological neural disorders linked to ageing and many others. We will deliver novel methodology for interrogating Big Data, not only by derivation of new algorithms for the early detection of disease, but also providing methods for understanding the underlying biology of disease progression.
Our team includes academics with expertise in computational mathematics and clinical practice, particularly cancer screening, who will communicate with the corresponding communities. The network machine-learning algorithms that we will develop belong to the world's most complex and interesting research challenges, so our results will be of utmost interest for the community of scientists developing network machine-learning algorithms Indeed in 2017, machine learning algorithms and artificial intelligence were identified as a top priority in the development of new strategies for cancer treatment. However, our algorithms will be universal and will be helpful not only for finding new methods of early cancer detection, but a wide spectrum of other diseases and biotechnological problems where we have to analyse serial multi-dimensional data to predict the abrupt change in the complex biological system.

Funded Value:

£469,754

Funded Period:

Jan 19 - May 22

Funder:

MRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

MR/R02524X/1

Principal Investigator:

Alexey Zaikin

Health Category:

Unclassified

Organisations

People	ORCID iD
Alexey Zaikin (Principal Investigator)
Mahesh Parmar (Co-Investigator)	http://orcid.org/0000-0003-0166-1700
John Timms (Co-Investigator)
Harry James Whitwell (Co-Investigator)
Usha Menon (Co-Investigator)
Peter DiMaggio (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 > >|

10 25 50

Abrego L (2021) Estimating integrated information in bidirectional neuron-astrocyte communication. in Physical review. E

Alexey Zaikin (2022) Longitudinal, deep, and network biomarkers: Parenclitic and synolitic network analysis

Alexey Zaikin (2022) Parenclitic and Synolitic Networks

Blyuss O (2020) Development of PancRISK, a urine biomarker-based risk score for stratified screening of pancreatic cancer patients. in British journal of cancer

Chen S (2020) Editorial: Multiscale Modeling of Rhythm, Pattern and Information Generation: from Genome to Physiome. in Frontiers in physiology

De Marco M (2020) Comment on: 'Development of PancRISK, a urine biomarker-based risk score for stratified screening of pancreatic cancer patients'. in British journal of cancer

Demichev V (2022) A proteomic survival predictor for COVID-19 patients in intensive care. in PLOS digital health

Demichev V (2021) A time-resolved proteomic and prognostic map of COVID-19. in Cell systems

Di Blasi R (2021) Non-Histone Protein Methylation: Biological Significance and Bioengineering Potential. in ACS chemical biology

Gentry-Maharaj A (2020) Multi-Marker Longitudinal Algorithms Incorporating HE4 and CA125 in Ovarian Cancer Screening of Postmenopausal Women. in Cancers

Research Databases and Models
Software and Technical Products
Engagement Activities


Title	Additional file 2 of Technical and biological sources of unreliability of Infinium probes on Illumina methylation microarrays
Description	Additional file 2 Results two-sided Fisher test, basis for Fig. 3b.
Type Of Material	Database/Collection of data
Year Produced	2024
Provided To Others?	Yes
URL	https://springernature.figshare.com/articles/dataset/Additional_file_2_of_Technical_and_biological_s...


Title	New methodology for the feature analysis of proteomic data
Description	A new transformation of data has been found to be used in the following network analysis. The transformation, identifying new and previously hidden features, can be applied to the longitudinal data for ovarian cancer case-control study. Our data have been obtained using two different analyte measurement systems (the Olink data represent unit-less values, whilst all other measurements are serum concentrations using ELISA-based assays).
Type Of Material	Computer model/algorithm
Year Produced	2020
Provided To Others?	No
Impact	The transformation resulting into the normalization of two different types of data and reduction of the data to one structure has been established. The advantages and disadvantages of each data set (before and after normalization) have been compared in terms of applying different parenclitic networks to them. A methodology developed will be helpful for conducting any longitudinal analysis of parenclitic networks based on proteomic data for ovarian cancer.


Title	New methodology for the preprocessing feature analysis of EPIC array DNAm data.
Description	We developed a new approach to use Intensity signals in the GREEN and RED channels of EPIC Array to find out the causes of patterns arising in specific DNA regions (such as SNPs). We have shown that it is the genetic sequency that results in different intensity scales for probes. For example, it was found that the presence of a larger number of nucleotide bases in the DNA sequence (target of the sample) contributes to better amplification of the probe with the target and, as a result, leads to stronger intensity indicators. Depending on the strength of the intensities, the reliability of the obtained beta-values was studied taking into account the presence of technical noise (it was shown that the lower the sum of intensity in channels, the loweris the reliability).
Type Of Material	Computer model/algorithm
Year Produced	2020
Provided To Others?	No
Impact	A new set (not indicated by any accessible SNPs-databases) of probes (matching the behavior-on-SNP) has been obtained. An approach has been developed that allows one to judge the reliability of beta-values at all intensity levels to avoid the widespread (and not informative) approach to determine a reliability of the probe signal. For this we have used only the distribution of control probes. A detailed analysis has shown the possibility to in the analysis of DNAm Epic array data not only the beta-values, but also the values of intensities, as an additional data source. The methodology developed will be usefull for all researchers using modern chips to analyse DNA methylation.


Title	Generalised Parenclitic Network Algorithm Implementation
Description	The algorithm implemented enables to run parenclitic network analysis with any machine-learning kernel chosen. The software enables parallel computations.
Type Of Technology	Software
Year Produced	2019
Impact	The software developed will be useful and beneficial for all people from academy or industry who runs parenclitic network analysis, i.e., for all data analysts working with high-dimensional data.
URL	https://github.com/mike-live/parenclitic


Description	Invited Keynote talk, "Analysis of Medical Data with Synolitic Networks", Workshop in Artificial Intelligence, Data Analysis and Modelling (AIDAM), Leicester, February 23, 2024.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Other audiences
Results and Impact	Invited Keynote talk, "Analysis of Medical Data with Synolitic Networks", Workshop in Artificial Intelligence, Data Analysis and Modelling (AIDAM), Leicester, February 23, 2024.
Year(s) Of Engagement Activity	2024


Description	Invited talk, "Merging Nonlinear Dynamics, Graphs and Artificial Intelligence: Synolitic Networks and Noise-induced AI", the IUTAM Symposium on Data-driven Nonlinear and Stochastic Dynamics with the Control, chaired by Yong Xu and Juergen Kurths, during June 5-9, 2023, in Xi'an, China.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	Invited talk, "Merging Nonlinear Dynamics, Graphs and Artificial Intelligence: Synolitic Networks and Noise-induced AI", the IUTAM Symposium on Data-driven Nonlinear and Stochastic Dynamics with the Control, chaired by Yong Xu and Juergen Kurths, during June 5-9, 2023, in Xi'an, China.
Year(s) Of Engagement Activity	2023


Description	Invited talk. "Analysis of Medical Data with Synolitic Networks", AMS-UMI International Meeting 2024, Palermo, Italy, July 23-26, 2024.
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	. Invited talk. "Analysis of Medical Data with Synolitic Networks"
Year(s) Of Engagement Activity	2024
URL	https://umi.dm.unibo.it/jm-umi-ams/


Description	Italian-Russian-British Workshop on DNA methylation analysis in Bologna
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We have organize one day Italian-Russian-British Workshop on DNA methylation analysis in Bologna, Italy to bring postgraduate students together to hear talks.
Year(s) Of Engagement Activity	2019
URL	https://www.ucl.ac.uk/~rmjbale/Workshop.html