Learning and assessing the robustness of Bayesian networks for biological data

Lead Research Organisation: University of Leeds
Department Name: Statistics

Abstract

Bayesian Network models are an increasingly popular way to analyse biological data. In Biology application, Bayesian Networks could be viewed as a machine which produces outputs from inputs. Here the inputs are data, hyper-parameters and distribution, and the outputs are biological meaningful results like the possibility of variable expression and its links to phenotype. While studying the construction of Bayesian Networks has led to many insights for target gene discovery and drug development, such models are typically computing-expensive. The computation cost caused three main problems in Bayesian Network's application to biological data. Firstly, the sensitivity of Bayesian Networks' outputs to the changes in inputs is rarely accessed. Secondly, biologists normally expect a faster analysis method, which limits the application of Bayesian Network to first-hand data. Last but not least, there is a gap in optimising such models' feature to achieve a more biologically meaningful result. Optimisation requires experiment repetition and design in the inputs space. Therefore, a faster method to achieve Bayesian Networks' results is demanded. We propose to develop a robust Bayesian emulator to mimic the result from the Bayesian Network model efficiently, which could reduce the computation cost of Bayesian Networks. We will also explore novel biological references to validate the credibility of our model. Thus we could do experiment design over different inputs and optimise the performance of Bayesian Networks. The improved method would have potential in general biology analysis practice.

Publications

10 25 50
 
Description I have developed a novel machine learning algorithm, scalable Bigraphical Lasso, to infer the conditional dependency across both instances and features from high-dimensional count data. In the case of single cell RNA sequencing data, this method can be used to learn the gene regulatory network and the temporal structure of cells simultaneously. The algorithm solved the tensor-decomposition based problem 300 times faster than previous approaches while maintaining high accuracy.
I have also investigated a simulation-based approach for sensitivity and reliability analysis for Bayesian Hierarchical Models (Bayesian Network).
As a result of my research, I have published two papers at peer-reviewed top conferences in the field of data science and statistics, and I have presented my work at various international conferences.
Exploitation Route The code of Scalable Bigraphical Lasso can be applied by other researchers to simultaneously infer the gene regulatory network and the temporal structure of cells from their gene expression data, thereby applied in personalised medicine. It can also be applied to other real world problems with complicated dependency structures across features and instances. Furthermore, the approach has the potential to be extended to a K-dimensional model (K>2), investigating problems with multi-way coordinates.
The pipeline we used in our simulation-based sensitivity analysis and reliability analysis for Bayesian Hierarchical Model can be applied to any general Bayesian Model researchers want to investigate. Moreover, our analysis on BASiCS, a classic Bayesian Hierarchical Model for single cell RNA sequencing data, has shown that it is underestimating certain parameters and in need of better posterior point estimate, such insights could help improving the method itself and the general practice in Bayesian Hierarchical Modelling in the future.
Both contributions are general to the field of statistical application and machine learning, therefore would be useful in both academic and non-academic environment.
Sectors Digital/Communication/Information Technologies (including Software),Other

URL https://arxiv.org/abs/2203.07912