Advanced Stochastic Computation for Inference from Tree, Graph and Network Models

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

As a result of recent experimental advances, large amounts of biological data are becoming available for humans and other organisms.
Such data pose inference problems well beyond the capabilities of standard statistical tools. Experimental advances must be accompanied by the development of suitable biostatistical and bioinformatic tools that will make efficient use of the complex data to improve our understanding of the genetic forces shaping the evolution of genome organization.

Molecular evolution and comparative genomics are no longer fields where collecting data is the main obstacle to progress. Many probabilistic models proposed in biology try to capture the evolutionary mechanisms and reflect the data generating mechanism. Whilst these models still have limitations, there have been substantial improvements in recent years.
We contend that progress is mainly limited by a lack of adequate computational tools for extracting information from existing data and performing inference for complex models. A major bottleneck in the application of probabilistic models to biology is that their calibration is computationally expensive and in many instances not possible using modern techniques. Thus, researchers often prefer to use simple summary statistics to characterize the underlying biological process; this approach is obviously unsatisfactory. E.g., topological summary statistics capture basic characteristics of binary interaction networks but are affected by different types of bias so great, that caution must be taken when drawing conclusions. A big challenge for systems biology nowadays consists in developing statistical and bioinformatics tools within the rigorous framework of probabilistic modelling that will allow for a better and more comprehensive understanding of cellular functions.

In the last few decades a wealth of research has been performed on model-based inference for molecular data accompanied by an explosion of research in developing computationally efficient methods to facilitate it. Broadly speaking, there are three main approaches to statistical inference in molecular biology: (i) importance sampling (IS) for likelihood evaluation, (ii) Markov chain Monte Carlo (MCMC) methods (iii) Approximate Bayesian Computation (ABC) . In this proposal we concentrate on a combination of advanced IS (or more generally Sequential Monte Carlo (SMC)) and MCMC methods focussed upon Markov models in genetics and bioinformatics. In particular, building upon these techniques, we aim to develop a general framework for approximate inference which is theoretically sound, computationally feasable and still be able to accurately reflect the complexity of the underlying stochastic model. The main life science application on which we will concentrate are: genealogical trees,
protein networks, phylogenetic trees.

Planned Impact

The project will develop new methods for inference from post-genomic data. Its benefits will be in the greater interpretability of statistical models derived from such data, generating new biological insights and hypothesis from existing and future projects. The methods will be equally applicable throughout post-genomic science and thus many groups of scientists throughout academia, government, biomedical community and industry will gain from the new techniques. A comprehensive and multi-faceted approach to feedback and engagement with the public and relevant researchers will be employed by the team, in collaboration with UCL Media Relations, which has regular contact with local and national media. The results of our research will be widely disseminated to the public (e.g. via UCL central press office). All aspects of the work will be published in peer-review journals and disseminated via presentations at appropriate conferences and seminars. We will also target publications relating to biology generally in the relevant industrial and more popular press and we will use our previous experience in providing briefings to the press on complex scientific issues.
The non-academic beneficiaries of the project will range across many groups:
(i) Scientists from government agencies, non-government organisations (e.g.~charities) and industries will benefit from an improved ability to analyse data from post-genomic investigations, enabling them to fulfill their research duties.
(ii) Scientists from the chemical, pharmaceutical and food industries, who invest in new technologies and employ `omic' methods in their daily research, will have improved data analysis tools available. This will enhance the competitiveness of these industries in developing new products such as drugs.
(iii) Clinical practitioners will benefit in the long run from a better understanding of biological mechanisms underlying disease, which has profound health and wealth implications. UCL has committed to developing and supporting the highest quality of interdisciplinary and translational research.
(iv) The general public is becoming more aware of post-genomic technologies/projects (e.g.~the human genome project) and this research will make an excellent example of how the new technologies can be combined with advanced computing methods to generate new biological insights.

A great opportunity for dissemination is offered by graduate courses such as the MSc at UCL and NUS. We will create a project web-site to disseminate our methodological development and a suite of statistical and visualization tools accessible to the widest possible audience. Extensive documentation will be provided for the web resources produced, including a user manual and integrated help pages. A mailing list will be established (allowing researchers to register interest in the project) and this will be used to provide updates on data availability during the lifetime of the project, as well as changes/improvements. Monitoring of the number of hits will be undertaken and reported and will give us a measure of success of these outreach activities. A workshop will be held at the end of the project in UCL and it will focus on exchanging knowledge between project participants and the wider scientific community. Any potential commercial exploitation of the outcomes will be examined in collaboration with UCL Consultants Ltd., who can give expert advice on, for example, patent applications, software licensing, material transfer and other legal matters.

Publications

10 25 50
publication icon
Heine K (2018) Bridging trees for posterior inference on ancestral recombination graphs. in Proceedings. Mathematical, physical, and engineering sciences

publication icon
Jasra A (2015) Bayesian inference for duplication-mutation with complementarity network models. in Journal of computational biology : a journal of computational molecular cell biology

publication icon
Persing A (2015) A simulation approach for change-points on phylogenetic trees. in Journal of computational biology : a journal of computational molecular cell biology

publication icon
Wang J (2014) Computational methods for a class of network models. in Journal of computational biology : a journal of computational molecular cell biology

 
Description We have developed novel statistical and computational methods for the analysis of phylogenetic trees (which describe how species relate to each either), for network models (which describe how different biological processes relate to each other) and for the ancestral recombination graph. We have provided new computational solutions to this problem, some based on the originally proposed stopping principle and some based on novel ideas based on bridging of trees. The methods developed have been tested in simulations and on real data applications. The type of models considered is is very popular in applied sciences, such as biology. The computational ideas developed can be easily extended to other fields. We have also developed a software ARBORES, which is publically available.
Exploitation Route General academic audience has access to our papers and there are opportunities to extend our inferential frameworks to other areas. We have presented our results to internal conferences and we have organised a workshop to engage an academic audience. We have provided e a software which implements our methods during the life of the project.
Sectors Environment,Healthcare,Pharmaceuticals and Medical Biotechnology,Other

 
Description Knowledge Discovery via Dependency Networks
Amount $595,000 (SGD)
Organisation Government of Singapore 
Department Ministry of Education
Sector Public
Country Singapore
Start 08/2020 
End 07/2023
 
Title Methods for Protein Interaction Networks 
Description We observe an undirected graph G without multiple edges and self-loops, which is to represent a protein-protein interaction (PPI) network. We assume that G evolved under the duplication-mutation with complementarity (DMC) model from a seed graph, G0, and we also observe the binary forest G that represents the duplication history of G. A posterior density for the DMC model parameters is established, and we outline a sampling strategy by which one can perform Bayesian inference; that sampling strategy employs a particle marginal Metropolis-Hastings (PMMH) algorithm. We test our methodology on numerical examples to demonstrate a high accuracy and precision in the inference of the DMC model's mutation and homodimerization parameters. 
Type Of Material Computer model/algorithm 
Year Produced 2015 
Provided To Others? Yes  
Impact The paper has been published. We propose a novel algorithm in the context of the duplication-mutation model which is efficient and allows for full posterior inference. 
 
Title Network Model 
Description In the work we provide an exposition of exact computational methods to perform parameter inference from partially observed network models. In particular, we consider the duplication attachment (DA) model which has a likelihood function that typically cannot be evaluated in any reasonable computational time. We consider a number of importance sampling (IS) and sequential Monte Carlo (SMC) methods for approximating the likelihood of the network model for a fixed parameter value. It is well-known that for IS, the relative variance of the likelihood estimate typically grows at an exponential rate in the time parameter (here this is associated to the size of the network): we prove that, under assumptions, the SMC method will have relative variance which can grow only polynomially. In order to perform parameter estimation, we develop particle Markov chain Monte Carlo (PMCMC) algorithms to perform Bayesian inference. Such algorithms use the afore-mentioned SMC algorithms within the transition dynamics. The approaches are illustrated numerically. 
Type Of Material Data analysis technique 
Year Produced 2014 
Provided To Others? Yes  
Impact The work has been published and will be presented at international conferences 
 
Title Phyologenetic Trees 
Description We observe n sequences at each of m sites, and assume that they have evolved from an ancestral sequence that forms the root of a binary tree of known topology and branch lengths, but the sequence states at internal nodes are unknown. The topology of the tree and branch lengths are the same for all sites, but the parameters of the evolutionary model can vary over sites. We assume a piecewise constant model for these parameters, with an unknown number of change-points and hence a trans-dimensional parameter space over which we seek to perform Bayesian inference. We propose two novel ideas to deal with the computational challenges of such inference. Firstly, we approximate the model based on the time machine principle: the top nodes of the binary tree (near the root) are replaced by an approximation of the true distribution; as more nodes are removed from the top of the tree, the cost of computing the likelihood is reduced linearly in n. The approach introduces a bias, which we investigate empirically. Secondly, we develop a particle marginal Metropolis-Hastings (PMMH) algorithm, that employs a sequential Monte Carlo (SMC) sampler and can use the first idea. Our time-machine PMMH algorithm copes well with one of the bottle-necks of standard computational algorithms: the trans-dimensional nature of the posterior distribution. The algorithm is implemented on simulated and real data examples, and we empirically demonstrate its potential to outperform competing methods based on approximate Bayesian computation (ABC) techniques. 
Type Of Material Data analysis technique 
Year Produced 2014 
Provided To Others? Yes  
Impact The work has been published and will be presented at international conferences 
 
Description Collaboration with National University of Singapore 
Organisation National University of Singapore
Country Singapore 
Sector Academic/University 
PI Contribution We have weekly teleconferences with the Dr Jasra to discuss the research and have already published 3 papers. The postdoctoral RA Adam Persing has visited the Dept of Statistics at NUS already twice, to work on specific aspects of the project. The PI Maria De Iorio has visited the Dept of Statistics at NUS in June 2015.
Collaborator Contribution Dr Jasra has visited the Dept of Statistical Science at UCL several times since the beginning of the project to work on the science: Dec 2013, April 2014, July 2014, Dec 2014, May 2015, Aug 2015
Impact 3 publications
Start Year 2013
 
Title ARBORES 
Description The software ARBORES implements Markov Chain Monte Carlo algorithms for simulating ancestral recombination graphs from a Bayesian posterior distribution for given DNA polymorhism data. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Academic and Non-academic user can now download the software and employ our method to perform ancestral Inference. Visualization and analysis tools are also provided. 
URL https://github.com/heinekmp/Arbores
 
Description 1 day RSS workshop on particle filters. 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at a high specialised workshop. Generated research discussion with peers.
Year(s) Of Engagement Activity 2014
 
Description ERCIM Conference 
Form Of Engagement Activity Scientific meeting (conference/symposium etc.)
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact A. Beskos will give the talk in December 2014 as invited speaker to present the results of the research at 7th International Conference of the ERCIM WG on Computational and Methodological Statistics (ERCIM 2014), University of Pisa, Italy, 6-8 December 2014

A. Beskos will give the talk in December 2014.
Year(s) Of Engagement Activity 2014
 
Description Greek Stochastics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk at a conference. Generated discussion and further interest
Year(s) Of Engagement Activity 2014
 
Description Greek Stochastics 
Form Of Engagement Activity Scientific meeting (conference/symposium etc.)
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Dr Ajay Jasra is an invited speaker at Greek Stochastics Conference, Networks: Theory, Methods and Applications ,Athens, Greece, 20-22 December 2014. He will give a talk on network results from our project.

The talk will be given in Dec 2014.
Year(s) Of Engagement Activity 2014
 
Description INI 2014 programme on recent advances in Monte Carlo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact talk at a conference. Generated discussion with peers
Year(s) Of Engagement Activity 2014
 
Description ISBA 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Dr Heine gave the Talk "Tree Bridging Markov Chain Monte Carlo for Ancestral inference", ISBA World Meeting, June 2016
Year(s) Of Engagement Activity 2016
 
Description Imperial COnference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The Postdoctoral researcher attended the "Computational Methods Workshop for Massive/Complex Data" at Imperial in June 2014. The aim of the workshop was to bring together experts on state-of-the-art data analysis (data mining/machine learning/(bio)statistics) of large-scale datasets and on computational methods. The RA benefitted from attending talks, and exchanging ideas with other researchers.

The postdoc got useful input for future research directions
Year(s) Of Engagement Activity 2014
 
Description Invited Seminar at Collegio Carlo Alberto, Turin, Italy: Sequential Monte Carlo Samplers for Applications in High Dimensions 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Alexandros Beskos presented outcomes of research he has been involved with at a group of scientists in the area of Statistics based at the University of Turin. The talk attracted the interest of the participants and sparked questions and discussion.

-
Year(s) Of Engagement Activity 2014
 
Description Invited Seminar at Medical Research Center, Biostatistics Unit, Cambridge: Sequential Monte Carlo methods: Some Developments and Applications 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The talk involved disseminating research on advanced computational statistics methods to research in applied medical or epidemiological statistics at the Medical Research Center, Biostatistics Unit, in Cambridge.

-
Year(s) Of Engagement Activity 2014
 
Description Invited talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Dr Kari Heine was invited speaker in the workshop to disseminate the grant main results. This generated discussion with leading scientist in the field.
Year(s) Of Engagement Activity 2016
 
Description Oxford Conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The postdoctoral research assistant attended "Biological Sequence Analysis and Probabilistic Models" meeting at Oxford University in July 2014. The focus of the meeting will be methods and inference in population genetics, phylogenetics and functional genomics. This meeting brings together leaders from several areas of biological sequence analysis, with an emphasis on advancing the underlying theoretical models that many problems share. This was important opportunity for the RA as he could get a grasp of the most interesting problems for biologist and the inferential methods used to solve such problems.

The knowledge acquired in the conference gave us idea for a new research direction regarding the Ancestral Recombination Graph.
Year(s) Of Engagement Activity 2014
 
Description Poster - Newcastle 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster at the European Mathematical Genetics Meeting 2016, New Castle University
Year(s) Of Engagement Activity 2016
 
Description UCL Workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Outreach activity. We have organised a workshop "Advanced Computational Methods for Complex Models in Biology" at UCL to present main results of the research. Leading scientists accepted our invitation to give a talk. It was a successful day that generated interesting discussions.
Year(s) Of Engagement Activity 2016