Learning and Analysing Discrete Geometric Structure in Statistical Models

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

The quantity and dimensionality of genetic data have been rapidly increasing in recent decades. Phylogenetic trees are a popular tool for summarising the underlying mutations inferred from genetic data, but we lack a rigorous statistical framework with which to study large tree data. Currently, such tools include the Robinson-Foulds metric and BHV tree space. The Robinson-Foulds metric is particularly popular for its computational efficiency but lacks sensitivity; meanwhile BHV space can fully capture the rich geometry of tree space, but calculating distances is computationally intensive.
In 2004 Speyer and Sturmfels established an equivalence between tree space and the tropical Grassmannian. This formulation embeds tree space in the tropical projective torus via the tropical Plucker relations. The tropical projective torus is a Banach space whose dimensionality increases quadratically with the number of leaves, providing a computationally tractable ambient space for trees. The geometry of tree space within the tropical projective torus is also well-studied from an algebraic and geometric perspective. More recently, this tropical tree space has been studied for its statistical potential, and initial investigations using Influenza data have shown it to offer more efficient statistical summaries of tree data than BHV space.
This project will focus on establishing the probabilistic groundwork on the tropical projective torus and tropical tree space to develop a rich statistical theory for tree datasets. We aim to formalise probabilistic distances for measures on different spaces, allowing us to compare tree datasets with different taxa. We will also study the behaviour of Fréchet means on the tropical projective torus, both in terms of their intersection with tropical tree space and their limiting behaviour for empirical measures.
We hope to carry this work through to practical application, establishing a methodology for hypothesis testing for the comparison of different tree datasets. We will be able to use these novel methods to study the clonal evolution of acute myeloid leukaemia using data from the Haematopoietic Stem Cell Laboratory at the Francis Crick Institute. The study of this data will highlight the impact of this research by establishing a rigorous and detailed statistical study of the evolutionary patterns exhibited by cancers.
This project falls within the EPSRC Mathematical Biology and Statistics and Applied Probability research areas. The project is supervised by Dr Anthea Monod (Imperial College London) and Prof. Mathias Drton (Technical University of Munich). It is part of the ICL-TUM Joint Academy of Doctoral Studies, which fosters collaboration between our two research groups. We also have the support of Dominique Bonnet at the Francis Crick institute, whose Haematopoietic Stem Cell Laboratory have pledged proprietary data on the clonal evolution of AML.

Planned Impact

Probabilistic modelling permeates the Financial services, healthcare, technology and other Service industries crucial to the UK's continuing social and economic prosperity, which are major users of stochastic algorithms for data analysis, simulation, systems design and optimisation. There is a major and growing skills shortage of experts in this area, and the success of the UK in addressing this shortage in cross-disciplinary research and industry expertise in computing, analytics and finance will directly impact the international competitiveness of UK companies and the quality of services delivered by government institutions.
By training highly skilled experts equipped to build, analyse and deploy probabilistic models, the CDT in Mathematics of Random Systems will contribute to
- sharpening the UK's research lead in this area and
- meeting the needs of industry across the technology, finance, government and healthcare sectors

MATHEMATICS, THEORETICAL PHYSICS and MATHEMATICAL BIOLOGY

The explosion of novel research areas in stochastic analysis requires the training of young researchers capable of facing the new scientific challenges and maintaining the UK's lead in this area. The partners are at the forefront of many recent developments and ideally positioned to successfully train the next generation of UK scientists for tackling these exciting challenges.
The theory of regularity structures, pioneered by Hairer (Imperial), has generated a ground-breaking approach to singular stochastic partial differential equations (SPDEs) and opened the way to solve longstanding problems in physics of random interface growth and quantum field theory, spearheaded by Hairer's group at Imperial. The theory of rough paths, initiated by TJ Lyons (Oxford), is undergoing a renewal spurred by applications in Data Science and systems control, led by the Oxford group in conjunction with Cass (Imperial). Pathwise methods and infinite dimensional methods in stochastic analysis with applications to robust modelling in finance and control have been developed by both groups.
Applications of probabilistic modelling in population genetics, mathematical ecology and precision healthcare, are active areas in which our groups have recognized expertise.

FINANCIAL SERVICES and GOVERNMENT

The large-scale computerisation of financial markets and retail finance and the advent of massive financial data sets are radically changing the landscape of financial services, requiring new profiles of experts with strong analytical and computing skills as well as familiarity with Big Data analysis and data-driven modelling, not matched by current MSc and PhD programs. Financial regulators (Bank of England, FCA, ECB) are investing in analytics and modelling to face this challenge. We will develop a novel training and research agenda adapted to these needs by leveraging the considerable expertise of our teams in quantitative modelling in finance and our extensive experience in partnerships with the financial institutions and regulators.

DATA SCIENCE:

Probabilistic algorithms, such as Stochastic gradient descent and Monte Carlo Tree Search, underlie the impressive achievements of Deep Learning methods. Stochastic control provides the theoretical framework for understanding and designing Reinforcement Learning algorithms. Deeper understanding of these algorithms can pave the way to designing improved algorithms with higher predictability and 'explainable' results, crucial for applications.
We will train experts who can blend a deeper understanding of algorithms with knowledge of the application at hand to go beyond pure data analysis and develop data-driven models and decision aid tools
There is a high demand for such expertise in technology, healthcare and finance sectors and great enthusiasm from our industry partners. Knowledge transfer will be enhanced through internships, co-funded studentships and paths to entrepreneurs

People

ORCID iD

Roan Talbut (Student)

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023925/1 01/04/2019 30/09/2027
2602130 Studentship EP/S023925/1 01/10/2021 30/09/2025 Roan Talbut