Statistical and Computational Challenges in High-dimensional Data Analysis

Lead Research Organisation: UNIVERSITY OF CAMBRIDGE

Department Name: Pure Maths and Mathematical Statistics

Abstract

We are living in an age of information: scientists, businesses and governments are collecting datasets of unprecedented size and complexity at an ever-increasing rate, with the hope of using statistics to discover patterns and help inform decisions that will shape the future of our society. Typically datasets consists of observations (e.g. patients) on which a number of variables have been measured (e.g. height, weight). Whilst modern datasets can have many observations, the trend today is towards datasets with a very large number of variables. This is particularly true in genomics where scientific advances have allowed researchers to collect detailed genetic information on patients amounting to thousands or even hundreds of thousands of variables. More generally, automated data collection has given rise to so-called high-dimensional datasets across a variety of disciplines. For example, in healthcare analytics, aspects of a patient's history can give rise to datasets with a huge number of variables indicating what combinations of drugs were prescribed at particular times.

The field of high-dimensional statistics is a response to the challenges posed by these sorts of datasets which often render infeasible more traditional approaches designed for settings with only a handful of carefully chosen variables. Whilst much progress has been made, there remain several challenges, and this proposal will address some key outstanding methodological problems. Our methods will be applicable in a wide variety of settings, but two areas of application we will explore in collaboration are genomics and healthcare analytics. Our proposal consists of three projects which are described below.

Often along with the variables measured on a number of observations, we have an outcome or response of interest whose relationship with the variables we wish to learn from the data. In many cases, this relationship can be complex and depend on interactions between several groups of variables. Searching for combinations of variables which only together contribute to the response presents a serious computational challenge as the number of subsets of variables to search through quickly grows with the size of the subset. Even examining interacting pairs of variables can be computationally infeasible when the number of variables in the tens of thousands. A key contribution of our research will be to develop new methods that can scale efficiently to capture high order interactions in high-dimensional data.

Uncertainty quantification for high-dimensional data, for instance producing p-values quantifying the significance of variables in determining the response, is crucial in order to avoid deriving false conclusions from data. However research on this important topic is still in its infancy with many existing approaches often highly unstable in practical settings. Our proposal will develop new robust and computationally efficient methods for p-value construction and other forms of uncertainty quantification for a variety of models.

In some settings we do not have a distinguished response but rather would like to understand relationships between the variables themselves. Graphical models provide a useful way to model such dependencies but the available methods are often not scalable to the size of datasets now faced by many practitioners. We will use new computational techniques to develop randomised algorithms that avoid explicitly assessing each pair of variables to determine their relationship but can still deliver estimates of the strongest dependencies. The method will have broad applicability, but for example with biological data can help to learn the network of dependencies governing the underlying biological processes.

Planned Impact

High-dimensional data are becoming increasingly more common across a range of scientific disciplines and industry sectors. This trend is set to continue with new technologies and automated data collection strategies enabling a greater number of measurements or properties to be recorded on each observation. Progress on the core methodological problems of interaction detection, nonparametric modelling and uncertainty quantification, which this proposal will tackle, therefore has the potential for wide impact. Software implementations of the methods we develop will be made publicly available to enable rapid deployment on real-world problems. In order to ensure the practical relevance of our research and usability of software, we will work closely with industry and academic partners.

Among the sciences, genomics is perhaps the biggest consumer of methodology for high-dimensional data. As part of our proposal, we will apply our methods to a number of biological datasets in collaboration with Sylvia Richardson's group at the MRC Biostatistics Unit, with the aim of advancing understanding of cancer subtypes. In addition, work on detecting interactions between transcription factors in collaboration with Sumanta Basu at Cornell University will help to better understand the regulatory architecture of transcriptional machinery. These applications will also serve to disseminate our methods more widely to the genomics community, which faces a number of key challenges addressed by our work including the detection of gene-gene interactions and gene-environment interactions. In the long-term, improved understanding of biological processes, enabled in part by our new methodology, may help to guide future drug development.

More generally, there are a variety of academic disciplines routinely using high-dimensional statistics who stand to benefit from our research. For example, the PI's previous research in this area has been used in areas ranging from ecology to neuroscience, in addition to genomics.

Industry sectors where analysis of high-dimensional data is a familiar challenge include healthcare, e-commerce, finance and fraud detection, to name just a few. As a concrete example, the predictive analytics team at QuintilesIMS study datasets with large numbers of variables encoding detailed patient histories in order to find undiagnosed patients with rare diseases. By working closely with their team we will be able to tailor our research and software to meet their needs and guide the application of our methods to these problems. The improved predictive capabilities of the models will contribute to increasing the diagnosis rate among those suffering from these rare diseases.

Funded Value:

£100,220

Funded Period:

May 18 - Mar 21

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/R013381/1

Principal Investigator:

Rajen Shah

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Rajen Shah (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Guo F (2024) Rank-transformed subsampling: inference for multiple data splitting and exchangeable p -values in Journal of the Royal Statistical Society Series B: Statistical Methodology

Guo F (2024) Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values

Janková J (2020) Goodness-of-fit testing in high dimensional generalized linear models

Janková J (2020) Goodness-of-fit Testing in High Dimensional Generalized Linear Models in Journal of the Royal Statistical Society Series B: Statistical Methodology

Lundborg A (2022) Conditional Independence Testing in Hilbert Spaces with Applications to Functional Data Analysis in Journal of the Royal Statistical Society Series B: Statistical Methodology

Lundborg A (2022) Conditional independence testing in Hilbert spaces with applications to functional data analysis

Rocha JJ (2023) Functional unknomics: Systematic screening of conserved genes of unknown function. in PLoS biology

Shah R (2020) Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding

Shah R (2020) The hardness of conditional independence testing and the generalised covariance measure in The Annals of Statistics

Shah R (2020) Right Singular Vector Projection Graphs: Fast High Dimensional Covariance Matrix Estimation under Latent Confounding in Journal of the Royal Statistical Society Series B: Statistical Methodology

Key Findings
Impact Summary
Software and Technical Products


Description	Determining whether one input variable is important for predicting another after accounting for potential confounders, formally known as testing for conditional independence, is a central problem in many areas of statistics such as regression analysis and causal inference. We however show that this problem of fundamental importance is also fundamentally impossible in that there exist no non-trivial tests for conditional independence. The conclusion is that further modelling assumptions are always required to perform the test. On the positive side, we show that certain classical tests, with some small modifications, have validity for conditional independence testing beyond the restrictive models for which they were designed. We also argue that in many cases it may be convenient to frame modelling assumptions required for conditional independence testing in terms of the performance of user-chosen machine learning methods for predicting the variables of interest, given the confounders. We develop a testing framework that allows for this, and provide formal guarantees for its validity.
Exploitation Route	Conditional independence testing is ubiquitous in many areas of science, and also industry. The tests developed in our work can be used for causal discovery, falsifying a proposed causal mechanism for generating data, or selecting variables in a regression analysis, for example.
Sectors	Environment Financial Services and Management Consultancy Healthcare Pharmaceuticals and Medical Biotechnology


Description	The conditional independence tests developed in the paper "The Hardness of Conditional Independence Testing and the Generalised Covariance Measure" have for example been used by Maersk to test the specification of some causal DAG models used to develop pricing strategies.
First Year Of Impact	2021
Sector	Transport
Impact Types	Economic


Title	GRPtests
Description	Methodology for testing nonlinearity in the conditional mean function in low- or high-dimensional generalized linear models, and the significance of (potentially large) groups of predictors. Details on the algorithms can be found in the paper by Jankova, Shah, Buehlmann and Samworth (2019) .
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	Too early to say
URL	https://CRAN.R-project.org/package=GRPtests


Title	GeneralisedCovarianceMeasure: Test for Conditional Independence Based on the Generalized Covariance Measure (GCM)
Description	A statistical hypothesis test for conditional independence. It performs nonlinear regressions on the conditioning variable and then tests for a vanishing covariance between the resulting residuals. It can be applied to both univariate random variables and multivariate random vectors. Details of the method can be found in Rajen D. Shah and Jonas Peters (2018) .
Type Of Technology	Software
Year Produced	2018
Open Source License?	Yes
Impact	Used by A.P. Moller Maersk in testing whether structural causal models relating to pricing can be falsified.
URL	https://CRAN.R-project.org/package=GeneralisedCovarianceMeasure


Title	dipw: Debiased Inverse Propensity Score Weighting
Description	Estimation of the average treatment effect when controlling for high-dimensional confounders using debiased inverse propensity score weighting (DIPW). DIPW relies on the propensity score following a sparse logistic regression model, but the regression curves are not required to be estimable. Despite this, our package also allows the users to estimate the regression curves and take the estimated curves as input to our methods. Details of the methodology can be found in Yuhao Wang and Rajen D. Shah (2020) "Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders" .
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	Too early to say.
URL	https://cran.r-project.org/package=dipw


Title	ghcm: Functional Conditional Independence Testing with the GHCM
Description	A statistical hypothesis test for conditional independence. Given residuals from a sufficiently powerful regression, it tests whether the covariance of the residuals is vanishing. It can be applied to both discretely-observed functional data and multivariate data. Details of the method can be found in Anton Rask Lundborg, Rajen D. Shah and Jonas Peters (2020) .
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	Too early to say.
URL	https://CRAN.R-project.org/package=ghcm

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications