Statistical and Computational Challenges in High-dimensional Data Analysis

Lead Research Organisation: University of Cambridge
Department Name: Pure Maths and Mathematical Statistics

Abstract

We are living in an age of information: scientists, businesses and governments are collecting datasets of unprecedented size and complexity at an ever-increasing rate, with the hope of using statistics to discover patterns and help inform decisions that will shape the future of our society. Typically datasets consists of observations (e.g. patients) on which a number of variables have been measured (e.g. height, weight). Whilst modern datasets can have many observations, the trend today is towards datasets with a very large number of variables. This is particularly true in genomics where scientific advances have allowed researchers to collect detailed genetic information on patients amounting to thousands or even hundreds of thousands of variables. More generally, automated data collection has given rise to so-called high-dimensional datasets across a variety of disciplines. For example, in healthcare analytics, aspects of a patient's history can give rise to datasets with a huge number of variables indicating what combinations of drugs were prescribed at particular times.

The field of high-dimensional statistics is a response to the challenges posed by these sorts of datasets which often render infeasible more traditional approaches designed for settings with only a handful of carefully chosen variables. Whilst much progress has been made, there remain several challenges, and this proposal will address some key outstanding methodological problems. Our methods will be applicable in a wide variety of settings, but two areas of application we will explore in collaboration are genomics and healthcare analytics. Our proposal consists of three projects which are described below.

Often along with the variables measured on a number of observations, we have an outcome or response of interest whose relationship with the variables we wish to learn from the data. In many cases, this relationship can be complex and depend on interactions between several groups of variables. Searching for combinations of variables which only together contribute to the response presents a serious computational challenge as the number of subsets of variables to search through quickly grows with the size of the subset. Even examining interacting pairs of variables can be computationally infeasible when the number of variables in the tens of thousands. A key contribution of our research will be to develop new methods that can scale efficiently to capture high order interactions in high-dimensional data.

Uncertainty quantification for high-dimensional data, for instance producing p-values quantifying the significance of variables in determining the response, is crucial in order to avoid deriving false conclusions from data. However research on this important topic is still in its infancy with many existing approaches often highly unstable in practical settings. Our proposal will develop new robust and computationally efficient methods for p-value construction and other forms of uncertainty quantification for a variety of models.

In some settings we do not have a distinguished response but rather would like to understand relationships between the variables themselves. Graphical models provide a useful way to model such dependencies but the available methods are often not scalable to the size of datasets now faced by many practitioners. We will use new computational techniques to develop randomised algorithms that avoid explicitly assessing each pair of variables to determine their relationship but can still deliver estimates of the strongest dependencies. The method will have broad applicability, but for example with biological data can help to learn the network of dependencies governing the underlying biological processes.

Planned Impact

High-dimensional data are becoming increasingly more common across a range of scientific disciplines and industry sectors. This trend is set to continue with new technologies and automated data collection strategies enabling a greater number of measurements or properties to be recorded on each observation. Progress on the core methodological problems of interaction detection, nonparametric modelling and uncertainty quantification, which this proposal will tackle, therefore has the potential for wide impact. Software implementations of the methods we develop will be made publicly available to enable rapid deployment on real-world problems. In order to ensure the practical relevance of our research and usability of software, we will work closely with industry and academic partners.

Among the sciences, genomics is perhaps the biggest consumer of methodology for high-dimensional data. As part of our proposal, we will apply our methods to a number of biological datasets in collaboration with Sylvia Richardson's group at the MRC Biostatistics Unit, with the aim of advancing understanding of cancer subtypes. In addition, work on detecting interactions between transcription factors in collaboration with Sumanta Basu at Cornell University will help to better understand the regulatory architecture of transcriptional machinery. These applications will also serve to disseminate our methods more widely to the genomics community, which faces a number of key challenges addressed by our work including the detection of gene-gene interactions and gene-environment interactions. In the long-term, improved understanding of biological processes, enabled in part by our new methodology, may help to guide future drug development.

More generally, there are a variety of academic disciplines routinely using high-dimensional statistics who stand to benefit from our research. For example, the PI's previous research in this area has been used in areas ranging from ecology to neuroscience, in addition to genomics.

Industry sectors where analysis of high-dimensional data is a familiar challenge include healthcare, e-commerce, finance and fraud detection, to name just a few. As a concrete example, the predictive analytics team at QuintilesIMS study datasets with large numbers of variables encoding detailed patient histories in order to find undiagnosed patients with rare diseases. By working closely with their team we will be able to tailor our research and software to meet their needs and guide the application of our methods to these problems. The improved predictive capabilities of the models will contribute to increasing the diagnosis rate among those suffering from these rare diseases.
 
Description Determining whether one input variable is important for predicting another after accounting for potential confounders, formally known as testing for conditional independence, is a central problem in many areas of statistics such as regression analysis and causal inference. We however show that this problem of fundamental importance is also fundamentally impossible in that there exist no non-trivial tests for conditional independence. The conclusion is that further modelling assumptions are always required to perform the test.

On the positive side, we show that certain classical tests, with some small modifications, have validity for conditional independence testing beyond the restrictive models for which they were designed. We also argue that in many cases it may be convenient to frame modelling assumptions required for conditional independence testing in terms of the performance of user-chosen machine learning methods for predicting the variables of interest, given the confounders. We develop a testing framework that allows for this, and provide formal guarantees for its validity.
Exploitation Route Conditional independence testing is ubiquitous in many areas of science, and also industry. The tests developed in our work can be used for causal discovery, falsifying a proposed causal mechanism for generating data, or selecting variables in a regression analysis, for example.
Sectors Environment,Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description The conditional independence test, the "generalised covariance measure", developed as part of this work has been used by A.P. Moller Maersk to test the validity of structural casual models used in pricing.
First Year Of Impact 2021
Sector Transport
Impact Types Economic

 
Title GRPtests 
Description Methodology for testing nonlinearity in the conditional mean function in low- or high-dimensional generalized linear models, and the significance of (potentially large) groups of predictors. Details on the algorithms can be found in the paper by Jankova, Shah, Buehlmann and Samworth (2019) . 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Too early to say 
URL https://CRAN.R-project.org/package=GRPtests
 
Title GeneralisedCovarianceMeasure: Test for Conditional Independence Based on the Generalized Covariance Measure (GCM) 
Description A statistical hypothesis test for conditional independence. It performs nonlinear regressions on the conditioning variable and then tests for a vanishing covariance between the resulting residuals. It can be applied to both univariate random variables and multivariate random vectors. Details of the method can be found in Rajen D. Shah and Jonas Peters (2018) . 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Used by A.P. Moller Maersk in testing whether structural causal models relating to pricing can be falsified. 
URL https://CRAN.R-project.org/package=GeneralisedCovarianceMeasure
 
Title dipw: Debiased Inverse Propensity Score Weighting 
Description Estimation of the average treatment effect when controlling for high-dimensional confounders using debiased inverse propensity score weighting (DIPW). DIPW relies on the propensity score following a sparse logistic regression model, but the regression curves are not required to be estimable. Despite this, our package also allows the users to estimate the regression curves and take the estimated curves as input to our methods. Details of the methodology can be found in Yuhao Wang and Rajen D. Shah (2020) "Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders" . 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Too early to say. 
URL https://cran.r-project.org/package=dipw
 
Title ghcm: Functional Conditional Independence Testing with the GHCM 
Description A statistical hypothesis test for conditional independence. Given residuals from a sufficiently powerful regression, it tests whether the covariance of the residuals is vanishing. It can be applied to both discretely-observed functional data and multivariate data. Details of the method can be found in Anton Rask Lundborg, Rajen D. Shah and Jonas Peters (2020) . 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact Too early to say. 
URL https://CRAN.R-project.org/package=ghcm