Statistical methodology and theory for the Big Data era (Ext.)

Lead Research Organisation: University of Cambridge
Department Name: Pure Maths and Mathematical Statistics

Abstract

This is an extension of the Fellowship 'New challenges in
high-dimensional statistical inference' (EP/J017213/1).

The project will involve developing methodology and theory for a range
of modern statistical problems from high-dimensional and nonparametric
statistical inference. Such problems lie at the heart of modern
data science, an area where the UK urgently needs to grow its capacity.

Specific challenges to be addressed include the development of methods
for dimension reduction, variable selection and uncertainty
quantification, among several others. These are some of the
fundamental problems faced by practitioners in the Big Data era, and
demand highly innovative solutions. Our proposed techniques will be
proven to be robust via appropriate theoretical justification, and
will be implemented in free, open source software so as to maximise
the impact of the research.

Planned Impact

Big Data are transforming the way we live. These days, we can access
an almost unimaginably large wealth of knowledge by typing a few
keywords into an internet search engine and use apps on our mobile
telephones or other smart devices to monitor or improve our health.
Many recent advances in healthcare are partly due to improved, highly
data-intensive scanning equipment in hospitals, and the development of
new, effective drug treatments, which have been the result of
extensive scientific study with data at its core.

This proposal addresses some of the fundamental and important
statistical challenges that arise in handling the modern data sets
that routinely arise in the applications above, as well as many
others. These challenges include dimension reduction, variable
selection and uncertainty quantification. It therefore has the
potential for high societal and economic impact, both in the immediate
applications considered and through later transfer of the innovative
new methods that will be developed.

In the UK, we are currently facing a great shortage of well-trained
data scientists, across both science and industry. This proposal will
go some way towards addressing this deficiency, since the two
post-doctoral research associates employed will be exposed to
cutting-edge problems in the field, and will acquire the crucial
skills that are in so much demand in many sectors of the economy,
including academia, and the technology and pharmaceuticals sectors.

I am the main organiser of a six-month Isaac Newton Institute
programme on 'Statistical Scalability' that will form one of the major
impact activities of the proposal. End users of modern statistical
methodology and theory from science and industry will be integrated
into the programme (for instance, through an `Open for Business' day
that will be co-organised with the Turing Gateway for Mathematics) to
ensure wide dissemination and cross-fertilisation of the ideas. They
will therefore benefit from exposure to the ideas of the top
researchers who will participate in the programme, and can translate
the key messages into their own communities.

Through the collaboration with Martin Bogsted and his cancer genetics
group, we will improve understanding of the prognostic value of gene
expressions from various B-cell malignancies. In particular, we will
ascertain which genes are associated with short survival for specific
diseases and which have a common association with short survival for
multiple diseases. The societal benefit will be shortcuts in
identifying the applicability of drugs, which have the potential to be
incorporated into clinical trials.

Publications

10 25 50
publication icon
Banerjee M (2018) A Conversation with Jon Wellner in Statistical Science

publication icon
Barber R (2020) Robust inference with knockoffs in The Annals of Statistics

publication icon
Berrett T (2020) The Conditional Permutation Test for Independence While Controlling for Confounders in Journal of the Royal Statistical Society Series B: Statistical Methodology

publication icon
Berrett TB (2021) USP: an independence test that improves on Pearson's chi-squared and the G-test. in Proceedings. Mathematical, physical, and engineering sciences

publication icon
Berrett, T. B. (2021) Optimal rates for independence testing via U-statistic permutation tests in Annals of Statistics

 
Description I have provided fundamental understanding of shape-constrained methods in Statistics, including log-concave density estimation and isotonic regression.

I have pioneered a new approach to the estimation of change points in high-dimensional time series.

I have developed a new way of estimating statistical functionals such as entropy and proved its efficiency; I have also shown how such techniques can be used to derive new tests of independence of random vectors. This has led to a major breakthrough, of a new test to replace Pearson's chi-squared test of independence.

I have pioneered a data perturbation approach to high-dimensional statistical inference.
Exploitation Route I gave a talk at Jump Trading, and discussed the use of these methods in the financial sector there. I have also disseminated my work in 13 plenary/keynote lectures over the course of the grant.
Sectors Energy,Environment,Financial Services, and Management Consultancy,Healthcare

URL http://www.statslab.cam.ac.uk/~rjs57/Research.html
 
Description The results of the paper I co-authored on 'Screening of healthcare workers for SARS-CoV-2 highlights the role of asymptomatic carriage in COVID-19 transmission' (eLife, 9:e58728) were presented to the SAGE committee, and were influential in their ongoing strategy. The results were also reviewed by NHS England, Public Health England, and the Health/Social Care committee. As a result of this, and subsequent manuscripts such as 'Single-dose BNT162b2 vaccine protects against asymptomatic SARS-CoV-2 infection' (eLife, 10:e68808), visits to the testing facility were undertaken by the national director of mass testing Alex Cooper, and the Prime Minister's health adviser William Warr. Furthermore, both studies led to recruitment of COVID-positive staff and patients into the NIHR COVID Bioresource, which has been widely valuable for research by multiple Cambridge-based and national laboratories.
First Year Of Impact 2020
Sector Healthcare
Impact Types Societal,Policy & public services

 
Title IndepTest 
Description R package for independence testing 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/IndepTest/index.html
 
Title InspectChangepoint 
Description R package for high-dimensional changepoint estimation. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/InspectChangepoint/index.html
 
Title LogConcComp 
Description Github python code for computing the log-concave maximum likelihood estimator 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://github.com/wenyuC94/LogConcComp
 
Title MCARtest: Optimal Nonparametric Testing of Missing Completely at Random 
Description R package 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/MCARtest/index.html
 
Title MissInspect 
Description Github R functions for changepoint estimation with heterogeneous missingness 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://github.com/wangtengyao/MissInspect
 
Title SPCAvRP 
Description R package for sparse PCA 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/SPCAvRP/index.html
 
Title Sshaped 
Description R package for fitting S-shaped functions 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/Sshaped/index.html
 
Title USP 
Description R package for independence testing 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/USP/index.html
 
Title ocd 
Description R package for online changepoint detection 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/ocd/index.html
 
Title ocd_CI 
Description R functions on github for online changepoint detection. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://github.com/yudongchen88/ocd_CI
 
Title primePCA 
Description R package on CRAN for high-dimensional PCA with heterogeneous missingness 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact None as yet. 
URL https://cran.r-project.org/web/packages/primePCA/index.html