New challenges in high-dimensional statistical inference

Lead Research Organisation: University of Cambridge
Department Name: Pure Maths and Mathematical Statistics

Abstract

As a society, more and more of the activities that we take for granted
rely on sophisticated technology, and are dependent on the fast and
efficient handling of large quantities of data. Obvious examples
include the use of internet search engines and mobile telephones.
Similarly, recent advances in healthcare are partly due to improved,
highly data-intensive scanning equipment in hospitals, and the
development of new, effective drug treatments, which have been the
result of extensive scientific study with data at its core.

Nevertheless, such advances can only be achieved through the
development of appropriate statistical models and methods which enable
practitioners to extract useful information from these vast quantities
of data. In order to capture the complexity of the data generating
processes, these models are inevitably high-dimensional, and have been
the topic of an enormous amount of research in Statistics over the
last 15 years or so.

This proposal addresses some of the fundamental and important
challenges in handling the huge data sets that routinely arise in the
applications above, as well as many others. For instance, in
high-dimensional models, sparse estimators are crucial for stability and
interpretability. But these give only a point estimate of a
parameter, and typically practioners require more sophisticated
inferential statements to assess uncertainty. We will show how this
by done by proposing easy-to-use and robust p-values
based on these sparse estimators.

One of the most important applications of sparse estimators is in
biotechnology. Indeed, we will apply our methodology described above
in a high-dimensional cancer study carried out by Danish
biostatisticians that uses microarray techniques. We will select,
with an associated quantification of uncertainty, a handful of stable
distinguishing genes for diffuse large B-cell lymphomas, thereby
enhancing our understanding of these cancers.

Another application area facing high-dimensional challenges is
neuroscience, and we will work on a study of dyscalculia that uses the
brain imaging technique of Electroencephalography (EEG). Dyscalculia
is a mathematical disability that prevents normal arithmetic function.
Here, existing statistical techniques used by experimental
psychologists in this area are inadequate, and modern high-dimensional
methods have the potential to improve dramatically our understanding
of this disability.

In classification problems, the challenge is to assign an observation
to one of two or more classes based on its similarity to (labelled)
data from each of these classes. They are some of the most frequently
encountered high-dimensional statistical problems, particularly in
fields such as machine learning and areas of computer science such as
computer vision and robotics. We will provide a simple and robust
improvement to perhaps the most popular method (the k-nearest
neighbour classifier), by weighting the nearest neighbours in an
optimal fashion. We will also give a quantification of the
improvement. A related problem we will study is to quantify the
uncertainty of a classifier constructed from training data. As an
example this could be used to give a doctor a measure of uncertainty
in a diagnosis.

The final main issue we will address concerns model misspecification.
This is a particularly important issue in high-dimensional statistical
problems, where it is almost inevitable that our model misses some
important effects, or does not model them in the correct way. We will
provide understanding of how statistical procedures perform in such
circumstances and develop new ones that are robust to model
misspecification. A particularly important application will be to
Independent Component Analysis models that are very popular in
statistical signal processing for analysing data arising from multiple
sources, including microarray and brain imaging data.

Planned Impact

As a society, more and more of the activities that we take for granted
rely on sophisticated technology, and are dependent on the fast and
efficient handling of large quantities of data. Obvious examples
include the use of internet search engines and mobile telephones.
Similarly, recent advances in healthcare are partly due to improved,
highly data-intensive scanning equipment in hospitals, and the
development of new, effective drug treatments, which have been the
result of extensive scientific study with data at its core.

This proposal addresses some of the fundamental and important
challenges in handling the huge data sets that routinely arise in the
applications above, as well as many others. It therefore has the
potential for high societal and economic impact, both in the immediate
applications considered and through later transfer of the innovative
new methods which will be developed.

Firstly, the proposal will be of great benefit the four post-doctoral
researchers, who will have acquired crucial skills that are much
in-demand in many sectors of the economy, including the technology
sector and pharmaceuticals industry. Conversely, the UK economy
will benefit from having well-trained individuals who are able to cope
with the challenges that handling huge data sets present.

In our collaboration with Martin Bogsted, we will analyse
high-dimensional data from sophisticated microarray experiments to
improve our understanding of the genes associated with so-called B
cell cancers (e.g. lymphomas, multiple myeloma and leukaemia). Our
hope is that this will benefit society by leading to improved
diagnosis and treatment for these types of cancer.

Another example of societal impact concerns the study of dyscalculia
though the analysis of high-dimensional data from a brain imaging
technique called Electroencephalography (EEG). Dyscalculia is a
poorly understood mathematical disability affecting 3-6% of the
population that prevents normal arithmetic function. Our aim is to
understand better the way in which brain function differs in people
with dyscalculia in order to improve clinical and classroom
intervention techniques.

Our work on Independent Component Analysis (ICA) algorithms will
benefit, for instance, people in the telecommunications industry who
need to design multiple access systems so that people can
communicate despite the fact that other users occupy the same
resources, possibly simultaneously.

Publications

10 25 50
publication icon
Ali JM (2015) Analysis of ischemia/reperfusion injury in time-zero biopsies predicts liver allograft outcomes. in Liver transplantation : official publication of the American Association for the Study of Liver Diseases and the International Liver Transplantation Society

publication icon
Cannings T (2017) Random-projection Ensemble Classification in Journal of the Royal Statistical Society Series B: Statistical Methodology

publication icon
Chen Y (2016) Generalized Additive and Index Models with Shape Constraints in Journal of the Royal Statistical Society Series B: Statistical Methodology

publication icon
Cribben I (2017) Estimating Whole-Brain Dynamics by Using Spectral Clustering in Journal of the Royal Statistical Society Series C: Applied Statistics

 
Description I have developed a new method for high-dimensional variable selection called Complementary Pairs Stability Selection, with provable error control guarantees. I have also proved theoretical guarantees for a related method called knockoffs.

I have developed a new, optimal algorithm for nearest-neighbour classification, and also shown how nearest-neighbour methods can be used to provide excellent estimators of entropy. Such methods lead to new methodology, which I have also proposed, for testing independence.

I have developed new methodology for shape-constrained estimation problems, as well as providing new and fundamental insights into their behaviour.

I have contributed to a new and rapidly emerging area that seeks to understand the fundamental trade-offs between statistical and computational efficiency. This area provides fascinating connections between Statistics and Theoretical Computer Science.

In collaboration with Danish biostatisticians, I have developed a refined classification system for diffuse large B-cell lymphoma, based on subset-specific B-cell-associated gene signatures (BAGS) in the normal B-cell
hierarchy.

So far, I have employed seven PDRAs on the grant, and have been a close mentor for another in the department. Of these, one is now a Lecturer in Statistics at Lancaster University, one is an Assistant Professor in Statistics at LSE, one is a Lecturer in Statistical Science at the University of Bristol, one is a post-doc at the Unviersity of Southern California and has an offer of a lectureship at a UK university, one is an Assistant Professor at Jilin University, China and another is an Assistant Professor at Sungshin Women's University. Of the remaining two, one has an offer of a lectureship from a UK university, and one will be on the job market next year or the year after.
Exploitation Route Several further projects have emerged from the research I have already carried out. For instance, following my work on classification, I am now working on methods for high-dimensional classification based on random projections of the data into lower-dimensional spaces.

The work on high-dimensional variable selection has led me to consider the problem of constructing confidence intervals for parameters in high-dimensional statistical models (e.g. the Cox proportional hazards model).

The work on classifying lymphoma has diagnostic and prognostic value for healthcare professionals.

I am regularly invited to speak at top international conferences (increasingly as a plenary speaker), which aids dissemination of my work. I have also given talks in other departments (e.g. the signal processing and machine learning groups) as well as at academia/industry events to facilitate knowledge transfer.
Sectors Healthcare

URL http://www.statslab.cam.ac.uk/~rjs57/Research.html
 
Description The methods I have recently developed have already been used in a huge range of applications, from gene expression classification, to the cultivation of disaster donors and ischemic heart screening. This provides both societal and economic impact. Further economic and cultural impact of my work arises from the PDRAs I have employed on my grant and trained in the latest research methods. Seven of the nine PDRAs employed are non-UK nationals, and yet they bring enormous expertise and skills to the country, in an area where the UK desperately needs to recruit and retain top talent. All nine of these PDRAs have obtained tenured or tenure-track positions (five in the UK, at Warwick(x2), LSE (x2) and Edinburgh). This is a major sense in which the grant has had a significant impact on the UK Statistics landscape.
Sector Healthcare
Impact Types Cultural,Societal,Economic

 
Description Philip Leverhulme Prize
Amount £100,000 (GBP)
Organisation The Leverhulme Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 12/2014 
End 11/2017
 
Title An R package called SPCAvRP 
Description An R package to implement a method for Sparse Principal Component Analysis proposed in Gataric, Wang and Samworth (2017) 
Type Of Technology Software 
Year Produced 2017 
Impact Too early to say. 
URL https://cran.r-project.org/web/packages/SPCAvRP/index.html
 
Title R package called 'RPEnsemble' 
Description This package implements the methodology proposed in the paper Cannings, T. I. and Samworth, R. J. (2015) Random projection ensemble classification. Available at http://arxiv.org/abs/1504.04595. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact This package has been very favourably received by the community. 
URL https://cran.r-project.org/web/packages/RPEnsemble/index.html
 
Title R package called 'fcd' 
Description Publicly-available software for community detection. 
Type Of Technology Software 
Year Produced 2013 
Open Source License? Yes  
Impact The package has been downloaded by users. 
URL http://cran.r-project.org/
 
Title R package called 'scar' 
Description R package for fitting generalised additive and index models with shape constraints. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The package has been downloaded by practitioners. 
URL http://cran.r-project.org/
 
Title R package called IndepTest 
Description An R package to implement an independence test called MINT, proposed in Berrett and Samworth (2017) 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Too early to say. 
URL https://cran.r-project.org/web/packages/IndepTest/index.html
 
Description Contact organiser for Statistical Laboratory's participation in the 2013 International Year of Statistics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact I organised four talks on Statistics, given by high-profile people (David Spiegelhalter, John Beddington, Tim Harford, Michael Rawlins). These were followed by dinners in St John's College.

N/A
Year(s) Of Engagement Activity 2013
URL http://www.statslab.cam.ac.uk/IYS2013/
 
Description Faculty Open Day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Schools
Results and Impact I have now given five Open Day talks in Cambridge, usually to around 400 students. The aim is to introduce them to an exciting problem or area of mathematics, to stimulate their thinking, and encourage them to apply to study mathematics at the University of Cambridge.

I have had extremely positive feedback from the talks, from both pupils and colleagues who attended.
Year(s) Of Engagement Activity 2012,2014
URL http://www.maths.cam.ac.uk/undergrad/admissions/openday/
 
Description School visit (North London Collegiate School for Girls) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Around 80 pupils attended a maths talk I gave. This was followed by questions and discussion, as well as lunch.

Several pupils from the school applied to study mathematics at the University of Cambridge.
Year(s) Of Engagement Activity 2012
URL http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCMQFjAA&url=http%3A%2F%2Fwww....
 
Description Talks at Maths Masterclasses 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact I gave two talks at Maths Masterclass days, designed to stimulate thinking in sixth-form pupils, and encourage them to apply to the University of Cambridge.

Of those who attended, and who applied to study maths at the University of Cambridge, there were 38/87 successful applications in 2013 and 25/59 successful applications in 2014. This is a much higher success rate than the overall level, which is around 20-25%.
Year(s) Of Engagement Activity 2012,2013
URL http://www.study.cam.ac.uk/undergraduate/events/masterclasses.html
 
Description Talks at Sutton Trust and Linacre Institute summer schools 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact To encourage sixth-form students from disadvantaged backgrounds to consider applying to Cambridge
Year(s) Of Engagement Activity 2015,2016