Sparse-group models in high-dimensional statistics
Lead Research Organisation:
Imperial College London
Department Name: Mathematics
Abstract
The focus of my PhD thesis has been in applying and optimising sparse-group models to work under high-dimensional settings, where the number of factors is larger than the number of observations. The work is primarily motivated by problems in genetics, which are of high-dimensional nature. The broad goal is to use information about the grouping structure of genes (pathways) to guide model optimisation, preferably with some theoretical guarantees.
The main contribution of my thesis is a novel new bi-level (variable and group) selection approach. The key objective was to produce a method that can accurately select relevant pathways (groups of genes) and genes as being linked to disease outcome, whilst reducing the number of false discoveries. These links are crucial in the fields of preventative and personalised medicine. First, finding such links using a statistical method gives biological and medical practitioners a subset of genes to investigate, allowing them to verify the findings through real-life experiments. Additionally, knowing which pathways and genes are linked to diseases will improve the accuracy of preventive screening. However, three key challenges are presented when developing such a method.
The first challenge is that the number of genes tends to be far greater than the number of individuals sampled (also known as high-dimensional data). There are several approaches proposed in the literature to deal with this. Perhaps the most popular class of such approaches are penalised regression approaches, which allow for variable selection to occur. The most popular penalisation approach is the least absolute shrinkage and selection operator (lasso).
However, a second challenge comes from the need to reduce the number of false discoveries made, also known as controlling the false discovery rate (FDR). Traditional approaches, such as the lasso, have no guarantees of FDR-control. As a remedy to this, sorted L-one penalised estimation (SLOPE) has been proposed in the literature, which extends the lasso to provide FDR control, by providing a link between the lasso and multiple testing. Group SLOPE further extends SLOPE by allowing it to perform group selection, controlling the group FDR.
Crucially, group SLOPE does not allow for bi-level selection. The consequence of this is that it contains an inherent assumption that each gene in a significant pathway is also significant, but this is unrealistic in a genetics setting. This leads to the final challenge: we want to be able to select both pathways and genes as being causal (known as a sparse-group setting). Therefore, my proposal was to incorporate SLOPE into the sparse-group setting, to form sparse-group SLOPE (SGS). SGS overcomes these three key challenges by working for high-dimensional data, controlling the variable and group FDR, and selecting both variables and groups.
The other contributions have been extending SGS and the sparse-group model framework. I developed screening rules which allow SGS to be fitted much more efficiently and extended these to also work for the sparse-group lasso (SGL). Finally, I am exploring the problem of model selection with regard to sparse-group models.
My research falls within the EPSRC statistics and applied probability research area. The methods I have developed have been motivated by applied problems but have been developed in general settings. All of the methods I have developed are available in corresponding R packages, allowing easy accessibility.
The main contribution of my thesis is a novel new bi-level (variable and group) selection approach. The key objective was to produce a method that can accurately select relevant pathways (groups of genes) and genes as being linked to disease outcome, whilst reducing the number of false discoveries. These links are crucial in the fields of preventative and personalised medicine. First, finding such links using a statistical method gives biological and medical practitioners a subset of genes to investigate, allowing them to verify the findings through real-life experiments. Additionally, knowing which pathways and genes are linked to diseases will improve the accuracy of preventive screening. However, three key challenges are presented when developing such a method.
The first challenge is that the number of genes tends to be far greater than the number of individuals sampled (also known as high-dimensional data). There are several approaches proposed in the literature to deal with this. Perhaps the most popular class of such approaches are penalised regression approaches, which allow for variable selection to occur. The most popular penalisation approach is the least absolute shrinkage and selection operator (lasso).
However, a second challenge comes from the need to reduce the number of false discoveries made, also known as controlling the false discovery rate (FDR). Traditional approaches, such as the lasso, have no guarantees of FDR-control. As a remedy to this, sorted L-one penalised estimation (SLOPE) has been proposed in the literature, which extends the lasso to provide FDR control, by providing a link between the lasso and multiple testing. Group SLOPE further extends SLOPE by allowing it to perform group selection, controlling the group FDR.
Crucially, group SLOPE does not allow for bi-level selection. The consequence of this is that it contains an inherent assumption that each gene in a significant pathway is also significant, but this is unrealistic in a genetics setting. This leads to the final challenge: we want to be able to select both pathways and genes as being causal (known as a sparse-group setting). Therefore, my proposal was to incorporate SLOPE into the sparse-group setting, to form sparse-group SLOPE (SGS). SGS overcomes these three key challenges by working for high-dimensional data, controlling the variable and group FDR, and selecting both variables and groups.
The other contributions have been extending SGS and the sparse-group model framework. I developed screening rules which allow SGS to be fitted much more efficiently and extended these to also work for the sparse-group lasso (SGL). Finally, I am exploring the problem of model selection with regard to sparse-group models.
My research falls within the EPSRC statistics and applied probability research area. The methods I have developed have been motivated by applied problems but have been developed in general settings. All of the methods I have developed are available in corresponding R packages, allowing easy accessibility.
Planned Impact
The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.
Organisations
People |
ORCID iD |
| Fabio Feser (Student) |
Studentship Projects
| Project Reference | Relationship | Related To | Start | End | Student Name |
|---|---|---|---|---|---|
| EP/S023151/1 | 31/03/2019 | 29/09/2027 | |||
| 2602754 | Studentship | EP/S023151/1 | 01/10/2021 | 29/09/2025 | Fabio Feser |