Under discussion

Lead Research Organisation: Imperial College London
Department Name: Mathematics

Abstract

As part of my PhD thesis, I am proposing a novel new bi-level (variable and group) selection approach. The key objective is to produce a method that can accurately select relevant pathways (groups of genes) and genes as being linked to disease outcome, whilst reducing the number of false discoveries. These links are crucial in the fields of preventative and personalised medicine. First, finding such links using a statistical method gives biological and medical practitioners a subset of genes to investigate, allowing them to verify the findings through real-life experiments. Additionally, knowing which pathways and genes are linked to diseases will improve the accuracy of preventive screening. However, three key challenges are presented with developing such a method.

The first challenge is that the number of genes tends to be far greater than the number of individuals sampled (also known as high-dimensional data). There are several approaches proposed in literature to deal with this. Perhaps the most popular class of such approaches are penalised regression approaches, which apply regression but with penalisation applied to allow for variable selection to occur. The most popular penalisation approach is the least absolute shrinkage and selection operator (lasso).

However, a second challenge comes from the need to reduce the number of false discoveries made, also known as controlling the false discovery rate (FDR). Traditional approaches, such as the lasso, have no guarantees of FDR-control. As a remedy to this, sorted L-one penalised estimation (SLOPE) has been proposed in literature, which extends the lasso to provide FDR control, by providing a link between the lasso and multiple testing. Group SLOPE further extends SLOPE by allowing it to perform group selection, controlling the group FDR. Crucially, group SLOPE does not allow for bi-level selection. The consequence of this, is that it contains an inherent assumption that each gene in a significant pathway is also significant, but this is unrealistic in a genetics setting. This leads onto the final challenge: we want to be able to select both pathways and genes as being causal (known as a sparse-group setting). Therefore, my proposal is to incorporate SLOPE into the sparse-group setting, to form sparse-group SLOPE (SGS). SGS overcomes these three key challenges by working for high-dimensional data, controlling the variable and group FDR, and selecting both variables and groups.

This project falls within the EPSRC statistics and applied probability research area. The project will first develop SGS in a general setting, so it can also have applications outside of genetics. By developing it in a general setting, the project will involve typical method development work - including optimisation, proving mathematical results, and coding. Additionally, an R package will be released, so that the method becomes widely available. Once simulation studies have verified the performance of SGS, it will be applied directly to real-life genetics data, with the hope of identifying relationships between genes and disease outcomes which have previously not been discovered in literature.

Finally, the approach has a lot of potential for future improvements. First, the method can be converted into a Bayesian setting (SLOPE has already had this treatment in literature). Using Bayesian modelling will allow for prior biological information to be incorporated, as well as quantifying uncertainty on any discoveries made by SGS. Second, SGS can be extended to discover non-linear relationships. In literature, most approaches are developed in the simplified general setting of discovering linear relationships. However, many such approaches have already been extended to finding more complex, non-linear, relationships - and SGS could also be extended.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2602754 Studentship EP/S023151/1 02/10/2021 30/08/2025 Fabio Feser