FORGING: Fortuitous Geometries and Compressive Learning

Lead Research Organisation: University of Birmingham
Department Name: School of Computer Science

Abstract

Statistical machine learning has been instrumental in providing algorithms that enable us to draw valid conclusions from empirical data. Its successes rely crucially on a rigorous mathematical theory.
Unfortunately, as the modern data sets are increasingly high dimensional, new challenges gathered under the term `curse of dimensionality' render many of the existing data analysis methods inadequate, questionable, or inefficient, and much of the existing theory becomes uninformative. Mitigating the curse of dimensionality receives a lot of research attention currently. However, many fundamental questions remain unresolved. The aim of this project is to provide answers to two of these:

Q1: What kinds of data distributions make a given high dimensional learning problem easier or harder to be solved?

Q2: What kinds of learning problems can be approximately solved compressively, on a low dimensional subspace?

We propose a stance complementary to efforts that look for ways to counter the various observed detrimental effects of the dimensionality curse: We shall exploit some very generic properties of high dimensional probability spaces to develop a unified theory, and its algorithmic implications, to unearth some precise conditions that enable us to solve high dimensional problems in low dimensions. These conditions will depend on the geometry of the problem. We will use a new notion of problem-dependent compressive distortion that we have started developing, and which will build on a so far unexploited connection between random projections and empirical process theory.

The expected outcome will be applicable across a range of different machine learning and data mining problems, and we validate this in case studies.

Planned Impact

This project is expected to provide answers to two fundamental open questions in high dimensional machine learning and data mining, along with a generic methodology for resolving these in various learning settings. As such, it will provide a new way of thinking about high dimensional data problems that shift the focus away from case by case solutions to the observed detrimental effects of the curse of dimensionality, and instead will be based on exploiting some very general properties of high dimensional data spaces. Without being able to make this qualitative shift, the unprecedented increase in the dimensionality of data sets in many areas of science and engineering, we risk to lose the performance guarantees that theory can provide for practice.

Researchers and practitioners in data mining and machine learning will benefit from this research, and the generality of our approach. Much research effort is currently spent on mitigating the detrimental effects of the curse of dimensionality in the recent years. The time has come when it is feasible to build up the theoretical foundations and eliminate inefficient case-by-case trial-and-error strategies. Fundamental research is essential to achieve this, and this is what we propose to do.

We expect to make societal impact: The project will assign a research student to work at developing the PI's ideas into algorithms for the medical domain in a collaboration envisaged with Prof. Tom Marshall and colleagues in the School of Health and Population Sciences (College of Medical and Dental Sciences) at the University of Birmingham. By applying our results to the UK national primary care database we expect to contribute to improving early diagnosis of patients, which is an important problem in public health medicine.

High dimensional data problems are ubiquitous on many areas of science, engineering and businesses, and machine learning is an enabling technology in many of these. Therefore this project will have an impact indirectly on all of these, and here are some concrete examples: (i) In genomics and proteomics, where high dimensional measurements are made routinely and inexpensively while the number of subjects of a specific condition is limited; (ii) In biomedical imaging, and computational neuroscience, where the resolution of measuring devices is ever increasing, and different modalities of measurements are possible and available; (iii) In web, multimedia and market basket analysis, for example for customer preference prediction from purchase logs, where more and more sophisticated customer profiles are feasible, with many descriptors, giving rise to higher and higher data dimensionality.

Our approach is necessarily cross-disciplinary as it will integrate together results and techniques from several areas of mathematics -- high dimensional probability theory and concentration of measure, empirical process theory, functional analysis, theoretical computer science, computational geometry, information theory, and random matrix theory -- to produce a new analytic strategy able to resolve fundamental issues in machine learning and data mining. Our results will naturally feed back upstream to researchers whose mathematical results we will use, and this may stimulate new research at the boundaries between disciplines.
 
Description We developed new uses and new understanding of random projections in statistical learning, including the following:

- A general framework to uncover hidden low-complexity structures that give both generalisation guarantees for compressive learning, and new insights into how learning can succeed in arbitrary high dimensions.

- An analytic strategy to uncover distributional conditions of statistical near-optimality in large random projection ensembles of averaging type.

- An approach to exploit set-heterogeneity structure in high probability uniform bounds. This yielded a tightening norm-preservation guarantees for random projection of infinite sets, and also provided a sound justification for heterogenous weighted ensembles in statistical learning (of which weighted random projection ensemble is one example).

We determined the number of random projections required for a random matrix theoretic covariance regulariser to be usable in practice.

We have studied generalisation under model compression, and the role of model approximability in learning, yielding new insights and new learning algorithms that provably perform well both in their full-precision and approximated forms.

We conducted an in-depth analysis of the statistical hardness of learning with asymmetric label noise under various distributional conditions.

We have identified a parallel between distribution-dependent and loss-and-data-dependent analysis of learning efficiency, which helps bridge between the statistical and the machine learning ways of thinking about infrence from finite samples. We demonstrated this on a problem of classification under label noise, and a problem of multi-output prediction respectively.

We leveraged the new understanding gained from our research into the area of optimisation, yielding new and theoretically motived algorithms for high dimensional black-box optimisation, and for noisy optimisation.

My team included 2 PDRFs employed on the grant. Of these, one is now a Lecturer in Statistical Science at the University of Bristol, and the other is now a Data Scientist at INRIX, after having secured a 3 months secondment to enhnce research and knowledge exchange.

I gave 7 invited research talks at international institutions and events, a summer school tutorial, and several departmental seminars at UK institutions. My PDRF gave invited talks at top institutions including Cambridge StatLab, Edinburgh, ENSAE Paris, as well as several invited conference session talks, and fostered lasting connections and networks.

I gave a talk organised by LightOn.ai in Paris, and discussed the transformative future of random matrices and random projections in conjunction with the optical hardware manufactured at the company.
Exploitation Route The published findings from this project have been cited by other academics in the areas of Machine Learning theory and practice, Theoretical Statistics, Biostatistics, AI, Neural networks, hyperspectral imaging, environmental sciences, uncertainty quantification, security and privacy, IoT, and distributionally robust optimisation.
Sectors Digital/Communication/Information Technologies (including Software),Energy,Environment,Healthcare,Transport,Other

URL https://www.cs.bham.ac.uk/~axk/papers_by_yr.htm
 
Title MatLab implementation of heterogeneous weighted ensemble of random projections 
Description This is a MATLAB implementation of the work presented in: Reeve H, Kaban A, Bootkrajang J. Heterogeneous Sets in Dimensionality Reduction and Ensemble Learning. Machine Learning. 2022 Sep 20. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2022 
Open Source License? Yes  
Impact Research code made available for reproducibility. 
URL https://link.springer.com/article/10.1007/s10994-022-06254-0
 
Description DSAI One-Day Meeting: Sampling Techniques in High Dimensional Spaces by Institute for Interdisciplinary Data Science and AI 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Research talk
Title: Taming the dimensionality curse in machine learning: A compressive sampling approach
Abstract: High dimensional learning problems are increasingly prevalent in practice, but typically ill-posed without some fairly specific prior knowledge such as sparsity, margin, intrinsic low dimension, and others. We are interested in the question of how to discover and exploit such benign structural traits of learning problem instances without knowing what they are in advance. In this talk we look at this question through the lens of compressive sampling -- a computationally inexpensive dimensionality reduction method, whereby data is observed in a randomly projected form. Compressive sampling is the cornerstone of compressive sensing, and random projection has also been well-studied in theoretical computer science and low-distortion embeddings. However, the goal in learning is very different, as we do not need to recover, or even to approximate the seen data, instead we need to produce accurate predictions for unseen data. Our strategy is to develop a general notion of compressive distortion, or compressibility, which yields conditions of geometric nature that guarantee effective learning from randomly projected data by an ERM algorithm. We instantiate and interpret the compressive distortion in various model settings, demonstrating its ability to capture benign traits of problem instances. Moreover, in generalised linear models, we find distributional conditions under which a compressive ERM ensemble algorithm exhibits near-optimality in a precise statistical (i.e. minimax) sense. Interestingly, the conditions we find have rather little in common with compressive sensing conditions, and they also differ from the classic learning theoretic conditions. The talk will draw on joint works with Henry Reeve, and with Bob Durrant.
Year(s) Of Engagement Activity 2021
URL https://www.eventbrite.co.uk/e/dsai-one-day-meeting-sampling-techniques-in-high-dimensional-spaces-t...
 
Description Data Centric Engineering Reading Group - talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Study participants or study members
Results and Impact Research talk given by Henry Reeve (Postdoctoral fellow) on "Optimistic bounds for multi-output prediction" on 6 May 2021 at the Turing Institute (online).
Year(s) Of Engagement Activity 2021
URL https://dce-rg.github.io/
 
Description I organised the ICDM Workshop series on High Dimensional Data Mining (HDM) in conjunction with the IEEE International Conference on Data Mining 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact I organised the ICDM Workshop series on High Dimensional Data Mining (HDM) in conjunction with the IEEE International Conference on Data Mining, in each year during the grant.
The URL below is the last edition to date; all previous editions are linked from that page.
Year(s) Of Engagement Activity 2017,2018,2019,2020,2021,2022
URL https://www.cs.bham.ac.uk/~axk/HDM22.htm
 
Description Shonan meeting on Data Dependent Dissimilarity Measures (Co-organiser with Profs. Kai Ming Ting and Takashi Washio) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact The aims of the meeting were the following:
- Discuss recent development in data dependent dissimilarity measures,
- Plan for future research directions in the next 2-5 years, and
- Establish research collaboration towards the planned research
Here is a report of the meeting: https://shonan.nii.ac.jp/docs/No-123.pdf
Year(s) Of Engagement Activity 2018
URL https://shonan.nii.ac.jp/seminars/123/
 
Description Tutorial at the IEEE CIS Summer School on Data-Driven Artificial/Computational Intelligence 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Tutorial talk at the IEEE CIS Summer School on Data-Driven Artificial/Computational Intelligence
Title: Random Projection Meets Learning Theory and Beyond
Abstract: High dimensional problems are increasingly prevalent in machine learning, and often the space of data features or the parameter space have a dimensionality that exceeds the sample size. This tutorial will start by developing some intuition about high dimensional data spaces, and will then focus on a simple yet powerful dimensionality reduction method, Random Projection (RP) that may be used to overcome the curse of dimensionality. RP is oblivious to the data, yet it provides low distortion guarantees with high probability that depend on the underlying unknown geometric structure of the data. The tutorial will cover some underlying principles of the theory behind RP, and will delve into its effects on generalisation guarantees for subsequent learning tasks. We will also touch upon the use of RP in high dimensional search problems.
Year(s) Of Engagement Activity 2021
URL https://sites.google.com/view/ss-ddaci/home