FORGING: Fortuitous Geometries and Compressive Learning

Lead Research Organisation: University of Birmingham

Department Name: School of Computer Science

Abstract

Statistical machine learning has been instrumental in providing algorithms that enable us to draw valid conclusions from empirical data. Its successes rely crucially on a rigorous mathematical theory.
Unfortunately, as the modern data sets are increasingly high dimensional, new challenges gathered under the term `curse of dimensionality' render many of the existing data analysis methods inadequate, questionable, or inefficient, and much of the existing theory becomes uninformative. Mitigating the curse of dimensionality receives a lot of research attention currently. However, many fundamental questions remain unresolved. The aim of this project is to provide answers to two of these:

Q1: What kinds of data distributions make a given high dimensional learning problem easier or harder to be solved?

Q2: What kinds of learning problems can be approximately solved compressively, on a low dimensional subspace?

We propose a stance complementary to efforts that look for ways to counter the various observed detrimental effects of the dimensionality curse: We shall exploit some very generic properties of high dimensional probability spaces to develop a unified theory, and its algorithmic implications, to unearth some precise conditions that enable us to solve high dimensional problems in low dimensions. These conditions will depend on the geometry of the problem. We will use a new notion of problem-dependent compressive distortion that we have started developing, and which will build on a so far unexploited connection between random projections and empirical process theory.

The expected outcome will be applicable across a range of different machine learning and data mining problems, and we validate this in case studies.

Planned Impact

This project is expected to provide answers to two fundamental open questions in high dimensional machine learning and data mining, along with a generic methodology for resolving these in various learning settings. As such, it will provide a new way of thinking about high dimensional data problems that shift the focus away from case by case solutions to the observed detrimental effects of the curse of dimensionality, and instead will be based on exploiting some very general properties of high dimensional data spaces. Without being able to make this qualitative shift, the unprecedented increase in the dimensionality of data sets in many areas of science and engineering, we risk to lose the performance guarantees that theory can provide for practice.

Researchers and practitioners in data mining and machine learning will benefit from this research, and the generality of our approach. Much research effort is currently spent on mitigating the detrimental effects of the curse of dimensionality in the recent years. The time has come when it is feasible to build up the theoretical foundations and eliminate inefficient case-by-case trial-and-error strategies. Fundamental research is essential to achieve this, and this is what we propose to do.

We expect to make societal impact: The project will assign a research student to work at developing the PI's ideas into algorithms for the medical domain in a collaboration envisaged with Prof. Tom Marshall and colleagues in the School of Health and Population Sciences (College of Medical and Dental Sciences) at the University of Birmingham. By applying our results to the UK national primary care database we expect to contribute to improving early diagnosis of patients, which is an important problem in public health medicine.

High dimensional data problems are ubiquitous on many areas of science, engineering and businesses, and machine learning is an enabling technology in many of these. Therefore this project will have an impact indirectly on all of these, and here are some concrete examples: (i) In genomics and proteomics, where high dimensional measurements are made routinely and inexpensively while the number of subjects of a specific condition is limited; (ii) In biomedical imaging, and computational neuroscience, where the resolution of measuring devices is ever increasing, and different modalities of measurements are possible and available; (iii) In web, multimedia and market basket analysis, for example for customer preference prediction from purchase logs, where more and more sophisticated customer profiles are feasible, with many descriptors, giving rise to higher and higher data dimensionality.

Our approach is necessarily cross-disciplinary as it will integrate together results and techniques from several areas of mathematics -- high dimensional probability theory and concentration of measure, empirical process theory, functional analysis, theoretical computer science, computational geometry, information theory, and random matrix theory -- to produce a new analytic strategy able to resolve fundamental issues in machine learning and data mining. Our results will naturally feed back upstream to researchers whose mathematical results we will use, and this may stimulate new research at the boundaries between disciplines.

Funded Value:

£876,859

Funded Period:

Jan 17 - Jan 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

EP/P004245/1

Principal Investigator:

Ata Kaban

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (80%)

Fundamentals of Computing (20%)

Organisations

University of Birmingham (Lead Research Organisation)

People	ORCID iD
Ata Kaban (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 > >|

10 25 50

Henry Reeve (2020) Optimistic Bounds for Multi-output Learning

Huang Z (2024) Efficient learning with projected histograms in Data Mining and Knowledge Discovery

Huang Z (2023) Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2022, Grenoble, France, September 19-23, 2022, Proceedings, Part IV

Kaban A (2017) Structure-aware error bounds for linear classification with the zero-one loss

Kaban A (2017) On Compressive Ensemble Induced Regularisation: How Close is the Finite Ensemble Precision Matrix to the Infinite Ensemble?

Kaban A (2020) Structure from Randomness in Halfspace Learning with the Zero-One Loss in Journal of Artificial Intelligence Research

Kaban A (2019) Compressive Learning of Multi-layer Perceptrons: An Error Analysis

Kaban A (2018) Tighter Guarantees for the Compressive Multi-layer Perceptron

Kaban A (2019) Dimension-Free Error Bounds from Random Projections