Understanding Generalization of Interpolators

Lead Research Organisation: University of Oxford
Department Name: Statistics

Abstract

When a machine learning system makes a recommendation to a radiologist whether a particular scan contains a tumor or whether it does not, the algorithm behind the system makes its decision based on the data that it has previously seen. The current scan at hand is completely new and the algorithm has never seen it before - it has only seen scans similar to it. The phenomenon of making predictions on new, previously unseen data, based on known training data is called generalization. Understanding when machine learning systems generalize well, and when do they not generalize well, is a fundamental problem of today's research, with implications across society.
Neural networks in particular exhibit often surprising generalization phenomena, for example such as admissible generalization even when trained on randomly corrupted data (Zhang et al., 2017) until having zero training error on the training data. A machine learning model that fits the training data perfectly is called an interpolator. Motivated by the empirical success and the generalization performance of interpolating neural networks, understanding when and why interpolating methods generalize well has recently been a topic of interest in statistical learning theory. However, systematically connecting interpolating methods to achievable notions of optimality has only received partial attention.
This area of research has established results for a variety of models. For instance, in kernel regression, Liang and Rakhlin (2020) provide a data-dependent upper bound on the generalization performance of the minimum-norm interpolator. By analyzing the upper bound, they show that small generalization error of the minimum-norm interpolator occurs in a regime with favourable curvature of the kernel, particular decay of the eigenvalues of the kernel and data population covariance matrices and, importantly, in an overparametrized setting. In random features regression, Mei and Montanari (2019) showed that for large signal-to-noise ratio, in the limit of large overparametrization, the optimal regularization is zero - i.e. the optimal ridge regressor is an interpolator. Liang and Sur (2020) characterized the precise high-dimensional asymptotic generalization of interpolating minimum-l1-norm classifiers and boosting algorithms which maximize the l1 margin. Bartlett et al. (2020) isolated a setting of benign overfitting in linear regression, dependent on notions of effective rank of the population covariance matrix, in which the minimum-norm interpolator has small generalization error. Similarly, this regime of benign overfitting occurs with large overparametrization.
A major focus of the interpolation literature has so far been to theoretically study if and when interpolating methods based on classical techniques such as ridge regression and gradient descent can have optimal or near-optimal generalization (Bartlett, Montanari, and Rakhlin, 2021).
The focus of my proposed project is to study the question of what is the optimal way to interpolate, in terms of generalization, for overparametrized machine learning models by procedures that do not use information about the realization of the data-generating process, nor the realization of the noise. In designing new ways to interpolate that are directly related to notions of optimality, we hope to provide a stepping stone to designing new, better ways to do machine learning in general.
This project falls within the EPSRC Mathematical Sciences theme.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/T517811/1 01/10/2020 30/09/2025
2595528 Studentship EP/T517811/1 01/10/2021 31/12/2026 Eduard Oravkin