Generalisation and expressiveness for over-parameterised neural networks

Lead Research Organisation: University of Oxford
Department Name: Statistics

Abstract

More than half a century has passed since the mathematical analysis of learning processes truly began, that is when F. Rosenblatt proposed the first model of learning machine, the Perceptron. Since then, huge progresses have been achieved in learning theory, both from theoretical and practical perspectives.
However the development of deep neural networks, able to perform better than any theoretical expectation, has underlined the lack of a complete mathematical theory, capable of giving an enlightening insight on the mechanisms underlying the learning process.
The basic idea behind learning theory is to find a way to quantify the generalisation capacity of an algorithm, meaning the ability to extrapolate, from a finite and relatively small training dataset, some unknown rule allowing to coherently classify data which are not part of the original training set.
A learning algorithm is usually evaluated via the introduction of a loss function, measuring the discrepancy between the correct answer and the answer given by the learning machine. The expectation of this function on the whole set of existing data is called the risk functional. However, in practice there's no way to evaluate it directly, and one has to deal with its approximation evaluated on the training set only, the empirical risk.
An efficient learning theory should be able to give tight probability bounds on the difference (generalisation error) between the empirical risk and the risk functional.
The Vapnik-Chervonenkis theory, which has been the main direction of theorists' work for about three decades (from the 60's to the 90's), cannot explain the high generalisation capacity of modern deep neural networks, where the machine is able to infer many more parameters than the available training data.
Among the theory developed in the last 20 years, of relevant interest is the PAC-Bayes approach, PAC standing for Probably Almost Correct. The main idea in PAC-Bayes is to consider a probability distribution (prior) on the hypothesis space (meaning the space encoding all the possible "classification rules" that the algorithm might choose) which is independent of the training set data. Another distribution (posterior) has to be chosen afterwards and, thanks to a well known result by McAllester, it is then possible to quantify the distance between empirical risk and risk functional in terms of the KL divergence between prior and posterior.
In a recent work, Dziugaite and Roy have shown that by optimising a PAC-Bayes bound it is possible to compute nonvacuous numerical bounds for generalisation error.
The main goal of this doctoral project is to try to use the PAC-Bayes framework in order to get new theoretical insights allowing for a better understanding of the underlying learning mechanisms in deep neural networks.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509711/1 01/10/2016 30/09/2021
2278529 Studentship EP/N509711/1 01/10/2019 21/04/2023 Eugenio Clerico
EP/R513295/1 01/10/2018 30/09/2023
2278529 Studentship EP/R513295/1 01/10/2019 21/04/2023 Eugenio Clerico