# Optimization methods for deep learning: training and testing in machine learning

Lead Research Organisation:
University of Oxford

Department Name: Industrially Focused Maths Modelling CDT

### Abstract

This project falls within the EPSRC numerical analysis research areas.

This project is undertaken with Industrial Partner NAG.

Optimization problems form the modelling and numerical core of machine learning and statistical methodologies, such as in the training of supervised machine learning classifiers. In such applications, a large amount of data (feature vectors) is available that has been already classified (namely labelled). Then a parameterised classifier is selected and trained on this data, namely, values of the parameters are calculated so that the output of the classifier at the given features matches their labels in some optimal way. The ensuing classifier is then used for testing, in order to label/classify unseen data. The training problem is formulated as an optimization problem that minimises, for example, the average amount of errors that the classifier makes on the test set. Various formulations of the optimization problem are used, most commonly considering some continuous loss function to measure the error at each data point and either (deterministic) finite sum or (probabilistic) expectation of the loss terms as the total error; the latter may be convex (such as in the case of binary classification) but, more and more nowadays, it may instead be nonconvex due to the prevalence of deep learning applications. The scale of the ensuing optimization is commonly huge, with millions of parameters and terms in the objective sum of functions. This makes the calculation of a single function or gradient value prohibitively expensive and leads to the need for inexact optimization algorithms that effectively exploit problem structure. The practical method of choice in ML applications is the (batch) stochastic gradient method (Robbins-Munro, 1950) that computes only the gradients of a small, randomly chosen number of the loss terms and controls both variance and convergence by means of a predefined stepsize. A grand challenge in this area is how to augment stochastic gradient methods with inexact second order derivative information, so as to obtain more efficient methods especially in the nonconvex case of deep learning, both in terms of achieving higher accuracy but also robustness to ill-conditioning.

In this project, we will investigate ways to approximate second-order information in the finite-sum structure of ML optimization problems, from subsampling second-order derivatives to approximating them by differences in batch gradients, such as in block (stochastic) quasi-Newton approaches and Gauss-Newton methods. In place of the usual predefined stepsize, we will consider the impact of more sophisticated stepsize techniques from classical optimization that are adaptive to local optimization landscapes such as variable/adaptive trust-region radius, regularization and linesearch. We will also consider preconditioning techniques that contain inexpensive second-derivative information so as to help the performance of first order methods. We will also investigate parallel and decentralised implementations of these methods, which is a challenge especially for higher-order techniques.

Potential outcomes:

(i) State-of-the-art deep learning and optimization formulations and methods for machine learning.

(ii) Novel optimization methods that use inexact (deterministic/stochastic) problem information.

(iii) Evaluation of methods for deep neural net training and testing.

This project is undertaken with Industrial Partner NAG.

Optimization problems form the modelling and numerical core of machine learning and statistical methodologies, such as in the training of supervised machine learning classifiers. In such applications, a large amount of data (feature vectors) is available that has been already classified (namely labelled). Then a parameterised classifier is selected and trained on this data, namely, values of the parameters are calculated so that the output of the classifier at the given features matches their labels in some optimal way. The ensuing classifier is then used for testing, in order to label/classify unseen data. The training problem is formulated as an optimization problem that minimises, for example, the average amount of errors that the classifier makes on the test set. Various formulations of the optimization problem are used, most commonly considering some continuous loss function to measure the error at each data point and either (deterministic) finite sum or (probabilistic) expectation of the loss terms as the total error; the latter may be convex (such as in the case of binary classification) but, more and more nowadays, it may instead be nonconvex due to the prevalence of deep learning applications. The scale of the ensuing optimization is commonly huge, with millions of parameters and terms in the objective sum of functions. This makes the calculation of a single function or gradient value prohibitively expensive and leads to the need for inexact optimization algorithms that effectively exploit problem structure. The practical method of choice in ML applications is the (batch) stochastic gradient method (Robbins-Munro, 1950) that computes only the gradients of a small, randomly chosen number of the loss terms and controls both variance and convergence by means of a predefined stepsize. A grand challenge in this area is how to augment stochastic gradient methods with inexact second order derivative information, so as to obtain more efficient methods especially in the nonconvex case of deep learning, both in terms of achieving higher accuracy but also robustness to ill-conditioning.

In this project, we will investigate ways to approximate second-order information in the finite-sum structure of ML optimization problems, from subsampling second-order derivatives to approximating them by differences in batch gradients, such as in block (stochastic) quasi-Newton approaches and Gauss-Newton methods. In place of the usual predefined stepsize, we will consider the impact of more sophisticated stepsize techniques from classical optimization that are adaptive to local optimization landscapes such as variable/adaptive trust-region radius, regularization and linesearch. We will also consider preconditioning techniques that contain inexpensive second-derivative information so as to help the performance of first order methods. We will also investigate parallel and decentralised implementations of these methods, which is a challenge especially for higher-order techniques.

Potential outcomes:

(i) State-of-the-art deep learning and optimization formulations and methods for machine learning.

(ii) Novel optimization methods that use inexact (deterministic/stochastic) problem information.

(iii) Evaluation of methods for deep neural net training and testing.

## People |
## ORCID iD |

Constantin Metherall Puiu (Student) |

### Studentship Projects

Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|

EP/R513295/1 | 01/10/2018 | 30/09/2023 | |||

2282418 | Studentship | EP/R513295/1 | 01/10/2019 | 30/09/2023 | Constantin Metherall Puiu |