# Optimization for Machine Learning

Lead Research Organisation:
University of Oxford

Department Name: Computer Science

### Abstract

This project falls within the EPSRC Artificial Intelligence research area. It falls thereby within the Information and Communication Technologies (ICT) research theme.

Machine learning is currently a very active area of research. It is a pervasive field that has found applications in almost every area of knowledge. This means that the development of new ideas or the improvement of the current state of the art can have a rapid impact in many areas. Many heuristic algorithms are used in practice and a major challenge is understanding why these heuristics work and in which contexts they work. This would allow, among other things, to select better algorithms to tackle a particular problem and would facilitate the design of new ones that improve on the current state of the art or alternatively could lead to guarantees that show the impossibility to improve within a certain regime.

The aim of this project is to research open problems in optimization for machine learning. Unlike in other branches of optimization, the interest here is in computing only as an approximation of the optimum value of functions to be optimized. Moreover, these functions have a particular structure. A notable example is in the training of neural networks in supervised machine learning. Here, one is interested in learning (minimizing) a function over a labeled instance space, computed as an expectation of a loss function under an unknown distribution from which samples can be drawn. Usually one takes the Empirical Risk Minimization (ERM) approach and approximately minimizes the average of the loss function over several samples. Getting arbitrary close to a global optimum is impractical. Naturally, there arises the problem of finding algorithms that provably find local minima of smooth functions with smooth Hessians, possibly non convex, with high probability. It is an open problem to give an algorithm with a lower bound that matches the complexity of the algorithm, both in the case in which one has a deterministic oracle that returns the function and gradient evaluations of a point and the case in which one has a stochastic oracle which can be modeled by the deterministic oracle plus a zero mean noise with bounded

variance. The stochastic oracle model is particularly important, since it can model the computation of a noisy gradient by taking the sum of a random subset of the gradients of the functions comprising the ERM objective, which is convenient for efficiency and can be implemented in a distributed manner. In particular, algorithms whose complexity does not depend on the number of samples in ERM can be obtain using the stochastic oracle.

The methodology will make use of new theoretical tools and insights developed in this branch optimization. Some examples are: a better understanding of acceleration phenomena via a coupling of dual and primal algorithms, new solutions for the online eigenvector problem that allow for faster identification of negative curvature of functions and fast detection and avoidance of saddle points, and noise analysis to transform iterative optimization methods that use Hessian vector products into methods that use first order information only, without deteriorating the complexity of the original method.

My main collaborators are my supervisor and co-supervisor: Varun Kanade and Patrick Rebeschini.

Machine learning is currently a very active area of research. It is a pervasive field that has found applications in almost every area of knowledge. This means that the development of new ideas or the improvement of the current state of the art can have a rapid impact in many areas. Many heuristic algorithms are used in practice and a major challenge is understanding why these heuristics work and in which contexts they work. This would allow, among other things, to select better algorithms to tackle a particular problem and would facilitate the design of new ones that improve on the current state of the art or alternatively could lead to guarantees that show the impossibility to improve within a certain regime.

The aim of this project is to research open problems in optimization for machine learning. Unlike in other branches of optimization, the interest here is in computing only as an approximation of the optimum value of functions to be optimized. Moreover, these functions have a particular structure. A notable example is in the training of neural networks in supervised machine learning. Here, one is interested in learning (minimizing) a function over a labeled instance space, computed as an expectation of a loss function under an unknown distribution from which samples can be drawn. Usually one takes the Empirical Risk Minimization (ERM) approach and approximately minimizes the average of the loss function over several samples. Getting arbitrary close to a global optimum is impractical. Naturally, there arises the problem of finding algorithms that provably find local minima of smooth functions with smooth Hessians, possibly non convex, with high probability. It is an open problem to give an algorithm with a lower bound that matches the complexity of the algorithm, both in the case in which one has a deterministic oracle that returns the function and gradient evaluations of a point and the case in which one has a stochastic oracle which can be modeled by the deterministic oracle plus a zero mean noise with bounded

variance. The stochastic oracle model is particularly important, since it can model the computation of a noisy gradient by taking the sum of a random subset of the gradients of the functions comprising the ERM objective, which is convenient for efficiency and can be implemented in a distributed manner. In particular, algorithms whose complexity does not depend on the number of samples in ERM can be obtain using the stochastic oracle.

The methodology will make use of new theoretical tools and insights developed in this branch optimization. Some examples are: a better understanding of acceleration phenomena via a coupling of dual and primal algorithms, new solutions for the online eigenvector problem that allow for faster identification of negative curvature of functions and fast detection and avoidance of saddle points, and noise analysis to transform iterative optimization methods that use Hessian vector products into methods that use first order information only, without deteriorating the complexity of the original method.

My main collaborators are my supervisor and co-supervisor: Varun Kanade and Patrick Rebeschini.

### Studentship Projects

Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|

EP/N509711/1 | 01/10/2016 | 30/09/2021 | |||

2053152 | Studentship | EP/N509711/1 | 01/10/2018 | 30/09/2022 | David Martinez |