Bayesian Optimization in Structured Domains

Lead Research Organisation: University of Cambridge
Department Name: Engineering

Abstract

Motivation
Bayesian optimization (BO) is a machine learning / statistics / artificial intelligence technique that recently gained considerable popularity for the gradient-free optimization of highly complex and costly-to-evaluate black-box objective functions, which is of central interest in a tremendous diversity of application domains including robotics, environmental monitoring, automatic machine learning, drug design, and many others. Most of the existing literature on BO focuses on modeling and optimizing objective functions defined on a box-constrained subset of the real coordinate space. However, many real-world optimization problems in applied mathematics, engineering and the natural sciences, involve highly complex and structured input spaces. Despite of its practical significance, efficient black-box optimization over structured domains has received only little attention and thus remains a central open challenge, which is mainly due to the challenges arising from the combinatorial explosion of the input space. While recently, BO was successfully applied to optimize fully discrete functions and mixed discrete/continuous functions, it is not clear how to deal with more richly structured, potentially high-dimensional inputs such as graphs, images, or text, which are ubiquitous in real-world problems.

Goals and Proposed Approach
Our goal is to develop a novel principled BO algorithm for efficient optimization over structured input domains. To this end, we follow recent work on automatic chemical design, proposing to employ a probabilistic generative model, e.g. a variational autoencoder (VAE), that learns a low-dimensional continuous latent representation of the structured inputs (e.g. protein sequences) in conjunction with a predictive model mapping from this latent representation of the input to some target property to optimize (e.g. fluorescence). BO is then performed in latent space by leveraging the predictive model as a surrogate of the objective. For this approach to be effective, we need to address the following challenges:

1. Generative Model Design. We need to design a generative model that induces a latent space that's as informative as possible to the optimization. We here aim to exploit effective generative models specifically designed for certain structured input types, e.g. graphs, DNA sequences or more general discrete data following some formal grammar. We also need to take care that our model doesn't underfit, i.e., ignore the data and just replicate the prior distribution (e.g. a zero-mean Gaussian), making the latent encoding meaningless. We aim to leverage recent advances in using Hamiltonian Monte Carlo (HMC) to sample from the latent space. In order to estimate the uncertainty of the generative model in decoding from the latent space (i.e., for the purposes of BO), we propose to perform a Bayesian treatment of the decoder model, e.g. via Stochastic Gradient HMC (SGHMC), thus resulting in a Bayesian generative model.

2. Bayesian Optimization Algorithm. For the BO procedure, we aim to leverage effective and robust methods that have been developed for continuous input spaces and adapt them to our setting in a suitable way. We especially need to take care that the optimization doesn't progress into regions of the input space far away from the training data, as this might result in latent points that decode to invalid or undesired (i.e., semantically and/or syntactically meaningless) inputs. At the same time, we want to find inputs that are sufficiently novel (i.e., not too close to anything we've seen so far) and have a high value for the target objective. To manage this trade-off as effectively as possible, we aim to devise an acquisition function that is able to leverage the uncertainty estimates of the generative model in decoding from the latent space to guide the optimization, in addition to the uncertainty estimates of the predictive surrogate model.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513180/1 01/10/2018 30/09/2023
2196582 Studentship EP/R513180/1 04/01/2019 01/01/2022 Erik Daxberger