Towards Robust, Explainable and Interpretable Foundation Models for Chemistry
Lead Research Organisation:
Imperial College London
Department Name: Mathematics
Abstract
Large language models, such as the GPT-series of models, have revolutionised the field of natural language processing. One of the key reasons for the success of these transformer-based models is the scalability and there is empirical evidence to suggest that scaling transformers, by increasing model parameters, dataset size and compute power leads to consistent reductions in test loss. Furthermore, these foundation models are highly generalisable: by pretraining on a large unlabelled corpus of data, one can further finetune on a downstream task, which may only have a small subset of examples.
These features of foundation models are highly attractive in the chemistry domain. In particular, in the field of molecular property prediction, where the goal is to predict a specific property of a molecule such as toxicity or solubility, conducting laboratory experiments to measure these properties is prohibitively costly. Therefore, a foundation model, pretrained on large publicly available lists of molecules, can be finetuned on a small dataset of molecules labelled with a desired property. There are many examples of such foundation models such as Chemberta and Molformer which have been made open source.
However, one of the drawbacks of large foundation models is that with increased scale and complexity, understanding how the model makes predictions is a challenge. This is important because, when models are used to aid in decision making processes there needs to be an understanding of risk in prediction. Indeed, large language models have been shown to be prone to hallucinating facts and being overconfident especially in unfamiliar domains.
Therefore, foundations models in chemistry need to be robust to making predictions in unfamiliar regions of the chemical space, have human-understandable explanations of model predictions and be interpretable to verify the reliability of predictions. Developing techniques to address these needs are the goals of this PhD project.
My first project considers one aspect of robust and explainable models, namely, the accurate prediction of uncertainty in predictions. Traditional approaches to principled uncertainty quantifications are based on Bayesian statistics, where the goal is to find the posterior distribution of parameters given observed data. However, the number of parameters in a foundation model is significantly larger than standard Bayesian models by several orders of magnitude and therefore, training Bayesian models incurs a large computational cost. To address this issue, we use the technique of Low Rank Adaptation which is applied to language models during finetuning to reduce the number of trainable parameters. By reducing the number of trainable parameters, computation of the posterior is easier. We name this approach Variational-LoRA as this combines the technique of variational inference to low rank adaptation.
The second project that I am working on in this PhD concerns the interpretability of chemical foundation models. Currently, there is a lack of understanding in how a model takes a string encoding of a molecule and identifies the relevant features within the molecule to predict a particular property. Developing methods to understand this process will not only improve the scientific understanding of properties but also allow for more trustworthy predictions.
My primary PhD supervisor is Dr Yingzhen Li, who specialises in probabilistic modelling and approximate inference techniques, with an interest in developing reliable machine learning systems. I also regularly meet with researchers from my industry sponsor BASF, Dr Miriam Mathea and Dr Jochen Sieg, who provide domain expertise in chemistry and cheminformatics.
These features of foundation models are highly attractive in the chemistry domain. In particular, in the field of molecular property prediction, where the goal is to predict a specific property of a molecule such as toxicity or solubility, conducting laboratory experiments to measure these properties is prohibitively costly. Therefore, a foundation model, pretrained on large publicly available lists of molecules, can be finetuned on a small dataset of molecules labelled with a desired property. There are many examples of such foundation models such as Chemberta and Molformer which have been made open source.
However, one of the drawbacks of large foundation models is that with increased scale and complexity, understanding how the model makes predictions is a challenge. This is important because, when models are used to aid in decision making processes there needs to be an understanding of risk in prediction. Indeed, large language models have been shown to be prone to hallucinating facts and being overconfident especially in unfamiliar domains.
Therefore, foundations models in chemistry need to be robust to making predictions in unfamiliar regions of the chemical space, have human-understandable explanations of model predictions and be interpretable to verify the reliability of predictions. Developing techniques to address these needs are the goals of this PhD project.
My first project considers one aspect of robust and explainable models, namely, the accurate prediction of uncertainty in predictions. Traditional approaches to principled uncertainty quantifications are based on Bayesian statistics, where the goal is to find the posterior distribution of parameters given observed data. However, the number of parameters in a foundation model is significantly larger than standard Bayesian models by several orders of magnitude and therefore, training Bayesian models incurs a large computational cost. To address this issue, we use the technique of Low Rank Adaptation which is applied to language models during finetuning to reduce the number of trainable parameters. By reducing the number of trainable parameters, computation of the posterior is easier. We name this approach Variational-LoRA as this combines the technique of variational inference to low rank adaptation.
The second project that I am working on in this PhD concerns the interpretability of chemical foundation models. Currently, there is a lack of understanding in how a model takes a string encoding of a molecule and identifies the relevant features within the molecule to predict a particular property. Developing methods to understand this process will not only improve the scientific understanding of properties but also allow for more trustworthy predictions.
My primary PhD supervisor is Dr Yingzhen Li, who specialises in probabilistic modelling and approximate inference techniques, with an interest in developing reliable machine learning systems. I also regularly meet with researchers from my industry sponsor BASF, Dr Miriam Mathea and Dr Jochen Sieg, who provide domain expertise in chemistry and cheminformatics.
Organisations
People |
ORCID iD |
| Shavindra Jayasekera (Student) |
Studentship Projects
| Project Reference | Relationship | Related To | Start | End | Student Name |
|---|---|---|---|---|---|
| EP/S023151/1 | 31/03/2019 | 29/09/2027 | |||
| 2891802 | Studentship | EP/S023151/1 | 30/09/2023 | 29/09/2027 | Shavindra Jayasekera |