# Active Learning for Computational Polymorph Landscape Analysis

Lead Research Organisation:
University of Southampton

Department Name: Sch of Chemistry

### Abstract

The proposed research will develop advanced computational methods for predicting the possible crystal structures of drug-like molecules. The work is motivated by the importance of anticipating the occurrence of polymorphism, where a molecule can crystallise in more than one crystal structure, depending on the conditions used for its crystallisation. In the context of pharmaceutical materials, we must know when polymorphs exist that we have not yet characterised. These present a risk related to property control; a change in crystal structure can dramatically alter important properties of a crystalline drug, affecting its processing, tabletting and bioavailability. Hence, there has been a huge investment in crystal structure prediction methods. Predicted structures could guide experimental screening - where to focus effort and, in the long run, what experimental variables to vary to maximise likelihood of isolating new structures.

Structure prediction has progressed impressively but still not made the expected impact on assessing risk. A root cause is the problem of over-prediction. Current methods always predict many competing crystal forms, most of which are never observed. Accordingly all candidate drug molecules appear to have significant uncertainly as to expected extent of polymorphism and this adversely impacts risk analysis.

The root of the problem is that the underlying lattice energy surface, on which local minima represent possible structures, is extremely complex and current methods for predicting polymorphism do not provide a sufficiently detailed description of this energy surface. We will develop the use of statistical learning methods to guide crystal structure calculations to efficiently map out the global features of lattice energy surfaces in a way that is not possible using current computational methods.

Two lines of study are proposed: to improve the fidelity of energetic assessment and, more importantly, to map the energy landscape of structures more globally. A starting point is to develop advanced statistical learning methods for correcting approximate computational models that are used for assessing lattice energies of predicted crystal structures. Our goal is to reduce the uncertainty in ranking of predicted structures at a controlled computational cost. We will then move to a completely unexplored problem: learning more detailed features of the lattice energy surface, such as the depth, shape and connectivity of energy basins. Key to this work is the development of multi-fidelity (multiple models of known accuracy and computational cost) and multi-objective Bayesian optimisation approaches to make use of the hierarchical of energy models (a series of approximate energy models with known, ordered accuracy) used in crystal structure prediction.

The objective is to judge the thermodynamic robustness and kinetic accessibility of individual predicted crystal structures and address the polymorphism over-prediction problem. This is completely new in the area and can be transformative in guiding experimental screening.

Thus, the vision is that active learning methods will guide the computer simulations that, in turn, will provide guidance to experimental polymorph screening.

Structure prediction has progressed impressively but still not made the expected impact on assessing risk. A root cause is the problem of over-prediction. Current methods always predict many competing crystal forms, most of which are never observed. Accordingly all candidate drug molecules appear to have significant uncertainly as to expected extent of polymorphism and this adversely impacts risk analysis.

The root of the problem is that the underlying lattice energy surface, on which local minima represent possible structures, is extremely complex and current methods for predicting polymorphism do not provide a sufficiently detailed description of this energy surface. We will develop the use of statistical learning methods to guide crystal structure calculations to efficiently map out the global features of lattice energy surfaces in a way that is not possible using current computational methods.

Two lines of study are proposed: to improve the fidelity of energetic assessment and, more importantly, to map the energy landscape of structures more globally. A starting point is to develop advanced statistical learning methods for correcting approximate computational models that are used for assessing lattice energies of predicted crystal structures. Our goal is to reduce the uncertainty in ranking of predicted structures at a controlled computational cost. We will then move to a completely unexplored problem: learning more detailed features of the lattice energy surface, such as the depth, shape and connectivity of energy basins. Key to this work is the development of multi-fidelity (multiple models of known accuracy and computational cost) and multi-objective Bayesian optimisation approaches to make use of the hierarchical of energy models (a series of approximate energy models with known, ordered accuracy) used in crystal structure prediction.

The objective is to judge the thermodynamic robustness and kinetic accessibility of individual predicted crystal structures and address the polymorphism over-prediction problem. This is completely new in the area and can be transformative in guiding experimental screening.

Thus, the vision is that active learning methods will guide the computer simulations that, in turn, will provide guidance to experimental polymorph screening.

### Planned Impact

The crystal form that a drug molecule adopts has an important impact on its solubility, dissolution rate (and bioavailability), shelf-life and mechanical properties. A complete understanding of the possible crystal forms of a drug is a regulatory requirement for pharmaceutical registration. Currently, pharmaceutical companies rely on high throughput screening of different crystallisation conditions in the hope of identifying all stable polymorphs and making a choice of which to formulate, but with no guarantee of success. A particular risk is a late-appearing stable and hence insoluble solid form, while a change of form to a more soluble material can have toxic effects. In the case of Ritonavir, an HIV drug already on the market where polymorphism suddenly became apparent, Abbott Laboratories had to reallocate over 600 scientists onto the case for more than 1 year. Abbott lost an estimated $250 million in sales as well as hundreds of millions of dollars to recover the original, patented polymorphic form. Consequently, there are now significant requirements concerning polymorphism that are imposed by healthcare regulatory bodies before a solid form drug can be marketed. Around 90% of prescribed drugs are essentially administered in the solid form and so the pharmaceutical industry is actively searching for approaches to accelerating the polymorph screening process.

Computational methods have been developed to supplement experimental polymorph screening by applying algorithms to find all energetically stable ways that a molecule can be packed into a crystalline structure. These crystal structure prediction (CSP) methods have enjoyed rapid development in recent years, but have yet to transform the field of polymorph screening. This project aims to develop the computational methods that could be transformative in how CSP is used to assess risk of polymorphism of drug molecules by developing statistical learning methods to guide the simulation and exploration of the energy surface that describes all possible crystal structures. The primary beneficiaries of this research are therefore industrial drug preformulators and formulators, and ultimately patients. The goal of developing better drugs whose solid form selection is guided by predictive computational methods, leading to pharmaceutical materials that are readily processed, tableted and consumed, contributes to better public health and a more productive UK.

The results of this feasibility study will benefit all scientists involved with crystallisation phenomena and have practical application in guiding solid form choice in drug formulation.

Hierarchical experiments are also commonplace in many other areas of science and technology, where computational or physical data can be collected at differing levels of cost and accuracy; for example, drug development through laboratory, pilot plant and manufacturing scales, materials development with multi-scale mathematical modelling, and epidemiological studies with multiple different computational models of disease spread (e.g. compartmental and agent-based). Hence the methods developed on this project for the construction and exploitation of hierarchical statistical learning models will have impact in scientific areas well beyond crystal structure prediction.

Impacts of the project include: the training of two postdoctoral research scientists in both high-level research skills and multi-disciplinary working; creation of knowledge - new methodologies in statistical learning and new insights into crystal structure prediction; economic impacts for the pharmaceutical industry, through better risk assessment of polymorphs and societal, through faster development and regulatory approval of new medicines.

Computational methods have been developed to supplement experimental polymorph screening by applying algorithms to find all energetically stable ways that a molecule can be packed into a crystalline structure. These crystal structure prediction (CSP) methods have enjoyed rapid development in recent years, but have yet to transform the field of polymorph screening. This project aims to develop the computational methods that could be transformative in how CSP is used to assess risk of polymorphism of drug molecules by developing statistical learning methods to guide the simulation and exploration of the energy surface that describes all possible crystal structures. The primary beneficiaries of this research are therefore industrial drug preformulators and formulators, and ultimately patients. The goal of developing better drugs whose solid form selection is guided by predictive computational methods, leading to pharmaceutical materials that are readily processed, tableted and consumed, contributes to better public health and a more productive UK.

The results of this feasibility study will benefit all scientists involved with crystallisation phenomena and have practical application in guiding solid form choice in drug formulation.

Hierarchical experiments are also commonplace in many other areas of science and technology, where computational or physical data can be collected at differing levels of cost and accuracy; for example, drug development through laboratory, pilot plant and manufacturing scales, materials development with multi-scale mathematical modelling, and epidemiological studies with multiple different computational models of disease spread (e.g. compartmental and agent-based). Hence the methods developed on this project for the construction and exploitation of hierarchical statistical learning models will have impact in scientific areas well beyond crystal structure prediction.

Impacts of the project include: the training of two postdoctoral research scientists in both high-level research skills and multi-disciplinary working; creation of knowledge - new methodologies in statistical learning and new insights into crystal structure prediction; economic impacts for the pharmaceutical industry, through better risk assessment of polymorphs and societal, through faster development and regulatory approval of new medicines.

### Publications

*Multifidelity Statistical Machine Learning for Molecular Crystal Structure Prediction.*in The journal of physical chemistry. A

Description | The project has found that a machine learning method, Gaussian Process Regression, can be applied to improve the accuracy of energy calculations on sets of predicted crystal structures at a small fraction of the computational cost that the calculations would have previously required. A second discovery is that a related machine learning method can be used to generate realistic pathways between pairs of crystal structures, allowing us to further characterise the energy landscapes of predicted crystal structures. |

Exploitation Route | The outcomes will be used in crystal structure prediction studies that are used in several other research projects, aimed at polymorph discovery of pharmaceutical molecules and the discovery of functional molecular materials. |

Sectors | Chemicals Energy Pharmaceuticals and Medical Biotechnology |

Description | The machine learning methods have contributed to reducing the expense of performing computational prediction of crystal structures and properties, in particular for the pharmaceutical sector. |

First Year Of Impact | 2021 |

Sector | Pharmaceuticals and Medical Biotechnology |

Impact Types | Economic |

Description | (ADAM) - Autonomous Discovery of Advanced Materials |

Amount | € 9,999,283 (EUR) |

Funding ID | 856405 |

Organisation | European Commission |

Sector | Public |

Country | European Union (EU) |

Start | 09/2020 |

End | 09/2026 |

Title | Machine learning method for lattice energy evaluation |

Description | A new model for approximating high level quantum mechanical energy evaluations of crystals from Gaussian Process Regression on a small number of crystal structures. |

Type Of Material | Computer model/algorithm |

Year Produced | 2020 |

Provided To Others? | No |

Impact | No impact yet, as the method is only just being made publicly available. |

Description | Collaboration between chemistry and statistics |

Organisation | University of Southampton |

Department | Southampton Statistical Sciences Research Institute |

Country | United Kingdom |

Sector | Academic/University |

PI Contribution | Our contribution is to bring an application that is promising for using statistical learning methods. Our expertise is on simulation methods and generating data that can be used to train statistical models. |

Collaborator Contribution | Partners brought expertise in statistical learning methods, which are being applied to accelerate computational chemistry methods. |

Impact | This is multi-disciplinary, involving chemistry and statistics. |

Start Year | 2018 |

Title | continued development of Global Lattice Energy Explorer code |

Description | Software for predicting the crystal structures of organic molecules. |

Type Of Technology | Software |

Year Produced | 2018 |

Impact | The software is not yet widely used. Outside of our research group, the software is used in two other academic labs and one company. |

Description | participation in EPSRC Network+ |

Form Of Engagement Activity | A talk or presentation |

Part Of Official Scheme? | No |

Geographic Reach | National |

Primary Audience | Professional Practitioners |

Results and Impact | Research presentation at a Network+ meeting, attended by approximately 50 network members, including postgraduate and undergraduate students, industry members and other academic researchers. |

Year(s) Of Engagement Activity | 2019 |