Learning to learn how to design drugs

Lead Research Organisation: Brunel University
Department Name: Computer Science


A key step in developing a new drug is to learn quantitative structure activity relationships (QSARs). These are mathematical functions that predict how well chemical compounds will act as drugs. QSARs are used to guide the synthesis of new drugs.

The current situation is:
1) There is a vast range of approaches to learning QSARs.
2) It is clear from theory and practice that the best QSAR approach depends on the type of problem.
3) Currently the QSAR scientist has little to guide her/him on which QSAR approach to choose for a specific problem.

We therefore propose to make a step-change in QSAR research. We will utilise newly available public domain chemoinformatic databases, and in-house datasets, to systematically run extensive comparative QSAR experiments. We will then generalise these results to learn which target-type/ compound-type/ compound-representation /learning-method combinations work best together.

We do not propose to develop any new QSAR method. Rather, we will learn how to better apply existing QSAR methods. This approach is called "meta-learning", using machine learning to learn about QSAR leaning.

We will make the knowledge we learn publically available to guide and improve future QSAR learning.


10 25 50

publication icon
Panov P (2014) Ontology of core data mining entities in Data Mining and Knowledge Discovery

publication icon
Panov P (2016) Generic ontology of datatypes in Information Sciences

Description We worked in close collaboration with our project partners from the University of Manchester and the University of Dundee. We also established a close collaboration with the OpenML Team from the University of Technology, Eindhoven and internally, in Brunel University with Dr Crina Grosan - an expert in machine learning. Below is a summary of the key outputs.

1. Annotation scheme. We have developed an annotation scheme for the annotation of machine learning experiments that are typically used for the prediction of biological activities of chemical compounds. The annotation scheme consists of descriptors for datasets, machine learners, their predictions and also drag targets. While several formalisms for the description of datasets and machine learning algorithms already exist, they are generic and not tuned for the prediction of biological activities necessary for drug discovery. For example, the ontology DMOP (Data Mining Optimization), developed within the European e-LICO project (http://www.e-lico.eu/DMOP.html), is sufficient to describe such properties of a dataset as number of data items, feature correlation, etc. However DMOP and other formalism do not provide descriptors that capture information about, for example, diversity of a chemical space. There also exists a classification of drug targets (see ChEMBL: https://www.ebi.ac.uk/chembl/target/browser), but it does not capture the functionality of targets or their similarity. Our annotation scheme captures all the essential properties of datasets, predictions, and drug targets.

The results of a preliminary work on an annotation scheme for the drug discovery have been published as a use case in this paper:

Panov, P., Soldatova, L.N., Dzeroski, S. (2014) Ontology of Core Data Mining Entities. J. of Data Mining and Knowledge Discovery. 28/5-6: 1222-1265.

2. Software development. We have developed and successfully tested the software infrastructure to run QSAR and meta-QSAR experiments.

3. Software integration. We have integrated the workflow of our project into the OpenML platform (http://openml.org/). OpenML is a popular platform where datasets and predictions made by various machine learning algorithms are stored, shared and compared. This open approach to the sharing of machine learning experiments is designed to save time and efforts, as there is no need to repeat computational experiments. We have made an agreement with the OpenML Team that this platform will have a dedicated to drug discovery section. In this way the drug discovery research community will benefit from this already established platform and its functionality (i.e. the comparison of different predictions across different datsets).

We have adjusted our software to enable the deposition of our datasets and the results of machine learning runs directly to OpenML. We have worked together with the developers of OpenML platform and they implemented a secure access to the store datasets.

Dr Sadawi and Prof. King (the University of Manchester) participated in OpenML Workshop, University of Technology Eindhoven in October, 2014
By the end of the project all data and machine learning predictions will be available at OpenML.

4. QSAR learning. We have done initial trails of QSAR and meta-QSAR learning. We are currently running the main QSAR learning experiments.

5. Meta-QSAR learning. Our meta- QSAR approach has proven to be more successful than the typical approaches used in drug activity predictions. The meta- QSAR predictions have outperformed random forest, vector support machines, and other popular algorithms in the majority of cases. We extracted 2,750 targets from ChEMBL with a very diverse number of chemical compounds. For the meta-learning stage we conceived a classification problem that indicates which QSAR method should be used for a particular QSAR problem. The training and learning dataset is formed by meta-features extracted from the datasets of the base learning level and are based on target properties (hydrophobicity, molecular weight, aliphatic index, etc) and on information theory (mean, mutual information, entropy, etc). The hypothesis that there is no single way to learning QSARs has been confirmed. We have obtained sufficient experimental evidence that meta-QSAR learning is correctly suggesting
for almost all targets which QSAR method should be used.

6. Additionally to the planned work we have worked on transfer learning. We developed a novel approach for a transfer learning using the evolutionary distance of targets to improve the standard QSAR learning through use of related targets.


1) Intermediate project results have been presented by Dr Soldatova at the 20th Euro-QSAR conference in St Petersburg, Russia in September, 2014 as an oral communication "Meta QSAR" (http://www.ldorganisation.com/v2/produits.php?langue=english&cle_menus=1238915734).
Slides of the presentation are available at the project website: http://www.meta-qsar.org/pubs.html

2) Initial project results were presented by Prof. King at OpenML Workshop, University of Technology Eindhoven in October, 2014. His talk is available at youtube:

3) The work on meta-QSAR learning has been presented at the ECML PKDM (European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases) conference (http://www.ecmlpkdd2015.org/) at the MetaSel - Meta-learning & Algorithm Selection workshop (http://metasel2015.inesctec.pt/) in Porto in September 2015. The extended abstract "Meta-QSAR: learning how to learn QSARs" by Iván Olier, Crina Grosan, Noureddin Sadawi, Larisa Soldatova and Ross King is available in the Proceedings at: http://ceur-ws.org/Vol-1455/paper-11.pdf
A video record of the presentation is available at: https://www.youtube.com/watch?v=wb6aOmpp8mQ

4) The work on the transfer learning has been presented at the ECML PKDM conference (http://www.ecmlpkdd2015.org/) at the BigTargets: Big Multi-Target Prediction workshop (http://www.kermit.ugent.be/big-multi-target-prediction/index.php) in Porto in September 2015. An extended abstract "Multiple Task Learning for Quantitative
Structure Activity Relationship Learning: Use of
a Natural Metric" by Iván Olier, Crina Grosan, Noureddin Sadawi, Larisa Soldatova and Ross King is available in the Proceedings at: http://www.kermit.ugent.be/big-multi-target-prediction/files/abstracts/Sadawi.pdf

5) A paper "Auditing Redundant Import in Reuse of a Top Level Ontology for the Drug Discovery Investigations Ontology." by Zhe He, Christopher Ochs, Larisa N Soldatova, Yehoshua Perl, Sivaram Arabandi, James Geller has been presented at ICBO (International Conference on Biomedical Ontology)/ VDOS (Vaccine and Drug Ontology Studies) in 2013. The presentation is available at: http://www.columbia.edu/~zh2132/VDOS2013-Zhe-Slides.pdf
The paper is available at: http://www2.unb.ca/csas/data/ws/semantic-trilogy-workshops/papers/vdos/vdos2013_submission_4.pdf (cited by 5).

6) A paper about meta-qsar learning is due to be submitted in April, 2016 to a Special Issue on Meta-Learning and Algorithm Selection in the Machine Learning Journal
Exploitation Route Our results will enable the better design of drugs by academic and commercial laboratories.
The problem of how best to learn QSARs is of great industrial and medical importance. Drug development is arguably the most important applications of science in the UK. The average cost to bring a new drug to market is ~£500 million. A successful drug can earn £billions a year, and as patent protection is time-limited, even an extra week of protection can be of great financial significance. The UK (both academia and industry) is a leader in QSAR research and chemoinformatics in general as can be seen by its publication record. This project aims to help to maintain this lead.
Sectors Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.meta-qsar.org/index.html
Description The partner on this project, Prof. Andrew Hopkins from the University of Dundee, chief executive and majority shareholder of the company Exscientia. The University of Manchester, the lead organisation on this project, sub-contracted the University of Dundee. The approach developed within this project influenced approaches used by Exscientia. Exscientia is one of the most successful AI companies: https://www.bbc.co.uk/news/business-40708043 https://www.bbc.co.uk/news/uk-scotland-scotland-business-47667125
First Year Of Impact 2017
Sector Chemicals,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

Title QSAR models in OpenML platform 
Description a collection of QSAR predictive models. All models will be publicly available at OpenML after publishing a paper 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? No  
Impact QSAR models will enable the better design of drugs by academic and commercial laboratories 
URL http://www.openml.org/
Description OpenML 
Organisation Eindhoven University of Technology
Country Netherlands 
Sector Academic/University 
PI Contribution We worked together on the development of software for depositing datasets and the results of machine learning experiments to modify OpenML platform to suite the needs of our project. Our project will benefit from the use of this well established and popular platform.
Collaborator Contribution Our project will contribute to OpenML platform datasets and models.
Impact The collaboration is multidisciplinary, it involves researchers from biochemistry, software engineering and machine learning. Expected Output: a QSAR-specific version of OpenML platform
Start Year 2013
Description Horizons article 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Ross King was interviewed for an article in Cartlidge, E. "Let the Robots do the tedious work", Horizons (Swiss magazine for Scientific Research); Vol 113; pg 10-11
Year(s) Of Engagement Activity 2017
URL http://www.snf.ch/SiteCollectionDocuments/horizonte/Horizonte_gesamt/SNSF_horizons_113_en.pdf
Description Science Museum Antenna Live Event 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact We presented the Robot Scientist and discussed the types of experiments it was capable of undertaking, within the framework of 'how to think like a scientist'. We also had an interactive (computer simulation) demonstration of drug design in which visitors could ascertain what features of a compound rendered it as a 'good' or a 'bad' drug. Both activities provoked significant interest and enthusiasm from members of the public of all ages, from 8 to 80! The robot itself sparked more general discussion about the potential uses of a robot scientist, as well as the technicalities of how it operates, whereas the computer simulated demonstration enabled those who took part to think about the characteristics looked for in drug design, which also generated much discussion. Visitor records to this specific exhibit recorded more than 3500 visitors either 'spectating' or actually 'engaging' with the scientists presenting the robot.

We anticipate that impact from this event will be long-term and on-going. For example, the event will certainly have increased public awareness as to what a Robot Scientist is capable of (i.e. it is more than just a technical operator, but is also capable of thinking like a human scientist), and we saw evidence of increased discussion around this subject between friends and families. We also anticipate increased interest in STEM subjects at secondary and higher education levels, given the curios
Year(s) Of Engagement Activity 2015
URL https://www.youtube.com/watch?v=wMIcMrzDgNc