Machine Learning Approaches to Predict Enzyme Function

Lead Research Organisation: University of St Andrews
Department Name: Chemistry

Abstract

Proteins are amongst the most important of all molecules in biological systems. They are crucial to organisms which use them to carry out a huge variety of essential functions: catalysis, transport, storage, motor functions, signalling, chaperoning folding, regulation, molecular recognition, structural roles, and DNA Repair. As proteins are so ubiquitous in biology, understanding their properties is essential if we want to know about biological processes. This project is focused on one of the most significant of all protein functions: enzyme catalysis. Enzymes catalyse, or facilitate, the chemical reactions that occur in living organisms. Understanding how they work is both interesting in itself and useful in areas as diverse as drug design, diagnostics, biofuels, food science and laundry. This project is about the relationship between the structure of a protein and the enzyme function it carries out. We aim to predict the catalytic functionality from a knowledge of the protein structure. In order to achieve this, we will use machine learning methods, and in particular a technique called Random Forest. The forest consists of several hundred 'decision trees', each of which is basically a flow diagram. We will train them to learn patterns in the known properties of existing enzyme structures and the chemistry of the steps comprising the reactions they catalyse. However, the way in which we will generate the trees involves computer-simulated dice-rolling. This will ensure that they are all different, though based on the same underlying information. The decision trees then each make a prediction of the unknown possible catalytic functions. These predictions are treated as votes as to the function of the protein. This voting process produces a consensus of many decision trees and maximises the use of the information contained in the underlying data, generating results which are much more accurate than those of any one decision tree. The prediction of enzyme function is immensely important for a number of reasons. Firstly, being able to predict enzyme function more accurately will improve the functional annotation of genomes and reduce the current risk of misannotations being propagated through bioinformatics databases. Rapid developments in structural genomics, high throughput structure determination of diverse proteins from a wide variety of organisms, mean that many structures are available for enzymes whose functions are not yet known. Secondly, this project will allow us to recognise chemical similarities between evolutionarily unrelated enzymes that catalyse similar steps, though not necessarily similar overall reactions. Thirdly, this work will help us to understand the key determinants of the complex relationship between protein structure, function and evolution, particularly in terms of catalysis of reaction steps. Fourthly, the project will facilitate the design of new enzymes with either novel functions or carefully modified versions of existing functions. This project sits at an interface between disciplines, combining chemistry, biology and computer science. A wide range of skills and expertise is necessary to increase our understanding of catalysis, which has long been an important academic goal. Commercially, this work lays a foundation which is directly useful to the pharmaceutical and biotechnology industries, where enzymes are used both as diagnostics and therapeutics; the agrochemical industry, whose products often target enzymes; in the development of biofuels, which need robust enzymes to improve productivity and reduce costs; in laundry, where enzymes are already used in everyday products; and in the nutrition and food industries. In particular this project will aid in the design of new and repurposed enzymes.

Technical Summary

The key idea in our work is to identify the reaction mechanism, if any, catalysed enzymatically by a protein structure. Here, the reaction mechanisms are the 260 distinct entries in MACiE. The possible predictions are that the enzyme catalyses each of these reactions, or catalyses no enzyme reaction in our knowledge base. Our work, including a study of convergently evolved analogous pairs of enzymes, suggests that the full stepwise chemical reaction mechanism contains information critical to recognising similarities between enzymes. Our main machine learning method is Random Forest, simply a forest made out of many different randomly created decision trees. Randomness is introduced in two ways. Firstly, each tree is based on a bootstrap sample of N out of the N known proteins, chosen with replacement such that some proteins will appear more than once and others not at all in the set from which a given tree is built. Secondly, the descriptors used for making the split at each node are chosen from a (new) small random subset of the descriptors. Once grown, the trees then predict unseen data. Random Forest can predict either a categorical or a continuous variable. Here, our interest is in classification; the class assigned to a new protein is that given the most votes amongst the trees in the forest. Subsequently to predicting the reaction mechanism, we will apply chemoinformatics, docking and Ultrafast Shape Recognition to suggest substrates for each enzyme reaction identified. Docking is a computational filter, reducing the number of candidates by more than an order of magnitude. Rescoring will use our novel Random Forest based RF-Score function. We will use fingerprint-based chemoinformatics methods to retain only molecules with the correct chemical functionalities needed to undergo the reaction mechanisms identified, and Ultrafast Shape Recognition as a scaffold-hopping method to identify molecules of suitable shape.

Planned Impact

The key beneficiaries are companies in the pharmaceutical, biotechnology, and medical technology sectors; other possible beneficiary fields are biofuels, foods, agrochemicals, and 'home and personal care'. This work centres on new aspects of function prediction, complementary to those used elsewhere, and we envisage that our methods will take their place amongst the arsenal of tools in the workflows for protein function prediction and gene annotation. We expect our methods to be most valuable when used alongside other state-of-the art techniques for predicting protein function from sequence and structure. One element of the strategy for increasing the impact of our function prediction work is to encourage its use in private sector R & D. This naturally includes large pharmaceutical companies, but we are particularly keen to see SMEs, biotechnology and smaller medical technology companies, many of whom do not have the resources to fund large in-house computational resources, make use of our predictive models. A key aspect of this is eliminating any IP-related barriers to the commercial use both of our predictive models and also of MACiE. Our function prediction software and models will be freely available on a Creative Commons license. The IP status is that all data in MACiE are public domain. Almost always, these are published, or very soon to be published, by their authors. We are prepared to embargo data pre-publication, but not afterwards. The database itself is copyrighted, and we may in future include a light touch Open Data Commons licence. This is intended only to prevent extreme cases of plagiarism, such as copying the entire database and passing it off as the work of others, and we positively encourage the use of our predictive models, and also data from MACiE, in commercial research and development. The second part of our strategy is to increase visibility. Dr Mitchell is in the fortunate position of receiving regular invitations to speak both directly to pharmaceutical, chemical and other commercial organisations (Pfizer, GSK, Unilever, Syngenta, Schering-Plough etc.), and also at conferences designed for the pharmaceutical industry (Improving Solubility 2008; ADMET 2009; Improving Solubility 2009; ADMET Europe 2010; UK-QSAR spring 2010). Here we can discuss our work in formal presentations and through informal networking. Other ideas for impact in the shorter term include authoring articles describing our work on function prediction and the related work on MACiE. There would be three specific target groups. One of these is research and development scientists in pharmaceutical and biotechnology companies. To reach them, an article in Drug Discovery Today or a similar 'trade magazine' would be an appropriate medium. The second target audience is young people (particularly the 16-21 age group) with an interest in science. While this already happens on a small scale via UCAS open days and the like, we are particularly interested in opening up discussion of science through the blogosphere, see for instance http://baoilleach.blogspot.com/2009/05/how-do-enzyme-mechanisms-evolve.html We are also very aware of the benefits of including relevant parts of our own research in undergraduate teaching material. The third target group is the broader public, who could be impacted by general interest magazine or newspaper articles, as well as by the blogs and other internet-based content. We also hope to secure a slot to present the work to the public at a local event such as those organised by Cafe Science Dundee. The University of St Andrews is a partner in the 'Create and Inspire' public engagement training days for young scientists at Sensation science centre. We believe that MACiE has potential as an educational resource. As well as undergraduate teaching, it could also be a valuable resource for year 13 chemistry teaching in school sixth forms (e.g., Salters' A-level module 'Thread of Life').

Publications

10 25 50
publication icon
Alderson RG (2012) Enzyme informatics. in Current topics in medicinal chemistry

publication icon
Barker D (2013) 4273p: bioinformatics education on low cost ARM hardware. in BMC bioinformatics

publication icon
Beattie KE (2015) Why do Sequence Signatures Predict Enzyme Mechanism? Homology versus Chemistry. in Evolutionary bioinformatics online

publication icon
Boobier S (2017) Can human experts predict solubility better than computers? in Journal of cheminformatics

publication icon
Mitchell JB (2014) Machine learning methods in chemoinformatics. in Wiley interdisciplinary reviews. Computational molecular science

publication icon
Mussa HY (2015) The Parzen Window method: In terms of two vectors and one matrix. in Pattern recognition letters

publication icon
Mussa HY (2015) A note on utilising binary features as ligand descriptors. in Journal of cheminformatics

 
Description We have demonstrated machine learning methods that can give high predictivity of enzyme mechanism from sequence. By splitting mechanism labels to a finer granularity, which includes the role of the protein chain in the overall enzyme complex, the method can predict at 96% accuracy (and 99.9% macro-averaged recall) the mechanism definitions of 248 proteins available in the databases. We find that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also find that incorporating Catalytic Site Atlas attributes does not seem to provide additional accuracy. We also find that the majority of the predictive information comes from evolutionary fossil signals present in sequence, and only a small proportion from the catalytic residues themselves.
Exploitation Route * Integration into web-based protein function prediction software.

* Potential use in the development of new classes of antibiotic by identifying the chemical mechanisms of pathogens' enzymes.

* Possible uses in enzyme engineering and redesign.
Sectors Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://chemistry.st-andrews.ac.uk/staff/jbom/group/
 
Description The software associated with this project has been made publically available at https://sourceforge.net/projects/ml2db/ for use by SMEs and larger companies, for instance in the biotech and pharmaceuticals sectors. The findings are also being incorporated into teaching material at University level.
First Year Of Impact 2014
Sector Agriculture, Food and Drink,Chemicals,Education,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title DLS-100 Solubility Dataset 
Description 100 Molecules with measured and reported intrinsic aqueous solubilities, together with a suggested 75-25 training-test split. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title The natural history of biocatalytic mechanisms (dataset) 
Description  
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
URL https://risweb.st-andrews.ac.uk:443/portal/en/datasets/the-natural-history-of-biocatalytic-mechanism...
 
Description Fife Science Festival 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? Yes
Geographic Reach Local
Primary Audience Schools
Results and Impact Engaged children with science
Year(s) Of Engagement Activity 2011,2012
 
Description International Science Summer School 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Schools
Results and Impact Talk well-received by audience

Unknown
Year(s) Of Engagement Activity 2013,2014
 
Description Sutton Trust Summer School 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach Regional
Primary Audience Schools
Results and Impact Talk well-received by audience

Intention is to stimulate applications to universities from attendees
Year(s) Of Engagement Activity 2012,2013
 
Description iGEM 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact Team won gold medal
Year(s) Of Engagement Activity 2010,2011,2012
URL http://2012.igem.org/Team:St_Andrews