Advancing Machine Learning Methodology for New Classes of Prediction Problems

Lead Research Organisation: University of East Anglia

Department Name: Computing Sciences

Abstract

The last few decades have seen enormous progress in the development of machine learning and pattern recognition algorithms for data classification. This has resulted in considerable advances in a number of applied fields, with some of these algorithms forming the core of ubiquitous deployed technologies. However there exist very many important applications, for example in biomedicine, which are highly non-standard prediction problems, and there is an urgent need to develop appropriate & effective classification techniques for such applications. For example, at NIPS2006 Girolami & Zhong reported state of the art prediction accuracy for a protein fold classification problem which stands at a modest 62%. While this may partly be due to overlaps between classes of fold, it is also clear that some of the fundamental assumptions made by most classification algorithms are not valid in this application. In particular, most algorithms make some assumptions on the structure of the data that are not met in reality: data (both training and test) is independent and identically distributed (i.i.d) from the same distribution, labels are unbiased (i.e. the relative proportions of positive and negative examples are approximately balanced) and the presence of labeling noise both on the input data and on the labels can be largely ignored. Recent advances in Machine Learning, such as kernel based methods and the availability of efficient computational methods for Bayesian inference, hold great promise that classification problems in non-standard situations can be addressed in a principled way. The development of effective classification tools is all the more urgent given the daunting pace at which technological advances are producing novel data sets. This is particularly true in the life sciences, where advances in molecular biology and proteomics are leading to the production of vast amounts of data, necessitating the development of methods for high-throughput automated analysis. Improving classification accuracy may lead to the removal of what is currently the bottleneck in the analysis of this type of data, leading to real impact in furthering biomedical research and in the life quality of millions of people. At present most classifiers used in life sciences applications, especially those deployed as bioinformatics web services, adopt & adapt traditional Machine Learning approaches, quite often in an ad hoc manner, e.g. employing Artificial Neural Networks & Support Vector Machines. However, in reality many of these applications are highly non-standard classification problems in the sense that a number of the fundamental underlying assumptions of pattern classification and decision theory (e.g. identical sampling distributions for 'training' and 'test' data, perfect noiseless labeling in the discrete case, object representations which can be embedded in a common feature space) are violated and this has a direct and potentially highly negative impact on achievable performance. To make much needed & significant progress on a wide range of important applications there is an urgent requirement to systematically address the associated methodological issues within a common framework and this is what motivates the current proposal.

Funded Value:

£101,469

Funded Period:

Feb 08 - Feb 11

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/F010508/1

Principal Investigator:

Gavin Cawley

Research Subject:

Info. & commun. Technol. (80%)

Medical & health interface (5%)

Omic sciences & technologies (5%)

Tools, technologies & methods (10%)

Research Topic:

Artificial Intelligence (80%)

Bioinformatics (10%)

Genomics (5%)

Medical science & disease (5%)

Organisations

People	ORCID iD
Gavin Cawley (Principal Investigator)
Geoffrey Robert Moore (Co-Investigator)
Steven Hayward (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Cawley G (2011) Sparse Bayesian prediction of disordered residues and disordered regions based on amino-acid composition

Cawley GC (2014) Kernel learning at the first level of inference. in Neural networks : the official journal of the International Neural Network Society

Gavin Cawley (Author) (2010) Over-fitting in model selection and subsequent selection bias in performance evaluation in Journal of Machine Learning Research

Key Findings
Impact Summary
Collaboration


Description	The last few decades have seen enormous progress in the development of machine learning and pattern recognition algorithms for data classification. This has resulted in considerable advances in a number of applied fields, with some of these algorithms forming the core of ubiquitous deployed technologies. However there exist very many important applications, for example in biomedicine, which are highly non-standard prediction problems, and there is an urgent need to develop appropriate & effective classification techniques for such applications. For example, at NIPS2006 Girolami & Zhong reported state of the art prediction accuracy for a protein fold classification problem which stands at a modest 62%. While this may partly be due to overlaps between classes of fold, it is also clear that some of the fundamental assumptions made by most classification algorithms are not valid in this application. In particular, most algorithms make some assumptions on the structure of the data that are not met in reality: data (both training and test) is independent and identically distributed (i.i.d) from the same distribution, labels are unbiased (i.e. the relative proportions of positive and negative examples are approximately balanced) and the presence of labeling noise both on the input data and on the labels can be largely ignored. Recent advances in Machine Learning, such as kernel based methods and the availability of efficient computational methods for Bayesian inference, hold great promise that classification problems in non-standard situations can be addressed in a principled way. The development of effective classification tools is all the more urgent given the daunting pace at which technological advances are producing novel data sets. This is particularly true in the life sciences, where advances in molecular biology and proteomics are leading to the production of vast amounts of data, necessitating the development of methods for high-throughput automated analysis. Improving classification accuracy may lead to the removal of what is currently the bottleneck in the analysis of this type of data, leading to real impact in furthering biomedical research and in the life quality of millions of people. At present most classifiers used in life sciences applications, especially those deployed as bioinformatics web services, adopt & adapt traditional Machine Learning approaches, quite often in an ad hoc manner, e.g. employing Artificial Neural Networks & Support Vector Machines. However, in reality many of these applications are highly non-standard classification problems in the sense that a number of the fundamental underlying assumptions of pattern classification and decision theory (e.g. identical sampling distributions for 'training' and 'test' data, perfect noiseless labeling in the discrete case, object representations which can be embedded in a common feature space) are violated and this has a direct and potentially highly negative impact on achievable performance. This project has made progress in adapting existing machine learning techniques for such non-standard learning problems. We investigated active learning, variable sampling distributions and disparate class frequencies via two case studies in computational biology: Firstly the detection of natively unfolded regions in proteins, and later predicting the binding affinity of peptides to the major histocompatibility complex (MHC) molecule (a key step in predicting immune response). Additionally, we found that over-fitting in model selection is a key issue in such applications of machine learning algorithms, and can also cause substantial losses in generalisation performance. We have developed methods to deal with this problem. We have produced a software package for use by out collaborators at the Netherlands Cancer Institute to assist them in the design of peptide therapies.
Exploitation Route	The VaDIS sofwtare we have developed during the project has application in the design of peptide therapies for a number of illnesses, and we are currently seeking to exploit our work in conjunction with the Netherlands Cancer Institute (NKI). Machine learning and data mining are widely used in science and industry. The key algorithmic developments resulting from the project can be directly employed in applications of these methods to improve performance and to obtain unbiased performance estimates.
Sectors	Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology


Description	We have produced a software package called Vadis, which has been used in the design of peptide therapies, at the Dutch National Cancer Institute (NKI).
First Year Of Impact	2011
Sector	Pharmaceuticals and Medical Biotechnology


Description	Vaccine Design In Silico
Organisation	John Innes Centre
Country	United Kingdom
Sector	Academic/University
PI Contribution	We have stablished a research collaporation with the Chemical Biology Laboratory of Huib Ovaa at the Netherlands Cancer Institute (NKI), and Dr Richard Morris of the John Innes Centre (JIC), to develop algorithms and sofware to assist in the design of peptide thearapies. This is a task that involves many non-standard elements, such as active learning, covariate shift and disparate class frequencies, and so is an ideal secondary case study for the project. The software we have produced (VaDIS) has proved highly effective in the work conducted at NKI. Publications on this work are currently in progress.
Start Year	2009

Abstract

Organisations

People

ORCID iD

Publications