Imbalanced Data Set Modelling and Classification for Life Threatening/ Safety Critical Applications

Lead Research Organisation: University of Reading

Department Name: Sch of Systems Engineering

Abstract

Machine learning from imbalanced data sets is related to a broad range of very important problems in many engineering and scientific disciplines, e.g. medical diagnostics, signal detection and machine/material fault detection. Apart from the highly practical value, data learning from imbalanced data sets is also of high theoretical interest. Because the performance metrics used in conventional classifier construction may break down when applied to the imbalanced data sets, this has motivated considerable researches in machine learning communities aimed at a variety of learning methodologies for the imbalanced data setsDespite significant research in machine learning for imbalanced data, there is still a need and/or a lack of general methodologies that are able to deliver the capability of knowledge discovery as demanded by many hugely important applications. For example, it is highly beneficial to discover new noninvasive biological markers from clinical data, which can improve early medical diagnostics results, in order to start early treatment of a cancer. The motivation of the proposed research can be illustrated by another example. In material science, suppose that new materials with exceptional properties, e.g. strength, are required for new mechanical structures, e. g. military vehicles. For this purpose, a sample of experimental trials is performed to obtain a new material together with the measurements of the properties. It is highly desirable that the properties/behaviours could be discovered, by resort of data modelling using a small sample, rather than performing many more unnecessary and very expensive engineering experiments (large sample).This proposal is concerned with the development of a new modelling approach which builds upon the state-of-the-art nonlinear modelling methodologies and is specifically designed for pattern recognition using the imbalanced data sets. The objectives of the research include the modelling, classification, class probability (risk) prediction and knowledge discovery from the imbalanced data sets which are commonly found in many associated applications.

Funded Value:

£102,020

Funded Period:

Oct 09 - Sep 12

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/G026858/1

Principal Investigator:

Xia Hong

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (100%)

Organisations

University of Reading (Lead Research Organisation)

People	ORCID iD
Xia Hong (Principal Investigator)
Chris Harris (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Gao M (2012) Probability density function estimation based over-sampling for imbalanced two-class problems

Gao M (2012) A neurofuzzy classifier for two class problems

Gao M (2011) A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems in Neurocomputing

Gao M (2011) On combination of SMOTE and particle swarm optimization based radial basis function classifier for imbalanced problems

Hong X (2012) Using zero-norm constraint for sparse probability density function estimation in International Journal of Systems Science

Hong X (2013) Elastic-Net Prefiltering for Two-Class Classification. in IEEE transactions on cybernetics

Hong X (2012) An Elastic Net Orthogonal Forward Regression Algorithm in IFAC Proceedings Volumes

Key Findings


Description	All the original objectives are achieved. We proposed a number of efficient learning algorithms of building two class classifiers using imbalanced data sets. We investigated the oversampling techniques to balance the class distributions and validated the concept by combining our previous state -of -the-art classifier construction algorithm over a large number of benchmark data sets. We initially applied the well-known SMOTE oversampling technique and then proposed a novel probability density estimation function estimation based oversampling approach for this strand of research. The works are detailed in two conference papers and a journal paper respectively. M. Gao, X. Hong, S. Chen and C. J. Harris: "Probability density function estimation based over-sampling for imbalanced two-class problems". In, International Joint Conference on Neural Networks, 10 - 15 Jun Brisbane, Australia, (2012) M. Gao, X. Hong, S. Chen and C. J. Harris: "On Combination of SMOTE and Particle Swarm Optimization based Radial Basis Function for Imbalanced Problem " in Proc. IJCNN 2011, San Jose, USA, July 30-Aug 5, pp1146-1153. (2011). M. Gao, X. Hong, S. Chen, C. J. Harris: "A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems". Neurocomputing, Vol. 74, No.17, pp. 3456-3466. (2011) A neurofuzzy classifier identification algorithm is introduced for two class problems. The advantage of a neurofuzzy classifier is that this leads to ease of knowledge discovery. We introduced new ideas of using Gaussian mixture models for input partitions and logistic model to produce the class probability output, further improving the model transparency. This work also extends our previous work of subspace orthogonal least square algorithm for rule selection. Benchmark examples have shown comparable classification performance in comparison of results in the literature, but our approach provides more internal information about the system to the users. This work is published in a conference paper M. Gao, X. Hong, C. J. Harris: "A neurofuzzy classifier for two class problems", In, the 12th Annual Workshop on Computational Intelligence (UKCI2012), 5 - 7, Sep, Edinburgh, UK. (2012) The problems of sparse modelling are fundamental to many machine learning and pattern recognition tasks. We investigate two modes of sparse modelling aimed at improved generation capabilities. i. e, supervised modelling of classifier construction using elastic net parameter regularization and that of probability density estimation using zero norm constraints. These works are published as two journal papers; X. Hong, S. Chen, C. J. Harris: "Elastic net prefiltering for two class classification", IEEE Transactions on Systems, Man, and Cybernetics--Part B: Cybernetics, (2012, In Press) X. Hong, S. Chen, C. J. Harris: "Using zero-norm constraint for sparse probability density function estimation". International Journal of Systems Science, Vol. 43, No. 11, pp 2107-2113 (2012)
Exploitation Route	The proposed algorithms are very practical, so they can be applied to many problems immediately by the users. Journal and conference papers
Sectors	Digital/Communication/Information Technologies (including Software)

Abstract

Organisations

People

ORCID iD

Publications