Imbalanced Data Set Modelling and Classification for Life Threatening/ Safety Critical Applications

Lead Research Organisation: University of Reading
Department Name: Sch of Systems Engineering

Abstract

Machine learning from imbalanced data sets is related to a broad range of very important problems in many engineering and scientific disciplines, e.g. medical diagnostics, signal detection and machine/material fault detection. Apart from the highly practical value, data learning from imbalanced data sets is also of high theoretical interest. Because the performance metrics used in conventional classifier construction may break down when applied to the imbalanced data sets, this has motivated considerable researches in machine learning communities aimed at a variety of learning methodologies for the imbalanced data setsDespite significant research in machine learning for imbalanced data, there is still a need and/or a lack of general methodologies that are able to deliver the capability of knowledge discovery as demanded by many hugely important applications. For example, it is highly beneficial to discover new noninvasive biological markers from clinical data, which can improve early medical diagnostics results, in order to start early treatment of a cancer. The motivation of the proposed research can be illustrated by another example. In material science, suppose that new materials with exceptional properties, e.g. strength, are required for new mechanical structures, e. g. military vehicles. For this purpose, a sample of experimental trials is performed to obtain a new material together with the measurements of the properties. It is highly desirable that the properties/behaviours could be discovered, by resort of data modelling using a small sample, rather than performing many more unnecessary and very expensive engineering experiments (large sample).This proposal is concerned with the development of a new modelling approach which builds upon the state-of-the-art nonlinear modelling methodologies and is specifically designed for pattern recognition using the imbalanced data sets. The objectives of the research include the modelling, classification, class probability (risk) prediction and knowledge discovery from the imbalanced data sets which are commonly found in many associated applications.
 
Description All the original objectives are achieved. We proposed a number of efficient learning algorithms of building two class classifiers using imbalanced data sets.



We investigated the oversampling techniques to balance the class distributions and validated the concept by combining our previous state -of -the-art classifier construction algorithm over a large number of benchmark data sets. We initially applied the well-known SMOTE oversampling technique and then proposed a novel probability density estimation function estimation based oversampling approach for this strand of research. The works are detailed in two conference papers and a journal paper respectively.



M. Gao, X. Hong, S. Chen and C. J. Harris: "Probability density function estimation based over-sampling for imbalanced two-class problems". In, International Joint Conference on Neural Networks, 10 - 15 Jun Brisbane, Australia, (2012)



M. Gao, X. Hong, S. Chen and C. J. Harris: "On Combination of SMOTE and Particle Swarm Optimization based Radial Basis Function for Imbalanced Problem " in Proc. IJCNN 2011, San Jose, USA, July 30-Aug 5, pp1146-1153. (2011).



M. Gao, X. Hong, S. Chen, C. J. Harris: "A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems". Neurocomputing, Vol. 74, No.17, pp. 3456-3466. (2011)



A neurofuzzy classifier identification algorithm is introduced for two class problems. The advantage of a neurofuzzy classifier is that this leads to ease of knowledge discovery. We introduced new ideas of using Gaussian mixture models for input partitions and logistic model to produce the class probability output, further improving the model transparency. This work also extends our previous work of subspace orthogonal least square algorithm for rule selection. Benchmark examples have shown comparable classification performance in comparison of results in the literature, but our approach provides more internal information about the system to the users. This work is published in a conference paper



M. Gao, X. Hong, C. J. Harris: "A neurofuzzy classifier for two class problems", In, the 12th Annual Workshop on Computational Intelligence (UKCI2012), 5 - 7, Sep, Edinburgh, UK. (2012)



The problems of sparse modelling are fundamental to many machine learning and pattern recognition tasks. We investigate two modes of sparse modelling aimed at improved generation capabilities. i. e, supervised modelling of classifier construction using elastic net parameter regularization and that of probability density estimation using zero norm constraints. These works are published as two journal papers;



X. Hong, S. Chen, C. J. Harris: "Elastic net prefiltering for two class classification", IEEE Transactions on Systems, Man, and Cybernetics--Part B: Cybernetics, (2012, In Press)



X. Hong, S. Chen, C. J. Harris: "Using zero-norm constraint for sparse probability density function estimation". International Journal of Systems Science, Vol. 43, No. 11, pp 2107-2113 (2012)
Exploitation Route The proposed algorithms are very practical, so they can be applied to many problems immediately by the users. Journal and conference papers
Sectors Digital/Communication/Information Technologies (including Software)