Comparing Classifiers Correctly

Lead Research Organisation: Brunel University London
Department Name: Computer Science

Abstract

Finding an effective way to predict defect-prone parts of software systems has been a longstanding challenge for software engineering researchers. This would assist both the allocation of scarce software testing resources and also guide decision-making concerning when to release software systems. Thus it's no surprise that defect prediction has attracted a great deal of research attention with many hundreds of published studies.

In order to combine these studies, researchers need a sound means of comparing predictive performance and currently this is surprisingly problematic. Currently, the popular methods are biased (and therefore unreliable) and also difficult to interpret.

By addressing this need, this proposed research will help unlock the potential value of the large number of studies, primarily using machine learning methods, on defect prediction. This will yield more trustworthy and therefore actionable results meaning that practitioners can be better guided as how to find and therefore fix defects.

This travel grant will enable close collaboration between Profs Shepperd and MacDonell to work intensively on this problem.

Planned Impact

Software is now ubiquitous and impinges upon almost every aspect of our lives. So naturally the quality of software is hugely important, hence the ability to predict potentially defect-prone parts of such systems is vital. Accurate costs of software defects are hard to come by, however NIST estimated in 2012 that defects cost the US economy of the order of $59bn annually. Elsewhere it is estimated that North American businesses lose $26.5bn in revenue each year due to IT downtime (Information Week, 24th May 2011). This of course ignores the even more serious non-monetary costs of harm to life and health of software failure.

So it's no surprise that software engineering researchers are harnessing machine learning and data science developments to attack this challenge of pinpointing software defects. To do this many suggestions have been made but we must compare techniques experimentally. However, such work is undermined if we can't adequately compare techniques or assess likely predictive performance.

There are two areas of impact. In the short term the impact will be upon academic practice but the intention, in the medium term, is for industrial impact once we can make more reliable recommendations as to which techniques are most suitable for defect prediction. One vehicle for this is via the existing EPSRC-funded network "Fault analyses in industry and academic research" at Brunel which is provides a forum for researchers AND practitioners.

Publications

10 25 50
publication icon
Shepperd M (2018) The role and value of replication in empirical software engineering results in Information and Software Technology

publication icon
Shepperd Martin (2018) Replication studies considered harmful in arXiv e-prints

publication icon
Shepperd, M. (2018) The role and value of replication in empirical software engineering results in Information & Software Technology

 
Description There has been a great deal of research into developing prediction systems e.g. for defects or costs in software engineering. For us to make sense of all the studies we need findings to be replicable. In order to make comparisons fairly the prediction performance metrics need to be un-biased. Also we need to consider what we mean by successfully replicable. A re-analysis of replication studies showed that due to lack of experimental power it's easy to confirm an original result but this is not informative because the prediction limits are so wide almost any result is confirmatory. The underlying problem is many experiments are poorly designed and therefore substantially under-powered. In addition we have developed new automated benchmarking procedures for researchers.
Exploitation Route The findings are mainly aimed at researchers in software engineering. I am finalising R code which can shared so that researchers can automate the process of benchmarking their prediction systems. The findings should also encourage researchers to check the power of their experiments a priori.
Sectors Digital/Communication/Information Technologies (including Software)