Lead Research Organisation: Brunel University
Department Name: Computer Science


SIGNIFICANCE: Faults in software code are a significant cost to companies, as well as a risk to human safety and business success. Finding and fixing faults in code costs the UK software industry billions of pounds every year. Significant cost savings are available with even small improvements in our capability to find faults before systems are delivered to users.

BACKGROUND: Our previous work shows that during the last 10 years, 208 studies have published hundreds of different fault prediction models. These studies are usually typified by researchers applying one or more of the many modeling techniques to one or more of the many available data sets, then applying performance measures to report how well that model predicts faults.

PROBLEM: Models do not perform consistently above the current predictive performance ceiling of about 80% recall. We propose that an important contributor to this underperformance is that models treat all faults as homogeneous. No previous attempt has been made to understand what characteristics make a fault predictable or what features a model needs in order to predict faults with particular characteristics.

AIM: To build a fault prediction model ensemble which is focused on the characteristics of faults and which consistently performs above the current performance ceiling.

METHOD: This 36 month project is based on analysing the code and fault data from six commercial systems and from six open source systems. We will conduct detailed quantitative and qualitative analysis of the characteristics of the faults in these systems, identifying for example whether the characteristics of faults are problems in code interfaces, algorithmic problems, structural problems, typographic problems, etc. We will construct a set of prediction models with a large variety of features (e.g. different modeling techniques, different independent variables, etc.). We will use these models to empirically identify relationships between fault characteristics and the features of individual models. This means that we will identify what features of prediction models predict faults with particular characteristics. We will build ensembles of models with features that cover the widest range of fault characteristics. We will evaluate those models on industrial systems in collaboration with a company.

Planned Impact

Faults are hugely costly to the UK software industry. Resourcing developers to find and fix faults is very expensive. The opportunity cost of developers doing this is significant. The costs of failing to find and fix faults can be catastrophic to both human life and business success. Finding faults early in the lifecycle reduces the cost of fixing these faults and mitigates the risk to humans and businesses. Research that improves our capability to find faults offers companies huge potential benefits. Indeed because the cost of finding and fixing faults is so significant even a small improvement in fault finding capability will save a large amount of money. Consequently our proposal is very important to the UK software industry.

Despite the potential importance of fault prediction to the software industry, uptake of fault prediction has been slow by companies. This is predominately because, as described earlier, the predictive performance of models typically does not go beyond 80% recall. Companies want better predictive performances to justify an investment in fault prediction models. In addition generating in-house fault prediction expertise is not straightforward. The field is complex and few companies have such expertise available. As a result industry does not currently have much appetite for fault prediction.

Our work will make a significant impact on increasing industrial take-up of fault prediction models for two reasons: First, our model ensemble will offer improved predictive performances. This in itself will make it more attractive to companies. Second, our impact and dissemination strategy is designed to explicitly target industrial take-up of our model ensemble. Ensuring our model ensemble impacts on the software industry is based around producing genuinely useful and usable tools for industry, as well as effectively drawing these tools to the attention of industry.

Our impact strategy has the following elements: Delivering our model ensemble in the form of a highly usable IDE/ANT plug-in tool; Developing a fault analysis web site that companies can submit code to and it will be automatically analysed for faults; An industry workshop on fault prediction; Both industry and academic orientated publications.
Description We have extensively analysed faults in commercial and open source systems. Our findings are that:
- commercial systems seem to contain fewer faults than open source systems.
- ensemble approaches seem to predict faults better than single model approaches.
- Code cleaning could improve the prediction of faults.
Exploitation Route Our findings and tools will take the fault prediction community forward and hopefully will be used by companies. Development of our tools is on-going.
Sectors Digital/Communication/Information Technologies (including Software)

Description Our company partners have used the tools we developed in their software engineering process. Our tools are also available as open source tools for anyone to use.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

Title ELFF Defect prediction tool 
Description A software defect prediction tool for use by researchers and developers. 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact The tool is currently in evaluation with our industrial partner (Sky Plc). We are also talking to other companies about transferring the tool into industrial practice. 
Title ELFF software defect prediction tool 
Description Our ELFF defect prediction tool is about to go under licence to a commercial company. The licence negotiations are currently at advanced stages. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted
Licensed No
Impact This is all pending and depends on the outcomes of licencing negotiations.