Automatic repair of natural source code

Lead Research Organisation: University of York
Department Name: Computer Science

Abstract

Fixing software defects is a time-consuming and costly activity. One of the main reasons why software debugging is so expensive is that it still remains mainly a manual activity. Fixing a bug is a complex process, which consists of many different steps including finding and understanding the underlying cause of the bug, identifying a set of changes that address the bug correctly, and finally verifying that those changes are correct. Automating this process (or parts of it) can potentially reduce the time, cost and effort it takes to fix bugs, and therefore the quality of the produced software.

Automatic Program repair (APR) is defined as the activity of removing a software defect during software maintenance without human intervention. The most popular approaches to automatic program repair rely on test suites as proxies of the behavioural specification of the system under repair. Although these approaches have demonstrated promising results, such as the synthesis of a fix for the famous Hearbleed bug, they face serious challenges. The main one is the correctness of the generated patches. Test suites are an imperfect metric of program correctness. Therefore techniques, which rely on test suites to evaluate the correctness of the generated patches, tend to overfit the test suite. Additionally, test-driven APR techniques can have scalability limitations in terms of execution time. The space of possible patches is vast and correct patches occur sparsely. In test-driven APR approaches the search time is mainly dominated by the validation of the patches, i.e. by the execution of the test suite. Existing approaches try to address this issue by employing various strategies such as parallel execution of test suites or the use of just a subset of the test suite (sample of positive test cases and all the negative ones). Although, such strategies improve the execution time, scalability of APR approaches remains a challenge.

The motivation of the project is to address overfitting and scalability limitations of APR techniques. The main hypothesis is the following: Statistical properties of source code can improve the state-of-the-art generate-and-validate techniques for APR in terms of patch correctness and execution time.

Statistical properties of source code (i.e. naturalness of source code) can help address the challenges of APR techniques in the following ways:

1) Heuristics based on the statistical properties of source code can be used to rank the results of fault localisation algorithms, and therefore augment their ability to correctly identify faulty lines of code.
2) Execution time of test-driven APR techniques can be reduced by discarding unnatural patches without executing the test suite.
3) Correctness of candidate patches can be improved by using the naturalness of code as part of the fitness function.

To measure the naturalness of source code, this project will mine existing software repositories such as GitHub and it will develop language models, similar to those used in the Natural Language Processing (NLP) domain. These models will be used to assess the naturalness of proposed patches.

Planned Impact

The potential impact of automatic program repair is substantial. Automatic repair of bugs has the potential to lower the cost of software development and improve the quality of the produced software. In the long run the automatic repair of software will likely be a necessity. History suggests that the systems built are becoming more and more complex and interconnected, and as a result finding and correcting defects in them becomes more and more challenging.
The impact of the project can be decomposed into four different areas:

- Economy: Engineering high quality software at low cost can be a potential competitive advantage for the UK software engineering sector. The results of the project will contribute towards this goal, and therefore they will be disseminated to companies developing software in the UK. The industrial collaborator of the project, is expected to use the developed tool internally in projects. Moreover, the results of the project will be disseminated to other industrial collaborators of the DEIS research group. Finally, we plan to expand the industrial user-base of the approach, by publishing the results in journals, which target an industrial audience. To strengthen the industrial impact of the project, we plan to make a number of visits to interested companies in order to promote the project and train engineers on the developed tools and techniques.

- Society: Software is transforming various domains of everyday life. For example, autonomous vehicles can provide mobility for individuals, who cannot normally drive; home automation enables ageing population to live independently for longer; and smart medical devices can improve the life of patients. Approaches which automate the removal of software defects can potentially bring value to society by reducing the overall cost of software development and by increasing the reliability of software. To increase awareness of software quality and debugging, we plan to reach the general public by publishing the results of the project in appropriate outlets, such as The Conversation website. Moreover, the results of the project will be used to establish future research in the important area of reliable software through funding bids, and research projects.

- Knowledge: We expect that the results of the project can enhance the research capacity, knowledge, and skills of businesses and organisations in software debugging. In order to ensure the impact of the proposed project several steps will be taken. First, close collaboration with engineers from the industrial partner (IBM) will enable us the dissemination of the produced knowledge to them. Moreover, to achieve wider dissemination of the results, the prototypical implementation, relevant datasets, and a set of demonstrators of the techniques will be made available via a public website. Moreover, a workshop on software debugging, which will bring together interested academics and commercial organisations, will be held at the end of the project. In addition to the workshop, the results of the project will be published in high-impact venues such as the ASE conference, and tutorial sessions will be organised at these events.

- People: One of the main aspirations of the project is to further develop the skills of engineers industry-wide by making them familiar with novel software engineering techniques and tools. To this end, the investigators will propose talks and presentations at local and national engineering user-groups such as SoCraTes UK.
 
Description The stochastic nature of the state-of-the-art syntax-based automatic program repair (APR) techniques limits the repetitiveness of the patching process. In the experiments conducted in the course of the project, a number of bugs prior reported to be fixed by the considered tools has not been successfully patched using the same tools despite repeating the APR process hundreds of times. Hence, we concluded that the stochastic nature of the known techniques questions their applicability by practitioners.

We have analysed the patches automatically generated by the most popular APR tools and found that, in reality, they do not alter more than two items (modification points, MPs) in software code, even if they have been reported to modify more MPs. We concluded that using techniques capable of fixing many MPs that need to browse an enormous search space of patch candidates is hence unjustified. We have estimated the search space for patches altering just a single MP at a time and found that for the vast majority of benchmark bugs it can be browsed exhaustively in a reasonable time (shorter than one day on a typical PC computer). We have implemented exhaustive search of the patch candidates and integrated it with a popular APR framework. Using this approach, we managed to plausibly patch the largest number of bugs in the popular Defects4J bug benchmark to date, outperforming the second-best solution by 15%. In contrast to that technique, the proposed one is fully deterministic and is suitable for parallel processing at various granularity levels. As the patches found are minimal (just one modification of the original software code), they are less likely to be harmful to the software functionality.
Exploitation Route Fixing software bugs automatically is a silver bullet for software engineering and it can potentially have huge economic and societal impact. This research area investigated the limitations of the state-of-the art techniques to program repair and we propose novel techniques for improving the performance and accuracy of the repair process. Our techniques are generic and can be adopted and incorporated by other search-based program repair techniques, in order to shrink the search space in order to make the search process more efficient.
Sectors Digital/Communication/Information Technologies (including Software)

URL https://github.com/Manatee-EPSRC-Project
 
Description Assuring Dependability In Collective Autonomous Systems (ATLAS)
Amount £88,600 (GBP)
Funding ID ACC2006195 
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 09/2019 
End 06/2020
 
Description Automated Robustness Testing Of Safety-Critical Multi-Robot Inspection Applications (SAFEMUV)
Amount £207,000 (GBP)
Organisation Lloyd's Register Foundation 
Sector Charity/Non Profit
Country United Kingdom
Start 03/2020 
End 08/2021
 
Description Secure and Safe Multi-Robot Systems
Amount € 6,999,786 (EUR)
Funding ID 101017258 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 01/2021 
End 12/2023
 
Title One- and two-degree edit patch finder 
Description Adding a new mode to the state-of-the-art open-source Automatic Program Repair (APR) tool named ARJA for searching one- and two-degree edit patches. Depending on the search-space assessment, the tool performs either an exhaustive or random search. The tool is capable of benefiting from data independence and facilities parallel processing. The number of analysed modification point is automatically adapted based on the predicted computational cost. For two-degree edit patches, there is also dynamic filtering of the modification point pairs to be analysed based on the search space size and the predicted computational cost. 
Type Of Material Improvements to research infrastructure 
Year Produced 2020 
Provided To Others? No  
Impact The tool has been used to generate the largest to date set of plausible patches for the open-source Java projects from the popular Defects4J benchmark. A few bugs from the benchmark have been plausibly patched for the first time, whereas for several other bugs smaller patches (in terms of the number of edited modification points) have been found that the ones published to date. 
 
Title n-gram model guidance for genetic-algorithm-based patch searching 
Description The majority of the state-of-the-art syntax-based Automatic Program Repair (APR) tools employ genetic algorithms (GAs) as a metaheuristic to find plausible patches. GA generates subsequent populations of patch candidates, assigns the fitness value to each of the candidates, selects the candidates with the highest fitness values and performs their mutations in order to improve the patch candidates' quality. This approach has been reported to lead to obtaining plausible patches in many cases. Traditionally, the fitness value depends solely on the number of failed tests from the associated test suite. Due to some adverse properties of such traditional fitness function, mainly the lack of smooth gradient in the search space landscape in the majority of bug cases, new techniques for the fitness function improvement are sought after. The developed method applies some principles from the Natural Language Processing (NLP) domain. According to the literature, a buggy extract of code is rated as more improbable by language models. In both the contexts of natural and programming language processing, the most popular language models are based on n-grams, i.e., contiguous sequences of n items (e.g. words or tokens) and a text probability can be expressed using cross-entropy. In this method, cross-entropy has been used to support the traditional fitness function in a syntax-based APR process. Such a compound fitness function assigns a fitness value to each of the patch candidates. The traditional genetic-algorithm operators, such as selection, mutation or elitism, are still used to generate subsequent populations. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? No  
Impact After applying the new method it appeared that, contrary to the prior results reported in the literature, the correlation between code perplexity and its correctness is statistically insignificant. Hence, applying the perplexity measures is rather unlikely to improve the patch search process in a GA, which has been experimentally confimed. This result, despite being negative, is likely to have a notable impact on the research domain as it demonstrates the unsuitability of simple language models for enhancing fitness functions used in APR. 
 
Title Database of single-degree edit patches 
Description The database includes a set of 7689 plausible patches for 66 software bugs from projects JFreeChart, Apache commons-lang, Apache commons-math and Joda-Time from the most popular Java benchmark set in the automatic program repair (APR) research field. The database, both in terms of the number of fixed bugs and the number of generated patches, is the largest Defects4J patch database created to date. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? No  
Impact Defects4J is the primary benchmark for assessing the quality of automatic program repair (APR) tools. The database, being the largest plausible patch databases for this benchmark created to date, can be used as a reference for other APR tools in terms of the efficiency in plausible patch searching. Similarly, as the patches in the database are minimal in terms of the number of edited modification points in the software, they can be used as the reference set for minimising the patch sizes generated by APR tools (patch size minimisation is an important part of the state-of-the-art syntax-based APR tools such as GenProg or ARJA). As the single edit patches have been generated in an exhaustive manner, the database includes all modification points whose alteration can lead to passing the entire unit test set. Software functionalities containing bugs that can be patched with a large number of single-edit plausible patches can be viewed as being insufficiently tested and hence the database can be employed to identify software elements that need additional unit tests written by developers. 
 
Title n-gram Java language models 
Description Five n-gram Java programming language models (for n=2,3,...,6) constructed based on almost 10,000 popular Java repositories including more than 2,200,000 Java source codes, mined from GitHub. The Java tokens such as keywords, keyword literals, separators and operators have been treated as separate items for the n-gram models, whereas all the identifiers, integer literals, floating-point literals and string literals have been abstracted and treated as the same item for each of these token types. 
Type Of Material Computer model/algorithm 
Year Produced 2019 
Provided To Others? No  
Impact The database can be used to determine the so-called normalness of a Java code. Since a buggy code has been shown in the literature to be more improbable by a language model. Using these language models, potential bugs can be identified by computing the perplexity of subsequent tokens, for example using the cross-entropy or log-likelihood measures. 
 
Description Collaboration with CROSSMINER 
Organisation University of York
Country United Kingdom 
Sector Academic/University 
PI Contribution We have collaborated with colleagues from the University of York, who are working on the H2020 European Union project called CROSSMINER. They are developing a platform for mining software engineering data from software repositories. Our team extended their tool with specific functionality, which enables the application of specific filters during the extraction of the online data. Such an extension was needed in our project, in order to mine data from software projects, which meet specific criteria. The results of this work are integrated in their tool and is made available as open source. This collaboration resulted in two publications in the Mining Software Repositories conference - one in 2019 and one to appear in the 2020 version of the conference.
Collaborator Contribution Our partners have provided the knowledge and the software tool, which enabled us to extract quality data from online software repositories.
Impact Two publications have resulted from this collaboration so far: * Kolovos, D., Neubauer, P., Barmpis, K., Matragkas, N. and Paige, R., 2019, May. Crossflow: a framework for distributed mining of software repositories. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) (pp. 155-159). IEEE. *Barmpis,K., Neubauer, P., Matragkas, N., Paige, R. and Kolovos, D., 2020, (To appear). Polyglot and Distributed Software Repository Mining with Crossflow. In 20120 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). IEEE.
Start Year 2019
 
Title CROSSFLOW 
Description Mining software repositories at a large scale typically requires substantial computational and storage resources. This creates an increasing need for repository mining programs to be executed in a distributed manner, such that remote collaborators can contribute local computational and storage resources. Crossflow is a novel framework for building polyglot distributed repository mining programs. It offers delegation of mining jobs to remote workers and can cache their results. Workers are able to implement advanced behavior like load balancing and rejecting jobs they either cannot perform or would execute sub-optimally. Workers can be written in different programming languages like Java and Python. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact The final version of Crossflow has only been released in end of December. IBM has chosen to adapt and extend Crossflow for one of the KTP projects it runs on data analytics (KTP010852). 
URL https://github.com/crossminer/scava/tree/master/crossflow
 
Description Participation in the 2019 Mining Software Repositories Conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Researchers and practitioners attended event with a focus on presenting and discussing about the recent advances on mining software repositories.
Year(s) Of Engagement Activity 2019
URL https://conf.researchr.org/home/msr-2019
 
Description Participation in the 2020 IEEE Congress on Evolutionary Computation (CEC) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Researchers and practitioners attended event with focus on recent advances on evolutionary computation.
Year(s) Of Engagement Activity 2020
URL https://wcci2020.org/cec-sessions/
 
Description Participation in the 2020 Mining Software Repositories Conference 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Researchers and practitioners attended event with a focus on presenting and discussing about the recent advances on mining software repositories.
Year(s) Of Engagement Activity 2020
URL https://conf.researchr.org/home/msr-2020
 
Description Participation in the 62nd CREST Open Workshop related to Automated Program Repair and Genetic Improvement 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Two members of the research team participated in the 62nd CREST Open Workshop related to Automated Program Repair and Genetic Improvement on 20th and 21st January 2020 in London, organised by UCL. The event has attracted about 50 participants, mainly top researchers in the domain of Automated Program Repair and Genetic Improvement. The members of the research team conducted a series of informal talk with the participants, advertised and discussed the research related to the award.
Year(s) Of Engagement Activity 2020
URL http://crest.cs.ucl.ac.uk/cow/62/