Statistical Natural Language Processing Methods for Computer Program Source Code

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

Complex software systems involve many components and make use of many external libraries. Programmers who work on such software must remember the protocols for using all of those components correctly, and the process of learning to use a new component can be time consuming and a source of bugs.

We believe that there is a major untapped resource that can help address this problem. Billions of lines of code are readily available on the Internet, much of which are of professional quality. Hidden within this code is a large amount of knowledge about good coding practices, for example, about avoiding error-prone constructs or about the best protocol for using a particular library. We envision a new type of programming tool, which could be called data-driven development tools, that aggregate knowledge about programming from a large corpus of mature software projects, for presentation within the development environment. Just as the current generation of IDEs helps developers to manage their code, the next generation of IDEs will help developers to learn how to write better code.

Fortunately, there is a research field that has already developed a large body of sophisticated tools for analyzing large amounts of text: namely, statistical natural language processing. The long-term strategic goal of this project is to develop new natural language processing techniques aimed at analyzing computer program source code, in order to help programmers learn coding techniques from the code of others. There is a large area for research here that has been almost completely unexplored.

As a first step in this research area, in this project we will focus on automatically identifying short code fragments, which we call idioms, that occur repeatedly across different software projects. An example of an idiom is the typical construct for iterating over an array in Java. Although they are ubiquitous in source code, idioms of this form have not to our knowledge been systematically studied, and we are unaware of any techniques for automatically identifying idioms. The main objective of this project is to develop new statistical NLP methods with the goal of automatically identifying idioms from a corpus of source code text. We call this research problem idiom mining, and it is to our knowledge a new research problem.

This is an interdisciplinary project that draws from statistical NLP, machine learning, and software engineering. The research work of this project is primarily in statistical NLP and machine learning, and will involve developing new statistical methods for finding idioms in programming language text.

Planned Impact

The work in this proposal has the potential for substantial economic benefits in the long term. This project is an applied one with the general goal of building tools that help developers to create software more accurately and with less cost. The UK has one of the strongest software sectors in Europe. For example, in 2008 the UK accounted for 25% of European software companies. By making it possible to develop software at lower cost, we hope that this will benefit companies that sell software by lowering their costs. We hope that these tools would have special benefit to the many companies that develop custom software systems for their own in house use, by lowering the cost of these infrastructural projects.

We also hope that software developers themselves benefit, by having new tools that make their jobs easier and more enjoyable. It has been our personal experience as a professional software developer, that programmers find it extremely important to have good development tools, and even enjoy using them. This is evidenced by the fact that in many cases, programmers voluntarily spend their own time working on tools; development tools comprise many of the most successful open source projects, such as the Gnu command-line utilities, gcc, and Eclipse. We hope that development tools based on code mining have the potential to have an elusive "magic" quality, by finding patterns that programmers recognize but didn't realize existed---with the effect of making software developers happier and more productive.

Publications

10 25 50
 
Description We have developed new methods for finding patterns in millions of lines of computer programs on the web. These patterns contain information for how to write programs that are easier for other computer programs to read and debug.
Exploitation Route This project has led the development of a new research area, focused around taking methods that have been used to help computers analyze text (such as those in Bing and Google Translate) and applying them to computer programs, which are also text. Therefore we expect that a growing number of researchers will take our methods forward and extend them in future work.
Sectors Digital/Communication/Information Technologies (including Software)

URL http://homepages.inf.ed.ac.uk/csutton/
 
Description We have been collaborating with one of the leading international software development companies to trial the use of our suggestion tools within a well-known software development environment. An initial implementation has been developed, but user testing is still underway.
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description EPSRC Responsive Mode Grant
Amount £614,000 (GBP)
Funding ID EP/P005314/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 04/2017 
End 03/2020
 
Description Microsoft Research PhD Scholarship
Amount £75,000 (GBP)
Organisation Microsoft Research 
Department Microsoft Research Cambridge
Sector Private
Country United Kingdom
Start 09/2017 
End 08/2020
 
Title Github Java corpus 
Description We curated a data set of open source computer programs from Github in order to provide a common platform for future research in the area. 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact The paper associated with this data set has been cited 95 times, indicating interest from the research community in building on this data. 
URL http://groups.inf.ed.ac.uk/cup/javaGithub/
 
Description Microsoft Research 
Organisation Microsoft Research
Department Microsoft Research Cambridge
Country United Kingdom 
Sector Private 
PI Contribution We developed new methods for finding patterns in computer program source code that represent _coding conventions_, which are stylistic aspects of the way that programmers express themselves when they write software.
Collaborator Contribution Microsoft sponsored a PhD studentship in an area related to our grant award on finding patterns in computer program source code. This student collaborated closely with the research staff on the project, and also worked as a researcher on the project for a few months after his PhD project was complete. We also collaborated with researchers in software engineering from Microsoft Research Redmond for several of our apapers in this area.
Impact This collaboration has resulting in our publication at the Foundations of Software Engineering conference in 2014.
Start Year 2012
 
Title TASSAL: Autofolding for Source Code Summarization 
Description TASSAL is a demonstration system that allows software developers to quickly skim code by showing only the regions of code that are most relevant to the overall code's purpose, as judged by a statistical model. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact This demo has just been released this year. The source code is available at the demonstration web site. 
URL http://groups.inf.ed.ac.uk/cup/tassal/demo.html