Statistical Natural Language Processing Methods for Computer Program Source Code
Lead Research Organisation:
University of Edinburgh
Department Name: Sch of Informatics
Abstract
Complex software systems involve many components and make use of many external libraries. Programmers who work on such software must remember the protocols for using all of those components correctly, and the process of learning to use a new component can be time consuming and a source of bugs.
We believe that there is a major untapped resource that can help address this problem. Billions of lines of code are readily available on the Internet, much of which are of professional quality. Hidden within this code is a large amount of knowledge about good coding practices, for example, about avoiding error-prone constructs or about the best protocol for using a particular library. We envision a new type of programming tool, which could be called data-driven development tools, that aggregate knowledge about programming from a large corpus of mature software projects, for presentation within the development environment. Just as the current generation of IDEs helps developers to manage their code, the next generation of IDEs will help developers to learn how to write better code.
Fortunately, there is a research field that has already developed a large body of sophisticated tools for analyzing large amounts of text: namely, statistical natural language processing. The long-term strategic goal of this project is to develop new natural language processing techniques aimed at analyzing computer program source code, in order to help programmers learn coding techniques from the code of others. There is a large area for research here that has been almost completely unexplored.
As a first step in this research area, in this project we will focus on automatically identifying short code fragments, which we call idioms, that occur repeatedly across different software projects. An example of an idiom is the typical construct for iterating over an array in Java. Although they are ubiquitous in source code, idioms of this form have not to our knowledge been systematically studied, and we are unaware of any techniques for automatically identifying idioms. The main objective of this project is to develop new statistical NLP methods with the goal of automatically identifying idioms from a corpus of source code text. We call this research problem idiom mining, and it is to our knowledge a new research problem.
This is an interdisciplinary project that draws from statistical NLP, machine learning, and software engineering. The research work of this project is primarily in statistical NLP and machine learning, and will involve developing new statistical methods for finding idioms in programming language text.
We believe that there is a major untapped resource that can help address this problem. Billions of lines of code are readily available on the Internet, much of which are of professional quality. Hidden within this code is a large amount of knowledge about good coding practices, for example, about avoiding error-prone constructs or about the best protocol for using a particular library. We envision a new type of programming tool, which could be called data-driven development tools, that aggregate knowledge about programming from a large corpus of mature software projects, for presentation within the development environment. Just as the current generation of IDEs helps developers to manage their code, the next generation of IDEs will help developers to learn how to write better code.
Fortunately, there is a research field that has already developed a large body of sophisticated tools for analyzing large amounts of text: namely, statistical natural language processing. The long-term strategic goal of this project is to develop new natural language processing techniques aimed at analyzing computer program source code, in order to help programmers learn coding techniques from the code of others. There is a large area for research here that has been almost completely unexplored.
As a first step in this research area, in this project we will focus on automatically identifying short code fragments, which we call idioms, that occur repeatedly across different software projects. An example of an idiom is the typical construct for iterating over an array in Java. Although they are ubiquitous in source code, idioms of this form have not to our knowledge been systematically studied, and we are unaware of any techniques for automatically identifying idioms. The main objective of this project is to develop new statistical NLP methods with the goal of automatically identifying idioms from a corpus of source code text. We call this research problem idiom mining, and it is to our knowledge a new research problem.
This is an interdisciplinary project that draws from statistical NLP, machine learning, and software engineering. The research work of this project is primarily in statistical NLP and machine learning, and will involve developing new statistical methods for finding idioms in programming language text.
Planned Impact
The work in this proposal has the potential for substantial economic benefits in the long term. This project is an applied one with the general goal of building tools that help developers to create software more accurately and with less cost. The UK has one of the strongest software sectors in Europe. For example, in 2008 the UK accounted for 25% of European software companies. By making it possible to develop software at lower cost, we hope that this will benefit companies that sell software by lowering their costs. We hope that these tools would have special benefit to the many companies that develop custom software systems for their own in house use, by lowering the cost of these infrastructural projects.
We also hope that software developers themselves benefit, by having new tools that make their jobs easier and more enjoyable. It has been our personal experience as a professional software developer, that programmers find it extremely important to have good development tools, and even enjoy using them. This is evidenced by the fact that in many cases, programmers voluntarily spend their own time working on tools; development tools comprise many of the most successful open source projects, such as the Gnu command-line utilities, gcc, and Eclipse. We hope that development tools based on code mining have the potential to have an elusive "magic" quality, by finding patterns that programmers recognize but didn't realize existed---with the effect of making software developers happier and more productive.
We also hope that software developers themselves benefit, by having new tools that make their jobs easier and more enjoyable. It has been our personal experience as a professional software developer, that programmers find it extremely important to have good development tools, and even enjoy using them. This is evidenced by the fact that in many cases, programmers voluntarily spend their own time working on tools; development tools comprise many of the most successful open source projects, such as the Gnu command-line utilities, gcc, and Eclipse. We hope that development tools based on code mining have the potential to have an elusive "magic" quality, by finding patterns that programmers recognize but didn't realize existed---with the effect of making software developers happier and more productive.
People |
ORCID iD |
Charles Sutton (Principal Investigator) |
Publications
Allamanis M
(2016)
A Convolutional Attention Network for Extreme Summarization of Source Code
Allamanis M
(2016)
A Convolutional Attention Network for Extreme Summarization of Source Code
Allamanis M
(2018)
Mining Semantic Loop Idioms
in IEEE Transactions on Software Engineering
Allamanis M
(2018)
A Survey of Machine Learning for Big Code and Naturalness
in ACM Computing Surveys
Allamanis M
(2014)
Mining idioms from source code
Allamanis M
(2014)
Learning natural coding conventions
Allamanis M
(2016)
Learning Continuous Semantic Representations of Symbolic Expressions
Allamanis M
(2015)
Suggesting accurate method and class names
Allamanis M.
(2016)
A convolutional attention network for extreme summarization of source code
in 33rd International Conference on Machine Learning, ICML 2016
Allamanis Miltiadis
(2014)
Learning Natural Coding Conventions
in arXiv e-prints
Description | We have developed new methods for finding patterns in millions of lines of computer programs on the web. These patterns contain information for how to write programs that are easier for other computer programs to read and debug. |
Exploitation Route | This project has led the development of a new research area, focused around taking methods that have been used to help computers analyze text (such as those in Bing and Google Translate) and applying them to computer programs, which are also text. Therefore we expect that a growing number of researchers will take our methods forward and extend them in future work. |
Sectors | Digital/Communication/Information Technologies (including Software) |
URL | http://homepages.inf.ed.ac.uk/csutton/ |
Description | We have been collaborating with one of the leading international software development companies to trial the use of our suggestion tools within a well-known software development environment. An initial implementation has been developed, but user testing is still underway. |
First Year Of Impact | 2016 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Economic |
Description | EPSRC Responsive Mode Grant |
Amount | £614,000 (GBP) |
Funding ID | EP/P005314/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 03/2017 |
End | 03/2020 |
Description | Microsoft Research PhD Scholarship |
Amount | £75,000 (GBP) |
Organisation | Microsoft Research |
Department | Microsoft Research Cambridge |
Sector | Private |
Country | United Kingdom |
Start | 08/2017 |
End | 08/2020 |
Title | Github Java corpus |
Description | We curated a data set of open source computer programs from Github in order to provide a common platform for future research in the area. |
Type Of Material | Database/Collection of data |
Year Produced | 2013 |
Provided To Others? | Yes |
Impact | The paper associated with this data set has been cited 95 times, indicating interest from the research community in building on this data. |
URL | http://groups.inf.ed.ac.uk/cup/javaGithub/ |
Description | Microsoft Research |
Organisation | Microsoft Research |
Department | Microsoft Research Cambridge |
Country | United Kingdom |
Sector | Private |
PI Contribution | We developed new methods for finding patterns in computer program source code that represent _coding conventions_, which are stylistic aspects of the way that programmers express themselves when they write software. |
Collaborator Contribution | Microsoft sponsored a PhD studentship in an area related to our grant award on finding patterns in computer program source code. This student collaborated closely with the research staff on the project, and also worked as a researcher on the project for a few months after his PhD project was complete. We also collaborated with researchers in software engineering from Microsoft Research Redmond for several of our apapers in this area. |
Impact | This collaboration has resulting in our publication at the Foundations of Software Engineering conference in 2014. |
Start Year | 2012 |
Title | TASSAL: Autofolding for Source Code Summarization |
Description | TASSAL is a demonstration system that allows software developers to quickly skim code by showing only the regions of code that are most relevant to the overall code's purpose, as judged by a statistical model. |
Type Of Technology | Webtool/Application |
Year Produced | 2016 |
Impact | This demo has just been released this year. The source code is available at the demonstration web site. |
URL | http://groups.inf.ed.ac.uk/cup/tassal/demo.html |