Statistical Natural Language Processing Methods for Computer Program Source Code

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

Complex software systems involve many components and make use of many external libraries. Programmers who work on such software must remember the protocols for using all of those components correctly, and the process of learning to use a new component can be time consuming and a source of bugs.

We believe that there is a major untapped resource that can help address this problem. Billions of lines of code are readily available on the Internet, much of which are of professional quality. Hidden within this code is a large amount of knowledge about good coding practices, for example, about avoiding error-prone constructs or about the best protocol for using a particular library. We envision a new type of programming tool, which could be called data-driven development tools, that aggregate knowledge about programming from a large corpus of mature software projects, for presentation within the development environment. Just as the current generation of IDEs helps developers to manage their code, the next generation of IDEs will help developers to learn how to write better code.

Fortunately, there is a research field that has already developed a large body of sophisticated tools for analyzing large amounts of text: namely, statistical natural language processing. The long-term strategic goal of this project is to develop new natural language processing techniques aimed at analyzing computer program source code, in order to help programmers learn coding techniques from the code of others. There is a large area for research here that has been almost completely unexplored.

As a first step in this research area, in this project we will focus on automatically identifying short code fragments, which we call idioms, that occur repeatedly across different software projects. An example of an idiom is the typical construct for iterating over an array in Java. Although they are ubiquitous in source code, idioms of this form have not to our knowledge been systematically studied, and we are unaware of any techniques for automatically identifying idioms. The main objective of this project is to develop new statistical NLP methods with the goal of automatically identifying idioms from a corpus of source code text. We call this research problem idiom mining, and it is to our knowledge a new research problem.

This is an interdisciplinary project that draws from statistical NLP, machine learning, and software engineering. The research work of this project is primarily in statistical NLP and machine learning, and will involve developing new statistical methods for finding idioms in programming language text.

Planned Impact

The work in this proposal has the potential for substantial economic benefits in the long term. This project is an applied one with the general goal of building tools that help developers to create software more accurately and with less cost. The UK has one of the strongest software sectors in Europe. For example, in 2008 the UK accounted for 25% of European software companies. By making it possible to develop software at lower cost, we hope that this will benefit companies that sell software by lowering their costs. We hope that these tools would have special benefit to the many companies that develop custom software systems for their own in house use, by lowering the cost of these infrastructural projects.

We also hope that software developers themselves benefit, by having new tools that make their jobs easier and more enjoyable. It has been our personal experience as a professional software developer, that programmers find it extremely important to have good development tools, and even enjoy using them. This is evidenced by the fact that in many cases, programmers voluntarily spend their own time working on tools; development tools comprise many of the most successful open source projects, such as the Gnu command-line utilities, gcc, and Eclipse. We hope that development tools based on code mining have the potential to have an elusive "magic" quality, by finding patterns that programmers recognize but didn't realize existed---with the effect of making software developers happier and more productive.

Funded Value:

£375,601

Funded Period:

Oct 13 - Mar 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/K024043/1

Principal Investigator:

Charles Sutton

Research Subject:

Info. & commun. Technol. (70%)

Linguistics (30%)

Research Topic:

Artificial Intelligence (30%)

Comput./Corpus Linguistics (30%)

Fundamentals of Computing (10%)

Software Engineering (30%)

Organisations

People	ORCID iD
Charles Sutton (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Allamanis M (2016) Learning Continuous Semantic Representations of Symbolic Expressions

Allamanis M (2014) Mining idioms from source code

Allamanis M (2018) Mining Semantic Loop Idioms in IEEE Transactions on Software Engineering

Allamanis M (2014) Learning natural coding conventions

Allamanis M (2018) A Survey of Machine Learning for Big Code and Naturalness in ACM Computing Surveys

Allamanis M (2016) A Convolutional Attention Network for Extreme Summarization of Source Code

Allamanis M (2015) Suggesting accurate method and class names

Allamanis M (2016) A Convolutional Attention Network for Extreme Summarization of Source Code

Allamanis M. (2016) A convolutional attention network for extreme summarization of source code in 33rd International Conference on Machine Learning, ICML 2016

Allamanis Miltiadis (2014) Mining Idioms from Source Code in arXiv e-prints

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Software and Technical Products


Description	We have developed new methods for finding patterns in millions of lines of computer programs on the web. These patterns contain information for how to write programs that are easier for other computer programs to read and debug.
Exploitation Route	This project has led the development of a new research area, focused around taking methods that have been used to help computers analyze text (such as those in Bing and Google Translate) and applying them to computer programs, which are also text. Therefore we expect that a growing number of researchers will take our methods forward and extend them in future work.
Sectors	Digital/Communication/Information Technologies (including Software)
URL	http://homepages.inf.ed.ac.uk/csutton/


Description	We have been collaborating with one of the leading international software development companies to trial the use of our suggestion tools within a well-known software development environment. An initial implementation has been developed, but user testing is still underway.
First Year Of Impact	2016
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Description	EPSRC Responsive Mode Grant
Amount	£614,000 (GBP)
Funding ID	EP/P005314/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	04/2017
End	03/2020


Description	Microsoft Research PhD Scholarship
Amount	£75,000 (GBP)
Organisation	Microsoft Research
Department	Microsoft Research Cambridge
Sector	Private
Country	United Kingdom
Start	09/2017
End	08/2020


Title	Github Java corpus
Description	We curated a data set of open source computer programs from Github in order to provide a common platform for future research in the area.
Type Of Material	Database/Collection of data
Year Produced	2013
Provided To Others?	Yes
Impact	The paper associated with this data set has been cited 95 times, indicating interest from the research community in building on this data.
URL	http://groups.inf.ed.ac.uk/cup/javaGithub/


Description	Microsoft Research
Organisation	Microsoft Research
Department	Microsoft Research Cambridge
Country	United Kingdom
Sector	Private
PI Contribution	We developed new methods for finding patterns in computer program source code that represent _coding conventions_, which are stylistic aspects of the way that programmers express themselves when they write software.
Collaborator Contribution	Microsoft sponsored a PhD studentship in an area related to our grant award on finding patterns in computer program source code. This student collaborated closely with the research staff on the project, and also worked as a researcher on the project for a few months after his PhD project was complete. We also collaborated with researchers in software engineering from Microsoft Research Redmond for several of our apapers in this area.
Impact	This collaboration has resulting in our publication at the Foundations of Software Engineering conference in 2014.
Start Year	2012


Title	TASSAL: Autofolding for Source Code Summarization
Description	TASSAL is a demonstration system that allows software developers to quickly skim code by showing only the regions of code that are most relevant to the overall code's purpose, as judged by a statistical model.
Type Of Technology	Webtool/Application
Year Produced	2016
Impact	This demo has just been released this year. The source code is available at the demonstration web site.
URL	http://groups.inf.ed.ac.uk/cup/tassal/demo.html

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications