LUCID: Clearer Software by Integrating Natural Language Analysis into Software Engineering

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Informatics

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Funded Value:

£306,726

Funded Period:

Apr 17 - Sep 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/P005314/1

Principal Investigator:

Charles Sutton

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (25%)

Fundamentals of Computing (25%)

Software Engineering (50%)

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Charles Sutton (Principal Investigator)
Miltiadis Allamanis (Researcher)

Publications

Author Name

Title Publication Date Published

10 25 50

Allamanis M (2017) A Survey of Machine Learning for Big Code and Naturalness

Allamanis M (2018) A Survey of Machine Learning for Big Code and Naturalness in ACM Computing Surveys

Annie Louis (2020) Where should I comment my code? A dataset and model for predicting locations that need comments

Louis A (2020) Where should I comment my code?

Louis Annie (2018) Deep Learning to Detect Redundant Method Comments in arXiv e-prints

Pâr?achi P (2020) Flexeme: untangling commits using lexical flows

Key Findings
Research Databases and Models
Engagement Activities


Description	Source code combines two channels, a formal channel that specifies an algorithm for a computer and a natural language channel that explains that algorithm to developers. Both channels have been extensively studied in isolation. Lucid established a new line of research explicitly focused on how they interact. Some of these interactions form dual channel constraints; Lucid showed how to exploit these constraints to solve software engineering problems. Among its achievements, Lucid tackled the problem of overloading builtin types rather than defining problem-specific types (Refinym, FSE'18), advanced the state of the art in comment placement and quality (Detecting Redundant Comments, arXiv 2019 and Where Should I Comment my Code?, ICSE NEIR'20); and showed the utility of probabilistic type inference. Good comments speed understanding and maintaining code. A class of bad comments are those that redundantly repeat the code. Lucid built a technique and a tool to detect such comments. Lucid also surfaced and investigated the problem that precedes writing any specific comment: the question of where to add a comment. Explicitly handling this problem promises to ease the more important and much harder problem of generating comments. Lucid built an effective tool to solve this problem and contributed a data set for researchers to use. Dynamic languages, like JavaScript and Python, are well-suited to writing prototypes, but are expensive to maintain, partly because they lack type annotations. Therefore, many companies, Google, Facebook, and Microsoft, among others, have invested in adding static typing to them. Static typing, however, requires developers to add type annotations, which can be a monumental effort for a large codebase. Type inference, the traditional solution to this problem, cannot soundly deduce the precise type of many expressions in dynamic languages. Lucid's DeepTyper project showed that probabilistic type inference can usefully ease this annotation burden and infer types, just from local lexical context. Types follow a fat-tailed Zipfian distribution. DeepTyper does not handle infrequent types well. Our Typilus work, published at PLDI, uses metalearning to infer infrequent types. Lucid has been a productive project, leading to eight papers published at top-tier venues in software engineering and programming languages: FSE, PLDI, ICSE, TSE, and ICSE's NIER.
Exploitation Route	Titans of the software industry, namely Google, Microsoft, Facebook, and Amazon, have all made substantial investments into applying machine learning and natural language processing techniques to improve developer productivity. Their teams are currently working on commercialising tools and techniques, some inspired by approaches and perspectives pioneered by Lucid, that exploit dual channel constraints. These approaches tackle a range of software engineering problems, including autocompletion, commit untangling, code search, probabilistic type inference for dynamically typed language.
Sectors	Digital/Communication/Information Technologies (including Software)


Title	Comment locator dataset
Description	This dataset contains a curated set of open source C projects, from which we have extracted the inline comments and added their locations as test targets. This is to facilitate research on predicting comment locations from code.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	This is fairly recent work at the time of writing (March 2021), so there has not been enough time to observe notable impacts from this dataset
URL	http://groups.inf.ed.ac.uk/cup/comment-locator/


Description	Huawei Shenzhen Dec 2017
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Gave tech talk about our research to software engineers at Huawei, one of the largest companies in China
Year(s) Of Engagement Activity	2017


Description	sourced Tech Talk
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Hour long tech talk for workshop aimed at software engineers in Eastern Europe
Year(s) Of Engagement Activity	2017
URL	https://blog.sourced.tech/post/ml_talks_moscow/

Abstract

Organisations

People

ORCID iD

Publications