LUCID: Clearer Software by Integrating Natural Language Analysis into Software Engineering

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

Developers spend most of their time maintaining code, with little tool support.
To maintain code, one must understand it. Clear code is easier to read and
understand, and therefore less expensive and risky to evolve and maintain; it
is also notoriously difficult to write. We will help developers write clearer
code to speed maintenance, and increase developer productivity. Source code
unites two channels - the programming language and natural language - to
describe algorithms. LUCID will advance the state of the art in software
engineering by developing new analyses that exploit the interconnections
between these channels to find uninformative names, stale comments, and bugs
that manifest as discrepancies between the two channels.

Planned Impact

LUCID attacks a core software engineering concern; it will build tools that
help developers to maintain software more quickly, with less risk and less
cost. Thus, the work in this proposal has the potential for enormous economic
benefits in the long term. The UK has one of the strongest software sectors in
Europe. For example, in 2008 the UK accounted for 25% of European software
companies. By making software maintenance cheaper, this project will benefit
companies that sell software by lowering the costs of evolving their code and
releasing new versions. These tools will also benefit the many companies that
evolve and maintain custom software systems for their own in house use, by
lowering the cost of these infrastructural projects.

Publications

10 25 50
 
Description Source code combines two channels, a formal channel that specifies an algorithm for a computer and a natural language channel that explains that algorithm to developers. Both channels have been extensively studied in isolation. Lucid established a new line of research explicitly focused on how they interact. Some of these interactions form dual channel constraints; Lucid showed how to exploit these constraints to solve software engineering problems.

Among its achievements, Lucid tackled the problem of overloading builtin types rather than defining problem-specific types (Refinym, FSE'18), advanced the state of the art in comment placement and quality (Detecting Redundant Comments, arXiv 2019 and Where Should I Comment my Code?, ICSE NEIR'20); and showed the utility of probabilistic type inference. Good comments speed understanding and maintaining code. A class of bad comments are those that redundantly repeat the code. Lucid built a technique and a tool to detect such comments. Lucid also surfaced and investigated the problem that precedes writing any specific comment: the question of where to add a comment. Explicitly handling this problem promises to ease the more important and much harder problem of generating comments. Lucid built an effective tool to solve this problem and contributed a data set for researchers to use. Dynamic languages, like JavaScript and Python, are well-suited to writing prototypes, but are expensive to maintain, partly because they lack type annotations. Therefore, many companies, Google, Facebook, and Microsoft, among others, have invested in adding static typing to them. Static typing, however, requires developers to add type annotations, which can be a monumental effort for a large codebase. Type inference, the traditional solution to this problem, cannot soundly deduce the precise type of many expressions in dynamic languages. Lucid's DeepTyper project showed that probabilistic type inference can usefully infer types, just from local lexical context. Types follow a fat-tailed Zipfian distribution. DeepTyper does not handle infrequent types well. Our Typilus work, published at PLDI, uses metalearning to infer infrequent types.

Lucid was a productive project, leading to eight papers published at top-tier venues in software engineering and programming languages: FSE, PLDI, ICSE, TSE, CSUR, and ICSE's NIER. Two further Lucid papers, both aimed to speed and smooth continuous integration workflows, are under review.
Exploitation Route Titans of the software industry, namely Google, Microsoft, Facebook, and Amazon, have all made substantial investments into applying machine learning and natural language processing techniques to improve developer productivity. Their teams are currently working on commercialising tools and techniques, some inspired by approaches and perspectives pioneered by Lucid, that exploit dual channel constraints. These approaches tackle a range of software engineering problems, including autocompletion, commit untangling, code search, probabilistic type inference for dynamically typed languages, and program synthesis.
Sectors Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Security and Diplomacy

URL http://ttendency.cs.ucl.ac.uk/lucid
 
Description Lucid advanced the state of the art by developing new analyses that exploit the interconnection between natural language and formal notation in source code to find uninformative names, stale comments, and bugs that manifest as discrepancies between the channels. Lucid was a research project in the world-wide research agenda of language for code, or AI for code. Lucid's publications and its researchers' interactions with the large AI for code community contributed to the momentum of this larger research effort. AI4Code is transforming software engineering and society as whole, in the form of large language models, like GPT-4 and GitHub's copilot. So, while Lucid has not had direct industrial impact, Lucid has indirectly contributed to this vast and ongoing transformation.
First Year Of Impact 2020
Sector Digital/Communication/Information Technologies (including Software),Security and Diplomacy
Impact Types Societal,Economic

 
Description Committee Member in Program Committee within the ECOOP Research Papers-track
Geographic Reach Multiple continents/international 
Policy Influence Type Participation in a guidance/advisory committee
URL https://2018.ecoop.org/committee/ecoop-2018-research-track-program-committee
 
Description Committee Member in Program Committee within the ISSTA Technical Papers-track
Geographic Reach Multiple continents/international 
Policy Influence Type Participation in a guidance/advisory committee
URL https://2018.ecoop.org/committee/issta-2018-technical-papers-program-committee
 
Description Earl Barr on National Science Foundation panel for Software and Hardware Foundations (SHF) Program
Geographic Reach North America 
Policy Influence Type Participation in a guidance/advisory committee
URL https://nsf.gov/
 
Description Collaboration with Miltos Allamanis, Microsoft Research Cambridge 
Organisation Microsoft Research
Department Microsoft Research Cambridge
Country United Kingdom 
Sector Private 
PI Contribution We are introducing new conceptual types in programs by studying how identifiers flow to each other through assignments in programs.
Collaborator Contribution Miltos is helping us learn new types for these identifiers by learning over the data on flows across assignments
Impact No outputs yet
Start Year 2017
 
Title Flexeme 
Description This project provides several implementations for commit untagling and proposes a new representation of git patches by projecting the patch onto a PDG. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Advancing the state of the art in commit untangling via two new structures, the delta nameflow graph and delta PDG. 
URL https://github.com/PPPI/Flexeme
 
Title POSIT 
Description This a project to simultaneously provide language ID tags and Part-Of-Speech or compiler tags (which are taken from CLANG compilations of C and C++ code). The corpus is either code with comments annotated with CLANG compiler information and universal PoS tags for English, or StackOverflow. For StackOverflow we start from the data dump (which can be found here), and use a frequency based heuristic to annotate code snippets. The frequency data is made available under ./data/corpora/SO/frequency_map.json. To generate training data from StackOverflow, please use the scripts under ./src/preprocessor together with the Posts.xml file from the data dump linked above. Our model is a BiLSTM neural network with a Linear CRF and Viterbi decode to go from LSTM state to tags or language IDs. We use the same LSTM network and change only the CRF on top for the two tasks. We linearly combine the two objectives in the loss with a slightly smaller weight given to language IDs. We do not condition Tag output on language IDs in this version of the model. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Enabled the study of mixed languages in softare artefacts, advancing the state of the art. 
URL https://github.com/PPPI/POSIT
 
Description Dr Earl Barr presentation at 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact AIFORSE Conference 2017 - the first global Conference on Artificial Intelligence (AI) for Software Engineering (SE) - will host on the 10th of November 2017 in Barcelona.
The main Purpose of the Conference, first of all, is to build a Bridge between the most significant Players of the Software Engineering Industry from one side and the most advanced adopters of cutting-edge Applied Artificial Intelligence Technologies from another side.
The Leaders and Experts of Software Engineering and pioneer Innovators of Artificial Intelligence in SE will meet on Communication Stage to accelerate the Development and increase the Efficiency of the Operations in the Industry.
12 Hours of Networking, Discussions, bright and unique Reports of the 10 best Speakers in the Industry. Speakers are Representatives of Companies from around the World, who already apply AI to the Software Engineering. They will not only disclose the Tools that help solve Problems faster and decrease Costs, but will also define the Development Vector of Software Engineering Industry.
Year(s) Of Engagement Activity 2017
URL http://aiforse.org/conference-2017
 
Description Earl Barr - talk at Semmle 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Earl Barr presents his paper To Type or Not to Type: Quantifying Detectable Bugs in JavaScript at Semmle
http://earlbarr.com/publications/typestudy.pdf
Year(s) Of Engagement Activity 2018
URL https://semmle.com/
 
Description Earl Barr attending invitation only workshop on software security, National Cyber Security Centre 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Invitation only workshop to discuss the requirements for an international collaborative effort and infrastructure to support large-scale empirical research on software security. The workshop was held in London on the 17th of December and was supported by the National Cyber Security Centre.

The collective goal for the day was to develop a shared understanding of the challenges faced by research on software code analysis for cybersecurity and outline a roadmap for an international testbed infrastructure for large-scale experimental research on software security.
Year(s) Of Engagement Activity 2018
 
Description Earl Barr invited speaker CHOOSE forum in Switzerland 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The CHOOSE Forum 2018 is organized by the Zurich Empirical Software engineering Team (ZEST) at the University of Zurich, on behalf of CHOOSE.
Earl Barr presented his paper Bimodal Software Engineering
Year(s) Of Engagement Activity 2018
URL https://choose.swissinformatics.org/events/choose-forum-2018-software-engineering-and-machine-learni...
 
Description Earl Barr presents Bimodal Software Engineering at FLOC 2018: FEDERATED LOGIC CONFERENCE 2018, MLP PROGRAM 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Earl Barr presents Bimodal Software Engineering at FLOC 2018: FEDERATED LOGIC CONFERENCE 2018 in the MLP PROGRAM
Year(s) Of Engagement Activity 2018
URL https://easychair.org/smart-program/FLoC2018/MLP-program.html
 
Description Earl Barr presents his paper Mining Semantic Loop Idioms @ FSE 2018 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Earl Barr presents his paper Mining Semantic Loop Idioms @ FSE 2018 in the Journal-First track
Sun 4 - Fri 9 November 2018 Lake Buena Vista, Florida, United States
Year(s) Of Engagement Activity 2018
URL https://2018.fseconference.org/event/fse-2018-journal-first-mining-semantic-loop-idioms
 
Description Earl Barr research visit to Monash University and University of Adelaide, Australia 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Earl Barr visited Monash University and University of Adelaide, Australia. He presented Bimodal Software Engineering at both Universities.
Tuesday (Oct. 2nd) - Monash University
09:00-10:00 KEYNOTE (General Seminar, Earl Barr, UCL) - Bimodal Software Engineering
https://www.monash.edu/it/our-research/research-seminars/events/events/2018/earl-barr-bimodal-software-engineering
Year(s) Of Engagement Activity 2018
URL https://www.monash.edu/it/our-research/research-seminars/events/events/2018/earl-barr-bimodal-softwa...
 
Description Huawei Workshop - Dr Earl Barr 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact Automated Programming Workshop funded by Huawei on 11th Dec 2017
Professor Daniel Kröning : University of Oxford/DiffBlue
Professor Earl Barr : UCL
Professor Charles Sutton : University of Edinburgh
Professor Hong Zhu : Oxford Brookes University
Dr. Ian Bayley : Oxford Brookes University
Professor Mark Harman : UCL/Facebook
Professor Peter O'Hearn : UCL/Facebook
Professor Alastair F. Donaldson : Imperial College London
Professor Philippa Gardner : Imperial College London
Dr. David White : UCL
Dr. David Kelly (UCL)
Dr. Zheng Gao (UCL)
Mr. Laifa Zhang: President of RDCC (R&D Competence Center)
Mr. Tony Chang: Chief Scientist, VP of RDCC in US
Mr. Ni Huang (Eric) : Senior Director of RDCC Technology Planning Dept.
Mr. Xuewen Gong (Sean) : Director of RDCC Technology Cooperation Dept.
Professor Qianxiang Wang: Director of software analysis LAB of HUAWEI, vice chair of ACM CSOFT(China chapter of SIGSOFT), secretary-general of CCF TCSE(Technical Committee of Software Engineering, China Computer Federation).
Mr. Michael Hill-King: Collaboration Director, Huawei Cambridge Research Centre.
Mr. Duo Wu: Collaboration Manager, Huawei Cambridge Research Centre.
Miss. Yuncong Zou: Collaboration Assistant, Huawei Cambridge Research Centre.
Year(s) Of Engagement Activity 2017
 
Description Participation in an activity, workshop or similar - Dagstuhl - SE4ML 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Multiple research disciplines, from cognitive sciences to biology, finance, physics, and the social sciences, as well as many companies, believe that data-driven and intelligent solutions are necessary. Unfortunately, current artificial intelligence (AI) and machine learning (ML) technologies are not sufficiently democratized - building complex AI and ML systems requires deep expertise in computer science and extensive programming skills to work with various machine reasoning and learning techniques at a rather low level of abstraction. It also requires extensive trial and error exploration for model selection, data cleaning, feature selection, and parameter tuning. Moreover, there is a lack of theoretical understanding that could be used to abstract away these subtleties. Conventional programming languages and software engineering paradigms have also not been designed to address challenges faced by AI and ML practitioners.

The goal of this Dagstuhl Seminar is to bring two rather disjoint communities together, software engineering and programming languages (PL/SE) and artificial intelligence and machine learning (AI-ML) to discuss open problems on how to improve the productivity of data scientists, software engineers, and AI-ML practitioners in industry. The issues addressed in the seminar will include the following:

What challenges do people building AI-ML-based systems face?
How do we re-think software development tools such as debugging, testing, and verification tools for complex AI-ML-based systems?
How do we reason about correctness, explainability, repeatability, traceability, and fairness, while building AI-ML pipeline?
What are innovative paradigms that seamlessly embed, reuse, and chain models, while abstracting away most low-level details?
The topics of the seminar address pressing demands from industry; the research questions are very relevant for practical software systems development that leverages artificial intelligence (AI) and machine learning (ML). In 2016, companies invested $26-39 billion in AI and McKinsey predicts that investments will be growing over the next few years. Any AI- and ML-based systems will need to be built, tested, and maintained, yet there is a lack of established engineering practices in industry for such systems because they are fundamentally different from traditional software systems. Ideas brainstormed in the seminar will contribute to a new suite of ML-relevant software development tools such as debuggers, testers and verification tools that increase developer productivity in building complex AI systems. Furthermore, we will also discuss new innovative AI and ML abstractions that improve programmability in designing intelligent systems.
Year(s) Of Engagement Activity 2020
URL https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=20091
 
Description Talk at source{d} paper reading club - Madrid, 16 November 2016 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Deep Learning for Programming Language Type Inference
On Friday, November 16th, as part of source{d} paper reading club [1], we are going to talk about a paper that was recently published at FSE'18: Deep Learning Type Inference [2].

ABSTRACT
Dynamically typed languages such as JavaScript and Python are
increasingly popular, yet static typing has not been totally eclipsed:
Python now supports type annotations and languages like TypeScript
offer a middle-ground for JavaScript: a strict superset of
JavaScript, to which it transpiles, coupled with a type system that
permits partially typed programs. However, static typing has a cost:
adding annotations, reading the added syntax, and wrestling with
the type system to fix type errors. Type inference can ease the
transition to more statically typed code and unlock the benefits of
richer compile-time information, but is limited in languages like
JavaScript as it cannot soundly handle duck-typing or runtime evaluation
via eval. We propose DeepTyper, a deep learning model
that understands which types naturally occur in certain contexts
and relations and can provide type suggestions, which can often
be verified by the type checker, even if it could not infer the type
initially.
Year(s) Of Engagement Activity 2018
URL https://github.com/src-d/reading-club
 
Description The 55th CREST Open Workshop - Bimodal Program Analysis 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Overview:

Software is bimodal: it interlinks two channels, an algorithmic channel aimed at devices and a natural language channel aimed at developers. Most research has focused on one channel or the other, not their interplay. Simultaneously considering both channels promises a new source of constraints for improving program analysis and software engineering tools. For example, names in program text can be exploited to refine a type lattice. The CREST Open Workshop on PL and NLP will explore how to identify and exploit these cross-channel connections.

Organisers:

Earl Barr, CREST Centre, SSE Group, Department of Computer Science, UCL, UK

Santanu Dash, CREST Centre, SSE Group, Department of Computer Science, UCL, UK
Year(s) Of Engagement Activity 2017
URL http://crest.cs.ucl.ac.uk/cow/55/
 
Description The 60th CREST Open Workshop - Those were the DAASE 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Overview:

DAASE has advanced the state of the art in numerous directions: novel, principled technique for handling class imbalance, optimising energy consumption, automated program repair, automatically generating product roadmaps, to name a few.

DAASE has achieved breakthroughs, including automated software transplantation, the first approach for transplanting code that dynamically adapts it for a new context, and a human-competitive multi-objective software effort estimator that balances accuracy against variance, both of which won Hummies at GECCO. DAASE has pioneered a new field of research called genetic improvement and produced award-winning work on fitness landscape analysis and visualisation.

DAASE has spawned a number of start-ups, most notably Sapienz which Facebook acquired and which now tests and automatically repairs code at Internet scale. Heathrow's plane scheduling now relies on a bespoke optimisation algorithm devised by DAASE researchers. Automated software repair using genetic improvement is also now a part of Janus Manager, a management software for rehabilitation centres in Iceland.

Join us to review and celebrate these accomplishments, and discuss how to carry them forward.

Day 1 - Monday 3 Dec

10:45 - Pastries

11:15 - Introductions - Earl Barr

11:30 - Bill Langdon, CREST Centre, SSE Group, Department of Computer Science, UCL, UK

Genetic Improvement by Evolving Program Data

12:00 - Darrell Whitley, Colorado State University, USA

Optimal Neuron Selection and Ensemble Based Learning

12:30 - Lunch

13:30 - Mark Harman, CREST Centre, SSE Group, Department of Computer Science, UCL, UK and Facebook

Deploying Search Based Software Engineering with Sapienz at Facebook

14:00 - John Woodward, School of Electronic Engineering & Computer Science, Queen Mary University of London, UK

Genetic Improvement in a Live System

14:30 - Earl Barr, CREST Centre, SSE Group, Department of Computer Science, UCL, UK

Bimodal software engineering

15:00 - Refreshments

15:30 - John Clark, Department of Computer Science, University of Sheffield, UK

Pushing the searchboat out: from quantum software simulation to digital twinning

16:00 - Jeff Kramer, Department of Computing, Imperial College London, UK

The challenge of change

16:30 - Close of day

Day 2 - Tuesday 4 Dec

11:00 - Pastries

11:30 - Justyna Petke, CREST Centre, SSE Group, Department of Computer Science, UCL, UK

Specialising Software Using Genetic Improvement and Code Transplantation

12:00 - Leandro Minku, School of Computer Science, University of Birmingham, UK

A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation

12:30 - Lunch

13:30 - Gabriela Ochoa, Computing Science and Mathematics, University of Stirling, UK

LON Maps: Recent Advances in Local Optima Networks

14:00 - Erwin Pesch, Faculty of Economics and Business Administration, University in Siegen, Germany

Preventing Crane Interferences at Automated Container Terminals

14:30 - David R. White, Department of Computer Science, University of Sheffield, UK

Gin: a Tool for Program Improvement

15:00 - Refreshments

15:30 - Closing remarks

16:00 - Close of day
Year(s) Of Engagement Activity 2019
URL http://crest.cs.ucl.ac.uk/cow/60/
 
Description Visit to Luxembourg University 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 8 + 9 of April 2019
Research visit to Jacques Klein at Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg
Year(s) Of Engagement Activity 2019
URL https://wwwen.uni.lu/research/fstc/computer_science_and_communications_research_unit/members/jacques...