LUCID: Clearer Software by Integrating Natural Language Analysis into Software Engineering

Lead Research Organisation: University College London

Department Name: Computer Science

Abstract

Developers spend most of their time maintaining code, with little tool support.
To maintain code, one must understand it. Clear code is easier to read and
understand, and therefore less expensive and risky to evolve and maintain; it
is also notoriously difficult to write. We will help developers write clearer
code to speed maintenance, and increase developer productivity. Source code
unites two channels - the programming language and natural language - to
describe algorithms. LUCID will advance the state of the art in software
engineering by developing new analyses that exploit the interconnections
between these channels to find uninformative names, stale comments, and bugs
that manifest as discrepancies between the two channels.

Planned Impact

LUCID attacks a core software engineering concern; it will build tools that
help developers to maintain software more quickly, with less risk and less
cost. Thus, the work in this proposal has the potential for enormous economic
benefits in the long term. The UK has one of the strongest software sectors in
Europe. For example, in 2008 the UK accounted for 25% of European software
companies. By making software maintenance cheaper, this project will benefit
companies that sell software by lowering the costs of evolving their code and
releasing new versions. These tools will also benefit the many companies that
evolve and maintain custom software systems for their own in house use, by
lowering the cost of these infrastructural projects.

Funded Value:

£337,410

Funded Period:

Dec 16 - Jan 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/P005659/1

Principal Investigator:

Earl Barr

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (25%)

Fundamentals of Computing (25%)

Software Engineering (50%)

Organisations

People	ORCID iD
Earl Barr (Principal Investigator)	http://orcid.org/0000-0003-0771-7891

Publications

Author Name Title Publication Date Published

10 25 50

Hellendoorn V (2018) Deep learning type inference

Allamanis M (2018) A Survey of Machine Learning for Big Code and Naturalness in ACM Computing Surveys

Dash S (2018) RefiNym: using names to refine types

Pârtachi P (2020) POSIT

Partachi, P-P (2020) POSIT: Simultaneously Tagging Natural and Programming Languages

Louis A (2020) Where should I comment my code?

Casalnuovo C (2020) A theory of dual channel constraints

Pâr?achi P (2020) Flexeme: untangling commits using lexical flows

Allamanis M (2020) Typilus: neural type hints

Menéndez HD (2021) Getting Ahead of the Arms Race: Hothousing the Coevolution of VirusTotal with a Packer. in Entropy (Basel, Switzerland)

Key Findings
Impact Summary
Policy Influence
Collaboration
Software and Technical Products
Engagement Activities


Description	Source code combines two channels, a formal channel that specifies an algorithm for a computer and a natural language channel that explains that algorithm to developers. Both channels have been extensively studied in isolation. Lucid established a new line of research explicitly focused on how they interact. Some of these interactions form dual channel constraints; Lucid showed how to exploit these constraints to solve software engineering problems. Among its achievements, Lucid tackled the problem of overloading builtin types rather than defining problem-specific types (Refinym, FSE'18), advanced the state of the art in comment placement and quality (Detecting Redundant Comments, arXiv 2019 and Where Should I Comment my Code?, ICSE NEIR'20); and showed the utility of probabilistic type inference. Good comments speed understanding and maintaining code. A class of bad comments are those that redundantly repeat the code. Lucid built a technique and a tool to detect such comments. Lucid also surfaced and investigated the problem that precedes writing any specific comment: the question of where to add a comment. Explicitly handling this problem promises to ease the more important and much harder problem of generating comments. Lucid built an effective tool to solve this problem and contributed a data set for researchers to use. Dynamic languages, like JavaScript and Python, are well-suited to writing prototypes, but are expensive to maintain, partly because they lack type annotations. Therefore, many companies, Google, Facebook, and Microsoft, among others, have invested in adding static typing to them. Static typing, however, requires developers to add type annotations, which can be a monumental effort for a large codebase. Type inference, the traditional solution to this problem, cannot soundly deduce the precise type of many expressions in dynamic languages. Lucid's DeepTyper project showed that probabilistic type inference can usefully infer types, just from local lexical context. Types follow a fat-tailed Zipfian distribution. DeepTyper does not handle infrequent types well. Our Typilus work, published at PLDI, uses metalearning to infer infrequent types. Lucid was a productive project, leading to eight papers published at top-tier venues in software engineering and programming languages: FSE, PLDI, ICSE, TSE, CSUR, and ICSE's NIER. Two further Lucid papers, both aimed to speed and smooth continuous integration workflows, are under review.
Exploitation Route	Titans of the software industry, namely Google, Microsoft, Facebook, and Amazon, have all made substantial investments into applying machine learning and natural language processing techniques to improve developer productivity. Their teams are currently working on commercialising tools and techniques, some inspired by approaches and perspectives pioneered by Lucid, that exploit dual channel constraints. These approaches tackle a range of software engineering problems, including autocompletion, commit untangling, code search, probabilistic type inference for dynamically typed languages, and program synthesis.
Sectors	Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Security and Diplomacy
URL	http://ttendency.cs.ucl.ac.uk/lucid


Description	Lucid advanced the state of the art by developing new analyses that exploit the interconnection between natural language and formal notation in source code to find uninformative names, stale comments, and bugs that manifest as discrepancies between the channels. Lucid was a research project in the world-wide research agenda of language for code, or AI for code. Lucid's publications and its researchers' interactions with the large AI for code community contributed to the momentum of this larger research effort. AI4Code is transforming software engineering and society as whole, in the form of large language models, like GPT-4 and GitHub's copilot. So, while Lucid has not had direct industrial impact, Lucid has indirectly contributed to this vast and ongoing transformation.
First Year Of Impact	2020
Sector	Digital/Communication/Information Technologies (including Software),Security and Diplomacy
Impact Types	Societal Economic


Description	Committee Member in Program Committee within the ECOOP Research Papers-track
Geographic Reach	Multiple continents/international
Policy Influence Type	Participation in a guidance/advisory committee
URL	https://2018.ecoop.org/committee/ecoop-2018-research-track-program-committee


Description	Committee Member in Program Committee within the ISSTA Technical Papers-track
Geographic Reach	Multiple continents/international
Policy Influence Type	Participation in a guidance/advisory committee
URL	https://2018.ecoop.org/committee/issta-2018-technical-papers-program-committee


Description	Earl Barr on National Science Foundation panel for Software and Hardware Foundations (SHF) Program
Geographic Reach	North America
Policy Influence Type	Participation in a guidance/advisory committee
URL	https://nsf.gov/


Description	Collaboration with Miltos Allamanis, Microsoft Research Cambridge
Organisation	Microsoft Research
Department	Microsoft Research Cambridge
Country	United Kingdom
Sector	Private
PI Contribution	We are introducing new conceptual types in programs by studying how identifiers flow to each other through assignments in programs.
Collaborator Contribution	Miltos is helping us learn new types for these identifiers by learning over the data on flows across assignments
Impact	No outputs yet
Start Year	2017


Title	Flexeme
Description	This project provides several implementations for commit untagling and proposes a new representation of git patches by projecting the patch onto a PDG.
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	Advancing the state of the art in commit untangling via two new structures, the delta nameflow graph and delta PDG.
URL	https://github.com/PPPI/Flexeme


Title	POSIT
Description	This a project to simultaneously provide language ID tags and Part-Of-Speech or compiler tags (which are taken from CLANG compilations of C and C++ code). The corpus is either code with comments annotated with CLANG compiler information and universal PoS tags for English, or StackOverflow. For StackOverflow we start from the data dump (which can be found here), and use a frequency based heuristic to annotate code snippets. The frequency data is made available under ./data/corpora/SO/frequency_map.json. To generate training data from StackOverflow, please use the scripts under ./src/preprocessor together with the Posts.xml file from the data dump linked above. Our model is a BiLSTM neural network with a Linear CRF and Viterbi decode to go from LSTM state to tags or language IDs. We use the same LSTM network and change only the CRF on top for the two tasks. We linearly combine the two objectives in the loss with a slightly smaller weight given to language IDs. We do not condition Tag output on language IDs in this version of the model.
Type Of Technology	Software
Year Produced	2020
Open Source License?	Yes
Impact	Enabled the study of mixed languages in softare artefacts, advancing the state of the art.
URL	https://github.com/PPPI/POSIT


Description	Dr Earl Barr presentation at
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	AIFORSE Conference 2017 - the first global Conference on Artificial Intelligence (AI) for Software Engineering (SE) - will host on the 10th of November 2017 in Barcelona. The main Purpose of the Conference, first of all, is to build a Bridge between the most significant Players of the Software Engineering Industry from one side and the most advanced adopters of cutting-edge Applied Artificial Intelligence Technologies from another side. The Leaders and Experts of Software Engineering and pioneer Innovators of Artificial Intelligence in SE will meet on Communication Stage to accelerate the Development and increase the Efficiency of the Operations in the Industry. 12 Hours of Networking, Discussions, bright and unique Reports of the 10 best Speakers in the Industry. Speakers are Representatives of Companies from around the World, who already apply AI to the Software Engineering. They will not only disclose the Tools that help solve Problems faster and decrease Costs, but will also define the Development Vector of Software Engineering Industry.
Year(s) Of Engagement Activity	2017
URL	http://aiforse.org/conference-2017


Description	Earl Barr - talk at Semmle
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Earl Barr presents his paper To Type or Not to Type: Quantifying Detectable Bugs in JavaScript at Semmle http://earlbarr.com/publications/typestudy.pdf
Year(s) Of Engagement Activity	2018
URL	https://semmle.com/


Description	Earl Barr attending invitation only workshop on software security, National Cyber Security Centre
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Industry/Business
Results and Impact	Invitation only workshop to discuss the requirements for an international collaborative effort and infrastructure to support large-scale empirical research on software security. The workshop was held in London on the 17th of December and was supported by the National Cyber Security Centre. The collective goal for the day was to develop a shared understanding of the challenges faced by research on software code analysis for cybersecurity and outline a roadmap for an international testbed infrastructure for large-scale experimental research on software security.
Year(s) Of Engagement Activity	2018


Description	Earl Barr invited speaker CHOOSE forum in Switzerland
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The CHOOSE Forum 2018 is organized by the Zurich Empirical Software engineering Team (ZEST) at the University of Zurich, on behalf of CHOOSE. Earl Barr presented his paper Bimodal Software Engineering
Year(s) Of Engagement Activity	2018
URL	https://choose.swissinformatics.org/events/choose-forum-2018-software-engineering-and-machine-learni...


Description	Earl Barr presents Bimodal Software Engineering at FLOC 2018: FEDERATED LOGIC CONFERENCE 2018, MLP PROGRAM
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Earl Barr presents Bimodal Software Engineering at FLOC 2018: FEDERATED LOGIC CONFERENCE 2018 in the MLP PROGRAM
Year(s) Of Engagement Activity	2018
URL	https://easychair.org/smart-program/FLoC2018/MLP-program.html


Description	Earl Barr presents his paper Mining Semantic Loop Idioms @ FSE 2018
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Earl Barr presents his paper Mining Semantic Loop Idioms @ FSE 2018 in the Journal-First track Sun 4 - Fri 9 November 2018 Lake Buena Vista, Florida, United States
Year(s) Of Engagement Activity	2018
URL	https://2018.fseconference.org/event/fse-2018-journal-first-mining-semantic-loop-idioms


Description	Earl Barr research visit to Monash University and University of Adelaide, Australia
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Earl Barr visited Monash University and University of Adelaide, Australia. He presented Bimodal Software Engineering at both Universities. Tuesday (Oct. 2nd) - Monash University 09:00-10:00 KEYNOTE (General Seminar, Earl Barr, UCL) - Bimodal Software Engineering https://www.monash.edu/it/our-research/research-seminars/events/events/2018/earl-barr-bimodal-software-engineering
Year(s) Of Engagement Activity	2018
URL	https://www.monash.edu/it/our-research/research-seminars/events/events/2018/earl-barr-bimodal-softwa...


Description	Huawei Workshop - Dr Earl Barr
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	Automated Programming Workshop funded by Huawei on 11th Dec 2017 Professor Daniel Kröning : University of Oxford/DiffBlue Professor Earl Barr : UCL Professor Charles Sutton : University of Edinburgh Professor Hong Zhu : Oxford Brookes University Dr. Ian Bayley : Oxford Brookes University Professor Mark Harman : UCL/Facebook Professor Peter O'Hearn : UCL/Facebook Professor Alastair F. Donaldson : Imperial College London Professor Philippa Gardner : Imperial College London Dr. David White : UCL Dr. David Kelly (UCL) Dr. Zheng Gao (UCL) Mr. Laifa Zhang: President of RDCC (R&D Competence Center) Mr. Tony Chang: Chief Scientist, VP of RDCC in US Mr. Ni Huang (Eric) : Senior Director of RDCC Technology Planning Dept. Mr. Xuewen Gong (Sean) : Director of RDCC Technology Cooperation Dept. Professor Qianxiang Wang: Director of software analysis LAB of HUAWEI, vice chair of ACM CSOFT(China chapter of SIGSOFT), secretary-general of CCF TCSE(Technical Committee of Software Engineering, China Computer Federation). Mr. Michael Hill-King: Collaboration Director, Huawei Cambridge Research Centre. Mr. Duo Wu: Collaboration Manager, Huawei Cambridge Research Centre. Miss. Yuncong Zou: Collaboration Assistant, Huawei Cambridge Research Centre.
Year(s) Of Engagement Activity	2017


Description	Participation in an activity, workshop or similar - Dagstuhl - SE4ML
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Multiple research disciplines, from cognitive sciences to biology, finance, physics, and the social sciences, as well as many companies, believe that data-driven and intelligent solutions are necessary. Unfortunately, current artificial intelligence (AI) and machine learning (ML) technologies are not sufficiently democratized - building complex AI and ML systems requires deep expertise in computer science and extensive programming skills to work with various machine reasoning and learning techniques at a rather low level of abstraction. It also requires extensive trial and error exploration for model selection, data cleaning, feature selection, and parameter tuning. Moreover, there is a lack of theoretical understanding that could be used to abstract away these subtleties. Conventional programming languages and software engineering paradigms have also not been designed to address challenges faced by AI and ML practitioners. The goal of this Dagstuhl Seminar is to bring two rather disjoint communities together, software engineering and programming languages (PL/SE) and artificial intelligence and machine learning (AI-ML) to discuss open problems on how to improve the productivity of data scientists, software engineers, and AI-ML practitioners in industry. The issues addressed in the seminar will include the following: What challenges do people building AI-ML-based systems face? How do we re-think software development tools such as debugging, testing, and verification tools for complex AI-ML-based systems? How do we reason about correctness, explainability, repeatability, traceability, and fairness, while building AI-ML pipeline? What are innovative paradigms that seamlessly embed, reuse, and chain models, while abstracting away most low-level details? The topics of the seminar address pressing demands from industry; the research questions are very relevant for practical software systems development that leverages artificial intelligence (AI) and machine learning (ML). In 2016, companies invested $26-39 billion in AI and McKinsey predicts that investments will be growing over the next few years. Any AI- and ML-based systems will need to be built, tested, and maintained, yet there is a lack of established engineering practices in industry for such systems because they are fundamentally different from traditional software systems. Ideas brainstormed in the seminar will contribute to a new suite of ML-relevant software development tools such as debuggers, testers and verification tools that increase developer productivity in building complex AI systems. Furthermore, we will also discuss new innovative AI and ML abstractions that improve programmability in designing intelligent systems.
Year(s) Of Engagement Activity	2020
URL	https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=20091


Description	Talk at source{d} paper reading club - Madrid, 16 November 2016
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Deep Learning for Programming Language Type Inference On Friday, November 16th, as part of source{d} paper reading club [1], we are going to talk about a paper that was recently published at FSE'18: Deep Learning Type Inference [2]. ABSTRACT Dynamically typed languages such as JavaScript and Python are increasingly popular, yet static typing has not been totally eclipsed: Python now supports type annotations and languages like TypeScript offer a middle-ground for JavaScript: a strict superset of JavaScript, to which it transpiles, coupled with a type system that permits partially typed programs. However, static typing has a cost: adding annotations, reading the added syntax, and wrestling with the type system to fix type errors. Type inference can ease the transition to more statically typed code and unlock the benefits of richer compile-time information, but is limited in languages like JavaScript as it cannot soundly handle duck-typing or runtime evaluation via eval. We propose DeepTyper, a deep learning model that understands which types naturally occur in certain contexts and relations and can provide type suggestions, which can often be verified by the type checker, even if it could not infer the type initially.
Year(s) Of Engagement Activity	2018
URL	https://github.com/src-d/reading-club


Description	The 55th CREST Open Workshop - Bimodal Program Analysis
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Overview: Software is bimodal: it interlinks two channels, an algorithmic channel aimed at devices and a natural language channel aimed at developers. Most research has focused on one channel or the other, not their interplay. Simultaneously considering both channels promises a new source of constraints for improving program analysis and software engineering tools. For example, names in program text can be exploited to refine a type lattice. The CREST Open Workshop on PL and NLP will explore how to identify and exploit these cross-channel connections. Organisers: Earl Barr, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Santanu Dash, CREST Centre, SSE Group, Department of Computer Science, UCL, UK
Year(s) Of Engagement Activity	2017
URL	http://crest.cs.ucl.ac.uk/cow/55/


Description	The 60th CREST Open Workshop - Those were the DAASE
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Overview: DAASE has advanced the state of the art in numerous directions: novel, principled technique for handling class imbalance, optimising energy consumption, automated program repair, automatically generating product roadmaps, to name a few. DAASE has achieved breakthroughs, including automated software transplantation, the first approach for transplanting code that dynamically adapts it for a new context, and a human-competitive multi-objective software effort estimator that balances accuracy against variance, both of which won Hummies at GECCO. DAASE has pioneered a new field of research called genetic improvement and produced award-winning work on fitness landscape analysis and visualisation. DAASE has spawned a number of start-ups, most notably Sapienz which Facebook acquired and which now tests and automatically repairs code at Internet scale. Heathrow's plane scheduling now relies on a bespoke optimisation algorithm devised by DAASE researchers. Automated software repair using genetic improvement is also now a part of Janus Manager, a management software for rehabilitation centres in Iceland. Join us to review and celebrate these accomplishments, and discuss how to carry them forward. Day 1 - Monday 3 Dec 10:45 - Pastries 11:15 - Introductions - Earl Barr 11:30 - Bill Langdon, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Genetic Improvement by Evolving Program Data 12:00 - Darrell Whitley, Colorado State University, USA Optimal Neuron Selection and Ensemble Based Learning 12:30 - Lunch 13:30 - Mark Harman, CREST Centre, SSE Group, Department of Computer Science, UCL, UK and Facebook Deploying Search Based Software Engineering with Sapienz at Facebook 14:00 - John Woodward, School of Electronic Engineering & Computer Science, Queen Mary University of London, UK Genetic Improvement in a Live System 14:30 - Earl Barr, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Bimodal software engineering 15:00 - Refreshments 15:30 - John Clark, Department of Computer Science, University of Sheffield, UK Pushing the searchboat out: from quantum software simulation to digital twinning 16:00 - Jeff Kramer, Department of Computing, Imperial College London, UK The challenge of change 16:30 - Close of day Day 2 - Tuesday 4 Dec 11:00 - Pastries 11:30 - Justyna Petke, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Specialising Software Using Genetic Improvement and Code Transplantation 12:00 - Leandro Minku, School of Computer Science, University of Birmingham, UK A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation 12:30 - Lunch 13:30 - Gabriela Ochoa, Computing Science and Mathematics, University of Stirling, UK LON Maps: Recent Advances in Local Optima Networks 14:00 - Erwin Pesch, Faculty of Economics and Business Administration, University in Siegen, Germany Preventing Crane Interferences at Automated Container Terminals 14:30 - David R. White, Department of Computer Science, University of Sheffield, UK Gin: a Tool for Program Improvement 15:00 - Refreshments 15:30 - Closing remarks 16:00 - Close of day
Year(s) Of Engagement Activity	2019
URL	http://crest.cs.ucl.ac.uk/cow/60/


Description	Visit to Luxembourg University
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	8 + 9 of April 2019 Research visit to Jacques Klein at Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg
Year(s) Of Engagement Activity	2019
URL	https://wwwen.uni.lu/research/fstc/computer_science_and_communications_research_unit/members/jacques...

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications