LUCID: Clearer Software by Integrating Natural Language Analysis into Software Engineering
Lead Research Organisation:
University College London
Department Name: Computer Science
Abstract
Developers spend most of their time maintaining code, with little tool support.
To maintain code, one must understand it. Clear code is easier to read and
understand, and therefore less expensive and risky to evolve and maintain; it
is also notoriously difficult to write. We will help developers write clearer
code to speed maintenance, and increase developer productivity. Source code
unites two channels - the programming language and natural language - to
describe algorithms. LUCID will advance the state of the art in software
engineering by developing new analyses that exploit the interconnections
between these channels to find uninformative names, stale comments, and bugs
that manifest as discrepancies between the two channels.
To maintain code, one must understand it. Clear code is easier to read and
understand, and therefore less expensive and risky to evolve and maintain; it
is also notoriously difficult to write. We will help developers write clearer
code to speed maintenance, and increase developer productivity. Source code
unites two channels - the programming language and natural language - to
describe algorithms. LUCID will advance the state of the art in software
engineering by developing new analyses that exploit the interconnections
between these channels to find uninformative names, stale comments, and bugs
that manifest as discrepancies between the two channels.
Planned Impact
LUCID attacks a core software engineering concern; it will build tools that
help developers to maintain software more quickly, with less risk and less
cost. Thus, the work in this proposal has the potential for enormous economic
benefits in the long term. The UK has one of the strongest software sectors in
Europe. For example, in 2008 the UK accounted for 25% of European software
companies. By making software maintenance cheaper, this project will benefit
companies that sell software by lowering the costs of evolving their code and
releasing new versions. These tools will also benefit the many companies that
evolve and maintain custom software systems for their own in house use, by
lowering the cost of these infrastructural projects.
help developers to maintain software more quickly, with less risk and less
cost. Thus, the work in this proposal has the potential for enormous economic
benefits in the long term. The UK has one of the strongest software sectors in
Europe. For example, in 2008 the UK accounted for 25% of European software
companies. By making software maintenance cheaper, this project will benefit
companies that sell software by lowering the costs of evolving their code and
releasing new versions. These tools will also benefit the many companies that
evolve and maintain custom software systems for their own in house use, by
lowering the cost of these infrastructural projects.
Publications
Hellendoorn V
(2018)
Deep learning type inference
Allamanis M
(2018)
A Survey of Machine Learning for Big Code and Naturalness
in ACM Computing Surveys
Dash S
(2018)
RefiNym: using names to refine types
Pârtachi P
(2020)
POSIT
Partachi, P-P
(2020)
POSIT: Simultaneously Tagging Natural and Programming Languages
Louis A
(2020)
Where should I comment my code?
Casalnuovo C
(2020)
A theory of dual channel constraints
Pâr?achi P
(2020)
Flexeme: untangling commits using lexical flows
Allamanis M
(2020)
Typilus: neural type hints
Menéndez HD
(2021)
Getting Ahead of the Arms Race: Hothousing the Coevolution of VirusTotal with a Packer.
in Entropy (Basel, Switzerland)
Description | Source code combines two channels, a formal channel that specifies an algorithm for a computer and a natural language channel that explains that algorithm to developers. Both channels have been extensively studied in isolation. Lucid established a new line of research explicitly focused on how they interact. Some of these interactions form dual channel constraints; Lucid showed how to exploit these constraints to solve software engineering problems. Among its achievements, Lucid tackled the problem of overloading builtin types rather than defining problem-specific types (Refinym, FSE'18), advanced the state of the art in comment placement and quality (Detecting Redundant Comments, arXiv 2019 and Where Should I Comment my Code?, ICSE NEIR'20); and showed the utility of probabilistic type inference. Good comments speed understanding and maintaining code. A class of bad comments are those that redundantly repeat the code. Lucid built a technique and a tool to detect such comments. Lucid also surfaced and investigated the problem that precedes writing any specific comment: the question of where to add a comment. Explicitly handling this problem promises to ease the more important and much harder problem of generating comments. Lucid built an effective tool to solve this problem and contributed a data set for researchers to use. Dynamic languages, like JavaScript and Python, are well-suited to writing prototypes, but are expensive to maintain, partly because they lack type annotations. Therefore, many companies, Google, Facebook, and Microsoft, among others, have invested in adding static typing to them. Static typing, however, requires developers to add type annotations, which can be a monumental effort for a large codebase. Type inference, the traditional solution to this problem, cannot soundly deduce the precise type of many expressions in dynamic languages. Lucid's DeepTyper project showed that probabilistic type inference can usefully infer types, just from local lexical context. Types follow a fat-tailed Zipfian distribution. DeepTyper does not handle infrequent types well. Our Typilus work, published at PLDI, uses metalearning to infer infrequent types. Lucid was a productive project, leading to eight papers published at top-tier venues in software engineering and programming languages: FSE, PLDI, ICSE, TSE, CSUR, and ICSE's NIER. Two further Lucid papers, both aimed to speed and smooth continuous integration workflows, are under review. |
Exploitation Route | Titans of the software industry, namely Google, Microsoft, Facebook, and Amazon, have all made substantial investments into applying machine learning and natural language processing techniques to improve developer productivity. Their teams are currently working on commercialising tools and techniques, some inspired by approaches and perspectives pioneered by Lucid, that exploit dual channel constraints. These approaches tackle a range of software engineering problems, including autocompletion, commit untangling, code search, probabilistic type inference for dynamically typed languages, and program synthesis. |
Sectors | Digital/Communication/Information Technologies (including Software) Financial Services and Management Consultancy Security and Diplomacy |
URL | http://ttendency.cs.ucl.ac.uk/lucid |
Description | Lucid advanced the state of the art by developing new analyses that exploit the interconnection between natural language and formal notation in source code to find uninformative names, stale comments, and bugs that manifest as discrepancies between the channels. Lucid was a research project in the world-wide research agenda of language for code, or AI for code. Lucid's publications and its researchers' interactions with the large AI for code community contributed to the momentum of this larger research effort. AI4Code is transforming software engineering and society as whole, in the form of large language models, like GPT-4 and GitHub's copilot. So, while Lucid has not had direct industrial impact, Lucid has indirectly contributed to this vast and ongoing transformation. |
First Year Of Impact | 2020 |
Sector | Digital/Communication/Information Technologies (including Software),Security and Diplomacy |
Impact Types | Societal Economic |
Description | Committee Member in Program Committee within the ECOOP Research Papers-track |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Participation in a guidance/advisory committee |
URL | https://2018.ecoop.org/committee/ecoop-2018-research-track-program-committee |
Description | Committee Member in Program Committee within the ISSTA Technical Papers-track |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Participation in a guidance/advisory committee |
URL | https://2018.ecoop.org/committee/issta-2018-technical-papers-program-committee |
Description | Earl Barr on National Science Foundation panel for Software and Hardware Foundations (SHF) Program |
Geographic Reach | North America |
Policy Influence Type | Participation in a guidance/advisory committee |
URL | https://nsf.gov/ |
Description | Collaboration with Miltos Allamanis, Microsoft Research Cambridge |
Organisation | Microsoft Research |
Department | Microsoft Research Cambridge |
Country | United Kingdom |
Sector | Private |
PI Contribution | We are introducing new conceptual types in programs by studying how identifiers flow to each other through assignments in programs. |
Collaborator Contribution | Miltos is helping us learn new types for these identifiers by learning over the data on flows across assignments |
Impact | No outputs yet |
Start Year | 2017 |
Title | Flexeme |
Description | This project provides several implementations for commit untagling and proposes a new representation of git patches by projecting the patch onto a PDG. |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | Advancing the state of the art in commit untangling via two new structures, the delta nameflow graph and delta PDG. |
URL | https://github.com/PPPI/Flexeme |
Title | POSIT |
Description | This a project to simultaneously provide language ID tags and Part-Of-Speech or compiler tags (which are taken from CLANG compilations of C and C++ code). The corpus is either code with comments annotated with CLANG compiler information and universal PoS tags for English, or StackOverflow. For StackOverflow we start from the data dump (which can be found here), and use a frequency based heuristic to annotate code snippets. The frequency data is made available under ./data/corpora/SO/frequency_map.json. To generate training data from StackOverflow, please use the scripts under ./src/preprocessor together with the Posts.xml file from the data dump linked above. Our model is a BiLSTM neural network with a Linear CRF and Viterbi decode to go from LSTM state to tags or language IDs. We use the same LSTM network and change only the CRF on top for the two tasks. We linearly combine the two objectives in the loss with a slightly smaller weight given to language IDs. We do not condition Tag output on language IDs in this version of the model. |
Type Of Technology | Software |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | Enabled the study of mixed languages in softare artefacts, advancing the state of the art. |
URL | https://github.com/PPPI/POSIT |
Description | Dr Earl Barr presentation at |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | AIFORSE Conference 2017 - the first global Conference on Artificial Intelligence (AI) for Software Engineering (SE) - will host on the 10th of November 2017 in Barcelona. The main Purpose of the Conference, first of all, is to build a Bridge between the most significant Players of the Software Engineering Industry from one side and the most advanced adopters of cutting-edge Applied Artificial Intelligence Technologies from another side. The Leaders and Experts of Software Engineering and pioneer Innovators of Artificial Intelligence in SE will meet on Communication Stage to accelerate the Development and increase the Efficiency of the Operations in the Industry. 12 Hours of Networking, Discussions, bright and unique Reports of the 10 best Speakers in the Industry. Speakers are Representatives of Companies from around the World, who already apply AI to the Software Engineering. They will not only disclose the Tools that help solve Problems faster and decrease Costs, but will also define the Development Vector of Software Engineering Industry. |
Year(s) Of Engagement Activity | 2017 |
URL | http://aiforse.org/conference-2017 |
Description | Earl Barr - talk at Semmle |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Earl Barr presents his paper To Type or Not to Type: Quantifying Detectable Bugs in JavaScript at Semmle http://earlbarr.com/publications/typestudy.pdf |
Year(s) Of Engagement Activity | 2018 |
URL | https://semmle.com/ |
Description | Earl Barr attending invitation only workshop on software security, National Cyber Security Centre |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Industry/Business |
Results and Impact | Invitation only workshop to discuss the requirements for an international collaborative effort and infrastructure to support large-scale empirical research on software security. The workshop was held in London on the 17th of December and was supported by the National Cyber Security Centre. The collective goal for the day was to develop a shared understanding of the challenges faced by research on software code analysis for cybersecurity and outline a roadmap for an international testbed infrastructure for large-scale experimental research on software security. |
Year(s) Of Engagement Activity | 2018 |
Description | Earl Barr invited speaker CHOOSE forum in Switzerland |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The CHOOSE Forum 2018 is organized by the Zurich Empirical Software engineering Team (ZEST) at the University of Zurich, on behalf of CHOOSE. Earl Barr presented his paper Bimodal Software Engineering |
Year(s) Of Engagement Activity | 2018 |
URL | https://choose.swissinformatics.org/events/choose-forum-2018-software-engineering-and-machine-learni... |
Description | Earl Barr presents Bimodal Software Engineering at FLOC 2018: FEDERATED LOGIC CONFERENCE 2018, MLP PROGRAM |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Earl Barr presents Bimodal Software Engineering at FLOC 2018: FEDERATED LOGIC CONFERENCE 2018 in the MLP PROGRAM |
Year(s) Of Engagement Activity | 2018 |
URL | https://easychair.org/smart-program/FLoC2018/MLP-program.html |
Description | Earl Barr presents his paper Mining Semantic Loop Idioms @ FSE 2018 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Earl Barr presents his paper Mining Semantic Loop Idioms @ FSE 2018 in the Journal-First track Sun 4 - Fri 9 November 2018 Lake Buena Vista, Florida, United States |
Year(s) Of Engagement Activity | 2018 |
URL | https://2018.fseconference.org/event/fse-2018-journal-first-mining-semantic-loop-idioms |
Description | Earl Barr research visit to Monash University and University of Adelaide, Australia |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Earl Barr visited Monash University and University of Adelaide, Australia. He presented Bimodal Software Engineering at both Universities. Tuesday (Oct. 2nd) - Monash University 09:00-10:00 KEYNOTE (General Seminar, Earl Barr, UCL) - Bimodal Software Engineering https://www.monash.edu/it/our-research/research-seminars/events/events/2018/earl-barr-bimodal-software-engineering |
Year(s) Of Engagement Activity | 2018 |
URL | https://www.monash.edu/it/our-research/research-seminars/events/events/2018/earl-barr-bimodal-softwa... |
Description | Huawei Workshop - Dr Earl Barr |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | Automated Programming Workshop funded by Huawei on 11th Dec 2017 Professor Daniel Kröning : University of Oxford/DiffBlue Professor Earl Barr : UCL Professor Charles Sutton : University of Edinburgh Professor Hong Zhu : Oxford Brookes University Dr. Ian Bayley : Oxford Brookes University Professor Mark Harman : UCL/Facebook Professor Peter O'Hearn : UCL/Facebook Professor Alastair F. Donaldson : Imperial College London Professor Philippa Gardner : Imperial College London Dr. David White : UCL Dr. David Kelly (UCL) Dr. Zheng Gao (UCL) Mr. Laifa Zhang: President of RDCC (R&D Competence Center) Mr. Tony Chang: Chief Scientist, VP of RDCC in US Mr. Ni Huang (Eric) : Senior Director of RDCC Technology Planning Dept. Mr. Xuewen Gong (Sean) : Director of RDCC Technology Cooperation Dept. Professor Qianxiang Wang: Director of software analysis LAB of HUAWEI, vice chair of ACM CSOFT(China chapter of SIGSOFT), secretary-general of CCF TCSE(Technical Committee of Software Engineering, China Computer Federation). Mr. Michael Hill-King: Collaboration Director, Huawei Cambridge Research Centre. Mr. Duo Wu: Collaboration Manager, Huawei Cambridge Research Centre. Miss. Yuncong Zou: Collaboration Assistant, Huawei Cambridge Research Centre. |
Year(s) Of Engagement Activity | 2017 |
Description | Participation in an activity, workshop or similar - Dagstuhl - SE4ML |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Multiple research disciplines, from cognitive sciences to biology, finance, physics, and the social sciences, as well as many companies, believe that data-driven and intelligent solutions are necessary. Unfortunately, current artificial intelligence (AI) and machine learning (ML) technologies are not sufficiently democratized - building complex AI and ML systems requires deep expertise in computer science and extensive programming skills to work with various machine reasoning and learning techniques at a rather low level of abstraction. It also requires extensive trial and error exploration for model selection, data cleaning, feature selection, and parameter tuning. Moreover, there is a lack of theoretical understanding that could be used to abstract away these subtleties. Conventional programming languages and software engineering paradigms have also not been designed to address challenges faced by AI and ML practitioners. The goal of this Dagstuhl Seminar is to bring two rather disjoint communities together, software engineering and programming languages (PL/SE) and artificial intelligence and machine learning (AI-ML) to discuss open problems on how to improve the productivity of data scientists, software engineers, and AI-ML practitioners in industry. The issues addressed in the seminar will include the following: What challenges do people building AI-ML-based systems face? How do we re-think software development tools such as debugging, testing, and verification tools for complex AI-ML-based systems? How do we reason about correctness, explainability, repeatability, traceability, and fairness, while building AI-ML pipeline? What are innovative paradigms that seamlessly embed, reuse, and chain models, while abstracting away most low-level details? The topics of the seminar address pressing demands from industry; the research questions are very relevant for practical software systems development that leverages artificial intelligence (AI) and machine learning (ML). In 2016, companies invested $26-39 billion in AI and McKinsey predicts that investments will be growing over the next few years. Any AI- and ML-based systems will need to be built, tested, and maintained, yet there is a lack of established engineering practices in industry for such systems because they are fundamentally different from traditional software systems. Ideas brainstormed in the seminar will contribute to a new suite of ML-relevant software development tools such as debuggers, testers and verification tools that increase developer productivity in building complex AI systems. Furthermore, we will also discuss new innovative AI and ML abstractions that improve programmability in designing intelligent systems. |
Year(s) Of Engagement Activity | 2020 |
URL | https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=20091 |
Description | Talk at source{d} paper reading club - Madrid, 16 November 2016 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Deep Learning for Programming Language Type Inference On Friday, November 16th, as part of source{d} paper reading club [1], we are going to talk about a paper that was recently published at FSE'18: Deep Learning Type Inference [2]. ABSTRACT Dynamically typed languages such as JavaScript and Python are increasingly popular, yet static typing has not been totally eclipsed: Python now supports type annotations and languages like TypeScript offer a middle-ground for JavaScript: a strict superset of JavaScript, to which it transpiles, coupled with a type system that permits partially typed programs. However, static typing has a cost: adding annotations, reading the added syntax, and wrestling with the type system to fix type errors. Type inference can ease the transition to more statically typed code and unlock the benefits of richer compile-time information, but is limited in languages like JavaScript as it cannot soundly handle duck-typing or runtime evaluation via eval. We propose DeepTyper, a deep learning model that understands which types naturally occur in certain contexts and relations and can provide type suggestions, which can often be verified by the type checker, even if it could not infer the type initially. |
Year(s) Of Engagement Activity | 2018 |
URL | https://github.com/src-d/reading-club |
Description | The 55th CREST Open Workshop - Bimodal Program Analysis |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Overview: Software is bimodal: it interlinks two channels, an algorithmic channel aimed at devices and a natural language channel aimed at developers. Most research has focused on one channel or the other, not their interplay. Simultaneously considering both channels promises a new source of constraints for improving program analysis and software engineering tools. For example, names in program text can be exploited to refine a type lattice. The CREST Open Workshop on PL and NLP will explore how to identify and exploit these cross-channel connections. Organisers: Earl Barr, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Santanu Dash, CREST Centre, SSE Group, Department of Computer Science, UCL, UK |
Year(s) Of Engagement Activity | 2017 |
URL | http://crest.cs.ucl.ac.uk/cow/55/ |
Description | The 60th CREST Open Workshop - Those were the DAASE |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Overview: DAASE has advanced the state of the art in numerous directions: novel, principled technique for handling class imbalance, optimising energy consumption, automated program repair, automatically generating product roadmaps, to name a few. DAASE has achieved breakthroughs, including automated software transplantation, the first approach for transplanting code that dynamically adapts it for a new context, and a human-competitive multi-objective software effort estimator that balances accuracy against variance, both of which won Hummies at GECCO. DAASE has pioneered a new field of research called genetic improvement and produced award-winning work on fitness landscape analysis and visualisation. DAASE has spawned a number of start-ups, most notably Sapienz which Facebook acquired and which now tests and automatically repairs code at Internet scale. Heathrow's plane scheduling now relies on a bespoke optimisation algorithm devised by DAASE researchers. Automated software repair using genetic improvement is also now a part of Janus Manager, a management software for rehabilitation centres in Iceland. Join us to review and celebrate these accomplishments, and discuss how to carry them forward. Day 1 - Monday 3 Dec 10:45 - Pastries 11:15 - Introductions - Earl Barr 11:30 - Bill Langdon, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Genetic Improvement by Evolving Program Data 12:00 - Darrell Whitley, Colorado State University, USA Optimal Neuron Selection and Ensemble Based Learning 12:30 - Lunch 13:30 - Mark Harman, CREST Centre, SSE Group, Department of Computer Science, UCL, UK and Facebook Deploying Search Based Software Engineering with Sapienz at Facebook 14:00 - John Woodward, School of Electronic Engineering & Computer Science, Queen Mary University of London, UK Genetic Improvement in a Live System 14:30 - Earl Barr, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Bimodal software engineering 15:00 - Refreshments 15:30 - John Clark, Department of Computer Science, University of Sheffield, UK Pushing the searchboat out: from quantum software simulation to digital twinning 16:00 - Jeff Kramer, Department of Computing, Imperial College London, UK The challenge of change 16:30 - Close of day Day 2 - Tuesday 4 Dec 11:00 - Pastries 11:30 - Justyna Petke, CREST Centre, SSE Group, Department of Computer Science, UCL, UK Specialising Software Using Genetic Improvement and Code Transplantation 12:00 - Leandro Minku, School of Computer Science, University of Birmingham, UK A Novel Automated Approach for Software Effort Estimation Based on Data Augmentation 12:30 - Lunch 13:30 - Gabriela Ochoa, Computing Science and Mathematics, University of Stirling, UK LON Maps: Recent Advances in Local Optima Networks 14:00 - Erwin Pesch, Faculty of Economics and Business Administration, University in Siegen, Germany Preventing Crane Interferences at Automated Container Terminals 14:30 - David R. White, Department of Computer Science, University of Sheffield, UK Gin: a Tool for Program Improvement 15:00 - Refreshments 15:30 - Closing remarks 16:00 - Close of day |
Year(s) Of Engagement Activity | 2019 |
URL | http://crest.cs.ucl.ac.uk/cow/60/ |
Description | Visit to Luxembourg University |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | 8 + 9 of April 2019 Research visit to Jacques Klein at Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg |
Year(s) Of Engagement Activity | 2019 |
URL | https://wwwen.uni.lu/research/fstc/computer_science_and_communications_research_unit/members/jacques... |