CaMELot: Catching and Mitigating Event-Loop Concurrency Issues

Lead Research Organisation: University of Kent

Department Name: Sch of Computing

Abstract

Most modern computer applications depend in some way or another on computations that are performed by server applications on the internet. More and more of these server applications are now built as so-called microservices, which allow developers to gradually update or fix issues in unrelated parts of a larger application, and therefore, have become popular. Many of these microservices avoid certain types of concurrency issues by design. Unfortunately, they still suffer from other kinds of concurrency issues, for example when multiple online customers try to reserve the same seats at the same time.

For software engineers, it is hard to test for all possible concurrent interactions. In practice, this means that only simple concurrency issues are reliably detected during testing. Complex issues can however easily slip through and make it into server applications and then handle client requests incorrectly. One example of such a concurrency issue appeared at Nasdaq when the Facebook stock was traded for the first time, resulting in the loss of millions of dollars.

Our goal is to develop techniques that detect concurrency issues automatically at run time, to be able to circumvent them, and enable developers to fix them, using detailed information gathered by the detection techniques. Researchers have shown that one can detect and avoid issues, for instance by changing the order in which client requests are processed. In practice however, current techniques slow server applications down significantly, which make these techniques too costly to be used. Our aim is to dynamically balance the need for accurate information and minimize slow down. We conjecture that we can get most practical benefits while only rarely tracking precise details of how program code executes. In addition to automatically preventing concurrency issues to cause problems, we will also use the obtained information to provide feedback to developers so that they can fix the underlying issue in their software.

Thus, overall the goal of this research project is to make server applications, and specifically microservices, more robust and resilient to software bugs that are hard to test for and therefore typically remain undiscovered until they cause major issues for customers or companies.

Our work will result in the development of adaptive techniques that detect concurrency issues, and automatically tradeoff accuracy and run-time overhead, to be usable in practice. Furthermore, the detection techniques will be used to provide actionable input to the software developers, so that the concurrency issue can be fixed and therefore be prevented reliably in the future.

To evaluate this work, we will collect various different types of concurrency issues and make them openly available. This collection will be based on issues from industrial systems and derived from theoretical scenarios for highly complex bugs. We include these theoretical scenarios, since such complex bugs are hard to diagnose and test for, they likely remain undiagnosed and undocumented in practice, but have the potential of causing major disruptions.

Finally, we will build and evaluate our proposed techniques based on a system designed for concurrency research. The system uses the GraalVM technology of Oracle Labs, which allows us to prototype at the level of state-of-the-art systems, while keeping the development effort manageable for a small team.

Funded Value:

£209,756

Funded Period:

Mar 21 - Mar 24

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/V007165/1

Principal Investigator:

Stefan Marr

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Fundamentals of Computing (80%)

Software Engineering (20%)

Organisations

People	ORCID iD
Stefan Marr (Principal Investigator)	http://orcid.org/0000-0001-9059-5180

Publications

Author Name Title Publication

Date Published

10 25 50

Larose O (2023) AST vs. Bytecode: Interpreters in the Age of Meta-Compilation in Proceedings of the ACM on Programming Languages

Pimás J (2023) Live Objects All The Way Down: Removing the Barriers between Applications and Virtual Machines in The Art, Science, and Engineering of Programming

Marr S (2022) Execution vs. Parse-Based Language Servers: Tradeoffs and Opportunities for Language-Agnostic Tooling for Dynamic Languages

Burchell H (2023) Don't Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers

Ugawa T (2022) Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems

Kaleba S (2022) Who You Gonna Call: Analyzing the Run-Time Call-Site Behavior of Ruby Applications

Larose O (2023) Dynamic Library Compartmentalization

Huang W (2023) Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm

Key Findings
Further Funding
Collaboration


Description	One of the important questions of this work was whether we can apply concurrency bug detection on large scale applications. As a step towards this goal, we analyzed large-scale Ruby applications. Our findings show that their behavior is similar to the smaller applications that were studied previously, which means that we our approach to optimizations can likely be applied to them as well. However, the large scale of these applications causes major overheads when using just-in-time compilation techniques such as inlining and splitting. While these are needed to achieve best performance, it means large application will take hours instead of minutes to reach best performance. With a novel splitting strategy, we can reduce oversplitting by up to 70% in such systems, which reduces the amount of compilation the system has to do. Furthermore, we investigated how the run-time representation in virtual machines can be optimized to reduce memory use and variability. With this technique, we are able to minimize the impact of our detection technique on memory use and performance. Our latest work further investigated the tradeoffs between basic interpreter techniques and found that with the meta-compilation technology available today, we can build and optimize languages based on tree-representation, which reduce engineering effort, and reach the same performance as classic instruction-oriented representations.
Exploitation Route	Our current results will be of use to programming language implementers, such as large companies building browsers and compilers.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	Industry Fellowship
Amount	£120,003 (GBP)
Funding ID	INF\R1\211001
Organisation	The Royal Society
Sector	Charity/Non Profit
Country	United Kingdom
Start	08/2021
End	08/2024


Description	Debugging Technology and User Evaluations
Organisation	Vrije Universiteit Brussel
Country	Belgium
Sector	Academic/University
PI Contribution	We contribute expertise on language implementation techniques, compilers, optimizations, and concurrency bug detection.
Collaborator Contribution	Our partners contribute expertise on debugging, user studies and empirical evaluation, distributed systems, and bugs in distributed systems.
Impact	The first outcome of the collaboration is a user study that was conducted, but the paper is currently under revision after first review rounds.
Start Year	2021


Description	Large-Scale Automated Testing
Organisation	University of Ghent
Country	Belgium
Sector	Academic/University
PI Contribution	We contributed our expertise on defining and implementing debugger protocols, which enabled a new approach to automated testing of large-scale systems.
Collaborator Contribution	Prof. Scholliers and his group built and evaluated a new system for large-scale testing of embedded devices based on our debugging and debugger testing technology. They further refined the approach, including first steps on formalizing the semantics of the overall approach.
Impact	The collaboration resulted in the implementation of a new software system Latch, as well as a draft paper, which currently is under revision.
Start Year	2022


Description	Virtual Machine Design and Application Integration
Organisation	University of Buenos Aires
Country	Argentina
Sector	Academic/University
PI Contribution	We contributed expertise on Virtual Machine Design and language implementation to this project.
Collaborator Contribution	J. Pimás design and implemented a novel system that combines the virtual machine and application in the same codebase and programming language to reap the benefits of uniform tooling and representation across the full software stack. Traditionally, the virtual machine is intended to completely insulate the application from the implementation details. However, in his research, he found that violating this general premise can have major benefits for small software teams.
Impact	We published a paper titled: Live Objects All The Way Down: Removing the Barriers between Applications and Virtual Machines DOI: https://doi.org/10.22152/programming-journal.org/2024/8/5
Start Year	2022

Abstract

Organisations

People

ORCID iD

Publications