CaMELot: Catching and Mitigating Event-Loop Concurrency Issues

Lead Research Organisation: University of Kent

Department Name: Sch of Computing

Abstract

Most modern computer applications depend in some way or another on computations that are performed by server applications on the internet. More and more of these server applications are now built as so-called microservices, which allow developers to gradually update or fix issues in unrelated parts of a larger application, and therefore, have become popular. Many of these microservices avoid certain types of concurrency issues by design. Unfortunately, they still suffer from other kinds of concurrency issues, for example when multiple online customers try to reserve the same seats at the same time.

For software engineers, it is hard to test for all possible concurrent interactions. In practice, this means that only simple concurrency issues are reliably detected during testing. Complex issues can however easily slip through and make it into server applications and then handle client requests incorrectly. One example of such a concurrency issue appeared at Nasdaq when the Facebook stock was traded for the first time, resulting in the loss of millions of dollars.

Our goal is to develop techniques that detect concurrency issues automatically at run time, to be able to circumvent them, and enable developers to fix them, using detailed information gathered by the detection techniques. Researchers have shown that one can detect and avoid issues, for instance by changing the order in which client requests are processed. In practice however, current techniques slow server applications down significantly, which make these techniques too costly to be used. Our aim is to dynamically balance the need for accurate information and minimize slow down. We conjecture that we can get most practical benefits while only rarely tracking precise details of how program code executes. In addition to automatically preventing concurrency issues to cause problems, we will also use the obtained information to provide feedback to developers so that they can fix the underlying issue in their software.

Thus, overall the goal of this research project is to make server applications, and specifically microservices, more robust and resilient to software bugs that are hard to test for and therefore typically remain undiscovered until they cause major issues for customers or companies.

Our work will result in the development of adaptive techniques that detect concurrency issues, and automatically tradeoff accuracy and run-time overhead, to be usable in practice. Furthermore, the detection techniques will be used to provide actionable input to the software developers, so that the concurrency issue can be fixed and therefore be prevented reliably in the future.

To evaluate this work, we will collect various different types of concurrency issues and make them openly available. This collection will be based on issues from industrial systems and derived from theoretical scenarios for highly complex bugs. We include these theoretical scenarios, since such complex bugs are hard to diagnose and test for, they likely remain undiagnosed and undocumented in practice, but have the potential of causing major disruptions.

Finally, we will build and evaluate our proposed techniques based on a system designed for concurrency research. The system uses the GraalVM technology of Oracle Labs, which allows us to prototype at the level of state-of-the-art systems, while keeping the development effort manageable for a small team.

Funded Value:

£209,756

Funded Period:

Apr 21 - Mar 24

Funder:

EPSRC

Project Status:

Active

Project Category:

Research Grant

Project Reference:

EP/V007165/1

Principal Investigator:

Stefan Marr

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Fundamentals of Computing (80%)

Software Engineering (20%)

Organisations

People	ORCID iD
Stefan Marr (Principal Investigator)	http://orcid.org/0000-0001-9059-5180

Publications

Author Name Title

Publication Date Published

10 25 50

Larose O (2023) AST vs. Bytecode: Interpreters in the Age of Meta-Compilation in Proceedings of the ACM on Programming Languages

Larose O (2023) Dynamic Library Compartmentalization

Marr S (2022) Execution vs. Parse-Based Language Servers: Tradeoffs and Opportunities for Language-Agnostic Tooling for Dynamic Languages

Huang W (2023) Optimizing the Order of Bytecode Handlers in Interpreters using a Genetic Algorithm

Ugawa T (2022) Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems

Kaleba S (2022) Who You Gonna Call: Analyzing the Run-Time Call-Site Behavior of Ruby Applications

Key Findings
Further Funding
Collaboration


Description	One of the important questions of this work was whether we can apply concurrency bug detection is large scale applications. As a step towards this goal, we analyzed large-scale Ruby applications and found that they behave similar enough to smaller applications studied previously that we can use the our optimizations for them as well. Furthermore, we investigated how the run-time representation in virtual machines can be optimized to reduce memory use and variability. With this technique, we are able to minimize the impact of our detection technique on memory use and performance.
Exploitation Route	Our current results will be of use to programming language implementers, such as large companies building browsers and compilers.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	Industry Fellowship
Amount	£120,003 (GBP)
Funding ID	INF\R1\211001
Organisation	The Royal Society
Sector	Charity/Non Profit
Country	United Kingdom
Start	09/2021
End	08/2024


Description	Debugging Technology and User Evaluations
Organisation	Vrije Universiteit Brussel
Country	Belgium
Sector	Academic/University
PI Contribution	We contribute expertise on language implementation techniques, compilers, optimizations, and concurrency bug detection.
Collaborator Contribution	Our partners contribute expertise on debugging, user studies and empirical evaluation, distributed systems, and bugs in distributed systems.
Impact	The first outcome of the collaboration is a user study that was conducted, but the publication process is not yet completed.
Start Year	2021

Abstract

Organisations

People

ORCID iD

Publications