CaMELot: Catching and Mitigating Event-Loop Concurrency Issues

Lead Research Organisation: University of Kent
Department Name: Sch of Computing

Abstract

Most modern computer applications depend in some way or another on computations that are performed by server applications on the internet. More and more of these server applications are now built as so-called microservices, which allow developers to gradually update or fix issues in unrelated parts of a larger application, and therefore, have become popular. Many of these microservices avoid certain types of concurrency issues by design. Unfortunately, they still suffer from other kinds of concurrency issues, for example when multiple online customers try to reserve the same seats at the same time.

For software engineers, it is hard to test for all possible concurrent interactions. In practice, this means that only simple concurrency issues are reliably detected during testing. Complex issues can however easily slip through and make it into server applications and then handle client requests incorrectly. One example of such a concurrency issue appeared at Nasdaq when the Facebook stock was traded for the first time, resulting in the loss of millions of dollars.

Our goal is to develop techniques that detect concurrency issues automatically at run time, to be able to circumvent them, and enable developers to fix them, using detailed information gathered by the detection techniques. Researchers have shown that one can detect and avoid issues, for instance by changing the order in which client requests are processed. In practice however, current techniques slow server applications down significantly, which make these techniques too costly to be used. Our aim is to dynamically balance the need for accurate information and minimize slow down. We conjecture that we can get most practical benefits while only rarely tracking precise details of how program code executes. In addition to automatically preventing concurrency issues to cause problems, we will also use the obtained information to provide feedback to developers so that they can fix the underlying issue in their software.

Thus, overall the goal of this research project is to make server applications, and specifically microservices, more robust and resilient to software bugs that are hard to test for and therefore typically remain undiscovered until they cause major issues for customers or companies.

Our work will result in the development of adaptive techniques that detect concurrency issues, and automatically tradeoff accuracy and run-time overhead, to be usable in practice. Furthermore, the detection techniques will be used to provide actionable input to the software developers, so that the concurrency issue can be fixed and therefore be prevented reliably in the future.

To evaluate this work, we will collect various different types of concurrency issues and make them openly available. This collection will be based on issues from industrial systems and derived from theoretical scenarios for highly complex bugs. We include these theoretical scenarios, since such complex bugs are hard to diagnose and test for, they likely remain undiagnosed and undocumented in practice, but have the potential of causing major disruptions.

Finally, we will build and evaluate our proposed techniques based on a system designed for concurrency research. The system uses the GraalVM technology of Oracle Labs, which allows us to prototype at the level of state-of-the-art systems, while keeping the development effort manageable for a small team.
 
Description One of the important questions of this work was whether we can apply concurrency bug detection is large scale applications.
As a step towards this goal, we analyzed large-scale Ruby applications and found that they behave similar enough to smaller applications studied previously that we can use the our optimizations for them as well.

Furthermore, we investigated how the run-time representation in virtual machines can be optimized to reduce memory use and variability. With this technique, we are able to minimize the impact of our detection technique on memory use and performance.
Exploitation Route Our current results will be of use to programming language implementers, such as large companies building browsers and compilers.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Industry Fellowship
Amount £120,003 (GBP)
Funding ID INF\R1\211001 
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 09/2021 
End 08/2024
 
Description Debugging Technology and User Evaluations 
Organisation Vrije Universiteit Brussel
Country Belgium 
Sector Academic/University 
PI Contribution We contribute expertise on language implementation techniques, compilers, optimizations, and concurrency bug detection.
Collaborator Contribution Our partners contribute expertise on debugging, user studies and empirical evaluation, distributed systems, and bugs in distributed systems.
Impact The first outcome of the collaboration is a user study that was conducted, but the publication process is not yet completed.
Start Year 2021