CaMELot: Catching and Mitigating Event-Loop Concurrency Issues

Lead Research Organisation: University of Kent
Department Name: Sch of Computing

Abstract

Most modern computer applications depend in some way or another on computations that are performed by server applications on the internet. More and more of these server applications are now built as so-called microservices, which allow developers to gradually update or fix issues in unrelated parts of a larger application, and therefore, have become popular. Many of these microservices avoid certain types of concurrency issues by design. Unfortunately, they still suffer from other kinds of concurrency issues, for example when multiple online customers try to reserve the same seats at the same time.

For software engineers, it is hard to test for all possible concurrent interactions. In practice, this means that only simple concurrency issues are reliably detected during testing. Complex issues can however easily slip through and make it into server applications and then handle client requests incorrectly. One example of such a concurrency issue appeared at Nasdaq when the Facebook stock was traded for the first time, resulting in the loss of millions of dollars.

Our goal is to develop techniques that detect concurrency issues automatically at run time, to be able to circumvent them, and enable developers to fix them, using detailed information gathered by the detection techniques. Researchers have shown that one can detect and avoid issues, for instance by changing the order in which client requests are processed. In practice however, current techniques slow server applications down significantly, which make these techniques too costly to be used. Our aim is to dynamically balance the need for accurate information and minimize slow down. We conjecture that we can get most practical benefits while only rarely tracking precise details of how program code executes. In addition to automatically preventing concurrency issues to cause problems, we will also use the obtained information to provide feedback to developers so that they can fix the underlying issue in their software.

Thus, overall the goal of this research project is to make server applications, and specifically microservices, more robust and resilient to software bugs that are hard to test for and therefore typically remain undiscovered until they cause major issues for customers or companies.

Our work will result in the development of adaptive techniques that detect concurrency issues, and automatically tradeoff accuracy and run-time overhead, to be usable in practice. Furthermore, the detection techniques will be used to provide actionable input to the software developers, so that the concurrency issue can be fixed and therefore be prevented reliably in the future.

To evaluate this work, we will collect various different types of concurrency issues and make them openly available. This collection will be based on issues from industrial systems and derived from theoretical scenarios for highly complex bugs. We include these theoretical scenarios, since such complex bugs are hard to diagnose and test for, they likely remain undiagnosed and undocumented in practice, but have the potential of causing major disruptions.

Finally, we will build and evaluate our proposed techniques based on a system designed for concurrency research. The system uses the GraalVM technology of Oracle Labs, which allows us to prototype at the level of state-of-the-art systems, while keeping the development effort manageable for a small team.
 
Description One of the important questions of this work was whether we can apply concurrency bug detection on large scale applications.
As a step towards this goal, we analyzed large-scale Ruby applications.
Our findings show that their behavior is similar to the smaller applications that were studied previously, which means that we our approach to optimizations can likely be applied to them as well. However, the large scale of these applications causes major overheads when using just-in-time compilation techniques such as inlining and splitting. While these are needed to achieve best performance, it means large application will take hours instead of minutes to reach best performance. With a novel splitting strategy, we can reduce oversplitting by up to 70% in such systems, which reduces the amount of compilation the system has to do.

Furthermore, we investigated how the run-time representation in virtual machines can be optimized to reduce memory use and variability. With this technique, we are able to minimize the impact of our detection technique on memory use and performance. Our latest work further investigated the tradeoffs between basic interpreter techniques and found that with the meta-compilation technology available today, we can build and optimize languages based on tree-representation, which reduce engineering effort, and reach the same performance as classic instruction-oriented representations.
Exploitation Route Our current results will be of use to programming language implementers, such as large companies building browsers and compilers.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Industry Fellowship
Amount £120,003 (GBP)
Funding ID INF\R1\211001 
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 08/2021 
End 08/2024
 
Description Debugging Technology and User Evaluations 
Organisation Vrije Universiteit Brussel
Country Belgium 
Sector Academic/University 
PI Contribution We contribute expertise on language implementation techniques, compilers, optimizations, and concurrency bug detection.
Collaborator Contribution Our partners contribute expertise on debugging, user studies and empirical evaluation, distributed systems, and bugs in distributed systems.
Impact The first outcome of the collaboration is a user study that was conducted, but the paper is currently under revision after first review rounds.
Start Year 2021
 
Description Large-Scale Automated Testing 
Organisation University of Ghent
Country Belgium 
Sector Academic/University 
PI Contribution We contributed our expertise on defining and implementing debugger protocols, which enabled a new approach to automated testing of large-scale systems.
Collaborator Contribution Prof. Scholliers and his group built and evaluated a new system for large-scale testing of embedded devices based on our debugging and debugger testing technology. They further refined the approach, including first steps on formalizing the semantics of the overall approach.
Impact The collaboration resulted in the implementation of a new software system Latch, as well as a draft paper, which currently is under revision.
Start Year 2022
 
Description Virtual Machine Design and Application Integration 
Organisation University of Buenos Aires
Country Argentina 
Sector Academic/University 
PI Contribution We contributed expertise on Virtual Machine Design and language implementation to this project.
Collaborator Contribution J. Pimás design and implemented a novel system that combines the virtual machine and application in the same codebase and programming language to reap the benefits of uniform tooling and representation across the full software stack. Traditionally, the virtual machine is intended to completely insulate the application from the implementation details. However, in his research, he found that violating this general premise can have major benefits for small software teams.
Impact We published a paper titled: Live Objects All The Way Down: Removing the Barriers between Applications and Virtual Machines DOI: https://doi.org/10.22152/programming-journal.org/2024/8/5
Start Year 2022