Automatic Binary Parallelisation

Lead Research Organisation: University of Cambridge
Department Name: Computer Laboratory

Abstract

Since the turn of the century, multicore processors have become commonplace in general-purpose computing systems. Starting with IBM's POWER 4 in 2001, multicores soon became mainstream with the release of x86 chips from Intel and AMD targeting desktop and servers; ARM's Cortex-A9 MPCore pushed multicores into the mobile space soon after. The failure of Dennard scaling (whereby, with each technology generation, integrated circuits could contain more transistors operating at a higher frequency for the same power budget) forced manufacturers to switch their focus from high-cost extraction of instruction-level parallelism (ILP) to efficient execution of thread-level parallelism (TLP) to avoid hitting the "power wall". This enables greater performance for separate tasks running together, either from totally independent programs or collaborating threads in a parallel application.

Unfortunately, writing parallel code is still seen as hard (John Hennessy called it "a problem that's as hard as any that computer science has faced"). The programming language and runtime communities have risen to this challenge by providing new languages and constructs to aid parallel programming, significantly boosting programmer productivity. Whilst important and useful for new applications, studies suggest that developers primarily use threads to structure their code, without fully exploiting the parallelism available from the underlying hardware. In addition, users of single-threaded applications where the source code is lost, unavailable or cannot easily be recompiled, are not able to take advantage of the TLP now available to them.

Within this context, parallelisation of application binaries becomes a seductive proposition. Regardless of the source languages used to create the program, or the availability of the code, an application can be restructured within its binary form to split off tasks into separate threads, manage communication between them and combine their results back together when required. Although almost impossible to perform effectively by hand, automatic tools have the ability to extract the inherent parallelism lurking in many sequential applications through sophisticated analysis and transformations to keep tasks as independent as possible. This obviates the need to spend time and effort reworking code into parallel form and allows users to obtain the benefits of parallel processors without writing a single line of code, opening up the performance potential of multicore processors so it is available to all.

This project represents a significant step towards a general-purpose binary parallelisation tool. To achieve its aims it will draw on prior research into binary analysis, compiler-based automatic parallelisation, dynamic binary translation, and software transactional memory. A concrete output from this work will be a tool capable of extracting and exploiting the thread-level parallelism available in sequential applications, to achieve speed-ups of at least 2x on commodity quad-core processors.

Planned Impact

This novel project seeks to develop a general-purpose, automatic binary parallelisation tool. The overall aim is to identify and extract thread-level parallelism from application binaries using a combination of static analysis and dynamic transformations. Although this tool would be useful in a variety of contexts, its primary benefit will be gained where users do not have access to the source code of the single-threaded applications they run, yet wish to take advantage of the underlying multicore hardware. Furthermore, it will also be advantageous to users who do have access to the source, but do not have the ability, skill or desire to alter the code to introduce multithreading.

One environment containing these types of users is a data centre. Users submit jobs to the data centre using applications written in a variety of languages. In an academic or research environment, some of the code they will have written, but they do not have the time to parallelise due to pressures on getting results quickly and the complexities involved in identifying and exploiting the parallelism available. Other codes may be written in scripting languages which only have basic support for parallelism, and where custom versions of the language are undesirable due to publication requirements for data reproducibility. Yet more applications are third-party programs where, even if the source code is available, users have even less incentive to create parallel versions since there is a significantly high barrier to entry in simply understanding the source in the first place.

To this end, the PI has partnered with the Cancer Research UK Cambridge Institute (CRUK-CI), to use their data centre and users' applications as a test-bed for the binary parallelisation tool. CRUK-CI will provide example applications and scripts that are regularly used in their data centre for analysis and parallelisation. In addition, we will talk with users about their applications to understand how the codes work and their key characteristics, helping to focus our analysis and parallelism extraction transformations. Once the tool is robust and able to extract a good level of performance from these codes, we aim to trial it with users in the data centre.

The impacts from this project fit directly into EPSRC's Delivery Plan outcome of a "Productive Nation" with the introduction of innovative and disruptive technologies. Partnership with CRUK-CI will ensure that even early prototypes of this technology will benefit researchers seeking to develop understanding of and treatments for one of the biggest health challenges of our time. Collaboration will enable a significant reduction in simulation latency for key scientific experiments, ultimately resulting in more productive researchers and a faster convergence to promising solutions.
 
Description We have created a prototype binary parallelisation tool, Janus, capable of splitting applications into threads that run in parallel. It is also capable of taking advantage of modern vector ISAs within general-purpose microprocessors to increase the amount of code executed in parallel.
Exploitation Route People can use the Janus tool to parallelise their own applications, even if they don't have the source code, enabling better use of the underlying microprocessor and faster program execution.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance
Amount £1,091,792 (GBP)
Funding ID EP/W00576X/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 08/2021 
End 03/2025
 
Title Research Data supporting "Compendia: Reducing Virtual-Memory Costs via Selective Densification" 
Description A Linux kernel patch to extend the Compendia simulator, using the same methodology and extending the code of BadgerTrap. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact
URL https://www.repository.cam.ac.uk/handle/1810/322525
 
Title Research data supporting "CHERIvoke: Characterising Pointer Revocation using CHERI Capabilities for Temporal Memory Safety" 
Description Source code for sweeper and modified dlmalloc (called dlmalloc_cherivoke) that implements the CHERIvoke technique. See the file README.md for a detailed description and usage instructions. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Research data supporting "Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring" 
Description Cinnamon compiler and backends for Janus, Pin and Dyninst to compile Cinnamon programs. Also the Cinnamon branch of the Janus repository. See the README.md file for more detailed information and usage instructions. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
Impact
URL https://www.repository.cam.ac.uk/handle/1810/321769
 
Title Research data supporting "Duplo: A Framework for OCaml Post-Link Optimisation" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/307491
 
Title Research data supporting "HALO: Post-Link Heap-Layout Optimisation" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/300136
 
Title Research data supporting "MineSweeper: A "Clean Sweep" for Drop-In Use-After-Free Prevention" 
Description Contains the MineSweeper implementation, an allocator extension implemented on top of JeMalloc to mitigate use-after-free attacks, together with scripts to evaluate its running time and memory overheads on the SPEC CPU2006 benchmarks. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact
URL https://www.repository.cam.ac.uk/handle/1810/332941
 
Title Research data supporting "The Janus Triad: Exploiting Parallelism Through Dynamic Binary Modification" 
Description A static binary analysis tool, profile information, a DynamoRIO client and benchmarks to show binary parallelisation using Janus. For more information, see the README.txt file. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title AN APPARATUS AND METHOD FOR SPECULATIVELY VECTORISING PROGRAM CODE 
Description An apparatus and method are provided for speculatively vectorising program code. The apparatus comprises processing circuitry for executing program code, the program code including an identified code region comprising at least a plurality of speculative vector memory access instructions. Execution of each speculative vector memory access instruction is employed to perform speculative vectorisation of a series of scalar memory access operations using a plurality of lanes of processing. Tracking storage is used to maintain, for each speculative vector memory access instruction, tracking information providing an indication of a memory address being accessed within each lane. Checking circuitry then references the tracking information during execution of the identified code region by the processing circuitry, in order to detect any inter lane memory hazard resulting from the execution of the plurality of speculative vector memory access instructions. For at least a first type of inter lane memory hazard, a status storage element is used to maintain an indication of each lane for which the checking circuitry has determined the presence of that type of memory hazard. Replay determination circuitry is then arranged, when an end of the identified code region is reached, to be responsive to the status storage element identifying at least one lane as having an inter lane memory hazard, to trigger re-execution of the identified code region for each lane identified by the status storage element. Such an approach can significantly increase the ability to vectorise scalar code, hence resulting in significant performance improvements. 
IP Reference WO2021001641 
Protection Patent granted
Year Protection Granted 2021
Licensed No
Impact Arm internal evaluation
 
Title Cinnamon 
Description The Cinnamon domain-specific language and compiler targets DynamoRIO (via the Janus tool), Pin and Dyninst binary instrumentation systems to allow flexible and framework-independent ways of building binary profiling and monitoring tools. 
Type Of Technology Software 
Year Produced 2021 
Open Source License? Yes  
Impact On-going 
URL https://www.cl.cam.ac.uk/~tmj32/data/
 
Title Janus 
Description Janus is a binary parallelisation tool, also capable of automatically vectorising applications, as well as inserting useful software prefetches. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact We published two papers on Janus and it is the building block for on-going research in the group. 
URL https://www.cl.cam.ac.uk/~tmj32/data/