Automatic Binary Parallelisation

Lead Research Organisation: University of Cambridge
Department Name: Computer Laboratory

Abstract

Since the turn of the century, multicore processors have become commonplace in general-purpose computing systems. Starting with IBM's POWER 4 in 2001, multicores soon became mainstream with the release of x86 chips from Intel and AMD targeting desktop and servers; ARM's Cortex-A9 MPCore pushed multicores into the mobile space soon after. The failure of Dennard scaling (whereby, with each technology generation, integrated circuits could contain more transistors operating at a higher frequency for the same power budget) forced manufacturers to switch their focus from high-cost extraction of instruction-level parallelism (ILP) to efficient execution of thread-level parallelism (TLP) to avoid hitting the "power wall". This enables greater performance for separate tasks running together, either from totally independent programs or collaborating threads in a parallel application.

Unfortunately, writing parallel code is still seen as hard (John Hennessy called it "a problem that's as hard as any that computer science has faced"). The programming language and runtime communities have risen to this challenge by providing new languages and constructs to aid parallel programming, significantly boosting programmer productivity. Whilst important and useful for new applications, studies suggest that developers primarily use threads to structure their code, without fully exploiting the parallelism available from the underlying hardware. In addition, users of single-threaded applications where the source code is lost, unavailable or cannot easily be recompiled, are not able to take advantage of the TLP now available to them.

Within this context, parallelisation of application binaries becomes a seductive proposition. Regardless of the source languages used to create the program, or the availability of the code, an application can be restructured within its binary form to split off tasks into separate threads, manage communication between them and combine their results back together when required. Although almost impossible to perform effectively by hand, automatic tools have the ability to extract the inherent parallelism lurking in many sequential applications through sophisticated analysis and transformations to keep tasks as independent as possible. This obviates the need to spend time and effort reworking code into parallel form and allows users to obtain the benefits of parallel processors without writing a single line of code, opening up the performance potential of multicore processors so it is available to all.

This project represents a significant step towards a general-purpose binary parallelisation tool. To achieve its aims it will draw on prior research into binary analysis, compiler-based automatic parallelisation, dynamic binary translation, and software transactional memory. A concrete output from this work will be a tool capable of extracting and exploiting the thread-level parallelism available in sequential applications, to achieve speed-ups of at least 2x on commodity quad-core processors.

Planned Impact

This novel project seeks to develop a general-purpose, automatic binary parallelisation tool. The overall aim is to identify and extract thread-level parallelism from application binaries using a combination of static analysis and dynamic transformations. Although this tool would be useful in a variety of contexts, its primary benefit will be gained where users do not have access to the source code of the single-threaded applications they run, yet wish to take advantage of the underlying multicore hardware. Furthermore, it will also be advantageous to users who do have access to the source, but do not have the ability, skill or desire to alter the code to introduce multithreading.

One environment containing these types of users is a data centre. Users submit jobs to the data centre using applications written in a variety of languages. In an academic or research environment, some of the code they will have written, but they do not have the time to parallelise due to pressures on getting results quickly and the complexities involved in identifying and exploiting the parallelism available. Other codes may be written in scripting languages which only have basic support for parallelism, and where custom versions of the language are undesirable due to publication requirements for data reproducibility. Yet more applications are third-party programs where, even if the source code is available, users have even less incentive to create parallel versions since there is a significantly high barrier to entry in simply understanding the source in the first place.

To this end, the PI has partnered with the Cancer Research UK Cambridge Institute (CRUK-CI), to use their data centre and users' applications as a test-bed for the binary parallelisation tool. CRUK-CI will provide example applications and scripts that are regularly used in their data centre for analysis and parallelisation. In addition, we will talk with users about their applications to understand how the codes work and their key characteristics, helping to focus our analysis and parallelism extraction transformations. Once the tool is robust and able to extract a good level of performance from these codes, we aim to trial it with users in the data centre.

The impacts from this project fit directly into EPSRC's Delivery Plan outcome of a "Productive Nation" with the introduction of innovative and disruptive technologies. Partnership with CRUK-CI will ensure that even early prototypes of this technology will benefit researchers seeking to develop understanding of and treatments for one of the biggest health challenges of our time. Collaboration will enable a significant reduction in simulation latency for key scientific experiments, ultimately resulting in more productive researchers and a faster convergence to promising solutions.

Publications

10 25 50
 
Description We have created a prototype binary parallelisation tool, Janus, capable of splitting applications into threads that run in parallel. It is also capable of taking advantage of modern vector ISAs within general-purpose microprocessors to increase the amount of code executed in parallel.
Exploitation Route People can use the Janus tool to parallelise their own applications, even if they don't have the source code, enabling better use of the underlying microprocessor and faster program execution.
Sectors Digital/Communication/Information Technologies (including Software)

 
Title Research data supporting "CHERIvoke: Characterising Pointer Revocation using CHERI Capabilities for Temporal Memory Safety" 
Description Source code for sweeper and modified dlmalloc (called dlmalloc_cherivoke) that implements the CHERIvoke technique. See the file README.md for a detailed description and usage instructions. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Research data supporting "Duplo: A Framework for OCaml Post-Link Optimisation" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/307491
 
Title Research data supporting "HALO: Post-Link Heap-Layout Optimisation" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/300136
 
Title Research data supporting "The Janus Triad: Exploiting Parallelism Through Dynamic Binary Modification" 
Description A static binary analysis tool, profile information, a DynamoRIO client and benchmarks to show binary parallelisation using Janus. For more information, see the README.txt file. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Janus 
Description Janus is a binary parallelisation tool, also capable of automatically vectorising applications, as well as inserting useful software prefetches. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact We published two papers on Janus and it is the building block for on-going research in the group. 
URL https://www.cl.cam.ac.uk/~tmj32/data/