Automatic Binary Parallelisation

Lead Research Organisation: University of Cambridge

Department Name: Computer Science and Technology

Abstract

Since the turn of the century, multicore processors have become commonplace in general-purpose computing systems. Starting with IBM's POWER 4 in 2001, multicores soon became mainstream with the release of x86 chips from Intel and AMD targeting desktop and servers; ARM's Cortex-A9 MPCore pushed multicores into the mobile space soon after. The failure of Dennard scaling (whereby, with each technology generation, integrated circuits could contain more transistors operating at a higher frequency for the same power budget) forced manufacturers to switch their focus from high-cost extraction of instruction-level parallelism (ILP) to efficient execution of thread-level parallelism (TLP) to avoid hitting the "power wall". This enables greater performance for separate tasks running together, either from totally independent programs or collaborating threads in a parallel application.

Unfortunately, writing parallel code is still seen as hard (John Hennessy called it "a problem that's as hard as any that computer science has faced"). The programming language and runtime communities have risen to this challenge by providing new languages and constructs to aid parallel programming, significantly boosting programmer productivity. Whilst important and useful for new applications, studies suggest that developers primarily use threads to structure their code, without fully exploiting the parallelism available from the underlying hardware. In addition, users of single-threaded applications where the source code is lost, unavailable or cannot easily be recompiled, are not able to take advantage of the TLP now available to them.

Within this context, parallelisation of application binaries becomes a seductive proposition. Regardless of the source languages used to create the program, or the availability of the code, an application can be restructured within its binary form to split off tasks into separate threads, manage communication between them and combine their results back together when required. Although almost impossible to perform effectively by hand, automatic tools have the ability to extract the inherent parallelism lurking in many sequential applications through sophisticated analysis and transformations to keep tasks as independent as possible. This obviates the need to spend time and effort reworking code into parallel form and allows users to obtain the benefits of parallel processors without writing a single line of code, opening up the performance potential of multicore processors so it is available to all.

This project represents a significant step towards a general-purpose binary parallelisation tool. To achieve its aims it will draw on prior research into binary analysis, compiler-based automatic parallelisation, dynamic binary translation, and software transactional memory. A concrete output from this work will be a tool capable of extracting and exploiting the thread-level parallelism available in sequential applications, to achieve speed-ups of at least 2x on commodity quad-core processors.

Planned Impact

This novel project seeks to develop a general-purpose, automatic binary parallelisation tool. The overall aim is to identify and extract thread-level parallelism from application binaries using a combination of static analysis and dynamic transformations. Although this tool would be useful in a variety of contexts, its primary benefit will be gained where users do not have access to the source code of the single-threaded applications they run, yet wish to take advantage of the underlying multicore hardware. Furthermore, it will also be advantageous to users who do have access to the source, but do not have the ability, skill or desire to alter the code to introduce multithreading.

One environment containing these types of users is a data centre. Users submit jobs to the data centre using applications written in a variety of languages. In an academic or research environment, some of the code they will have written, but they do not have the time to parallelise due to pressures on getting results quickly and the complexities involved in identifying and exploiting the parallelism available. Other codes may be written in scripting languages which only have basic support for parallelism, and where custom versions of the language are undesirable due to publication requirements for data reproducibility. Yet more applications are third-party programs where, even if the source code is available, users have even less incentive to create parallel versions since there is a significantly high barrier to entry in simply understanding the source in the first place.

To this end, the PI has partnered with the Cancer Research UK Cambridge Institute (CRUK-CI), to use their data centre and users' applications as a test-bed for the binary parallelisation tool. CRUK-CI will provide example applications and scripts that are regularly used in their data centre for analysis and parallelisation. In addition, we will talk with users about their applications to understand how the codes work and their key characteristics, helping to focus our analysis and parallelism extraction transformations. Once the tool is robust and able to extract a good level of performance from these codes, we aim to trial it with users in the data centre.

The impacts from this project fit directly into EPSRC's Delivery Plan outcome of a "Productive Nation" with the introduction of innovative and disruptive technologies. Partnership with CRUK-CI will ensure that even early prototypes of this technology will benefit researchers seeking to develop understanding of and treatments for one of the biggest health challenges of our time. Collaboration will enable a significant reduction in simulation latency for key scientific experiments, ultimately resulting in more productive researchers and a faster convergence to promising solutions.

Funded Value:

£850,187

Funded Period:

Jun 17 - Apr 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/P020011/1

Principal Investigator:

Timothy Jones

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Computer Sys. & Architecture (100%)

Organisations

People	ORCID iD
Timothy Jones (Principal Investigator)	http://orcid.org/0000-0002-4114-7661

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Ainsworth S (2021) ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance

Ainsworth S (2020) MuonTrap: Preventing Cross-Domain Spectre-Like Attacks by Capturing Speculative State

Ainsworth S (2021) Compendia: reducing virtual-memory costs via selective densification

Ainsworth S (2020) MarkUs: Drop-in use-after-free prevention for low-level languages

Ainsworth S (2020) Prefetching in functional languages

Ainsworth S (2020) The Guardian Council

Arif M (2021) Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring

Falsafi B (2022) MineSweeper: a "clean sweep" for drop-in use-after-free prevention

Helwani F (2022) ACIA: A Methodology for identification of Architectural Design Patterns that support Continuous Integration based on Continuous Assessment

Ho H (2021) Timed hyperproperties

Key Findings
Further Funding
Research Databases and Models
Intellectual Property
Software and Technical Products


Description	We have created a prototype binary parallelisation tool, Janus, capable of splitting applications into threads that run in parallel. It is also capable of taking advantage of modern vector ISAs within general-purpose microprocessors to increase the amount of code executed in parallel.
Exploitation Route	People can use the Janus tool to parallelise their own applications, even if they don't have the source code, enabling better use of the underlying microprocessor and faster program execution. In addition, we have used the Janus tool in on-going research into binary translation systems, including a language for binary modification (called Cinnamon) and a tool under development that adds security mechanisms to a binary.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance
Amount	£1,091,792 (GBP)
Funding ID	EP/W00576X/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	02/2022
End	03/2025


Title	Research Data supporting "Compendia: Reducing Virtual-Memory Costs via Selective Densification"
Description	A Linux kernel patch to extend the Compendia simulator, using the same methodology and extending the code of BadgerTrap.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	-
URL	https://www.repository.cam.ac.uk/handle/1810/322525


Title	Research data supporting "CHERIvoke: Characterising Pointer Revocation using CHERI Capabilities for Temporal Memory Safety"
Description	Source code for sweeper and modified dlmalloc (called dlmalloc_cherivoke) that implements the CHERIvoke technique. See the file README.md for a detailed description and usage instructions.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	Research data supporting "Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring"
Description	Cinnamon compiler and backends for Janus, Pin and Dyninst to compile Cinnamon programs. Also the Cinnamon branch of the Janus repository. See the README.md file for more detailed information and usage instructions.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
Impact	-
URL	https://www.repository.cam.ac.uk/handle/1810/321769


Title	Research data supporting "Duplo: A Framework for OCaml Post-Link Optimisation"
Description
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/307491


Title	Research data supporting "HALO: Post-Link Heap-Layout Optimisation"
Description
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/300136


Title	Research data supporting "MineSweeper: A "Clean Sweep" for Drop-In Use-After-Free Prevention"
Description	Contains the MineSweeper implementation, an allocator extension implemented on top of JeMalloc to mitigate use-after-free attacks, together with scripts to evaluate its running time and memory overheads on the SPEC CPU2006 benchmarks.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	-
URL	https://www.repository.cam.ac.uk/handle/1810/332941


Title	Research data supporting "ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance"
Description	Gem5 modifications to support performance evaluation of using heterogeneous cores to perform processor error correction when undervolting or overclocking (the ParaDox technique). See the README.md file for more detailed information and usage instructions.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/316858


Title	Research data supporting "Quantifying the Semantic Gap Between Serial and Parallel Programming"
Description	Source code for the VIA tool used to analyse applications. Please see README file for details
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	Publication of the research paper "Quantifying the Semantic Gap Between Serial and Parallel Programming"
URL	https://www.repository.cam.ac.uk/handle/1810/340561


Title	Research data supporting "The Guardian Council: Parallel Programmable Hardware Security"
Description	This artefact contains the simulator, a sample benchmark, guardian kernels, and scripts to reproduce and plot experiments from the ASPLOS 2020 paper by S. Ainsworth and T. M. Jones -- The Guardian Council: Parallel Programmable Hardware Security, which uses a many-core support architecture to provide programmable security analysis for conventional processors. It can be used standalone for a short run, or with a copy of SPEC CPU2006 to fully re-evaluate the experiments from the original paper.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/300491


Title	Research data supporting "The Janus Triad: Exploiting Parallelism Through Dynamic Binary Modification"
Description	A static binary analysis tool, profile information, a DynamoRIO client and benchmarks to show binary parallelisation using Janus. For more information, see the README.txt file.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	AN APPARATUS AND METHOD FOR SPECULATIVELY VECTORISING PROGRAM CODE
Description	An apparatus and method are provided for speculatively vectorising program code. The apparatus comprises processing circuitry for executing program code, the program code including an identified code region comprising at least a plurality of speculative vector memory access instructions. Execution of each speculative vector memory access instruction is employed to perform speculative vectorisation of a series of scalar memory access operations using a plurality of lanes of processing. Tracking storage is used to maintain, for each speculative vector memory access instruction, tracking information providing an indication of a memory address being accessed within each lane. Checking circuitry then references the tracking information during execution of the identified code region by the processing circuitry, in order to detect any inter lane memory hazard resulting from the execution of the plurality of speculative vector memory access instructions. For at least a first type of inter lane memory hazard, a status storage element is used to maintain an indication of each lane for which the checking circuitry has determined the presence of that type of memory hazard. Replay determination circuitry is then arranged, when an end of the identified code region is reached, to be responsive to the status storage element identifying at least one lane as having an inter lane memory hazard, to trigger re-execution of the identified code region for each lane identified by the status storage element. Such an approach can significantly increase the ability to vectorise scalar code, hence resulting in significant performance improvements.
IP Reference	WO2021001641
Protection	Patent granted
Year Protection Granted	2021
Licensed	No
Impact	Arm internal evaluation


Title	Chromium
Description	Web browser
Type Of Technology	Webtool/Application
Year Produced	2020
Open Source License?	Yes
Impact	Influenced the design of the StarScan memory safety algorithms in both Chrome (Google's closed source browser) and Chromium (open-source). A YouTube discussion on an early version of this is available here: https://www.youtube.com/watch?v=ohlxw5kDn-k (mention of MarkUs being the influence at 29 minutes) and a blog post on the techniques here: https://security.googleblog.com/2022/05/retrofitting-temporal-memory-safety-on-c.html
URL	https://www.chromium.org/


Title	Cinnamon
Description	The Cinnamon domain-specific language and compiler targets DynamoRIO (via the Janus tool), Pin and Dyninst binary instrumentation systems to allow flexible and framework-independent ways of building binary profiling and monitoring tools.
Type Of Technology	Software
Year Produced	2021
Open Source License?	Yes
Impact	On-going
URL	https://www.cl.cam.ac.uk/~tmj32/data/


Title	Janus
Description	Janus is a binary parallelisation tool, also capable of automatically vectorising applications, as well as inserting useful software prefetches.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	We published two papers on Janus and it is the building block for on-going research in the group.
URL	https://www.cl.cam.ac.uk/~tmj32/data/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications