ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

Since the turn of the century, multicore processors have become commonplace in almost all computing domains. Instead of performance coming solely from the extraction of instruction-level parallelism (ILP), it now also requires software developers or compilers to break applications into multiple streams of instructions to exploit coarse-grained thread-level parallelism (TLP). Whilst extremely beneficial for a large class of programs, single-threaded performance still matters greatly, especially during sequential parts of an application where execution speed can dominate overall program performance (sometimes dubbed "Amdahl's cruel law"). In addition, improvements in single-threaded performance benefit all applications, as each thread experiences a performance uplift, thus impacting all parts of the code-sequential and parallel.

However, improving single-threaded performance is hard. The move to multicore was driven by the power limitations of complex out-of-order hardware schemes to extract ILP (caused by the failure of Dennard scaling in the underlying transistor technologies). While designers do still increase the out-of-order instruction window, unfortunately this only makes a marginal difference and future designs are expected to be limited by Pollack's rule and the fundamental limits of ILP (the ILP wall). Conversely, although many applications would see a major performance boost from taking advantage of TLP, actually extracting it remains a challenge (John Hennessy said writing parallel code is "a problem that's as hard as any that computer science has faced").

This project takes a radically different approach. Instead of going back to the future with elaborate schemes for out-of-order execution, it explores the space between ILP and the coarse-grained TLP exploited by modern multicores. In particular, it focuses on the extraction of fine-grained TLP from a single stream of instructions within and across cores. On the one hand it will investigate schemes to identify and spin-up independent short-running threads (hardware threadlets) transparently to the application, so as to boost single-threaded performance. On the other, it will research compiler techniques to indicate this parallelism, with the hardware able to exploit it within and across multiple tightly coupled cores. If successful, this project would lead to a step change in performance of high-performance cores, driven by increased utilisation of core resources and the ability to increase those resources in a scalable manner. It would also open up a broader design space, trading out-of-order pipeline complexity for ILP with increased TLP, to find better balances between area, efficiency and application-domain suitability.

Publications

10 25 50
 
Description ARM 
Organisation Arm Limited
Country United Kingdom 
Sector Private 
PI Contribution Three PhD students work on topics related to the research grant. They have mentors in ARM to ensure that their work is relevant and goes in the direction that industry is heading.
Collaborator Contribution Funding for PhD students and appointing mentors for them. Students spend three months over the course of their studies within ARM Research. For the rest of the time they meet with their mentors once a month, on average.
Impact Two patent filings, not yet granted. Numerous publications. A follow-on grant.
Start Year 2013
 
Title OptiWISE 
Description OptiWISE is a tool for application analysis. It is a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OptiWISE then combines this information to give a highly detailed per-instruction CPI (cycles per instruction) metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. 
Type Of Technology Webtool/Application 
Year Produced 2024 
Open Source License? Yes  
Impact None as yet, but it is helping us advance our own research 
URL https://github.com/CompArchCam/optiwise
 
Title Research data supporting "OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis" 
Description OptiWISE is a profiling tool for x86-64 and AArch64 processors running Linux. It aims to help a user understand a target program's performance characteristics by reporting detailed statistics such as per instruction CPI. It achieves this by running the application twice: once with low-overhead sampling and once with high-overhead dynamic instrumentation. This archive contains the tool's source code at v0.9.0, as well as a copy of the tool built for x86-64. It also contains scripts that use the tool to run several experiments demonstrating the tool's output and overheads. These experiments correspond to generating Figures 1, 7, and 10 in the corresponding publication "OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis". 
Type Of Technology Software 
Year Produced 2023 
Impact None as yet, but it is helping in our own research 
URL https://www.repository.cam.ac.uk/handle/1810/361537