ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance

Lead Research Organisation: University of Cambridge
Department Name: Computer Laboratory

Abstract

Since the turn of the century, multicore processors have become commonplace in almost all computing domains. Instead of performance coming solely from the extraction of instruction-level parallelism (ILP), it now also requires software developers or compilers to break applications into multiple streams of instructions to exploit coarse-grained thread-level parallelism (TLP). Whilst extremely beneficial for a large class of programs, single-threaded performance still matters greatly, especially during sequential parts of an application where execution speed can dominate overall program performance (sometimes dubbed "Amdahl's cruel law"). In addition, improvements in single-threaded performance benefit all applications, as each thread experiences a performance uplift, thus impacting all parts of the code-sequential and parallel.

However, improving single-threaded performance is hard. The move to multicore was driven by the power limitations of complex out-of-order hardware schemes to extract ILP (caused by the failure of Dennard scaling in the underlying transistor technologies). While designers do still increase the out-of-order instruction window, unfortunately this only makes a marginal difference and future designs are expected to be limited by Pollack's rule and the fundamental limits of ILP (the ILP wall). Conversely, although many applications would see a major performance boost from taking advantage of TLP, actually extracting it remains a challenge (John Hennessy said writing parallel code is "a problem that's as hard as any that computer science has faced").

This project takes a radically different approach. Instead of going back to the future with elaborate schemes for out-of-order execution, it explores the space between ILP and the coarse-grained TLP exploited by modern multicores. In particular, it focuses on the extraction of fine-grained TLP from a single stream of instructions within and across cores. On the one hand it will investigate schemes to identify and spin-up independent short-running threads (hardware threadlets) transparently to the application, so as to boost single-threaded performance. On the other, it will research compiler techniques to indicate this parallelism, with the hardware able to exploit it within and across multiple tightly coupled cores. If successful, this project would lead to a step change in performance of high-performance cores, driven by increased utilisation of core resources and the ability to increase those resources in a scalable manner. It would also open up a broader design space, trading out-of-order pipeline complexity for ILP with increased TLP, to find better balances between area, efficiency and application-domain suitability.

Publications

10 25 50