Compiler and Architectural Support for Ultra-Fine-Grained Parallelism

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

The main idea is to split an application into tiny fragments of code, where each performs a small fraction of the computation of the original program. In the extreme, each fragment would load a value from memory, perform simple computation, then store back. These fragments are then scheduled dynamically onto a sea of low-power processors, based on data flow through the program. The aim is to achieve significant application speedups through parallel execution of the code.
The technique would be achieved by analysing an application during compilation to determine its data dependences. A transformation pass would create the tiny code fragments, then link them together based on their data dependences. At runtime, a pool of ready fragments would be maintained (those with all dependences satisfied), that could be scheduled on any of the low-power processors. Once a fragment had finished execution, it would check its dependent fragments and place any into the pool that had no more unexecuted dependences. There are several challenges to address in order to achieve a functioning and high-performing prototype system. The main challenge is in splitting up an application into code fragments. The optimal size of each fragment needs to be determined, and its composition in terms of the types of instruction it contains. For example, should there be a maximum of only one store in each fragment, are multiple loads a good idea, should some code be duplicated in different fragments? Then, linking the fragments together requires the definition of a binary format and sufficient dependence analysis of the original program so as to keep the number of dependents as low as possible.
At runtime, the main challenge is in an efficient implementation of the pool of ready fragments. The pool could be a global data structure, or local to each processor, or hierarchical. It needs to support fast access. A method needs to be created to dispatch fragments into the pool efficiently, and pop them off again. To schedule fragments, each processor needs to decide whether to take the oldest ready fragment, or one that involves the least data transfer, or one that satisfies some other metric, such as number of dependents. This will be influenced, to an extent, by the implementation of the ready pool.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509620/1 01/10/2016 30/09/2022
1940174 Studentship EP/N509620/1 01/10/2017 31/03/2021 Joseph Isaacs