Hybrid Static / Dynamic Optimisations for Many-Cores: Breaking the Memory Wall

Lead Research Organisation: University of Manchester
Department Name: Computer Science

Abstract

Fueled by an exponential growth in computing capacities, society's reliance on information and computer systems has grown to encompass most activities and aspects of our lives. Mobile computing devices alone are set to outnumber humans in 2014, each such device providing more computational power than the supercomputers used to put men on the moon. Yet despite this exponential growth, society still hungers for more: consumers want new and improved smartphones and computers; scientists want to run more precise simulations, solve bigger problems or get results faster; we explore new computationally intensive technologies to bring forward the next generation of tools, from self-driving cars to robotic surgery. However, the "free lunch" is over: processors are no longer getting faster. Hard physical limits, such as energy and power density, have brought this success story to an abrupt end. To avoid this impasse, all major manufacturers are now offering an exponentially increasing number of processing units, or cores, per chip.

With ubiquitous parallel hardware and stagnating single processor performance, the only way forward is to improve the scalability and efficiency of parallel execution. In this context, the critical impediment to performance is the widening gap between the computing capabilities of many-core architectures and their limited off-chip memory bandwidth: the memory wall. Traditional solutions attempt to reduce off-chip communications by improving data locality, which means that computations are scheduled in such a way that intermediate results are not sent back to main memory and then re-loaded on chip. Common approaches are either static, with an optimising compiler determining a program schedule such that all intermediate results fit within the available cache memory, or dynamic, when a runtime environment builds the schedule during execution.

The main criticism of these techniques is that: (1) static approaches are very limited in scope, only applicable to restricted classes of computations; and (2) dynamic approaches are unaffordable, as finding the optimal schedule is generally more expensive than the gain. Current state-of-the-art approaches favour inexpensive dynamic heuristics, albeit imprecise and sub-optimal. The research hypothesis of this project is that near-optimal dynamic schedules can be built efficiently, provided that the work is partitioned between the compiler and the runtime system. Alone, the compiler cannot optimise many important programs due to static analysis shortcomings when dealing with recursion or complex, dynamically allocated, data structures. However, if the compiler performs best effort static optimisations and prepares the ground by providing the runtime environment with tractable subproblems, which are cheap to solve with the accurate information typically available during execution, then such a hybrid optimisation scheme becomes profitable and will lead to significant performance and scalability gains.

Planned Impact

The outcome of this project will deliver performance and scalability gains to software development companies that target multi- and many-core architectures, and as a consequence, it will also enhance consumer experience on computers and mobile devices equipped with new chips, providing faster and less power-hungry applications, an essential gain on smartphones and tablets. The mobile economy is not only one of the fastest growing economic sectors, but it represents a fundamental societal shift.

The most direct beneficiaries, financially speaking, are chip manufacturers. They will increasingly struggle to sell chips as more cores often provide no performance gains to existing applications. One of their main issues will be to find a way to show that new chips, with more cores, have an edge over previous generations, and this crucially depends on the capability of programmers to exploit vast numbers of additional cores. This project will enhance the scalability of parallel programs, which is as relevant to established chip manufacturers, such as ARM, IBM, Intel or Qualcomm, as to start-ups that produce massively parallel many-core processors like Kalray's MPPA architecture.
 
Description Dynamic task parallelism is an increasingly popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems, an unavoidable evolution for scalability.

We found that it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models, for both computing and memory resources, while achieving near-optimal data locality. We designed and implemented run-time algorithms for NUMA-aware task and data placement that are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes in application behaviour. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables fine-grained dynamic control over the placement of application data.

We demonstrated that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5x compared to a NUMA-aware hierarchical work-stealing baseline.
Exploitation Route Our scheduling and memory allocation techniques could benefit most other asynchronous many-task (AMT) parallel programming environments, such as OpenMP, StarSs, Cilk, Habanero/X10 or TBB.

Furthermore, we have included these optimisations in the runtime system of the OpenStream programming language (http://openstream.info), so any application implemented using this language will directly benefit from our results.
Sectors Digital/Communication/Information Technologies (including Software)

URL http://www.openstream.info/
 
Description Cross Site Proposal
Amount € 3,300 (EUR)
Organisation EUROLAB4HPC 
Sector Public
Country European Union (EU)
Start 03/2016 
End 04/2016
 
Description H2020 FET-HPC
Amount € 780,069 (EUR)
Funding ID H2020-671578 
Organisation European Research Council (ERC) 
Sector Public
Country Belgium
Start 10/2015 
End 09/2018
 
Description University Research Fellowship
Amount £457,406 (GBP)
Organisation Royal Academy of Engineering 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2015 
End 12/2019
 
Title Aftermath 
Description Aftermath is a graphical tool for the trace-based performance analysis of OpenMP and OpenStream programs. The tool allows programmers to relate performance data from a parallel execution to the programming model, e.g. to loops and tasks in OpenMP. Aftermath supports the interactive exploration of trace files, the generation of detailed statistics for arbitrary subsets of a trace on-the-fly and the inspection of individual events in detail. Aftermath enables fine grained model-centric analysis of parallel programs, in particular allowing to detect performance anomalies and to find correlations between low-level hardware, runtime or operating system events and the performance anomalies observed during parallel execution. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The Aftermath tool for parallel program execution analysis has attracted significant interest from a wide range of academic and industrial partners, in particular attracting 40 participants at a tutorial organised jointly with the HiPEAC 2017 conference in Stockholm. It is currently used for teaching parallel programming at institutions in France and Germany. The tool was instrumental in enabling research that resulted in a Best Paper Award at PACT 2016. 
URL https://www.aftermath-tracing.com/
 
Description Aftermath Tutorial at HiPEAC 2017 in Stockholm 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Trace-based, post-mortem analysis has been established as a key technique for the performance analysis of parallel programs. It provides insight into the dynamic aspects of program execution that are inaccessible to static analysis, such as the interaction between the application, the run-time system, the operating system and the hardware. Existing tools for trace-based analysis provide various performance metrics, including function call profiles, hardware performance counter data and memory allocation, but only few tools are able to relate performance data to the programming and execution model.
This tutorial introduced and provided hands-on experience with Aftermath, a graphical tool for the analysis of OpenMP and OpenStream programs, specifically designed for cross-layer, language-centric performance analysis, relating low-level performance data to high-level concepts of the programming language, run-time, and application.

Out of the ~40 registered attendants, two academics mentioned that they had already started using the tool for research and for teaching, and two expressed their intent to use the tool in their courses. One industry attendant discussed plans to develop a R&D project around the tool.
Year(s) Of Engagement Activity 2017
URL https://www.hipeac.net/events/activities/7447/aftermath/