Hybrid Static / Dynamic Optimisations for Many-Cores: Breaking the Memory Wall

Lead Research Organisation: University of Manchester

Department Name: Computer Science

Abstract

Fueled by an exponential growth in computing capacities, society's reliance on information and computer systems has grown to encompass most activities and aspects of our lives. Mobile computing devices alone are set to outnumber humans in 2014, each such device providing more computational power than the supercomputers used to put men on the moon. Yet despite this exponential growth, society still hungers for more: consumers want new and improved smartphones and computers; scientists want to run more precise simulations, solve bigger problems or get results faster; we explore new computationally intensive technologies to bring forward the next generation of tools, from self-driving cars to robotic surgery. However, the "free lunch" is over: processors are no longer getting faster. Hard physical limits, such as energy and power density, have brought this success story to an abrupt end. To avoid this impasse, all major manufacturers are now offering an exponentially increasing number of processing units, or cores, per chip.

With ubiquitous parallel hardware and stagnating single processor performance, the only way forward is to improve the scalability and efficiency of parallel execution. In this context, the critical impediment to performance is the widening gap between the computing capabilities of many-core architectures and their limited off-chip memory bandwidth: the memory wall. Traditional solutions attempt to reduce off-chip communications by improving data locality, which means that computations are scheduled in such a way that intermediate results are not sent back to main memory and then re-loaded on chip. Common approaches are either static, with an optimising compiler determining a program schedule such that all intermediate results fit within the available cache memory, or dynamic, when a runtime environment builds the schedule during execution.

The main criticism of these techniques is that: (1) static approaches are very limited in scope, only applicable to restricted classes of computations; and (2) dynamic approaches are unaffordable, as finding the optimal schedule is generally more expensive than the gain. Current state-of-the-art approaches favour inexpensive dynamic heuristics, albeit imprecise and sub-optimal. The research hypothesis of this project is that near-optimal dynamic schedules can be built efficiently, provided that the work is partitioned between the compiler and the runtime system. Alone, the compiler cannot optimise many important programs due to static analysis shortcomings when dealing with recursion or complex, dynamically allocated, data structures. However, if the compiler performs best effort static optimisations and prepares the ground by providing the runtime environment with tractable subproblems, which are cheap to solve with the accurate information typically available during execution, then such a hybrid optimisation scheme becomes profitable and will lead to significant performance and scalability gains.

Planned Impact

The outcome of this project will deliver performance and scalability gains to software development companies that target multi- and many-core architectures, and as a consequence, it will also enhance consumer experience on computers and mobile devices equipped with new chips, providing faster and less power-hungry applications, an essential gain on smartphones and tablets. The mobile economy is not only one of the fastest growing economic sectors, but it represents a fundamental societal shift.

The most direct beneficiaries, financially speaking, are chip manufacturers. They will increasingly struggle to sell chips as more cores often provide no performance gains to existing applications. One of their main issues will be to find a way to show that new chips, with more cores, have an edge over previous generations, and this crucially depends on the capability of programmers to exploit vast numbers of additional cores. This project will enhance the scalability of parallel programs, which is as relevant to established chip manufacturers, such as ARM, IBM, Intel or Qualcomm, as to start-ups that produce massively parallel many-core processors like Kalray's MPPA architecture.

Funded Value:

£99,988

Funded Period:

Mar 15 - Jun 16

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/M004880/1

Principal Investigator:

Antoniu Pop

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Fundamentals of Computing (100%)

Organisations

University of Manchester (Lead Research Organisation)

People	ORCID iD
Antoniu Pop (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Drebes A (2016) NUMA-aware scheduling and memory allocation for data-flow task-parallel applications in ACM SIGPLAN Notices

Drebes A (2016) NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Drebes A (2016) OpenMP: Memory, Devices, and Tasks

Drebes A (2016) Interactive visualization of cross-layer performance anomalies in dynamic task-parallel applications and systems

Drebes A (2016) Scalable Task Parallelism for NUMA

Neill R (2017) Fuse Accurate Multiplexing of Hardware Performance Counters Across Executions in ACM Transactions on Architecture and Code Optimization

Neill R (2017) Scaling OpenMP for Exascale Performance and Portability

Rodchenko A (2015) Euro-Par 2015: Parallel Processing - 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings

Key Findings
Further Funding
Software and Technical Products
Engagement Activities


Description	Dynamic task parallelism is an increasingly popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems, an unavoidable evolution for scalability. We found that it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models, for both computing and memory resources, while achieving near-optimal data locality. We designed and implemented run-time algorithms for NUMA-aware task and data placement that are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes in application behaviour. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables fine-grained dynamic control over the placement of application data. We demonstrated that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5x compared to a NUMA-aware hierarchical work-stealing baseline.
Exploitation Route	Our scheduling and memory allocation techniques could benefit most other asynchronous many-task (AMT) parallel programming environments, such as OpenMP, StarSs, Cilk, Habanero/X10 or TBB. Furthermore, we have included these optimisations in the runtime system of the OpenStream programming language (http://openstream.info), so any application implemented using this language will directly benefit from our results.
Sectors	Digital/Communication/Information Technologies (including Software)
URL	http://www.openstream.info/


Description	Cross Site Proposal
Amount	€ 3,300 (EUR)
Organisation	EUROLAB4HPC
Sector	Public
Country	European Union (EU)
Start	03/2016
End	04/2016


Description	H2020 FET-HPC
Amount	€ 780,069 (EUR)
Funding ID	H2020-671578
Organisation	European Research Council (ERC)
Sector	Public
Country	Belgium
Start	10/2015
End	09/2018


Description	University Research Fellowship
Amount	£457,406 (GBP)
Organisation	Royal Academy of Engineering
Sector	Charity/Non Profit
Country	United Kingdom
Start	01/2015
End	12/2019


Title	Aftermath
Description	Aftermath is a graphical tool for the trace-based performance analysis of OpenMP and OpenStream programs. The tool allows programmers to relate performance data from a parallel execution to the programming model, e.g. to loops and tasks in OpenMP. Aftermath supports the interactive exploration of trace files, the generation of detailed statistics for arbitrary subsets of a trace on-the-fly and the inspection of individual events in detail. Aftermath enables fine grained model-centric analysis of parallel programs, in particular allowing to detect performance anomalies and to find correlations between low-level hardware, runtime or operating system events and the performance anomalies observed during parallel execution.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The Aftermath tool for parallel program execution analysis has attracted significant interest from a wide range of academic and industrial partners, in particular attracting 40 participants at a tutorial organised jointly with the HiPEAC 2017 conference in Stockholm. It is currently used for teaching parallel programming at institutions in France and Germany. The tool was instrumental in enabling research that resulted in a Best Paper Award at PACT 2016.
URL	https://www.aftermath-tracing.com/


Description	Aftermath Tutorial at HiPEAC 2017 in Stockholm
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Trace-based, post-mortem analysis has been established as a key technique for the performance analysis of parallel programs. It provides insight into the dynamic aspects of program execution that are inaccessible to static analysis, such as the interaction between the application, the run-time system, the operating system and the hardware. Existing tools for trace-based analysis provide various performance metrics, including function call profiles, hardware performance counter data and memory allocation, but only few tools are able to relate performance data to the programming and execution model. This tutorial introduced and provided hands-on experience with Aftermath, a graphical tool for the analysis of OpenMP and OpenStream programs, specifically designed for cross-layer, language-centric performance analysis, relating low-level performance data to high-level concepts of the programming language, run-time, and application. Out of the ~40 registered attendants, two academics mentioned that they had already started using the tool for research and for teaching, and two expressed their intent to use the tool in their courses. One industry attendant discussed plans to develop a R&D project around the tool.
Year(s) Of Engagement Activity	2017
URL	https://www.hipeac.net/events/activities/7447/aftermath/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications