Hybrid Static/Dynamic Scheduling for Task Dataflow Parallel Programs

Lead Research Organisation: Queen's University Belfast

Department Name: Sch of Electronics, Elec Eng & Comp Sci

Abstract

Traditionally, software development has benefit tremendously from the exponential performance increase that processors, the central computing units in computers, have witnessed. Up until about 2004, processor performance doubled about every 18 to 24 months. This trend could however not be sustained due to physical limitations, most importantly constraints on energy consumption. For this reason, processor manufacturars have switched to integrating multiple processor cores on a chip. These processors still allow an overall performance growth at similar rates as before 2004. However, software must be rewritten to utilize all processing cores in order to benefit from this performance potential. The pressure is now on software development as software must have several independent threads of execution, i.e., software must be parallel (or concurrent). The development of high-performance parallel software is non-trivial and is a specialisation of its own. The key problem with parallelism is that it must be taken into account throughout the design of software. Moreover, optimising the performance of parallel software requires many code changes that often increase performance only on specific computers. Parallelism in software imposes a dual expertise on software developers: expertise in the problem domain and expertise in parallel programming. Such a dual expertise is counterproductive in many respects and potentially leads to more costly, less effective and less functional software.

This project aims to alleviate the dual expertise problem by advancing knowledge and technology on parallel programming models based on task dataflow. These programming models separate the specification of the program from the detection of parallelism, thus shifting the focus towards correctness of software and ease of development. Task dataflow models however depend on dynamic analysis of parallelism, which adds to the execution time overhead and restricts the model to programs with coarse-grain parallelism. In contrast, it is known that statically scheduled programs (where parallelism has been decided and mapped out before the program executes) allow considerably finer-grain parallelism.

This project will investigate techniques to reconcile the benefits of dynamically scheduled task dataflow programs with the benefits of static scheduling. To this end, we will investigate compilation techniques and extensions to dynamic schedulers that allow embedding statically scheduled fine-grain parallel components inside coarse-grain dynamically scheduled programs.

If successful, this project will generate both scientific knowledge and long-term practical value for the ICT industry. This research programme will also make initial steps in the philosophically important issue of recompiling parallel programs, an issue that has been largely ignored in the past due to its sheer complexity. This research programme furthermore aligns with the EPSRC ICT capability priority on Many-core architectures and concurrency in distributed and embedded systems".

Planned Impact

The main long-term goal of this project is to improve the programmability of many-core processors, which means that better parallel software may be written with less programming effort. The term 'better' is very wide and may include higher performance, better energy-efficiency, more or better functionality, less bugs especially those related to concurrency, etc. In order to arrive at such better parallel software, tools will need to be developed based on the results of the proposed research to support software developers.

Thus, contributing to improving programmability impacts the United Kingdom's ICT industry and society in several ways, using EPSRC nomenclature:
1. Enhancing the effectiveness and sustainability of organisations including public services and businesses:
Computing is a means to enhancing services and increasing revenue: it is not computing itself but the adoption of new processes and techniques enabled by computing that creates productivity growth of businesses. Often, but not always, computing requirements strain the capabilities of current computing systems and the adoption of high-performance and parallel computing techniques becomes a requirement. In the current state of affairs, business that develop software of rmany-core processors will need to invest in people that specialise in parallelizing and optimizing software for these processors. As a result, programmers need to develop a dual expertise: expertise in the products that they develop and expertise in parallel programming. This will reflect negatively on the cost of software development and may pose problems to attract or train the required staff. Moreover, the quality of the software may also be reduced because parallel software is susceptible to a whole range of subtle concurrency bugs that do not exist in single-threaded software.

The goal of this project is to reduce the expertise in parallel programming required. This will improve the effectiveness and concurrential position of companies that adopt programming languages with improved programmability.

2. Commercialisation and exploitation:
The proposed research may be commercialised in the form of tools (compilers) and execution environment (runtime system). These will be used by the industry described above to construct better software. Many tool companies already exist in the UK that have carved out a particular problem domain and market tools that aid to solve problems related to programmability. Such companies include ARM Ltd., CriticalBlue, Maxeler and Codeplay, to name just a few. Such tool companies can benefit from the proposed research by commercialising and exploiting research results of this project.

3. Wealth creation, economic prosperity and regeneration:
Better tools lead to better software products with more functionality, better performance, better energy-efficiency and better quality (fewer bugs). Better software again supports industry to perform better in whatever use they have of software, be it in manufacturing and design, science, financial markets, communications, digital rendering or media and entertainment. Moreover, the results of this project will help the IT sector as a whole to keep up with the growth rate of IT functionality and pervasiveness that we have grown accustomed to, which has been put at threat by the many-core revolution. As such, wealth creation and economic prosperity will follow from improved software for a range of companies.

Following the analysis above, the direct industrial beneficiaries of this project are tool companies. These tool companies will be addressed through publishing and showcasing our research results and through making the project's scheduler and compiler available as open source software. This way, evidence of the effectiveness of the developed techniques as well as present potential ways to implement them are presented.

Funded Value:

£96,235

Funded Period:

Dec 14 - Mar 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/L027402/1

Principal Investigator:

Hans Vandierendonck

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Computer Sys. & Architecture (50%)

Fundamentals of Computing (50%)

Organisations

Queen's University Belfast (Lead Research Organisation)

People	ORCID iD
Hans Vandierendonck (Principal Investigator)	http://orcid.org/0000-0001-5868-9259

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Arif M (2021) Reducing the burden of parallel loop schedulers for many-core processors in Concurrency and Computation: Practice and Experience

Arif M (2018) Reducing the burden of parallel loop schedulers for many-core processors

Arif M (2018) Reducing the burden of parallel loop schedulers for many-core processors in ACM SIGPLAN Notices

Arif M (2016) A scalable and composable map-reduce system

Arif M (2015) OpenMP: Heterogenous Execution and Data Movements

Sun J (2020) Fast load balance parallel graph analytics with an automatic graph data structure selection algorithm in Future Generation Computer Systems

Sun J (2017) GraphGrind

Sun J (2017) Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning

Vandierendonck H (2016) HPTA: High-performance text analytics

Vandierendonck, H. (2015) Compiler and Runtime Support for Hybrid Static/Dynamic Scheduling

Key Findings
Further Funding
Software and Technical Products
Engagement Activities


Description	The current state-of-the-practice is that programmers need to explicitly encode parallelism in software in order to exploit the increasing number of processing cores available on modern processors. This has a number of ramifications on software and the software development process, in particular, that software developers need to spent time and effort to decide what pieces of the software should be adapted for parallel execution and in what way the parallelism is best expressed. This project has investigated new ways for programmers to express parallelism in software. This new approach does not explicitly require programmers to express parallelism. Instead, programmers record properties of code from which parallelism is deduced. Programmers need to encode the "footprint" of tasks, i.e., all variables or memory locations that may be accessed by a well-defined piece of code, typically a procedure in a program. These properties are known by programmers anyway as they form part of the functionality and the contract of the code. Expressing them explicitly is thus a minor overhead. The project has made following discoveries: 1. Complex, long-running programs may contain a variety of coarse-grain parallel loops and fine-grain parallel loops. Coarse-grain parallel loops contain large tasks that take a relatively long time to execute while fine-grain parallel loops contain short-running tasks. While it is technically easy to schedule coarse-grain tasks on processing cores, there exist real applications contain fine-grain parallel loops where tasks take little time in comparison to the overhead of scheduling the execution of these tasks. This overhead is known as the burden of the scheduler. We have observed this overhead on a range of applications in graph analytics, data analytics and high-performance computing. 2. We propose static scheduling as a means to reduce the scheduling burden. This static scheduling is calculated quickly at execution time prior to starting a loop, yet cannot adapt to workload imbalance and performance variability as general (dynamic) schedulers do. We have demonstrated that (i) static scheduling consistently improves the performance of fine-grain parallel loops on a 48-core system (i.e., load imbalance is not an issue in these loops) and (ii) a hybrid scheduler combining static and dynamic scheduling can be built with negligible overhead to switch between the modes of execution. 3. We found that reducing scheduler burden improves performance portability. Performance portability means that a program that executes fast on one processor type or configuration will also execute fast on another. We have found that (i) migrating parallel software from a low-core count machine to a high-core count machine may degrade performance for fine-grain loops; (ii) static scheduling is more robust to retain a speedup when performing this migration; (iii) static scheduling scales to larger core counts compared to dynamic schedulers in the case of fine-grain paralle loops. 4. We have developed and implemented the Swan parallel programming language, an extension of the MIT/Intel Cilkplus language, to experiment with scheduling strategies. We have implemented hybrid scheduling and NUMA-aware scheduling in Swan and demonstrated the efficiency of Swan on a range of applications, including numerical analysis, text analytics and graph analytics.
Exploitation Route	Our findings have lead to new scheduling policies that have been implemented in the Swan language and runtime system. Compiler and runtime support is available. As such, anyone with programming skills can take the Swan software systems and enhance the speed of their software using the new scheduling policies. This is mostly relevant to software performance engineers working in high-performance computing, data analytics/big data and data-intensive embedded systems. We will work on the pathways to impact in the coming years by engaging in first instance with the high-performance computing community and seeking collaborations to apply Swan in real high-performance computing projects. We have been discussing with ECMWF, who are in the process of evaluating a modified version of the OpenMP runtime that implements our fine-grain scheduling techniques. This and similar collaborations will provide us with additional insight that we can use to enhance and extend Swan. We will furthermore use our links to the OpenMP language committee which drafts the OpenMP standard in search of adoption of research outcomes of this project.
Sectors	Digital/Communication/Information Technologies (including Software) Other


Description	(Entrans) - Energy Efficient Transprecision Techniques for Linear Solver
Amount	€ 183,455 (EUR)
Funding ID	798209
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	05/2018
End	05/2020


Description	Mini-project/Meeting Fund of ReCoVER LWEC Network
Amount	£2,600 (GBP)
Funding ID	RFFMPM019
Organisation	ReCoVER Network, EPSRC
Sector	Academic/University
Country	United Kingdom
Start	06/2017
End	09/2017


Description	Software for the Future Call II
Amount	£963,929 (GBP)
Funding ID	EP/M01147X/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	03/2015
End	03/2018


Title	Extension of Clang (LLVM) compiler for task dataflow programs
Description	The Clang/LLVM compiler is a generic and extensible compiler for a variety of C-like languages. It has been extended to compile the Swan language extensions. Swan is an extension of the MIT Cilk/Intel Cilkplus language and adds dataflow annotations to tasks. The software recognises now special variables that signify dataflow dependences between tasks and generates code for checking and enforcing dependences at runtime. The generated code interfaces with the Intel Cilkplus runtime, which has been also modified to schedule the task dataflow programs. License info LLVM and Clang: https://opensource.org/licenses/UoI-NCSA.php URLs: https://github.com/hvdieren/swan_clang https://github.com/hvdieren/swan_llvm
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The software is still under development. We are using it in ongoing research on scheduling of parallel programs and in a PhD project on optimising graph analytics algorithms. Partners in the EU FP7 project ASAP have been using it in the context of multi-engine data analytics.
URL	https://github.com/hvdieren/swan_clang


Title	Software - Swan usage examples
Description	A collection of examples to use the Swan language for parallel programs, using the task dataflow annotations, NUMA-aware scheduling hints or a fine-grain scheduling hint. These examples require the Swan compiler and runtime system. URLs: https://github.com/hvdieren/swan_tests
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The software is still under development. We are using it in ongoing research on scheduling of parallel programs and in a PhD project on optimising graph analytics algorithms. Partners in the EU FP7 project ASAP have been using it in the context of multi-engine data analytics.
URL	https://github.com/hvdieren/swan_tests


Title	Software - data analytics operators implemented in the Swan language
Description	The implementation of example operations in data analytics, in particular K-means clustering and term frequency - inverse document frequency (TF-IDF) analysis. These operations are commonly used in data analytics and are implemented here in the Swan programming language. The software furthermore contains a library of common operations and data structures used in text analytics. The software demonstrates the application of the Swan language to efficiently parallelise data-intensive programs.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The software is still under development. We are using it in ongoing research on scheduling of parallel programs and in a PhD project on optimising graph analytics algorithms. Partners in the EU FP7 project ASAP have been using it in the context of multi-engine data analytics.
URL	https://github.com/hvdieren/asap_operators


Title	Swan - Extension of Intel Cilkplus runtime for parallel programs
Description	This project has developed the Swan programming language for parallel programs using the task dataflow model. Swan is an extension of the MIT Cilk/Intel Cilkplus language and adds dataflow annotations to tasks. The software recognises now special variables that signify dataflow dependences between tasks and generates code for checking and enforcing dependences at runtime. The runtime is intended to be used along with a compiler that generates appropriate calls to the runtime system. The Intel Cilkplus runtime comes under multiple licenses; BSD-3 license is appropriate URLs: https://github.com/hvdieren/swan_runtime
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The software is still under development. We are using it in ongoing research on scheduling of parallel programs and in a PhD project on optimising graph analytics algorithms. Partners in the EU FP7 project ASAP have been using it in the context of multi-engine data analytics.
URL	https://github.com/hvdieren/swan_runtime


Description	Research Impact Showcase - DNA of Innovation
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	The DNA of Innovation is a publication targeting the general public and media to showcase innovative research performed at Queen's. It is accompanied with a showcase event where researchers featured in the booklet man stands and explain their research to the general public.
Year(s) Of Engagement Activity	2015


Description	Talk at IET On Campus event "Meet the Lecturer"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Undergraduate students
Results and Impact	Around 40 students attended talks by academics and post-docs in the School of Electronics, Electrical Engineering and Computer Science, which sparked question and discussion afterwards.
Year(s) Of Engagement Activity	2017


Description	W5 STEM Outreach Event (career talks, speed networking)
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	Pupils from various schools in Northern Ireland visited W5 for a STEM speed networking event. Main audience is pre-A levels. Discussions sparked interest in the subject area and clarified appropriate subjects for A levels to prepare for studies in this area.
Year(s) Of Engagement Activity	2016

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications