SANDeRS: Smart, Adaptive Compilation for Dark Silicon

Lead Research Organisation: Lancaster University

Department Name: Computing & Communications

Abstract

We live in an era of multi-cores: computing processors are no longer marketed by their clock speeds, they are marked by the number of cores. The fundamental limits of energy and power density of processors will soon push us further into an age of dark-silicon where only a small portion of the chip can be powered at any time. In such a setting, putting more of the same processing cores on a chip (i.e. homogeneity) gives no advantage. This has forced computer architects to introduce heterogeneous many-core systems built around distinct processors -- which have different energy and performance characteristics and each is specialised for a certain class of applications. Computer architects now hope that software will find ways to unlock the potential of heterogeneous many-cores. Software developers, however, are struggling to cope with this dramatic increase in complexity; and the current compiler tools, whose role is to enable software makes effective use of the underlying hardware, are simply inadequate to the task.

It is already a daunting task to build optimising compilers for homogeneous multi-cores consisting of identical cores, even just targeting performance (i.e. to make programs faster). It typically takes several generations of a compiler to start to effectively exploit the processor's potential, by which time a new processor appears and the process starts again. It will be a fundamentally more difficult task to design efficient compiler heuristics for optimising energy (i.e. to reduce energy consumption) and performance on heterogeneous many-cores, especially given the subtle interactions of different cores and inter-connections. Even if successfully achieved, the task of compiler design must likely to be started again when moving to a new released processor. This never ending game of catch-up inevitably delays time to market, meaning that we rarely fully exploit the hardware in its lifetime. If no solution is found, we will be faced with software stagnation and will be unable to offer scalable computing performance -- a driving force that has dramatically changed our society over the past 50 years.

What is needed is an approach that evolves and adapts to the future hardware architectural change and delivers scalable performance over hardware generations. This project offers precisely that. It will achieve this by bringing together two distinct areas of computer science: parallel compiler design and machine learning to develop a new paradigm for energy and performance optimisation. Our key insight is that the best optimisation strategies can be learned from similar software/hardware settings; and the learnt knowledge can be constantly refreshed without human involvement. This project will deliver such a smart, adaptive compilation system. We will use machine learning to acquire knowledge of workloads, applications and the underlying hardware, testing new compilation strategies, learning how each individual program should be optimised for each specific computing environment, and constantly improving the optimisation heuristics over time.

As knowledge of the application environment grows, our system will make programs faster and more energy efficient; for example, software will respond quicker and the battery life will last longer on mobile phones. It will reduce time to market for software products and deliver scalable performance as hardware advances. If successful, such as programme of work will help to the looming software crisis of dark silicon, which will be of benefit to academics and UK industry, and system software researchers and developers worldwide.

Planned Impact

The immediate beneficiaries of this work will be computing systems and systems software providers, users of data analytics applications and smartphones. The academic beneficiaries will be researchers in the areas of programming languages, compilers, operating systems and computer architecture. We will also train postgraduate and undergraduate students.

We identify 11 activities through that the potential industrial, academic, economic and societal impact of this work will be realised.

[A. Industrial Impact]

*1. Prototypes
This work will develop system software tool-chains for heterogeneous computing, including workload profiling and program synthesis tools, heterogeneous many-core compiler heuristics and a continuous optimisation framework. These will be released under an open source license and used as demonstrators of ideas and potential.

*2. Industrial Engagement:
We will visit our industrial partners and encourage our PhD student to take up internships with our partners to deliver technology transfer.

*3. Industrial Workshop:
At the second year of this project, we will organise a workshop in conjunction with the annual HiPEAC conference to disseminate the results to International Computing Systems industry. The PI has already successfully organised one such industrial workshop in HiPEAC.

*4. Technology Licensing:
IP for heterogeneous many-core software development tools are viable paths for commercial exploitation through technology licensing. The Business Partnerships \& Enterprise Team (BPET) in Lancaster provides commercialisation services to university members. We will make full use of BPET for exploitation.

[B. Economic Impact]
We will work closely with our partners to realise the potential economic impact on two specific areas:

*5. Big Data Analytics:
The results of this work can help to improve the energy efficiency and performance for big-data analytics technologies for big data applications which are currently a $16 bn market. We will work with Freescale, Herta Security and the Barcelona Supercomputing Center to exploit the results in this direction.

*6. Energy-efficient Mobile Devices:
Battery life is a major concern to billions of mobile users who often find their phone has died at most inconvenient times. The techniques developed in this work can improve energy efficiency and performance for each user's mobile device. We will collaborate with our partners, Movidius and CodePlay, to exploit the energy-aware compilation techniques created in this work for mobile computing.

[C. Academic Impact]

*7. Publications:
We aim to publish our results in the best conferences and journals in the areas of computing systems research, compilation, and parallel computing (PLDI, CGO, PPoPP, PACT, HiPEAC, LCTES, ASPLOS, ICS, ACM TACO, ACM TOPLAS, IEEE TPDS). Whenever possible, publications and research results will be made available on the project web site.

*8. Demos, Tutorials and Workshop:
Research prototypes will be disseminated to academic collaborators by giving tutorials and platform demonstrations in major technical conferences. We will continue the highly successful COSMIC (Code OptimiSation for MultI and many Cores) international workshop where we will present the key results of our work.

*9. Academic Visits:
We will regularly visit other systems groups in the UK working on relevant topics to disseminate research findings and build up collaboration.

[D. Societal Impact]

*10. Public Engagement:
We will use the web, and social and news media for public engagement. The project will use CompuCast (the world's first and only podcast for computer scientists) that was co-founded by the PI to engage with a wider audience.

*11. Student Training:
We will design projects for postgraduate and undergraduate students in areas within the project's research agenda, providing students with much needed skills in software development for heterogene

Funded Value:

£98,612

Funded Period:

Jun 15 - Nov 17

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/M01567X/1

Principal Investigator:

Zheng Wang

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Fundamentals of Computing (100%)

Organisations

People	ORCID iD
Zheng Wang (Principal Investigator)	http://orcid.org/0000-0001-6157-0662

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 > >|

10 25 50

C. Cummins (2017) Synthesizing benchmarks for predictive modeling

Chang L (2018) SleepGuard Capturing Rich Sleep Information Using Smartwatch Sensing Data in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Chang L (2018) Towards Large-Scale RFID Positioning: A Low-cost, High-precision Solution Based on Compressive Sensing

Chen D (2019) Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture in International Journal of Parallel Programming

Chen X (2019) Sensing Our World Using Wireless Signals in IEEE Internet Computing

Cummins C (2017) End-to-End Deep Learning of Optimization Heuristics

G.X Ye (2017) Cracking Android Pattern Lock in Five Attempts

J. Ren (2017) Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach

Kuang K (2016) Exploiting Dynamic Scheduling for VM-Based Code Obfuscation

Kuang K (2017) Exploit dynamic data flows to protect software against semantic attacks

Key Findings
Impact Summary
Further Funding
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	We have shown that energy-optimisation on heterogeneous many-core systems are non-trivial, but if we can make the right choice, the benefit will be significant. We have shown that compiler plays a key role in power and performance optimisations on heterogeneous archietctures. We have developed a tool based on the LLVM compiler infrastructrue and show that by correctly optimising the program, we can achieve up to 3x speedup or 2x performance reduction over the standard compiler setting on a heterogeneous CPU-GPU mixed platform. We have shown that by optimizing and scheduling the code in different ways different performance and energy trade-offs can be achieved on heterogeneous multi-core architectures. This demonstrates that compiler-based techniques can play a key role in performing energy and performance optimizations for heterogeneous multi- and many-core systems. We are among the first to show that deep learning can be used to replace compiler heuristics, leading to far better performance on parallel GPGPU programs.
Exploitation Route	We have made our tool public available on github: https://github.com/zwang4/dividend. We have also published our results in over 10 papers from which the research community can benefit from our key finding.
Sectors	Digital/Communication/Information Technologies (including Software),Energy,Other


Description	Our work on code size reduction was licensed to a RISC-V processor IP company and is being producised by a major IT company.
First Year Of Impact	2019
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Description	EPSRC iCASE Studentship
Amount	£35,000 (GBP)
Organisation	Arm Limited
Sector	Private
Country	United Kingdom
Start	01/2016
End	06/2019


Description	Royal Society
Amount	£12,000 (GBP)
Organisation	The Royal Society
Sector	Charity/Non Profit
Country	United Kingdom
Start	03/2017
End	03/2019


Title	DeepTune - a deep learning based compiler optimisaiton tool
Description	DeepTune is an open-source framework for building compiler optimisation heuristics using deep learning techniques. DeepTune uses a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation.
Type Of Material	Improvements to research infrastructure
Year Produced	2017
Provided To Others?	Yes
Impact	DeepTune is the world's first deep-learning-based autotuner for compiler heuristics. It opens up a new research field for using deep learning to model program structures for performance optimisation. A range of follow up works have built upon DeepTune. It also helps to secure follow-up industrial funding for over £500K.
URL	https://github.com/ChrisCummins/paper-end2end-dl


Title	HSA auto-tuning framework
Description	A compiler-based auto-tuning tool for HSA applications. It is the first automatic tool for tuning HAS applications.
Type Of Material	Improvements to research infrastructure
Year Produced	2016
Provided To Others?	Yes
Impact	There are two research groups (the project partners), Albert Cohen at Inria France, and Alexandru Amaricai from Politehnica University of Timi?oara, Romaina are using our tool
URL	https://github.com/zwang4/dividend


Description	Collaboration with Dionasys
Organisation	Peking University
Department	School of Electronics Engineering and Computer Science
Country	China
Sector	Academic/University
PI Contribution	We are collaborating on a collaboration project funded by the Royal Society. The project mines opensource repositories like github to automatically detect bugs and generate fixings. The Lancaster team contributes to the project on compiler and code analysis expertise.
Collaborator Contribution	The Peking university team contributes staff time and expertise on natural language processing to the project.
Impact	The project just started and no outcome were generated yet.
Start Year	2017


Description	Collaboration with Peking University
Organisation	Peking University
Department	School of Electronics Engineering and Computer Science
Country	China
Sector	Academic/University
PI Contribution	We are working on a joint project to mine the open sourced projects from github to detect and repair bugs. We contribute our expertise on code analysis to the project.
Collaborator Contribution	The collaborative partner contributes their expertise on natural language processing to the project. The partner team involves two academics and three postgraduate students.
Impact	This collaborative work has led to two joint publications: (DOI: 0.18653/v1/P17-1040 and Scale Up Event Extraction Learning via Automatic Training Data Generation).
Start Year	2017


Description	HSA collaboration with AMD
Organisation	Advanced Micro Devices (AMD)
Country	United States
Sector	Private
PI Contribution	This work has led to a collaboration with AMD who is a main contributor of the Heterogeneous System Architecture (HSA) Foundation. We are currently working on building a compiler-based HSA auto-tuner for the LLVM HSAIL compiler developed by AMD.
Collaborator Contribution	AMD has gave us access to their internal version of the HSA driver and provide technical support to their HSA architecture.
Impact	This has led to a prototype HSA auto-tuner released on github: https://github.com/zwang4/dividend
Start Year	2016


Title	HSA Auto-tuning tool
Description	A compiler-based auto-tuning tool for HSA applications.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The first auto-tuning tool for HSA programs.
URL	https://github.com/zwang4/dividend


Description	NDSS paper
Form Of Engagement Activity	A magazine, newsletter or online publication
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Public/other audiences
Results and Impact	Our research into Android Pattern Lock security has received wide media coverage. The news appeared in most UK national newspapers and was reported on by media outlets around the world to a potential audience of millions (as reported by the press office at Lancaster University)
Year(s) Of Engagement Activity	2016
URL	http://www.thetimes.co.uk/edition/news/scientists-finger-security-flaw-on-smartphone-lock-dmql3hdp3

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications