M3: Managing Many-Cores for the Masses

Lead Research Organisation: University of Cambridge

Department Name: Computer Science and Technology

Abstract

Since the invention of the microprocessor in the early 1970s, information and communication technologies have rapidly improved almost all aspects of our daily lives. Unrelenting growth in computing power has been the enabler for significant scientific advances and transformed our society in the way we work, play and interact with each other. Fundamentally, this has been achieved through continuous strides in semiconductor technology, repeatedly doubling the number of transistors available on a chip every 18 months (famously known as Moore's Law). At the same time, Dennard scaling ensured that overall chip power and area stayed constant, allowing computer architects to design more and more elaborate processors to take advantage of the increased transistor counts and provide faster and faster systems.

All this was performed transparently to the programmer, who simply had to switch to a new processor to make his applications run faster. There was no need to rewrite or modify parts of the program, or worry about the characteristics of the underlying platform. There was, in effect, a contract between the hardware and software, whereby the architect would provide more advanced systems and the programmer would concentrate on producing more sophisticated and innovative applications.

Serendipitous transistor scaling couldn't last forever though, and in the early part of this century we hit the "power wall". Transistors could no longer be made smaller yet achieve the same power efficiency gains as in previous technology generations. In response, industry switched to multicore designs, aiming to continue performance increases through thread-level parallelism. This, in effect, broke the contract, requiring software writers to create programs with an eye on performance, as well as functionality. Although there are some application domains that can benefit from this parallelism (e.g. games), there is little evidence that developers have risen to the challenge of writing multi-threaded code and even when they do, the runtime environment doesn't necessarily allow their threads to actually be executed in parallel.

The failure of Dennard scaling means that many-core chips are the logical next step. However, it also means that the energy problem will not decline, and key industry figures warn of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. We may soon find that significant parts of each multicore chip must be turned off, due to power limitations. Added to this, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.

This fellowship is a step towards meeting these challenges by reinstating the hardware / software contract, allowing the system to deal with each issue and leaving the application developer free to continue innovating. It will bring together a world-class research team that will investigate a variety of holistic schemes to achieve this aim based on automatic parallelisation, heterogeneous architectures and optical networks-on-chip, directly tackling the challenges laid out in the EPSRC cross-ICT priority of Many-Core Architectures and Concurrency in Distributed and Embedded Systems.

Planned Impact

The advent of mainstream multicore processors requires software developers to write multi-threaded applications to achieve high performance. At the same time, energy consumption is still a major challenge, with key industry figures warning of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. Futhermore, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.

The main goal of this project is to tackle all three of these challenges transparently to the software developer, increasing the performance, reducing the energy consumption and ensuring the reliability of future many-core processors. The primary beneficiaries from this work will be academics and industry, where we will perform state-of-the-art research into automatic parallelisation, heterogeneous architectures and reliability techniques for many-core architectures.

This work is of significant value to industry, as demonstrated by the letter of support received from ARM, who will fund the three Ph.D. students due to work on this project. Dr. Jones has good links with the different divisions within ARM and these provide an obvious route to dissemination of ideas, as well as feedback on the developed techniques.

Furthermore, at an early stage in the project the core IP that will be generated from the research will be identified. With this in hand, the potential for further economic gain can be evaluated. The University of Cambridge has an excellent track record in commercial exploitation, for example XenSource.

As with any long-term research, it is complex to express its impact on the wider research community and society. However, EPSRC's vision for a digital economy is underpinned by the need for high-performance, reliable and efficient electronics and systems. This proposal will significantly contribute towards this goal the EPSRC cross-ICT priority on Many-Core Architectures and Concurrency in Distributed and Embedded Systems.

Funded Value:

£1,212,276

Funded Period:

Sep 13 - Mar 19

Funder:

EPSRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

EP/K026399/1

Principal Investigator:

Timothy Jones

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Computer Sys. & Architecture (50%)

Electronic Devices & Subsys. (40%)

System on Chip (10%)

Organisations

People	ORCID iD
Timothy Jones (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

10 25 50

Ainsworth S (2018) An Event-Triggered Programmable Prefetcher for Irregular Workloads

Ainsworth S (2020) Prefetching in functional languages

Ainsworth S (2020) MarkUs: Drop-in use-after-free prevention for low-level languages

Ainsworth S (2021) ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance

Ainsworth S (2018) An Event-Triggered Programmable Prefetcher for Irregular Workloads in ACM SIGPLAN Notices

Ainsworth S (2017) Software prefetching for indirect memory accesses

Ainsworth S (2020) The Guardian Council

Ainsworth S (2019) Software Prefetching for Indirect Memory Accesses A Microarchitectural Perspective in ACM Transactions on Computer Systems

Ainsworth S (2018) Parallel Error Detection Using Heterogeneous Cores

Ainsworth S (2016) Graph Prefetching Using Data Structure Knowledge

Ainsworth S (2020) MuonTrap: Preventing Cross-Domain Spectre-Like Attacks by Capturing Speculative State

Ainsworth S (2019) ParaMedic: Heterogeneous Parallel Error Correction

Campanoni S (2014) HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs

Campanoni S (2017) Automatically accelerating non-numerical programs by architecture-compiler co-design in Communications of the ACM

Dubach C (2013) Dynamic microarchitectural adaptation using machine learning in ACM Transactions on Architecture and Code Optimization

Erdos M (2022) MineSweeper: a "clean sweep" for drop-in use-after-free prevention

Hadade I (2020) Software Prefetching for Unstructured Mesh Applications in ACM Transactions on Parallel Computing

Hadade I (2018) Software Prefetching for Unstructured Mesh Applications

Ho H (2021) Timed hyperproperties in Information and Computation

Kohn T (2020) Dynamic pattern matching with Python

Madarbux M (2014) Towards zero latency photonic switching in shared memory networks in Concurrency and Computation: Practice and Experience

Madarbux M (2016) Energy Efficient And Low Latency Interconnection Network For Multicast Invalidates In Shared Memory Systems

Mitropoulou K (2016) Lynx

Mitropoulou K (2016) COMET

Murphy N (2016) Performance implications of transient loop-carried data dependences in automatically parallelized loops

Porpodas V (2015) Throttling Automatic Vectorization: When Less is More

Porpodas V (2015) PSLP: Padded SLP automatic vectorization

Savage J (2020) HALO: post-link heap-layout optimisation

Soman J (2017) High performance fault tolerance through predictive instruction re-execution

Soman J (2015) REPAIR: Hard-error recovery via re-execution

Sun P (2021) Speculative Vectorisation with Selective Replay

Valero A (2017) On Microarchitectural Mechanisms for Cache Wearout Reduction in IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Valero A (2016) Enhancing the L1 Data Cache Design to Mitigate HCI in IEEE Computer Architecture Letters

Xia H (2019) CHERIvoke

Zhang X (2021) Quantifying the Semantic Gap Between Serial and Parallel Programming

Zhou R (2019) The janus triad: exploiting parallelism through dynamic binary modification

Zhou R (2019) Janus: Statically-Driven and Profile-Guided Automatic Dynamic Binary Parallelisation

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration
Intellectual Property
Software and Technical Products


Description	We have developed schemes to accelerate parallel code within multicore processors, extended the reach of vectorisation, and found that many transient and permanent errors within processors can be overcome through the addition of an array of small, power-efficient cores capable of re-executing code in parallel. These can also improve the performance of many HPC applications by bringing data into the core in advance of it being required.
Exploitation Route	Our work can be used by industry to improve the performance and reliability of microprocessors.
Sectors	Digital/Communication/Information Technologies (including Software),Electronics
URL	http://www.cl.cam.ac.uk/~tmj32/


Description	Close collaboration with Arm Ltd is on-going and exploring the long-term benefits of our work. We have jointly filed three patent applications (one accepted) and Arm are evaluating our work for a future product.
First Year Of Impact	2015
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Economic


Description	EPSRC Responsive Mode Grant
Amount	£1,062,734 (GBP)
Funding ID	EP/P020011/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	07/2017
End	12/2020


Description	Heterogeneous Parallel Reliability
Amount	£686,644 (GBP)
Organisation	Huawei Technologies Research and Development UK Ltd
Sector	Private
Country	United Kingdom
Start	07/2022
End	07/2026


Description	ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance
Amount	£1,091,792 (GBP)
Funding ID	EP/W00576X/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2021
End	03/2025


Description	Programmable Real-Time Security
Amount	£732,581 (GBP)
Organisation	Huawei Technologies Research and Development UK Ltd
Sector	Private
Country	United Kingdom
Start	10/2020
End	03/2024


Title	Research data supporting "An Event-Triggered Programmable Prefetcher for Irregular Workloads"
Description	Source code for the LLVM passes for automating programmable prefetching, as well as code modifications to gem5 to evaluate programmable prefetching, and associated benchmarks.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes


Title	Research data supporting "CHERIvoke: Characterising Pointer Revocation using CHERI Capabilities for Temporal Memory Safety"
Description	Source code for sweeper and modified dlmalloc (called dlmalloc_cherivoke) that implements the CHERIvoke technique. See the file README.md for a detailed description and usage instructions.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Title	Research data supporting "HALO: Post-Link Heap-Layout Optimisation"
Description
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/300136


Title	Research data supporting "High Performance Fault Tolerance Through Predictive Instruction Re-Execution"
Description	Source code for simulator modules to implement schemes in the paper.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes


Title	Research data supporting "Quantifying the Semantic Gap Between Serial and Parallel Programming"
Description	Source code for the VIA tool used to analyse applications. Please see README file for details
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/340561


Title	Research data supporting "The Janus Triad: Exploiting Parallelism Through Dynamic Binary Modification"
Description	A static binary analysis tool, profile information, a DynamoRIO client and benchmarks to show binary parallelisation using Janus. For more information, see the README.txt file.
Type Of Material	Database/Collection of data
Year Produced	2019
Provided To Others?	Yes


Description	ARM
Organisation	Arm Limited
Country	United Kingdom
Sector	Private
PI Contribution	Three PhD students work on topics related to the research grant. They have mentors in ARM to ensure that their work is relevant and goes in the direction that industry is heading.
Collaborator Contribution	Funding for PhD students and appointing mentors for them. Students spend three months over the course of their studies within ARM Research. For the rest of the time they meet with their mentors once a month, on average.
Impact	Two patent filings, not yet granted. Numerous publications. A follow-on grant.
Start Year	2013


Description	HiPEAC
Organisation	European Commission
Department	Seventh Framework Programme (FP7)
Country	European Union (EU)
Sector	Public
PI Contribution	Attending meetings to disseminate results and interact with other researchers in the same area.
Collaborator Contribution	A visit by a PostDoc from another member for 4 months.
Impact	The network is on High-Performance and Embedded Architectures and Compilers
Start Year	2011


Description	NESUS
Organisation	Network for Sustainable Ultrascale Computing (NESUS)
Country	Global
Sector	Academic/University
PI Contribution	Visit to initial kick-off meeting to discuss our work.
Collaborator Contribution	The whole consortium aims to meet the challenges of sustainable ultrascale computing.
Impact	The collaboration crosses disciplines within Computer Science. These are programming languages, compilers, runtimes, computer architecture and networks.
Start Year	2014


Title	AN APPARATUS AND METHOD FOR SPECULATIVELY VECTORISING PROGRAM CODE
Description	An apparatus and method are provided for speculatively vectorising program code. The apparatus comprises processing circuitry for executing program code, the program code including an identified code region comprising at least a plurality of speculative vector memory access instructions. Execution of each speculative vector memory access instruction is employed to perform speculative vectorisation of a series of scalar memory access operations using a plurality of lanes of processing. Tracking storage is used to maintain, for each speculative vector memory access instruction, tracking information providing an indication of a memory address being accessed within each lane. Checking circuitry then references the tracking information during execution of the identified code region by the processing circuitry, in order to detect any inter lane memory hazard resulting from the execution of the plurality of speculative vector memory access instructions. For at least a first type of inter lane memory hazard, a status storage element is used to maintain an indication of each lane for which the checking circuitry has determined the presence of that type of memory hazard. Replay determination circuitry is then arranged, when an end of the identified code region is reached, to be responsive to the status storage element identifying at least one lane as having an inter lane memory hazard, to trigger re-execution of the identified code region for each lane identified by the status storage element. Such an approach can significantly increase the ability to vectorise scalar code, hence resulting in significant performance improvements.
IP Reference	WO2021001641
Protection	Patent granted
Year Protection Granted	2021
Licensed	No
Impact	Arm internal evaluation


Title	Event triggered programmable prefetcher
Description	A main processor 4 executes a main program and has an associated cache 6. Event detection circuitry 12 detects events consequent upon execution of the main program and indicative of data to be used by the main processor. A programmable further processor 16 and/or 18 is triggered by the detected events to execute a further program. The data to be used by the main processor are pre-fetched by circuitry 28 responsive to the further program. Event detection circuit 12 comprises programmable filters to detect memory operations directed to memory addresses within programmable ranges. Statistical analysis circuit 20 provides memory latency values that can be used by the further processor. Detected events comprise: cache fetch, fill, eviction, hit, miss events; branch prediction; memory snoops; system events and interrupt signals. Prefetch performance for irregular patterns, e.g. pointer chasing or compressed sparse matrices, is improved over static hardware pre-fetch mechanisms, e.g. stride prediction.
IP Reference	GB2544474
Protection	Patent granted
Year Protection Granted	2017
Licensed	No
Impact	None


Title	Main processor error detection using checker processors
Description	A main processor 6, which may support out-of-order execution, executes a main stream of program instructions (30, fig. 2) and two or more checker processors 20, e.g. in-order processors, execute checker parallel streams of instructions (34, fig. 2) corresponding to different portions of the main stream. Errors are detected when a mismatch is identified between the outcome of a given portion of the main stream and the outcome of the corresponding checker stream. Lower-performance checker cores 20 may be shared between multiple main high-performance CPUs 6. Each checker stream may correspond to a portion of the main stream executed by the main processor between two successive checking boundary events e.g. exceptions, a control flow changing instructions or a checking barrier instructions. Checkpoint entries, indicative of an architectural state of the main processor, may be stored and an error recovery operation based on a checkpoint entry may be triggered in response to detection of an error. The main processor 6 may defer a store instruction targeting a particular address until error checks are complete or commit a transaction to memory before error detection circuitry 28 has detected whether an error occurred for that store instruction.
IP Reference	GB2555628
Protection	Patent granted
Year Protection Granted	2018
Licensed	No
Impact	None


Title	Chromium
Description	Web browser
Type Of Technology	Webtool/Application
Year Produced	2020
Open Source License?	Yes
Impact	Influenced the design of the StarScan memory safety algorithms in both Chrome (Google's closed source browser) and Chromium (open-source). A YouTube discussion on an early version of this is available here: https://www.youtube.com/watch?v=ohlxw5kDn-k (mention of MarkUs being the influence at 29 minutes) and a blog post on the techniques here: https://security.googleblog.com/2022/05/retrofitting-temporal-memory-safety-on-c.html
URL	https://www.chromium.org/


Title	DynamoRIO
Description	A dynamic binary instrumentation tool
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	Contributed to the ARM 64-bit architecture port of DynamoRIO, allowing anybody to instrument AArch64 binaries using this tool.
URL	http://dynamorio.org/


Title	Janus
Description	Janus is a binary parallelisation tool, also capable of automatically vectorising applications, as well as inserting useful software prefetches.
Type Of Technology	Software
Year Produced	2019
Open Source License?	Yes
Impact	We published two papers on Janus and it is the building block for on-going research in the group.
URL	https://www.cl.cam.ac.uk/~tmj32/data/


Title	The Lynx Queue
Description	Lynx is a very fast single-producer, single-consumer software queue.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	We have used this queue to develop faster soft-error detection techniques. It has been downloaded 21 times by others.
URL	http://www.cl.cam.ac.uk/~tmj32/data/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications