M3: Managing Many-Cores for the Masses

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

Since the invention of the microprocessor in the early 1970s, information and communication technologies have rapidly improved almost all aspects of our daily lives. Unrelenting growth in computing power has been the enabler for significant scientific advances and transformed our society in the way we work, play and interact with each other. Fundamentally, this has been achieved through continuous strides in semiconductor technology, repeatedly doubling the number of transistors available on a chip every 18 months (famously known as Moore's Law). At the same time, Dennard scaling ensured that overall chip power and area stayed constant, allowing computer architects to design more and more elaborate processors to take advantage of the increased transistor counts and provide faster and faster systems.

All this was performed transparently to the programmer, who simply had to switch to a new processor to make his applications run faster. There was no need to rewrite or modify parts of the program, or worry about the characteristics of the underlying platform. There was, in effect, a contract between the hardware and software, whereby the architect would provide more advanced systems and the programmer would concentrate on producing more sophisticated and innovative applications.

Serendipitous transistor scaling couldn't last forever though, and in the early part of this century we hit the "power wall". Transistors could no longer be made smaller yet achieve the same power efficiency gains as in previous technology generations. In response, industry switched to multicore designs, aiming to continue performance increases through thread-level parallelism. This, in effect, broke the contract, requiring software writers to create programs with an eye on performance, as well as functionality. Although there are some application domains that can benefit from this parallelism (e.g. games), there is little evidence that developers have risen to the challenge of writing multi-threaded code and even when they do, the runtime environment doesn't necessarily allow their threads to actually be executed in parallel.

The failure of Dennard scaling means that many-core chips are the logical next step. However, it also means that the energy problem will not decline, and key industry figures warn of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. We may soon find that significant parts of each multicore chip must be turned off, due to power limitations. Added to this, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.

This fellowship is a step towards meeting these challenges by reinstating the hardware / software contract, allowing the system to deal with each issue and leaving the application developer free to continue innovating. It will bring together a world-class research team that will investigate a variety of holistic schemes to achieve this aim based on automatic parallelisation, heterogeneous architectures and optical networks-on-chip, directly tackling the challenges laid out in the EPSRC cross-ICT priority of Many-Core Architectures and Concurrency in Distributed and Embedded Systems.

Planned Impact

The advent of mainstream multicore processors requires software developers to write multi-threaded applications to achieve high performance. At the same time, energy consumption is still a major challenge, with key industry figures warning of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. Futhermore, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.

The main goal of this project is to tackle all three of these challenges transparently to the software developer, increasing the performance, reducing the energy consumption and ensuring the reliability of future many-core processors. The primary beneficiaries from this work will be academics and industry, where we will perform state-of-the-art research into automatic parallelisation, heterogeneous architectures and reliability techniques for many-core architectures.

This work is of significant value to industry, as demonstrated by the letter of support received from ARM, who will fund the three Ph.D. students due to work on this project. Dr. Jones has good links with the different divisions within ARM and these provide an obvious route to dissemination of ideas, as well as feedback on the developed techniques.

Furthermore, at an early stage in the project the core IP that will be generated from the research will be identified. With this in hand, the potential for further economic gain can be evaluated. The University of Cambridge has an excellent track record in commercial exploitation, for example XenSource.

As with any long-term research, it is complex to express its impact on the wider research community and society. However, EPSRC's vision for a digital economy is underpinned by the need for high-performance, reliable and efficient electronics and systems. This proposal will significantly contribute towards this goal the EPSRC cross-ICT priority on Many-Core Architectures and Concurrency in Distributed and Embedded Systems.

Publications

10 25 50
publication icon
Valero A (2017) On Microarchitectural Mechanisms for Cache Wearout Reduction in IEEE Transactions on Very Large Scale Integration (VLSI) Systems

publication icon
Mitropoulou K (2016) Lynx

publication icon
Valero A (2016) Enhancing the L1 Data Cache Design to Mitigate HCI in IEEE Computer Architecture Letters

publication icon
Dubach C (2013) Dynamic microarchitectural adaptation using machine learning in ACM Transactions on Architecture and Code Optimization

publication icon
Mitropoulou K. (2016) COMET: Communication-optimised multi-threaded error-detection technique in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES 2016

publication icon
Mitropoulou K (2016) COMET

publication icon
Xia H (2019) CHERIvoke

publication icon
Ainsworth S (2018) An Event-Triggered Programmable Prefetcher for Irregular Workloads in ACM SIGPLAN Notices

 
Description We have developed schemes to accelerate parallel code within multicore processors, extended the reach of vectorisation, and found that many transient and permanent errors within processors can be overcome through the addition of an array of small, power-efficient cores capable of re-executing code in parallel. These can also improve the performance of many HPC applications by bringing data into the core in advance of it being required.
Exploitation Route Our work can be used by industry to improve the performance and reliability of microprocessors.
Sectors Digital/Communication/Information Technologies (including Software)

Electronics

URL http://www.cl.cam.ac.uk/~tmj32/
 
Description Close collaboration with Arm Ltd is on-going and exploring the long-term benefits of our work. We have jointly filed three patent applications (one accepted) and Arm are evaluating our work for a future product.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description EPSRC Responsive Mode Grant
Amount £1,062,734 (GBP)
Funding ID EP/P020011/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 06/2017 
End 12/2020
 
Description Heterogeneous Parallel Reliability
Amount £686,644 (GBP)
Organisation Huawei Technologies Research and Development UK Ltd 
Sector Private
Country United Kingdom
Start 06/2022 
End 07/2026
 
Description ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance
Amount £1,091,792 (GBP)
Funding ID EP/W00576X/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 02/2022 
End 03/2025
 
Description Programmable Real-Time Security
Amount £732,581 (GBP)
Organisation Huawei Technologies Research and Development UK Ltd 
Sector Private
Country United Kingdom
Start 09/2020 
End 03/2024
 
Title Research data supporting "An Event-Triggered Programmable Prefetcher for Irregular Workloads" 
Description Source code for the LLVM passes for automating programmable prefetching, as well as code modifications to gem5 to evaluate programmable prefetching, and associated benchmarks. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title Research data supporting "CHERIvoke: Characterising Pointer Revocation using CHERI Capabilities for Temporal Memory Safety" 
Description Source code for sweeper and modified dlmalloc (called dlmalloc_cherivoke) that implements the CHERIvoke technique. See the file README.md for a detailed description and usage instructions. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Title Research data supporting "HALO: Post-Link Heap-Layout Optimisation" 
Description  
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
URL https://www.repository.cam.ac.uk/handle/1810/300136
 
Title Research data supporting "High Performance Fault Tolerance Through Predictive Instruction Re-Execution" 
Description Source code for simulator modules to implement schemes in the paper. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title Research data supporting "Quantifying the Semantic Gap Between Serial and Parallel Programming" 
Description Source code for the VIA tool used to analyse applications. Please see README file for details 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Publication of the research paper "Quantifying the Semantic Gap Between Serial and Parallel Programming" 
URL https://www.repository.cam.ac.uk/handle/1810/340561
 
Title Research data supporting "The Janus Triad: Exploiting Parallelism Through Dynamic Binary Modification" 
Description A static binary analysis tool, profile information, a DynamoRIO client and benchmarks to show binary parallelisation using Janus. For more information, see the README.txt file. 
Type Of Material Database/Collection of data 
Year Produced 2019 
Provided To Others? Yes  
 
Description ARM 
Organisation Arm Limited
Country United Kingdom 
Sector Private 
PI Contribution Three PhD students work on topics related to the research grant. They have mentors in ARM to ensure that their work is relevant and goes in the direction that industry is heading.
Collaborator Contribution Funding for PhD students and appointing mentors for them. Students spend three months over the course of their studies within ARM Research. For the rest of the time they meet with their mentors once a month, on average.
Impact Two patent filings, not yet granted. Numerous publications. A follow-on grant.
Start Year 2013
 
Description HiPEAC 
Organisation European Commission
Department Seventh Framework Programme (FP7)
Country European Union (EU) 
Sector Public 
PI Contribution Attending meetings to disseminate results and interact with other researchers in the same area.
Collaborator Contribution A visit by a PostDoc from another member for 4 months.
Impact The network is on High-Performance and Embedded Architectures and Compilers
Start Year 2011
 
Description NESUS 
Organisation Network for Sustainable Ultrascale Computing (NESUS)
Country Global 
Sector Academic/University 
PI Contribution Visit to initial kick-off meeting to discuss our work.
Collaborator Contribution The whole consortium aims to meet the challenges of sustainable ultrascale computing.
Impact The collaboration crosses disciplines within Computer Science. These are programming languages, compilers, runtimes, computer architecture and networks.
Start Year 2014
 
Title AN APPARATUS AND METHOD FOR SPECULATIVELY VECTORISING PROGRAM CODE 
Description An apparatus and method are provided for speculatively vectorising program code. The apparatus comprises processing circuitry for executing program code, the program code including an identified code region comprising at least a plurality of speculative vector memory access instructions. Execution of each speculative vector memory access instruction is employed to perform speculative vectorisation of a series of scalar memory access operations using a plurality of lanes of processing. Tracking storage is used to maintain, for each speculative vector memory access instruction, tracking information providing an indication of a memory address being accessed within each lane. Checking circuitry then references the tracking information during execution of the identified code region by the processing circuitry, in order to detect any inter lane memory hazard resulting from the execution of the plurality of speculative vector memory access instructions. For at least a first type of inter lane memory hazard, a status storage element is used to maintain an indication of each lane for which the checking circuitry has determined the presence of that type of memory hazard. Replay determination circuitry is then arranged, when an end of the identified code region is reached, to be responsive to the status storage element identifying at least one lane as having an inter lane memory hazard, to trigger re-execution of the identified code region for each lane identified by the status storage element. Such an approach can significantly increase the ability to vectorise scalar code, hence resulting in significant performance improvements. 
IP Reference WO2021001641 
Protection Patent granted
Year Protection Granted 2021
Licensed No
Impact Arm internal evaluation
 
Title Event triggered programmable prefetcher 
Description A main processor 4 executes a main program and has an associated cache 6. Event detection circuitry 12 detects events consequent upon execution of the main program and indicative of data to be used by the main processor. A programmable further processor 16 and/or 18 is triggered by the detected events to execute a further program. The data to be used by the main processor are pre-fetched by circuitry 28 responsive to the further program. Event detection circuit 12 comprises programmable filters to detect memory operations directed to memory addresses within programmable ranges. Statistical analysis circuit 20 provides memory latency values that can be used by the further processor. Detected events comprise: cache fetch, fill, eviction, hit, miss events; branch prediction; memory snoops; system events and interrupt signals. Prefetch performance for irregular patterns, e.g. pointer chasing or compressed sparse matrices, is improved over static hardware pre-fetch mechanisms, e.g. stride prediction. 
IP Reference GB2544474 
Protection Patent granted
Year Protection Granted 2017
Licensed No
Impact None
 
Title Main processor error detection using checker processors 
Description A main processor 6, which may support out-of-order execution, executes a main stream of program instructions (30, fig. 2) and two or more checker processors 20, e.g. in-order processors, execute checker parallel streams of instructions (34, fig. 2) corresponding to different portions of the main stream. Errors are detected when a mismatch is identified between the outcome of a given portion of the main stream and the outcome of the corresponding checker stream. Lower-performance checker cores 20 may be shared between multiple main high-performance CPUs 6. Each checker stream may correspond to a portion of the main stream executed by the main processor between two successive checking boundary events e.g. exceptions, a control flow changing instructions or a checking barrier instructions. Checkpoint entries, indicative of an architectural state of the main processor, may be stored and an error recovery operation based on a checkpoint entry may be triggered in response to detection of an error. The main processor 6 may defer a store instruction targeting a particular address until error checks are complete or commit a transaction to memory before error detection circuitry 28 has detected whether an error occurred for that store instruction. 
IP Reference GB2555628 
Protection Patent granted
Year Protection Granted 2018
Licensed No
Impact None
 
Title Chromium 
Description Web browser 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact Influenced the design of the StarScan memory safety algorithms in both Chrome (Google's closed source browser) and Chromium (open-source). A YouTube discussion on an early version of this is available here: https://www.youtube.com/watch?v=ohlxw5kDn-k (mention of MarkUs being the influence at 29 minutes) and a blog post on the techniques here: https://security.googleblog.com/2022/05/retrofitting-temporal-memory-safety-on-c.html 
URL https://www.chromium.org/
 
Title DynamoRIO 
Description A dynamic binary instrumentation tool 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Contributed to the ARM 64-bit architecture port of DynamoRIO, allowing anybody to instrument AArch64 binaries using this tool. 
URL http://dynamorio.org/
 
Title Janus 
Description Janus is a binary parallelisation tool, also capable of automatically vectorising applications, as well as inserting useful software prefetches. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact We published two papers on Janus and it is the building block for on-going research in the group. 
URL https://www.cl.cam.ac.uk/~tmj32/data/
 
Title The Lynx Queue 
Description Lynx is a very fast single-producer, single-consumer software queue. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact We have used this queue to develop faster soft-error detection techniques. It has been downloaded 21 times by others. 
URL http://www.cl.cam.ac.uk/~tmj32/data/