M3: Managing Many-Cores for the Masses
Lead Research Organisation:
University of Cambridge
Department Name: Computer Science and Technology
Abstract
Since the invention of the microprocessor in the early 1970s, information and communication technologies have rapidly improved almost all aspects of our daily lives. Unrelenting growth in computing power has been the enabler for significant scientific advances and transformed our society in the way we work, play and interact with each other. Fundamentally, this has been achieved through continuous strides in semiconductor technology, repeatedly doubling the number of transistors available on a chip every 18 months (famously known as Moore's Law). At the same time, Dennard scaling ensured that overall chip power and area stayed constant, allowing computer architects to design more and more elaborate processors to take advantage of the increased transistor counts and provide faster and faster systems.
All this was performed transparently to the programmer, who simply had to switch to a new processor to make his applications run faster. There was no need to rewrite or modify parts of the program, or worry about the characteristics of the underlying platform. There was, in effect, a contract between the hardware and software, whereby the architect would provide more advanced systems and the programmer would concentrate on producing more sophisticated and innovative applications.
Serendipitous transistor scaling couldn't last forever though, and in the early part of this century we hit the "power wall". Transistors could no longer be made smaller yet achieve the same power efficiency gains as in previous technology generations. In response, industry switched to multicore designs, aiming to continue performance increases through thread-level parallelism. This, in effect, broke the contract, requiring software writers to create programs with an eye on performance, as well as functionality. Although there are some application domains that can benefit from this parallelism (e.g. games), there is little evidence that developers have risen to the challenge of writing multi-threaded code and even when they do, the runtime environment doesn't necessarily allow their threads to actually be executed in parallel.
The failure of Dennard scaling means that many-core chips are the logical next step. However, it also means that the energy problem will not decline, and key industry figures warn of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. We may soon find that significant parts of each multicore chip must be turned off, due to power limitations. Added to this, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.
This fellowship is a step towards meeting these challenges by reinstating the hardware / software contract, allowing the system to deal with each issue and leaving the application developer free to continue innovating. It will bring together a world-class research team that will investigate a variety of holistic schemes to achieve this aim based on automatic parallelisation, heterogeneous architectures and optical networks-on-chip, directly tackling the challenges laid out in the EPSRC cross-ICT priority of Many-Core Architectures and Concurrency in Distributed and Embedded Systems.
All this was performed transparently to the programmer, who simply had to switch to a new processor to make his applications run faster. There was no need to rewrite or modify parts of the program, or worry about the characteristics of the underlying platform. There was, in effect, a contract between the hardware and software, whereby the architect would provide more advanced systems and the programmer would concentrate on producing more sophisticated and innovative applications.
Serendipitous transistor scaling couldn't last forever though, and in the early part of this century we hit the "power wall". Transistors could no longer be made smaller yet achieve the same power efficiency gains as in previous technology generations. In response, industry switched to multicore designs, aiming to continue performance increases through thread-level parallelism. This, in effect, broke the contract, requiring software writers to create programs with an eye on performance, as well as functionality. Although there are some application domains that can benefit from this parallelism (e.g. games), there is little evidence that developers have risen to the challenge of writing multi-threaded code and even when they do, the runtime environment doesn't necessarily allow their threads to actually be executed in parallel.
The failure of Dennard scaling means that many-core chips are the logical next step. However, it also means that the energy problem will not decline, and key industry figures warn of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. We may soon find that significant parts of each multicore chip must be turned off, due to power limitations. Added to this, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.
This fellowship is a step towards meeting these challenges by reinstating the hardware / software contract, allowing the system to deal with each issue and leaving the application developer free to continue innovating. It will bring together a world-class research team that will investigate a variety of holistic schemes to achieve this aim based on automatic parallelisation, heterogeneous architectures and optical networks-on-chip, directly tackling the challenges laid out in the EPSRC cross-ICT priority of Many-Core Architectures and Concurrency in Distributed and Embedded Systems.
Planned Impact
The advent of mainstream multicore processors requires software developers to write multi-threaded applications to achieve high performance. At the same time, energy consumption is still a major challenge, with key industry figures warning of "dark silicon" when we have abundant transistors on chip but lack the ability to power them all on at once. Futhermore, as transistors shrink they become more susceptible to manufacturing variability, wearout and in-field faults, decreasing their reliability and shortening the lifespan of the systems they construct. Unless we take drastic action, this convergence of challenges could seriously undermine our ability to continue advancing the systems that have revolutionised our lives.
The main goal of this project is to tackle all three of these challenges transparently to the software developer, increasing the performance, reducing the energy consumption and ensuring the reliability of future many-core processors. The primary beneficiaries from this work will be academics and industry, where we will perform state-of-the-art research into automatic parallelisation, heterogeneous architectures and reliability techniques for many-core architectures.
This work is of significant value to industry, as demonstrated by the letter of support received from ARM, who will fund the three Ph.D. students due to work on this project. Dr. Jones has good links with the different divisions within ARM and these provide an obvious route to dissemination of ideas, as well as feedback on the developed techniques.
Furthermore, at an early stage in the project the core IP that will be generated from the research will be identified. With this in hand, the potential for further economic gain can be evaluated. The University of Cambridge has an excellent track record in commercial exploitation, for example XenSource.
As with any long-term research, it is complex to express its impact on the wider research community and society. However, EPSRC's vision for a digital economy is underpinned by the need for high-performance, reliable and efficient electronics and systems. This proposal will significantly contribute towards this goal the EPSRC cross-ICT priority on Many-Core Architectures and Concurrency in Distributed and Embedded Systems.
The main goal of this project is to tackle all three of these challenges transparently to the software developer, increasing the performance, reducing the energy consumption and ensuring the reliability of future many-core processors. The primary beneficiaries from this work will be academics and industry, where we will perform state-of-the-art research into automatic parallelisation, heterogeneous architectures and reliability techniques for many-core architectures.
This work is of significant value to industry, as demonstrated by the letter of support received from ARM, who will fund the three Ph.D. students due to work on this project. Dr. Jones has good links with the different divisions within ARM and these provide an obvious route to dissemination of ideas, as well as feedback on the developed techniques.
Furthermore, at an early stage in the project the core IP that will be generated from the research will be identified. With this in hand, the potential for further economic gain can be evaluated. The University of Cambridge has an excellent track record in commercial exploitation, for example XenSource.
As with any long-term research, it is complex to express its impact on the wider research community and society. However, EPSRC's vision for a digital economy is underpinned by the need for high-performance, reliable and efficient electronics and systems. This proposal will significantly contribute towards this goal the EPSRC cross-ICT priority on Many-Core Architectures and Concurrency in Distributed and Embedded Systems.
People |
ORCID iD |
Timothy Jones (Principal Investigator / Fellow) |
Publications
Madarbux M
(2014)
Towards zero latency photonic switching in shared memory networks
in Concurrency and Computation: Practice and Experience
Madarbux M
(2014)
Towards zero latency photonic switching in shared memory networks
Ho H
(2021)
Timed hyperproperties
in Information and Computation
Ho H
(2021)
Timed hyperproperties
Porpodas V
(2015)
Throttling Automatic Vectorization: When Less is More
Ainsworth S
(2020)
The Guardian Council
Sun P
(2021)
Speculative Vectorisation with Selective Replay
Hadade I
(2020)
Software Prefetching for Unstructured Mesh Applications
in ACM Transactions on Parallel Computing
Hadade I
(2018)
Software Prefetching for Unstructured Mesh Applications
Hadade I
(2020)
Software Prefetching for Unstructured Mesh Applications
Ainsworth S
(2019)
Software Prefetching for Indirect Memory Accesses A Microarchitectural Perspective
in ACM Transactions on Computer Systems
Ainsworth S
(2017)
Software prefetching for indirect memory accesses
Soman J
(2015)
REPAIR: Hard-error recovery via re-execution
Porpodas V
(2015)
PSLP: Padded SLP automatic vectorization
Ainsworth S
(2020)
Prefetching in functional languages
Ainsworth S
(2019)
ParaMedic: Heterogeneous Parallel Error Correction
Ainsworth S
(2018)
Parallel Error Detection Using Heterogeneous Cores
Ainsworth S
(2021)
ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance
Ho H.-M.
(2019)
On verifying timed hyperproperties
in Leibniz International Proceedings in Informatics, LIPIcs
Ho H
(2018)
On Verifying Timed Hyperproperties
Valero A
(2017)
On microarchitectural mechanisms for cache wearout reduction
Valero A
(2017)
On Microarchitectural Mechanisms for Cache Wearout Reduction
in IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Ainsworth S
(2020)
MarkUs: Drop-in use-after-free prevention for low-level languages
Mitropoulou K
(2016)
Lynx
Savage J
(2020)
HALO: post-link heap-layout optimisation
Ainsworth S
(2016)
Graph Prefetching Using Data Structure Knowledge
Valero A
(2016)
Enhancing the L1 Data Cache Design to Mitigate HCI
in IEEE Computer Architecture Letters
Kohn T
(2020)
Dynamic pattern matching with Python
Dubach C
(2013)
Dynamic microarchitectural adaptation using machine learning
in ACM Transactions on Architecture and Code Optimization
Mitropoulou K.
(2016)
COMET: Communication-optimised multi-threaded error-detection technique
in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES 2016
Mitropoulou K
(2016)
COMET
Xia H
(2019)
CHERIvoke
Campanoni S
(2017)
Automatically accelerating non-numerical programs by architecture-compiler co-design
in Communications of the ACM
Ainsworth S
(2018)
An Event-Triggered Programmable Prefetcher for Irregular Workloads
in ACM SIGPLAN Notices
Ainsworth S
(2018)
An Event-Triggered Programmable Prefetcher for Irregular Workloads
Ainsworth S
(2018)
An Event-Triggered Programmable Prefetcher for Irregular Workloads
Description | We have developed schemes to accelerate parallel code within multicore processors, extended the reach of vectorisation, and found that many transient and permanent errors within processors can be overcome through the addition of an array of small, power-efficient cores capable of re-executing code in parallel. These can also improve the performance of many HPC applications by bringing data into the core in advance of it being required. |
Exploitation Route | Our work can be used by industry to improve the performance and reliability of microprocessors. |
Sectors | Digital/Communication/Information Technologies (including Software) Electronics |
URL | http://www.cl.cam.ac.uk/~tmj32/ |
Description | Close collaboration with Arm Ltd is on-going and exploring the long-term benefits of our work. We have jointly filed three patent applications (one accepted) and Arm are evaluating our work for a future product. |
First Year Of Impact | 2015 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Economic |
Description | EPSRC Responsive Mode Grant |
Amount | £1,062,734 (GBP) |
Funding ID | EP/P020011/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 06/2017 |
End | 12/2020 |
Description | Heterogeneous Parallel Reliability |
Amount | £686,644 (GBP) |
Organisation | Huawei Technologies Research and Development UK Ltd |
Sector | Private |
Country | United Kingdom |
Start | 06/2022 |
End | 07/2026 |
Description | ParaSol: Fine-Grained Thread-Level Parallelism for Single-Threaded Performance |
Amount | £1,091,792 (GBP) |
Funding ID | EP/W00576X/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 02/2022 |
End | 03/2025 |
Description | Programmable Real-Time Security |
Amount | £732,581 (GBP) |
Organisation | Huawei Technologies Research and Development UK Ltd |
Sector | Private |
Country | United Kingdom |
Start | 09/2020 |
End | 03/2024 |
Title | Research data supporting "An Event-Triggered Programmable Prefetcher for Irregular Workloads" |
Description | Source code for the LLVM passes for automating programmable prefetching, as well as code modifications to gem5 to evaluate programmable prefetching, and associated benchmarks. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Title | Research data supporting "CHERIvoke: Characterising Pointer Revocation using CHERI Capabilities for Temporal Memory Safety" |
Description | Source code for sweeper and modified dlmalloc (called dlmalloc_cherivoke) that implements the CHERIvoke technique. See the file README.md for a detailed description and usage instructions. |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Title | Research data supporting "HALO: Post-Link Heap-Layout Optimisation" |
Description | |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
URL | https://www.repository.cam.ac.uk/handle/1810/300136 |
Title | Research data supporting "High Performance Fault Tolerance Through Predictive Instruction Re-Execution" |
Description | Source code for simulator modules to implement schemes in the paper. |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
Title | Research data supporting "Quantifying the Semantic Gap Between Serial and Parallel Programming" |
Description | Source code for the VIA tool used to analyse applications. Please see README file for details |
Type Of Material | Database/Collection of data |
Year Produced | 2022 |
Provided To Others? | Yes |
Impact | Publication of the research paper "Quantifying the Semantic Gap Between Serial and Parallel Programming" |
URL | https://www.repository.cam.ac.uk/handle/1810/340561 |
Title | Research data supporting "The Janus Triad: Exploiting Parallelism Through Dynamic Binary Modification" |
Description | A static binary analysis tool, profile information, a DynamoRIO client and benchmarks to show binary parallelisation using Janus. For more information, see the README.txt file. |
Type Of Material | Database/Collection of data |
Year Produced | 2019 |
Provided To Others? | Yes |
Description | ARM |
Organisation | Arm Limited |
Country | United Kingdom |
Sector | Private |
PI Contribution | Three PhD students work on topics related to the research grant. They have mentors in ARM to ensure that their work is relevant and goes in the direction that industry is heading. |
Collaborator Contribution | Funding for PhD students and appointing mentors for them. Students spend three months over the course of their studies within ARM Research. For the rest of the time they meet with their mentors once a month, on average. |
Impact | Two patent filings, not yet granted. Numerous publications. A follow-on grant. |
Start Year | 2013 |
Description | HiPEAC |
Organisation | European Commission |
Department | Seventh Framework Programme (FP7) |
Country | European Union (EU) |
Sector | Public |
PI Contribution | Attending meetings to disseminate results and interact with other researchers in the same area. |
Collaborator Contribution | A visit by a PostDoc from another member for 4 months. |
Impact | The network is on High-Performance and Embedded Architectures and Compilers |
Start Year | 2011 |
Description | NESUS |
Organisation | Network for Sustainable Ultrascale Computing (NESUS) |
Country | Global |
Sector | Academic/University |
PI Contribution | Visit to initial kick-off meeting to discuss our work. |
Collaborator Contribution | The whole consortium aims to meet the challenges of sustainable ultrascale computing. |
Impact | The collaboration crosses disciplines within Computer Science. These are programming languages, compilers, runtimes, computer architecture and networks. |
Start Year | 2014 |
Title | AN APPARATUS AND METHOD FOR SPECULATIVELY VECTORISING PROGRAM CODE |
Description | An apparatus and method are provided for speculatively vectorising program code. The apparatus comprises processing circuitry for executing program code, the program code including an identified code region comprising at least a plurality of speculative vector memory access instructions. Execution of each speculative vector memory access instruction is employed to perform speculative vectorisation of a series of scalar memory access operations using a plurality of lanes of processing. Tracking storage is used to maintain, for each speculative vector memory access instruction, tracking information providing an indication of a memory address being accessed within each lane. Checking circuitry then references the tracking information during execution of the identified code region by the processing circuitry, in order to detect any inter lane memory hazard resulting from the execution of the plurality of speculative vector memory access instructions. For at least a first type of inter lane memory hazard, a status storage element is used to maintain an indication of each lane for which the checking circuitry has determined the presence of that type of memory hazard. Replay determination circuitry is then arranged, when an end of the identified code region is reached, to be responsive to the status storage element identifying at least one lane as having an inter lane memory hazard, to trigger re-execution of the identified code region for each lane identified by the status storage element. Such an approach can significantly increase the ability to vectorise scalar code, hence resulting in significant performance improvements. |
IP Reference | WO2021001641 |
Protection | Patent granted |
Year Protection Granted | 2021 |
Licensed | No |
Impact | Arm internal evaluation |
Title | Event triggered programmable prefetcher |
Description | A main processor 4 executes a main program and has an associated cache 6. Event detection circuitry 12 detects events consequent upon execution of the main program and indicative of data to be used by the main processor. A programmable further processor 16 and/or 18 is triggered by the detected events to execute a further program. The data to be used by the main processor are pre-fetched by circuitry 28 responsive to the further program. Event detection circuit 12 comprises programmable filters to detect memory operations directed to memory addresses within programmable ranges. Statistical analysis circuit 20 provides memory latency values that can be used by the further processor. Detected events comprise: cache fetch, fill, eviction, hit, miss events; branch prediction; memory snoops; system events and interrupt signals. Prefetch performance for irregular patterns, e.g. pointer chasing or compressed sparse matrices, is improved over static hardware pre-fetch mechanisms, e.g. stride prediction. |
IP Reference | GB2544474 |
Protection | Patent granted |
Year Protection Granted | 2017 |
Licensed | No |
Impact | None |
Title | Main processor error detection using checker processors |
Description | A main processor 6, which may support out-of-order execution, executes a main stream of program instructions (30, fig. 2) and two or more checker processors 20, e.g. in-order processors, execute checker parallel streams of instructions (34, fig. 2) corresponding to different portions of the main stream. Errors are detected when a mismatch is identified between the outcome of a given portion of the main stream and the outcome of the corresponding checker stream. Lower-performance checker cores 20 may be shared between multiple main high-performance CPUs 6. Each checker stream may correspond to a portion of the main stream executed by the main processor between two successive checking boundary events e.g. exceptions, a control flow changing instructions or a checking barrier instructions. Checkpoint entries, indicative of an architectural state of the main processor, may be stored and an error recovery operation based on a checkpoint entry may be triggered in response to detection of an error. The main processor 6 may defer a store instruction targeting a particular address until error checks are complete or commit a transaction to memory before error detection circuitry 28 has detected whether an error occurred for that store instruction. |
IP Reference | GB2555628 |
Protection | Patent granted |
Year Protection Granted | 2018 |
Licensed | No |
Impact | None |
Title | Chromium |
Description | Web browser |
Type Of Technology | Webtool/Application |
Year Produced | 2020 |
Open Source License? | Yes |
Impact | Influenced the design of the StarScan memory safety algorithms in both Chrome (Google's closed source browser) and Chromium (open-source). A YouTube discussion on an early version of this is available here: https://www.youtube.com/watch?v=ohlxw5kDn-k (mention of MarkUs being the influence at 29 minutes) and a blog post on the techniques here: https://security.googleblog.com/2022/05/retrofitting-temporal-memory-safety-on-c.html |
URL | https://www.chromium.org/ |
Title | DynamoRIO |
Description | A dynamic binary instrumentation tool |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | Contributed to the ARM 64-bit architecture port of DynamoRIO, allowing anybody to instrument AArch64 binaries using this tool. |
URL | http://dynamorio.org/ |
Title | Janus |
Description | Janus is a binary parallelisation tool, also capable of automatically vectorising applications, as well as inserting useful software prefetches. |
Type Of Technology | Software |
Year Produced | 2019 |
Open Source License? | Yes |
Impact | We published two papers on Janus and it is the building block for on-going research in the group. |
URL | https://www.cl.cam.ac.uk/~tmj32/data/ |
Title | The Lynx Queue |
Description | Lynx is a very fast single-producer, single-consumer software queue. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | We have used this queue to develop faster soft-error detection techniques. It has been downloaded 21 times by others. |
URL | http://www.cl.cam.ac.uk/~tmj32/data/ |