Compiling for Energy Efficiency in Multicore Memory Hierarchies

Lead Research Organisation: University of Edinburgh
Department Name: Institute Computing Systems Architecture

Abstract

Over the past few years, processor manufacturers have switched from single core designs to multicore architectures. In these new devices, two or more processing cores are placed on a single chip and linked together to enable several applications to run at exactly the same time. Examples of current multicore architectures include the Intel Core 2 Quad and the Cell Broadband Engine.On each processing core, several threads of execution can run in parallel with each other. Each thread is simply a stream of instructions from a program that must be executed in a particular order so that a certain task is performed. For example, one thread might be loading up a web page in a browser whilst another is playing some music. Manufacturers are relying on this thread-level parallelism to maintain the performance gains that have been achieved in each new generation of processors over the last 40 years. However, power efficiency continues to be a major issue for the processor industry as manufacturers seek to maximise the usage of the transistors on-chip, delivering high performance with low energy.The cache hierarchy is one element of a multicore system where tackling these challenges can make a significant difference. A cache is a fast memory, usually on the same chip as the processing cores themselves. Each cache stores a copy of the frequently used instructions and data so that the processor has easy access to it, instead of having to wait for a slow, off-chip memory. The caches occupy a significant fraction of the total chip area and thus consume a large percentage of the total system power. Here also, threads interact with each other, competing for resources and consuming a significant amount of electrical energy.This proposal seeks to address these issues by using the compiler to drive energy efficiency. The compiler is the tool that converts a program from a human-readable format into the 1s and 0s that run on the actual machine. Along the way it performs some analysis and optimisation to make the program run as fast as possible. This proposal will consider the impact of compiler-inferred knowledge during compilation and runtime, enabling the generation of energy-efficient programs that can automatically influence energy saving in the underlying environment.The proposal will consider two complementary project themes: level 2 cache management and D-NUCA designs. The first will consider energy saving schemes that can place parts of the second level cache into low power sleep modes. The compiler will have the ability to use both state-preserving (i.e. the data is retained) and state-destroying (i.e. the data is lost) techniques and use the compiler to turn off parts of the cache at both a coarse granularity (e.g. each cache bank) and at a finer level too (e.g. cache lines). This work will consider the trade-offs between static energy savings and increased dynamic energy consumption through extra cache misses.The second topic in this research will consider an emerging cache architecture: D-NUCA (Dynamic Non-Uniform Cache Architecture) designs. As the name suggests, this type of cache has a variable latency to access different data within it. This proposal will develop a technique to influence the data management policy of the cache to maintain the high performance and flexibility of this paradigm, yet also provide opportunities for static energy reduction. Furthermore, the scheme will proactively leverage the existing data migration infrastructure to move certain information around the cache, when beneficial, for increased static energy savings.

Planned Impact

A) Beneficiaries The beneficiaries of this work fall broadly into two categories: processor and compiler manufacturers; and the wider population. 1) Industry Around the UK are a number of companies working within the processor and compiler sectors. These businesses will benefit through the incorporation of the compiler analysis and microarchitectural optimisations into their products or design solutions, giving them an energy advantage over their rivals. Further afield, international companies such as Intel, IBM, Microsoft and ACE will all directly benefit from this research. These benefits will take several years to realise fully due to the lead times inherent in this market sector. 2) The Wider Population Given the ubiquity of multicore processors in the wide range of computing environments, this research will have an impact on the majority of the population. This can be summarised within the following list: *) Users of mobile devices requiring one or more computer processors will find that battery life is improved due to the energy savings achieved by the optimisations proposed in this work; *) A reduction in the energy required to power desktop machines means that manufacturers can reduce the cooling solutions built into and on top of the processor. The removal of the fan, in particular, would reduce the noise emitted by the machine, increasing quality of life and productivity for the user; *) Modern supercomputers and data centres consume a large amount of power. Any reduction in their energy requirements would increase the number of devices that could be placed in a given area, increasing processing power. This would, in turn, provide faster, cheaper services and additional functionality to the customers of the businesses that rely on these environments. A reduction in the energy required to operate a processor translates directly into savings in electricity bills. This has both financial implications and environmental benefits. B) Providing Opportunities to Benefit Maximum impact of the project outcomes will be achieved: through collaborations and commercialisation. 1) Collaborations Over the past six years the principal investigator has built up collaborations with researchers around the world and these will continue throughout this project. In particular, collaborations with Dr Avi Mendelson (Microsoft Research, Israel) and Professor Antonio Gonzalez will be hugely beneficial and enable effective exploitation of the project results via industrial routes. In addition to this, Dr Jones has extensive industrial contacts obtained through the HiPEAC network of excellence. As technical leader of adaptive compilation within this European network, the principal investigator is ideally placed to exploit these contacts to advance the research within this project. Finally, within the UK, the PI has contacts with XMOS in Bristol. One main feature of the processors developed by this company is the transfer of decision-making from runtime to the compiler. This work will enable further tasks to be shifted from processor to compiler, enabling XMOS to produce chips that are even more energy-efficient than their own current designs. 2) Commercialisation The primary intellectual property output from this project will be an understanding of how the compiler can be used to influence power and energy in a multi-core processor's cache hierarchy. The University of Edinburgh has a proven track record in the commercialisation of its inventions and other IP emerging from its strong research base. The management on the University's IP is undertaken by Edinburgh Research & Innovation (ERI), the commercialisation organisation for the University. This operates an ambitious commercialisation strategy which focuses on identifying, protecting and the subsequent successful exploitation of such inventions and IP.

Publications

10 25 50

publication icon
Lira J (2012) The migration prefetcher Anticipating data promotion in dynamic NUCA caches in ACM Transactions on Architecture and Code Optimization

 
Description Within the field of microprocessor caching, we considered work in two directions. Firstly, we have looked at reducing the overheads of coherence with sparse directories. We have shown how you only need to track the locations of shared data within the system caches and can ignore private data. This allows increased performance and a reduction in power and network traffic. Secondly, we have considered moving data around in a dynamic non-uniform cache architecture (D-NUCA). We have developed simple schemes to predict which cores will need the data and proactively move it towards those cores before it is required. This brings performance improvements and reductions in access latency.
Exploitation Route Processor designers, when implementing their memory systems with sparse directories for cache coherence, can use the results of this research to reduce the overheads of their directories. Likewise, those implementing systems with NUCA caches can use our technique to speed up accesses to frequently-used data.
Processor designers, when implementing their memory systems with sparse directories for cache coherence, can use the results of this research to reduce the overheads of their directories. Likewise, those implementing systems with NUCA caches can use our technique to speed up accesses to frequently-used data.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description Intel have evaluated the use of the developed scheme.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic