Garbage Collection for Multicore Platforms

Lead Research Organisation: University of Kent
Department Name: Sch of Computing

Abstract

Developers are increasingly turning to languages like Java and C# for their ease of development, deployment and maintenance. Most applications for the foreseeable future will be written in languages supported by managed runtimes, running on multicore hardware. Particular benefits of managed runtimes include support for automatic dynamic memory management, or 'garbage collection' (GC) and threads. GC allows programs to recycle unused memory automatically, without error-prone programmer intervention. Threading allows a program to run different sequences of instructions in parallel; for instance, a web server might employ a separate thread for each incoming request from internet browsers.One of the most significant recent developments for language implementers is the development of multicore processors, with the number of cores deployed in commodity platforms expected to increase significantly over the next 5 years. The complexity of the processor's access to memory has also increased, in terms of levels of memory hierarchy and in the technology interconnecting processors. However, modern runtime technology has not evolved as fast as hardware technology, and how to fully exploit hardware parallelism remains an open question.This research asks, how can we exploit hardware parallelism, in particular by running multiple user program ('mutator') and GC threads? How can we avoid paying penalties for non-local memory access, but still benefit from multiple paths to memory? How can we take proactive advantage of locality properties? How can we minimise synchronisation between mutator and GC threads?Today's concurrent GC techniques avoid relocating live objects, so as to minimise the need for this synchronisation, but this leads to poor use of memory with many small holes but nowhere to accommodate larger objects ('fragmentation'). The standard fragmentation solution - periodic compaction phases - has high overheads, and often lacks portability or leads to throughput slumps before mutator threads can operate at full speed again. Memory management will be a bottleneck for the next generation of increasingly thread parallel software unless the problem of high performance GC for multicore can be solved. This proposal aims to address this key problem, reconciling compaction with concurrency and performance.We believe that the key to exploiting modern multicore architectures is a judicious division of effort between mutator and GC threads, in order not simply to avoid paying the price of accessing non-local memory, but proactively to process data while it is in the cache. It is almost always worth paying the cost of executing a few more instructions in order to avoid accessing non-local data. Thus, if a mutator thread is about to access data (and hence it is or soon will be in the cache), it should perform some GC work on that data immediately. Other data should be left to be handled by separate GC threads. At no time should all mutator threads be halted waiting for the collector. Our aim is therefore to provide high throughput and very low pauses, by utilising parallel copying collector threads, running concurrently and carefully coupled with mutators in order to leverage locality.This research will benefit GC researchers by broadening the design space and devising and evaluating new concurrent GC techniques, and developers in the broad community through enabling them to tune their applications and GCs to modern architectures. A high-performance GC tuned to modern multicore hardware will also lower the barrier to deployment of future software applications that expect to exploit multicore hardware fully. We will make all code developed freely available under an Open Source license. As well as disseminating our results through journals and conferences, we shall organise two workshops in order to build UK research strength in this field.

Planned Impact

The ubiquitous platform for the medium term future (say, the next ten years) will almost certainly be multicore, multi-socket, hardware executing a managed runtime for a language like Java or C#. Such hardware is already ubiquitous on modern desktops and laptops, and will soon appear in more mobile markets (netbooks, high-end phones). It is essential that the next generation of runtimes, such as Java virtual machines in general and their memory management subsystems in particular, can fully exploit the hardware parallelism on offer in order to provide a better experience for end users. JikesRVM is the most widely used platform for research into the implementation of Java Virtual Machines. Currently, JikesRVM/MMTk provides no support for concurrent collectors. If successful, the research proposed here will open up the design space of concurrent, copying collectors. It will also provide a platform for further GC research, by embodying the knowledge gathered into a redesigned MMTk toolkit for building GCs. The 2008 CPHC study of the IT labour market in the UK revealed that, while employers need more technical staff (and IT managers), the supply of computer science graduates is falling, putting the UK's position in the IT industry at risk. A supply of highly skilled, human capital to UK technology companies is essential. This project will train a research assistant and a PhD student, thus contributing to the UK's skills base. We plan to disseminate the results of our research in a variety of effective and appropriate ways. The PI has a track record of promoting research and exchange of ideas. He is author of the definitive book on GC, co-founded the International Symposium on Memory Management, co-established ACM SIGPLAN/EAPLS Semantics, Program Analysis and Computing Environments for memory management as a regular workshop series, edited a special issue on memory management of the journal Science of Computer Programming, and maintains the widely used online GC bibliography, www.cs.ukc.ac.uk/~rej/gcbib/gcbib.html. Coordinating the UK Memory Management Network (GR/R57140), he regularly brings together academic and industrial researchers and developers. The intensity of interest in this area was exemplified by the oversubscribed 2008 Workshop on Language and Runtime Support for Concurrent Systems, which attracted attendees from the UK, Europe and the USA. This proposal plans two further workshops on the same theme, which will provide an opportunity not only for researchers to exchange ideas but will also build UK research strength in this field.
 
Description 1) By comparing the cache behaviour of mutator insertion and deletion write barriers for concurrent/incremental collectors, in a VM, GC and hardware agnostic manner, we demonstrate for the first time that deletion barriers generate more work for a concurrent GC than insertion barriers, but find that the time between triggering a write barrier on an object and subsequently using it can be much lower with a deletion barrier than an insertion barrier, suggesting that deletion barriers may lead to better cache performance than has hitherto been expected.



2) Garbage collectors must update all references to objects they move. Updating is a lengthy operation but the updates must be transparent to the mutator running concurrently with the collector. The consequence is that no space can be reclaimed until all references have been updated. One solution is to replace direct references to objects with handles: these eliminate the updating problem and allow immediate reuse of the space used by evacuated objects.However, the execution time overhead of handles has led to them being abandoned by most modern systems. We demonstrate optimisations for handles that nearly eliminate their overhead compared with other widely used real-time collectors.

3) We found for the first time that the hardware support for transactions provided by the more recent editions of Intel processors (Transactional Synchronization Extensions) can be used to improve the performance of a fully concurrent, replicating garbage collector provided that (i) care is taken to move as much work as possible outside the transaction, and (ii) there is sufficient work to amortize the cost of setting up a transaction. Using this approach, we obtained improvements in object copying speed of gains of 48-100%.However, we found that it is also possible to match these gains using a novel software transactional memory techniques.

4) The Java language provides programmers with references of several degrees of 'strength'. However, the rules pertaining to the reclamation of objects reachable from weak references are complex and difficult to implement in a fully concurrent garbage collector. This task is made more difficult by the informal, English language explanation of the rules. We formalised these rules using mathematics, and implemented a high-performance, fully concurrent collector that handle reference types correctly (and in doing so, fixed bugs in a widely used Java virtual machine).

5) Building garbage collectors that are both performant and correct is hard. Building fully concurrent, defragmenting collectors is even more so. Our design methodology emphasises a tricolour abstraction (representing the collector's knowledge of each object in the heap), invariants between objects expressed in terms of the abstraction that both user program and garbage collector must maintain, and simple model checking. We demonstrate that this methodology is effective in constructing the most complex garbage collectors, and gives assurance of their correctness.

6) Increasing levels of hardware parallelism are one of the main challenges for managed runtimes. Any concurrency or scalability improvements must be evaluated experimentally but application benchmarks available today may not reflect the highly concurrent applications we anticipate in the future and may also behave in ways that VM developers do not expect. We provide a set of platform independent concurrency-related metrics and an in-depth observational study of current state of the art benchmarks, discovering how concurrent they really are, how they scale the work and how they synchronise and communicate via shared memory.



7) Experimental evaluation is key to systems research. Because modern systems, and especially concurrent systems, are complex and non-deterministic, good experimental methodology forces researchers to account for uncertainty. Unfortunately the standards of reporting in computer science are often poorly by comparison with other disciplines. One cause may be researchers' reluctance to spend large amounts of time conducting experiments in order to account sufficiently for variation. This paper provides for the first time a statistically rigorous methodology for repetition and summarization of results methodology for repetition and summarization of results. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations of a program). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision. We present our methodology as a cookbook that guides researchers on how to obtain reliable results.
Exploitation Route Our work is of significance to any vendor of managed runtime languages, such as Java or C#.

Our formalisation of java's weak reference types complements the informal description found only in the class documentation for java.lang.Reference. Use of this precise, mathematical specification allows Java VM implementors to confirm their understanding of the rules that apply to Java's weak reference types.

Our development methodology is applicable to any garbage collector. It provides an effective way to develop correct collectors.

Our recommendations for rigorous benchmarking not only demonstrate how to make robust performance comparisons between different versions of a system, but also how to do so at least cost of a developer's time.

We plan to make our tools and data open source in order to provide the greatest benefit to the community, academic and non-academic.
Sectors Digital/Communication/Information Technologies (including Software)

URL http://www.cs.kent.ac.uk/projects/gc/mirrorgc
 
Description Our findings have potential use for any organisation developing high performance garbage collectors. Our finding that, for certain special cases, software transactional memory may offer the same performance as Intel's hardware transactional memory extensions, while offering the advantage of portability, may have a wider application. We have also demonstrated that modelling checking is a fairly simple, easy for programmers to use, technique for assuring correctness of concurrent garbage collectors.
First Year Of Impact 2010
Sector Digital/Communication/Information Technologies (including Software),Education
Impact Types Economic

 
Title Rigorous bennchmarking in reasonable time 
Description The answer to the type of method is inappropriate but the least bad match. researchfish should fix this. We describe a method for experimental computer science that is rigorous yet minimises the amount of experiment time required (minimises the number of repetitions required). We also show how to use the method to quantify performance changes with effect-size confidence intervals. While the use of effect sizes is known in other scientific disciplines, it seems to have been ignored in computer science. 
Type Of Material Improvements to research infrastructure 
Year Produced 2013 
Provided To Others? Yes  
Impact The method has been used by other researchers, e.g. Tratt's group at Kings College London. Software has also been made available on request. 
URL http://www.cs.kent.ac.uk/~rej/
 
Title Tools for black-box understanding of concurrency of Java programs 
Description The "Type of research tool or method" response above is inappropriate but the least bad match. researchfish should fix this. We provide tools for analysing the scalability and concurrency bottlenecks in Java programs 
Type Of Material Improvements to research infrastructure 
Year Produced 2012 
Provided To Others? Yes  
Impact Not known other than our OOPSLA 2012 paper 
URL http://www.cs.kent.ac.uk/projects/gc/dacapo/
 
Title Tools for black-box understanding of concurrency of Java programs 
Description Source code of the tools used in our paper, A Black-box Approach to Understanding Concurrency in DaCapo, Tomas Kalibera, Matthew Mole, Richard Jones and Jan Vitek, OOPSLA 2012, doi: 10.1145/2398857.2384641, ACM 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact OOPSLA 2012 paper 
URL http://www.cs.kent.ac.uk/projects/gc/dacapo/