Enabling High-performance Statistical Computing in R on Hybrid GPU and Multicore Architectures

Lead Research Organisation: Imperial College London
Department Name: Dept of Mathematics

Abstract

The R system for statistical computing is a popular, open-source platform used world-wide by a very large community of statisticians, engineers, physicists and other scientists. R is used across a diverse range of applications areas including bioinformatics/genomics, cosmology, particle physics, astronomy and image processing. A key reason for R's popularity is its high-level, easy to learn language for programming with data , that enables users to perform many common data analytic tasks, including organization and manipulation of data sets, fitting statistical models, producing graphics, and executing a vast range of numerical computations and simulations. Although R's impact on many data-driven applications is undeniable, the recent huge increase in the size of modern data sets, coupled with the continuing development of highly sophisticated, but computationally intensive data analytic techniques, has led to a serious data analysis bottleneck.This project aims to enhance the capabilities of the R system for statistical computing by developing a framework for high-performance computing that takes full advantage of the specialized processing capabilities offered by the latest generation of 'commodity' hardware. Specifically, we recognize that almost all of today's statistical and analytical approaches rely heavily on the performing of common mathematical computations such as matrix algebra operations, and we seek to contribute to the development of a new generation of extremely fast mathematical libraries that will utilize the special processing capabilities of the modern multicore and graphical processing units (GPU), that are now found in all personal computers and laptops. Given the widespread use of R across many scientific disciplines, we believe that this project will have immediate and far reaching consequences, enabling high-performance statistical computing for the masses.

Planned Impact

1) Acceleration in the performance of R for large-scale statistical modeling: Our focus in this project on acceleration of the R system using GPU based processing is based on our desire to deliver the last mile of high performance computing and translate the computing potential of modern processors into the hands of end users: there are more than 250,000 estimated users, including statistical programmers, analysts, scientists and engineers that use R on a regular basis. We plan to build on state of the art research into linear algebra libraries that operate on GPU and hybrid architecture computer processors and then develop R extensions that allow R users transparent access to these libraries. Any user of R with an acceptable hardware configuration that include a relatively recent graphics cards will be able make immediate user of this work and see the benefit in faster computation and increased productivity. 2) Deployment to e-Science and other data intensive research communities: many potential users of this work are scientists and engineers for whom learning how to program in R is low on their priority list. The e-Science community has made many advances in making programs written in R and other systems reusable by packaging them as modules for use in scientific workflow systems such as Discovery Net or Taverna. In these environments, non-expert users compose workflows to rapidly solve data analysis problems. By embedding the higher performance R developed in the project into a workflow system we reach a larger group of users than R programmers alone. 3) Application of high performance computing in R to challenges in emerging and data-rich scientific fields much of the motivation for this project has been derived from observing application stakeholders in disciplines such as astrostatistics and statistical medical imaging. These are research areas where the size of data and complexity of the modelling mean that statistical analysis of their data is significant bottlenecks to the pace of their research. Working with such groups throughout this project to benchmark their current processes, prioritize the statistical functions that would impact this process and accelerating these functions provides a key success metric for the project as well as case studies and methodologies for research groups in similar areas. 4) Industrial Uptake: The current growth in popularity of the R system has also attracted the attentions of large industrial users such as Oracle, GE and Netezza. All three of these organizations are exploring ways of exploiting the maximum use of hardware in their own software products. Uptake of this work by organizations such as these would enable this research to be applied in a industrial context and either via R or the underlying MAGMA libraries be directly embedded into the future of database appliances and advanced analytics. 5) Training: Two PDRAs, will be involved in this project both investigating the specific areas of making use of existing High Performance Computing technology but aiming to make it available to a wide number of users. This focus of the project should not be understated. The rate of software improvements lags desperately behind the pace of hardware development. Unless more people are trained on delivering HPC to the end users this gap will widen. We anticipate that the RAs involved in this project will continue to work in the R community, continuing to add to this highly valuable open source resource

Publications

10 25 50
 
Description The R system for statistical computing is a popular, open-source platform used world-wide by a very large community of statisticians, engineers, physicists and other scientists. R is used across a diverse range of applications areas including bioinformatics and finance. In this project we have developed software libraries to enhance the R system so that certain calculations can be performed more efficiently (hence require less computing time) by leveraging consumer graphical cards (graphical processing units). This has enabled R users to achieve much higher performance using inexpensive hardware.
Exploitation Route We have developed open-source software that can be easily installed and used by any R user. R users can then build their own applications on top of our libraries, without additional efforts, and achieve much higher computational performance.
Sectors Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description NVIDIA collaboration 
Organisation NVIDIA
Country Global 
Sector Private 
PI Contribution We made no contributions to the company.
Collaborator Contribution NVIDIA is the world leader in visual computing and a company producing high-performance computing cards. NVIDIA contributed towards the project by donating 4 graphical cards.
Impact The main outcome was a software library called HiPlaR (High Performance Linear Algebra in R) to deliver high performance linear algebra (LA) routines for the R platform for statistical computing using the latest software libraries for heterogeneous architectures. A secondary output was a publication that made use of the HiPlaR library: Wang Z. and Montana G. (2014) The graph-guided group lasso for genome-wide association studies. In "Regularization, Optimization, Kernels, and Support Vector Machines", Johan A.K. Suykens et al (Editors).
Start Year 2012
 
Title HiplaR 
Description CUDA Library for 
Type Of Technology Software 
Year Produced 2011 
Open Source License? Yes  
Impact The software provides accelerated libraries for the R language for statistical computing. It has been used by a number of research groups, and is featured in NVIDIA's web site: https://developer.nvidia.com/hiplar. 
URL https://developer.nvidia.com/hiplar