Enabling High-performance Statistical Computing in R on Hybrid GPU and Multicore Architectures

Lead Research Organisation: Imperial College London

Department Name: Mathematics

Abstract

The R system for statistical computing is a popular, open-source platform used world-wide by a very large community of statisticians, engineers, physicists and other scientists. R is used across a diverse range of applications areas including bioinformatics/genomics, cosmology, particle physics, astronomy and image processing. A key reason for R's popularity is its high-level, easy to learn language for programming with data , that enables users to perform many common data analytic tasks, including organization and manipulation of data sets, fitting statistical models, producing graphics, and executing a vast range of numerical computations and simulations. Although R's impact on many data-driven applications is undeniable, the recent huge increase in the size of modern data sets, coupled with the continuing development of highly sophisticated, but computationally intensive data analytic techniques, has led to a serious data analysis bottleneck.This project aims to enhance the capabilities of the R system for statistical computing by developing a framework for high-performance computing that takes full advantage of the specialized processing capabilities offered by the latest generation of 'commodity' hardware. Specifically, we recognize that almost all of today's statistical and analytical approaches rely heavily on the performing of common mathematical computations such as matrix algebra operations, and we seek to contribute to the development of a new generation of extremely fast mathematical libraries that will utilize the special processing capabilities of the modern multicore and graphical processing units (GPU), that are now found in all personal computers and laptops. Given the widespread use of R across many scientific disciplines, we believe that this project will have immediate and far reaching consequences, enabling high-performance statistical computing for the masses.

Planned Impact

1) Acceleration in the performance of R for large-scale statistical modeling: Our focus in this project on acceleration of the R system using GPU based processing is based on our desire to deliver the last mile of high performance computing and translate the computing potential of modern processors into the hands of end users: there are more than 250,000 estimated users, including statistical programmers, analysts, scientists and engineers that use R on a regular basis. We plan to build on state of the art research into linear algebra libraries that operate on GPU and hybrid architecture computer processors and then develop R extensions that allow R users transparent access to these libraries. Any user of R with an acceptable hardware configuration that include a relatively recent graphics cards will be able make immediate user of this work and see the benefit in faster computation and increased productivity. 2) Deployment to e-Science and other data intensive research communities: many potential users of this work are scientists and engineers for whom learning how to program in R is low on their priority list. The e-Science community has made many advances in making programs written in R and other systems reusable by packaging them as modules for use in scientific workflow systems such as Discovery Net or Taverna. In these environments, non-expert users compose workflows to rapidly solve data analysis problems. By embedding the higher performance R developed in the project into a workflow system we reach a larger group of users than R programmers alone. 3) Application of high performance computing in R to challenges in emerging and data-rich scientific fields much of the motivation for this project has been derived from observing application stakeholders in disciplines such as astrostatistics and statistical medical imaging. These are research areas where the size of data and complexity of the modelling mean that statistical analysis of their data is significant bottlenecks to the pace of their research. Working with such groups throughout this project to benchmark their current processes, prioritize the statistical functions that would impact this process and accelerating these functions provides a key success metric for the project as well as case studies and methodologies for research groups in similar areas. 4) Industrial Uptake: The current growth in popularity of the R system has also attracted the attentions of large industrial users such as Oracle, GE and Netezza. All three of these organizations are exploring ways of exploiting the maximum use of hardware in their own software products. Uptake of this work by organizations such as these would enable this research to be applied in a industrial context and either via R or the underlying MAGMA libraries be directly embedded into the future of database appliances and advanced analytics. 5) Training: Two PDRAs, will be involved in this project both investigating the specific areas of making use of existing High Performance Computing technology but aiming to make it available to a wide number of users. This focus of the project should not be understated. The rate of software improvements lags desperately behind the pace of hardware development. Unless more people are trained on delivering HPC to the end users this gap will widen. We anticipate that the RAs involved in this project will continue to work in the R community, continuing to add to this highly valuable open source resource

Funded Value:

£346,272

Funded Period:

Oct 11 - Sep 13

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/I030638/1

Principal Investigator:

Giovanni Montana

Research Subject:

Info. & commun. Technol. (25%)

Mathematical sciences (25%)

Tools, technologies & methods (50%)

Research Topic:

Computer Sys. & Architecture (25%)

High Performance Computing (50%)

Statistics & Appl. Probability (25%)

Organisations

People	ORCID iD
Giovanni Montana (Principal Investigator)	http://orcid.org/0000-0003-3942-3900
Jack Dongarra (Co-Investigator)
Yi-Ke Guo (Co-Investigator)

Publications

Author Name Title Publication

Date Published

10 25 50

Key Findings
Collaboration
Software and Technical Products


Description	The R system for statistical computing is a popular, open-source platform used world-wide by a very large community of statisticians, engineers, physicists and other scientists. R is used across a diverse range of applications areas including bioinformatics and finance. In this project we have developed software libraries to enhance the R system so that certain calculations can be performed more efficiently (hence require less computing time) by leveraging consumer graphical cards (graphical processing units). This has enabled R users to achieve much higher performance using inexpensive hardware.
Exploitation Route	We have developed open-source software that can be easily installed and used by any R user. R users can then build their own applications on top of our libraries, without additional efforts, and achieve much higher computational performance.
Sectors	Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology


Description	NVIDIA collaboration
Organisation	NVIDIA
Country	Global
Sector	Private
PI Contribution	We made no contributions to the company.
Collaborator Contribution	NVIDIA is the world leader in visual computing and a company producing high-performance computing cards. NVIDIA contributed towards the project by donating 4 graphical cards.
Impact	The main outcome was a software library called HiPlaR (High Performance Linear Algebra in R) to deliver high performance linear algebra (LA) routines for the R platform for statistical computing using the latest software libraries for heterogeneous architectures. A secondary output was a publication that made use of the HiPlaR library: Wang Z. and Montana G. (2014) The graph-guided group lasso for genome-wide association studies. In "Regularization, Optimization, Kernels, and Support Vector Machines", Johan A.K. Suykens et al (Editors).
Start Year	2012


Title	HiplaR
Description	CUDA Library for
Type Of Technology	Software
Year Produced	2011
Open Source License?	Yes
Impact	The software provides accelerated libraries for the R language for statistical computing. It has been used by a number of research groups, and is featured in NVIDIA's web site: https://developer.nvidia.com/hiplar.
URL	https://developer.nvidia.com/hiplar

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications