The SPRINT approach to network biology

Lead Research Organisation: University of Edinburgh

Department Name: Sch of Biomedical Sciences

Abstract

The aim of this project is to promote the accessibility and usability of the Simple Parallel R INTerface (SPRINT). SPRINT is an innovative parallel tool kit for performing computationally challenging analysis workflow on post-genomic data using High Performance Computing (HPC). Specifically we propose to improve the support for manipulating very large datasets such as next generation sequencing and the functionality required to implement machine learning approaches such as clustering to analyse complex, data dependent experiments such as time series.

SPRINT and R are open source tools free to use by all. SPRINT does not require any expert knowledge of HPC or parallel programming. SPRINT allows R users to run their analysis workflows on any HPC platform with minimum alterations to their existing scripts and yet give them maximum performance.
SPRINT is implemented using the standard parallel programming tools C and MPI. It has two main components: an intelligent HPC harness that manages all aspects of accessing and working with HPC and a library of parallel R functions. SPRINT is fully scalable and portable. SPRINT is designed to use any number of nodes, from two to thousands. It runs on any HPC platform, from multi-core desktop to server, local cluster, supercomputer or cloud. SPRINT can tackle very large datasets including datasets larger than the internal memory of the computer. SPRINT is flexible and allows the addition of further functions to its library. SPRINT is open to external contributions from the research community. The functions currently included in SPRINT have been selected for their importance in the analysis of highly parallel, high-throughput post-genomic data, and network biology in general, and through prioritisation by R users.
The objective of improving the accessibility and usability of SPRINT will be achieved by working on four different levels: functionality, access, availability and dissemination.

- New functionality will be added to support machine learning approaches and next generation sequencing.
- Central SPRINT installations on public HPC resources ready to use and accessible to all will be made on HECToR, the UK supercomputing service and on Amazon EC2 cloud service. Help with local installation will also be provided.
- Technical improvements will be made to allow SPRINT to run on different type of computers such as UNIX/Linux, Apple Macs or Windows platforms.
- The use and usability of SPRINT will be promoted through an active programme of dissemination. This will be achieved through hands on workshops across the UK to train users, advice research groups and promote SPRINT use.

SPRINT is the first application to recognise that parallelisation and HPC access for biologists is the key next step in supporting high quality bioinformatics resources to respond to next generation sequencing and other high-throughput technologies. SPRINT stands out as the only tool that combines ease of use, the ability to perform complex statistical analyses including data dependent problems, the capacity to tackle very large datasets even larger than the physical memory of the computer, and can also run on any HPC platform, including cloud, with good scalability.

This project will benefit the biological community, both academia and industry, by providing them with a tool kit enabling them to exploit HPC and to perform currently intractable analyses on large HT genomic data. In fact, the wider R community will benefit from these technology developments because the SPRINT functionality can be used in generic statistical analyses.

Technical Summary

SPRINT is an R package that allows easy access to HPC for the analysis of high throughput "omics" data using the statistical programming language R. SPRINT provides functional interfaces that are as close as possible to existing R user interfaces so that biologists can obtain maximum performance from HPC platforms with minimal changes to existing analysis workflows and without requiring specialist knowledge of HPC.
This project aims to promote the accessibility and usability of SPRINT. This will be achieved by working on four different levels: functionality, access, availability and dissemination.
New functions will be added to implement a critical step-change in the functionality needed for machine learning algorithms by implementing the distance function which is core to many clustering algorithms; the Hamming distance for use with the string data as produced by genotyping or next generation sequencing; and optimise a standard function, Robust Multi-Array Average expression measure (RMA).
Better access will be offered by providing ready to use SPRINT installations on the national supercomputing service HECToR and on the Amazon EC2 cloud. Direct assistance with installation and user training will also be offered to research groups wanting to exploit their local HPC facilities.
Availability of SPRINT will be improved on Linux/UNIX and non Linux/UNIX platforms. SPRINT currently relies on MPICH2 MPI library, we will adapt the software to ensure it can also run on other implementations of MPI2 in particular OpenMPI. SPRINT has been successfully installed on the Apple MAC architecture. We will improve and fully port the software to MAC architecture. We will also investigate the installation and porting of SPRINT onto Windows platforms.
Extensive efforts will be made to promote the use of SPRINT through an active programme of dissemination through hands on workshops across the UK to train users, advice research groups on installing and using SPRINT.

Planned Impact

The SPRINT framework is an R package which aims to overcome limitations on data size and analysis time by providing easy access to High Performance Computing (HPC). The statistical programming language and environment R is commonly used by both industry and academia and is becoming the lingua franca of statistical computing. R has been extended with a large number of problem-specific packages; it is distributed free under a GNU General Public License and is available from the Comprehensive R Archive Network (CRAN).
The main beneficiaries of these technology developments are the bioscience researchers, both in Industry and academia, who use R to process data from high throughput, highly parallel "omics" experiments. These are now essential tools of biological research. Technologies such as microarrays and next generation sequencing are becoming routinely used in life science laboratories for many applications such as, for example, the discovery and validation of new drug targets, or the fundamental research into the complex nature and relationships between various organism levels in a system and network biology approaches to further our understanding of the healthy system. These technologies generate an unprecedented amount of data which is more and more difficult to store and process due to the lack of appropriate tools. The requirements for the analysis and interpretation of such data are also particularly sophisticated and specialised. It is of crucial importance that adequate resources are provided to the community to fully analyse and extract biological knowledge from these data. Failure to do so will reduce possible economics and societal benefits and so limit the practical and applied advances of genomics.
SPRINT provides the biological community with a tool kit enabling them to exploit HPC to perform currently intractable analyses on large high throughput "omics" experiments. SPRINT removes current bottlenecks in the analysis of very large datasets such as those from time series and next generation sequencing experiments. It allows the processing of datasets previously too large to tackle, enables execution of analyses that previously took too long to perform or used algorithms that were computationally too demanding. In particular, the proposed developments aim to provide support for machine learning algorithms.
SPRINT is user friendly, aimed at biological scientists, not HPC experts. However, SPRINT is designed by HPC experts giving the software an industrial quality not usually found in open source software. The scalable aspect of SPRINT offers the added potential of future proofing analysis workflows from increasing data size. SPRINT is platform independent and can be used on multi-core desktops, local clusters, servers, supercomputers or in the cloud.
Moreover, the R community from all disciplines and scientific community at large will also benefit from this technology development because the analysis methods considered here are generic and can be applied to a wide variety of areas.

Funded Value:

£257,288

Funded Period:

Oct 12 - Sep 14

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/J019283/1

Principal Investigator:

Peter Ghazal

Research Subject:

Tools, technologies & methods (100%)

Research Topic:

High Performance Computing (100%)

Organisations

University of Edinburgh (Lead Research Organisation)

People	ORCID iD
Peter Ghazal (Principal Investigator)
Terence Sloan (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Hill J (2018) Exploiting Parallel R in the Cloud with SPRINT in Methods of Information in Medicine

Lloyd A (2013) Embedded systems for global e-Social Science: Moving computation rather than data in Future Generation Computer Systems

Mitchell L (2014) Parallel classification and feature selection in microarray data using SPRINT. in Concurrency and computation : practice & experience

O'Driscoll A (2015) HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool. in Journal of biomedical informatics

Robertson, K (2012) SPRINT: more runners, fewer hurdles in Edinburgh Parallel Computing Centre

Sloan T. M. (2014) Parallel Optimisation of Bootstrapping in R in arXiv e-prints

Sloan, T (2013) SPRINT: taking biomedical analysis from the desktop to supercomputers and the cloud in EPCC, University of Edinburgh

Troup, E (2014) Using SPRINT and parallelised functions for analysis of large data on multi-core Mac and HPC platforms in The R User Conference 2014

Key Findings
Impact Summary
Software and Technical Products
Engagement Activities


Description	The SPRINT project (www.r-sprint.org) has developed a software package called 'sprint' that successfully enables users of statistical analysis software ("R") to perform computationally challenging analyses on very large data sets. By providing parallelised code and a simple usage model for important analysis functions, we are able to help users unfamiliar with high performance computing to nonetheless make use of either supercomputers or simply multi-processor computers (Mac or Linux). Users of this software can therefore perform types of analyses that would otherwise be beyond their computational ability, or lengthy standard analysis times can be reduced by an order of magnitute. Primary achievements within this grant are: - increasing availability of SPRINT on supercomputing platforms, compute clusters and individual computers of research institutions - expanding the availability and utility of SPRINT to other operating systems and the "Cloud" - increasing potential research capabilities of interested institutions through the roll-out of SPRINT training courses - expanding of the set of available analysis functions to address new biological high-throughput laboratory technologies As a result of this grant, we now know: - that complex genomic data analyses (or those for any other large data sets) can successfully be reduced from hours to minutes for users of R (Mac or Unix-based operating systems) who are not specialists in high-performance computing - how to successfully develop and disseminate parallelised code for use by the research community and for specific research data problems - after testing, that MS Windows OS is not easily compatible with parallelisation solutions for R outside a commercial context - after interaction with multiple groups and users, that data analysis challenges will increase in the biological sciences due to now widespread next-generation sequencing In terms of specific key developments, we have: - parallelised a function to compute the Hamming distance ("similarity") between pairs of strings, as they would occur in large number in next-generation sequencing studies - installed SPRINT on UK supercomputers HECToR and ARCHER, on the compute clusters as well as individual workstations of several UK biomedical research institutions (European Bioinformatics Institute, Cancer Research UK, MRC Institute for Genetics and Molecular Medicine, Francis Crick Institute), and as a successful proof of concept on the AMAZON EC2 "cloud" - developed SPRINT up to version 1.0.7, which distinguishes it from versions prior to this grant by working on multicore Mac OS computers, the lastest version of R, compatibility with a second and frequently used Message-Passing-Interface standard (OpenMPI), as well as bug fixes - published our analysis functions and software engineering approaches in peer-reviewed journals - promoted SPRINT and instructed in its use through two training courses, multiple seminars and interactions with user/research groups With the development of this software package, we expect individual research groups other than ours to benefit from increased computational capabilities without a requirement to collaborate with high-performance computing programmers. We are also using the expertise and developed code to formulate grant applications to solve specific computational analysis problems in biomedical research and diagnostics.
Exploitation Route	In line with a shift from providing the community with parallelised functions to targeting this development at relevant biomedical research problems, we are now using the developed expertise and code to obtain funding for solving specific biomedical analysis problems, in the first instance this is in the area of diagnostic 'omics' for neonatal sepsis. We have also established a long-standing collaboration between the Division of Pathway Medicine and the Edinburgh Parallel Computing Centre, which will be of great use in addressing future research analysis problems with the advent of next-generation sequencing technology. Independently, since this is freely available and fully functional OpenSource software, other research groups or industrial users can make use of the SPRINT package to improve their own analysis workflow contents and run times, and we have had user interactions through our mailing list and at our training courses that show this to be the case.
Sectors	Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology,Other
URL	http://www.r-sprint.org


Description	SPRINT is a highly specialised software, which we do not foresee to have immediate applicable benefits beyond academia, however, we envisage that by enhancing the "knowledge economy", specifically the ability of researchers with very large data sets (now common in biology, and increasing in medicine) to analyse these data better and faster, we incrementally advance biological and medical findings, ultimately impacting health care provision. For example, funding being sought now will use the software and knowledge acquired to - in the first instance - identify 'omic' biomarkers for neonatal bacterial sepsis through analysis models that require the use of high performance computing solutions. SPRINT is also "Open Source" software, that is, commercial companies interested in developing high-performance computing solutions of their own may be able to identify parallelisation solutions based on this.
First Year Of Impact	2010
Sector	Digital/Communication/Information Technologies (including Software),Other
Impact Types	Societal,Economic


Title	SPRINT release 1.0.5 (E. Troup et al, November 2013)
Description	SPRINT is an R (www.r-project.org) that allows users to easily make use of parallelised versions of analysis functions on any multi-core computer or supercomputing platform (both with the exception of MS Windows). This version: This release introduces Hamming distance function, and various bug fixes
Type Of Technology	Software
Year Produced	2013
Open Source License?	Yes
Impact	No actual Impacts realised to date
URL	http://www.ed.ac.uk/schools-departments/pathway-medicine/our-research/ghazal-group/pathway-informati...


Description	Cancer Research UK Cambridge workshop (Milestone 13.1) 26th Apr 2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	A one-day group meeting between SPRINT and Cancer Research UK Cambridge (lead by Peter Maccallum, Head of Scientific IT & Computing) took place on 26th April 2013 in Cambridge (milestone M13-1), resulting in project pointers and test installs. Presentation materials (slides) no actual impacts realised to date
Year(s) Of Engagement Activity	2013


Description	Edinburgh Cancer Research Centre Seminar 29th Oct 2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Geographic Reach	International
Primary Audience	Participants in your research or patient groups
Results and Impact	SPRINT was presented during a seminar to the Edinburgh Cancer Research Centre on 29th October 2013, with this preceding a planned SPRINT tutorial in the first quarter of 2014. Presentation materials (slides) no actual impacts realised to date
Year(s) Of Engagement Activity	2013


Description	SPRINT version 1.0.5 release announcement on www.r-sprint.org
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	A new version of the SPRINT package was made available on the www.r-sprint.org website. SPRINT v1.0.5 no actual impacts realised to date
Year(s) Of Engagement Activity	2013
URL	http://www.r-sprint.org website. SPRINT v1.0.5