Optimizing R with SparkR on a commodity cluster for biomedical research. (December 2016)
- Record Type:
- Journal Article
- Title:
- Optimizing R with SparkR on a commodity cluster for biomedical research. (December 2016)
- Main Title:
- Optimizing R with SparkR on a commodity cluster for biomedical research
- Authors:
- Sedlmayr, Martin
Würfl, Tobias
Maier, Christian
Häberle, Lothar
Fasching, Peter
Prokosch, Hans-Ulrich
Christoph, Jan - Abstract:
- Highlights: R is a popular environment for clinical data analysis. It does not directly support big data workloads. Both, the Message Passing Interface (MPI) and SparkR allow to parallelize computational demanding workloads on clusters. SparkR offers elastic resources even on non-dedicated hardware and tight integration with Hadoop distributed services. SparkR requires minimal changes to original code in R in order to utilize parallel execution. Computation in SparkR scales better than with the Message Passing Interface (MPI) due to optimized data communication. Abstract: Background and Objectives: Medical researchers are challenged today by the enormous amount of data collected in healthcare. Analysis methods such as genome-wide association studies (GWAS) are often computationally intensive and thus require enormous resources to be performed in a reasonable amount of time. While dedicated clusters and public clouds may deliver the desired performance, their use requires upfront financial efforts or anonymous data, which is often not possible for preliminary or occasional tasks. We explored the possibilities to build a private, flexible cluster for processing scripts in R based on commodity, non-dedicated hardware of our department. Methods: For this, a GWAS-calculation in R on a single desktop computer, a Message Passing Interface (MPI)-cluster, and a SparkR-cluster were compared with regards to the performance, scalability, quality, and simplicity. Results: The originalHighlights: R is a popular environment for clinical data analysis. It does not directly support big data workloads. Both, the Message Passing Interface (MPI) and SparkR allow to parallelize computational demanding workloads on clusters. SparkR offers elastic resources even on non-dedicated hardware and tight integration with Hadoop distributed services. SparkR requires minimal changes to original code in R in order to utilize parallel execution. Computation in SparkR scales better than with the Message Passing Interface (MPI) due to optimized data communication. Abstract: Background and Objectives: Medical researchers are challenged today by the enormous amount of data collected in healthcare. Analysis methods such as genome-wide association studies (GWAS) are often computationally intensive and thus require enormous resources to be performed in a reasonable amount of time. While dedicated clusters and public clouds may deliver the desired performance, their use requires upfront financial efforts or anonymous data, which is often not possible for preliminary or occasional tasks. We explored the possibilities to build a private, flexible cluster for processing scripts in R based on commodity, non-dedicated hardware of our department. Methods: For this, a GWAS-calculation in R on a single desktop computer, a Message Passing Interface (MPI)-cluster, and a SparkR-cluster were compared with regards to the performance, scalability, quality, and simplicity. Results: The original script had a projected runtime of three years on a single desktop computer. Optimizing the script in R already yielded a significant reduction in computing time (2 weeks). By using R-MPI and SparkR, we were able to parallelize the computation and reduce the time to less than three hours (2.6 h) on already available, standard office computers. While MPI is a proven approach in high-performance clusters, it requires rather static, dedicated nodes. SparkR and its Hadoop siblings allow for a dynamic, elastic environment with automated failure handling. SparkR also scales better with the number of nodes in the cluster than MPI due to optimized data communication. Conclusion: R is a popular environment for clinical data analysis. The new SparkR solution offers elastic resources and allows supporting big data analysis using R even on non-dedicated resources with minimal change to the original code. To unleash the full potential, additional efforts should be invested to customize and improve the algorithms, especially with regards to data distribution. … (more)
- Is Part Of:
- Computer methods and programs in biomedicine. Volume 137(2016)
- Journal:
- Computer methods and programs in biomedicine
- Issue:
- Volume 137(2016)
- Issue Display:
- Volume 137, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 137
- Issue:
- 2016
- Issue Sort Value:
- 2016-0137-2016-0000
- Page Start:
- 321
- Page End:
- 328
- Publication Date:
- 2016-12
- Subjects:
- Computing methodologies -- Genome-wide association study -- Big data -- Cluster computing -- SparkR
Medicine -- Computer programs -- Periodicals
Biology -- Computer programs -- Periodicals
Computers -- Periodicals
Medicine -- Periodicals
Médecine -- Logiciels -- Périodiques
Biologie -- Logiciels -- Périodiques
Biology -- Computer programs
Medicine -- Computer programs
Periodicals
Electronic journals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/01692607 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.cmpb.2016.10.006 ↗
- Languages:
- English
- ISSNs:
- 0169-2607
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.095000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 21087.xml