Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning. (5th September 2020)
- Record Type:
- Journal Article
- Title:
- Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning. (5th September 2020)
- Main Title:
- Towards optimizing the execution of spark scientific workflows using machine learning‐based parameter tuning
- Authors:
- de Oliveira, Douglas
Porto, Fábio
Boeres, Cristina
de Oliveira, Daniel - Abstract:
- Summary: In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute‐ and data‐intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy‐to‐install framework, it has more than one hundred parameters to be set, besides domain‐specific parameters of each workflow. In this way, to execute Spark‐based workflows efficiently, the user has to fine‐tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial‐and‐error manner since it is tedious and error‐prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain‐specific ones related to the workflow performance in the predictive model.
- Is Part Of:
- Concurrency and computation. Volume 33:Number 5(2021)
- Journal:
- Concurrency and computation
- Issue:
- Volume 33:Number 5(2021)
- Issue Display:
- Volume 33, Issue 5 (2021)
- Year:
- 2021
- Volume:
- 33
- Issue:
- 5
- Issue Sort Value:
- 2021-0033-0005-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2020-09-05
- Subjects:
- Apache spark -- machine learning -- scientific workflows -- Spark parameter tuning
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.5972 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 15777.xml