A generic approach to scheduling and checkpointing workflows. (November 2019)
- Record Type:
- Journal Article
- Title:
- A generic approach to scheduling and checkpointing workflows. (November 2019)
- Main Title:
- A generic approach to scheduling and checkpointing workflows
- Authors:
- Han, Li
Le Fèvre, Valentin
Canon, Louis-Claude
Robert, Yves
Vivien, Frédéric - Other Names:
- Dongarra Jack guest-editor.
Tourancheau Bernard guest-editor. - Abstract:
- This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and Min Min and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (Ckpt All ), which is an overkill when failures are rare events, and checkpointing no task (Ckpt None ), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both Ckpt All and Ckpt None for a wide variety of workflows.
- Is Part Of:
- International journal of high performance computing applications. Volume 33:Number 6(2019)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 33:Number 6(2019)
- Issue Display:
- Volume 33, Issue 6 (2019)
- Year:
- 2019
- Volume:
- 33
- Issue:
- 6
- Issue Sort Value:
- 2019-0033-0006-0000
- Page Start:
- 1255
- Page End:
- 1274
- Publication Date:
- 2019-11
- Subjects:
- Workflow -- checkpoint -- fail-stop error -- resilience
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342019866891 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11258.xml