Unified fault-tolerance framework for hybrid task-parallel message-passing applications. (September 2018)
- Record Type:
- Journal Article
- Title:
- Unified fault-tolerance framework for hybrid task-parallel message-passing applications. (September 2018)
- Main Title:
- Unified fault-tolerance framework for hybrid task-parallel message-passing applications
- Authors:
- Subasi, Omer
Martsinkevich, Tatiana
Zyulkyarov, Ferad
Unsal, Osman
Labarta, Jesus
Cappello, Franck - Other Names:
- Bland Wesley guest-editor.
Erez Mattan guest-editor.
Hidalgo J Ignacio guest-editor.
Fernández de Vega Francisco guest-editor.
Mercier Guillaume guest-editor. - Abstract:
- We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.
- Is Part Of:
- International journal of high performance computing applications. Volume 32:Number 5(2018)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 32:Number 5(2018)
- Issue Display:
- Volume 32, Issue 5 (2018)
- Year:
- 2018
- Volume:
- 32
- Issue:
- 5
- Issue Sort Value:
- 2018-0032-0005-0000
- Page Start:
- 641
- Page End:
- 657
- Publication Date:
- 2018-09
- Subjects:
- Fault-tolerance -- message logging -- checkpoint/restart -- task-based programming model -- optimal checkpointing interval
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342016669416 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 8501.xml