Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI. (23rd July 2013)
- Record Type:
- Journal Article
- Title:
- Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI. (23rd July 2013)
- Main Title:
- Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI
- Authors:
- Bland, Wesley
Du, Peng
Bouteiller, Aurelien
Herault, Thomas
Bosilca, George
Dongarra, Jack J. - Other Names:
- Bougé Luc guestEditor.
Lengauer Christian guestEditor. - Abstract:
- SUMMARY: Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint‐based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software‐level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application‐based recovery technique to repair the failure‐damaged dataset. The validity and performance of this approach are evaluated on large‐scale systems, usingSUMMARY: Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: (i) traditional checkpoint‐based approaches incur a steep overhead on failure free operations and (ii) the dominant programming paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software‐level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable advanced forward recovery techniques, without incurring the overhead of customary periodic checkpointing. With our approach, when failure strikes, applications regain control to make a checkpoint before quitting execution. This checkpoint is in reaction to the failure occurrence rather than periodic. This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application‐based recovery technique to repair the failure‐damaged dataset. The validity and performance of this approach are evaluated on large‐scale systems, using the QR factorization as an example. Published 2013. This article is a US Government work and is in the public domain in the USA. … (more)
- Is Part Of:
- Concurrency and computation. Volume 25:Number 17(2013:Dec.)
- Journal:
- Concurrency and computation
- Issue:
- Volume 25:Number 17(2013:Dec.)
- Issue Display:
- Volume 25, Issue 17 (2013)
- Year:
- 2013
- Volume:
- 25
- Issue:
- 17
- Issue Sort Value:
- 2013-0025-0017-0000
- Page Start:
- 2381
- Page End:
- 2393
- Publication Date:
- 2013-07-23
- Subjects:
- fault tolerance -- message passing interface -- ABFT -- Checkpoint‐on‐Failure
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.3100 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 371.xml