The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints. (9th September 2018)
- Record Type:
- Journal Article
- Title:
- The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints. (9th September 2018)
- Main Title:
- The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints
- Authors:
- Levy, Scott
Ferreira, Kurt B.
Widener, Patrick - Other Names:
- Jin Hai guestEditor.
Shen Xipeng guestEditor.
Lovas Robert guestEditor.
Liao Xiaofei guestEditor.
Skjellum Anthony guestEditor.
Bangalore Purushotham V. guestEditor.
Grant Ryan E. guestEditor. - Abstract:
- Summary: Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large‐scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next‐generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantlySummary: Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large‐scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next‐generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery. … (more)
- Is Part Of:
- Concurrency and computation. Volume 32:Number 3(2020)
- Journal:
- Concurrency and computation
- Issue:
- Volume 32:Number 3(2020)
- Issue Display:
- Volume 32, Issue 3 (2020)
- Year:
- 2020
- Volume:
- 32
- Issue:
- 3
- Issue Sort Value:
- 2020-0032-0003-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2018-09-09
- Subjects:
- checkpoint/restart -- fault tolerance -- MPI
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.4890 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 12612.xml