A scalable and extensible checkpointing scheme for massively parallel simulations. (July 2019)
- Record Type:
- Journal Article
- Title:
- A scalable and extensible checkpointing scheme for massively parallel simulations. (July 2019)
- Main Title:
- A scalable and extensible checkpointing scheme for massively parallel simulations
- Authors:
- Kohl, Nils
Hötzer, Johannes
Schornbaum, Florian
Bauer, Martin
Godenschwager, Christian
Köstler, Harald
Nestler, Britta
Rüde, Ulrich - Abstract:
- Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems, it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to 40 billion computational cells executing on more than 400 billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than 260, 000 (2 18 ) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation.
- Is Part Of:
- International journal of high performance computing applications. Volume 33:Number 4(2019)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 33:Number 4(2019)
- Issue Display:
- Volume 33, Issue 4 (2019)
- Year:
- 2019
- Volume:
- 33
- Issue:
- 4
- Issue Sort Value:
- 2019-0033-0004-0000
- Page Start:
- 571
- Page End:
- 589
- Publication Date:
- 2019-07
- Subjects:
- Resilience -- checkpoint-restart -- supercomputing -- scalable parallel algorithms -- parallel performance -- HPC -- ULFM -- MPI
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342018767736 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11469.xml