Efficient checkpoint/verification patterns. (January 2017)
- Record Type:
- Journal Article
- Title:
- Efficient checkpoint/verification patterns. (January 2017)
- Main Title:
- Efficient checkpoint/verification patterns
- Authors:
- Benoit, Anne
Raina, Saurabh K
Robert, Yves - Abstract:
- Errors have become a critical problem for high-performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (mean time between failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared with the base algorithm that always perform a verification just before taking a checkpoint ( p = q = 1), and we exhibit gains of up to 19%.
- Is Part Of:
- International journal of high performance computing applications. Volume 31:Number 1(2017)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 31:Number 1(2017)
- Issue Display:
- Volume 31, Issue 1 (2017)
- Year:
- 2017
- Volume:
- 31
- Issue:
- 1
- Issue Sort Value:
- 2017-0031-0001-0000
- Page Start:
- 52
- Page End:
- 65
- Publication Date:
- 2017-01
- Subjects:
- High-performance computing -- fault tolerance -- checkpointing -- verification -- silent error -- silent data corruption
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342015594531 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 7494.xml