A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations. (3rd January 2020)
- Record Type:
- Journal Article
- Title:
- A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations. (3rd January 2020)
- Main Title:
- A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations
- Authors:
- Nakamura, Junya
Kim, Yonghwan
Katayama, Yoshiaki
Masuzawa, Toshimitsu - Abstract:
- Summary: A distributed system consisting of a huge number of computational entities is prone to faults because faults in a few nodes cause the entire system to fail. Consequently, fault tolerance of distributed systems is a critical issue. Checkpoint‐rollback recovery is a universal and representative technique for fault tolerance; it periodically records the entire system state (configuration) to non‐volatile storage, and the system restores itself using the recorded configuration when the system fails. To record a configuration of a distributed system, a specific algorithm known as a snapshot algorithm is required. However, many snapshot algorithms require coordination among all nodes in the system; thus, frequent executions of snapshot algorithms require unacceptable communication cost, especially if the systems are large. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a partial snapshot (instead of a global snapshot). However, if two or more partial snapshot algorithms are concurrently executed, and their snapshot domains overlap, they should coordinate, so that the partial snapshots (taken by the algorithms) are consistent. In this paper, we propose a new efficient partial snapshot algorithm with the aim of reducing communication for the coordination. In a simulation, we show that the proposed algorithm drastically outperforms the existing partial snapshot algorithm, in terms of message and time complexity.
- Is Part Of:
- Concurrency and computation. Volume 33:Number 12(2021)
- Journal:
- Concurrency and computation
- Issue:
- Volume 33:Number 12(2021)
- Issue Display:
- Volume 33, Issue 12 (2021)
- Year:
- 2021
- Volume:
- 33
- Issue:
- 12
- Issue Sort Value:
- 2021-0033-0012-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2020-01-03
- Subjects:
- checkpoint‐rollback -- distributed algorithm -- fault‐tolerance -- snapshot
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.5647 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 18234.xml