Dynamic cluster strategy for hierarchical rollback‐recovery protocols in MPI HPC applications. (24th May 2017)
- Record Type:
- Journal Article
- Title:
- Dynamic cluster strategy for hierarchical rollback‐recovery protocols in MPI HPC applications. (24th May 2017)
- Main Title:
- Dynamic cluster strategy for hierarchical rollback‐recovery protocols in MPI HPC applications
- Authors:
- Liao, Xiaofei
Zheng, Long
Zhang, Binsheng
Zhang, Yu
Jin, Hai
Shi, Xuanhua
Lin, Yi - Other Names:
- Jin Hai guestEditor.
Shen Xipeng guestEditor.
Lovas Robert guestEditor.
Liao Xiaofei guestEditor.
Skjellum Anthony guestEditor.
Bangalore Purushotham V. guestEditor.
Grant Ryan E. guestEditor. - Abstract:
- Summary: Fault tolerance in parallel computing becomes increasingly important with a significant rise in high‐performance computing systems. Coordinated checkpointing and message logging protocols are commonly used fault tolerance mechanisms for message‐passing applications. However, these mechanisms are insufficient because of their severe drawbacks. Hierarchical rollback‐recovery protocols, combining coordinated checkpointing with message logging, are a better solution. However, such protocols may not obtain the appropriate efficiency because the communication pattern in different stages of applications may vary at runtime. In an effort to improve the efficiency of hierarchical rollback‐recovery protocols, we propose a dynamic cluster strategy to adapt to the runtime variation of communication pattern by using a prediction scheme. Finally, the efficiency and scalability of the dynamic cluster strategy are evaluated using 2 static process partition algorithms on the High‐Performance Linpack benchmark.
- Is Part Of:
- Concurrency and computation. Volume 32:Number 3(2020)
- Journal:
- Concurrency and computation
- Issue:
- Volume 32:Number 3(2020)
- Issue Display:
- Volume 32, Issue 3 (2020)
- Year:
- 2020
- Volume:
- 32
- Issue:
- 3
- Issue Sort Value:
- 2020-0032-0003-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2017-05-24
- Subjects:
- fault tolerance -- high‐performance computing -- message‐passing interface -- rollback‐recovery protocols
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.4173 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 12605.xml