Tree‐based fault‐tolerant collective operations for MPI. (15th June 2020)
- Record Type:
- Journal Article
- Title:
- Tree‐based fault‐tolerant collective operations for MPI. (15th June 2020)
- Main Title:
- Tree‐based fault‐tolerant collective operations for MPI
- Authors:
- Margolin, Alexander
Barak, Amnon - Abstract:
- Summary: With the increase in size and complexity of high‐performance computing systems, the probability of failures, and the cost of recovery grow. Parallel applications running on these systems should be able to continue running in spite of node failures at arbitrary times. Collective operations are essential for many parallel MPI applications, and are often the first to detect such failures. This work presents tree‐based fault‐tolerant collective operations, which combine fault detection and recovery as an integral part each operation. We do this by extending existing tree‐based algorithms, to allow for a collective operation to succeed despite failing nodes before or during its run. This differs from other approaches, where recovery takes place after a failure of such operations have failed. The article includes a comparison between the performance of the proposed algorithm and other approaches, as well as a simulator‐based analysis of performance at scale.
- Is Part Of:
- Concurrency and computation. Volume 33:Number 14(2021)
- Journal:
- Concurrency and computation
- Issue:
- Volume 33:Number 14(2021)
- Issue Display:
- Volume 33, Issue 14 (2021)
- Year:
- 2021
- Volume:
- 33
- Issue:
- 14
- Issue Sort Value:
- 2021-0033-0014-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2020-06-15
- Subjects:
- Allreduce -- collective operations -- fault‐tolerance -- MPI
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.5826 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 17351.xml