Evaluating and extending user-level fault tolerance in MPI applications. (August 2016)
- Record Type:
- Journal Article
- Title:
- Evaluating and extending user-level fault tolerance in MPI applications. (August 2016)
- Main Title:
- Evaluating and extending user-level fault tolerance in MPI applications
- Authors:
- Laguna, Ignacio
Richards, David F
Gamblin, Todd
Schulz, Martin
de Supinski, Bronis R
Mohror, Kathryn
Pritchard, Howard - Abstract:
- The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
- Is Part Of:
- International journal of high performance computing applications. Volume 30:Number 3(2016:Autumn)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 30:Number 3(2016:Autumn)
- Issue Display:
- Volume 30, Issue 3 (2016)
- Year:
- 2016
- Volume:
- 30
- Issue:
- 3
- Issue Sort Value:
- 2016-0030-0003-0000
- Page Start:
- 305
- Page End:
- 319
- Publication Date:
- 2016-08
- Subjects:
- MPI -- fault tolerance -- failure recovery models -- checkpointing -- molecular dynamics simulation
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342015623623 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 6801.xml