Application health monitoring for extreme‐scale resiliency using cooperative fault management. (25th July 2019)
- Record Type:
- Journal Article
- Title:
- Application health monitoring for extreme‐scale resiliency using cooperative fault management. (25th July 2019)
- Main Title:
- Application health monitoring for extreme‐scale resiliency using cooperative fault management
- Authors:
- Agarwal, Pratul K.
Naughton, Thomas
Park, Byung H.
Bernholdt, David E.
Hursey, Joshua J.
Geist, Al - Abstract:
- Summary: Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application‐driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.
- Is Part Of:
- Concurrency and computation. Volume 32:Number 2(2020)
- Journal:
- Concurrency and computation
- Issue:
- Volume 32:Number 2(2020)
- Issue Display:
- Volume 32, Issue 2 (2020)
- Year:
- 2020
- Volume:
- 32
- Issue:
- 2
- Issue Sort Value:
- 2020-0032-0002-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2019-07-25
- Subjects:
- exascale resiliency -- fault tolerance -- heterogeneous systems -- molecular dynamics -- quantum chemistry calculations -- silent errors
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.5449 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 23525.xml