Programmer-guided reliability for extreme-scale applications. (September 2018)
- Record Type:
- Journal Article
- Title:
- Programmer-guided reliability for extreme-scale applications. (September 2018)
- Main Title:
- Programmer-guided reliability for extreme-scale applications
- Authors:
- Bernholdt, David E
Elwasif, Wael R
Kartsaklis, Christos
Lee, Seyong
Mintz, Tiffany M - Other Names:
- Bland Wesley guest-editor.
Erez Mattan guest-editor.
Hidalgo J Ignacio guest-editor.
Fernández de Vega Francisco guest-editor.
Mercier Guillaume guest-editor. - Abstract:
- We present "programmer-guided reliability" (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.
- Is Part Of:
- International journal of high performance computing applications. Volume 32:Number 5(2018)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 32:Number 5(2018)
- Issue Display:
- Volume 32, Issue 5 (2018)
- Year:
- 2018
- Volume:
- 32
- Issue:
- 5
- Issue Sort Value:
- 2018-0032-0005-0000
- Page Start:
- 598
- Page End:
- 612
- Publication Date:
- 2018-09
- Subjects:
- Applications -- error detection -- fault tolerance -- resilience -- soft errors
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342016667625 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 8501.xml