Failure prediction for HPC systems and applications: Current situation and open issues. (August 2013)
- Record Type:
- Journal Article
- Title:
- Failure prediction for HPC systems and applications: Current situation and open issues. (August 2013)
- Main Title:
- Failure prediction for HPC systems and applications
- Authors:
- Gainaru, Ana
Cappello, Franck
Snir, Marc
Kramer, William - Abstract:
- As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint–restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. One way of offering prediction is by the analysis of system logs generated during production by large-scale systems. Current research in this field presents a number of limitations that make them unusable for running on real production high-performance computing (HPC) systems. Based on our observations that different failures have different distributions and behaviours, we propose a novel hybrid approach that combines signal analysis with data mining in order to overcome current limitations. We show that by analysing each event according to its specific behaviour, our prediction provides a precision of over 90% and its able to discover about 50% of all failures in a system, result which allows its integration in proactive fault tolerance protocols.
- Is Part Of:
- International journal of high performance computing applications. Volume 27:Number 3(2013)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 27:Number 3(2013)
- Issue Display:
- Volume 27, Issue 3 (2013)
- Year:
- 2013
- Volume:
- 27
- Issue:
- 3
- Issue Sort Value:
- 2013-0027-0003-0000
- Page Start:
- 273
- Page End:
- 282
- Publication Date:
- 2013-08
- Subjects:
- failure prediction -- fault tolerance -- signal analysis
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342013488258 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23897.xml