Epidemic failure detection and consensus for extreme parallelism. (September 2018)
- Record Type:
- Journal Article
- Title:
- Epidemic failure detection and consensus for extreme parallelism. (September 2018)
- Main Title:
- Epidemic failure detection and consensus for extreme parallelism
- Authors:
- Katti, Amogh
Di Fatta, Giuseppe
Naughton, Thomas
Engelmann, Christian - Other Names:
- Bland Wesley guest-editor.
Erez Mattan guest-editor.
Hidalgo J Ignacio guest-editor.
Fernández de Vega Francisco guest-editor.
Mercier Guillaume guest-editor. - Abstract:
- Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time beingFuture extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient. … (more)
- Is Part Of:
- International journal of high performance computing applications. Volume 32:Number 5(2018)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 32:Number 5(2018)
- Issue Display:
- Volume 32, Issue 5 (2018)
- Year:
- 2018
- Volume:
- 32
- Issue:
- 5
- Issue Sort Value:
- 2018-0032-0005-0000
- Page Start:
- 729
- Page End:
- 743
- Publication Date:
- 2018-09
- Subjects:
- Fault-tolerant MPI -- user-level failure mitigation -- failure detection -- consensus -- Gossip protocols
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342017690910 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 8501.xml