Partial differential equations preconditioner resilient to soft and hard faults. (September 2018)
- Record Type:
- Journal Article
- Title:
- Partial differential equations preconditioner resilient to soft and hard faults. (September 2018)
- Main Title:
- Partial differential equations preconditioner resilient to soft and hard faults
- Authors:
- Rizzi, F
Morris, K
Sargsyan, K
Mycek, P
Safta, C
Le Maître, O
Knio, O
Debusschere, B - Other Names:
- Bland Wesley guest-editor.
Erez Mattan guest-editor.
Hidalgo J Ignacio guest-editor.
Fernández de Vega Francisco guest-editor.
Mercier Guillaume guest-editor. - Abstract:
- We present a domain-decomposition-based preconditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. This reformulation allows us to recast the problem as a set of independent tasks, and exploit data locality to reduce global communication. We discuss two different parallel implementations: (a) a single program multiple data (SPMD) version based on a one-to-one mapping between subdomain and MPI processes responsible for both state and computation; and (b) an asynchronous server–client implementation where all state information is held by the servers and clients are designed solely as computational units. We present a scalability comparison of both implementations under nominal conditions, showing efficiency within ~80% for up to 12, 000 cores. We present a resilience analysis under different fault scenarios based on the server–client implementation. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all of the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model.We present a domain-decomposition-based preconditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. This reformulation allows us to recast the problem as a set of independent tasks, and exploit data locality to reduce global communication. We discuss two different parallel implementations: (a) a single program multiple data (SPMD) version based on a one-to-one mapping between subdomain and MPI processes responsible for both state and computation; and (b) an asynchronous server–client implementation where all state information is held by the servers and clients are designed solely as computational units. We present a scalability comparison of both implementations under nominal conditions, showing efficiency within ~80% for up to 12, 000 cores. We present a resilience analysis under different fault scenarios based on the server–client implementation. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all of the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing; soft faults occurring during the communication of the tasks between server and clients; and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates. … (more)
- Is Part Of:
- International journal of high performance computing applications. Volume 32:Number 5(2018)
- Journal:
- International journal of high performance computing applications
- Issue:
- Volume 32:Number 5(2018)
- Issue Display:
- Volume 32, Issue 5 (2018)
- Year:
- 2018
- Volume:
- 32
- Issue:
- 5
- Issue Sort Value:
- 2018-0032-0005-0000
- Page Start:
- 658
- Page End:
- 673
- Publication Date:
- 2018-09
- Subjects:
- Resilience -- scientific computing -- supercomputers -- parallel programming -- distributed computing -- client–server systems -- high-performance computing -- parallel algorithms -- software engineering -- fault tolerance -- message passing -- fault-tolerant systems -- partial differential equations
High performance computing -- Periodicals
Supercomputers -- Periodicals
004.1105 - Journal URLs:
- http://hpc.sagepub.com ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1177/1094342016684975 ↗
- Languages:
- English
- ISSNs:
- 1094-3420
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 8501.xml