Resilient parallel computing on volunteer PC grids. (24th April 2018)
- Record Type:
- Journal Article
- Title:
- Resilient parallel computing on volunteer PC grids. (24th April 2018)
- Main Title:
- Resilient parallel computing on volunteer PC grids
- Authors:
- Subhlok, Jaspal
Nguyen, Hien
Gabriel, Edgar
Rahman, Mohammad Tanvir - Abstract:
- Summary: Volunteer PC hosts represent massive computation capacity at a low cost but are challenging to employ for general parallel computing. This paper presents the design, execution model, implementation, and evaluation of the Volpex framework for robust execution of parallel codes on volunteer PC grids characterized by system and network heterogeneity, varying availability, and frequent failures. The communication model is based on one‐sided Put/Get calls to an abstract global shared space enhanced to support multiple autonomous instances of the same process at different stages of execution. Our approach customizes and combines the use of replication, checkpointing, and host selection. This presents formidable challenges that are addressed in this work; efficient checkpointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and application specific host selection. The integrated runtime system is shown to effectively execute moderate size, coarse‐grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. Extensive evaluation is conducted with example scientific codes on a pool of around 600 volunteer hosts. The results demonstrate the trade‐offs in deploying checkpointing, redundancy, and host selection, and how these methods combine to provide application performance that is close to the ideal failure free performance.
- Is Part Of:
- Concurrency and computation. Volume 30:Number 18(2018)
- Journal:
- Concurrency and computation
- Issue:
- Volume 30:Number 18(2018)
- Issue Display:
- Volume 30, Issue 18 (2018)
- Year:
- 2018
- Volume:
- 30
- Issue:
- 18
- Issue Sort Value:
- 2018-0030-0018-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2018-04-24
- Subjects:
- checkpointing -- fault tolerance -- host selection -- parallel execution -- replication -- tuplespace -- volunteer computing
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.4478 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 7458.xml