Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice. (25th September 2021)
- Record Type:
- Journal Article
- Title:
- Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice. (25th September 2021)
- Main Title:
- Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice
- Authors:
- Jayasekara, Sachini
Karunasekera, Shanika
Harwood, Aaron - Abstract:
- Abstract: Fault‐tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State‐of‐the‐art distributed stream processing systems use checkpointing to support fault‐tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance. In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real‐world cloud setting using different streaming applications; we use Apache Flink, a well‐known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real‐world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures perAbstract: Fault‐tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State‐of‐the‐art distributed stream processing systems use checkpointing to support fault‐tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance. In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real‐world cloud setting using different streaming applications; we use Apache Flink, a well‐known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real‐world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute. Moreover, we explore how performance measures: latency and throughput are affected by the optimal interval. Our observations demonstrate that significant improvements can be achieved using the optimal interval for both latency and throughput. … (more)
- Is Part Of:
- Software, practice & experience. Volume 52:Number 1(2022)
- Journal:
- Software, practice & experience
- Issue:
- Volume 52:Number 1(2022)
- Issue Display:
- Volume 52, Issue 1 (2022)
- Year:
- 2022
- Volume:
- 52
- Issue:
- 1
- Issue Sort Value:
- 2022-0052-0001-0000
- Page Start:
- 296
- Page End:
- 315
- Publication Date:
- 2021-09-25
- Subjects:
- checkpointing -- fault‐tolerance -- optimization -- stream processing
Computer software -- Periodicals
Computer programming -- Periodicals
Computer programs -- Periodicals
005.3 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/spe.3021 ↗
- Languages:
- English
- ISSNs:
- 0038-0644
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8321.453000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 19985.xml