Skewed distributions in semi-stream joins: How much can caching help?. (March 2017)
- Record Type:
- Journal Article
- Title:
- Skewed distributions in semi-stream joins: How much can caching help?. (March 2017)
- Main Title:
- Skewed distributions in semi-stream joins: How much can caching help?
- Authors:
- Naeem, M.Asif
Dobbie, Gillian
Lutteroth, Christof
Weber, Gerald - Abstract:
- Abstract: Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters. Abstract : Highlights: We present a generic front-stage for tuple-level caching: that can improve the performance of many well-known semi-stream join algorithms. We present a novel load shedding technique: that sheds the tuples that areAbstract: Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters. Abstract : Highlights: We present a generic front-stage for tuple-level caching: that can improve the performance of many well-known semi-stream join algorithms. We present a novel load shedding technique: that sheds the tuples that are most expensive to process, thus measurably increasing the service rate. Sensitivity analysis: We perform a sensitivity analysis with respect to various parameters. … (more)
- Is Part Of:
- Information systems. Volume 64(2017)
- Journal:
- Information systems
- Issue:
- Volume 64(2017)
- Issue Display:
- Volume 64, Issue 2017 (2017)
- Year:
- 2017
- Volume:
- 64
- Issue:
- 2017
- Issue Sort Value:
- 2017-0064-2017-0000
- Page Start:
- 63
- Page End:
- 74
- Publication Date:
- 2017-03
- Subjects:
- Semi-stream processing -- Join -- Front-stage cache -- Performance optimization
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2016.09.007 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 1232.xml