The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet. (9th September 2019)
- Record Type:
- Journal Article
- Title:
- The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet. (9th September 2019)
- Main Title:
- The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet
- Authors:
- Ivanov, Todor
Pergolesi, Matteo - Abstract:
- Summary: Columnar file formats provide an efficient way to store data to be queried by SQL‐on‐Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx‐BB), a standardized application‐level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
- Is Part Of:
- Concurrency and computation. Volume 32:Number 5(2020)
- Journal:
- Concurrency and computation
- Issue:
- Volume 32:Number 5(2020)
- Issue Display:
- Volume 32, Issue 5 (2020)
- Year:
- 2020
- Volume:
- 32
- Issue:
- 5
- Issue Sort Value:
- 2020-0032-0005-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2019-09-09
- Subjects:
- BigBench -- big data benchmarking -- columnar file formats -- Hive -- ORC -- Parquet -- SparkSQL -- SQL‐on‐Hadoop
Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.5523 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 12795.xml