Pipeline provenance for cloud‐based big data analytics. (5th September 2019)
- Record Type:
- Journal Article
- Title:
- Pipeline provenance for cloud‐based big data analytics. (5th September 2019)
- Main Title:
- Pipeline provenance for cloud‐based big data analytics
- Authors:
- Wang, Ruoyu
Sun, Daniel
Li, Guoqiang
Wong, Raymond
Chen, Shiping - Other Names:
- Ranjan Rajiv guestEditor.
Villari Massimo guestEditor.
Shen Haiying guestEditor.
Rana Omer guestEditor.
Buyya Rajkumar guestEditor. - Abstract:
- Summary: Provenance is information about the origin and creation of data. In data science and engineering related with cloud environment, such information is useful and sometimes even critical. In data analytics, it is necessary for making data‐driven decisions to trace back history and reproduce final or intermediate results, even to tune models and adjust parameters in a real‐time fashion. Particularly, in cloud, users need to evaluate data and pipeline trustworthiness. In this paper, we propose a solution: LogProv, toward realizing these functionalities for big data provenance, which needs to renovate data pipelines or some of big data software infrastructure to generate structured logs for pipeline events, and then stores data and logs separately in cloud space. The data are explicitly linked to the logs, which implicitly record pipeline semantics. Semantic information can be retrieved from the logs easily since they are well defined and structured beforehand. We implemented and deployed LogProv in Nectar Cloud, * associated with Apache Pig, Hadoop ecosystem, and adopted Elasticsearch to provide query service. LogProv was evaluated and empirically case studied. The results show that LogProv is efficient since the performance overhead is no more than 10%; the query can be responded within 1 second; the trustworthiness is marked clearly; and there is no impact on the data processing logic of original pipelines.
- Is Part Of:
- Software, practice & experience. Volume 50:Number 5(2020)
- Journal:
- Software, practice & experience
- Issue:
- Volume 50:Number 5(2020)
- Issue Display:
- Volume 50, Issue 5 (2020)
- Year:
- 2020
- Volume:
- 50
- Issue:
- 5
- Issue Sort Value:
- 2020-0050-0005-0000
- Page Start:
- 658
- Page End:
- 674
- Publication Date:
- 2019-09-05
- Subjects:
- big data -- cloud -- data analytic pipeline -- Elasticsearch -- Hadoop -- log -- Nectar -- Pig -- provenance
Computer software -- Periodicals
Computer programming -- Periodicals
Computer programs -- Periodicals
005.3 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/spe.2744 ↗
- Languages:
- English
- ISSNs:
- 0038-0644
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8321.453000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 13133.xml