Discovering and merging related analytic datasets. (July 2020)
- Record Type:
- Journal Article
- Title:
- Discovering and merging related analytic datasets. (July 2020)
- Main Title:
- Discovering and merging related analytic datasets
- Authors:
- Liu, Rutian
Simon, Eric
Amann, Bernd
Gançarski, Stéphane - Abstract:
- Abstract: The production of analytic datasets is a significant big data trend and has gone well beyond the scope of traditional IT-governed dataset development. Analytic datasets are now created by data scientists and data analysts using big data frameworks and agile data preparation tools. However, despite the profusion of available datasets, it remains quite difficult for a data analyst to start from a dataset at hand and customize it with additional attributes coming from other existing datasets. This article describes a model and algorithms that exploit automatically extracted and user-defined semantic relationships for extending analytic datasets with new atomic or aggregated attribute values. Our framework is implemented as a REST service in SAP HANA and includes a careful theoretical analysis and practical solutions for several complex data quality issues. Highlights: Attribute graphs for literal functional dependencies in hierarchical dimensions. Schema augmentation and reduction operations to eliminate tuple multiplication. Quality criteria and automatic repair operations for merging schema augmentations. Detailed description of the implementations within SAP HANA platform. Experimental evaluations on real datasets and usage scenarios.
- Is Part Of:
- Information systems. Volume 91(2020)
- Journal:
- Information systems
- Issue:
- Volume 91(2020)
- Issue Display:
- Volume 91, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 91
- Issue:
- 2020
- Issue Sort Value:
- 2020-0091-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-07
- Subjects:
- Schema augmentation -- Schema complement -- Data quality -- SAP HANA
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2020.101495 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13537.xml