Detecting coherent explorations in SQL workloads. (September 2020)
- Record Type:
- Journal Article
- Title:
- Detecting coherent explorations in SQL workloads. (September 2020)
- Main Title:
- Detecting coherent explorations in SQL workloads
- Authors:
- Peralta, Verónika
Marcel, Patrick
Verdeaux, Willeme
Diakhaby, Aboubakar Sidikhy - Abstract:
- Abstract: This paper presents a proposal aiming at better understanding a workload of SQL queries and detecting coherent explorations hidden within the workload. In particular, our work investigates SQLShare (Jain et al., 2016), a database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of (Jain et al., 2016), this workload is the only one containing primarily ad-hoc hand-written queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we investigate three different machine learning approaches to use these features to separate sequences of SQL queries into meaningful explorations. The first approach is unsupervised and based only on similarity between contiguous queries. The second approach uses transfer learning to apply a model trained over a dataset where ground truth is available. The last approach uses weak labeling to predict the most probable segmentation from heuristics meant to label a training set. We ran several tests over various query workloads to evaluate and compare the proposed methods. Highlights: We propose an approach for segmenting a SQL workload into meaningful, coherent database explorations. We describe each query of a SQL workload with a set of features that are intrinsic to a query or that relate the query to its neighbor in the workload. We define a setAbstract: This paper presents a proposal aiming at better understanding a workload of SQL queries and detecting coherent explorations hidden within the workload. In particular, our work investigates SQLShare (Jain et al., 2016), a database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of (Jain et al., 2016), this workload is the only one containing primarily ad-hoc hand-written queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we investigate three different machine learning approaches to use these features to separate sequences of SQL queries into meaningful explorations. The first approach is unsupervised and based only on similarity between contiguous queries. The second approach uses transfer learning to apply a model trained over a dataset where ground truth is available. The last approach uses weak labeling to predict the most probable segmentation from heuristics meant to label a training set. We ran several tests over various query workloads to evaluate and compare the proposed methods. Highlights: We propose an approach for segmenting a SQL workload into meaningful, coherent database explorations. We describe each query of a SQL workload with a set of features that are intrinsic to a query or that relate the query to its neighbor in the workload. We define a set of similarity indexes, exploiting query features, that quantify the similarity between contiguous queries from several points of view. We proposed three methods for workload segmentation: unsupervised learning, supervised learning and weak supervision. … (more)
- Is Part Of:
- Information systems. Volume 92(2020)
- Journal:
- Information systems
- Issue:
- Volume 92(2020)
- Issue Display:
- Volume 92, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 92
- Issue:
- 2020
- Issue Sort Value:
- 2020-0092-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-09
- Subjects:
- Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2019.101479 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13592.xml