Similarity query support in big data management systems. (February 2020)
- Record Type:
- Journal Article
- Title:
- Similarity query support in big data management systems. (February 2020)
- Main Title:
- Similarity query support in big data management systems
- Authors:
- Kim, Taewoo
Li, Wenhai
Behm, Alexander
Cetindil, Inci
Vernica, Rares
Borkar, Vinayak
Carey, Michael J.
Li, Chen - Abstract:
- Abstract: Similarity query processing is becoming increasingly important in many applications such as data cleaning, record linkage, Web search, and document analytics. In this paper we study how to provide end-to-end similarity query support natively in a parallel database system. We discuss how to express a similarity predicate in its query language, how to build indexes, how to answer similarity queries (selections and joins) efficiently in the runtime engine, possibly using indexes, and how to optimize similarity queries. One particular challenge is how to incorporate existing similarity join algorithms, which often require a series of steps to achieve a high efficiency, including collecting token frequencies, finding matching record id pairs, and reassembling result records based on id pairs. We present a novel approach that uses existing runtime operators to implement such complex join algorithms without reinventing the wheel; doing so positions the system to automatically benefit from future improvements to those operators. The approach includes a technique to transform a similarity join plan into an efficient operator-based physical plan during query optimization by using a template expressed largely in the system's user-level query language; this technique greatly simplifies the specification of such a transformation rule. We use Apache AsterixDB, a parallel Big Data management system, to illustrate and validate our techniques. We conduct an experimental study usingAbstract: Similarity query processing is becoming increasingly important in many applications such as data cleaning, record linkage, Web search, and document analytics. In this paper we study how to provide end-to-end similarity query support natively in a parallel database system. We discuss how to express a similarity predicate in its query language, how to build indexes, how to answer similarity queries (selections and joins) efficiently in the runtime engine, possibly using indexes, and how to optimize similarity queries. One particular challenge is how to incorporate existing similarity join algorithms, which often require a series of steps to achieve a high efficiency, including collecting token frequencies, finding matching record id pairs, and reassembling result records based on id pairs. We present a novel approach that uses existing runtime operators to implement such complex join algorithms without reinventing the wheel; doing so positions the system to automatically benefit from future improvements to those operators. The approach includes a technique to transform a similarity join plan into an efficient operator-based physical plan during query optimization by using a template expressed largely in the system's user-level query language; this technique greatly simplifies the specification of such a transformation rule. We use Apache AsterixDB, a parallel Big Data management system, to illustrate and validate our techniques. We conduct an experimental study using several large, real datasets on a parallel computing cluster to assess the similarity query support. We also include experiments involving three other parallel systems and report the efficacy and performance results. Highlights: Extends the existing query language of a parallel DBMS to support similarity queries. Uses existing operators in the system to implement state-of-the-art techniques. Presents a novel framework called the "AQL+" to optimize similarity queries. Includes empirical similarity query experiments using several large, real datasets. Compares the approach with three other parallel systems to show its relative efficacy. … (more)
- Is Part Of:
- Information systems. Volume 88(2020)
- Journal:
- Information systems
- Issue:
- Volume 88(2020)
- Issue Display:
- Volume 88, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 88
- Issue:
- 2020
- Issue Sort Value:
- 2020-0088-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-02
- Subjects:
- Similarity query -- Parallel database -- Optimization
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2019.101455 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 12595.xml