Tree edit distance: Robust and memory-efficient. (March 2016)
- Record Type:
- Journal Article
- Title:
- Tree edit distance: Robust and memory-efficient. (March 2016)
- Main Title:
- Tree edit distance: Robust and memory-efficient
- Authors:
- Pawlik, Mateusz
Augsten, Nikolaus - Abstract:
- Abstract: Hierarchical data are often modelled as trees. An interesting query identifies pairs of similar trees. The standard approach to tree similarity is the tree edit distance, which has successfully been applied in a wide range of applications. In terms of runtime, the state-of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast independent of the tree shape. Unfortunately, this algorithm requires up to twice the memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for the tree edit distance computation. In this paper we present a new, memory efficient algorithm for the tree edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without trading in memory efficiency. This is achieved by releasing memory early during the first step of the algorithm, which computes a decomposition strategy for the actual distance computation. We show the correctness of our approach and prove an upper bound for the memory usage. The strategy computed by AP-TED is optimal in the class of all-path strategies, which subsumes the class of LRH strategies used in RTED. We further present the AP-TED + algorithm, which requires less computational effort for very small subtrees and improves the runtime of the distance computation. Our experimental evaluation confirms the low memory requirements and the runtime efficiency of our approach. Abstract : Highlights: We address the memory problemAbstract: Hierarchical data are often modelled as trees. An interesting query identifies pairs of similar trees. The standard approach to tree similarity is the tree edit distance, which has successfully been applied in a wide range of applications. In terms of runtime, the state-of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast independent of the tree shape. Unfortunately, this algorithm requires up to twice the memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for the tree edit distance computation. In this paper we present a new, memory efficient algorithm for the tree edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without trading in memory efficiency. This is achieved by releasing memory early during the first step of the algorithm, which computes a decomposition strategy for the actual distance computation. We show the correctness of our approach and prove an upper bound for the memory usage. The strategy computed by AP-TED is optimal in the class of all-path strategies, which subsumes the class of LRH strategies used in RTED. We further present the AP-TED + algorithm, which requires less computational effort for very small subtrees and improves the runtime of the distance computation. Our experimental evaluation confirms the low memory requirements and the runtime efficiency of our approach. Abstract : Highlights: We address the memory problem of the strategy computation in the RTED algorithm for the tree edit distance. We prove an upper bound which guarantees that the strategy computation never uses more memory than the distance computation. We compute the optimal strategy in the class of all-path strategies which subsumes the LRH strategies used before. We develop new single-path functions which are better in terms of runtime and memory than the previously used functions. … (more)
- Is Part Of:
- Information systems. Volume 56(2016)
- Journal:
- Information systems
- Issue:
- Volume 56(2016)
- Issue Display:
- Volume 56, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 56
- Issue:
- 2016
- Issue Sort Value:
- 2016-0056-2016-0000
- Page Start:
- 157
- Page End:
- 173
- Publication Date:
- 2016-03
- Subjects:
- Tree edit distance -- Similarity search -- Approximate matching
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2015.08.004 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 1609.xml