Mining Compressing Sequential Patterns. (23rd May 2013)
- Record Type:
- Journal Article
- Title:
- Mining Compressing Sequential Patterns. (23rd May 2013)
- Main Title:
- Mining Compressing Sequential Patterns
- Authors:
- Lam, Hoang Thanh
Mörchen, Fabian
Fradkin, Dmitriy
Calders, Toon - Abstract:
- <abstract abstract-type="main" xml:lang="en"> <title>Abstract</title> <p>Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the <italic>minimum</italic><italic>description length</italic> (MDL) principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL‐based algorithms for mining non‐redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two‐phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending<abstract abstract-type="main" xml:lang="en"> <title>Abstract</title> <p>Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the <italic>minimum</italic><italic>description length</italic> (MDL) principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL‐based algorithms for mining non‐redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two‐phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state‐of‐the‐art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013</p> </abstract> … (more)
- Is Part Of:
- Statistical analysis and data mining. Volume 7:Number 1(2014)
- Journal:
- Statistical analysis and data mining
- Issue:
- Volume 7:Number 1(2014)
- Issue Display:
- Volume 7, Issue 1 (2014)
- Year:
- 2014
- Volume:
- 7
- Issue:
- 1
- Issue Sort Value:
- 2014-0007-0001-0000
- Page Start:
- 34
- Page End:
- 52
- Publication Date:
- 2013-05-23
- Subjects:
- Data mining -- Statistical methods -- Periodicals
006.312 - Journal URLs:
- http://www3.interscience.wiley.com/journal/112701062/home ↗
http://onlinelibrary.wiley.com/ ↗ - DOI:
- 10.1002/sam.11192 ↗
- Languages:
- English
- ISSNs:
- 1932-1864
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8447.424100
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 3305.xml