Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey. (29th July 2019)
- Record Type:
- Journal Article
- Title:
- Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey. (29th July 2019)
- Main Title:
- Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey
- Authors:
- Zimmermann, Albrecht
- Abstract:
- Abstract: Machine Learning (ML) and Data Mining (DM) build tools intended to help users solve data‐related problems that are infeasible for "unaugmented" humans. Tools need manuals, however, and in the case of ML/DM methods, this means guidance with respect to which technique to choose, how to parameterize it, and how to interpret derived results to arrive at knowledge about the phenomena underlying the data. While such information is available in the literature, it has not yet been collected in one place. We survey three types of work for clustering and pattern mining: (1) comparisons of existing techniques, (2) evaluations of different parameterization options and studies providing guidance for setting parameter values, and (3) work comparing mining results with the ground truth. We find that although interesting results exist, as a whole the body of work on these questions is too limited. In addition, we survey recent studies in the field of community detection, as a contrasting example. We argue that an objective obstacle for performing needed studies is a lack of data and survey the state of available data, pointing out certain limitations. As a solution, we propose to augment existing data by artificially generated data, review the state‐of‐the‐art in data generation in unsupervised mining, and identify shortcomings. In more general terms, we call for the development of a true "Data Science" that—based on work in other domains, results in ML, and existingAbstract: Machine Learning (ML) and Data Mining (DM) build tools intended to help users solve data‐related problems that are infeasible for "unaugmented" humans. Tools need manuals, however, and in the case of ML/DM methods, this means guidance with respect to which technique to choose, how to parameterize it, and how to interpret derived results to arrive at knowledge about the phenomena underlying the data. While such information is available in the literature, it has not yet been collected in one place. We survey three types of work for clustering and pattern mining: (1) comparisons of existing techniques, (2) evaluations of different parameterization options and studies providing guidance for setting parameter values, and (3) work comparing mining results with the ground truth. We find that although interesting results exist, as a whole the body of work on these questions is too limited. In addition, we survey recent studies in the field of community detection, as a contrasting example. We argue that an objective obstacle for performing needed studies is a lack of data and survey the state of available data, pointing out certain limitations. As a solution, we propose to augment existing data by artificially generated data, review the state‐of‐the‐art in data generation in unsupervised mining, and identify shortcomings. In more general terms, we call for the development of a true "Data Science" that—based on work in other domains, results in ML, and existing tools—develops needed data generators and builds up the knowledge needed to effectively employ unsupervised mining techniques. This article is categorized under: Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Ensemble Methods > Structure Discovery Internet > Society and Culture Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Abstract : Numerous comparisons of frequent itemset mining techniques have failed to establish a clear performance hierarchy. … (more)
- Is Part Of:
- Wiley interdisciplinary reviews. Volume 10:Number 2(2020)
- Journal:
- Wiley interdisciplinary reviews
- Issue:
- Volume 10:Number 2(2020)
- Issue Display:
- Volume 10, Issue 2 (2020)
- Year:
- 2020
- Volume:
- 10
- Issue:
- 2
- Issue Sort Value:
- 2020-0010-0002-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2019-07-29
- Subjects:
- algorithmic comparison -- clustering -- parameter selection -- pattern mining -- result verification
Data mining -- Periodicals
006.31205 - Journal URLs:
- http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1942-4795 ↗
http://onlinelibrary.wiley.com/ ↗ - DOI:
- 10.1002/widm.1330 ↗
- Languages:
- English
- ISSNs:
- 1942-4787
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 23782.xml