An efficient entropy based dissimilarity measure to cluster categorical data. (March 2023)
- Record Type:
- Journal Article
- Title:
- An efficient entropy based dissimilarity measure to cluster categorical data. (March 2023)
- Main Title:
- An efficient entropy based dissimilarity measure to cluster categorical data
- Authors:
- Kar, Amit Kumar
Mishra, Amaresh Chandra
Mohanty, Sraban Kumar - Abstract:
- Abstract: Clustering is an unsupervised learning technique that discovers intrinsic groups based on proximity between data points. Therefore, the performance of clustering techniques mainly relies on the proximity measures used to compute the (dis)similarity between the data objects. In general, it is relatively easier to compute the distance between numerical data points as numerical operations can directly be applied to values along features. However, for categorical datasets, computing the (dis)similarity between the data objects becomes a non-trivial problem. Therefore, in this paper, we propose a new distance metric based on the information theoretic approach to compute the dissimilarity between categorical data points. We compute entropy along each feature to capture the intra-attribute statistical information, based on which significance of attributes are decided during clustering. The proposed measure is free from any domain-dependent parameters and also does not rely on the distribution of data points. Experiment is conducted over diversified benchmark data sets, considering six competing proximity measures with three popular clustering algorithms and the clustering results are compared in terms of RI (Rand Index), ARI (Adjusted Rand Index), CA (Clustering Accuracy) and Cluster Discrimination Matrix (CDM). Over 85 percent of the data sets, the clustering accuracy of the proposed metric embedded with K-Mode and Weighted K-Mode outperforms its counterparts.Abstract: Clustering is an unsupervised learning technique that discovers intrinsic groups based on proximity between data points. Therefore, the performance of clustering techniques mainly relies on the proximity measures used to compute the (dis)similarity between the data objects. In general, it is relatively easier to compute the distance between numerical data points as numerical operations can directly be applied to values along features. However, for categorical datasets, computing the (dis)similarity between the data objects becomes a non-trivial problem. Therefore, in this paper, we propose a new distance metric based on the information theoretic approach to compute the dissimilarity between categorical data points. We compute entropy along each feature to capture the intra-attribute statistical information, based on which significance of attributes are decided during clustering. The proposed measure is free from any domain-dependent parameters and also does not rely on the distribution of data points. Experiment is conducted over diversified benchmark data sets, considering six competing proximity measures with three popular clustering algorithms and the clustering results are compared in terms of RI (Rand Index), ARI (Adjusted Rand Index), CA (Clustering Accuracy) and Cluster Discrimination Matrix (CDM). Over 85 percent of the data sets, the clustering accuracy of the proposed metric embedded with K-Mode and Weighted K-Mode outperforms its counterparts. Approximately, 0.2951 s is needed by the proposed metric to cluster a data set having 10, 000 data points with 8 attributes and 2 clusters on a standard desktop machine. Overall, experimental results demonstrate the efficacy of the proposed metric to handle complex real datasets of different characteristics. … (more)
- Is Part Of:
- Engineering applications of artificial intelligence. Volume 119(2023)
- Journal:
- Engineering applications of artificial intelligence
- Issue:
- Volume 119(2023)
- Issue Display:
- Volume 119, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 119
- Issue:
- 2023
- Issue Sort Value:
- 2023-0119-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-03
- Subjects:
- Distance metric -- Dissimilarity metric for categorical data -- Entropy based dissimilarity measure -- Proximity measure for clustering -- Dissimilarity measure for clustering
Engineering -- Data processing -- Periodicals
Artificial intelligence -- Periodicals
Expert systems (Computer science) -- Periodicals
Ingénierie -- Informatique -- Périodiques
Intelligence artificielle -- Périodiques
Systèmes experts (Informatique) -- Périodiques
Artificial intelligence
Engineering -- Data processing
Expert systems (Computer science)
Periodicals
620.00285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09521976 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.engappai.2022.105795 ↗
- Languages:
- English
- ISSNs:
- 0952-1976
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3755.704500
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 25681.xml