Coarse-refinement dilemma: On generalization bounds for data clustering. (1st December 2021)
- Record Type:
- Journal Article
- Title:
- Coarse-refinement dilemma: On generalization bounds for data clustering. (1st December 2021)
- Main Title:
- Coarse-refinement dilemma: On generalization bounds for data clustering
- Authors:
- Vaz, Yule
de Mello, Rodrigo Fernandes
Grossi Ferreira, Carlos Henrique - Abstract:
- Highlights: Topological definition of the data clustering and hierarchical clustering. Presentation and demonstration of the Coarse-Refinement Dilemma. Presentation of homology consistency based on multidimensional persistent homology. Formulation of generalization bounds for homology consistency. Abstract: The data clustering problem is of central importance for the area of machine learning, given its usefulness to represent data structural similarities from input spaces. Although, data clustering counts on scarse literature of a theoretical framework with generalization guarantees. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on topological spaces, instead of metric ones as typically approached in the literature. From that, we show that the data clustering problem presents an analogous dilemma to the bias-variance one, which is here referred to as the coarse-refinement dilemma, from which we conclude that: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting). The coarse-refinement dilemma suggests the need of a relaxation of Kleinberg's richness axiom, as such axiom allows the production of unstable or unrepresentative partitions. ExperimentalHighlights: Topological definition of the data clustering and hierarchical clustering. Presentation and demonstration of the Coarse-Refinement Dilemma. Presentation of homology consistency based on multidimensional persistent homology. Formulation of generalization bounds for homology consistency. Abstract: The data clustering problem is of central importance for the area of machine learning, given its usefulness to represent data structural similarities from input spaces. Although, data clustering counts on scarse literature of a theoretical framework with generalization guarantees. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on topological spaces, instead of metric ones as typically approached in the literature. From that, we show that the data clustering problem presents an analogous dilemma to the bias-variance one, which is here referred to as the coarse-refinement dilemma, from which we conclude that: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting). The coarse-refinement dilemma suggests the need of a relaxation of Kleinberg's richness axiom, as such axiom allows the production of unstable or unrepresentative partitions. Experimental exploration considering different clustering refinements can, then, depict such partitions. … (more)
- Is Part Of:
- Expert systems with applications. Volume 184(2021)
- Journal:
- Expert systems with applications
- Issue:
- Volume 184(2021)
- Issue Display:
- Volume 184, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 184
- Issue:
- 2021
- Issue Sort Value:
- 2021-0184-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-12-01
- Subjects:
- Data clustering -- Topology -- Persistent homology -- Multidimensional persistence -- Algorithm stability
Expert systems (Computer science) -- Periodicals
Systèmes experts (Informatique) -- Périodiques
Electronic journals
006.33 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09574174 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.eswa.2021.115399 ↗
- Languages:
- English
- ISSNs:
- 0957-4174
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3842.004220
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 18643.xml