Stochastic Estimations of the Total Number of Classes for a Clustering having Extremely Large Samples to be Included in the Clustering Engine. Issue 5 (26th March 2021)
- Record Type:
- Journal Article
- Title:
- Stochastic Estimations of the Total Number of Classes for a Clustering having Extremely Large Samples to be Included in the Clustering Engine. Issue 5 (26th March 2021)
- Main Title:
- Stochastic Estimations of the Total Number of Classes for a Clustering having Extremely Large Samples to be Included in the Clustering Engine
- Authors:
- Utimula, Keishu
Prayogo, Genki I.
Nakano, Kousuke
Hongo, Kenta
Maezono, Ryo - Abstract:
- Abstract: Numerous reports have elucidated the classification of a large amount of data using various clustering techniques. However, an increase in data size hinders the applicability of these methods. Here, it is investigated how to deal with the exploding number of possibilities to be sorted into irreducible classes by using a clustering technique when its input capacity cannot accommodate the total number of possibilities. This can be exemplified by atomic substitutions in the supercell modeling of alloys. The number of possibilities is sometimes equal to trillions, which is extremely large to be accommodated in a cluster. Thus, it is not practically feasible to identify directly how many irreducible classes exist even though several techniques are available to perform the clustering. In this regard, a stochastic framework is developed to avoid the shortage limitations, providing a method to estimate the total number of irreducible classes (the order of classes), as a statistical estimate. The main conclusion is that the statistical variation of the number of classes, at each sampling trial, can serve as a promising measure to estimate the total number of irreducible classes. Characteristics of this approach is also discussed by comparing with the conventional one based on Polya's theorem. Abstract : For such a huge dataset (approx. several billion), which cannot be accommodated by a given clustering tool, it is difficult to identify the total number of clusters. ThisAbstract: Numerous reports have elucidated the classification of a large amount of data using various clustering techniques. However, an increase in data size hinders the applicability of these methods. Here, it is investigated how to deal with the exploding number of possibilities to be sorted into irreducible classes by using a clustering technique when its input capacity cannot accommodate the total number of possibilities. This can be exemplified by atomic substitutions in the supercell modeling of alloys. The number of possibilities is sometimes equal to trillions, which is extremely large to be accommodated in a cluster. Thus, it is not practically feasible to identify directly how many irreducible classes exist even though several techniques are available to perform the clustering. In this regard, a stochastic framework is developed to avoid the shortage limitations, providing a method to estimate the total number of irreducible classes (the order of classes), as a statistical estimate. The main conclusion is that the statistical variation of the number of classes, at each sampling trial, can serve as a promising measure to estimate the total number of irreducible classes. Characteristics of this approach is also discussed by comparing with the conventional one based on Polya's theorem. Abstract : For such a huge dataset (approx. several billion), which cannot be accommodated by a given clustering tool, it is difficult to identify the total number of clusters. This difficulty can be resolved stochastically by a newly proposed framework that repeats the clustering with affordable sample sizes randomly taken from the whole pool. … (more)
- Is Part Of:
- Advanced theory and simulations. Volume 4:Issue 5(2021)
- Journal:
- Advanced theory and simulations
- Issue:
- Volume 4:Issue 5(2021)
- Issue Display:
- Volume 4, Issue 5 (2021)
- Year:
- 2021
- Volume:
- 4
- Issue:
- 5
- Issue Sort Value:
- 2021-0004-0005-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2021-03-26
- Subjects:
- classification algorithms -- clustering techniques -- irreducible classes -- machine learning -- stochastic frameworks
Science -- Simulation methods -- Periodicals
Science -- Methodology -- Periodicals
Engineering -- Simulation methods -- Periodicals
Engineering -- Methodology -- Periodicals
507.21 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/adts.202000301 ↗
- Languages:
- English
- ISSNs:
- 2513-0390
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 0696.935575
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 16901.xml