Textual data summarization using the Self-Organized Co-Clustering model. (July 2020)
- Record Type:
- Journal Article
- Title:
- Textual data summarization using the Self-Organized Co-Clustering model. (July 2020)
- Main Title:
- Textual data summarization using the Self-Organized Co-Clustering model
- Authors:
- Selosse, Margot
Jacques, Julien
Biernacki, Christophe - Abstract:
- Highlights: The SOCC model is a novel approach for clustering textual data sets. It consists in co-clustering the classic document-term frequency matrix, i.e in simultaneously clustering the document and the term they are made of. The crossing of a document-cluster and a term-cluster is called block. A structure with meaningful and non-meaningful blocks is proposed. The resulting co-clustering offers highly interpretable results for the user. Abstract: Recently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-the-art methods for document and term clustering and offers user-friendly results. The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model's inference as well as a modelHighlights: The SOCC model is a novel approach for clustering textual data sets. It consists in co-clustering the classic document-term frequency matrix, i.e in simultaneously clustering the document and the term they are made of. The crossing of a document-cluster and a term-cluster is called block. A structure with meaningful and non-meaningful blocks is proposed. The resulting co-clustering offers highly interpretable results for the user. Abstract: Recently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-the-art methods for document and term clustering and offers user-friendly results. The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model's inference as well as a model selection criterion to choose the number of co-clusters. Both simulated and real data sets illustrate the efficiency of this model by its ability to easily identify relevant co-clusters. … (more)
- Is Part Of:
- Pattern recognition. Volume 103(2020:Jul.)
- Journal:
- Pattern recognition
- Issue:
- Volume 103(2020:Jul.)
- Issue Display:
- Volume 103 (2020)
- Year:
- 2020
- Volume:
- 103
- Issue Sort Value:
- 2020-0103-0000-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-07
- Subjects:
- Co-Clustering -- Document-term matrix -- Latent block model
Pattern perception -- Periodicals
Perception des structures -- Périodiques
Patroonherkenning
006.4 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00313203 ↗
http://www.sciencedirect.com/ ↗ - DOI:
- 10.1016/j.patcog.2020.107315 ↗
- Languages:
- English
- ISSNs:
- 0031-3203
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13455.xml