A modified two-stage Markov clustering algorithm for large and sparse networks. Issue 135 (October 2016)
- Record Type:
- Journal Article
- Title:
- A modified two-stage Markov clustering algorithm for large and sparse networks. Issue 135 (October 2016)
- Main Title:
- A modified two-stage Markov clustering algorithm for large and sparse networks
- Authors:
- Szilágyi, László
Szilágyi, Sándor M. - Abstract:
- Highlights: A fast and memory efficient, two-stage Markov clustering algorithm is proposed. The algorithm is designed for the analysis of large and sparse graphs. The algorithm was validated using real and synthetic protein sequence data. It can accurately process a million-node BLAST similarity network in 100 minutes. Abstract: Background: Graph-based hierarchical clustering algorithms become prohibitively costly in both execution time and storage space, as the number of nodes approaches the order of millions. Objective: A fast and highly memory efficient Markov clustering algorithm is proposed to perform the classification of huge sparse networks using an ordinary personal computer. Methods: Improvements compared to previous versions are achieved through adequately chosen data structures that facilitate the efficient handling of symmetric sparse matrices. Clustering is performed in two stages: the initial connected network is processed in a sparse matrix until it breaks into isolated, small, and relatively dense subgraphs, which are then processed separately until convergence is obtained. An intelligent stopping criterion is also proposed to quit further processing of a subgraph that tends toward completeness with equal edge weights. The main advantage of this algorithm is that the necessary number of iterations is separately decided for each graph node. Results: The proposed algorithm was tested using the SCOP95 and large synthetic protein sequence data sets. TheHighlights: A fast and memory efficient, two-stage Markov clustering algorithm is proposed. The algorithm is designed for the analysis of large and sparse graphs. The algorithm was validated using real and synthetic protein sequence data. It can accurately process a million-node BLAST similarity network in 100 minutes. Abstract: Background: Graph-based hierarchical clustering algorithms become prohibitively costly in both execution time and storage space, as the number of nodes approaches the order of millions. Objective: A fast and highly memory efficient Markov clustering algorithm is proposed to perform the classification of huge sparse networks using an ordinary personal computer. Methods: Improvements compared to previous versions are achieved through adequately chosen data structures that facilitate the efficient handling of symmetric sparse matrices. Clustering is performed in two stages: the initial connected network is processed in a sparse matrix until it breaks into isolated, small, and relatively dense subgraphs, which are then processed separately until convergence is obtained. An intelligent stopping criterion is also proposed to quit further processing of a subgraph that tends toward completeness with equal edge weights. The main advantage of this algorithm is that the necessary number of iterations is separately decided for each graph node. Results: The proposed algorithm was tested using the SCOP95 and large synthetic protein sequence data sets. The validation process revealed that the proposed method can reduce 3–6 times the processing time of huge sequence networks compared to previous Markov clustering solutions, without losing anything from the partition quality. Conclusions: A one-million-node and one-billion-edge protein sequence network defined by a BLAST similarity matrix can be processed with an upper-class personal computer in 100 minutes. Further improvement in speed is possible via parallel data processing, while the extension toward several million nodes needs intermediary data storage, for example on solid state drives. … (more)
- Is Part Of:
- Computer methods and programs in biomedicine. Issue 135(2016)
- Journal:
- Computer methods and programs in biomedicine
- Issue:
- Issue 135(2016)
- Issue Display:
- Volume 135, Issue 135 (2016)
- Year:
- 2016
- Volume:
- 135
- Issue:
- 135
- Issue Sort Value:
- 2016-0135-0135-0000
- Page Start:
- 15
- Page End:
- 26
- Publication Date:
- 2016-10
- Subjects:
- Hierarchical clustering -- Markov clustering -- Efficient computing -- Sparse matrix -- Protein sequence networks
Medicine -- Computer programs -- Periodicals
Biology -- Computer programs -- Periodicals
Computers -- Periodicals
Medicine -- Periodicals
Médecine -- Logiciels -- Périodiques
Biologie -- Logiciels -- Périodiques
Biology -- Computer programs
Medicine -- Computer programs
Periodicals
Electronic journals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/01692607 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.cmpb.2016.07.007 ↗
- Languages:
- English
- ISSNs:
- 0169-2607
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.095000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 1650.xml