Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis. (June 2021)
- Record Type:
- Journal Article
- Title:
- Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis. (June 2021)
- Main Title:
- Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis
- Authors:
- Jha, Preeti
Tiwari, Aruna
Bharill, Neha
Ratnaparkhe, Milind
Mounika, Mukkamalla
Nagendra, Neha - Abstract:
- Graphical abstract: Highlights: The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP). The SNP preprocessing used for feature extraction. The complexity of the kernelized algorithms is linear. Abstract: This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, whichGraphical abstract: Highlights: The kernelized fuzzy clustering algorithms based on Apache Spark for the clustering of huge Single Nucleotide Polymorphism (SNP). The SNP preprocessing used for feature extraction. The complexity of the kernelized algorithms is linear. Abstract: This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM. … (more)
- Is Part Of:
- Computational biology and chemistry. Volume 92(2021)
- Journal:
- Computational biology and chemistry
- Issue:
- Volume 92(2021)
- Issue Display:
- Volume 92, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 92
- Issue:
- 2021
- Issue Sort Value:
- 2021-0092-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-06
- Subjects:
- High-dimensional -- Non-linear -- Apache Spark -- SNP sequences -- Kernelized fuzzy clustering
Chemistry -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
Biochemistry -- Data processing
Biology -- Data processing
Molecular biology -- Data processing
Periodicals
Electronic journals
542.85 - Journal URLs:
- http://www.sciencedirect.com/science/journal/14769271 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiolchem.2021.107454 ↗
- Languages:
- English
- ISSNs:
- 1476-9271
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3390.576700
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 18236.xml