Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm. (February 2020)
- Record Type:
- Journal Article
- Title:
- Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm. (February 2020)
- Main Title:
- Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm
- Authors:
- Apurva, Mehta
Mazumdar, Himanshu - Abstract:
- Graphical abstract: Highlights: The latest and largest dataset SCOPe 2.07 used along with benchmark datasets ASTRAL 1.73, 25PDB and FC699. The Random Forest algorithm uses Primary and Secondary Structure based feature vectors for prediction. The species based features are recognized as sensitive features for predictions. The performance of class α/β and α + β increased significantly. Abstract: At present, tertiary structure discovery growth rate is lagging far behind discovery of primary structure. The prediction of protein structural class using Machine Learning techniques can help reduce this gap. The Structural Classification of Protein – Extended (SCOPe 2.07) is latest and largest dataset available at present. The protein sequences with less than 40% identity to each other are used for predicting α, β, α/β and α + β SCOPe classes. The sensitive features are extracted from primary and secondary structure representations of Proteins. Features are extracted experimentally from secondary structure with respect to its frequency, pitch and spatial arrangements. Primary structure based features contain species information for a protein sequence. The species parameters are further validated with uniref100 dataset using TaxId. As it is known, protein tertiary structure is manifestation of function. Functional differences are observed in species. Hence, the species are expected to have strong correlations with structural class, which is discovered in current work. It enhancesGraphical abstract: Highlights: The latest and largest dataset SCOPe 2.07 used along with benchmark datasets ASTRAL 1.73, 25PDB and FC699. The Random Forest algorithm uses Primary and Secondary Structure based feature vectors for prediction. The species based features are recognized as sensitive features for predictions. The performance of class α/β and α + β increased significantly. Abstract: At present, tertiary structure discovery growth rate is lagging far behind discovery of primary structure. The prediction of protein structural class using Machine Learning techniques can help reduce this gap. The Structural Classification of Protein – Extended (SCOPe 2.07) is latest and largest dataset available at present. The protein sequences with less than 40% identity to each other are used for predicting α, β, α/β and α + β SCOPe classes. The sensitive features are extracted from primary and secondary structure representations of Proteins. Features are extracted experimentally from secondary structure with respect to its frequency, pitch and spatial arrangements. Primary structure based features contain species information for a protein sequence. The species parameters are further validated with uniref100 dataset using TaxId. As it is known, protein tertiary structure is manifestation of function. Functional differences are observed in species. Hence, the species are expected to have strong correlations with structural class, which is discovered in current work. It enhances prediction accuracy by 7%–10%. The subset of SCOPe 2.07 is trained using 65 dimensional feature vector using Random Forest classifier. The test result for the rest of the set gives consistent accuracy of better than 95%. The accuracy achieved on benchmark datasets ASTRAL 1.73, 25PDB and FC699 is better than 86%, 91% and 97% respectively, which is best reported to our knowledge. … (more)
- Is Part Of:
- Computational biology and chemistry. Volume 84(2020)
- Journal:
- Computational biology and chemistry
- Issue:
- Volume 84(2020)
- Issue Display:
- Volume 84, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 84
- Issue:
- 2020
- Issue Sort Value:
- 2020-0084-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-02
- Subjects:
- Evolutionary -- Low identity -- Protein secondary structure -- Protein structural class -- Random Forest -- Structural classification of proteins (SCOP)
Chemistry -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
Biochemistry -- Data processing
Biology -- Data processing
Molecular biology -- Data processing
Periodicals
Electronic journals
542.85 - Journal URLs:
- http://www.sciencedirect.com/science/journal/14769271 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiolchem.2019.107164 ↗
- Languages:
- English
- ISSNs:
- 1476-9271
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3390.576700
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 12624.xml