Correlation and redundancy on machine learning performance for chemical databases. (19th February 2018)
- Record Type:
- Journal Article
- Title:
- Correlation and redundancy on machine learning performance for chemical databases. (19th February 2018)
- Main Title:
- Correlation and redundancy on machine learning performance for chemical databases
- Authors:
- Li, Hongzhi
Li, Wenze
Pan, Xuefeng
Huang, Jiaqi
Gao, Ting
Hu, LiHong
Li, Hui
Lu, Yinghua - Abstract:
- Abstract: Variable reduction is an essential step for establishing a robust, accurate, and generalized machine learning model. Variable correlation and redundancy/total correlation are the primary considerations in many variable reduction methods given that they directly impact model performances. However, their effects vary from one class of databases to another. To clarify their effects on regression models on the basis of small chemical databases, a series of calculations are performed. Regression models are built on features with various correlation coefficients and redundancies by 4 machine learning methods: random forest, support vector machine, extreme learning machine, and multiple linear regression. The results suggest that the correlation is, as expected, closely related to the prediction accuracy; ie, generally, the features with large correlation coefficients regarding to response variables achieve better regression models than those with lower ones. However, for the redundancy, no trends on the performances of regression models are disclosed. This may indicate that for these chemical molecular databases, the redundancy might not be a primary concern. Abstract : Feature correlation and redundancy, quasi‐pairwise factors in machine learning modeling, are widely considered in variable reduction methods. Their effects on regression models are not uniform for databases in various areas. Therefore, they are investigated for 4 types of regression models, random forest,Abstract: Variable reduction is an essential step for establishing a robust, accurate, and generalized machine learning model. Variable correlation and redundancy/total correlation are the primary considerations in many variable reduction methods given that they directly impact model performances. However, their effects vary from one class of databases to another. To clarify their effects on regression models on the basis of small chemical databases, a series of calculations are performed. Regression models are built on features with various correlation coefficients and redundancies by 4 machine learning methods: random forest, support vector machine, extreme learning machine, and multiple linear regression. The results suggest that the correlation is, as expected, closely related to the prediction accuracy; ie, generally, the features with large correlation coefficients regarding to response variables achieve better regression models than those with lower ones. However, for the redundancy, no trends on the performances of regression models are disclosed. This may indicate that for these chemical molecular databases, the redundancy might not be a primary concern. Abstract : Feature correlation and redundancy, quasi‐pairwise factors in machine learning modeling, are widely considered in variable reduction methods. Their effects on regression models are not uniform for databases in various areas. Therefore, they are investigated for 4 types of regression models, random forest, support vector machine, extreme learning machine, and multiple linear regression, based on small chemical databases with quantum chemical and structural molecular descriptors. The correlation is closely related to the prediction, and generally, higher correlation leads to better predictions; the redundancy effect is clueless, which means that the redundancy is not certain to deteriorate the regression model based on chemical databases. On the basis of regression and density functional theory calculations, an optimal setting for obtaining quantum chemical descriptors is suggested for similar database regression modeling. … (more)
- Is Part Of:
- Journal of chemometrics. Volume 32:Number 7(2018)
- Journal:
- Journal of chemometrics
- Issue:
- Volume 32:Number 7(2018)
- Issue Display:
- Volume 32, Issue 7 (2018)
- Year:
- 2018
- Volume:
- 32
- Issue:
- 7
- Issue Sort Value:
- 2018-0032-0007-0000
- Page Start:
- n/a
- Page End:
- n/a
- Publication Date:
- 2018-02-19
- Subjects:
- chemical databases -- correlation -- density functional theory (DFT) -- machine learning regression -- redundancy/total correlation
Chemistry -- Mathematics -- Periodicals
Chemistry -- Statistical methods -- Periodicals
542.85 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cem.3023 ↗
- Languages:
- English
- ISSNs:
- 0886-9383
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4957.380000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9296.xml