RESI: A Region-Splitting Imputation method for different types of missing data. (15th April 2021)
- Record Type:
- Journal Article
- Title:
- RESI: A Region-Splitting Imputation method for different types of missing data. (15th April 2021)
- Main Title:
- RESI: A Region-Splitting Imputation method for different types of missing data
- Authors:
- Peng, Dunlu
Zou, Mengping
Liu, Cong
Lu, Jing - Abstract:
- Abstract: A certain degree of data loss seriously affects the accuracy and availability of data, especially on the effects of the subsequent in-depth data analysis and mining. It is of great value in practical applications to construct a data imputation model, which is suitable for completing different types of missing data, including numerical only, categorical only and mixed-type data, and has strong capability of generalization. To address this issue, this paper defines a new metric, mean integrity rate, to measure the missing degree of a dataset, and proposes RESI, a novel tuple-based RE gion- S plitting I mputation model, to impute different type missing data. We first select features and assign weights to each attribute by using the entropy weight method, and then partition the tuples into a subset of complete tuples and several subsets of incomplete tuples based on their integrity rate, which is formulated with the weights of attributes and the missing degree of tuples. The model performs training iterations on the complete tuple subset. In each iteration, the trained model is used to impute the next missing subset, and the computed subset is merged into the complete subset for training the next model. To improve the imputation accuracy, we leverage k -fold cross validation to correct errors. Besides imputing diverse types of missing data, extensive experimental results have shown that our model, RESI, significantly outperforms the state-of-the-art methods in theAbstract: A certain degree of data loss seriously affects the accuracy and availability of data, especially on the effects of the subsequent in-depth data analysis and mining. It is of great value in practical applications to construct a data imputation model, which is suitable for completing different types of missing data, including numerical only, categorical only and mixed-type data, and has strong capability of generalization. To address this issue, this paper defines a new metric, mean integrity rate, to measure the missing degree of a dataset, and proposes RESI, a novel tuple-based RE gion- S plitting I mputation model, to impute different type missing data. We first select features and assign weights to each attribute by using the entropy weight method, and then partition the tuples into a subset of complete tuples and several subsets of incomplete tuples based on their integrity rate, which is formulated with the weights of attributes and the missing degree of tuples. The model performs training iterations on the complete tuple subset. In each iteration, the trained model is used to impute the next missing subset, and the computed subset is merged into the complete subset for training the next model. To improve the imputation accuracy, we leverage k -fold cross validation to correct errors. Besides imputing diverse types of missing data, extensive experimental results have shown that our model, RESI, significantly outperforms the state-of-the-art methods in the sensitivity to missing rate and accuracy of imputed data. Highlights: Only imputing numerical or categorical missing data is unpractical. Region-splitting iterative framework (RESI) excels at completing mix-type missing data. Mean integrity rate is defined to measure the whole missing degree of a dataset. The k -fold validation can alleviate the deviation caused by under- and overlearning. … (more)
- Is Part Of:
- Expert systems with applications. Volume 168(2021)
- Journal:
- Expert systems with applications
- Issue:
- Volume 168(2021)
- Issue Display:
- Volume 168, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 168
- Issue:
- 2021
- Issue Sort Value:
- 2021-0168-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-04-15
- Subjects:
- Data mining -- Missing data imputation -- Region-splitting -- k-fold cross validation
Expert systems (Computer science) -- Periodicals
Systèmes experts (Informatique) -- Périodiques
Electronic journals
006.33 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09574174 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.eswa.2020.114425 ↗
- Languages:
- English
- ISSNs:
- 0957-4174
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3842.004220
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 15532.xml