Imputation methods for high-dimensional mixed-type datasets by nearest neighbors. (August 2021)
- Record Type:
- Journal Article
- Title:
- Imputation methods for high-dimensional mixed-type datasets by nearest neighbors. (August 2021)
- Main Title:
- Imputation methods for high-dimensional mixed-type datasets by nearest neighbors
- Authors:
- Faisal, Shahla
Tutz, Gerhard - Abstract:
- Abstract: In modern biomedical research, the data often contain a large number of variables of mixed data types (continuous, multi-categorical, or binary) but on some variables observations are missing. Imputation is a common solution when the downstream analyses require a complete data matrix. Several imputation methods are available that work under specific distributional assumptions. We propose an improvement over the popular non-parametric nearest neighbor imputation method which requires no particular assumptions. The proposed method makes practical and effective use of the information on the association among the variables. In particular, we propose a weighted version of the L q distance for mixed-type data, which uses the information from a subset of important variables only. The performance of the proposed method is investigated using a variety of simulated and real data from different areas of application. The results show that the proposed methods yield smaller imputation error and better performance when compared to other approaches. It is also shown that the proposed imputation method works efficiently even when the number of samples is smaller than the number of variables. Highlights: Missing values are a common fact in modern medical research of complex diseases. Imputation becomes more challenging in high-dimensional mixed-type variables. We propose a new approach to impute missing values in high-dimensional mixed data. We have compared it with existingAbstract: In modern biomedical research, the data often contain a large number of variables of mixed data types (continuous, multi-categorical, or binary) but on some variables observations are missing. Imputation is a common solution when the downstream analyses require a complete data matrix. Several imputation methods are available that work under specific distributional assumptions. We propose an improvement over the popular non-parametric nearest neighbor imputation method which requires no particular assumptions. The proposed method makes practical and effective use of the information on the association among the variables. In particular, we propose a weighted version of the L q distance for mixed-type data, which uses the information from a subset of important variables only. The performance of the proposed method is investigated using a variety of simulated and real data from different areas of application. The results show that the proposed methods yield smaller imputation error and better performance when compared to other approaches. It is also shown that the proposed imputation method works efficiently even when the number of samples is smaller than the number of variables. Highlights: Missing values are a common fact in modern medical research of complex diseases. Imputation becomes more challenging in high-dimensional mixed-type variables. We propose a new approach to impute missing values in high-dimensional mixed data. We have compared it with existing approaches for missing data imputation. Our method works efficiently in simulations as well as real dataset. … (more)
- Is Part Of:
- Computers in biology and medicine. Volume 135(2021)
- Journal:
- Computers in biology and medicine
- Issue:
- Volume 135(2021)
- Issue Display:
- Volume 135, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 135
- Issue:
- 2021
- Issue Sort Value:
- 2021-0135-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-08
- Subjects:
- Weighted nearest neighbors -- Mixed-type data -- Missing values -- High-dimensional data
Medicine -- Data processing -- Periodicals
Biology -- Data processing -- Periodicals
610.285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/00104825/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.compbiomed.2021.104577 ↗
- Languages:
- English
- ISSNs:
- 0010-4825
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.880000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 18878.xml