Semi-supervised encoding for outlier detection in clinical observation data. (November 2019)
- Record Type:
- Journal Article
- Title:
- Semi-supervised encoding for outlier detection in clinical observation data. (November 2019)
- Main Title:
- Semi-supervised encoding for outlier detection in clinical observation data
- Authors:
- Estiri, Hossein
Murphy, Shawn N. - Abstract:
- Highlights: A semi-supervised encoding is proposed for outlier detection of clinical observation data. Semi-supervised encoders (super-encoders) outperformed autoencoders in outlier detection. Outlier detection performance decreases by adding depth and complexity to the Neural Network. At least one super-encoder had a Youden's J index higher than 0.9999 for all 30 observations. Adding 'age at observation' to 'observation value' as features improves outlier detection. Abstract: Background and Objective: Electronic Health Record (EHR) data often include observation records that are unlikely to represent the "truth" about a patient at a given clinical encounter. Due to their high throughput, examples of such implausible observations are frequent in records of laboratory test results and vital signs. Outlier detection methods can offer low-cost solutions to flagging implausible EHR observations. This article evaluates the utility of a semi-supervised encoding approach (super-encoding) for constructing non-linear exemplar data distributions from EHR observation data and detecting non-conforming observations as outliers. Methods: Two hypotheses are tested using experimental design and non-parametric hypothesis testing procedures: (1) adding demographic features (e.g., age, gender, race/ethnicity) can increase precision in outlier detection, (2) sampling small subsets of the large EHR data can increase outlier detection by reducing noise-to-signal ratio. The experiments involvedHighlights: A semi-supervised encoding is proposed for outlier detection of clinical observation data. Semi-supervised encoders (super-encoders) outperformed autoencoders in outlier detection. Outlier detection performance decreases by adding depth and complexity to the Neural Network. At least one super-encoder had a Youden's J index higher than 0.9999 for all 30 observations. Adding 'age at observation' to 'observation value' as features improves outlier detection. Abstract: Background and Objective: Electronic Health Record (EHR) data often include observation records that are unlikely to represent the "truth" about a patient at a given clinical encounter. Due to their high throughput, examples of such implausible observations are frequent in records of laboratory test results and vital signs. Outlier detection methods can offer low-cost solutions to flagging implausible EHR observations. This article evaluates the utility of a semi-supervised encoding approach (super-encoding) for constructing non-linear exemplar data distributions from EHR observation data and detecting non-conforming observations as outliers. Methods: Two hypotheses are tested using experimental design and non-parametric hypothesis testing procedures: (1) adding demographic features (e.g., age, gender, race/ethnicity) can increase precision in outlier detection, (2) sampling small subsets of the large EHR data can increase outlier detection by reducing noise-to-signal ratio. The experiments involved applying 492 encoder configurations (involving different input features, architectures, sampling ratios, and error margins) to a set of 30 datasets EHR observations including laboratory tests and vital sign records extracted from the Research Patient Data Registry (RPDR) from Partners HealthCare. Results: Results are obtained from (30 × 492) 14, 760 encoders. The semi-supervised encoding approach (super-encoding) outperformed conventional autoencoders in outlier detection. Adding age of the patient at the observation (encounter) to the baseline encoder that only included observation value as the input feature slightly improved outlier detection. Top-nine performing encoders are introduced. The best outlier detection performance was from a semi-supervised encoder, with observation value as the single feature and a single hidden layer, built on one percent of the data and one percent reconstruction error. At least one encoder configurations had a Youden's J index higher than 0.9999 for all 30 observation types. Conclusion: Given the multiplicity of distributions for a single observation in EHR data (i.e., same observation represented with different names or units), as well as non-linearity of human observations, encoding offers huge promises for outlier detection in large-scale data repositories. https://github.com/hestiri/superencoder … (more)
- Is Part Of:
- Computer methods and programs in biomedicine. Volume 181(2020)
- Journal:
- Computer methods and programs in biomedicine
- Issue:
- Volume 181(2020)
- Issue Display:
- Volume 181, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 181
- Issue:
- 2020
- Issue Sort Value:
- 2020-0181-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2019-11
- Subjects:
- Neural Networks -- Encoding -- Semi-supervised encoding -- Outlier detection -- Data quality -- Electronic Health Records
Medicine -- Computer programs -- Periodicals
Biology -- Computer programs -- Periodicals
Computers -- Periodicals
Medicine -- Periodicals
Médecine -- Logiciels -- Périodiques
Biologie -- Logiciels -- Périodiques
Biology -- Computer programs
Medicine -- Computer programs
Periodicals
Electronic journals
610.28 - Journal URLs:
- http://www.sciencedirect.com/science/journal/01692607 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.cmpb.2019.01.002 ↗
- Languages:
- English
- ISSNs:
- 0169-2607
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.095000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 12168.xml