Preliminary exploration of topic modelling representations for Electronic Health Records coding according to the International Classification of Diseases in Spanish. (15th October 2022)
- Record Type:
- Journal Article
- Title:
- Preliminary exploration of topic modelling representations for Electronic Health Records coding according to the International Classification of Diseases in Spanish. (15th October 2022)
- Main Title:
- Preliminary exploration of topic modelling representations for Electronic Health Records coding according to the International Classification of Diseases in Spanish
- Authors:
- Lebeña, Nuria
Blanco, Alberto
Pérez, Alicia
Casillas, Arantza - Abstract:
- Abstract: In this work, we cope with the classification of Electronic Health Records (EHR) in Spanish according to the International Classification of Diseases (ICD). We employ Topic Models representing each document as a probabilistic distribution over topics, offering a low-dimensional representation of documents. The trend is to turn to an embedding text representation, but these approaches require large amounts of textual data. We found Topic Models as a suitable alternative approach to deal with the few resources available for Spanish clinical text mining. Besides, they are interpretable and aid the explainability in artificial intelligence (XAI). We explored two different methods, known as Latent Dirichlet Allocation (LDA) and Partially Labelled Latent Dirichlet Allocation (PLDA), the supervised approach of the former. We assessed the results attained in Spanish with an analogous task in English as a reference. Evaluation methods were applied directly to the representation, with metrics to determine topic coherence and the relationship between topics and ICD labels. We learned that PLDA was able to discover topics associated with the ICD. This finding means that this representation itself can reveal ICD codes previous to classification. Also, this representation was used as predictive features to feed a conventional classifier to show their competence in a downstream task. We conclude that in a context with a lack of big data availability, PLDA emerges as a versatileAbstract: In this work, we cope with the classification of Electronic Health Records (EHR) in Spanish according to the International Classification of Diseases (ICD). We employ Topic Models representing each document as a probabilistic distribution over topics, offering a low-dimensional representation of documents. The trend is to turn to an embedding text representation, but these approaches require large amounts of textual data. We found Topic Models as a suitable alternative approach to deal with the few resources available for Spanish clinical text mining. Besides, they are interpretable and aid the explainability in artificial intelligence (XAI). We explored two different methods, known as Latent Dirichlet Allocation (LDA) and Partially Labelled Latent Dirichlet Allocation (PLDA), the supervised approach of the former. We assessed the results attained in Spanish with an analogous task in English as a reference. Evaluation methods were applied directly to the representation, with metrics to determine topic coherence and the relationship between topics and ICD labels. We learned that PLDA was able to discover topics associated with the ICD. This finding means that this representation itself can reveal ICD codes previous to classification. Also, this representation was used as predictive features to feed a conventional classifier to show their competence in a downstream task. We conclude that in a context with a lack of big data availability, PLDA emerges as a versatile candidate, able to offer a competitive representation of EHRs. While other works are primarily concerned with supervised categorization and do not pay attention to the representation, LDA and PLDA offer an interpretable approach that can be associated with ICDs. Moreover, compared with those that employ LDA, we demonstrate how its' supervised version, PLDA, can be more intuitive as it shows a closer relation with the ICDs. Highlights: Classification of Electronic Health Records in Spanish and English using XAI methods. Interpretable representation of the text as a fixed-length numerical vector. PLDA can discover topics associated with the ICD without a classifier. Experiments in Spanish and English to seize topic-coherence and ICD association. … (more)
- Is Part Of:
- Expert systems with applications. Volume 204(2022)
- Journal:
- Expert systems with applications
- Issue:
- Volume 204(2022)
- Issue Display:
- Volume 204, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 204
- Issue:
- 2022
- Issue Sort Value:
- 2022-0204-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-10-15
- Subjects:
- Multi-label classification -- Document classification -- Electronic Health Records -- ICD classification -- Topic models -- Partially labelled dirichlet allocation
Expert systems (Computer science) -- Periodicals
Systèmes experts (Informatique) -- Périodiques
Electronic journals
006.33 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09574174 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.eswa.2022.117303 ↗
- Languages:
- English
- ISSNs:
- 0957-4174
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3842.004220
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 21799.xml