UMLS-based data augmentation for natural language processing of clinical research literature. (23rd December 2020)
- Record Type:
- Journal Article
- Title:
- UMLS-based data augmentation for natural language processing of clinical research literature. (23rd December 2020)
- Main Title:
- UMLS-based data augmentation for natural language processing of clinical research literature
- Authors:
- Kang, Tian
Perotte, Adler
Tang, Youlan
Ta, Casey
Weng, Chunhua - Abstract:
- Abstract: Objective: The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. Materials and Methods: We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. Results: UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). Conclusions: This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.
- Is Part Of:
- Journal of the American Medical Informatics Association. Volume 28:Number 4(2021)
- Journal:
- Journal of the American Medical Informatics Association
- Issue:
- Volume 28:Number 4(2021)
- Issue Display:
- Volume 28, Issue 4 (2021)
- Year:
- 2021
- Volume:
- 28
- Issue:
- 4
- Issue Sort Value:
- 2021-0028-0004-0000
- Page Start:
- 812
- Page End:
- 823
- Publication Date:
- 2020-12-23
- Subjects:
- Unified Medical Language System -- UMLS -- natural language processing -- NLP -- machine learning -- evidence based medicine -- named entity recognition -- data augmentation
Medical informatics -- Periodicals
Information Services -- Periodicals
Medical Informatics -- Periodicals
Médecine -- Informatique -- Périodiques
Informatica
Geneeskunde
Informatique médicale
Computer network resources
Electronic journals
610.285 - Journal URLs:
- http://jamia.bmj.com/ ↗
http://www.jamia.org ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=76 ↗
http://www.sciencedirect.com/science/journal/10675027 ↗
http://jamia.oxfordjournals.org/ ↗
http://www.oxfordjournals.org/en/ ↗ - DOI:
- 10.1093/jamia/ocaa309 ↗
- Languages:
- English
- ISSNs:
- 1067-5027
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4689.025000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 17160.xml