Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies. (30th May 2019)
- Record Type:
- Journal Article
- Title:
- Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies. (30th May 2019)
- Main Title:
- Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies
- Authors:
- Afshar, Majid
Dligach, Dmitriy
Sharma, Brihat
Cai, Xiaoyuan
Boyda, Jason
Birch, Steven
Valdez, Daniel
Zelisko, Suzan
Joyce, Cara
Modave, François
Price, Ron - Abstract:
- Abstract: Objective: Natural language processing (NLP) engines such as the clinical Text Analysis and Knowledge Extraction System are a solution for processing notes for research, but optimizing their performance for a clinical data warehouse remains a challenge. We aim to develop a high throughput NLP architecture using the clinical Text Analysis and Knowledge Extraction System and present a predictive model use case. Materials and Methods: The CDW was comprised of 1 103 038 patients across 10 years. The architecture was constructed using the Hadoop data repository for source data and 3 large-scale symmetric processing servers for NLP. Each named entity mention in a clinical document was mapped to the Unified Medical Language System concept unique identifier (CUI). Results: The NLP architecture processed 83 867 802 clinical documents in 13.33 days and produced 37 721 886 606 CUIs across 8 standardized medical vocabularies. Performance of the architecture exceeded 500 000 documents per hour across 30 parallel instances of the clinical Text Analysis and Knowledge Extraction System including 10 instances dedicated to documents greater than 20 000 bytes. In a use–case example for predicting 30-day hospital readmission, a CUI-based model had similar discrimination to n-grams with an area under the curve receiver operating characteristic of 0.75 (95% CI, 0.74–0.76). Discussion and Conclusion: Our health system's high throughput NLP architecture may serve as a benchmark forAbstract: Objective: Natural language processing (NLP) engines such as the clinical Text Analysis and Knowledge Extraction System are a solution for processing notes for research, but optimizing their performance for a clinical data warehouse remains a challenge. We aim to develop a high throughput NLP architecture using the clinical Text Analysis and Knowledge Extraction System and present a predictive model use case. Materials and Methods: The CDW was comprised of 1 103 038 patients across 10 years. The architecture was constructed using the Hadoop data repository for source data and 3 large-scale symmetric processing servers for NLP. Each named entity mention in a clinical document was mapped to the Unified Medical Language System concept unique identifier (CUI). Results: The NLP architecture processed 83 867 802 clinical documents in 13.33 days and produced 37 721 886 606 CUIs across 8 standardized medical vocabularies. Performance of the architecture exceeded 500 000 documents per hour across 30 parallel instances of the clinical Text Analysis and Knowledge Extraction System including 10 instances dedicated to documents greater than 20 000 bytes. In a use–case example for predicting 30-day hospital readmission, a CUI-based model had similar discrimination to n-grams with an area under the curve receiver operating characteristic of 0.75 (95% CI, 0.74–0.76). Discussion and Conclusion: Our health system's high throughput NLP architecture may serve as a benchmark for large-scale clinical research using a CUI-based approach. … (more)
- Is Part Of:
- Journal of the American Medical Informatics Association. Volume 26:Number 11(2019)
- Journal:
- Journal of the American Medical Informatics Association
- Issue:
- Volume 26:Number 11(2019)
- Issue Display:
- Volume 26, Issue 11 (2019)
- Year:
- 2019
- Volume:
- 26
- Issue:
- 11
- Issue Sort Value:
- 2019-0026-0011-0000
- Page Start:
- 1364
- Page End:
- 1369
- Publication Date:
- 2019-05-30
- Subjects:
- natural language processing -- unstructured data -- clinical text and knowledge extraction system -- data architecture -- unified medical language system
Medical informatics -- Periodicals
Information Services -- Periodicals
Medical Informatics -- Periodicals
Médecine -- Informatique -- Périodiques
Informatica
Geneeskunde
Informatique médicale
Computer network resources
Electronic journals
610.285 - Journal URLs:
- http://jamia.bmj.com/ ↗
http://www.jamia.org ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=76 ↗
http://www.sciencedirect.com/science/journal/10675027 ↗
http://jamia.oxfordjournals.org/ ↗
http://www.oxfordjournals.org/en/ ↗ - DOI:
- 10.1093/jamia/ocz068 ↗
- Languages:
- English
- ISSNs:
- 1067-5027
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4689.025000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 21985.xml