Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation. Issue 3 (14th October 2020)
- Record Type:
- Journal Article
- Title:
- Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation. Issue 3 (14th October 2020)
- Main Title:
- Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation
- Authors:
- Odisho, Anobel Y
Park, Briton
Altieri, Nicholas
DeNero, John
Cooperberg, Matthew R
Carroll, Peter R
Yu, Bin - Abstract:
- Abstract: Objective: Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. Materials and methods: Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct. Results: Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields. Conclusions: We find that when applying machineAbstract: Objective: Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. Materials and methods: Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct. Results: Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields. Conclusions: We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates. … (more)
- Is Part Of:
- JAMIA open. Volume 3:Issue 3(2020)
- Journal:
- JAMIA open
- Issue:
- Volume 3:Issue 3(2020)
- Issue Display:
- Volume 3, Issue 3 (2020)
- Year:
- 2020
- Volume:
- 3
- Issue:
- 3
- Issue Sort Value:
- 2020-0003-0003-0000
- Page Start:
- 431
- Page End:
- 438
- Publication Date:
- 2020-10-14
- Subjects:
- pathology -- natural language processing -- information extraction -- cancer -- prostate cancer -- machine learning
Medical informatics -- Periodicals
610.285 - Journal URLs:
- http://www.oxfordjournals.org/ ↗
https://academic.oup.com/jamiaopen ↗ - DOI:
- 10.1093/jamiaopen/ooaa029 ↗
- Languages:
- English
- ISSNs:
- 2574-2531
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 14851.xml