Using word embeddings to improve the privacy of clinical notes. (10th May 2020)
- Record Type:
- Journal Article
- Title:
- Using word embeddings to improve the privacy of clinical notes. (10th May 2020)
- Main Title:
- Using word embeddings to improve the privacy of clinical notes
- Authors:
- Abdalla, Mohamed
Abdalla, Moustafa
Rudzicz, Frank
Hirst, Graeme - Abstract:
- Abstract: Objective: In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). Materials and Methods: We employ a new "random replacement" paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current "search-and-secure" paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. Results: We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. Discussion: As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. Conclusion: Our proposed technique can secure clinicalAbstract: Objective: In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). Materials and Methods: We employ a new "random replacement" paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current "search-and-secure" paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. Results: We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. Discussion: As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. Conclusion: Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers. … (more)
- Is Part Of:
- Journal of the American Medical Informatics Association. Volume 27:Number 6(2020)
- Journal:
- Journal of the American Medical Informatics Association
- Issue:
- Volume 27:Number 6(2020)
- Issue Display:
- Volume 27, Issue 6 (2020)
- Year:
- 2020
- Volume:
- 27
- Issue:
- 6
- Issue Sort Value:
- 2020-0027-0006-0000
- Page Start:
- 901
- Page End:
- 907
- Publication Date:
- 2020-05-10
- Subjects:
- : privacy -- data anonymization -- natural language processing -- personal health records
Medical informatics -- Periodicals
Information Services -- Periodicals
Medical Informatics -- Periodicals
Médecine -- Informatique -- Périodiques
Informatica
Geneeskunde
Informatique médicale
Computer network resources
Electronic journals
610.285 - Journal URLs:
- http://jamia.bmj.com/ ↗
http://www.jamia.org ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=76 ↗
http://www.sciencedirect.com/science/journal/10675027 ↗
http://jamia.oxfordjournals.org/ ↗
http://www.oxfordjournals.org/en/ ↗ - DOI:
- 10.1093/jamia/ocaa038 ↗
- Languages:
- English
- ISSNs:
- 1067-5027
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4689.025000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 15710.xml