A machine learning based approach to identify protected health information in Chinese clinical text. (August 2018)
- Record Type:
- Journal Article
- Title:
- A machine learning based approach to identify protected health information in Chinese clinical text. (August 2018)
- Main Title:
- A machine learning based approach to identify protected health information in Chinese clinical text
- Authors:
- Du, Liting
Xia, Chenxi
Deng, Zhaohua
Lu, Gary
Xia, Shuxu
Ma, Jingdong - Abstract:
- Highlights: Chinese clinical text contains a high volume of diverse PHI, which is unevenly distributed across various medical institutions. We trained and tested a machine learning based approach to identify PHI in Chinese clinical corpus with double annotated PHI tags. The CRF based approach has achieved performance as high as 98.78% measured by F-score when applied to Chinese clinical text. Abstract: Background: With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined. Objectives: We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text. Methods: Based on 14, 719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we builtHighlights: Chinese clinical text contains a high volume of diverse PHI, which is unevenly distributed across various medical institutions. We trained and tested a machine learning based approach to identify PHI in Chinese clinical corpus with double annotated PHI tags. The CRF based approach has achieved performance as high as 98.78% measured by F-score when applied to Chinese clinical text. Abstract: Background: With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined. Objectives: We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text. Methods: Based on 14, 719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we built a conditional random fields (CRF) model to identify PHI in clinical text, and then used the regular expressions to optimize the recognition results of the PHI categories with fewer samples. Results: We constructed a Chinese clinical text corpus with PHI tags through substantial manual annotation, wherein the descriptive statistics of PHI manifested its wide range and diverse categories. The evaluation showed with a high F-measure of 0.9878 that our CRF-based model had a good performance for identifying PHI in Chinese clinical text. Conclusion: The rapid adoption of EHR in the health sector has created an urgent need for tools that can parse patient specific information from Chinese clinical text. Our application of CRF algorithms for de-identification has shown the potential to meet this need by offering a highly accurate and flexible solution to analyzing Chinese clinical text. … (more)
- Is Part Of:
- International journal of medical informatics. Volume 116(2018)
- Journal:
- International journal of medical informatics
- Issue:
- Volume 116(2018)
- Issue Display:
- Volume 116, Issue 2018 (2018)
- Year:
- 2018
- Volume:
- 116
- Issue:
- 2018
- Issue Sort Value:
- 2018-0116-2018-0000
- Page Start:
- 24
- Page End:
- 32
- Publication Date:
- 2018-08
- Subjects:
- Protected health information -- De-identification -- Electronic health records -- Conditional random fields
Medical informatics -- Periodicals
Information science -- Periodicals
Computers -- Periodicals
Medical technology -- Periodicals
Medical Informatics -- Periodicals
Technology, Medical -- Periodicals
Computers
Information science
Medical informatics
Medical technology
Electronic journals
Periodicals
Electronic journals
610.285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/13865056 ↗
http://www.clinicalkey.com/dura/browse/journalIssue/13865056 ↗
http://www.clinicalkey.com.au/dura/browse/journalIssue/13865056 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ijmedinf.2018.05.010 ↗
- Languages:
- English
- ISSNs:
- 1386-5056
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4542.345250
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 9438.xml