Privacy-enhancing ETL-processes for biomedical data. (June 2019)
- Record Type:
- Journal Article
- Title:
- Privacy-enhancing ETL-processes for biomedical data. (June 2019)
- Main Title:
- Privacy-enhancing ETL-processes for biomedical data
- Authors:
- Prasser, Fabian
Spengler, Helmut
Bild, Raffael
Eicher, Johanna
Kuhn, Klaus A. - Abstract:
- Highlights: Anonymization is important in biomedical research, especially when data is pooled. We present an integration of formal anonymization methods into an ETL platform. Our plugin allows protecting data from complex threats with little configuration. Important properties of data are preserved and very large datasets can be handled. We show this by presenting results of an extensive experimental evaluation. Abstract: Background: Modern data-driven approaches to medical research require patient-level information at comprehensive depth and breadth. To create the required big datasets, information from disparate sources can be integrated into clinical and translational warehouses. This is typically implemented with Extract, Transform, Load (ETL) processes, which access, harmonize and upload data into the analytics platform. Objective: Privacy-protection needs careful consideration when data is pooled or re-used for secondary purposes, and data anonymization is an important protection mechanism. However, common ETL environments do not support anonymization, and common anonymization tools cannot easily be integrated into ETL workflows. The objective of the work described in this article was to bridge this gap. Methods: Our main design goals were (1) to base the anonymization process on expert-level risk assessment methodologies, (2) to use transformation methods which preserve both the truthfulness of data and its schematic properties (e.g. data types), (3) to implement aHighlights: Anonymization is important in biomedical research, especially when data is pooled. We present an integration of formal anonymization methods into an ETL platform. Our plugin allows protecting data from complex threats with little configuration. Important properties of data are preserved and very large datasets can be handled. We show this by presenting results of an extensive experimental evaluation. Abstract: Background: Modern data-driven approaches to medical research require patient-level information at comprehensive depth and breadth. To create the required big datasets, information from disparate sources can be integrated into clinical and translational warehouses. This is typically implemented with Extract, Transform, Load (ETL) processes, which access, harmonize and upload data into the analytics platform. Objective: Privacy-protection needs careful consideration when data is pooled or re-used for secondary purposes, and data anonymization is an important protection mechanism. However, common ETL environments do not support anonymization, and common anonymization tools cannot easily be integrated into ETL workflows. The objective of the work described in this article was to bridge this gap. Methods: Our main design goals were (1) to base the anonymization process on expert-level risk assessment methodologies, (2) to use transformation methods which preserve both the truthfulness of data and its schematic properties (e.g. data types), (3) to implement a method which is easy to understand and intuitive to configure, and (4) to provide high scalability. Results: We designed a novel and efficient anonymization process and implemented a plugin for the Pentaho Data Integration (PDI) platform, which enables integrating data anonymization and re-identification risk analyses directly into ETL workflows. By combining different instances into a single ETL process, data can be protected from multiple threats. The plugin supports very large datasets by leveraging the streaming-based processing model of the underlying platform. We present results of an extensive experimental evaluation and discuss successful applications. Conclusions: Our work shows that expert-level anonymization methodologies can be integrated into ETL workflows. Our implementation is available under a non-restrictive open source license and it overcomes several limitations of other data anonymization tools. … (more)
- Is Part Of:
- International journal of medical informatics. Volume 126(2019)
- Journal:
- International journal of medical informatics
- Issue:
- Volume 126(2019)
- Issue Display:
- Volume 126, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 126
- Issue:
- 2019
- Issue Sort Value:
- 2019-0126-2019-0000
- Page Start:
- 72
- Page End:
- 81
- Publication Date:
- 2019-06
- Subjects:
- Clinical data warehousing -- Extract Transform Load -- Privacy -- Anonymization
Medical informatics -- Periodicals
Information science -- Periodicals
Computers -- Periodicals
Medical technology -- Periodicals
Medical Informatics -- Periodicals
Technology, Medical -- Periodicals
Computers
Information science
Medical informatics
Medical technology
Electronic journals
Periodicals
Electronic journals
610.285 - Journal URLs:
- http://www.sciencedirect.com/science/journal/13865056 ↗
http://www.clinicalkey.com/dura/browse/journalIssue/13865056 ↗
http://www.clinicalkey.com.au/dura/browse/journalIssue/13865056 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ijmedinf.2019.03.006 ↗
- Languages:
- English
- ISSNs:
- 1386-5056
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4542.345250
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 10068.xml