A data cleaning method for heterogeneous attribute fusion and record linkage. (5th August 2019)
- Record Type:
- Journal Article
- Title:
- A data cleaning method for heterogeneous attribute fusion and record linkage. (5th August 2019)
- Main Title:
- A data cleaning method for heterogeneous attribute fusion and record linkage
- Authors:
- Zhu, Hui-Juan
Jiang, Tong-Hai
Wang, Yi
Cheng, Li
Ma, Bo
Zhao, Fan - Abstract:
- In big data era, massive heterogeneous data are generated from various data sources, the cleaning of dirty data is critical for reliable data analysis. Existing rule-based methods are generally developed in single data source environment, issues like data standardisation and duplication detection for different data type attributes, are not fully studied. In order to address these challenges, we introduce a method based on dynamic configurable rules which can integrate data detection, modification and transformation together. Secondly, we propose a type-based blocking and a varying window size selection mechanism based on classic sorted-neighbourhood algorithm. We present a reference implementation of our method in a real-life data fusion system and validate its effectiveness and efficiency using recall and precision metrics. Experimental results indicate that our method is suitable in the scenario of multiple data sources with heterogeneous attribute properties.
- Is Part Of:
- International journal of computational science and engineering. Volume 19:Number 3(2019)
- Journal:
- International journal of computational science and engineering
- Issue:
- Volume 19:Number 3(2019)
- Issue Display:
- Volume 19, Issue 3 (2019)
- Year:
- 2019
- Volume:
- 19
- Issue:
- 3
- Issue Sort Value:
- 2019-0019-0003-0000
- Page Start:
- 311
- Page End:
- 324
- Publication Date:
- 2019-08-05
- Subjects:
- big data -- varying window -- data cleaning -- record linkage -- record similarity -- SNM -- type-based blocking
Computer science -- Mathematics -- Periodicals
Computer simulation -- Mathematical aspects -- Periodicals
Computational intelligence -- Periodicals
004.015105 - Journal URLs:
- http://www.inderscience.com/jhome.php?jcode=ijcse ↗
http://www.inderscience.com/ ↗ - Languages:
- English
- ISSNs:
- 1742-7185
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 11262.xml