A novel ensemble learning approach to unsupervised record linkage. (November 2017)
- Record Type:
- Journal Article
- Title:
- A novel ensemble learning approach to unsupervised record linkage. (November 2017)
- Main Title:
- A novel ensemble learning approach to unsupervised record linkage
- Authors:
- Jurek, Anna
Hong, Jun
Chi, Yuan
Liu, Weiru - Abstract:
- Highlights: A novel unsupervised approach to record linkage has been proposed. The approach combines ensemble learning and automatic self learning. An ensemble of diverse self learning models is generated through application of different string similarity metrics schemes. Application of ensemble learning alleviates the problem of having to select the most suitable similarity metric scheme and improves the performance of an individual self learning model. The proposed method obtained comparable results with the supervised methods. Abstract: Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose aHighlights: A novel unsupervised approach to record linkage has been proposed. The approach combines ensemble learning and automatic self learning. An ensemble of diverse self learning models is generated through application of different string similarity metrics schemes. Application of ensemble learning alleviates the problem of having to select the most suitable similarity metric scheme and improves the performance of an individual self learning model. The proposed method obtained comparable results with the supervised methods. Abstract: Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches. … (more)
- Is Part Of:
- Information systems. Volume 71(2017)
- Journal:
- Information systems
- Issue:
- Volume 71(2017)
- Issue Display:
- Volume 71, Issue 2017 (2017)
- Year:
- 2017
- Volume:
- 71
- Issue:
- 2017
- Issue Sort Value:
- 2017-0071-2017-0000
- Page Start:
- 40
- Page End:
- 54
- Publication Date:
- 2017-11
- Subjects:
- Unsupervised record linkage -- Data matching -- Classification -- Ensemble learning
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2017.06.006 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11477.xml