A new approach and gold standard toward author disambiguation in MEDLINE. (8th April 2019)
- Record Type:
- Journal Article
- Title:
- A new approach and gold standard toward author disambiguation in MEDLINE. (8th April 2019)
- Main Title:
- A new approach and gold standard toward author disambiguation in MEDLINE
- Authors:
- Vishnyakova, Dina
Rodriguez-Esteban, Raul
Rinaldi, Fabio - Abstract:
- Abstract: Objective: Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation. Materials and Methods: Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards. Results: We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms. Discussion and Conclusion: AnAbstract: Objective: Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation. Materials and Methods: Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards. Results: We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms. Discussion and Conclusion: An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/ … (more)
- Is Part Of:
- Journal of the American Medical Informatics Association. Volume 26:Number 10(2019)
- Journal:
- Journal of the American Medical Informatics Association
- Issue:
- Volume 26:Number 10(2019)
- Issue Display:
- Volume 26, Issue 10 (2019)
- Year:
- 2019
- Volume:
- 26
- Issue:
- 10
- Issue Sort Value:
- 2019-0026-0010-0000
- Page Start:
- 1037
- Page End:
- 1045
- Publication Date:
- 2019-04-08
- Subjects:
- author name disambiguation -- MEDLINE -- text mining -- gold standard -- machine learning
Medical informatics -- Periodicals
Information Services -- Periodicals
Medical Informatics -- Periodicals
Médecine -- Informatique -- Périodiques
Informatica
Geneeskunde
Informatique médicale
Computer network resources
Electronic journals
610.285 - Journal URLs:
- http://jamia.bmj.com/ ↗
http://www.jamia.org ↗
http://www.pubmedcentral.nih.gov/tocrender.fcgi?journal=76 ↗
http://www.sciencedirect.com/science/journal/10675027 ↗
http://jamia.oxfordjournals.org/ ↗
http://www.oxfordjournals.org/en/ ↗ - DOI:
- 10.1093/jamia/ocz028 ↗
- Languages:
- English
- ISSNs:
- 1067-5027
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4689.025000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 15078.xml