Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification. Issue 1 (December 2016)
- Record Type:
- Journal Article
- Title:
- Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification. Issue 1 (December 2016)
- Main Title:
- Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification
- Authors:
- Mehryary, Farrokh
Kaewphan, Suwisa
Hakala, Kai
Ginter, Filip - Abstract:
- Abstract Background Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Methods Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. Results The method is evaluated on theAbstract Background Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Methods Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. Results The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1, 338, 075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. Availability The data and source code for this work are available at:http://bionlp-www.utu.fi/trigger-clustering/ . … (more)
- Is Part Of:
- Journal of biomedical semantics. Volume 7:Issue 1(2016)
- Journal:
- Journal of biomedical semantics
- Issue:
- Volume 7:Issue 1(2016)
- Issue Display:
- Volume 7, Issue 1 (2016)
- Year:
- 2016
- Volume:
- 7
- Issue:
- 1
- Issue Sort Value:
- 2016-0007-0001-0000
- Page Start:
- 1
- Page End:
- 13
- Publication Date:
- 2016-12
- Subjects:
- BioNLP -- Event extraction -- Trigger detection -- Word embeddings
Semantics -- Periodicals
Medicine -- Research -- Periodicals
Biology -- Research -- Periodicals
Computer systems -- Periodicals
Bioinformatics -- Periodicals
570.285 - Journal URLs:
- http://www.jbiomedsem.com/ ↗
http://link.springer.com/ ↗ - DOI:
- 10.1186/s13326-016-0070-4 ↗
- Languages:
- English
- ISSNs:
- 2041-1480
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 10192.xml