Building a multi-domain comparable corpus using a learning to rank method. (15th June 2016)
- Record Type:
- Journal Article
- Title:
- Building a multi-domain comparable corpus using a learning to rank method. (15th June 2016)
- Main Title:
- Building a multi-domain comparable corpus using a learning to rank method
- Authors:
- RAHIMI, RAZIEH
SHAKERY, AZADEH
DADASHKARIMI, JAVID
ARIANNEZHAD, MOZHDEH
DEHGHANI, MOSTAFA
ESFAHANI, HOSSEIN NASR - Editors:
- Rapp, Reinhard
Sharoff, Serge
Zweigenbaum, Pierre - Abstract:
- Abstract: Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for buildingAbstract: Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments. … (more)
- Is Part Of:
- Natural language engineering. Volume 22:Part 4(2016)
- Journal:
- Natural language engineering
- Issue:
- Volume 22:Part 4(2016)
- Issue Display:
- Volume 22, Issue 4, Part 4 (2016)
- Year:
- 2016
- Volume:
- 22
- Issue:
- 4
- Part:
- 4
- Issue Sort Value:
- 2016-0022-0004-0004
- Page Start:
- 627
- Page End:
- 653
- Publication Date:
- 2016-06-15
- Subjects:
- Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35 - Journal URLs:
- http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
- DOI:
- 10.1017/S1351324916000164 ↗
- Languages:
- English
- ISSNs:
- 1351-3249
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 14465.xml