Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework. Issue 2 (March 2016)
- Record Type:
- Journal Article
- Title:
- Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework. Issue 2 (March 2016)
- Main Title:
- Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework
- Authors:
- Rahimi, Razieh
Shakery, Azadeh
King, Irwin - Abstract:
- Highlights: Proposing a language modeling method to extract translations from comparable corpora. Comparing two similarity functions for deriving bilingual word correlations. Improving translation quality by integrating co-occurrence relations into word models. Comparing different estimations of translation probabilities from word correlations. Showing the significant impact of probability estimation methods on CLIR performance. Abstract: A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, itHighlights: Proposing a language modeling method to extract translations from comparable corpora. Comparing two similarity functions for deriving bilingual word correlations. Improving translation quality by integrating co-occurrence relations into word models. Comparing different estimations of translation probabilities from word correlations. Showing the significant impact of probability estimation methods on CLIR performance. Abstract: A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English–Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR. … (more)
- Is Part Of:
- Information processing & management. Volume 52:Issue 2(2016:Mar.)
- Journal:
- Information processing & management
- Issue:
- Volume 52:Issue 2(2016:Mar.)
- Issue Display:
- Volume 52, Issue 2 (2016)
- Year:
- 2016
- Volume:
- 52
- Issue:
- 2
- Issue Sort Value:
- 2016-0052-0002-0000
- Page Start:
- 299
- Page End:
- 318
- Publication Date:
- 2016-03
- Subjects:
- Translation model -- Bilingual lexicon -- Comparable corpora -- Cross-Language Information Retrieval -- Language modeling framework
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2015.08.001 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 7616.xml