An expectation-maximization algorithm for query translation based on pseudo-relevant documents. Issue 2 (March 2017)
- Record Type:
- Journal Article
- Title:
- An expectation-maximization algorithm for query translation based on pseudo-relevant documents. Issue 2 (March 2017)
- Main Title:
- An expectation-maximization algorithm for query translation based on pseudo-relevant documents
- Authors:
- Dadashkarimi, Javid
Shakery, Azadeh
Faili, Heshaam
Zamani, Hamed - Abstract:
- Highlights: A query translation method based on expectation maximization algorithm is proposed. The method (EM4QT) exploits pseudo-relevant documents in source and target languages. EM4QT extracts a number of hidden variables for each translation pair. EM4QT employs an expectation maximization algorithm for estimating the parameters. EM4QT outperforms competitive baselines in cross-language information retrieval. Abstract: Query translation in cross-language information retrieval (CLIR) can be done by employing dictionaries, aligned corpora, or machine translators. Scarcity of aligned corpora for various domains in many language pairs intensifies the importance of dictionary-based CLIR which motivates us to use only a bilingual dictionary and two independent collections in source and target languages for query translation. We exploit pseudo-relevant documents for a given query in the source language and pseudo-relevant documents for a translation of the query in the target language with a proposed expectation-maximization algorithm for improving query translation. The proposed method (called EM4QT ) assumes that each target term either is translated from the source pseudo-relevant documents or has come from a noisy collection. Since EM4QT does not directly consider term coherency, which is defined as fluency of the target translation, we investigate a crucial question: can EM4QT be improved using either coherency-based methods or token-to-token translation ones? To addressHighlights: A query translation method based on expectation maximization algorithm is proposed. The method (EM4QT) exploits pseudo-relevant documents in source and target languages. EM4QT extracts a number of hidden variables for each translation pair. EM4QT employs an expectation maximization algorithm for estimating the parameters. EM4QT outperforms competitive baselines in cross-language information retrieval. Abstract: Query translation in cross-language information retrieval (CLIR) can be done by employing dictionaries, aligned corpora, or machine translators. Scarcity of aligned corpora for various domains in many language pairs intensifies the importance of dictionary-based CLIR which motivates us to use only a bilingual dictionary and two independent collections in source and target languages for query translation. We exploit pseudo-relevant documents for a given query in the source language and pseudo-relevant documents for a translation of the query in the target language with a proposed expectation-maximization algorithm for improving query translation. The proposed method (called EM4QT ) assumes that each target term either is translated from the source pseudo-relevant documents or has come from a noisy collection. Since EM4QT does not directly consider term coherency, which is defined as fluency of the target translation, we investigate a crucial question: can EM4QT be improved using either coherency-based methods or token-to-token translation ones? To address this question, we combine different translation models via simple linear interpolation and a proposed divergence minimization method. Evaluations over four CLEF collections in Persian, French, Spanish, and German indicate that EM4QT significantly outperforms competitive baselines in all the collections. Our experiments also reveal that since EM4QT indirectly considers term coherency, combining the method with coherency-based models cannot significantly improve the retrieval performance. On the other hand, investigating the query-by-query results supports the view that EM4QT usually gives a relatively high weight to one translation and its combination with the proposed token-to-token translation model, which is obtained by running EM4QT for each query term separately, soothes the effect and reaches better results for many queries. Comparing the method with a competitive word-embedding baseline reveals the superiority of the proposed model. … (more)
- Is Part Of:
- Information processing & management. Volume 53:Issue 2(2017:Mar.)
- Journal:
- Information processing & management
- Issue:
- Volume 53:Issue 2(2017:Mar.)
- Issue Display:
- Volume 53, Issue 2 (2017)
- Year:
- 2017
- Volume:
- 53
- Issue:
- 2
- Issue Sort Value:
- 2017-0053-0002-0000
- Page Start:
- 371
- Page End:
- 387
- Publication Date:
- 2017-03
- Subjects:
- Dictionary-based cross-language information retrieval -- Query translation -- Expectation maximization -- Pseudo-relevant documents
00-01 -- 99-00
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2016.11.007 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 639.xml