Linguistically-augmented perplexity-based data selection for language models. (July 2015)
- Record Type:
- Journal Article
- Title:
- Linguistically-augmented perplexity-based data selection for language models. (July 2015)
- Main Title:
- Linguistically-augmented perplexity-based data selection for language models
- Authors:
- Toral, Antonio
Pecina, Pavel
Wang, Longyue
van Genabith, Josef - Abstract:
- Abstract : Highlights: Word-level linguistic information for perplexity-based data selection. Evaluation and analysis for four languages: English, Spanish, Czech and Chinese. Combination of models lead to lower perplexity than the state-of-the-art baseline. Abstract: This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naïve selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese). The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in the range 7.72–13.02%. In absolute terms the reduction is higher for languages with high type-token ratio (Chinese, 202.16) or rich morphology (Czech, 81.53) and lower for the remaining languages, Spanish (55.2)Abstract : Highlights: Word-level linguistic information for perplexity-based data selection. Evaluation and analysis for four languages: English, Spanish, Czech and Chinese. Combination of models lead to lower perplexity than the state-of-the-art baseline. Abstract: This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naïve selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese). The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in the range 7.72–13.02%. In absolute terms the reduction is higher for languages with high type-token ratio (Chinese, 202.16) or rich morphology (Czech, 81.53) and lower for the remaining languages, Spanish (55.2) and English (34.43 on the English side of the same parallel dataset as for Czech and 61.90 on the same parallel dataset as for Spanish). … (more)
- Is Part Of:
- Computer speech & language. Volume 32(2015)
- Journal:
- Computer speech & language
- Issue:
- Volume 32(2015)
- Issue Display:
- Volume 32, Issue 2015 (2015)
- Year:
- 2015
- Volume:
- 32
- Issue:
- 2015
- Issue Sort Value:
- 2015-0032-2015-0000
- Page Start:
- 11
- Page End:
- 26
- Publication Date:
- 2015-07
- Subjects:
- Data selection -- Language modelling -- Computational linguistics
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2014.10.002 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 5431.xml