Linguistically-augmented perplexity-based data selection for language models. (July 2015)

Record Type:: Journal Article
Title:: Linguistically-augmented perplexity-based data selection for language models. (July 2015)
Main Title:: Linguistically-augmented perplexity-based data selection for language models
Authors:: Toral, Antonio
Pecina, Pavel
Wang, Longyue
van Genabith, Josef
Abstract:: Abstract : Highlights: Word-level linguistic information for perplexity-based data selection. Evaluation and analysis for four languages: English, Spanish, Czech and Chinese. Combination of models lead to lower perplexity than the state-of-the-art baseline. Abstract: This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naïve selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese). The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in the range 7.72–13.02%. In absolute terms the reduction is higher for languages with high type-token ratio (Chinese, 202.16) or rich morphology (Czech, 81.53) and lower for the remaining languages, Spanish (55.2) … (more)
Is Part Of:: Computer speech & language. Volume 32(2015)
Journal:: Computer speech & language
Issue:: Volume 32(2015)
Issue Display:: Volume 32, Issue 2015 (2015)
Year:: 2015
Volume:: 32
Issue:: 2015
Issue Sort Value:: 2015-0032-2015-0000
Page Start:: 11
Page End:: 26
Publication Date:: 2015-07
Subjects:: Data selection -- Language modelling -- Computational linguistics
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454
Journal URLs:: http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.csl.2014.10.002 ↗
Languages:: English
ISSNs:: 0885-2308
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 5431.xml