Novel textual features for language modeling of intra-sentential code-switching data. (November 2020)
- Record Type:
- Journal Article
- Title:
- Novel textual features for language modeling of intra-sentential code-switching data. (November 2020)
- Main Title:
- Novel textual features for language modeling of intra-sentential code-switching data
- Authors:
- Ganji, Sreeram
Dhawan, Kunal
Sinha, Rohit - Abstract:
- Highlights: An improved parts-of-speech labeling scheme for code-switching data is proposed. A novel code-switching location feature is proposed to locate code-switch instances. Evaluation is done on Hindi-English and Mandarin-English code-switching datasets. Abstract: Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing native LM to handle the code-switching data. In this work, we present our efforts for language modeling of code-switching data following both the traditional and the proposed approaches. The salient contributions of this paper includes: (i) creation of the Hindi-English code-switching text corpus, (ii) an improved parts-of-speech (POS) labeling scheme for accurate tagging of non-native words embedded in the code-switching data, and (iii) the proposal of a novel textual feature referred to as the code-switching location (CSL) feature, that allows LMs to predict the code-switching instances. The evaluation of the proposed features has been done on two code-switching datasets: Hindi-English and Mandarin-English. On experimental evaluation, a substantial reduction in the perplexity is achieved with the use of theHighlights: An improved parts-of-speech labeling scheme for code-switching data is proposed. A novel code-switching location feature is proposed to locate code-switch instances. Evaluation is done on Hindi-English and Mandarin-English code-switching datasets. Abstract: Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing native LM to handle the code-switching data. In this work, we present our efforts for language modeling of code-switching data following both the traditional and the proposed approaches. The salient contributions of this paper includes: (i) creation of the Hindi-English code-switching text corpus, (ii) an improved parts-of-speech (POS) labeling scheme for accurate tagging of non-native words embedded in the code-switching data, and (iii) the proposal of a novel textual feature referred to as the code-switching location (CSL) feature, that allows LMs to predict the code-switching instances. The evaluation of the proposed features has been done on two code-switching datasets: Hindi-English and Mandarin-English. On experimental evaluation, a substantial reduction in the perplexity is achieved with the use of the improvised POS features. It is also observed that the proposed CSL features provide an independent and additive improvement over the POS features in terms of perplexity. … (more)
- Is Part Of:
- Computer speech & language. Volume 64(2020)
- Journal:
- Computer speech & language
- Issue:
- Volume 64(2020)
- Issue Display:
- Volume 64, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 64
- Issue:
- 2020
- Issue Sort Value:
- 2020-0064-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-11
- Subjects:
- Code-switching -- Textual features -- Factored language modeling
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2020.101099 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13392.xml