SwitchNet: Learning to switch for word-level language identification in code-mixed social media text. (3rd May 2022)
- Record Type:
- Journal Article
- Title:
- SwitchNet: Learning to switch for word-level language identification in code-mixed social media text. (3rd May 2022)
- Main Title:
- SwitchNet: Learning to switch for word-level language identification in code-mixed social media text
- Authors:
- Sarma, Neelakshi
Sanasam Singh, Ranbir
Goswami, Diganta - Abstract:
- Abstract: Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the wordAbstract: Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers. … (more)
- Is Part Of:
- Natural language engineering. Volume 28:Number 3(2022)
- Journal:
- Natural language engineering
- Issue:
- Volume 28:Number 3(2022)
- Issue Display:
- Volume 28, Issue 3 (2022)
- Year:
- 2022
- Volume:
- 28
- Issue:
- 3
- Issue Sort Value:
- 2022-0028-0003-0000
- Page Start:
- 337
- Page End:
- 359
- Publication Date:
- 2022-05-03
- Subjects:
- Language identification -- Code-mixing -- Social media text -- Multilingual
Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35 - Journal URLs:
- http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
- DOI:
- 10.1017/S1351324921000115 ↗
- Languages:
- English
- ISSNs:
- 1351-3249
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 21231.xml