Automatic Language Identification in Code-Switched Hindi-English Social Media Text. (25th June 2021)
- Record Type:
- Journal Article
- Title:
- Automatic Language Identification in Code-Switched Hindi-English Social Media Text. (25th June 2021)
- Main Title:
- Automatic Language Identification in Code-Switched Hindi-English Social Media Text
- Authors:
- Nguyen, Li
Bryant, Christopher
Kidwai, Sana
Biberauer, Theresa - Abstract:
- Natural Language Processing (NLP) tools typically struggle to process code-switched data and so linguists are commonly forced to annotate such data manually. As this data becomes more readily available, automatic tools are increasingly needed to help speed up the annotation process and improve consistency. Last year, such a toolkit was developed to semi-automatically annotate transcribed bilingual code-switched Vietnamese-English speech data with token-based language information and POS tags (hereafter the CanVEC toolkit, L. Nguyen & Bryant, 2020 ). In this work, we extend this methodology to another language pair, Hindi-English, to explore the extent to which we can standardise the automation process. Specifically, we applied the principles behind the CanVEC toolkit to data from the International Conference on Natural Language Processing (ICON) 2016 shared task, which consists of social media posts (Facebook, Twitter and WhatsApp) that have been annotated with language and POS tags (Molina et al., 2016 ). We used the ICON-2016 annotations as the gold-standard labels in the language identification task. Ultimately, our tool achieved an F1 score of 87.99% on the ICON-2016 data. We then evaluated the first 500 tokens of each social media subset manually, and found almost 40% of all errors were caused entirely by problems with the gold-standard, i.e., our system was correct. It is thus likely that the overall accuracy of our system is higher than reported. This shows greatNatural Language Processing (NLP) tools typically struggle to process code-switched data and so linguists are commonly forced to annotate such data manually. As this data becomes more readily available, automatic tools are increasingly needed to help speed up the annotation process and improve consistency. Last year, such a toolkit was developed to semi-automatically annotate transcribed bilingual code-switched Vietnamese-English speech data with token-based language information and POS tags (hereafter the CanVEC toolkit, L. Nguyen & Bryant, 2020 ). In this work, we extend this methodology to another language pair, Hindi-English, to explore the extent to which we can standardise the automation process. Specifically, we applied the principles behind the CanVEC toolkit to data from the International Conference on Natural Language Processing (ICON) 2016 shared task, which consists of social media posts (Facebook, Twitter and WhatsApp) that have been annotated with language and POS tags (Molina et al., 2016 ). We used the ICON-2016 annotations as the gold-standard labels in the language identification task. Ultimately, our tool achieved an F1 score of 87.99% on the ICON-2016 data. We then evaluated the first 500 tokens of each social media subset manually, and found almost 40% of all errors were caused entirely by problems with the gold-standard, i.e., our system was correct. It is thus likely that the overall accuracy of our system is higher than reported. This shows great potential for effectively automating the annotation of code-switched corpora, on different language combinations, and in different genres. We finally discuss some limitations of our approach and release our code and human evaluation together with this paper. … (more)
- Is Part Of:
- Journal of open humanities data. Volume 7(2021)
- Journal:
- Journal of open humanities data
- Issue:
- Volume 7(2021)
- Issue Display:
- Volume 7, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 7
- Issue:
- 2021
- Issue Sort Value:
- 2021-0007-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-06-25
- Subjects:
- code-switching -- language identification -- automatic annotation -- Hindi -- English -- Vietnamese
Humanities -- Periodicals
001.3 - Journal URLs:
- http://openhumanitiesdata.metajnl.com/ ↗
- DOI:
- 10.5334/johd.44 ↗
- Languages:
- English
- ISSNs:
- 2059-481X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 16205.xml