Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM. (January 2021)
- Record Type:
- Journal Article
- Title:
- Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM. (January 2021)
- Main Title:
- Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
- Authors:
- AlKhwiter, Wasan
Al-Twairesh, Nora - Abstract:
- Highlights: POS taggers are developed for MSA and GLF variants of the Arabic language using CRF and BiLSTM. The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community. An exploratory analysis of the behavior of using hashtags in Arabic tweets is presented, and this can be leveraged in future studies. The POS tagger for Arabic tweets using the BiLSTM achieves the best performance. Experiments show that there is no need for a dialect specific POS tagger. Abstract: Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the 'Mixed, ' 'MSA, 'Highlights: POS taggers are developed for MSA and GLF variants of the Arabic language using CRF and BiLSTM. The gold standard annotated datasets that have been constructed for POS tagging are made accessible to the research community. An exploratory analysis of the behavior of using hashtags in Arabic tweets is presented, and this can be leveraged in future studies. The POS tagger for Arabic tweets using the BiLSTM achieves the best performance. Experiments show that there is no need for a dialect specific POS tagger. Abstract: Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the 'Mixed, ' 'MSA, ' and 'GLF' datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the 'Mixed' dataset with 96.5% accuracy. However, the specific-dialect taggers trained on the 'MSA' and 'GLF' datasets achieve an accuracy of 95.6% and 95%, respectively. The results for the 'Mixed' dataset indicate the effectiveness of developing a joint POS tagger without the need for a dialect-specific POS tagger. … (more)
- Is Part Of:
- Computer speech & language. Volume 65(2021)
- Journal:
- Computer speech & language
- Issue:
- Volume 65(2021)
- Issue Display:
- Volume 65, Issue 2021 (2021)
- Year:
- 2021
- Volume:
- 65
- Issue:
- 2021
- Issue Sort Value:
- 2021-0065-2021-0000
- Page Start:
- Page End:
- Publication Date:
- 2021-01
- Subjects:
- Part-of-speech (POS) tagging -- Conditional random fields -- Bidirectional Long Short-Term Memory -- Arabic Tweets
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2020.101138 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 16886.xml