AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling. Issue 4 (July 2022)
- Record Type:
- Journal Article
- Title:
- AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling. Issue 4 (July 2022)
- Main Title:
- AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling
- Authors:
- El Mekki, Abdellah
El Mahdaouy, Abdelkader
Berrada, Ismail
Khoumsi, Ahmed - Abstract:
- Abstract: Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distributionAbstract: Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distribution alignment. We evaluate our framework on multi-dialectal Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The overall results show that the zero-shot transfer learning, using our proposed framework, boosts the performance of the multilingual PLMs by 40.87% in macro-F1 score for the NER task, while it boosts the accuracy by 6.95% for the POS tagging task. For the Arabic PLMs, our proposed framework increases performance by 16.18% macro-F1 for the NER task and 2.22% accuracy for the POS tagging task, and thus, achieving new state-of-the-art zero-shot transfer learning performance for Arabic multi-dialectal sequence labeling. Highlights: We present AdaSL, an unsupervised framework for dialectal Arabic sequence labeling. We introduce a sub-word pooling aggregator for the full representation of words. We apply AdaSL for multilingual and Arabic Transformer pre-trained language models. We validate AdaSL on Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. We achieve new SOTA zero-shot performances for dialectal Arabic sequence labeling. … (more)
- Is Part Of:
- Information processing & management. Volume 59:Issue 4(2022)
- Journal:
- Information processing & management
- Issue:
- Volume 59:Issue 4(2022)
- Issue Display:
- Volume 59, Issue 4 (2022)
- Year:
- 2022
- Volume:
- 59
- Issue:
- 4
- Issue Sort Value:
- 2022-0059-0004-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-07
- Subjects:
- Dialectal Arabic -- Arabic natural language processing -- Domain adaptation -- Multi-dialectal sequence labeling -- Named entity recognition -- Part-of-speech tagging -- Zero-shot transfer learning
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.ipm.2022.102964 ↗
- Languages:
- English
- ISSNs:
- 0306-4573
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 22245.xml