Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study. (5th November 2020)

Record Type:: Journal Article
Title:: Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study. (5th November 2020)
Main Title:: Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study
Authors:: Tarmom, Taghreed
Teahan, William
Atwell, Eric
Alsalka, Mohammad Ammar
Editors:: Zampieri, Marcos
Nakov, Preslav
Abstract:: Abstract: The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character … (more)
Is Part Of:: Natural language engineering. Volume 26:Part 6(2020)
Journal:: Natural language engineering
Issue:: Volume 26:Part 6(2020)
Issue Display:: Volume 26, Issue 6, Part 6 (2020)
Year:: 2020
Volume:: 26
Issue:: 6
Part:: 6
Issue Sort Value:: 2020-0026-0006-0006
Page Start:: 663
Page End:: 676
Publication Date:: 2020-11-05
Subjects:: Arabic, -- Corpus linguistics, -- Language resources, -- Machine learning, -- Sublanguages and controlled languages, -- Text segmentation
Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35
Journal URLs:: http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
DOI:: 10.1017/S135132492000011X ↗
Languages:: English
ISSNs:: 1351-3249
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library HMNTS - ELD Digital store
Ingest File:: 15569.xml