Fine-grained analysis of language varieties and demographics. (10th November 2020)
- Record Type:
- Journal Article
- Title:
- Fine-grained analysis of language varieties and demographics. (10th November 2020)
- Main Title:
- Fine-grained analysis of language varieties and demographics
- Authors:
- Rangel, Francisco
Rosso, Paolo
Zaghouani, Wajdi
Charfi, Anis - Editors:
- Zampieri, Marcos
Nakov, Preslav - Abstract:
- Abstract: The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors' demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors' gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors' age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.
- Is Part Of:
- Natural language engineering. Volume 26:Part 6(2020)
- Journal:
- Natural language engineering
- Issue:
- Volume 26:Part 6(2020)
- Issue Display:
- Volume 26, Issue 6, Part 6 (2020)
- Year:
- 2020
- Volume:
- 26
- Issue:
- 6
- Part:
- 6
- Issue Sort Value:
- 2020-0026-0006-0006
- Page Start:
- 641
- Page End:
- 661
- Publication Date:
- 2020-11-10
- Subjects:
- Language variety identification, -- Demographics, -- Gender, -- Age, -- Author profiling, -- Cybersecurity, -- Arabic
Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35 - Journal URLs:
- http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
- DOI:
- 10.1017/S1351324920000108 ↗
- Languages:
- English
- ISSNs:
- 1351-3249
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 15569.xml