Combining n-grams and deep convolutional features for language variety classification. (18th July 2019)
- Record Type:
- Journal Article
- Title:
- Combining n-grams and deep convolutional features for language variety classification. (18th July 2019)
- Main Title:
- Combining n-grams and deep convolutional features for language variety classification
- Authors:
- Martinc, Matej
Pollak, Senja - Abstract:
- Abstract: This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of- n -grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easilyAbstract: This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of- n -grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community. … (more)
- Is Part Of:
- Natural language engineering. Volume 25:Part 5(2019)
- Journal:
- Natural language engineering
- Issue:
- Volume 25:Part 5(2019)
- Issue Display:
- Volume 25, Issue 5, Part 5 (2019)
- Year:
- 2019
- Volume:
- 25
- Issue:
- 5
- Part:
- 5
- Issue Sort Value:
- 2019-0025-0005-0005
- Page Start:
- 607
- Page End:
- 632
- Publication Date:
- 2019-07-18
- Subjects:
- language variety, -- author profiling, -- text classification, -- convolutional neural network, -- bag-of-n-grams
Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35 - Journal URLs:
- http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
- DOI:
- 10.1017/S1351324919000299 ↗
- Languages:
- English
- ISSNs:
- 1351-3249
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 11658.xml