An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis. Issue 6 (November 2020)

Record Type:: Journal Article
Title:: An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis. Issue 6 (November 2020)
Main Title:: An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis
Authors:: Mehmood, Khawar
Essam, Daryl
Shafi, Kamran
Malik, Muhammad Kamran
Abstract:: Abstract: Text normalization is the task of transforming lexically variant words to their canonical forms. The importance of text normalization becomes apparent while developing natural language processing applications. This paper proposes a novel technique called Transliteration based Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It consists of three interlinked modules: transliteration based encoder, filter module and hash code ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The next component filters the irrelevant codes, while the third module ranks the filtered hash-codes based on their relevance. The aim of this study is not only to normalize the text but to also examine its impact on text classification. Hence, baseline classification accuracies were computed on a dataset of 11, 000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using different machine learning algorithms. The dataset was then standardized using TERUN and other established phonetic algorithms, and the classification accuracies were recomputed. The cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and significantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus specific to a corpus independent text normalization technique. To this end, a parallel … (more)
Is Part Of:: Information processing & management. Volume 57:Issue 6(2020:Nov.)
Journal:: Information processing & management
Issue:: Volume 57:Issue 6(2020:Nov.)
Issue Display:: Volume 57, Issue 6 (2020)
Year:: 2020
Volume:: 57
Issue:: 6
Issue Sort Value:: 2020-0057-0006-0000
Page Start:
Page End:
Publication Date:: 2020-11
Subjects:: Machine learning -- Natural language processing -- Pattern recognition -- Sentiment analysis
Information storage and retrieval systems -- Periodicals
Information science -- Periodicals
Systèmes d'information -- Périodiques
Sciences de l'information -- Périodiques
Information science
Information storage and retrieval systems
Periodicals
658.4038
Journal URLs:: http://www.sciencedirect.com/science/journal/03064573 ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.ipm.2020.102368 ↗
Languages:: English
ISSNs:: 0306-4573
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 4493.893000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 14754.xml