An analysis of machine learning models for sentiment analysis of Tamil code-mixed data. (November 2022)
- Record Type:
- Journal Article
- Title:
- An analysis of machine learning models for sentiment analysis of Tamil code-mixed data. (November 2022)
- Main Title:
- An analysis of machine learning models for sentiment analysis of Tamil code-mixed data
- Authors:
- Shanmugavadivel, Kogilavani
Sampath, Sai Haritha
Nandhakumar, Pramod
Mahalingam, Prasath
Subramanian, Malliga
Kumaresan, Prasanna Kumar
Priyadharshini, Ruba - Abstract:
- Abstract: Nowadays, more and more people are sharing and expressing their feelings through social media platforms such as Twitter, Facebook and YouTube. Sentiment analysis is a process that explores, identifies and categorizes content. People that belong to multilingual communities tend to communicate through multiple regional languages. This type of text is represented using different languages and is known as code-mixed data. The proposed system utilizes a code-mixed data set of Tamil–English languages from FIRE 2021. To handle the class imbalance problem, re-sampling is performed and the impact is analyzed. Pre-processing of input text data can play a vital role in code-mixed data classification by removing unnecessary content. This research work aims to explore the impact of pre-processing on Tamil code-mixed data by employing various pre-processing steps such as emojis removal; repeated characters removal; and punctuations, symbols and number removal. The pre-processed text is applied to traditional machine learning, deep learning, transfer learning and hybrid deep learning models, and the accuracy of all these models before and after pre-processing is compared. Traditional machine learning models depend on various weighting schemes for the feature selection process. The main objective of this research work is to build hybrid deep learning models combining Convolutional Neural Network (CNN) with Long–Short Term Memory (LSTM) and Convolutional Neural Network (CNN) withAbstract: Nowadays, more and more people are sharing and expressing their feelings through social media platforms such as Twitter, Facebook and YouTube. Sentiment analysis is a process that explores, identifies and categorizes content. People that belong to multilingual communities tend to communicate through multiple regional languages. This type of text is represented using different languages and is known as code-mixed data. The proposed system utilizes a code-mixed data set of Tamil–English languages from FIRE 2021. To handle the class imbalance problem, re-sampling is performed and the impact is analyzed. Pre-processing of input text data can play a vital role in code-mixed data classification by removing unnecessary content. This research work aims to explore the impact of pre-processing on Tamil code-mixed data by employing various pre-processing steps such as emojis removal; repeated characters removal; and punctuations, symbols and number removal. The pre-processed text is applied to traditional machine learning, deep learning, transfer learning and hybrid deep learning models, and the accuracy of all these models before and after pre-processing is compared. Traditional machine learning models depend on various weighting schemes for the feature selection process. The main objective of this research work is to build hybrid deep learning models combining Convolutional Neural Network (CNN) with Long–Short Term Memory (LSTM) and Convolutional Neural Network (CNN) with Bi-Long–Short Term Memory (LSTM) in order to capture the local and global features implicitly from the code-mixed data for conducting sentiment analysis, and then classify the Tamil code-mixed data into positive, negative, mixed_feelings and unknown_state. The performance of hybrid deep learning models were evaluated by comparing them with state-of-art methods that include various traditional machine learning techniques such as random forest, multinomial Naive Bayes, logistic regression and linear Support Vector Classification (SVC); deep learning techniques such as LSTM, BiLSTM, BiGRU (Bidirectional Gated Recurrent Unit) and CNN; and a transfer learning method, IndicBERT. This research work also summarizes the precision, recall, F1-score, accuracy, macro-average, weighted-average and confusion matrix for all mentioned models. The result indicates that among all the different models employed, the hybrid deep learning model, especially the CNN+BiLSTM model performs better, with an accuracy of 0.66 with preprocessed Tamil code-mixed data. Highlights: To perform sentiment analysis on Tamil code-mixed data learning methods are employed To improve the accuracy of the models various pre-possessing techniques are explored To avoid class imbalance problem in the dataset class re-sampling is performed To capture local and global features hybrid deep learning models are implemented. … (more)
- Is Part Of:
- Computer speech & language. Volume 76(2022)
- Journal:
- Computer speech & language
- Issue:
- Volume 76(2022)
- Issue Display:
- Volume 76, Issue 2022 (2022)
- Year:
- 2022
- Volume:
- 76
- Issue:
- 2022
- Issue Sort Value:
- 2022-0076-2022-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-11
- Subjects:
- Code-mixing -- Natural Language Processing -- Sequence classification
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2022.101407 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 21757.xml