Authors :
Himani H. Patel; Om Mahalle; Anish Shetty; Sachin Hugar
Volume/Issue :
Volume 11 - 2026, Issue 6 - June
Google Scholar :
https://tinyurl.com/3yewdpjv
Scribd :
https://tinyurl.com/mscnwy94
DOI :
https://doi.org/10.38124/ijisrt/26jun364
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
The concept of sentiment analysis of social media texts poses a vital role in better understanding public opinions,
behavior patterns of consumers, and societal trends. Twitter, being a microblogging website, poses a tremendous challenge
to the social media world due to its highly noisy and informal nature of tweets. This paper emphasizes a highly efficient and
scalable sentiment analysis system using the Sentiment140 dataset. The dataset comprises 1.6 million tweets that are
automatically labeled using emoticons. The system uses lightweight text processing steps followed by converting tweets into
a numerical representation by means of Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams, bigrams,
and sublinear scaling. Three powerful yet classic machine learning classifiers—Multinomial Naive Bayes and Logistic
Regression (tuned using GridSearchCV) and Linear SVM—are combined using a hard voting classifier. This paper proves
that the combination of classifiers yields better accuracy and performance. An experimental study using a train-test splitting
ratio of 75:25 demonstrates that the combination classifier exhibits higher accuracy, precision, recall, and F1-measure. The
system has been found computationally efficient. Error analysis indicates that slang usage, sarcasm, and the use of emojis
constitute major challenges. The results confirm that classical linear models, when trained on large-scale data and combined
effectively, provide a strong, scalable baseline for Twitter sentiment analysis suitable for real-time deployment. Future work
includes incorporating emoji-aware features and contextual embeddings to handle linguistic nuance.
Keywords :
Component, Formatting, Style, Styling, Insert.
References :
- A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” CS224N Project Report, Stanford University, 2009.
- G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, vol. 24, no. 5, pp. 513–523, 1988.
- J. Ramos, “Using TF-IDF to determine word relevance in document queries,” Proc. First Instructional Conf. Machine Learning, 2003.
- T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” Proc. European Conf. Machine Learning (ECML), pp. 137–142, 1998.
- A. McCallum and K. Nigam, “A comparison of event models for Naive Bayes text classification,” AAAI Workshop on Learning for Text Categorization, 1998.
- C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
- B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classification using machine learning techniques,” Proc. ACL, pp. 79–86, 2002.
- B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008.
- F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
- O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018.
- Y. Kim, “Convolutional neural networks for sentence classification,” Proc. EMNLP, pp. 1746–1751, 2014.
- A. Severyn and A. Moschitti, “Twitter sentiment analysis with deep convolutional neural networks,” Proc. SIGIR, pp. 959–962, 2015.
- A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” Proc. NAACL-HLT, pp. 4171–4186, 2019.
The concept of sentiment analysis of social media texts poses a vital role in better understanding public opinions,
behavior patterns of consumers, and societal trends. Twitter, being a microblogging website, poses a tremendous challenge
to the social media world due to its highly noisy and informal nature of tweets. This paper emphasizes a highly efficient and
scalable sentiment analysis system using the Sentiment140 dataset. The dataset comprises 1.6 million tweets that are
automatically labeled using emoticons. The system uses lightweight text processing steps followed by converting tweets into
a numerical representation by means of Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams, bigrams,
and sublinear scaling. Three powerful yet classic machine learning classifiers—Multinomial Naive Bayes and Logistic
Regression (tuned using GridSearchCV) and Linear SVM—are combined using a hard voting classifier. This paper proves
that the combination of classifiers yields better accuracy and performance. An experimental study using a train-test splitting
ratio of 75:25 demonstrates that the combination classifier exhibits higher accuracy, precision, recall, and F1-measure. The
system has been found computationally efficient. Error analysis indicates that slang usage, sarcasm, and the use of emojis
constitute major challenges. The results confirm that classical linear models, when trained on large-scale data and combined
effectively, provide a strong, scalable baseline for Twitter sentiment analysis suitable for real-time deployment. Future work
includes incorporating emoji-aware features and contextual embeddings to handle linguistic nuance.
Keywords :
Component, Formatting, Style, Styling, Insert.