Aspect Based Sentiment Analysis: Feature Extraction using Latent Dirichlet Allocation (LDA) and Term Frequency - Inverse Document Frequency (TF-IDF) in Machine Learning (ML)


  • Shakirah Mohd Sofi Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia Kuala Lumpur, Jalan Sultan Yahya Petra, Kuala Lumpur 54100, Malaysia Author
  • Ali Selamat Center for Basic and Applied Research, Faculty of Informatics and Management, University of Hradec Kralove, Rokitanskeho 62, 50003 Hradec Kralove, Czech Republic Author



Aspect-Based Sentiment Analysis, Opinion Mining, Feature Extraction, Top Modeling, LDA, Count Vectorizer, TF-IDF, SVM, NB


The growth and development of social networks, blogs, forums, and e-commerce websites has produced a number of data, notably textual data, which has increased tremendously. Twitter is one of the most popular media social platforms; during the COVID-19 pandemic, people all around the world use social media to share their opinions or concerns about the pandemic that has changed their lives. It revealed a significant rise in tweets on coronavirus, including positive, negative, and neutral tweets about the virus's impact. Sentiment analysis faces challenges: sparse data limits understanding, while topic coherence and interpretability demand improvement for clearer insights. The primary goal of this paper is to improve the accuracy and effectiveness of sentiment analysis during the COVID-19 pandemic through the application of advanced techniques and classifiers. In this article, we experiment with such Support Vector Machines (SVM) and Naive Bayes (NB) on Twitter data for high-accuracy machine learning models. Using Latent Dirichlet Allocation (LDA)for feature extraction, we aim to capture comprehensive aspects and topics for sentiment analysis. Additionally, we explore Count Vectorizer and Term Frequency - Inverse Document Frequency (TF-IDF) as word embedding techniques. The main objectives are to extract topics, understand public concerns about Covid-19, and compare classifier performance in Aspect-Based Sentiment Analysis on Covid-19 tweets. This paper introduces advanced sentiment analysis techniques, such as LDA, Count Vectorizer, and SVM, enhancing nuanced sentiment analysis during the COVID-19 pandemic with notable 85% accuracy in SVM classification.


Download data is not yet available.


Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M., & Shah, Z. (2020). Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study. Journal of Medical Internet Research, 22(4), e19016.

Abdulaziz, M., Alotaibi, A., Alsolamy, M., & Alabbas, A. (2021). Topic based Sentiment Analysis for COVID-19 Tweets. International Journal of Advanced Computer Science and Applications, 12(1), 626–636.

Apuke, O. D., & Omar, B. (2021). Fake news and COVID-19: modelling the predictors of fake news sharing among social media users. Telematics and Informatics, 56(March 2020), 101475.

Avasthi, S., Chauhan, R., & Acharjya, D. P. (2022). Information Extraction and Sentiment Analysis to Gain Insight into the COVID-19 Crisis. January, 343–353.

Cambria, E., Poria, S., Gelbukh, A., & Thelwall, M. (2017). Sentiment Analysis Is a Big Suitcase. IEEE Intelligent Systems, 32(6), 74–80.

Chakraborty, K., Bhatia, S., Bhattacharyya, S., Platos, J., Bag, R., & Hassanien, A. E. (2020). Sentiment Analysis of COVID-19 tweets by Deep Learning Classifiers—A study to show how popularity is affecting accuracy in social media. Applied Soft Computing Journal, 97, 106754.

Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews.

Kausar, M. A., Soosaimanickam, A., & Nasar, M. (2021). Public Sentiment Analysis on Twitter Data during COVID-19 Outbreak. International Journal of Advanced Computer Science and Applications, 12(2), 415–422.

Naseem, U., Razzak, I., Khushi, M., Eklund, P. W., & Kim, J. (2021). COVIDSenti: A Large-Scale Benchmark Twitter Data Set for COVID-19 Sentiment Analysis. IEEE Transactions on Computational Social Systems, 8(4), 976–988.

Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., AL-Smadi, M., Al-Ayyoub, M., Zhao, Y., Qin, B., De Clercq, O., Hoste, V., Apidianaki, M., Tannier, X., Loukachevitch, N., Kotelnikov, E., Bel, N., Jiménez-Zafra, S. M., & Eryiğit, G. (2016). SemEval-2016 Task 5: Aspect Based Sentiment Analysis. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 19–30.

Priya, A., & Kumar, A. (2021). Deep Ensemble Approach for COVID-19 Fake News Detection from Social Media. Proceedings of the 8th International Conference on Signal Processing and Integrated Networks, SPIN 2021, 396–401.

Rapanta, C., Botturi, L., Goodyear, P., Guàrdia, L., & Koole, M. (2020). Online University Teaching During and After the Covid-19 Crisis: Refocusing Teacher Presence and Learning Activity. Postdigital Science and Education, 2(3), 923–945.

Raza, G. M., Butt, Z. S., Latif, S., & Wahid, A. (2021). Sentiment Analysis on COVID Tweets: An Experimental Analysis on the Impact of Count Vectorizer and TF-IDF on Sentiment Predictions using Deep Learning Models. 2021 International Conference on Digital Futures and Transformative Technologies, ICoDT2 2021.

Rustam, F., Khalid, M., Aslam, W., Rupapara, V., Mehmood, A., & Choi, G. S. (2021). A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis. PLoS ONE, 16(2), 1–23.

Sayed, S. A. F., Elkorany, A. M., & Mohammad, S. S. (2021). Applying Different Machine Learning Techniques for Prediction of COVID-19 Severity. IEEE Access, 9, 135697–135707.

World Health Organization. (2021). WHO Coronavirus (COVID-19) Dashboard. In

Yousefinaghani, S., Dara, R., Mubareka, S., Papadopoulos, A., & Sharif, S. (2021). An analysis of COVID-19 vaccine sentiments and opinions on Twitter. International Journal of Infectious Diseases, 108, 256–262.






How to Cite

Aspect Based Sentiment Analysis: Feature Extraction using Latent Dirichlet Allocation (LDA) and Term Frequency - Inverse Document Frequency (TF-IDF) in Machine Learning (ML). (2024). Malaysian Journal of Information and Communication Technology (MyJICT), 8(2), 169-179.