Influence of Imbalanced Data on Text Classification Using Recurrent Neural Network

Rina Septiriana; Tursina Tursina

doi:10.47065/bulletincsr.v6i4.996

Authors

Rina Septiriana Universitas Tanjungpura, Pontianak, Indonesia
Tursina Tursina Universitas Tanjungpura, Pontianak, Indonesia

DOI:

https://doi.org/10.47065/bulletincsr.v6i4.996

Keywords:

Deep Learning; Imbalanced Data; Recurrent Neural Networks; Gated Recurrent Unit; Resampling

Abstract

Recurrent Neural Networks (RNNs) such as LSTM and GRU are designed for sequential data. However, their performance in emotion detection is often compromised by class imbalance. This study compares LSTM and GRU architectures for classifying emotional states using a dataset of 4,386 Indonesian tweets. The dataset exhibits a mild imbalance (approximately 1.7:1) across five classes: Anger, Happy, Sadness, Love, and Fear. However, the effectiveness of these models is often hindered by class imbalance in datasets, which biases predictions toward majority classes and compromises the reliability of standard metrics. This study aims to systematically evaluate the comparison of LSTM and GRU architectures in processing imbalanced Indonesian emotional tweet data. The methodology involves evaluating these models across various resampling techniques, including Random Oversampling, SMOTE, and Near-Miss. Key findings reveal that LSTM consistently outperforms GRU in capturing complex emotional patterns. Specifically, the LSTM model combined with Random Oversampling emerged as the most robust configuration, achieving a Macro-F1 score of 71% and an accuracy of 73%. While Random Oversampling effectively enhanced minority class recognition without overfitting, SMOTE and Near-Miss introduced significant performance trade-offs. These results provide actionable insights for selecting optimal architectures and resampling strategies to mitigate imbalance-related biases in sequential classification tasks.

Downloads

Download data is not yet available.

References

C. Janiesch, P. Zschech, and K. Heinrich, “Machine learning and deep learning,” Mach. Learn. Deep Learn. Christ., vol. 31, pp. 685–695, 2021, doi: 10.1515/9783110791402-004.

A. Mathew, P. Amudha, and S. Sivakumari, “Deep learning techniques: an overview,” in Advances in Intelligent Systems and Computing, Springer, 2021, pp. 599–608. doi: 10.1007/978-981-15-3383-9_54.

F. M. Shiri, T. Perumal, N. Mustapha, and R. Mohamed, “A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU,” arXiv Prepr. arXiv2305.17473, 2023, [Online]. Available: http://arxiv.org/abs/2305.17473

G. Ian, Y. Bengio, and A. Courville, Deep Learning (Adaptive Computation and Machine Learning series). Cambridge: The MIT Press, 2016.

P. Kumar, R. Bhatnagar, K. Gaur, and A. Bhatnagar, “Classification of Imbalanced Data:Review of Methods and Applications,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1099, no. 1, p. 012077, Mar. 2021, doi: 10.1088/1757-899x/1099/1/012077.

A. Amin, A. Adnan, and S. Anwar, “An adaptive learning approach for customer churn prediction in the telecommunication industry using evolutionary computation and Naïve Bayes,” Appl. Soft Comput., vol. 137, p. 110103, Apr. 2023, doi: 10.1016/j.asoc.2023.110103.

R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” in 2020 11th International Conference on Information and Communication Systems (ICICS), IEEE, Apr. 2020, pp. 243–248. doi: 10.1109/ICICS49469.2020.239556.

T. F. Handoyo, M. Pajar, and K. Putra, “Optimasi Bobot Kelas LSTM untuk Deteksi URL Phishing pada Dataset Tidak Berimbang,” JPIT (Jurnal Penelit. Inform. dan Teknol., vol. 10, no. 1, pp. 20–36, 2025, doi: 10.30591/jpit.v10i1.8128.

F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Inf. Sci. (Ny)., vol. 513, pp. 429–441, Mar. 2020, doi: 10.1016/j.ins.2019.11.004.

S. Korkmaz, “Deep Learning-Based Imbalanced Data Classification for Drug Discovery,” J. Chem. Inf. Model., vol. 60, no. 9, pp. 4180–4190, Sep. 2020, doi: 10.1021/acs.jcim.9b01162.

A. B. P. Negara, H. Muhardi, and F. Sajid, “Perbandingan Algoritma Klasifikasi terhadap Emosi Tweet Berbahasa Indonesia,” J. Edukasi dan Penelit. Inform., vol. 7, no. 2, p. 242, Aug. 2021, doi: 10.26418/jp.v7i2.48198.

E. Brochu, V. M. Cora, and N. de Freitas, “A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning,” arXiv Prepr. arXiv1012.2599, 2010, [Online]. Available: http://arxiv.org/abs/1012.2599

P. S. Muhuri, P. Chatterjee, X. Yuan, K. Roy, and A. Esterline, “Using a long short-term memory recurrent neural network (LSTM-RNN) to classify network attacks,” Inf., vol. 11, no. 5, pp. 1–21, 2020, doi: 10.3390/INFO11050243.

Y. Luan and S. Lin, “Research on Text Classification Based on CNN and LSTM,” Proc. 2019 IEEE Int. Conf. Artif. Intell. Comput. Appl. ICAICA 2019, pp. 352–355, 2019, doi: 10.1109/ICAICA.2019.8873454.

S. Nosouhian, F. Nosouhian, and A. K. Khoshouei, “A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU,” Preprints.org, pp. 1–7, Jul. 2021, doi: 10.20944/preprints202107.0252.v1.

H. He and Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications. New Jersey: John Wiley & Sons, 2013.

C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” J. Big Data, vol. 11, no. 1, 2024, doi: 10.1186/s40537-023-00857-7.

F. R. A. Pratama and S. I. Oktora, “Synthetic Minority Over-sampling Technique (SMOTE) for handling imbalanced data in poverty classification,” Stat. J. IAOS, vol. 39, no. 1, pp. 233–239, 2023, doi: https://doi.org/10.3233/SJI-220080.

M. S. Shelke, P. R. Deshmukh, and V. K. Shandilya, “A Review on Imbalanced Data Handling Using Undersampling and Oversampling Technique,” Int. J. Recent Trends Eng. Res., vol. 3, no. 4, pp. 444–449, May 2017, doi: 10.23883/IJRTER.2017.3168.0UWXM.

A. Tanimoto, S. Yamada, T. Takenouchi, M. Sugiyama, and H. Kashima, “Improving imbalanced classification using near-miss instances,” Expert Syst. Appl., vol. 201, no. November 2021, p. 117130, 2022, doi: 10.1016/j.eswa.2022.117130.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Influence of Imbalanced Data on Text Classification Using Recurrent Neural Network

Influence of Imbalanced Data on Text Classification Using Recurrent Neural Network

Authors

DOI:

Keywords:

Abstract

Downloads

References

ARTICLE HISTORY

How to Cite

Issue

Section

Most read articles by the same author(s)