Analisis Limitasi Performa Penilaian Esai Otomatis pada Aplikasi ESAO Berdasarkan Metrik BLEU dan ROUGE

Akhmam Fahmi; Nuraini Nuraini; Maulana Fakih Latief

doi:10.47065/bulletincsr.v6i4.1154

Authors

Akhmam Fahmi Sekolah Tinggi Teknologi Terpadu Nurul Fikri, Depok, Indonesia
Nuraini Nuraini Sekolah Tinggi Teknologi Terpadu Nurul Fikri, Depok, Indonesia
Maulana Fakih Latief Sekolah Tinggi Teknologi Terpadu Nurul Fikri, Depok, Indonesia

DOI:

https://doi.org/10.47065/bulletincsr.v6i4.1154

Keywords:

Automated Essay Grading; Generative Artificial Intelligence; BLEU; ROUGE; ESAO

Abstract

The development of GenAI has encouraged the use of automated essay scoring technology through various platforms, one of which is the ESAO (Essay Analytic Online) application. Although this LLM-based system is capable of automatically generating assessment feedback narratives, standardizing evaluation methods to measure the reliability of these texts still faces significant challenges. This study aims to test the suitability of the Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics as instruments to measure the extratextual performance of the ESAO application. The research method was carried out by comparing feedback texts from ESAO with authentic lecturer assessment drafts on three different characteristics of the exam material: dataset condition analysis, descriptive statistics, and correlation and regression. The test results showed an average value of the BLEU metric of 0.0522 and ROUGE of 0.1255. This study revealed that low scores do not represent a functional failure of the ESAO application, but rather indicate fundamental limitations and shortcomings in using rigid lexical metrics (word-based metrics) in assessing dynamic generative texts. The BLEU and ROUGE metrics rely heavily on rigid n-gram overlap, thus failing to capture the semantic similarity, academic reasoning context, and linguistic variation generated by ESAO. This study concludes that traditional evaluation metrics such as BLEU and ROUGE are inaccurate and incompatible as a single benchmark for Generative AI performance in the context of educational assessment, necessitating a transition to semantic-based metrics in the future.

Downloads

Download data is not yet available.

References

E. Shidbringoid, “Proses Adopsi Teknologi Generative Artificial Intelligence Dalam Dunia Pendidikan?: Perspektif Teori Difusi Inovasi Adoption Process Of Generative Artificial Intelligence Technology In Education?: Diffusion Of Innovation Theory Perspective,” J. Pendidik. Dan Kebud., Vol. 9, No. 1, Pp. 110–133, 2024, Doi: 10.24832/Jpnk.V9i1.4859.

B. A. Dewantara And L. K. Dewi, “Generative Ai Dalam Pembelajaran Mahasiswa: Antara Inovasi Pendidikan Dan Integritas Akademik Keywords: Kata Kunci,” J. Ilm. Ilmu Pendidik., Vol. 8, No. 7, Pp. 8209–8217, 2025, Doi: 10.54371/Jiip.V5i12.1910.

D. Baidoo-Anu And L. O. Ansah, “Education In The Era Of Generative Artificial Intelligence ( Ai ): Understanding The Potential Benefits Of Chatgpt In Promoting Teaching And Learning,” J. Ai, Vol. 7, No. December, Pp. 52–62, 2023, Doi: 10.61969/Jai.1337500.

J. Atkinson And D. Palma, “An Llm-Based Hybrid Approach For Enhanced Automated Essay Scoring,” Sci. Rep., Vol. 15, No. 14551, Pp. 1–9, 2025, Doi: 10.1038/S41598-025-87862-3.

D. Ramesh And S. K. Sanampudi, “An Automated Essay Scoring Systems: A Systematic Literature Review,” Artif. Intell. Rev., Vol. 55, No. 3, Pp. 2495–2527, 2022, Doi: 10.1007/S10462-021-10068-2.

N. Rokhman, P. A. Maulan, And N. A. Wirahuda, “Analisis Penilaian Esai Secara Otomatis Menggunakan Natural Language Processing (Nlp) Dan Cosine Similarity,” Go Infotech J. Ilm. Stmik Aub, Vol. 31, No. 1, Pp. 41–52, 2025, Doi: 10.36309/Goi.V31i1.359.

A. Ayaan And K. Ng, “Automated grading using natural language processing and semantic analysis,” MethodsX, Vol. 14, P. 103395, 2025, Doi: 10.1016/J.Mex.2025.103395.

A. Info, “Evaluasi Akurasi Dan Presisi Large Language Model ( Llm ),” J. Ilm. Inform. With Cc By Nc Licence, Vol. 10, No. 1, Pp. 48–60, 2025, Doi: 10.35316/Jimi.V10i1.48-60.

E. Fianu, F. Amankwah-Sarfo, P. Ofori, J. K. Amoako, And H. Sumani, “From Traditional Machine Learning Models To Large Language Models: A Systematic Literature Review Of Automated Essay Scoring,” Sn Comput. Sci., Vol. 7, No. 5, P. 406, 2026, Doi: 10.1007/S42979-026-05028-Y.

W. Xu, R. Mahmud, And W. A. I. L. A. M. Hoo, “A Systematic Literature Review?: Are Automated Essay Scoring Systems Competent In Real-Life Education Scenarios??,” Ieee Educ. Soc. Sect., Vol. 12, No. June, Pp. 77639–77657, 2024, Doi: 10.1109/Access.2024.3399163.

R. Junqueira And V. P. Moreira, “The Inadequacy Of Automatic Evaluation Metrics In Question Answering?: A Case-Study In Portuguese,” Proc. 17th Int. Conf. Comput. Process. Port., Vol. 1, No. Propor, Pp. 551–561, 2026, [Online]. Available: Https://Aclanthology.Org/2026.Propor-1.54.Pdf

M. S. Maksum, T. Arifin, R. Rohidin, M. Azril, B. Prasetya, And I. Fardian, “Optimalisasi Algoritma Terjemahan Bahasa Dengan Model Transformer: Pendekatan Statistical Machine Learning,” Infotech J., Vol. 10, No. 2, Pp. 282–287, 2024, Doi: 10.31949/Infotech.V10i2.11132.

Y. Yuniati, K. M. Fitria, Melvi, S. Purwiyanti, E. Nasrullah, And M. A. Muhammad, “Analisis Performa Ekstraksi Konten Gpt-3 Dengan Matrik Bertscore Dan Rouge,” J. Teknol. Inf. Dan Ilmu Komput., Vol. 11, No. 6, Pp. 1273–1280, 2024, Doi: 10.25126/Jtiik.2024118088.

L. Banh And G. Strobel, “Generative Artificial Intelligence,” Electron. Mark., Vol. 33, No. 1, Pp. 1–17, 2024, Doi: 10.1007/S12525-023-00680-1.

B. Arslan Et Al., “Opportunities And Challenges Of Using Generative Ai To Personalize Educational Assessment,” Front. Artif. Intell., Vol. 7, No. Perspective, Pp. 1–8, 2024, Doi: 10.3389/Frai.2024.1460651.

S. M. S. Mohammadabadi, B. C. Kara, C. Eyupoglu, C. Uzay, M. S. Tosun, And O. Karakuss, “A Survey Of Large Language Models?: Evolution , Architectures , Adaptation , Benchmarking , Applications , Challenges , And Societal Implications,” Electronics, Vol. 14, No. 18, Pp. 1–31, 2025, Doi: 10.3390/Electronics14183580.

E. Reiter, “A Structured Review Of The Validity Of Bleu,” Comput. Linguist., Vol. 44, No. 3, Pp. 393–401, 2025, Doi: 10.1162/Coli_A_00322.

M. Barbella And G. Tortora, “Rouge Metric Evaluation For Text Summarization Techniques,” Ssrn Electron. J., Pp. 1–31, 2022, Doi: 10.2139/Ssrn.4120317.

D. Li Et Al., “From Generation To Judgment?: Opportunities And Challenges Of Llm-As-A-Judge,” Emnlp, Vol. Proceeding, Pp. 2758–2792, 2025, Doi: 10.18653/V1/2025.Emnlp-Main.138.

K. Papineni, S. Roukos, T. Ward, And W. Zhu, “B Leu?: A Method For Automatic Evaluation Of Machine Translation,” No. July, Pp. 311–318, 2002, Doi: 10.3115/1073083.1073135.

C.-Y. Lin, “Rouge?: A Package For Automatic Evaluation Of Summaries,” Assoc. Comput. Linguist., Vol. Text Summa, Pp. 74–81, 2004, [Online]. Available: Https://Aclanthology.Org/W04-1013/.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Analisis Limitasi Performa Penilaian Esai Otomatis pada Aplikasi ESAO Berdasarkan Metrik BLEU dan ROUGE

Analisis Limitasi Performa Penilaian Esai Otomatis pada Aplikasi ESAO Berdasarkan Metrik BLEU dan ROUGE

Authors

DOI:

Keywords:

Abstract

Downloads

References

ARTICLE HISTORY

How to Cite

Issue

Section

Most read articles by the same author(s)