Text Normalization on Indonesian-English Code-Mixed Twitter Text using UFAL ByT5
Main Article Content
Abstract
Social media has been grown rapidly in the global community. It also includes Twitter, which is getting increase in both users and content created. However, Twitter has character limit in one tweet which causes changes to the writing patterns of its users. Twitter users began to modify their writing from using formal words into non-formal words, one of which was using code-mixed language. For tweet analysis purposes, text normalization is required to transform non-formal words into formal ones to help analysis process. The recent state-of-the-art for Indonesian-English code-mixed Twitter text normalization is with statistical machine translation (SMT) models, however the SMT model still has weakness in word recognition. This research focuses on the Indonesian and English code-mixed Twitter text normalization using one of transformer model which is UFAL ByT5. There are two UFAL ByT5 models that were used, each of them are for Indonesian and English language. Research result shows that UFAL ByT5 model outperform SMT model on text normalization by 0.88 percent of BLEU score in difference.
Article Details
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
M. R. Ningsih, K. A. H. Wibowo, A. U. Dullah, dan J. Jumanto, “Global recession sentiment analysis utilizing VADER and ensemble learning method with word embedding,” J. Soft Comput. Explor., vol. 4, no. 3, 2023, doi: https://doi.org/10.52465/joscex.v4i3.193.
M. A. Rizaty, “Pengguna Aktif Twitter Global Capai 830 Juta per Kuartal II/2022,” DataIndonesia.id, 2022. https://dataindonesia.id/internet/detail/pengguna-aktif-twitter-global-capai-830-juta-per-kuartal-ii2022.
A. F. Hidayatullah, “Language tweet characteristics of Indonesian citizens,” Int. Conf. Sci. Technol., 2015, doi: 10.1109/TICST.2015.7369393.
C. M. Annur, “Pengguna Twitter Indonesia Masuk Daftar Terbanyak di Dunia, Urutan Berapa?,” databoks, 2022. https://databoks.katadata.co.id/datapublish/2022/03/23/pengguna-twitter-indonesia-masuk-daftar-terbanyak-di-dunia-urutan-berapa.
A. M. Barik, R. Mahendra, dan M. Adriani, “Normalization of Indonesian-English Code-Mixed Twitter Data,” Assoc. Comput. Linguist., 2019, doi: 10.18653/v1/D19-5554.
E. Yulianti, A. Kurnia, M. Adriani, dan Y. S. Duto, “Normalisation of Indonesian-English Code-Mixed Text and its Effect on Emotion Classification,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, 2021, doi: 10.14569/IJACSA.2021.0121177.
H. A. Wibowo et al., “IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words,” Assoc. Comput. Linguist., hal. 3170–3183, 2021, doi: 10.18653/v1/2021.findings-acl.280.
D. Samuel dan M. Straka, “ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5,” Comput. Lang., 2021.
B. Singh dan D. K. Sharma, “SiteForge: Detecting and localizing forged images on microblogging platforms using deep convolutional neural network,” Comput. Ind. Eng., vol. 162, 2021, doi: https://doi.org/10.1016/j.cie.2021.107733.
S. Dutta, A. K. Das, S. Ghosh, dan D. Samanta, “Chapter 1 - Introduction to microblogging sites,” in Data Analytics for Social Microblogging Platforms, 2023, hal. 3–38.
A. M. Kaplan dan M. Haenlein, “The early bird catches the news: Nine things you should know about micro-blogging,” Bus. Horiz., vol. 54, no. 2, hal. 105–113, 2011, doi: https://doi.org/10.1016/j.bushor.2010.09.004.
A. B. Boot, E. T. K. Sang, K. Dijkstra, dan R. A. Zwaan, “How character limit affects language usage in tweets,” Humanit. Soc. Sci. Commun., 2019, doi: 10.1057/s41599-019-0280-3.
J. Jumanto, M. A. Muslim, Y. Dasril, dan T. Mustaqim, “Accuracy of Malaysia Public Response to Economic Factors During the Covid-19 Pandemic Using Vader and Random,” J. Inf. Syst. Explor. Res., vol. 1, no. 1, hal. 49–70, 2023.
J. Jimmi dan R. E. Davistasya, “Code-mixing in language style of south jakarta community indonesia,” J. English Educ. Appl. Linguist., vol. 8, 2019, doi: http://dx.doi.org/10.24127/pj.v8i2.2219.
R. Ramezani, “A language-independent authorship attribution approach for author identification of text documents,” Expert Syst. Appl., vol. 180, 2021, doi: https://doi.org/10.1016/j.eswa.2021.115139.
M. Zampieri, “Chapter 8 - Automatic Language Identification,” Work. with Text, hal. 189–208, 2016, doi: https://doi.org/10.1016/B978-1-84334-749-1.00008-1.
A. Kurnia, E. Yulianti, D. Chahyati, I. Budi, dan W. C. Wibowo, “Normalisasi teks code-mixed bahasa Indonesia-Inggris pada data twitter dan analisis pengaruhnya untuk klasifikasi emosi = Code-mixed text normalization on Indonesian-English language on twitter data and the analysis of its effect on emotion classification,” Univ. Indones.
“normalisasi,” KBBI Daring, 2016. .
K. H. Lai, M. Topaz, F. R. Goss, dan L. Zhou, “Automated misspelling detection and correction in clinical free-text records,” J. Biomed. Inform., vol. 55, hal. 188–195, 2015, doi: https://doi.org/10.1016/j.jbi.2015.04.008.
S. P. Mishra, P. Warule, dan S. Deb, “Improvement of emotion classification performance using multi-resolution variational mode decomposition method,” Biomed. Signal Process. Control, vol. 89, 2024, doi: https://doi.org/10.1016/j.bspc.2023.105708.
J. Devlin, M.-W. Chang, K. Lee, dan K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv, 2018, doi: https://doi.org/10.48550/arXiv.1810.04805.
R. van der Goot, A. Üstün, A. Ramponi, I. Sharaf, dan B. Plank, “Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP,” Assoc. Comput. Linguist., hal. 176–197, 2021, doi: 10.18653/v1/2021.eacl-demos.22.
M. Li, H. Zhou, J. Hou, P. Wang, dan E. Gao, “Is cross-linguistic advert flaw detection in Wikipedia feasible? A multilingual-BERT-based transfer learning approach,” Knowledge-Based Syst., vol. 252, 2022, doi: https://doi.org/10.1016/j.knosys.2022.109330.
B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” arXiv, 2020, doi: https://doi.org/10.48550/arXiv.2009.05387 Focus to learn more.
J. Radom dan J. Kocoń, “Multi-task Sequence Classification for Disjoint Tasks in Low-resource Languages,” Procedia Comput. Sci., vol. 192, hal. 1132–1140, 2021, doi: https://doi.org/10.1016/j.procs.2021.08.116.
I. Lourentzou, K. Manghnani, dan C. Zhai, “Adapting Sequence to Sequence models for Text Normalization in Social Media,” arXiv, 2019, doi: https://doi.org/10.48550/arXiv.1904.06100.