EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS

Authors

  • Azilawati Azizan College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia
  • Nurkhairizan Khairudin College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia
  • Muhammad Fitri Shazwan Fadzely Pembangunan Sumber Manusia Berhad, Malaysia
  • Nursyahidah Alias College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia
  • Norshuhani Zamin College of Computer Studies, De La Salle University, Philippines
  • Norlina Mohd Sabri College of Computing, Informatics and Mathematics, Universiti Teknologi MARA Terengganu, Malaysia

Keywords:

Noisy Text; Text Normalization; Malay Noisy Text; Similarity Measure; Threshold.

Abstract

Noisy text normalization is a critical preprocessing step in natural language processing (NLP), particularly for user-generated content (UGC) that contains a lot of slang, abbreviations, and typographical errors. This extended study investigates the performance of multiple similarity measures in normalizing Malay noisy text, addressing gaps in prior study that predominantly relied on rule-based approaches and single similarity measures. By systematically evaluating token-based, edit-based, and sequence-based similarity measures across various thresholds, this study provides a comprehensive analysis of their effectiveness and computational efficiency. The methodology comprises a two-phase experiment: an initial phase to identify optimal thresholds using a small dataset and a second phase that generalizes findings on a larger dataset. Key findings reveal that edit-based measures, such as Levenshtein Distance and Damerau-Levenshtein, consistently outperform other measures at lower thresholds, achieving normalization success rates exceeding 83%. Ratcliff/Obershelp emerged as the most effective sequence-based measure, while token-based measures like Jaccard and Cosine demonstrated limited performance. The study also highlights the critical role of threshold in balancing normalization accuracy and flexibility. Additionally, computational time analysis underscores the trade-offs between accuracy and efficiency across similarity categories. These findings pave the way for more robust and adaptable text normalization strategies, particularly for Malay language studies.

Downloads

Download data is not yet available.

Published

2025-08-01

How to Cite

Azizan, A. ., Khairudin, N. ., Fadzely, M. F. S. ., Alias, N. ., Zamin, N. ., & Sabri, N. M. . (2025). EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS. Malaysian Journal of Computer Science, 38. Retrieved from https://samudera.um.edu.my/index.php/MJCS/article/view/63760