EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS

Azilawati  Azizan; Nurkhairizan  Khairudin; Muhammad Fitri Shazwan  Fadzely; Nursyahidah  Alias; Norshuhani  Zamin; Norlina Mohd  Sabri

doi:10.22452/mjcs.vol38spc.1

Authors

Azilawati Azizan College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia Corresponding Author
Nurkhairizan Khairudin College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia
Muhammad Fitri Shazwan Fadzely Pembangunan Sumber Manusia Berhad, Malaysia
Nursyahidah Alias College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia
Norshuhani Zamin College of Computer Studies, De La Salle University, Philippines
Norlina Mohd Sabri College of Computing, Informatics and Mathematics, Universiti Teknologi MARA Terengganu, Malaysia

DOI:

https://doi.org/10.22452/mjcs.vol38spc.1

Keywords:

Noisy Text; Text Normalization; Malay Noisy Text; Similarity Measure; Threshold.

Abstract

Noisy text normalization is a critical preprocessing step in natural language processing (NLP), particularly for user-generated content (UGC) that contains a lot of slang, abbreviations, and typographical errors. This extended study investigates the performance of multiple similarity measures in normalizing Malay noisy text, addressing gaps in prior study that predominantly relied on rule-based approaches and single similarity measures. By systematically evaluating token-based, edit-based, and sequence-based similarity measures across various thresholds, this study provides a comprehensive analysis of their effectiveness and computational efficiency. The methodology comprises a two-phase experiment: an initial phase to identify optimal thresholds using a small dataset and a second phase that generalizes findings on a larger dataset. Key findings reveal that edit-based measures, such as Levenshtein Distance and Damerau-Levenshtein, consistently outperform other measures at lower thresholds, achieving normalization success rates exceeding 83%. Ratcliff/Obershelp emerged as the most effective sequence-based measure, while token-based measures like Jaccard and Cosine demonstrated limited performance. The study also highlights the critical role of threshold in balancing normalization accuracy and flexibility. Additionally, computational time analysis underscores the trade-offs between accuracy and efficiency across similarity categories. These findings pave the way for more robust and adaptable text normalization strategies, particularly for Malay language studies.

EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

Editorial Information

Scope

Submission Guidelines

Indexing

Article Publication Charge

Journal Template

Special Issue

In Press Publication

Awards

Information

Conference

Articles

Top Cited Articles

Most View Articles

Publishing Timeline