Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers

Authors

  • Salhana Amad Darwis Faculty of Computer Science and Information Technology, University of Malaya
  • Rukaini Abdullah Faculty of Computer Science and Information Technology, University of Malaya
  • Norisma Idris Faculty of Computer Science and Information Technology, University of Malaya

Keywords:

Malay language, stemming, Malay Language stemmers, Malay word register, ambiguity problem, under stemming, over stemming

Abstract

Stemmers or word stemming algorithms reduce a derivative word to its root word by removing all the affixes. The complexity of Malay Language (ML) morphological rules and Malay lexicon make stemming Malay words difficult. There is no fixed method to determine the affix to be removed from a derivative word to produce the correct root word. Furthermore, a derivative word could contain one or more valid root words. Stemming errors still exist in the previous Malay Language Stemmers (MLS). Regardless of the approaches used, they rely on the first affix matched or the first root word found. Hence, some words were under stemmed or over stemmed while words with many valid root words were not stemmed to reveal the correct root word. This multiple root words or ambiguity problem, however, has never been addressed by previous MLS. To solve the over stemming and under stemming errors, we propose an approach that exhaustively strips all matched affixes to ensure that a valid root word will be extracted. In addition, we also propose the use of a Malay Word Register to address the ambiguity problem of determining the correct root word. We tested the proposed approach with words from newspaper articles, Malay translation of the Quran, History essays and incorrectly stemmed words from the previous stemmers. The results reveal this stemmer is successful with 99.8% accuracy. There were no stemming errors. The imperfect accuracy is due to the ambiguity problem approach.

Downloads

Download data is not yet available.

Downloads

Published

2012-12-01

How to Cite

Darwis, S. A., Abdullah, R., & Idris, N. (2012). Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers. Malaysian Journal of Computer Science, 25(4), 196–209. Retrieved from https://samudera.um.edu.my/index.php/MJCS/article/view/6717

Most read articles by the same author(s)