FEATURE ENGINEERING WITH SENTENCE SIMILARITY USING THE LONGEST COMMON SUBSEQUENCE FOR EMAIL CLASSIFICATION

Authors

  • Aruna Kumara B School of Computing and Information Technology, REVA University, 560064 Bengaluru, India
  • Mallikarjun M Kodabagi School of Computing and Information Technology, REVA University, 560064 Bengaluru, India

DOI:

https://doi.org/10.22452/mjcs.sp2022no2.6

Keywords:

Email Classification, Feature Engineering, Sentence Similarity, Similarity Measure, Imbalanced Learning, Feature Selection

Abstract

Feature selection plays a prominent role in email classification since selecting the most relevant features enhances the accuracy and performance of the learning classifier. Due to the exponential increase rate in the usage of emails, the classification of such emails posed a fitting problem. Therefore, there is a requirement for a proper classification system. Such an email classification system requires an efficient feature selection method for the accurate classification of the most relevant features. This paper proposes a novel feature selection method for sentence similarity using the longest common subsequence for email classification. The proposed feature selection method works in two main phases: First, it builds the longest common subsequence vector of features by comparing each email with all other emails in the dataset. Later, a template is constructed for each class using the closest features of emails of a particular class. Further, email classification is tested for unseen emails using these templates. The performance of the proposed method is compared with traditional feature selection methods such as TF-IDF, Information Gain, Chi-square, and semantic approach. The experimental results showed that the proposed method performed well with 96.61% accuracy.

Downloads

Download data is not yet available.

Downloads

Published

2022-12-06

How to Cite

B, A. K. ., & Kodabagi, M. M. (2022). FEATURE ENGINEERING WITH SENTENCE SIMILARITY USING THE LONGEST COMMON SUBSEQUENCE FOR EMAIL CLASSIFICATION. Malaysian Journal of Computer Science, 65–78. https://doi.org/10.22452/mjcs.sp2022no2.6