Open Access Open Access  Restricted Access Subscription Access

Feature Selection Methods for Classifying Email Messages: Analysis, Proposal, and Comparative Study


Affiliations
1 Department of Informatics Research, Electronics Research Institute, Cairo, Egypt
 

Spam Email messages have a big problem either for users or for the Internet serviceproviders. The content of such messages may contain viruses and bad information. The spam messages also occupya huge amount of space on the mail boxes. So, the process of Emails' classification is very important to be analyzed and discussed. This research work aims at classifying the email messages into either spam or non-spam. The E-mail messages or a dataset can be represented in a matrix form. The rows of the matrix are representing the instances (messages) while the columns are representing the features of such instances. K-Nearest Neighbor (KNN) and Naïve Bayes (NB) are two classifiers where they are used to classify the email messages. The proposed approach based on partitioning the dataset into segment and compared with the adopted approach. Moreover, feature selection methods are adopted to choose the significant features and eliminate the others to avoid processing overheads. The choice of the relevant features plays an important role of the classification accuracy. In this work, some feature selection methods are adopted, analyzed, and operated. The performance of such methods is compared. Moreover, a feature selection method is proposed and discussed. The performance of the proposed feature selection method is compared with the adopted ones. This work is operated on a chosen dataset taken from the Internet. The dataset contains about four-thousand messages with fifty-eight features. Moreover, the dataset is supported with a target feature representing the class labels. From the practical experiments it is shown that the performance of the proposed method is better than the adopted ones. It is also expected that the proposed method is applicable to other datasets for other application domains.

Keywords

Spam Messages, Classification Algorithms, Feature Selection Methods, Text Representation, and Performance Evaluation.
User
Notifications
Font Size

Abstract Views: 149

PDF Views: 0




  • Feature Selection Methods for Classifying Email Messages: Analysis, Proposal, and Comparative Study

Abstract Views: 149  |  PDF Views: 0

Authors

Sanaa Abou Elhamayed
Department of Informatics Research, Electronics Research Institute, Cairo, Egypt
Samah Osama M. Kamel
Department of Informatics Research, Electronics Research Institute, Cairo, Egypt

Abstract


Spam Email messages have a big problem either for users or for the Internet serviceproviders. The content of such messages may contain viruses and bad information. The spam messages also occupya huge amount of space on the mail boxes. So, the process of Emails' classification is very important to be analyzed and discussed. This research work aims at classifying the email messages into either spam or non-spam. The E-mail messages or a dataset can be represented in a matrix form. The rows of the matrix are representing the instances (messages) while the columns are representing the features of such instances. K-Nearest Neighbor (KNN) and Naïve Bayes (NB) are two classifiers where they are used to classify the email messages. The proposed approach based on partitioning the dataset into segment and compared with the adopted approach. Moreover, feature selection methods are adopted to choose the significant features and eliminate the others to avoid processing overheads. The choice of the relevant features plays an important role of the classification accuracy. In this work, some feature selection methods are adopted, analyzed, and operated. The performance of such methods is compared. Moreover, a feature selection method is proposed and discussed. The performance of the proposed feature selection method is compared with the adopted ones. This work is operated on a chosen dataset taken from the Internet. The dataset contains about four-thousand messages with fifty-eight features. Moreover, the dataset is supported with a target feature representing the class labels. From the practical experiments it is shown that the performance of the proposed method is better than the adopted ones. It is also expected that the proposed method is applicable to other datasets for other application domains.

Keywords


Spam Messages, Classification Algorithms, Feature Selection Methods, Text Representation, and Performance Evaluation.