K-spectrum Support Vector Machine classifier for spam filtering
University of New Brunswick
Traditionally machine learning approaches including Support Vector Machine (SVM) for spam filtering use the bag of words text representation technique to represent its features. However, this technique does not take the word order information into account and is not suitable for languages that do not use white spaces as word delimiters. Therefore, it is appealing to treat every email as a string of symbols by using a string-based approach. In this report, we implement a contiguous string-based approach, which is called k-spectrum kernel, for use with SVM in a discriminative approach to the spam classification problem. When using the k-spectrum SVM spam classifier, email texts are implicitly mapped into a high-dimensional feature space. The classifier produces a decision boundary in this feature space, and emails are classified based on whether they map to the positive (spam) or negative side (non-spam) of the boundary. Our experimental results demonstrate that the k-spectrum SVM spam classifier could offer an effective and accurate alternative to other approaches of spam filtering, such as generally used approaches including Naive Baysian and SVM classifier that is based Bag-of-Words (BOW).