Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification

Belsis, Petros; Fragos, Kostas; Gritzalis, Stefanos; Skourlas, Christos

doi:10.3233/JCS-2009-0377

Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification

Issue title: Best papers of the Security Track at the 2006 ACM Symposium on Applied Computing

Guest editors: Giampaolo BellaGuest Editor and Peter Y.A. RyanGuest Editor

Article type: Research Article

Authors: Belsis, Petros^{a; b; *} | Fragos, Kostas^c | Gritzalis, Stefanos^a | Skourlas, Christos^b

Affiliations: [a] Department of Information and Communication Systems Engineering, University of the Aegean, Samos, 83200 Greece | [b] Department of Informatics, Technological Education Institute of Athens, Egaleo, 12210 Greece | [c] Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, 15771 Greece

Correspondence: [*] Corresponding author: Department of Information and Communication Systems Engineering, University of the Aegean, Karlovassi, Samos, 83200 Greece. Tel.: +30 22730 82234; Fax: +30 22730 82009; E-mail: pbelsis@aegean.gr.

Abstract: E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity – due both to their simplicity and relative ease of interpretation – the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) – HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.

Keywords: Spam mail, machine learning based processing, hierarchical mixtures of experts

DOI: 10.3233/JCS-2009-0377

Journal: Journal of Computer Security, vol. 17, no. 3, pp. 239-268, 2009

Published: 15 April 2009

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia