Cross-dataset email classification

Morales, Valentin; Gomez, Juan Carlos; Van Amerongen, Saskia

doi:10.3233/JIFS-179890

Cross-dataset email classification

Issue title: Special section: Selected papers of LKE 2019

Guest editors: David Pinto, Vivek Singh and Fernando Perez

Article type: Research Article

Authors: Morales, Valentin | Gomez, Juan Carlos^*; | Van Amerongen, Saskia

Affiliations: Departamento de Ingeniería Electrónica, División de Ingenierías Campus Irapuato-Salamanca, Universidad de Guanajuato, Salamanca, México

Correspondence: [*] Corresponding authors. Juan Carlos Gomez, Departamento de Ingeniería Electrónica, División de Ingenierías Campus Irapuato-Salamanca, Universidad de Guanajuato, Salamanca, México, Tel. +524646479940. E-mail: jc.gomez@ugto.mx.

Abstract: Email is one of the most popular ways of communication. Nevertheless, it is also a potential tool to deceive and fill users with unwanted publicity, which reduces productivity. To alleviate such fact, a common solution has been building machine learning models based on the content of emails to automatically separate emails (spam vs ham). In this work, a study of a set of machine learning models and content-based features for the problem of cross-dataset email classification is presented. This problem consists in training and testing the models using different datasets; considering the fact that the datasets were collected under different independent setups. This has the purpose of simulating future variable or unpredictable conditions in the emails content distributions as could happen in a real setting, where models are trained using emails from a certain period of time, group of users or accounts, but tested with emails from other users or accounts. Experiments were conducted with the models and features using different datasets and two setups, same-dataset, and cross-dataset, to show the complexity of the later. The performance was evaluated using the Area Under the ROC Curve, a common metric in email classification. The results show interesting insights for the problem.

Keywords: Email classification, data mining, machine learning, cross-dataset classification

DOI: 10.3233/JIFS-179890

Journal: Journal of Intelligent & Fuzzy Systems, vol. 39, no. 2, pp. 2279-2290, 2020

Published: 31 August 2020

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia