An intelligent data pre-processing of complex datasets

Abdul-Rahman, Shuzlina; Bakar, Azuraliza Abu; Mohamed-Hussein, Zeti-Azura

doi:10.3233/IDA-2012-0525

An intelligent data pre-processing of complex datasets

Article type: Research Article

Authors: Abdul-Rahman, Shuzlina^a | Bakar, Azuraliza Abu^{a; *} | Mohamed-Hussein, Zeti-Azura^b

Affiliations: [a] Department of Science and System Management, Faculty of Information Science and Technology, The National University of Malaysia, Bangi, Selangor, Malaysia | [b] School of Biosciences & Biotechnology, Faculty of Science & Technology, The National University of Malaysia, Bangi, Selangor, Malaysia

Correspondence: [*] Corresponding author: Azuraliza Abu Bakar, Department of Science and System Management, Faculty of Information Science and Technology, The National University of Malaysia, 43600 Bangi, Selangor, Malaysia. Tel.: +603 89216794; E-mail: aab@ftsm.ukm.my

Keywords: Classification, data mining, feature selection, machine learning, optimisation, particle swarm optimisation

DOI: 10.3233/IDA-2012-0525

Journal: Intelligent Data Analysis, vol. 16, no. 2, pp. 305-325, 2012

Published: 1 March 2012

Get PDF

Abstract

Pre-processing plays a vital role in classification tasks, particularly when complex features are involved, and this demands a highly intelligent method. In bioinformatics, where datasets are categorised as having complex features, the need for pre-processing is unavoidable. In this paper, we propose a framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches. Several state-of-the-art multivariate filters were explored in the first phase to remove the unwanted features that contributed to noise, while particle swarm optimisation (PSO) with support vector machine (SVM) was adopted in the wrapper phase to produce the most optimal features. Several PSO variants were investigated in the wrapper phase to compare the most suitable PSO variants for the problem domain. The results of both phases were analysed based on classification accuracy, number of selected features, modelling time and area under the curve on the main dataset and, five benchmark machine learning datasets of similar complexity. The higher classification accuracy of the proposed framework was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Abstract

Share this:

North America

Europe

Asia