Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Hu, Ya-Hana | Lin, Wei-Chaob | Tsai, Chih-Fongc; * | Ke, Shih-Wend | Chen, Chih-Wene
Affiliations: [a] Department of Information Management, National Chung Cheng University, Taiwan | [b] Department of Computer Science and Information Engineering, Hwa Hsia University of Technology, Taiwan | [c] Department of Information Management, National Central University, Taiwan | [d] Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan | [e] Department of Pharmacy, Kaohsiung Municipal Chinese Medical Hospital, Taiwan
Correspondence: [*] Corresponding author: Chih-Fong Tsai, Department of Information Management, National Central University, Taiwan. Tel.: +886 3 422 7151; Fax: +886 3 4254604; E-mail: cftsai@mgt.ncu.edu.tw.
Abstract: Background:The size of medical datasets is usually very large, which directly affects the computational cost of the data mining process. Instance selection is a data preprocessing step in the knowledge discovery process, which can be employed to reduce storage requirements while also maintaining the mining quality. This process aims to filter out outliers (or noisy data) from a given (training) dataset. However, when the dataset is very large in size, more time is required to accomplish the instance selection task. Objective:In this paper, we introduce an efficient data preprocessing approach (EDP), which is composed of two steps. The first step is based on training a model over a small amount of training data after preforming instance selection. The model is then used to identify the rest of the large amount of training data. Methods:Experiments are conducted based on two medical datasets for breast cancer and protein homology prediction problems that contain over 100000 data samples. In addition, three well-known instance selection algorithms are used, IB3, DROP3, and genetic algorithms. On the other hand, three popular classification techniques are used to construct the learning models for comparison, namely the CART decision tree, k-nearest neighbor (k-NN), and support vector machine (SVM). Results:The results show that our proposed approach not only reduces the computational cost by nearly a factor of two or three over three other state-of-the-art algorithms, but also maintains the final classification accuracy. Conclusions:To perform instance selection over large scale medical datasets, it requires a large computational cost to directly execute existing instance selection algorithms. Our proposed EDP approach solves this problem by training a learning model to recognize good and noisy data. To consider both computational complexity and final classification accuracy, the proposed EDP has been demonstrated its efficiency and effectiveness in the large scale instance selection problem.
Keywords: Data preprocessing, instance selection, medical data mining, breast cancer, protein homology
DOI: 10.3233/THC-140887
Journal: Technology and Health Care, vol. 23, no. 2, pp. 153-160, 2015
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl