Improving pattern classification of DNA microarray data by using PCA and logistic regression
Abstract
DNA microarrays is a technology that can be used to diagnose cancer and other diseases. To automate the analysis of such data, pattern recognition and machine learning algorithms can be applied. However, the curse of dimensionality is unavoidable: very few samples to train, and many attributes in each sample. As the predictive accuracy of supervised classifiers decays with irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. The main idea is to retain only the genes that are the most influential in the classification of the disease. In this paper, a new methodology based on Principal Component Analysis and Logistics Regression is proposed. Our method enables the selection of particular genes that are relevant for classification. Experiments were run using eight different classifiers on two benchmark datasets: Leukemia and Lymphoma. The results show that our method not only reduces the number of required attributes, but also increase the classification accuracy in more than 10% in all the cases we tested.
References
[1] | Alizadeh A.A., , Eisen M.B., , Davis R.E., , Ma C., , Lossos I.S., , Rosenwald A., , Boldrick J.C. et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403: (6769) ((2000) ), 503-511. |
[2] | Antoniadis A., , Lambert-Lacroix S., and Leblanc F., Effective dimension reduction methods for tumor classification using gene expression data, Bioinformatics 19: (5) ((2003) ), 563-570. |
[3] | Bellman R., Adaptive Control Processes: A Guided Tour, Princeton University Press, (1961) . |
[4] | Bielza C., , Robles V., and Larrañaga P., Regularized logistic regression without a penalty term: An application to cancer classification with microarray data, Expert Systems with Applications 38: (5) ((2011) ), 5110-5118. |
[5] | Brewster J.L., , Beason K.B., , Eckdahl T.T., and Evans I.M., The microarray revolution: Perspectives from educators, Biochemistry and Molecular Biology Education 32: (4) ((2004) ), 217-227. |
[6] | Brown P.O., and Botstein D., Exploring the new world of the genome with DNA microarrays, Nature Genetics 21: ((1999) ), 33-37. |
[7] | Chang C.-C., and Lin C.-J., LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST) 2: (3) ((2011) ), 27. |
[8] | Chen X.-W., Gene selection for cancer classification using bootstrapped genetic algorithms and support vector machines, in: Proc of IEEE Bioinformatics Conference, ((2003) ), 504-505. |
[9] | Chiang Y.-M., , Chiang H.-M., and Lin S.-Y., The application of ant colony optimization for gene selection in microarray-based cancer classification, in: Proc of International Conference on Machine Learning and Cybernetics 7: ((2008) ), 4001-4006. |
[10] | Cho S.-B., and Won H.-H., Machine learning in DNA microarray analysis for cancer classification, in: Proc of the First Asia-Pacific Bioinformatics Conference on Bioinformatics, APBC '03, Australian Computer Society, Inc. ((2003) ), 189-198. |
[11] | Chou H.-L., , Yao C.-T., , Su S.-L., , Lee C.-Y., , Hu K.-Y., , Terng H.-J., , Shih Y.-W., , Chang Y.-T., , Lu Y.-F., , Chang C.-W. et al., Gene expression profiling of breast cancer survivability by pooled cdna microarray analysis using logistic regression, artificial neural networks and decision trees, BMC Bioinformatics 14: (1) ((2013) ), 100. |
[12] | Chu F., and Wang L., Applications of support vector machines to cancer classification with microarray data, Int Journal of Neural Systems 15: (6) ((2005) ), 475-484. |
[13] | De Vos J., , Thykjaer T., , Tarte K., , Ensslen M., , Raynaud P., , Requirand G., , Pellet F., , Pantesco V., , Reme T., , Jourdan M. et al., Comparison of gene expression profiling between malignant and normal plasma cells with oligonucleotide arrays, Oncogene 21: (44) ((2002) ), 6848-6857. |
[14] | Diaz J.M., , Pinon R.C., and Solano G., Lung cancer classification using genetic algorithm to optimize prediction models, in: The 5th International Conference on Information, Intelligence, Systems and Applications, ((2014) ), 1-6. |
[15] | Ding C., and Peng H., Minimum redundancy feature selection from microarray gene expression data, Journal of Bioinformatics and Computational Biology 3: (2) ((2005) ), 185-205. |
[16] | Dolled-Filhart M., , Rydén L., , Cregger M., , Jirström K., , Harigopal M., , Camp R.L., and Rimm D.L., Classification of breast cancer using genetic algorithms and tissue microarrays, Clinical Cancer Research 12: (21) ((2006) ), 6459-6468. |
[17] | El Akadi A., , Amine A., , El Ouardighi A., and Aboutajdine D., A two-stage gene selection scheme utilizing MRMR filter and GA wrapper, Knowledge and Information Systems 26: (3) ((2011) ), 487-500. |
[18] | Friedman N., , Geiger D., and Goldszmidt M., Bayesian network classifiers, Machine Learning 29: (2-3) ((1997) ), 131-163. |
[19] | Furey T.S., , Cristianini N., , Duffy N., , Bednarski D.W., , Schummer M., and Haussler D., Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16: (10) ((2000) ), 906-914. |
[20] | Garro B.A., , Vazquez R.A., and Rodr'ıguez K., Classification of DNA microarrays using artificial bee colony (ABC) algorithm, in: Advances in Swarm Intelligence, Springer ((2014) ), 207-214. |
[21] | Golub T.R., , Slonim D.K., , Tamayo P., , Huard C., , Gaasenbeek M., , Mesirov J.P., , Coller H., , Loh M.L., , Downing J.R., , Caligiuri M.A. et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286: (5439) ((1999) ), 531-537. |
[22] | Guyon I., , Weston J., , Barnhill S., and Vapnik V., Gene selection for cancer classification using support vector machines, Machine Learning 46: (1-3) ((2002) ), 389-422. |
[23] | Hair J., , Black W., , Babin B., and Anderson R., Multivariate Data Analysis, 7th edition, Prentice Hall, USA, (2010) . |
[24] | Hall M., , Frank E., , Holmes G., , Pfahringer B., , Reutemann P., and Witten I.H., The WEKA data mining software: An update, ACM SIGKDD Explorations Newsletter 11: (1) ((2009) ), 10-18. |
[25] | Huang G.-B., , Zhu Q.-Y., and Siew C.-K., Extreme learning machine: Theory and applications, Neurocomputing 70: (1-3) ((2006) ), 489-501. |
[26] | Huang J., , Lu J., and Ling C.X., Comparing naive bayes, decision trees, and SVM with AUC and accuracy, in: Data Mining, 2003 ICDM 2003 Third IEEE International Conference on, IEEE ((2003) ), 553-556. |
[27] | Huerta E.B., , Duval B., and Hao J.-K., A hybrid GA/SVM approach for gene selection and classification of microarray data, in: Applications of Evolutionary Computing, Springer ((2006) ), 34-44. |
[28] | Huynh H.T., , Kim J.-J., and Won Y., DNA microarray classification with compact single hidden-layer feedforward neural networks, in: Frontiers in the Convergence of Bioscience and Information Technologies, ((2007) ), 193-198. |
[29] | Institute N.C., SEER Data, 1973-2010, http://http://seer.cancer.gov/data/, accessed: 2014-03-26. |
[30] | Jeffery I.B., , Higgins D.G., and Culhane A.C., Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data, BMC Bioinformatics 7: (1) ((2006) ), 359. |
[31] | Jolliffe I., Principal Component Analysis, Wiley Online Library, (2002) . |
[32] | Khan J., , Wei J.S., , Ringner M., , Saal L.H., , Ladanyi M., , Westermann F., , Berthold F., , Schwab M., , Antonescu C.R., , Peterson C. et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine 7: (6) ((2001) ), 673-679. |
[33] | Kohavi R., and John G.H., Wrappers for feature subset selection, Artificial Intelligence 97: (1) ((1997) ), 273-324. |
[34] | Koller D., and Sahami M., Toward optimal feature selection, Technical report, Stanford InfoLab, Stanford University, (1996) . |
[35] | Lee J.W., , Lee J.B., , Park M., and Song S.H., An extensive comparison of recent classification tools applied to microarray data, Computational Statistics & Data Analysis 48: (4) ((2005) ), 869-885. |
[36] | Li W., and Yang Y., How many genes are needed for a discriminant microarray data analysis, in: Methods of Microarray Data Analysis, Springer, ((2002) ), 137-149. |
[37] | Liao J., and Chin K.-V., Logistic regression for disease classification using microarray data: model selection in a large p and small n$ case, Bioinformatics 23: (15) ((2007) ), 1945-1951. |
[38] | Linder R., , Richards T., and Wagner M., Microarray data classified by artificial neural networks, in: Microarrays, Springer ((2007) ), 345-372. |
[39] | Liu B., , Cui Q., , Jiang T., and Ma S., A combinational feature selection and ensemble neural network method for classification of gene expression data, BMC Bioinformatics 5: (1) ((2004) ), 136. |
[40] | Liu H., and Setiono R., A probabilistic approach to feature selection-a filter solution, in: ICML, Citeseer 96: ((1996) ), 319-327. |
[41] | Liu X., , Krishnan A., and Mondry A., An entropy-based gene selection method for cancer classification using microarray data, BMC Bioinformatics 6: (1) ((2005) ), 76. |
[42] | Ma S., and Huang J., Regularized ROC method for disease classification and biomarker selection with microarray data, Bioinformatics 21: (24) ((2005) ), 4356-4362. |
[43] | Mahmoud A.M., , Maher B.A., , El-Horbaty E.-S.M., and Salem A.B.M., Analysis of machine learning techniques for gene selection and classification of microarray data, in: Proc ICIT 2013 The 6th International Conference on Information Technology, ((2013) ). |
[44] | Mukherjee S., , Classifying microarray data using support vector machines, in: A Practical Approach to Microarray Data Analysis, Berrar D.P., , Dubitzky W., and Granzow M., eds, Springer US, (2003) , pp. 166-185. |
[45] | Mukherjee S., , Tamayo P., , Slonim D., , Verri A., , Golub T., , Mesirov J., and Poggio T., Support vector machine classification of microarray data, Technical report, Massachusetts Institute of Technology, (1999) . |
[46] | Nguyen D.V., and Rocke D.M., Tumor classification by partial least squares using microarray gene expression data, Bioinformatics 18: (1) ((2002) ), 39-50. |
[47] | Noble W.S. et al., Support vector machine applications in computational biology, Kernel Methods in Computational Biology ((2004) ), 71-92. |
[48] | Ocampo R., , de Luna M.A., , Vega R., , Sanchez-Ante G., , Falcon-Morales L.E., and Sossa H., , Pattern analysis in DNA microarray data through PCA-based gene selection, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Bayro-Corrochano E., and Hancock E., eds, volume 8827 of Lecture Notes in Computer Science, Springer International Publishing, (2014) , pp. 532-539. |
[49] | Pirooznia M., , Yang J., , Yang M.Q., and Deng Y., A comparative study of different machine learning methods on microarray gene expression data, BMC Genomics 9: (1) ((2008) ), S13. |
[50] | Revathi T., and Sumathi P., A novel microarray gene ranking and classification using extreme learning machine algorithm, Journal of Theoretical and Applied Information Technology 68: (3) ((2014) ). |
[51] | Ruiz R., , Riquelme J.C., and Aguilar-Ruiz J.S., Incremental wrapper-based gene selection from microarray data for cancer classification, Pattern Recognition 39: (12) ((2006) ), 2383-2392. |
[52] | Ryu J., and Cho S.-B., Towards optimal feature and classifier for gene expression classification of cancer, in: Advances in Soft Computing, AFSS 2002, Springer ((2002) ), 310-317. |
[53] | Shah M., , Marchand M., and Corbeil J., Feature selection with conjunctions of decision stumps and learning from microarray data, IEEE Transactions on Pattern Analysis and Machine Intelligence 34: (1) ((2012) ), 174-186. |
[54] | Shen L., and Chong-Tan E., Reducing multiclass cancer classification to binary by output coding and SVM, Computational Biology and Chemistry 30: (1) ((2006) ), 63-71. |
[55] | Sossa H., and Guevara E., Efficient training for dendrite morphological neural networks, Neurocomputing 131: ((2014) ), 132-142. |
[56] | Statnikov A., , Aliferis C.F., , Tsamardinos I., , Hardin D., and Levy S., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics 21: (5) ((2005) ), 631-643. |
[57] | Thomas J.G., , Olson J.M., , Tapscott S.J., and Zhao L.P., An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles, Genome Research 11: (7) ((2001) ), 1227-1236. |
[58] | Tong D.L., and Schierz A.C., Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data, Artificial Intelligence in Medicine 53: (1) ((2011) ), 47-56. |
[59] | Tsamardinos I., and Aliferis C.F., Towards principled feature selection: Relevancy, filters and wrappers, in: Proc of the Ninth International Workshop on Artificial Intelligence and Statistics, ((2003) ). |
[60] | Wang Y., , Tetko I.V., , Hall M.A., , Frank E., , Facius A., , Mayer K.F., and Mewes H.W., Gene selection from microarray data for cancer classification-a machine learning approach, Computational Biology and Chemistry 29: (1) ((2005) ), 37-46. |
[61] | Xing E.P., , Jordan M.I., , Karp R.M. et al., Feature selection for high-dimensional genomic microarray data, in: ICML, Citeseer 1: ((2001) ), 601-608. |
[62] | Yu H., , Gu G., , Liu H., , Shen J., and Zhao J., A modified ant colony optimization algorithm for tumor marker gene selection, Genomics, Proteomics & Bioinformatics 7: (4) ((2009) ), 200-208. |
[63] | Yu H., , Hong S., , Yang X., , Ni J., , Dan Y., and Qin B., Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers, BioMed Research International, ((2013) ). |
[64] | Zhang H., , Cohen A.L., , Krishnakumar S., , Wapnir I.L., , Veeriah S., , Deng G., , Coram M.A., , Piskun C.M., , Longacre T.A., , Herrler M. et al., Patient-derived xenografts of triple-negative breast cancer reproduce molecular features of patient tumors and respond to mTOR inhibition, Breast Cancer Res 16: (2) ((2014) ), R36. |
[65] | Zhang R., , Huang G.-B., , Sundararajan N., and Saratchandran P., Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 4: (3) ((2007) ), 485-495. |
[66] | Zhou X., , Liu K.-Y., and Wong S.T., Cancer classification and prediction using logistic regression with Bayesian gene selection, Journal of Biomedical Informatics 37: (4) ((2004) ), 249-259. |