ARCADE: A Prediction Method for Nominal Variables

Costa, J.F.P.; Lerman, I.C.

doi:10.3233/IDA-1998-2402

ARCADE: A Prediction Method for Nominal Variables

Article type: Research Article

Authors: Costa, J.F.P.^{a; *} | Lerman, I.C.^{b; 1}

Affiliations: [a] Departamento de Matemática Aplicada (Fac. de Ciências) LIACC, Univ. do Porto, Rua das Taipas, 135, 4050 Porto, Portugal | [b] IRISA-INRIA-Rennes, France

Correspondence: [*] Corresponding author. E-mail: jpcosta@ncc.up.pt.

Note: [1] E-mail: lerman@irisa.fr.

Abstract: The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (204 in our application). A small number of relevant binary attributes are gathered from each initial attribute. Let us suppose that we want to binarize a categorical attribute v with L values, where L is large or very large. The total number of binary attributes that can be extracted from v is 2L−1−1, which in the case of a large L is prohibitive. Our idea is to select only those binary attributes that are predictive; and these shall constitute a small fraction of all possible binary attributes. In order to do this, the significant idea consists in grouping the L values of a categorical attribute by means of an hierarchical clustering method. To do so, we need to define a similarity between values, which is associated with their predictive power. By clustering the L values into a small number of clusters (J), we define a new categorical attribute with only J values. The hierarchical clustering method used by us, AVL, allows to choose a significant value for J. Now, we could consider using all the 2L−1−1 binary attributes associated with this new categorical attribute. Nevertheless, the J values are tree-structured, because we have used a hierarchical clustering method. We profit from this, and consider only about 2×J binary attributes. If L is extremely large, for complexity and statistical reasons, we might not be able to apply a clustering algorithm directly. In this case, we start by “factorizing” v into a pair (v2, v2), each one with about L(v) values. For a simple example, consider an attribute v with only four values m1, m2, m3, m4. Obviously, in this example, there is no need to factorize the set of values of v, because it has a very small number of values. Nevertheless, for illustration purposes, v could be decomposed (factorized) into 2 attributes with only two values each; the correspondence between the values of v and (v2, v2) would be v (v1, v2)m1 1 1m2 1 2m3 2 1m4 2 2 Now we apply the clustering method to both sets of values of v1 and v2, defining therefore a new synthetic pair (v¯1,v¯2). Then, we “multiply” these new attributes and get another attribute v10 with J×J values; J1 (resp. J2) is the number of values of v¯1 (resp. v¯2). Now, we apply a final clustering to the values of v10, and proceed as above. The solution that we propose is independent of the number of classes and can be applied to various situations. The application of ARCADE to the protein secondary structure prediction problem, proves the validity of our approach.

Keywords: Decision trees, Binarization, Complexity reduction, Categorical attributes, Hierarchical clustering

DOI: 10.3233/IDA-1998-2402

Journal: Intelligent Data Analysis, vol. 2, no. 4, pp. 265-286, 1998

Received 28 February 1998

Revision received 3 May 1998

Accepted 28 May 1998

Published: 1 October 1998

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia