Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Fushimi, Takayasua | Saito, Kazumib; c; * | Motoda, Hiroshid
Affiliations: [a] School of Computer Science, Tokyo University of Technology, Tokyo, Japan | [b] Faculty of Science, Kanagawa University, Kanagawa, Japan | [c] Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan | [d] Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan
Correspondence: [*] Corresponding author: Kazumi Saito, Faculty of Science, Kanagawa University, Kanagawa, Japan. E-mails: k-saito@ kanagawa-u.ac.jp, kazumi.saito@riken.jp.
Abstract: We propose a new method of constructing a variable bin width histogram that can accommodate the unbalanced distribution of the samples yet retaining, as a whole, the good aspect of both equal width (EW) and equal-area (EA) histograms that are being used popularly for data visualization and analysis. We formulate this as an optimal change point detection problem in which the bin boundaries are determined by minimizing the sum of the absolute error or the squared error in each bin. The former is based on Distance Minimization (DM) and new, and the latter is based on Variance Minimization (VM) and is considered the state-of-the-art. The constructed histograms can effectively be used to detect and visualize hidden outliers/anomalies by applying the interquartile range method in each bin. The final histograms are obtained by adjusting bin boundaries and heights accordingly after removing the detected outliers/anomalies. We further propose a method to annotate the constructed bins if the data for annotation is given for each sample as a set of nominal variables, using z-score with respect to their distribution within each bin. We applied our method to both real vinyl greenhouse datasets and two different sets of three synthetic datasets, and confirmed that both DM and VM methods work as intended, both can represent the sample distribution with a smaller number of bins than those by EW and EA methods, The use of interquartile range method can detect anomalies as well as outliers, and the terms selected for annotation are interpretable and reasonable. EW and EA methods have contrasting properties. DM and VM methods lie in between, but the former is closer to EA method and the latter to EW method. DM method runs substantially faster than VM method and performs slightly better than VM method in outlier detection and annotation tasks.
Keywords: Histogram, variable bin width, error minimization, change point detection, outlier detection
DOI: 10.3233/IDA-216316
Journal: Intelligent Data Analysis, vol. 27, no. 1, pp. 5-29, 2023
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl