Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Issue title: Special Issue – SAS Global Forum 2018
Guest editors: Jennifer Waller and Tyler Smith
Article type: Research Article
Authors: Sloan, Stephena; * | Lafler, Kirk Paulb
Affiliations: [a] Cream Ridge, NJ 08514, USA | [b] Spring Valley, CA 91978, USA
Correspondence: [*] Corresponding author: Stephen Sloan, 42 Tower Road, Cream Ridge, NJ 08514, USA. Tel.: +1 917 375 2937; Fax: +1 609 758 5240; E-mail: Stephen.b.sloan@accenture.com.
Abstract: Data comes in all forms, shapes, sizes and complexities. Stored in files and data sets, SAS® users know all too well that data can be, and often is, problematic and plagued with a variety of issues. Although today’s statistical software programs are extremely powerful, they are typically not designed to overcome poor quality data. This paper describes and recommends a comprehensive data preparation and fuzzy matching process to follow to enable improved statistical modeling. Statistical techniques are also available for comparing the results of the process. Most statistical software users are aware that two or more data files can be joined, or combined, without a problem when the data files have identifiers with unique and reliable values. However, many files do not have unique identifiers, or “keys”, and need to be joined using character values, like names or E-mail addresses. To add to the difficulty and confusion, these identifiers might be spelled differently, or use different abbreviation or capitalization protocols. This paper describes a versatile 6-step approach to handling data preparation and fuzzy matching issues for improved statistical modeling. The steps include the identification and understanding of potential matching scenarios; exploring data values and data types; data cleaning and validation; data transformation; traditional merge and join techniques; and an assortment of techniques to successfully merge, join and match less than perfect, or “messy”, data by doing phonetic matching using special-purpose character-handling functions like the SOUNDEX algorithm, and the SPEDIS, COMPLEV, and COMPGED fuzzy matching functions. Although the programming techniques described in this paper are illustrated using SAS code, many, if not most, of the techniques can be applied to any software platform that supports character-handling capabilities.
Keywords: SAS, fuzzy matching, character-handling functions, phonetic matching, SOUNDEX, SPEDIS, edit distance, Levenshtein, COMPLEV, COMPGED
DOI: 10.3233/MAS-180447
Journal: Model Assisted Statistics and Applications, vol. 13, no. 4, pp. 367-375, 2018
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn
For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl
如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl