Affiliations: [a] NORC at the University of Chicago, Bethesda, MD, USA | [b] Centers for Disease Control and Prevention, National Center for Health Statistics, Hyattsville, MD, USA
Correspondence:
[*]
Corresponding author: Scott R. Campbell, NORC at the University of Chicago, 4350 East-West Highway, 8 th Floor, Bethesda, MD, 20814, USA. Tel.: +1 301 634 9431; E-mail: Campbell-Scott@norc.org.
Abstract: Record linkage enables survey data to be integrated with other data sources, expanding the analytic potential of both sources. However, depending on the number of records being linked, the processing time can be prohibitive. This paper describes a case study using a supervised machine learning algorithm, known as the Sequential Coverage Algorithm (SCA). The SCA was used to develop the join strategy for two data sources, the National Center for Health Statistics’ (NCHS) 2016 National Hospital Care Survey (NHCS) and the Center for Medicare & Medicaid Services (CMS) Enrollment Database (EDB), during record linkage. Due to the size of the CMS data, common record joining methods (i.e. blocking) were used to reduce the number of pairs that need to be evaluated to identify the vast majority of matches. NCHS conducted a case study examining how the SCA improved the efficiency of blocking. This paper describes how the SCA was used to design the blocking used in this linkage.
Keywords: National Center for Health Statistics, Centers for Medicare & Medicaid Services, National Hospital Care Survey, record linkage, blocking, machine learning