Tobacco use status from clinical notes using Natural Language Processing and rule based algorithm

Hegde, Harshad; Shimpi, Neel; Glurich, Ingrid; Acharya, Amit

doi:10.3233/THC-171127

Tobacco use status from clinical notes using Natural Language Processing and rule based algorithm

Article type: Research Article

Authors: Hegde, Harshad | Shimpi, Neel | Glurich, Ingrid | Acharya, Amit^*

Affiliations: Center for Oral and Systemic Health, Marshfield Clinic Research Institute, Marshfield Clinic, Marshfield, WI-54449, USA

Correspondence: [*] Corresponding author: Amit Acharya, BDS, MS, PhD, Executive Director, Research Scientist, Center for Oral and Systemic Health, Marshfield Clinic Research Institute, Marshfield Clinic, 1000 North Oak Avenue, Marshfield, WI 54449, USA. Tel.: +1 715 221 6423; E-mail: acharya.amit@marshfieldresearch.org.

Abstract: BACKGROUND: This cross-sectional retrospective study utilized Natural Language Processing (NLP) to extract tobacco-use associated variables from clinical notes documented in the Electronic Health Record (EHR). OBJECITVE: To develop a rule-based algorithm for determining the present status of the patient’s tobacco-use. METHODS: Clinical notes (n= 5,371 documents) from 363 patients were mined and classified by NLP software into four classes namely: “Current Smoker”, “Past Smoker”, “Nonsmoker” and “Unknown”. Two coders manually classified these documents into above mentioned classes (document-level gold standard classification (DLGSC)). A tobacco-use status was derived per patient (patient-level gold standard classification (PLGSC)), based on individual documents’ status by the same two coders. The DLGSC and PLGSC were compared to the results derived from NLP and rule-based algorithm, respectively. RESULTS: The initial Cohen’s kappa (n= 1,000 documents) was 0.9448 (95% CI = 0.9281–0.9615), indicating a strong agreement between the two raters. Subsequently, for 371 documents the Cohen’s kappa was 0.9889 (95% CI = 0.979–1.000). The F-measures for the document-level classification for the four classes were 0.700, 0.753, 0.839 and 0.988 while the patient-level classifications were 0.580, 0.771, 0.730 and 0.933 respectively. CONCLUSIONS: NLP and the rule-based algorithm exhibited utility for deriving the present tobacco-use status of patients. Current strategies are targeting further improvement in precision to enhance translational value of the tool.

Keywords: Data mining, decision support systems clinical, health information systems, smoking, electronic health records, information storage and retrieval

DOI: 10.3233/THC-171127

Journal: Technology and Health Care, vol. 26, no. 3, pp. 445-456, 2018

Received 11 November 2017

Accepted 18 January 2018

Published: 29 June 2018

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia