Pulmonary nodules are commonly encountered in clinical

Similar documents
Validation of two models to estimate the probability of malignancy in patients with solitary pulmonary nodules

Comparison of three mathematical prediction models in patients with a solitary pulmonary nodule

2018 OPTIONS FOR INDIVIDUAL MEASURES: REGISTRY ONLY. MEASURE TYPE: Process

Approach to Pulmonary Nodules

DENOMINATOR: All final reports for CT imaging studies with a finding of an incidental pulmonary nodule for patients aged 35 years and older

Veterans Health Administration Lung Cancer Screening Demonstration Project: Results & Lessons Learned

What to know and what to make of it

PULMONARY NODULES DETECTED INCIDENTALLY OR BY SCREENING: LOTS OF GUIDELINES BUT WHERE IS THE EVIDENCE?

A Clinical Model To Estimate the Pretest Probability of Lung Cancer in Patients With Solitary Pulmonary Nodules*

An Automated Method for Identifying Individuals with a Lung Nodule Can Be Feasibly Implemented Across Health Systems

Published Pulmonary Nodule Guidelines A Synthesis

CT Screening for Lung Cancer for High Risk Patients

May-Lin Wilgus. A. Study Purpose and Rationale

Pulmonary Nodules. Michael Morris, MD

Robert J. McKenna M.D. Chief, Thoracic Surgery Cedars Sinai Medical Center

Role of Chest Low-dose Computed Tomography in Elderly Patients with Suspected Acute Pulmonary Infection in the Emergency Room

Lung Cancer Screening. Eric S. Papierniak, DO NF/SG VHA UF Health

Lung Cancer Screening

A Summary from the 2013World Conference on Lung Cancer Sydney, Australia

SERRATUS ANTERIOR MUSCLE

Using Electronic Medical Records to Identify Complex Health Outcomes

LUNG NODULES: MODERN MANAGEMENT STRATEGIES

Time to Colonoscopy After a Positive Fecal Test and Risk of Colorectal Cancer Outcomes

Lung Cancer Risk Associated With New Solid Nodules in the National Lung Screening Trial

Lung Cancer Screening: Benefits and limitations to its Implementation

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Screening Programs background and clinical implementation. Denise R. Aberle, MD Professor of Radiology and Engineering

Proportion and characteristics of transient nodules in a retrospective analysis of pulmonary nodules

U"lity of Screening and Surveillance Using Nuclear Medicine Methodology. (Lung Cancer) Michael M. Graham, PhD, MD University of Iowa

Christine Argento, MD Interventional Pulmonology Emory University

HCC Screening Compliance Module: A Tool for Automated Compliance Monitoring, Clinician Notification, and Efficacy Research

BIOSTATISTICAL METHODS

After primary tumor treatment, 30% of patients with malignant

Outcomes in the NLST. Health system infrastructure needs to implement screening

Adam J. Hansen, MD UHC Thoracic Surgery

Age-adjusted vs conventional D-dimer thresholds in the diagnosis of venous thromboembolism

Lung Cancer Screening

American College of Radiology ACR Appropriateness Criteria

Comparing ICD9-Encoded Diagnoses and NLP-Processed Discharge Summaries for Clinical Trials Pre-Screening: A Case Study

New Criteria for Ventilation-Perfusion Lung Scan Interpretation: A Basis for Optimal Interaction with Helical CT Angiography 1

Endobronchial Ultrasound in the Diagnosis & Staging of Lung Cancer

Pulmonary Nodules: When to worry, when to chill. Douglas Arenberg Associate Professor Pulmonary & Critical Care

A Comprehensive Cancer Center Designated by the National Cancer Institute

LUNG CANCER SCREENING: LUNG CANCER SCREENING: THE TIME HAS COME LUNG CANCER: A NATIONAL EPIDEMIC

Screening for Lung Cancer: Are We There Yet?

The Virtual Lung Nodule Clinic

VHA Demonstration Project for Lung Cancer Screening Using Low-Dose Chest CT Screening

Equity Issues with Lung Cancer Screening Prevent Cancer - Quantitative Imaging Workshop November 5-6, 2018 Alexandria, VA

An Update: Lung Cancer

PET CT for Staging Lung Cancer

Medical Policy. MP Molecular Testing in the Management of Pulmonary Nodules

MANAGEMENT RECOMMENDATIONS

London Medical Imaging & Artificial Intelligence Centre for Value-Based Healthcare. Professor Reza Razavi Centre Director

DOES SMOKING MARIJUANA INCREASE THE RISK OF CHRONIC OBSTRUCTIVE PULMONARY DISEASE?

Role of CT imaging to evaluate solitary pulmonary nodule with extrapulmonary neoplasms

SCBT-MR 2015 Incidentaloma on Chest CT

The solitary pulmonary nodule: Assessing the success of predicting malignancy

Lung Cancer Screening

NOTICE OF SUBSTANTIAL AMENDMENT

Pulmonologist s Perspective

None

Pulmonary Sarcoidosis - Radiological Evaluation

Larry Tan, MD Thoracic Surgery, HSC. Community Cancer Care Educational Conference October 27, 2017

Critical Review Form Clinical Decision Analysis

Big Data Phenomics in the VA. Outline

DISCLOSURE. Dr. Plummer has declared no conflicts of interest related to the content of his presentation.

Original Article. Emergency Department Evaluation of Ventricular Shunt Malfunction. Is the Shunt Series Really Necessary? Raymond Pitetti, MD, MPH

Noninvasive Differential Diagnosis of Pulmonary Nodules Using the Standardized Uptake Value Index

Lung Cancer Screening: To Screen or Not to Screen?

Ethnic Disparities in the Treatment of Stage I Non-small Cell Lung Cancer. Juan P. Wisnivesky, MD, MPH, Thomas McGinn, MD, MPH, Claudia Henschke, PhD,

Learning Objectives. 1. Identify which patients meet criteria for annual lung cancer screening

Low birthweight and respiratory disease in adulthood: A population-based casecontrol

Measure #405: Appropriate Follow-up Imaging for Incidental Abdominal Lesions National Quality Strategy Domain: Effective Clinical Care

Section II: Detailed Measure Specifications

Multidisciplinary quality improvement initiative to standardize reporting of lung cancer screening

Radiographic Assessment of Response An Overview of RECIST v1.1

MEDICAL POLICY SUBJECT: LOW-DOSE COMPUTED TOMOGRAPHY (LDCT) FOR LUNG CANCER SCREENING. POLICY NUMBER: CATEGORY: Technology Assessment

CT screening for lung cancer. Should it be done in the Indian context?

Screening for High Blood Pressure in Adults During Ambulatory Nonprimary Care Visits: Opportunities to Improve Hypertension Recognition

NCCN Guidelines as a Model of Extended Criteria for Lung Cancer Screening

Goals of Presentation

LUNG CANCER SCREENING WHAT S THE IMPACT? Nitra Piyavisetpat, MD Department of Radiology Chulalongkorn University

Radiological staging of lung cancer. Shukri Loutfi,MD,FRCR Consultant Thoracic Radiologist KAMC-Riyadh

SCBT-MR 2016 Lung Cancer Screening in Practice: State of the Art

MEASUREMENT OF EFFECT SOLID TUMOR EXAMPLES

Indications and methods of surgical treatment of solitary pulmonary nodule

I9 COMPLETION INSTRUCTIONS

PET/CT in lung cancer

Diagnostic challenge: Sclerosing Hemangioma of the Lung. Department of Medicine, Division of Pulmonary and Critical Care, Lincoln Medical and

Subject: Virtual Bronchoscopy and Electromagnetic Navigational Bronchoscopy for Evaluation of Peripheral Pulmonary Lesions

2018 OPTIONS FOR INDIVIDUAL MEASURES: REGISTRY ONLY. MEASURE TYPE: Efficiency

Supplemental Figure 1. Gating strategies for flow cytometry and intracellular cytokinestaining

When to suspect Wegener Granulomatosis: A radiologic review

The Itracacies of Staging Patients with Suspected Lung Cancer

Diffuse Interstitial Lung Diseases: Is There Really Anything New?

Validation of Clinical Outcomes in Electronic Data Sources

Positron Emission Tomography in Lung Cancer

Projected Outcomes Using Different Nodule Sizes to Define a Positive CT Lung Cancer Screening Examination

Lung Cancer Diagnosis for Primary Care

Transcription:

ORIGINAL ARTICLE Automated Identification of Patients With Pulmonary Nodules in an Integrated Health System Using Administrative Health Plan Data, Radiology Reports, and Natural Language Processing Kim N. Danforth, ScD,* Megan I. Early, MPH,* Sharon Ngan, MD, Anne E. Kosco, MD Chengyi Zheng, PhD,* Michael K. Gould, MD, MS* Introduction: Lung nodules are commonly encountered in clinical practice, yet little is known about their management in community settings. An automated method for identifying patients with lung nodules would greatly facilitate research in this area. Methods: Using members of a large, community-based health plan from 2006 to 2010, we developed a method to identify patients with lung nodules, by combining five diagnostic codes, four procedural codes, and a natural language processing algorithm that performed free text searches of radiology transcripts. An experienced pulmonologist reviewed a random sample of 116 radiology transcripts, providing a reference standard for the natural language processing algorithm. Results: With the use of an automated method, we identified 7112 unique members as having one or more incident lung nodules. The mean age of the patients was 65 years (standard deviation 14 years). There were slightly more women (54%) than men, and Hispanics and non-whites comprised 45% of the lung nodule cohort. Thirtysix percent were never smokers whereas 11% were current smokers. Fourteen percent of the patients were subsequently diagnosed with lung cancer. The sensitivity and specificity of the natural language processing algorithm for identifying the presence of lung nodules were 96% and 86%, respectively, compared with clinician review. Among the true positive transcripts in the validation sample, only 35% were solitary and unaccompanied by one or more associated findings, and 56% measured 8 to 30 mm in diameter. *Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA; Keck School of Medicine, University of Southern California, Los Angeles, CA; Department of Internal Medicine, Kaiser Permanente Southern California, Lancaster, CA; and Department of Radiology, Kaiser Permanente Southern California, Los Angeles, CA. Disclosure: Dr. Gould has a grant with the National Cancer Institute, and a pending grant with the Patient-Centered Outcomes Research Institute, related to lung nodules. The other authors declare no conflicts of interest. This project was funded in part by the Los Angeles Basin Clinical and Translational Science Institute (LAB-CTSI, grant 1UL1RR031986-01); and in part by the Department of Research and Evaluation within the Southern California Permanente Medical Group with funds from the Medical Group and the Kaiser Permanente Community Benefit Fund. Address for correspondence: Kim N. Danforth, ScD, Department of Research and Evaluation, Kaiser Permanente Southern California, S Los Robles Ave, 2nd Floor, Pasadena, CA 91101. E-mail: kim.n.danforth@kp.org Copyright 2012 by the International Association for the Study of Lung Cancer ISSN: 1556-0864/12/0708-1257 Conclusions: A combination of diagnostic codes, procedural codes, and a natural language processing algorithm for free text searching of radiology reports can accurately and efficiently identify patients with incident lung nodules, many of whom are subsequently diagnosed with lung cancer. Key Words: lung nodules, pulmonary coin lesion, lung cancer, natural language processing (NLP), validation study (J Thorac Oncol. 2012;7: 1257 1262) Pulmonary nodules are commonly encountered in clinical practice, but their management in community settings has not been well studied. Alternatives for the clinical management of patients with screening-detected or incidentally identified lung nodules include imaging tests, nonsurgical biopsy, surgical resection, and computed tomography (CT) surveillance, each with its own potential benefits and harm. 1 3 The relative safety, effectiveness, and efficiency of these different approaches have not been determined. Current guidelines reflect expert consensus, based largely on uncontrolled studies of diagnostic accuracy or the risk of procedure-related complications. Furthermore, although the solitary pulmonary nodule (SPN) was once the most common presentation, nodules detected in the CT era are often small, multiple, and accompanied by other findings. 4 The frequency of different clinical presentations in the CT era and the potential implications for clinical management have not been described. Research to date has been limited, in part, by the difficulty of identifying patients with pulmonary nodules in the general population. In theory, the use of diagnostic codes might be an efficient means to identify such patients. However, pulmonary nodules are captured under several different International Classification of Diseases-Ninth Revision (ICD-9) codes, each of which is also used to code other conditions of the lung. Virtually all lung nodules are either discovered or confirmed by chest CT, the use of which can be ascertained by Current Procedural Terminology (CPT) codes, but even the combined use of ICD-9 and CPT codes results in an impractically large number of patient records to review through chart abstraction, many of which will not include any reference to lung nodules. Journal of Thoracic Oncology Volume 7, Number 8, August 2012 1257

Danforth et al. Journal of Thoracic Oncology Volume 7, Number 8, August 2012 During the past decade, automatic text processing has been used successfully to extract data from electronic radiology and other reports for a variety of purposes. For instance, natural language processing (NLP) tools have been applied in health research for prescreening in clinical trials, 5 case identification, 6 adverse-event detection, 7 and chart-review facilitation. 8 Therefore, we sought to develop a method to identify patients with incident lung nodules using electronic administrative records and automatic text processing of diagnostic radiology reports within a large, integrated health system. The successful development of a validated, automated method would remove a significant obstacle to this area of research, thereby facilitating future studies of comparative effectiveness. To illustrate the usefulness of the method, we provide descriptive information about the large cohort of patients with pulmonary nodules identified in this study. METHODS We used a multistep process to identify patients with lung nodules. Briefly, this entailed identification of healthplan members who underwent chest CT within 30 days of receiving a diagnosis (ICD-9) code that indicated the possible presence of one or more lung nodules, development of an NLP algorithm to scan the relevant radiology transcripts and classify them regarding the presence or absence of nodules, refinement of the NLP algorithm based on preliminary validation work, validation of the final NLP algorithm against clinician review of dictated transcripts, and application of the final NLP program to identify and then describe (all) patients with incident lung nodules diagnosed in 2006 2010. Setting Kaiser Permanente Southern California (KPSC) is an integrated health care system that currently serves more than 3.4 million members. KPSC members are racially, ethnically, and socioeconomically diverse, and reside in urban, suburban, and rural communities from Bakersfield to San Diego. Implementation of the KPSC electronic health record was largely complete by 2006. Diagnostic and procedural codes, radiology transcripts, and information about patient demographic and behavioral characteristics key variables of interest for the current study are available electronically in these records. Study Population Individuals who were KPSC members at any point during the period from 2006 to 2010 were considered for inclusion in this study. We assembled five nonmutually exclusive subcohorts of members with possible lung nodules during this period by using five ICD-9 codes identified a priori by clinical experience and judgment: 793.1 (nonspecific [abnormal] findings on radiological and other examinations of lung field), 786.6 (swelling, mass or lump in chest), 518.89 (other diseases of lung not classified elsewhere), 519.8 (other diseases of respiratory system not classified elsewhere), and 519.9 (unspecified disease of the respiratory system). 9 A corresponding CPT code for a CT scan within 30 days of the FIGURE 1. Identification of a cohort of 7112 Kaiser Permanente Southern California patients with incident lung nodules during 2006 2010, using an automated process based on diagnosis codes, procedural codes, and natural language processing of radiology transcripts. diagnosis (ICD-9) date was required for inclusion in a given subcohort. We identified relevant CT scans by using the following four CPT codes: 71250 (CT thorax without contrast), 71260 (CT thorax with contrast), 71270 (CT thorax without and with contrast), and 71275 (CT angiography chest). The earlier of the ICD-9 or CT scan dates was used as the reference date for exclusions unless otherwise specified. Within each ICD-9 group, we used the first relevant diagnostic code between 2006 and 2010 to identify CT transcripts for scanning by the NLP algorithm. We excluded individuals with a (possible) lung nodule diagnosis before January 1, 2006, the start of our study period. These exclusions were made within each ICD-9 group independently of the other groups to allow each candidate ICD-9 code to be evaluated separately. We also excluded individuals with a diagnosis of lung cancer (ICD-9 codes 162.2, 162.3, 162.4, 162.5, 162.8, or 162.9) on or before the dates of the relevant diagnosis or CT scan (Fig. 1). Development and Validation of the NLP Algorithm We extracted radiology transcripts for CT scans performed 30 days before or after the reference date. The NLP rules were developed via an iterative process. First, preliminary NLP rules were developed in consultation with a pulmonologist (M.K.G.). Essentially, the algorithm searched for positive key words, such as nodule, mass, opacity, SPN, and GGO, in the presence of explicit dimensions of 30 mm or less or similar qualitative information. These sentences were considered to be positive as long as they did not include a reference to a non-lung site (e.g., lymph, hepatic, kidney ) that was considered to be a negative key word in the NLP algorithm. If a sentence contained a negative key word, the sentence was still considered to be positive if the sentence also included terms indicating a possible lung 1258 Copyright 2012 by the International Association for the Study of Lung Cancer

Journal of Thoracic Oncology Volume 7, Number 8, August 2012 Identification of Patients With Pulmonary Nodules replacement. However, some strata contained fewer than five individuals and thus only 116 patients were included in the validation sample; in particular, ICD-9 codes 519.8 and 519.9 were infrequently used in our setting. A positive transcript was defined as one that noted one or more nodules (but <10), with none measuring more than 30 mm in diameter, and with or without associated pneumonia, atelectasis, mediastinal lymph node enlargement, pleural effusion, or pleural thickening. Performance of the NLP algorithm was determined for each ICD-9 code separately to assess whether it varied by the ICD-9 code. On the basis of the NLP performance characteristics and frequency of nodules in each ICD-9 subcohort, a final method to identify incident lung nodules was selected and applied to the KPSC study population to assemble and describe a community-based cohort of patients with incident lung nodules. FIGURE 2. Natural language processing (NLP) algorithm for processing radiology transcripts from computed tomography scans of the chest. nodule (e.g., lung, thorax, pulmonary ). At the transcript level, a positive sentence resulted in the transcript being considered a positive NLP result unless any other sentence within the transcript referred to a pulmonary lesion that measured larger than 30 mm in widest diameter, indicating a lung mass. The transcripts were extracted retrospectively, thus no attempt was made to standardize the reading or notating of the CT scans. Instead, we designed our algorithm to capture a wide range of commonly used terms that indicate the presence of a lung nodule, as noted by the radiologist based on his or her judgment. We compared the performance of the initial NLP algorithm against a clinician review of transcripts from 50 randomly selected patients in the ICD-9 793.1 and 786.6 subcohorts (selected for preliminary testing because a priori they were considered the best candidates in whom lung nodules could be identified). The initial algorithm had a sensitivity of 93% and a specificity of 74% for lung nodule identification compared with the clinician review. Minor adjustments were then made to the algorithm to further improve its performance. For instance, the algorithm was revised to allow the use of explicit dimensions in the sentence immediately after a positive key word instead of requiring the dimensions to occur within the same sentence. The revised NLP algorithm (Fig. 2) was then evaluated against clinician review of a new, stratified random sample of 116 patients, with stratification on ICD-9 group and calendar year. Up to five cases per year per ICD-9 group were eligible (maximum sample size of 125) and were randomly selected using equal probability weighting and sampling without RESULTS From 2006 to 2010, a total of 12,389 individuals had 13,091 radiology reports from CT scans performed within 30 days of the diagnostic codes of interest (Fig. 1). Approximately 10% of individuals appeared in more than one ICD-9 category. Most individuals came from ICD-9 groups 793.1 (n = 4204 individuals), 786.6 (n = 3488 individuals), or 518.89 (n = 5922 individuals). Relatively few individuals came from ICD-9 groups 519.8 (n = 42 individuals) or 519.9 (n = 31 individuals). The frequency of a positive radiology transcript varied by ICD-9 code from 26% for ICD-9 code 519.9 to 80% for code 518.89 (Table 1), as determined by clinician review of a random sample of up to 25 radiology transcripts in each ICD-9 subcohort. Thus, ICD-9 code 518.89, despite being used to code other diseases of lung not elsewhere classified, was both the largest group and had the highest percentage of transcripts indicating the presence of a lung nodule. Application of the final iteration of the NLP algorithm to all radiology transcripts from each of the ICD-9 subcohorts yielded a positive lung nodule result in a total of 7112 unique patients (Fig. 1). Validation of the Final NLP Algorithm Compared with clinician review, the sensitivity of the NLP algorithm ranged from 93% to % across the five ICD-9 codes (Table 1). Specificity ranged from 79% to %, and the positive predictive value ranged from 78% to %. Overall, the NLP algorithm performed with 96% sensitivity (95% CI: 88 99%) and 86% specificity (95% CI: 75 93%), with an 87% positive predictive value (95% CI: 77% 93%) and a 96% negative predictive value (95% CI: 87% 99%). That is, if the NLP algorithm determined that a radiology transcript did not contain a reference to a lung nodule, the NLP algorithm was almost always correct (96%). If the NLP algorithm indicated that a transcript referenced a lung nodule, it was slightly less often correct (87%) by design, given our greater willingness to accept false-positive results from the NLP algorithm than false-negative results. Within the validation sample, eight false-positive cases were identified by the NLP algorithm compared with clinician review. The most common reason for an NLP false-positive was Copyright 2012 by the International Association for the Study of Lung Cancer 1259

Danforth et al. Journal of Thoracic Oncology Volume 7, Number 8, August 2012 TABLE 1. Performance of an NLP Algorithm for Lung Nodule Identification Compared With Clinician Review of Computed Tomography Scan Transcripts From 116 Patients ICD-9 Subcohort Sample Size for Validation Positive Transcripts a % (N) 793.1 25 44 (11) (11/11) 786.6 25 56 (14) 93 (13/14) 518.89 25 80 (20) 95 (19/20) 519.8 22 32 (7) (7/7) 519.9 19 26 (5) (5/5) All Groups 116 49 (57) 96 (55/57) NLP Accuracy Compared With Clinician Review Sensitivity % (N) Specificity % (N) PPV % (N) NPV % (N) 79 (11/14) 82 (9/11) (5/5) 87 (13/15) 93 (13/14) 86 (51/59) 79 (11/14) 87 (13/15) (19/19) 78 (7/9) 83 (5/6) 87 (55/63) (11/11) 90 (9/10) 83 (5/6) (13/13) (13/13) 96 (51/53) a As determined by expert clinician review, the reference standard in this study. NLP, natural language processing; ICD-9, International Classification of Diseases-Ninth Revision; N, sample size; PPV, positive predictive value; NPV, negative predictive value. TABLE 2. Characteristics of 7112 Patients With Lung Nodules a in Kaiser Permanente Southern California, 2006 2010 Characteristics N=7112 Mean age at diagnosis, yrs (standard deviation) 65 (14) Female (%) 54 Race/Ethnicity (%) Non-Hispanic white 55 Non-Hispanic black 10 Hispanic 19 Asian/Pacific Islander 8 Other 8 Smoking status (%) Current smoker 11 Former smoker 35 Never smoker 36 Missing/other 18 Comorbidities (%) Asthma 13 Chronic obstructive pulmonary disease 21 Coronary heart disease 37 Chronic kidney disease 14 Heart failure 8 Charlson comorbidity index 0 37 1 29 2 16 3+ 18 Lung cancer during follow-up (%) 14 a As determined by an automated process based on diagnosis codes, procedural codes, and natural language processing of radiology transcripts. the presence of a positive key word and dimension followed by a nonspecific statement qualifying the previous reference to a lung nodule. For instance, references to a pulmonary nodule were followed by statements such as likely representing granulomatous disease, likely representing artifact, or even no longer visualized. Two false-negative cases were missed by the NLP algorithm, neither of which indicated that revisions to the NLP algorithm would be warranted. For instance, there was human error in documenting the nodule size, which was possible to assess by reading the transcript as a whole but which the NLP algorithm appropriately excluded based on the rules. Patient and Nodule Characteristics There were 7112 (unique) members with at least one radiology report indicating the presence of a lung nodule (Table 2). On average, these patients were 65 years old, with 14% being younger than 50 years old. There were slightly more women (54%) than men. The majority of members were non-hispanic and white (55%) but almost 20% were Hispanic, with smaller numbers of non-hispanic blacks (10%) and Asian/Pacific Islanders (8%). Approximately 36% of the sample never smoked, and 11% were current smokers (Table 2). In the positive lung nodule cohort, most KPSC members (63%) had a Charlson comorbidity index score of 1 or higher, with almost 20% of patients having a score of 3 or higher (Table 2). Other comorbidities of particular relevance were also common, with 21% of patients diagnosed with chronic airway obstruction and 13% with asthma. Coronary heart disease was present among almost 40% of this population, and 8% had been diagnosed with heart failure. On average, patients with lung nodules diagnosed between 2006 and 2010 (as determined by our automated method) had a median postdiagnosis follow-up of 2.9 years through December 31, 2011. The average follow-up (membership) time was longer for those diagnosed in 2006 (median, 5.1 years) compared with those diagnosed in 2010 (median, 1.9 years). Among the 1010 (14%) who were diagnosed with lung cancer by December 31, 2011, the median time between lung nodule identification and lung cancer diagnosis was 0.2 years (interquartile range [IQR], 0.09 0.42 years). Eight percent of lung cancer cases were diagnosed more than 2 years after lung nodule identification. 1260 Copyright 2012 by the International Association for the Study of Lung Cancer

Journal of Thoracic Oncology Volume 7, Number 8, August 2012 Identification of Patients With Pulmonary Nodules Table 3. Radiographic Findings for 57 Nodules Identified by Clinician Review of a Random Sample of CT Transcripts Nodule Characteristics from CT Transcripts a ICD-9 Subcohort 793.1 786.6 518.89 519.8 519.9 Total Single nodule 7 7 5 1 0 20 Multiple nodules 4 5 11 5 4 29 (2 9 nodules) b Enlarged mediastinal 2 3 1 6 lymph nodes Pneumonia/Atelectasis 1 3 2 2 8 Pleural effusion 1 1 2 Apical pleuralparenchymal 2 2 scar Total 11 14 20 7 5 57 a Characteristics presented for the largest nodule 8 30 mm in diameter; categories are not mutually exclusive. b Transcripts that described the presence of 10 nodules, diffuse nodules, innumerable nodules, or too many nodules to count were not considered to be positive. CT, computed tomography; ICD9, International Classification of Diseases-Ninth Revision. Among 57 members with a true positive transcript for one or more lung nodules, the largest nodule measured 8 to 30 mm in 56%. Only 20 transcripts (35%) noted a solitary nodule without other associated findings (Table 3). Among the remaining transcripts, 29 noted multiple (<10) nodules, eight referenced evidence of pneumonia or atelectasis, six had enlarged mediastinal lymph nodes, and four noted pleural effusion and/or apical pleural thickening. DISCUSSION In this community-based sample of members of a large health plan, we developed, validated, and applied a new method for identifying incident lung nodules in communitybased settings. We were able to accurately identify patients who had one or more pulmonary nodules noted on a radiology transcript by using a combination of ICD-9 codes, CPT codes, and a novel algorithm for NLP of free text from these transcripts. Application of this algorithm enabled us to report the sociodemographic and clinical characteristics of the largest community-based sample of patients with incidentally detected lung nodules yet to be described. The inability to readily identify patients with lung nodules in community settings has been a key factor limiting research on how best to manage these patients. Most prior studies of lung nodules were performed in tertiary care settings, and many enrolled nonrepresentative samples in uncontrolled studies of diagnostic accuracy, screening, or surgical outcomes. 10 Our study confirms the inadequacy of ICD-9 and CPT codes alone to identify lung nodules accurately, and underlines the importance of using multiple ICD-9 codes to capture incident lung nodules within a large health system. In a random sample of 116 members who were identified by a combination of ICD-9 and CPT codes, clinician review of radiology transcripts identified one or more lung nodules in only half of them. The proportion of transcripts noting one or more lung Copyright 2012 by the International Association for the Study of Lung Cancer nodules varied across ICD-9 groups and ranged from 26% to 80%. Although the combination of ICD-9 code 518.89 and CT scan dates may have performed adequately for some research purposes (80% positive for lung nodules by clinician review for that group), a substantial number of lung nodules would have been missed given the large number of lung nodules captured by ICD-9 codes 793.1 and 786.6. However, although the size of the patient groups identified by ICD-9 codes 793.1 and 786.6 were large, the frequencies of lung nodules identified were relatively low (44% and 56%, respectively). Looking forward, the creation of pulmonary nodule-specific ICD-9 (793.11) and ICD-10 (R91.1) codes since October 2011 may mitigate, if not eliminate, this problem. 9,11 The addition of the NLP algorithm improved the specificity and positive predictive value of nodule identification considerably for all five ICD-9 groups, and our final method for identifying lung nodules performed well enough for it to facilitate research in two primary ways. First, a fully automated process using electronic data could be used to study the incidence and prevalence of lung nodules in large populations, with the caveat that approximately 13% of cases identified by the automated method would not meet our definition of one or more nodules (e.g., be false-positives). However, in this study we favored sensitivity over specificity, and the NLP algorithm could be modified to improve its specificity. Alternatively, the method developed here could be used as a sensitive first step to be followed by more specific review of radiology transcripts or actual imaging studies. This two-step approach would reduce investigator burden by identifying a smaller number of transcripts that are more likely to mention the presence of one or more lung nodules, and thereby facilitate studies that require information contained in the radiology transcript or the even more detailed information provided by the actual images. Both these methods provide an efficient and relatively accurate way to identify lung nodules of interest. As expected, the large sample that we identified included a racially and ethnically diverse mix of primarily older individuals, many of whom were never smokers. However, comorbidities were common, including coronary heart disease, chronic kidney disease, asthma, and obstructive lung disease. Of note, among the 57 positive transcripts reviewed in our validation sample, almost two-thirds of positive transcripts contained reference to more than one nodule and/or one or more associated findings, such as pneumonia, atelectasis, or mediastinal lymph node enlargement. This highlights that the classical solitary pulmonary nodule is but one of many different ways in which a lung nodule may present in the current era of CT scanning. Accordingly, future research and recommendations for practice should distinguish between solitary nodules, oligonodular or paucinodular involvement, diffuse nodules, and nodules accompanied by associated findings. Our study has several limitations. First, although we observed that the NLP algorithm had excellent sensitivity for identifying transcripts that referenced one or more lung nodules, our use of ICD-9 codes to select transcripts for NLP review may have resulted in large numbers of missed cases. This seems especially likely because our method unexpectedly detected more larger nodules (56%) than smaller, subcentimeter 1261

Danforth et al. Journal of Thoracic Oncology Volume 7, Number 8, August 2012 nodules, which are generally recognized to be more common. In retrospect, we believe that the requirement for the presence of a nodule-related diagnostic code skewed the distribution of nodule size toward larger nodules. On average, a busy clinician might be less inclined to enter one of the nonspecific ICD-9 codes for patients with small, incidental nodules. However, the assignment of a diagnosis code reflects a real-world judgment: that a nodule is clinically important, and thus the identification of these nodules and their clinical follow-up and outcomes is of value even if some nodules are missed. Although we excluded patients with prior diagnoses (ICD-9 codes) that indicated a possible lung nodule so that only incident nodules could be identified, it is possible that some nodules were previously observed but did not rise to the level of being documented via a diagnosis code. Future research should examine the trade-off between improved sensitivity and greater burden of review that would result from applying the NLP algorithm to less restricted samples of radiology transcripts, i.e., by eliminating the requirement for a diagnostic code. Another limitation is that we performed our validation by reviewing a relatively small sample of radiology transcripts. In subsequent validation work, it will be important to include larger samples. Future research should evaluate the generalizability of our method to other health care settings, as it is possible that coding patterns and even terminology used to dictate radiology reports may differ across providers, regions, and systems. However, no attempt was made to standardize the reading or notation of the CT scans within our health system, encompassing 14 hospitals and multiple radiologists, and therefore our study reflects some of the heterogeneity that might be expected in other settings. Furthermore, we incorporated several standard terms in our algorithm, which are frequently used to indicate the presence of a lung nodule (e.g., nodule, opacity, SPN, and GGO ) and are likely to be used in other health care settings as well. In summary, by using a combination of ICD-9 codes, CPT codes, and NLP of full-text radiology reports, we developed an automated method that seemed to have excellent sensitivity and good specificity for identifying incident lung nodules. With further refinement and subsequent validation, such a method will help to identify patients with lung nodules in large populations more efficiently, thereby facilitating research that compares alternative practices for lung nodule management in community-based settings. ACKNOWLEDGMENTS We thank Qiaowu Li and Margo Sidell for their contributions to the analyses. REFERENCES 1. Gould MK, Fletcher J, Iannettoni MD, et al.; American College of Chest Physicians. Evaluation of patients with pulmonary nodules: when is it lung cancer?: ACCP evidence-based clinical practice guidelines (2 nd edition). Chest 2007;132(3 Suppl):108S 130S. 2. Ost D, Fein AM, Feinsilver SH. Clinical practice. The solitary pulmonary nodule. N Engl J Med 2003;348:2535 2542. 3. Ost DE Gould MK. Decision making in patients with pulmonary nodules. Am J Respir Crit Care Med 2012;185:363 372. 4. MacMahon H, Austin JH, Gamsu G, et al.; Fleischner Society. Guidelines for management of small pulmonary nodules detected on CT scans: a statement from the Fleischner Society. Radiology 2005;237:395 400. 5. Li L, Chase HS, Patel CO, Friedman C, Weng C. Comparing ICD9-encoded diagnoses and NLP-processed discharge summaries for clinical trials prescreening: a case study. AMIA Annu Symp Proc 2008;2008:404 408. 6. Gundlapalli AV, South BR, Phansalkar S, et al. Application of Natural Language Processing to VA Electronic Health Records to Identify Phenotypic Characteristics for Clinical and Research Purposes. Summit on Translat Bioinforma 2008;2008:36 40. 7. Melton GB Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc 2005;12:448 457. 8. Womack JA, Scotch M, Gibert C, et al. A comparison of two approaches to text processing: facilitating chart reviews of radiology reports in electronic medical records. Perspect Health Inf Manag 2010;7:1a. 9. World Health Organization. International Classification of Diseases. Available at: http://www.who.int/classifications/icd/en/. Accessed December 15, 2011. 10. Wahidi MM, Govert JA, Goudar RK, Gould MK, McCrory DC; American College of Chest Physicians. Evidence for the treatment of patients with pulmonary nodules: when is it lung cancer?: ACCP evidence-based clinical practice guidelines (2 nd edition). Chest 2007;132(3 Suppl):94S 107S. 11. Centers for Disease Control and Prevention. ICD-9-CM Coordination and Maintenance Committee Meeting March 9 10, 2011. Available at: http://www.cdc.gov/nchs/data/icd9/topicpacketformarch2011_ha1.pdf. Accessed February 6, 2012. 1262 Copyright 2012 by the International Association for the Study of Lung Cancer