Determinants of Smoking and Quitting in HIV-Infected Individuals

Background Cigarette smoking is widespread among HIV-infected patients, who confront increased risk of smoking-related co-morbidities. The effects of HIV infection and HIV-related variables on smoking and smoking cessation are incompletely understood. We investigated the correlates of smoking and quitting in an HIV-infected cohort using a validated natural language processor to determine smoking status. Method We developed and validated an algorithm using natural language processing (NLP) to ascertain smoking status from electronic health record data. The algorithm was applied to records for a cohort of 3487 HIV-infected from a large health care system in Boston, USA, and 9446 uninfected control patients matched 3:1 on age, gender, race and clinical encounters. NLP was used to identify and classify smoking-related portions of free-text notes. These classifications were combined into patient-year smoking status and used to classify patients as ever versus never smokers and current smokers versus non-smokers. Generalized linear models were used to assess associations of HIV with 3 outcomes, ever smoking, current smoking, and current smoking in analyses limited to ever smokers (persistent smoking), while adjusting for demographics, cardiovascular risk factors, and psychiatric illness. Analyses were repeated within the HIV cohort, with the addition of CD4 cell count and HIV viral load to assess associations of these HIV-related factors with the smoking outcomes. Results Using the natural language processing algorithm to assign annual smoking status yielded sensitivity of 92.4, specificity of 86.2, and AUC of 0.89 (95% confidence interval [CI] 0.88–0.91). Ever and current smoking were more common in HIV-infected patients than controls (54% vs. 44% and 42% vs. 30%, respectively, both P<0.001). In multivariate models HIV was independently associated with ever smoking (adjusted rate ratio [ARR] 1.18, 95% CI 1.13–1.24, P <0.001), current smoking (ARR 1.33, 95% CI 1.25–1.40, P<0.001), and persistent smoking (ARR 1.11, 95% CI 1.07–1.15, P<0.001). Within the HIV cohort, having a detectable HIV RNA was significantly associated with all three smoking outcomes. Conclusions HIV was independently associated with both smoking and not quitting smoking, using a novel algorithm to ascertain smoking status from electronic health record data and accounting for multiple confounding clinical factors. Further research is needed to identify HIV-related barriers to smoking cessation and develop aggressive interventions specific to HIV-infected patients.


Method
We developed and validated an algorithm using natural language processing (NLP) to ascertain smoking status from electronic health record data. The algorithm was applied to records for a cohort of 3487 HIV-infected from a large health care system in Boston, USA, and 9446 uninfected control patients matched 3:1 on age, gender, race and clinical encounters. NLP was used to identify and classify smoking-related portions of free-text notes. These classifications were combined into patient-year smoking status and used to classify patients as ever versus never smokers and current smokers versus non-smokers. Generalized linear models were used to assess associations of HIV with 3 outcomes, ever smoking, current smoking, and current smoking in analyses limited to ever smokers (persistent smoking), while adjusting for demographics, cardiovascular risk factors, and psychiatric illness. Analyses were repeated within the HIV cohort, with the addition of CD4 cell count and HIV viral load to assess associations of these HIV-related factors with the smoking outcomes.

Results
Using the natural language processing algorithm to assign annual smoking status yielded sensitivity of 92.4, specificity of 86.2, and AUC of 0.89 (95% confidence interval [CI] 0.88-0.91). Ever and current smoking were more common in HIV-infected patients than controls (54% vs. 44% and 42% vs. 30%, respectively, both P<0.001). In multivariate models HIV was independently associated with ever smoking (adjusted rate ratio [ARR] 1.18, 95% CI

Introduction
Smoking is highly prevalent among HIV-infected patients [1][2][3][4][5][6] and is strongly associated with increased prevalence of smoking-related chronic diseases. [5,7,8] Cardiovascular disease (CVD) risk, which is known to be heightened in HIV disease, [9][10][11][12][13] has been shown to decrease with increased time since quitting smoking in an HIV cohort. [14] Smoking-related characteristics, including degree of nicotine dependence, [15,16] readiness to quit, [3,15] and frequency of quit attempts, [15] have been explored for HIV-infected patients. HIV-infected patients have been cited as a high-priority group for intervention by a major tobacco guideline. [17] Understanding the impact of HIV and HIV-related parameters on smoking will help to develop smoking cessation strategies tailored to this group.
The challenge of obtaining reliable smoking data from electronic health record (EHR) data sources represents a barrier to studying smoking among HIV populations in clinical care. [18,19] Natural language processing (NLP) tools have been developed to identify and classify smoking-related portions of text in medical records [20][21][22] and represent a novel approach to this problem. However, individual NLP classifications must be integrated to create a clinically meaningful smoking status for a patient at specific point in time that is appropriate for clinical research use.
We investigated smoking outcomes in a health care system-based longitudinal observational cohort of HIV-infected patients and matched controls. To determine smoking status in this large cohort, we developed and validated an algorithm to assign smoking status using NLP data. While current smoking prevalence has been demonstrated to be elevated among HIVinfected patients, it is unclear the extent to which this is due to greater smoking initiation or reduced smoking cessation among this group. We assessed whether HIV infection is independently associated with ever smoking and current smoking. In order to assess the effect of HIV status on smoking cessation, we also examined the outcome of current smoking in analyses limited to ever smokers (persistent smoking or failure to quit). We controlled for cardiovascular risk factors because they are elevated among patients with HIV and diagnosis with cardiovascular disease has been associated with smoking cessation. In addition, we controlled for mood disorders and schizophrenia with have been associated with high smoking prevalence and difficulty quitting. [23,24] We then examined specific correlates of the three smoking outcomes within the HIV-infected group to assess whether HIV-related clinical characteristics, which have been associated with cardiovascular outcomes, [25][26][27] may impact likelihood of smoking and ability to quit. We sought to provide a comprehensive investigation of the impact of HIV on smoking behaviors, specifically examining whether HIV-related clinical characteristics affect smoking outcomes independently of potentially confounding clinical factors.

Patient population
The cohort comprised HIV-infected patients (cases) matched to HIV-uninfected patients (controls) on the basis of age, gender, race, and number of medical encounters in a 3:1 ratio. Data were obtained from the Partners HealthCare System (PHS) Research Patient Data Registry (RPDR), a comprehensive database of administrative, billing and electronic health record (EHR) information including inpatient and outpatient encounters for over 4.5 million patients. Patients were eligible to be included as cases if they received care at Brigham and Women's Hospital or Massachusetts General Hospital between 2005 and 2007. HIV infection was determined by inpatient or outpatient diagnosis of HIV (International Classification of Diseases, 9th Revision, Clinical Modification [ICD-9-CM] codes 042 and all subtypes, V08, and corresponding electronic health record codes). Exclusion criteria for both groups included diagnosis of coronary heart disease (CHD) prior to 2008, age <18 years, and death prior to January 1, 2008. The study period spanned the time of the earliest documented clinical encounter through October 31, 2008. This study was approved by the Partners Human Research Committee. Informed consent of study subjects was not obtained. The IRB approval included a waiver of the requirement to obtain informed consent because the risk to study subjects, including risk to privacy, was deemed to be minimal, obtaining informed consent of study subjects was not feasible and the rights and welfare of the subjects would not be adversely affected by the waiver.

Smoking NLP algorithm validation
We used an NLP tool [28] to scan free text portions of the medical record, identify portions of text, or "tokens," that contain smoking-related information, and classify each token as indicating a non-smoker, current smoker or former smoker. The performance of the classifier in categorizing individual tokens has been validated previously. [28] However, a single patient's medical record may contain multiple tokens with discrepant classifications. We applied an aggregation rule for combining token classifications to assign a smoking status to a patient for a given calendar year (S1 Text). To validate the full algorithm, a sample of 250 HIV cases and 250 controls were randomly selected from among those with NLP data available. For each calendar year from the patient's first encounter to the last, the reviewer classified the patient's smoking status for the period as smoker, nonsmoker or unknown. We calculated sensitivity, specificity and AUC comparing the NLP-based algorithm to the gold standard of clinician medical record review for ever versus never and current versus not current smoking. We assessed the performance of the algorithm by patient characteristics that might be expected to affect physician documentation (HIV status, gender, age, cardiovascular risk factors) as well as time (calendar year) and the number of tokens found. We compared AUC using a nonparametric test. [29] Covariate ascertainment Data extracted from the RPDR included demographic data (age, gender and self-reported race), ICD-9 diagnostic codes, laboratory test results, medication prescriptions, and free text notes. Patients were classified as having hypertension, diabetes, dyslipidemia, coronary heart disease, depression, anxiety, bipolar disorder and schizophrenia if a relevant ICD-9 code was found (see Table 1 for specific codes). Patients were considered to have used pharmacotherapy for smoking cessation if a prescription for varenicline or an outpatient prescription for nicotine replacement therapy (NRT) was found. Inpatient NRT use was not included because is commonly used for temporary abstinence during hospitalizations. Because bupropion is indicated for both depression and smoking cessation, this medication was not considered a cessation aid. For cases, we obtained the most recent CD4 cell count and HIV RNA laboratory results. HIV RNA results are presented as percent detectable (400 copies/ml) versus not detectable, and among those with detectable results, the mean log-transformed HIV RNA.
To explore the influence of HIV-infection on smoking cessation, we repeated the model with current smoking as the outcome, but limited the analysis to ever smokers. The outcome of this analysis can be interpreted as persistent smoking, or failure to quit.
All models included the cardiovascular risk and mood disorder variables plus schizophrenia while controlling for age (as a continuous variable), gender, and race (white vs. other). The models predicting persistent smoking also included a term for ever use of smoking cessation medication (varenicline or outpatient NRT). We constructed models for each outcome including HIV status as a correlate, and then repeated them for HIV cases only adding dichotomous variables for ever use of antiretroviral therapy (ART), CD4 cell count (<200 vs. 200) and HIV RNA (< 400 copies/ml vs. 400 copies/ml) at the most recent observation. For CD4 and HIV RNA, additional categories were created for patients with missing laboratory data. Sensitivity analyses were conducted substituting nadir CD4 for recent CD4 cell count, continuous HIV RNA (log transformed) for dichotomous HIV RNA, and duration of ART use for ever ART use. Additional analyses in the overall and HIV-only persistent smoking model were conducted limiting to patients with at least 12 months between the first and last smoking status. We present adjusted rate ratios (RR) and 95% confidence intervals (CI). All tests were 2-sided with P values <0.05 considered significant. All analyses were conducted in Stata

Cohort characteristics
The overall cohort included 3487 HIV and 9446 control patients. NLP identified at least 1 smoking-related token for 2868 cases (82%) and 6915 controls (73%). Among those with >1 token available, the median time between the first and last observation was 55 months. Table 1 presents the demographic and clinical characteristics of the entire cohort, patients with NLP data available, and the validation sample randomly drawn from those with NLP data available.

Smoking algorithm validation
Smoking status was ascertained by both the NLP-based algorithm and the medical record reviewer for 500 patients during a total 1591 patient years. For current smoking, the NLPbased algorithm had a sensitivity of 92%, specificity of 86% and AUC of 0.89 (95% CI 0.88-0.91). For ever smoking, the NLP-based algorithm had a sensitivity of 94%, specificity of 73% and AUC of 0.84 (95% CI 0.81-0.87). The performance of the NLP-based algorithm as compared to medical record review in specific subgroups of patients is presented in Table 2.

Prevalence and correlates of smoking
Using the NLP-based algorithm, overall smoking prevalence was 47% for ever smoking and 33% for current smoking. Smoking was more prevalent in HIV-infected patients compared to controls (54% vs. 44% for ever smoking, 42% vs. 30% for current smoking, P<0.001 for both comparisons). Persistent smoking (among ever smokers) was documented in 71% of the overall group, 77% of the HIV-infected patients, and 68% of the control patients. NRT was used by 7% (N = 333) of ever smokers, with no difference between cases and controls (8% vs. 7%, p = 0.111).
In multivariate modeling adjusted for age, gender, race, cardiovascular risk index, mood disorder, and schizophrenia, HIV infection was significantly associated with ever smoking (RR 1.18, 95% CI 1.13-1.24, P<0.001), current smoking (RR 1.33, 95% CI 1.25-1.40, P<0.001), and persistent smoking (RR 1.11, 95% CI 1.07-1.15, P<0.001). (Table 3) Male gender and being diagnosed with schizophrenia were the only other factors to show this consistent pattern across all three outcomes. The number of cardiovascular risk factors diagnosed was not associated with ever smoking, but each additional diagnosis was associated with a 10% decrease in the prevalence of current smoking (RR 0.91, 95% CI 0.88-0.94, P<0.001) and a 10% increase in quitting (RR 0.91 for persistent smoking, 95% CI 0.89-0.93, P<0.001). The presence of a mood disorder was associated with ever and current smoking but not with quitting smoking.
In analyses repeated within the HIV-infected group only, having a detectable recent HIV RNA was significantly associated with ever smoking, current smoking, and persistent smoking (Table 4). Having a CD4 cell count less than 200/mm 3 was associated with being less likely to quit smoking, although this association did not achieve statistical significance. The performance of the other factors followed a similar pattern between the HIV-infected only and overall models.
In further sensitivity analyses among the HIV-infected patients, we investigated the effects of CD4 cell count nadir, HIV RNA expressed as a continuous variable, and ART duration on the three outcomes in order to assess different aspects of HIV disease severity. Results were similar, with the exception of ART duration which was significantly associated with history of ever smoking (RR 1.02, 95% CI 1.01-1.03, P<0.001).

Discussion
In a large clinical care cohort of HIV-infected and matched control patients, we found HIV to be a significant correlate of current and ever smoking with an effect size comparable to that for associations of smoking with male gender or schizophrenia, while controlling for cardiovascular risk factors and mental health disorders. We also showed being HIV infected to be independently associated with decreased likelihood of quitting smoking. Despite extensive data supporting a high prevalence of smoking among HIV-infected individuals, whether HIV infection is independently associated with smoking after accounting for multiple potentially confounding clinical factors has not been clearly established. Our finding that HIV infection is independently associated with both smoking and decreased likelihood of quitting strongly establishes HIV-infected patients as an extremely high-risk group meriting targeted smoking cessation intervention.
HIV-infected patients demonstrate extremely high smoking prevalence across multiple geographic and clinical settings, with a recent study demonstrating higher attributable mortality to smoking than to HIV itself. [30] Smoking prevalence among HIV-infected patients has ranged from 43% to 64% in a series of earlier cohort studies, [1-4, 15, 31] and was higher relative to matched control patients in French and Danish cohorts. [30,32] A recent study comparing smoking prevalence in HIV-infected patients versus the general US population found current smoking prevalence of 42% for HIV compared with 21% for the general US population. [6] Our findings, which showed smoking prevalence of 42% versus 30% for HIV-infected compared with matched control patients in longitudinal clinical care, are highly consistent with results from this national cross-sectional survey.
HIV-infected patients were also found to be significantly less likely to quit smoking, despite higher prevalence of pharmacologic smoking cessation aids. Prior studies have reported relatively high motivation to quit smoking among HIV-infected patients [15] and high rates of quit attempts. [15] Yet this apparent readiness does not appear to translate into successful smoking cessation, as demonstrated in a recent study. [6] Several smoking cessation trials utilizing intensive counseling and cellular telephone interventions have demonstrated efficacy, [33][34][35] but were limited by short follow-up or non-randomized design. A recent study demonstrated increased smoking cessation rates following implementation of a training program for HIV clinicians. [36] Our findings reinforce the need for studies of intensive yet feasible smoking cessation interventions tailored to HIV-infected patients that can be readily applied within current care models.
Within the group of HIV-infected patients, we conducted several analyses exploring factors associated with smoking and smoking cessation to identify HIV subgroups that might be targeted for more intensive intervention. Patients with a detectable HIV viral load were significantly more likely to smoke and less likely to quit compared to those who were virologically suppressed, even after accounting for the presence of mental health disorders. While CD4 cell count was not associated with ever or current smoking, having a low CD4 cell count tended to be associated with not having quit smoking (P = 0.068). Importantly, these HIV-related factors appear to be more important in predicting patients' ability to quit smoking than mood disorders or schizophrenia, which were not significant risk factors. The group of patients with less well controlled HIV infection might represent those not yet meeting previous criteria for antiretroviral treatment (as guidelines recommending treatment for all HIV-infected patients are relatively recent [37]) or those who are not adhering to prescribed therapy. Socio-demographic and clinical factors which affect medication adherence and lead to detectable viral load measurements might also represent barriers to smoking cessation.
Our findings are consistent with established risk factors for smoking in the general population, in which smoking prevalence is typically higher in men [38] and in patients with psychiatric disorders, [39] identified as a high risk group. [40] The presence of HIV infection coupled with a psychiatric disorder is likely to confer a heightened risk of smoking, given individual increased risks of 30% conferred by HIV infection, 25% by a mood disorder, and 20% by schizophrenia. This subgroup of HIV-infected patients with mental illness represents a particularly high risk group for whom aggressive smoking cessation intervention is warranted.
To optimize smoking data for our cohort, we developed and validated a novel algorithm to identify smoking status from EHR data using NLP. Several NLP tools for smoking status have been developed [20][21][22] and used to assess physician adherence to evidence-based guidelines [41] or to assign smoking status on the patient level [19]. Our purpose was to develop a method to use NLP token classifications in a way that reflected the longitudinal nature of our cohort and that could capture changes in smoking status over time. The algorithm performed extremely well, yielding sensitivity and specificity in the 90 percent range for annual smoking prevalence. Moreover, the algorithm performance remained consistent when evaluated according to multiple characteristics reflecting variation in patient characteristics and clinical care delivery.
The study was limited by several factors intrinsic to observational data. It was a retrospective observational study and therefore potentially subject to confounding, despite the demographic matching of the control group. We were unable to control for socioeconomic status and other substance use, variables likely to influence smoking behavior that might differ between HIVinfected patients and controls. The validation study was conducted using detailed medical record data by a trained clinical research nurse, as patient self-reported smoking data were not available. Our algorithm was by necessity validated in the cohort in which it was derived, rather than an external validation cohort, because the NLP tool we used was developed for and is specific to the Partners HealthCare System. While the algorithm we developed is applicable to EHR data in the Partners HealthCare System. The process by which it was generated is applicable to other health care systems, in which it might serve as a model for the development of analogous algorithms.
The implications of the study extend to both HIV management care and HIV clinical research methodology. We demonstrated that HIV is independently and significantly associated with history of smoking, current smoking, and decreased likelihood of quitting smoking. Additionally, we show that having less well-controlled HIV disease represents a barrier to quitting smoking with a stronger association than having a mental health disorder. Moreover, the development of an automated algorithm to identify smoking status from EHR data represents an innovative approach which can be translated to other settings and serve as a paradigm in a research era of increasing reliance on clinical care data. By substantiating the link between HIV infection and smoking and identifying HIV subgroups with lower likelihood of quitting, the data from this study provide strong support for intensifying efforts at the provider and public health level for HIV-specific smoking cessation strategies.
Supporting Information S1 Text. Token Aggregration Rule Selection. (DOCX) group for facilitating use of their database and natural language processing tool and to Jo Ann David-Kasdan for medical record review.