Validation of Six Short and Ultra-short Screening Instruments for Depression for People Living with HIV in Ontario: Results from the Ontario HIV Treatment Network Cohort Study

Objective Major depression affects up to half of people living with HIV. However, among HIV-positive patients, depression goes unrecognized 60–70% of the time in non-psychiatric settings. We sought to evaluate three screening instruments and their short forms to facilitate the recognition of current depression in HIV-positive patients attending HIV specialty care clinics in Ontario. Methods A multi-centre validation study was conducted in Ontario to examine the validity and accuracy of three instruments (the Center for Epidemiologic Depression Scale [CESD20], the Kessler Psychological Distress Scale [K10], and the Patient Health Questionnaire depression scale [PHQ9]) and their short forms (CESD10, K6, and PHQ2) in diagnosing current major depression among 190 HIV-positive patients in Ontario. Results from the three instruments and their short forms were compared to results from the gold standard measured by Mini International Neuropsychiatric Interview (the “M.I.N.I.”). Results Overall, the three instruments identified depression with excellent accuracy and validity (area under the curve [AUC]>0.9) and good reliability (Kappa statistics: 0.71–0.79; Cronbach’s alpha: 0.87–0.93). We did not find that the AUCs differed in instrument pairs (p-value>0.09), or between the instruments and their short forms (p-value>0.3). Except for the PHQ2, the instruments showed good-to-excellent sensitivity (0.86–1.0) and specificity (0.81–0.87), excellent negative predictive value (>0.90), and moderate positive predictive value (0.49–0.58) at their optimal cut-points. Conclusion Among people in HIV care in Ontario, Canada, the three instruments and their short forms performed equally well and accurately. When further in-depth assessments become available, shorter instruments might find greater clinical acceptance. This could lead to clinical benefits in fast-paced speciality HIV care settings and better management of depression in HIV-positive patients.

Data Availability Statement: We obtained our data from the OHTN Cohort Study. There are ethical restrictions on the dataset as it contains patient information that may pose a risk of residual disclosure of the HIV-positive participants. A deidentified dataset will be made available to all interested researchers upon request to the OHTN Cohort Study Governance Committee. Full details regarding the application process are provided at www.ohtncohortstudy.ca. Interested readers may contact Ms. Madison Kopansky-Giles (OCS

Introduction
Depression affects up to half of people living with HIV [1][2][3][4]. However, depression goes unrecognized in about 60-70% of HIV-positive patients in non-psychiatric healthcare settings [5][6][7][8]. When depression is left untreated in HIV-positive patients, it can reduce immune activity [9][10][11][12] increase the risk of co-morbidities and mortality [13,14], and reduce quality of life [15]. Given the advancements made by highly active antiretroviral therapy (HAART), HIV-positive patients are living longer, and physicians and patients are facing long-term challenges in managing depression [16]. Because of the substantive negative impacts of depression on clinical outcomes normally found among HIV-positive patients, recent guidelines from Canada, U.K. and the U.S. recommend that screening should be undertaken if follow-up in-depth assessments are available [17][18][19].
Over the past several decades, numerous short and ultra-short screening instruments have been developed to assist in examining depressive symptomatology in non-psychiatric healthcare settings [20,21]. Despite ongoing debates about the effectiveness of these instruments, a recent meta-analysis of 113 studies has shown that most instruments demonstrate adequate performance when used in the initial assessment of depression among patients with physical illness [20].
The 9-item Patient Health Questionnaire (PHQ 9 ), the 20-item Center for Epidemiologic Depression Scale (CES-D 20 ), and the 10-item Kessler Psychological Distress Scale (K 10 ) are three screening instruments commonly used with HIV-positive patients [21,22]. The PHQ 9 has earned acceptance in primary care and research settings because it is half of the length of most other instruments but maintains comparable sensitivity and specificity [23]. Each item of the PHQ 9 also corresponds to specific Diagnostic and Statistical Manual of Mental Disorders, 4 th edition (DSM-IV) depression diagnosis criteria [23]. The CESD 20 has the longest history of measuring depression in both HIV-positive patients and the general population [21,22]. It was originally designed for community surveys and has extensively demonstrated its reliability and validity [20,24]. The K 10 is a short instrument that can broadly screen for both anxiety and depressive disorders [25]. It has strong psychometric properties for distinguishing DSM-IV disorders and its diagnostic accuracy has been shown to have no significant bias by gender or education level [26,27].
Although these three instruments have been extensively evaluated in the general population [24] and in patients with physical illness [20], evaluations of the instruments among HIV-positive patients have been performed mainly in limited-resource settings (i.e., Sub-Saharan Africa) [21,22]. However, the characteristics of the HIV-positive patients in Sub-Saharan Africa-for instance, their literacy levels and their understanding and expression of mental health issues-might be quite different from those of North Americans and affect the evaluation of the instruments. As a result, the psychometric properties of the three instruments and their comparability to a "gold standard" remain unknown for HIV-positive patients in wellresourced settings such as Canada and the United States.
Our multi-centre study sought to determine and compare the diagnostic accuracy and reliability of the three instruments (CESD 20 , K 10 , and PHQ 9 ) and their short forms (CESD 10 , K 6 , and PHQ 2 ) for current major depression against a gold standard as measured by the Mini International Neuropsychiatric Interview (the "M.I.N.I."). The study focused on HIV-positive patients receiving HIV primary care in Ontario. Additional study objectives were to determine the optimal cut-points for each screening instrument and to examine potential factors that might affect the diagnostic accuracy of the instruments.

Study design
We conducted a cross-sectional validation study nested within a larger cohort of participants in HIV care. The Ontario HIV Treatment Network Cohort Study (OCS) is a multi-site, HIVpositive, clinical cohort. Full details regarding the cohort design can be found in a previous publication [28]. Briefly, participants are HIV-positive patients aged 16 years or older receiving care at one of ten specialty HIV clinics in Ontario. Clinical data recorded during the participants' routine health care visits are abstracted from clinic records and, since 2008, participants have been interviewed annually.
Three OCS sites were included in this validation study: Maple Leaf Medical Centre in Toronto, St. Joseph's Health Care in London, Ontario, and Windsor Regional Hospital. Participants who agreed to take part in the study received a $20 CAD honorarium. Ethical approval was received from the University of Toronto Human Subjects Review Committee and from the individual study sites (i.e. Ottawa Health Science Network Research Ethics Board, The University of Western Ontario Research Ethics Board for Health Sciences Research involving Human Subjects, St. Michael's Hospital Research Ethics Board, the Research Ethics Board of Health Sciences North, Sunnybrook Health Sciences Centre Research Ethics Board, University Health Network Research Ethics Board, and Windsor Regional Hospital Research Ethics Board). Our consent procedure was approved by all the ethics boards involved and written informed consent was obtained from each participant.
The M.I.N.I. is a short and widely adopted structured interview that takes about 15 minutes to complete and can be easily administered by a lay interviewer [29]. The M.I.N.I. has high sensitivity (94-96%) and specificity (79-88%) for identifying major depressive disorder when compared to the structured clinical interviews for the DSM-IV (SCID) and the International Classification of Disease, 10 th revision (ICD-10) criteria [29][30][31]. Nurses and participants were blinded to the results of the M.I.N.I. interviews.

Covariates
We also assessed whether certain characteristics of patients might affect the diagnostic accuracy of the screening instruments. Patient information was obtained through interviews  administered by the nurses on the study date or during a previous appointment [28]. Measurement details for key characteristics are provided in Table 3.

Statistical Analysis
After the data were collected and de-identified, results from the M.I.N.I. diagnoses and total scores for the three screening instruments were generated at the OCS office by the lead investigator (S. C.) who was independent to the data collection. Our statistical analysis plan was fourfold: 1) To examine the diagnostic accuracy of the three screening instruments and their short forms; 2) To identify optimal cut-points for the screening instruments; 3) To examine the effects of seven previously documented somatic symptoms of HIV infection [32] on the diagnostic accuracy and performance of the screening instruments; and 4) To examine inter-rater agreement for pairs of the three instruments and internal consistency of each instrument. We first used descriptive statistics to describe baseline characteristics, scores of the screening instruments and their short forms, and the prevalence of DSM-IV defined psychiatric disorders among study participants. We also assessed the differences by age (Student's t-test) and by sex (Pearson's chi-squared test) between our sample and the rest of the OCS participants who are currently active in the OCS.
We then used non-parametric crude and adjusted Receiver Operating Characteristic (ROC) analyses to examine the criterion validity and accuracy of the three screening instruments and their short forms as compared to the M.I.N.I. First, overall psychometric property of each instrument was described by a global measure: area under the ROC curve (AUC). In general, values of AUC (ranged: 0.5 to 1) greater than 0.8 and 0.9 indicated either good or excellent performance respectively. Second, we used non-parametric Mann-Whitney U-test to assess for equality of ROC curves of the instruments [33]. For each screening instrument, several criterion validity statistics were reported at each pre-defined cut-point: sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), and negative likelihood ratio (LR-). Finally, adjusted multivariable non-parametric ROC analyses were performed [34] because some covariates may have an impact on the accuracy of the instruments. Bivariate analyses were first performed to examine crude associations between the ROC curve of each instrument and each covariate. Covariates with a p-value<0. 25 were entered into the final multivariable model [35]. Coefficients of the adjusted multivariable model generally reflect the impact of a specific covariate on the adjusted ROC curve by assuming a linear relationship exists between diagnostic accuracy of the instruments and each covariate. A value of zero indicates no effect. We also assessed the overall impact of the covariates by comparing crude and adjusted AUCs for each instrument.
There are many criteria for determining optimal cut-points for screening instruments [36][37][38][39][40]. In our study, we adopted three common criteria: Youden index (YI) [defined as Se+Sp-1] [41], distance (PROC01) between the optimal point on the ROC curve and the point of (0, 1), which is an ideal point corresponding to a sensitivity and specificity equal to 1 [defined as (NPV-1) 2 +(PPV-1) 2 ] [37,39], and diagnostic odds ratios (DOR) [defined as LR+/LR-] [40,42]. The YI (ranged:-1 to 1) is a single index that balances the sensitivity and specificity where the greater its value, the better the validity of the cut-point. The PROC01 (ranged: 0 to 2) is a single index that is balanced on both the NPV and PPV and its minimum value indicates the best validity for the cut-point. The DOR (ranged: 0 to infinity) is a summary statistic that indicates the odds for a patient to have a positive result in the screening for depression when compared to a non-diseased patient. The greater the value of the DOR (ranged: 0 to infinity) indicates a better predictive performance. Because we were evaluating the predictive performance of each screening instrument, we made our final decision on the optimal cut-point based on the following order: DOR, PROC01, and YI. We further examined the diagnostic accuracy of the screening instruments by removing some items (i.e., fatigue, sleep, appetite, not being able to shake the blues, feeling bothered, feeling depressed, and lack of concentration) from the instruments that have been previously reported as somatic symptoms of HIV. It is possible that these items might inflate depression scores [32]. For each instrument, we repeated the adjusted ROC analysis with items related to the somatic symptoms removed. We then used Wald test to determine for the equality between the adjusted ROC curves of the original instruments and their corresponding reduced scales. The standard error of the hypothesis test was obtained from a bias-corrected bootstrap method [43,44].
Finally, we used Cohen's Kappa statistic (ranged: -1 to 1; 0.6-0.7, 0.8-0.9 and >0.9 representing good, very good and excellent agreement, respectively) to examine the inter-rater agreement of each instrument pair by dichotomizing total scores of the instrument at the optimal cut-points. Cronbach's alpha (ranged: 0 to 1; 0.7-0.9 and >0.9 representing good and excellent consistency, respectively) was used to examine internal consistency of the instruments.
All reported 95% confidence intervals were constructed by bias-corrected bootstrap method with 2000 replicates [45]. All statistical analyses were 2-sided with statistical significance defined as a p-value less than 0.05 and were performed by using STATA IC v.13.1 [46].

Sample size calculation
Based on two receiver operating characteristics (ROC) curves power analysis, we would have required 177 individuals with complete data to achieve an 80% statistical power (assuming a prevalence of 17% and a difference of 0.15 in AUC to be detected between two ROC curves) [47,48].

Results
Two hundred and thirty-seven HIV-positive patients (aged 18 years) agreed to participate in the validation study. When we compared the characteristics of the validation study participants to the remainder of the cohort, we found that participants were slightly younger (mean age: 47 v. 51 years; p-value: 0.02) and more likely to be male (86 v. 82%; p-value: 0.08).
Of the 237 HIV-positive patients initially included, we excluded 47 participants on the basis of information missing from either the M.I.N.I. or one of the screening instruments. Our final analytical sample was 190 patients. Of these, 179 had provided demographic, psychosocial, and behavioural information during a regular OCS interview conducted before the validation study began. Table 4 presents baseline characteristics and the prevalence of DSM-defined psychiatric disorders of the sample. Of the 179 patients who provided demographic information, the mean age was 47 (SD = 11) years and 87% were male. Based on DSM-IV criteria from the M.I.N.I., twenty-nine patients (16%) were identified with current major depression within the past two weeks. The mean and standard derivation of distribution of total scores of the CESD 20 , K 10 , PHQ 9 , CESD 10 , K 6 , and PHQ 2 were 14(13), 18(8), 5(5), 8(7), 11(5), and 1(2) respectively. About half of the HIV-positive patients reported annual household incomes of less than $20,000 CAD and about half were recipients of Ontario Disability Support Program subsidies. About 40% of patients had at least one of the nine psychiatric disorders that we examined.  (Table 5).

Overall Psychometric Properties and Criterion Validity from ROC Analysis
Of the 179 patients who provided demographic information, our multivariable ROC analysis indicated that the receipt of Ontario Disability Support Program subsidies might make discriminatory ability of these instruments weaker for CESD 10 and PHQ 9 (Table 6). Additionally, though the ROC curves and AUCs after controlling for covariates were similar to those without the adjustment, there were differences between the crude and adjusted ROC curves for each instrument (Fig 2)    Optimal Cut-points Table 7 presents results for the diagnostic accuracy of the instruments at a range of possible cut-points evaluated in prior studies. Based on the best results for DOR, PROC01, and YI, we identified optimal cut-points of 22  2 , these instruments showed an excellent NPV (>0.90) for ruling-out major depression, but moderate PPV (0.49-0.51) for ruling-in the condition at their optimal cut-points. Although PHQ 2 showed moderate PPV (0.7), its sensitivity was poor (0.45); hence, it was likely to miss some depression cases.

Impacts of Somatic Symptoms of HIV Infection on Diagnostic Accuracy
When we removed items (i.e., fatigue, sleep, appetite, not being able to shake the blues, feeling bothered, feeling depressed, and lack of concentration) [32] that were previously reported as somatic symptoms of HIV infection from the original screening instruments and their short forms for current major depression, we found that the results of adjusted AUCs of CESD 20 (pvalue = 0.0019), CESD 10 (p-value = 0.017) and PHQ 2 (p-value = 0.023) were significantly reduced (Fig 3).

Discussion
To our knowledge, this is the first study to examine and compare the diagnostic accuracy and reliability of three common depression screening instruments (CESD 20 , K 10 , and PHQ 9 ) and their short forms against a DSM-IV defined gold standard in a HIV-positive population. Overall, each of the screening instruments diagnosed depression with excellent accuracy and reliability. The diagnostic accuracy of the three instruments and their short forms was comparable. Except for the PHQ 2 , each of the instruments showed good-to-excellent sensitivity and specificity, excellent negative predictive value, and moderate positive predictive value at optimal cut-points. The diagnostic accuracy of all instruments may vary according to presence or absence of physical and mental disability. Previously reported somatic symptoms of HIV infection might have affected the diagnostic accuracy of CESD 20 , CESD 10 , and PHQ 2 .
Our results of overall performance are generally consistent with findings previously reported with HIV-positive patients. First, the AUCs and criterion validity statistics of the CESD 20 and PHQ 9 were similar to prior findings from HIV-positive patients in Uganda [48]. Although our results were better than the pooled estimates (Se:0.82; Sp:0.73) reported in a recent meta-analysis, substantive between-study heterogeneity was reported in that analysis [22]. Second, the short forms of the three instruments performed comparably, a finding that is consistent with a recent systematic review [21]. Third, as with other studies, most of our test instruments showed moderate rates of false positives when ruling-in for depression [20].
A few differences were noted when we compared our results to the studies conducted in Sub-Saharan Africa. First, unlike Akena et al. (2013) [48], none of the three instruments were diagnostically superior according to AUC values among HIV-positive patients. Additionally, unlike the recent meta-analysis of 113 studies for patients with chronic physical illness, we did not find that the PHQ 9 was the most sensitive [20]. However, our results of psychometric properties for the PHQ 9 were generally comparable to that of the general population (Se = 0.88; Sp = 0.88) [23]. Second, the performance of K 10 in OCS participants was better than previous findings of sensitivity (0.67-0.83) and specificity (0.72-0.77) reported by Akena et al.(2013) and Spies et al.(2009) [48,49]. This may due to systematic differences between the HIV-positive populations in Sub-Saharan Africa and Canada [48,49].
In terms of the optimal cut-points, our results differ from prior findings. For the PHQ9, our optimal cut-point was a total score of 8; previously-reported optimal cut-points have typically been a score of 10 [23,48]. However, results from a recent meta-analysis have shown that cutpoints between 8 and 11 all report acceptable diagnostic properties for identifying major depression [50]. For the CESD 20 , our optimal cut-point was slightly higher than those previously reported (i.e., between 16 and 22) among HIV-positive patients [21,22,48], but an optimal point of 23 has also been reported in diabetic populations [51,52]. For the K 10 , our optimal cut-point was within the range reported in prior studies [48,49]. These differences may possibly be due to different criteria that we used when identifying the optimal cut-points. Our optimal cut-points were determined based on three common criteria: 1) diagnostic odds ratios; 2) PROC01; and 3)Youden index. The criteria that were used in prior Sub-Saharan Africa studies focused on maximizing sensitivity and specificity; however, these two measures are only one of the methods to measure the diagnostic accuracy and these criteria may not focus on evaluating predictive performance of a screening instrument.
Our results suggest that shorter instruments are desirable in primary HIV care settings because resource constraints are often found in these settings. Therefore, shorter instruments may find a greater acceptance and yield larger clinical benefits. However, similar to the original screening instruments, the shorter screening instruments also come with moderate positive predictive values, indicating that false positives are likely. We advise that the screening instruments should only be administered when in-depth follow-up assessments are available to properly diagnose depression.
Our results from multivariable ROC analysis indicated that in general, the presence of physical and mental disability may reduce the diagnostic accuracy of screening instruments, thereby making the instruments more difficult to detect depression cases. It is possible that the patients who are eligible for the ODSP programs are sicker and may have more severe physical and mental conditions when compared to other patients who were not eligible for the ODSP program. Similar to prior evidence [20,32], our results may imply that symptoms of chronic conditions may overlap with symptoms of depression especially among patients who have received ODSP subsidies. This would result in an inflation of the total scores for the screening instruments and cause a higher number of false positives, which will lead to a lower PPV to detect depression. As we showed in our further analysis, after we removed some items related to HIV somatic symptoms from the screening instruments, the diagnostic accuracy indicated by the adjusted AUCs were reduced. Therefore, our results suggests that careful consideration must be taken and in-depth follow-up assessments should be available when applying these instruments to patients with chronic illness, especially those with severe physical and mental impairments.
Our study has several strengths. First, this was a multi-center study whose participants may represent typical HIV-positive patients receiving care in Ontario [28]. Second, this is the first study to compare three common screening instruments for depression in a developed country. Unlike Akena et al.(2013) [48], our sample size calculation allowed for detecting differences between AUCs of the instruments, thereby allowing for direct comparison of their diagnostic accuracy. Comparing instruments within a single sample may overcome the heterogeneity issues that have been reported in a recent meta-analysis [21,22]. Third, our analysis also considered the potential impacts of somatic symptoms of HIV infection on the diagnostic accuracy of the instruments [32]. Finally, we adopted advanced statistical techniques to examine the impacts of potential factors that might affect the performance of the instruments [34].
There may be some limitations to our results. First, although the M.I.N.I has frequently been adopted as a "gold standard" for validation studies among the general population and HIV-positive patients [20,22], it is an abbreviated structured interview for psychiatric diagnoses; therefore, it is imperfect when compared to the SCID or ICD-10. This may impact on the discriminatory accuracy of the instruments. However, prior evidence has shown the M.I.N.I to have high sensitivity (94-96%) and specificity (79-88%) for identifying major depressive disorders when compared against SCID or ICD-10 criteria [29][30][31]. Misclassification from use of the M.I.N.I. as the gold standard would have produced underestimates of sensitivity and specificity. Second, interviewer bias is likely because the M.I.N.I. interviews were conducted by nurses familiar with the clinical histories of their patients. It is possible that the nurses recalled the mental health conditions of their patients from previous appointments and that these recollections affected the interviews. Third, the completion of the screening instruments may have had a positive impact on the performance of the M.I.N.I. through priming (i.e., exposure to the screening instruments may have influenced how participants responded to their M.I.N.I.). This implies that the subsequent M.I.N.I. may have more likely been able to detect depression. Future studies should replicate our results by randomizing the order of the M.I.N.I and the screening instruments to determine if priming is a possibility. Fourth, our study might have been under-powered when testing for equality of AUCs of the instruments because the difference of the AUCs (0.15) that we assumed from Akena et al. (2013) was bigger than that of our current study [48]. Replication with a larger sample is desirable. Fifth, although efforts were made to ensure that our sample represented typical HIV-positive patients in Ontario, differences have been noted between the overall OCS cohort and non-OCS participants [53].
Despite the limitations noted above, our findings demonstrate excellent diagnostic accuracy and reliability of the CESD 20 , K 10 , and PHQ 9 for current major depression in HIV-positive patients in Ontario. Additionally, the diagnostic accuracy of three instruments and their short forms was comparable. When follow-up assessments become available, shorter instruments may find greater acceptance and yield clinical benefits in relation to depression when incorporated into fast-paced speciality HIV care.