Factorial validity and invariance of the Patient Health Questionnaire (PHQ)-9 among clinical and non-clinical populations

The Patient Health Questionnaire-9 (PHQ-9) is commonly used to screen for depressive disorder and for monitoring depressive symptoms. However, there are mixed findings regarding its factor structure (i.e., whether it has a unidimensional, two-dimensional, or bi-factor structure). Furthermore, its measurement invariance between non-clinical and clinical populations and that between patients with major depressive disorder (MDD) and MDD with comorbid anxiety disorder (AD) is unknown. Japanese adults with MDD (n = 406), MDD with AD (n = 636), and no psychiatric disorders (non-clinical population; n = 1,163) answered this questionnaire on the Internet. Confirmatory factor analyses showed that the bi-factor model had a better fit than the unidimensional and two-dimensional factor models did. The results of a multi-group confirmatory factor analysis indicated scalar invariance between the non-clinical and only MDD groups, and that between the only MDD and MDD with AD groups. In conclusion, the bi-factor model with two specific factors was supported among the non-clinical, only MDD, and MDD with AD groups. The scalar measurement invariance model was supported between the groups, which indicated the total or sub-scale scores were comparable between groups.


Introduction
Depression is an exceedingly common comorbid condition in several mental disorders. To date, numerous self-report measures of depression have been developed, many of which are commonly used in clinical practice and research [1]. In particular, the Patient Health Questionnaire (PHQ) is one of the most useful measures for monitoring depressive symptoms and for screening for major depressive disorder (MDD) [2]. The 9-item version of the PHQ (PHQ-9) [3] is a brief and simple self-administered measure that corresponds with the nine diagnostic criteria for depressive disorder of the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV). The PHQ-9 has been found to have high reliability and validity in Western populations, and can be used as a one-or two-item measure. The PLOS ONE | https://doi.org/10.1371/journal.pone.0199235 July 19, 2018 1 / 9 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 guidelines of the National Institute for Health and Clinical Excellence recommend the PHQ-9 for assessing the severity of depressive symptoms in clinical practice [4]. Despite its numerous advantages, the PHQ-9 has at least two main issues stemming from a lack of previous research, namely, the unclear factor structure and measurement invariance. First, there are mixed findings regarding the factor structure of the PHQ-9 in Western populations. Some studies using clinical samples (both in primary care and psychiatric settings) have determined that the PHQ-9 has a unidimensional factor structure [5]. In contrast, other studies on psychiatric patients or individuals with physical illness and depression have determined that a two-dimensional factor structure (including somatic and cognitive/affective symptom factors) had a better fit [6,7,8]. However, no studies have yet examined a possible bi-factor model that represents the existence of both a general factor and specific sub-factors called as group factors [9,10]. For example, similar mixed results regarding the factorial unidimensionality of the Symptom Checklist-90-Revised has been resolved by testing a bi-factor model [11]. Hence, we hypothesized that the bi-factor model will show the best fit against the mere unidimensional or two-factor model.
Second, previous studies have not examined the possible measurement invariance [12] in the PHQ-9 between non-clinical and clinical populations. Additionally, no studies have yet examined whether the factor structure of the PHQ-9 could be assumed equivalently between patients with only MDD and those with MDD who comorbid anxiety disorder (AD). Given the high co-occurrence rate of depression and anxiety [13], it is necessary to examine the factor structure of the PHQ-9 between patients with only MDD and those with MDD who have comorbid AD.
In the present study, using an existing large dataset of non-clinical and clinical populations in Japan, we (1) compared the fit of the unidimensional, two-dimensional, and bi-factor models of the PHQ-9 via a confirmatory factor analysis; and (2) examined the measurement invariance between non-clinical, only MDD, and MDD with AD groups using a multi-group confirmatory factor analysis.

Participants and procedure
This study was part of a larger web-based survey for examining the emotions and psychopathology of Japanese clinical and non-clinical populations [14,15]. We recruited participants in this study from panelists registered with Macromill Incorporation. This company is one of the largest Japanese internet marketing research company and has been used in previous studies [16]. A total of 2,830 individuals (1,547 females, 1,283 males; mean age = 42.44 years; SD = 10.39 years; range = 19−79 years) were selected randomly according to their age, gender, and living area from each population, including 619 individuals with MDD, 576 with social anxiety disorder, 619 with panic disorder, 645 with obsessive-compulsive disorder, and 371 without any psychiatric disorder (i.e., non-clinical population). The participants self-reported their own diagnoses by answering the following items regarding their current diagnoses and treatment of mental disorders: "Are you currently diagnosed as having Major Depressive Disorder and being treated for the problem in a medical setting?" Similarly, they were asked to respond "yes" or "no" to the question of their own diagnoses of social anxiety disorder, panic disorder, and obsessive-compulsive disorder. This study was approved by the institutional review board of the National Center of Neurology and Psychiatry (approval number: A2013-002).

Measurements
Japanese version of the Patient Health Questionnaire-9 (J-PHQ-9). The J-PHQ-9 assesses the frequency with which the nine symptoms of depression occurred over the last two weeks [17]. The participants rate each of the nine items on a scale ranging from 0 (not at all) to 3 (nearly every day). The reliability of the English version of PHQ-9 is excellent, as evidenced by the previous reports of an internal reliability Cronbach's α of .86 to .91 [3,5] and test-retest reliability [3]. The construct validity of the English version of the PHQ-9 confirmed by findings of previous studies which reported that increasing PHQ-9 scores were associated with worsening function [3,18], increasing depression assessed using other measures [6,18], increasing anxiety [6], and decreasing psychology well-being [6]. Additionally, in the present study, the internal reliability of the J-PHQ-9 was excellent, as evidenced by a Cronbach's α .93, .84, and .91 for the total score, somatic score, and cognitive/affective score, respectively. The J-PHQ-9 also had good convergent validity as it was associated with the Japanese versions of the Kesseler Psychological Distress Scale (K6) [19] (r = .81) and Center for Epidemiologic Studies Depression Scale [20] (r = .86). In this study, we used the sum of the item scores of this scale for our analyses.

Statistical analysis
First, we conducted a confirmatory factor analysis of the PHQ-9 using the data collected from the entire sample (n = 2,205). In this analysis, we determined and compared the fit of the above-stated three factor models to the data using the full information maximum likelihood method. In the unidimensional factor model, each item was represented by a single factor (Fig  1) [21]. In the two-dimensional factor model, items loaded onto one of the latent factors of somatic and cognitive/affective symptoms (Fig 2) [8]. Finally, in the bi-factor model, we designated somatic and cognitive/affective symptoms as specific group factors, and the sum of the item scores as the general factor (Fig 3). Second, to examine the measurement invariance across non-clinical, only MDD, and MDD with AD populations, we conducted a multi-group confirmatory factor analysis [22]. We examined the measurement invariance of the PHQ-9 scores between the non-clinical and only MDD groups, and between the only MDD and MDD with AD groups. We constructed the following five increasingly restrictive models: where all parameters were free (Model 1: configural invariance); where loadings were invariant (Model 2: metric invariance); where loadings and intercepts were invariant (Model 3: scalar invariance); where loadings, intercepts, and residuals were invariant (Model 4: error variance invariance); and where loadings, intercepts, residuals, and factor means were invariant (Model 5: factor variance invariance). We used the following fit indices to evaluate the models: chi-square, root mean square error of approximation (RMSEA), Akaike information criterion (AIC), Bayesian information criterion (BIC), comparative fit index (CFI), and standardized root mean square residual (SRMR). Goodnessof-fit indices were examined in light of the following standards used in past literature [23]: the chi-square test (χ 2 ) should not be significant; the RMSEA should be < .10 for acceptable fit and < .06 for good fit; the CFI should be !.90 for acceptable fit and >.95 for good fit; and the SRMR should be < .10 for acceptable fit and < .08 for good fit. The following criterion was used to adopting the model: a difference of less than .01 in the ΔCFI index supports the less parameterized model [24].

Multi-group confirmatory factor analysis
First, we conducted a multi-group confirmatory factor analysis (of the bi-factor model) for the non-clinical and only MDD groups (Table 3). According to the criterion for adopting the Second, we conducted a multi-group confirmatory factor analysis using only the MDD and MDD with AD groups (Table 3). Similar to the findings pertaining to the non-clinical and only MDD groups, Model 3 showed the best fit (scalar invariance). Scalar invariance indicates that differences in the factor mean lead to differences in item mean.

Discussion
In this study, we compared a bi-factor model of the PHQ-9 with unidimensional and twodimensional factor models and examined the measurement invariance of the PHQ-9 across non-clinical, only MDD, and MDD with AD groups. Among both non-clinical and clinical Note. df = degree of freedom, RMSEA = standardized root mean square residual, CFI = comparative fit index, and SRMR = standardized root mean square residual, MDD = Major depressive disorder, AD = Anxiety disorder.
https://doi.org/10.1371/journal.pone.0199235.t001 populations, we found that the bi-factor model had the best fit. This explains the mixed results found in previous studies that reported that the PHQ-9 has either a unidimensional or a twodimensional factor structure [8,21]. The bi-factor model allows one to use the unidimensional factor model of the PHQ-9, that is, we can use the cut-off point and the total score as a single variable. Additionally, we can use the two-dimensional factor model of the PHQ-9 for assessing more detailed symptoms. Moreover, the general PHQ-9 factor accounted for over 40% of the common variance. Thus, using both total score and sub-scale scores allows us to assess patients' symptoms more precisely. In addition to assessing patients' symptoms more precisely, we may be able to detect the change in patients' symptoms due to treatment more fully by using the PHQ-9 regularly during treatment. This in turn will aid the implementation of appropriate treatment or the modification of the treatment according to the patients' needs.
According to the results of the measurement invariance, scalar invariance showed best fit between the non-clinical and only MDD groups, and between the only MDD and MDD with AD groups, which means that we can compare the latent mean of the PHQ-9 between these two populations. Although the PHQ-9 total, somatic, and cognitive/affective scores of the MDD with AD group were higher than those of the MDD and non-clinical groups, these populations responded to each item similarly.
This study has several limitations. First, we might have obtained a biased sample because we conducted a web-based survey. For example, patients with more severe depressive symptoms and those who do not use the Internet frequently might have been excluded from this web-based survey. Second, participants were asked to report their own diagnoses and were not interviewed to assess whether they actually had MDD/AD. In other words, some of the participants might not have met the required diagnostic criteria for MDD/AD. This is in part supported by the low mean PHQ-9 score reported in the present study (M = 12.42 for MDD only, M = 15.86 for MDD/AD) as compared to that found in previous studies conducted in Western countries. For example, Kroenke [3] reported a mean score of 17.1 among 41 patients with  MDD, and Petersen [8] reported a mean score of 17.3 among the 626 such patients. Future studies must test the higher-order factorial model and assess measurement invariance with participants diagnosed using a structured interview. Finally, we used only Japanese non-clinical and clinical populations, making it unclear whether these results are applicable to a Western population.