Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Factor structure and measurement invariance across various demographic groups and over time for the PHQ-9 in primary care patients in Spain

  • César González-Blanch ,

    Roles Funding acquisition, Writing – original draft, Writing – review & editing

    Affiliation Mental Health Centre, University Hospital “Marqués de Valdecilla”- IDIVAL, Santander, Spain

  • Leonardo Adrián Medrano,

    Roles Conceptualization, Formal analysis, Writing – review & editing

    Affiliation Faculty of Psychology, University Siglo 21, Córdoba, Argentina

  • Roger Muñoz-Navarro,

    Roles Conceptualization, Writing – review & editing

    Affiliation Department of Basic Psychology, Faculty of Psychology, University of Valencia, Valencia, Spain

  • Paloma Ruíz-Rodríguez,

    Roles Writing – review & editing

    Affiliation Castilla La Nueva Primary Care Centre, Health Service of Madrid, Madrid, Spain

  • Juan Antonio Moriana,

    Roles Writing – review & editing

    Affiliation Department of Psychology, University of Córdoba/ Maimónides Institute for Research in Biomedicine of Cordoba-IMIBIC/Reina Sofía University Hospital, Córdoba, Spain

  • Joaquín T. Limonero,

    Roles Writing – review & editing

    Affiliation Department of Basic Psychology, Autonomous University of Barcelona, Bellaterra, Barcelona, Spain

  • Florian Schmitz,

    Roles Methodology, Writing – review & editing

    Affiliation Department of Psychology, Ulm University, Ulm, Germany

  • Antonio Cano-Vindel,

    Roles Conceptualization, Funding acquisition, Supervision, Writing – review & editing

    Affiliation Department of Basic Psychology, University Complutense of Madrid, Madrid, Spain

  • on behalf of the PsicAP Research Group

    Membership of the PsicAP Research Group is provided in the Acknowledgments

Factor structure and measurement invariance across various demographic groups and over time for the PHQ-9 in primary care patients in Spain

  • César González-Blanch, 
  • Leonardo Adrián Medrano, 
  • Roger Muñoz-Navarro, 
  • Paloma Ruíz-Rodríguez, 
  • Juan Antonio Moriana, 
  • Joaquín T. Limonero, 
  • Florian Schmitz, 
  • Antonio Cano-Vindel, 
  • on behalf of the PsicAP Research Group


The Patient Health Questionnaire (PHQ-9) is a widely-used screening tool for depression in primary care settings. The purpose of the present study is to identify the factor structure of the PHQ-9 and to examine the measurement invariance of this instrument across different sociodemographic groups and over time in a sample of primary care patients in Spain. Data came from 836 primary care patients enrolled in a randomized controlled trial (PsicAP study) and a subsample of 218 patients who participated in a follow-up assessment at 3 months. Confirmatory factor analysis (CFA) was used to test one- and two-factor structures identified in previous studies. Analyses of multiple-group invariance were conducted to determine the extent to which the factor structure is comparable across various demographic groups (i.e., gender, age, marital status, level of education, and employment situation) and over time. Both one-factor and two-factor re-specified models met all the pre-established fit criteria. However, because the factors identified in the two-factor model were highly correlated (r = .86), the one-factor model was preferred for its parsimony. Multi-group CFA indicated measurement invariance across different demographic groups and across time. The present findings suggest that physicians in Spain can use the PHQ-9 to obtain a global score for depression severity in different demographic groups and to reliably monitor changes over time in the primary care setting.


Depression is one of the most common mental disorders around the world. Globally, the total number of people with depression was estimated to exceed 300 million in 2015 [1]. Depression is responsible for more ‘years lost’ to disability than any other condition in the world (according to WHO estimates, 7.5% of all years lived with disability in 2015), and it is a major contributor to the overall global burden of disease [1]. Depression is ranked by WHO as the single largest contributor to global disability. The lifetime prevalence for major depressive disorder (MDD) has been estimated at 12.8% in Europe (ESEMeD project) [2] and 10.6% in Spain [3]. The Diagnostic and Assessment Study of Mental Disorders in Primary Care (DASMAP), based on 3,815 patients from 77 primary care centres in Spain, found that almost 30% reported a lifetime history of MDD, with 9.6% experiencing MDD in the past 12 months [4]. Nevertheless, epidemiological and clinical studies show that general practitioners (GP) fail to diagnose a large part of patients suffering from depression [5]. In fact, studies conducted in Spanish primary care settings show that nearly 78% of the patients with depression are misdiagnosed, indicating that only approximately one-quarter of cases diagnosed as depressed by GPs are correctly diagnosed [6]. Moreover, a high proportion of these individuals remain untreated [7]: in Spain, only approximately one-third of patients with MDD receive "minimally-adequate" treatment [8].

The Patient Health Questionnaire Depression Scale (PHQ-9) [9] is well-validated instrument used to quickly diagnose depression, assess severity, and monitor treatment response. The PHQ-9 includes the nine signs/symptoms for the diagnosis of MDD in the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) and it is a widely-used measure to assess depression in various different settings [10], particularly in primary care [11] where most patients with depression are first diagnosed and treated [12].

The Spanish version of the PHQ-9 is a reliable and valid measure of depression in hospitalized patients [13] and in primary care patients [14] in Spain. However, no studies have yet examined the underlying factor structure of this instrument in the Spanish population. Studies conducted in other Spanish-speaking populations have reported mixed results that support both a one-factor structure [1518] as well as a two-factor structure that includes both affective and somatic symptoms [19, 20]. For instance, Huang et al [17] investigated the factor structure of the PHQ-9 in a large sample of 5,053 primary care patients, finding that one general factor accounted for the communality of all PHQ-9 items. Additionally, a Spanish version of the scale developed in Mexico presented a unidimensional factor structure in a large sample (n = 55,555) of Mexican women [16]. Some studies in different countries and settings also support a unidimensional structure [21, 22], although other studies have identified a two-factor solution [23, 24].

When validating a scale such as the PHQ-9, it is important to examine whether that scale remains structurally stable over time and across different groups (i.e., measurement invariance). Examination (and confirmation) of factorial invariance is critical to assure comparability of clinical outcomes. If invariance is not given, the interpretation of manifest scores (such as feeling down or little pleasure in doing things) does not necessarily reflect the same latent factor (i.e., depression) in different groups or at different times [25]. With regard to the PHQ-9, little research has been conducted to determine whether this scale is invariant across various demographic groups (other than gender) or over time. Petersen et al. [24] found that the PHQ-9 was invariant in a sample of male and female primary care patients, and Chilcot et al. [26] showed that the factor structure of this scale was invariant over time within a palliative care population. To the best of our knowledge, no studies have yet investigated the invariance of the PHQ-9 factor structure in any Spanish version of this instrument.

Therefore, the present study had two primary aims. The first aim was to identify the factor structure of the PHQ-9 in a sample of primary care patients in Spain. The second aim was to examine the invariance of structural properties across various sociodemographic groups (i.e., gender, age, marital status, level of education, and employment situation) and over time.



This study was conducted at 22 primary care centres included in the Psychology in Primary Care (PsicAP) study [27] between the months of January and July (inclusive), 2016. The PsicAP study is a randomized controlled trial developed to test the effectiveness of a group-delivered transdiagnostic cognitive behavioral therapy (TD-CBT) versus treatment as usual (TAU) in the treatment of emotional disorders in primary care settings in Spain. During the study inclusion period, all patients aged 18 to 65 years who consulted with their GP at any of these 22 PsicAP centers and who presented a diagnosis or suspected diagnosis of an emotional disorder (anxiety, depression or somatization disorder) were invited to participate in the clinical trial. A total of 836 participants agreed to participate and were therefore included in the baseline sample used to study the factor structure of the PHQ-9. Of these 836 patients, a subgroup of 218 participants (who were finally included in the RCT and re-evaluated with the PHQ-9 three months later) was used to assess factorial invariance over time. At first contact, all candidates were given a patient information sheet containing full details about the study purpose. Subjects were required to sign an informed consent form to participate. The socio-demographics of the participants are shown in Table 1.

A total of 218 participants completed the PHQ-9 again 3 months after the initial baseline assessment. The demographic characteristics of this smaller subsample was similar to the original sample, with no significant differences in gender distribution (χ2 = 0.55, df = 1, p = .46), age (t = 1.18, df = 1051, p = .24), or employment condition (χ2 = 1.95, df = 6, p = .92) (see Table 1). However, significant differences were observed in the distribution of marital status (χ2 = 11.75, df = 5, p = .03) and level of education (χ2 = 12.74, df = 5, p = .03). Given the actually large sample size and the sensitivity of the χ2 statistic to sample size, these differences should not be over-interpreted. In fact, the effect size of these differences were considerably low (Cramer’s V level of education = .10; Cramer’s V Marital Status = .11). We further compared whether missing cases significantly differed from non-missing cases on each PHQ-9 item. We found statistically significant differences (without adjusting for multiple comparisons) in item 1 (loss of interest) [t (834) = 2.48, p = 0.01], and item 9 (suicidal ideation) [t (834) = 2.20, p = 0.03]. However, the effect sizes for these differences were small (Hedges' g < 0.2) and thus unlikely to affect the results.


Patient Health Questionnaire-9 (PHQ-9) [9]. The PHQ-9 is part of the PHQ and consists of nine items designed to assess the nine DSM-IV diagnostic criteria for MDD. The scales check for the presence of the following symptoms over the previous two weeks: (a) depressed mood; (b) anhedonia; (c) sleep problems; (d) feelings of tiredness; (e) changes in appetite or weight; (f) feelings of guilt or worthlessness; (g) difficulty concentrating; (h) feelings of sluggishness or worry; and (i) suicidal ideation. Items are answered on a four-point Likert scale from 0–3 as follows: 0 (never), 1 (several days), 2 (more than half of the days), and 3 (most days). The Spanish version of the PHQ-9 was used in the present study. This version has demonstrated good psychometric properties, as follows: internal consistency, McDonald’s ω = .89; 88% sensitivity; 80% specificity; and positive and negative predictive values of 92% and 72%, respectively [14].


GPs at the participating centers were asked to identify patients who presented signs or symptoms of anxiety, depression, or physical symptoms for which no biological cause could be found. The GP then asked these candidates to participate in the study. Patients who agreed then signed the informed consent form and were scheduled to meet with a clinical psychologist, who again provided the patients with verbal and written details about the study to be sure they fully understood it. At this same appointment, the particpants were asked to complete a computer-based version of the PHQ and the other study measures (n = 836). Patients with impaired vision received assistance in completing the questionnaires. Paper versions of these instruments were provided to patients who had difficulties using the computer.

A subsample of 218 cases was assessed at the 3-month follow-up. This subsample received one of the two treatments tested in the randomized controlled trial: (i) TAU, mainly pharmacological treatment by the GP, or (ii) seven 90-minute group sessions of TD-CBT delivered over 12 to 14 weeks (for further details see [27]).

Ethical aspects

The study was conducted in accordance with the Declaration of Helsinki. This project is supported by the Psicofundación (Spanish Foundation for the Promotion, Scientific and Professional Development of Psychology) and approved by the Corporate Clinical Research Ethics Committee of primary care of Valencia (CEIC-APCV), Spain, as the national research ethics committee coordinator, and the Spanish Agency of Medicines and Medical Devices (AEMPS) (EUDRACT: 2013-001955-11 and Protocol Code: ISRCTN58437086).

The study was conducted in accordance with the Spanish Law on Data Protection. Patient participation in the study was voluntary and participants were able to withdraw at any time without explanation and without negative consequences for future medical care.

Statistical analysis

Initially, an exploratory analysis of the data was performed to explore the behaviour of the variables, to evaluate the quality of the data, and to check that requirements of statistical tests were met. We checked for the existence of atypical cases, missing values and compliance with the statistical assumptions (linearity, and univariate and multivariate normality). A confirmatory factor analysis (CFA) was performed using maximum-likelihood (ML) as an estimation method; the AMOS 20 program [28] was used for these analyses. A one-factor model was specified in which the nine PHQ-9 items loaded on a single factor (called "depression") and an alternative two-factor model in which 3 items loaded on the "somatic" factor and 6 items in the "cognitive-affective" factor, corresponding with previous research on PHQ structure.

Multiple indices were examined to determine model fit: (a) The chi-squared statistic (χ2) was reported following conventions. However, given its sensitivity to sample size, is becomes usually significant. However, differences in χ2 can be interpreted to compare nested models. Additionally, we considered (b) the comparative fit index (CFI); (c) the Tucker-Lewis index (TLI); and (d) the root mean square error of approximation (RMSEA). To interpret these indices, we used the critical values previously recommended [29, 30]. Specifically, values > .90 and .95 for the CFI and TLI were considered benchmarks for acceptable and good fit, respectively; and RMSEA values of < .08 and .06 were benchmarks for acceptable and good fit, respectively.

Analyses of multiple-group invariance were conducted to determine the extent to which the factor structure was comparable across various sociodemographic groups (i.e., gender, age, marital status, level of education and employment situation) and over time. We followed the measurement invariance procedures outlined by Brown [31]. Factorial invariance is essential to provide meaningful comparisons of scores across groups or across time. Therefore, four levels of measurement invariance were sequentially tested (configural, weak, strong, and strict invariance), where each level introduces more equality constraints across groups. Configural invariance implies that the pattern of fixed and free factor loadings are equivalent. Weak factorial invariance examines the equivalence of factor loadings (i.e., items assess the latent variable in the same way across groups or time). Strong factorial invariance examines the equality of latent means, implying that any differences in means on the scale are due to true differences in means across groups or time. Finally, strict invariance—the most restrictive level of factorial invariance—examines the invariant item residual variances and this implies that group differences in variances of scale scores are due only to group differences in depression variances, since error variances were constant across groups. Configural invariance is supported if the same unconstrained factor structure simultaneously fit for the split groups yields a good fit. After testing configural invariance, we examined weak/metric invariance. The fit of the restricted model (equal factor loadings across groups) and the free model, were compared in terms of their χ2 values. A non-significant increase in the χ2 value (relative to df) in the constrained model relative to the unconstrained model indicated that the constrains across groups were possible. As an additional criterion, the change in the CFI coefficient was considered. If the drop in CFI of the constrained model relative to the unconstrained model did not exceed 0.01, the constrained model was accepted [32]. The ∆CFI criterion was argued to be superior to ∆ χ2, as it is less sensitive to sample size [33]. We proceeded analogously to tests strong/scalar invariance and strict invariance.


Exploratory and descriptive analysis

The Z scores for each item were calculated and univariate values considered atypical were those values outside the Z ± 3 range [34]. The existence of atypical multivariate cases was assessed using the Mahalanobis distance (D2) statistical procedure. There were no univariate atypical cases and only 7 cases showed a statistically significant distance from the centroid of the group (p < .001). In the analysis of missing values, the items presented < 5% of lost cases. Considering the low proportion of missing treatment values, "listwise deletion" was applied in accordance with published recommendations [35].

Based on the criteria proposed by George and Mallery [34], all items presented a distribution that was close to normal given that asymmetry and kurtosis values were between ±2 (see Table 2). Multivariate normality showed a Mardia index of 7.24, indicating there was no substantial deviation from normal distribution. To test the linearity assumption of the relations, linear and curvilinear estimates were calculated between pairs of items. In all cases, the linear function was superior to the curvilinear function, thus confirming the linearity assumption. S1 Table shows the inter-item correlation matrix for the PHQ-9 items.

Confirmatory factor analysis

Since normality assumptions were met, the maximum-likelihood (ML) was used as an estimation method. Both the one-factor and two-factor models presented acceptable values for the CFI and GFI indexes, but exceeded an RMSEA of .08. The model fit could be substantially improved when error terms of PHQ-9 items 1 and 2 were allowed to be correlated, reflecting that both items shared similarity that is not explained by the general depression factor. The fit indexes obtained for both models are summarized in Table 3. All items displayed substantial factor saturation, as indicated by their high factor loading (all λ .54-.77; all p< 0.05; see Fig 1). The two-factor model yielded a better fit in both the original model and the models that additionally allowed for the error correlations of PHQ items 1 and 2. However, the "somatic" and "cognitive-affective" factors were found to be highly correlated (r = .86). This indicates substantial overlap between the two factors and complicates the interpretation of corresponding test scores for diagnostic purposes.

Invariance across sociodemographic groups

Given the known issues of using the ML estimation method with ordinal data, we used the ML and the Weighted Least-Squares (WLS) methods to perform the CFA. The results obtained with both estimation methods (ML and WLS) were highly similar and therefore we report only the results of the ML estimation. To determine whether patient gender affected the measurement model, the sample was split into men (23.6%) and women (76.4%), and constrains were introduced to test if parameters could be constrained across both groups. Then, we compared the fit of the constrained model to that of the free model were parameters were estimated independently in each group. Both models were compared in terms of ∆χ2 and, more importantly, by ∆CFI: The more constrained model was accepted if the constraints did not significantly deteriorate model fit [29, 32]. Following Brown [31], we first examined the fit of the single-sample one-factor solutions within the specific subsamples (e.g., men-only and women-only subsample) separately. Next, we tested four levels of invariance (configural invariance, weak invariance, strong invariance, and strict invariance) using a series of increasingly restrictive models. The single-sample solutions for the one-factor model based on the men-only and women-only subsamples are shown in Table 4. Both subsamples were associated with good model fit. The test of configural invariance was supported, as evidenced by fit indices meeting the benchmarks for adequate fit (RMSEA = .058, CFI = .95, TLI = .93). All levels of invariance up to strict invariance could be assumed across gender, as evidenced by a non-significant drop in model fit (∆CFI < .01) for the successively more constrained models (Table 4).

Table 4. Fit statistics for multi-group confirmatory factor analysis by gender, age, marital status, level of education, and employment situation.

Using the same procedure, the model was evaluated to check for invariance across age (Table 4). To this end, participants were divided into a younger adult group (20 to 39 years of age) and an older adult group (40 to 59 years old) following the criteria recommended by Martín Ruiz [36]. Adolescents (< 20 years) and elderly (> 60 years) were not included in this analysis because they did not reach the minimum group sample size (> 200 cases) [37]. After dividing patients into younger adults (n = 316) and older adults (n = 433), successively stricter constrains were tested to test for configural, weak, strong, and strict invariance. The single-sample solutions for the one-factor model based on the younger adult group and the older adult group subsamples are shown in Table 4. A good model fit was obtained for both subsamples. Configural invariance was supported by fit indices meeting benchmarks for adequate fit (RMSEA = .064, CFI = .94, TLI = .92). Weak, strong, and strict invariance could be assumed across age groups, as evidenced by a non-significant drop in model fit (Δχ2 = n.s. and ∆CFI < .01) for the successively stricter models (Table 4).

To evaluate invariance across different groups of marital status, participants were split into a "paired group" (married and cohabiting participants; n = 522) and "unpaired group(divorced, separated, widowed, and unmarried; n = 314). Successively stricter constrains were tested to evaluate configural, weak, strong, and strict invariance. The single-sample solutions for the one-factor model based on the paired group and unpaired group subsamples are shown in Table 4, each showing a good model fit. Configural invariance was supported by fit indices meeting benchmarks for adequate fit (RMSEA = .063, CFI = .94, TLI = .92). Weak, strong, and strict invariance could be assumed across pairing status, as evidenced by a non-significant drop in model fit for the stricter models (∆CFI < .01) (Table 4).

To evaluate invariance across levels of education, participants were split into three groups: basic education group (n = 248); secondary education group (n = 335) and high education (n = 238). Successively stricter constrains were tested to evaluate configural, weak, strong, and strict invariance. Configural invariance was supported by fit indices meeting the benchmarks for adequate fit (RMSEA = .047, CFI = .95, TLI = .93). Weak, strong, and strict invariance could be assumed across educational levels, as evidenced by a non-significant drop in model fit for the stricter models (∆CFI < .01) (Table 4).

To test invariance across employment conditions, participants were divided into two groups, an employed group (full and part-time employment; n = 422), and an unemployed group (unemployed, incapacity to work, and retired; n = 408). Successively stricter constrains were tested across groups to evaluate configural, weak, strong, and strict invariance. Model fits obtained for the employed subsample and for the unemployed group subsample are shown in Table 4. These were good for both subsamples. Configural invariance was supported by fit indices meeting the benchmarks for adequate fit (RMSEA = .063, CFI = .94, TLI = .92). Weak, strong, and strict invariance could be assumed across employment conditions, as evidenced by a non-significant drop in model fit for the stricter models (∆CFI < .01) (Table 4).

Longitudinal invariance

Temporal stability, also known as longitudinal invariance, is important to make sure the instrument in question measures the same latent constructs in the same manner over time. Using an analogous procedure, we evaluated whether the measurement model was invariant across time (i.e., a 3-month interval from baseline to the 3-month assessment). Applying successively stricter parameter constrains, the analyses supported weak (∆χ2 = 60.70, df = 8, p = .57; ∆CFI < .01), strong (∆χ2 = 117.87, df = 17, p < .01; ∆CFI < .01), and strict invariance (∆χ2 = 117.87, df = 26, p < .01; ∆CFI < .01) across time.


To our knowledge, this is the first study to examine the factor structure of the Spanish version of the PHQ-9 in the Spanish population. The results of the confirmatory factor analyses in the data collected from this primary care sample identified one-factor and two-factor models, both of which met all the pre-established fit criteria. However, given that the "somatic" and "cognitive-affective" factors identified in the two-factor model were highly correlated (r = .86), the unidimensional model is more parsimonious and, hence, the preferred solution. Importantly, this one-factorial structure was found to be invariant across various demographic groups, including gender, age, marital status, level of education, and employment situation. Thus, the PHQ-9 is applicable in the Spanish population and derived scores can be validly compared without need for specific sociodemographic adjustments. Additionally, the one-factor structure was found to be stable over a 3-month period.

Consistent with the results reported in previous studies of Mexican females [16] and a US-based Spanish-speaking Latino population [15, 17, 18], our findings support the one-factor solution for the Spanish version of the PHQ-9. Most studies examining the factor structure of the PHQ-9 have corroborated the unidimensionality of the scale [21, 22, 3840]. However, some authors have reported a two-factor structure comprising a cognitive–affective and a somatic dimension [23, 24, 26, 41, 42]. This discrepancy between studies is likely due to differences in patient populations. The current study mainly consisted of individuals with mild to moderate emotional distress in a primary care setting, which is the kind of heterogeneous sample for which the PHQ-9 was originally developed and validated to diagnose depression [9]. By contrast, studies that have found a two-factor solution have been conducted in populations that predominately present comorbid physical conditions such as spinal cord injury [42] or cancer [26, 41]; therefore, somatic factor loading may be attributable to possible confounding effects of the physical illness [43].

Petersen et al [24] found that a two-factor model with five ‘somatic’ items and four ‘affective/non-somatic’ items yielded the best fit in a sample of primary care patients. The one-factor and two-factor structures examined in our study displayed a poor model fit in the study carried out by Petersen et al [24]. However, it is important to note that all the patients in their sample had a diagnosis of major depression (PHQ-9 >9) for which antidepressive treatment was indicated. As those authors suggested, their relatively homogeneous sample may have resulted in range restriction in the measures, thereby attenuating correlations among variables. When PHQ-9 is used with more heterogeneous samples, it is more likely to produce a one-factor solution because the variance is larger and therefore the items are more likely to load on one factor [24]. Alternatively, it may well be that somatic features are more relevant in the diagnosis of depression in samples that include individuals with moderate to severe clinical conditions. It is worth noting that while the one-factor model seems to be a more parsimonious solution in our study, the two-factor model (with a factor of 3 ‘somatic’ items) also displayed a decent model fit.

Gender invariance, which may be considered a prerequisite for making quantitative comparisons, adds important support for the validity of the PHQ-9 as a self-report screening instrument because it indicates that the measurement model of the latent depression construct is comparable in both sexes. This implies that differences in observed test scores between men and women reflect true differences in depression rather than an artefact of the measurement method. Epidemiological studies have consistently shown a higher prevalence of MDD in women than in men [44, 45]. In Spain, the ESEMED study found 12-month prevalence rates for MDD of 2.2% for males and 5.6% for females. Moreover, this difference was even more pronounced for "any depressive disorder"; 2.3% for males vs. 6.3% for females [3]. Although we found a statically significant difference between men and women in mean PHQ-9 scores, the effect size was small.

Apart from gender, we also tested invariance across other sociodemographic characteristics such as age, marital status, level of education, and employment situation. Invariance could be assumed for all of these variables when patients were divided into comparable groups according to the criteria recommended by Martín Ruiz [36] and Barret [37]. These results strengthen the validity of the PHQ-9 as a screening tool in settings (such as primary care) where population heterogeneity is substantial.

To precisely measure the true change and inter-individual differences, it is also critical to examine if the PHQ-9 consistently measures the same construct over time. Previous studies have reported mixed results for invariance across time. One study [26] found that the PHQ-9 was invariant over time in a sample of patients newly referred to a palliative care service. In that study, patients were assessed within one-week of referral (time 1) and then again 4-weeks later (time 2). The observed factor structure of the PHQ-9 appeared to be stable in this setting since the two-factor model had good fit at both time points. By contrast, other studies in patients with spinal cord injuries have reported considerable instability of the factor structure over time [42, 43]. In our results, the factor structure was equivalent at two different time points (baseline and at 3 months) during which patients received pharmacological and/or psychological treatment for emotional disorders in a primary care setting [27]. This invariance indicates that changes in PHQ-9 scores over time reflect true changes in the underlying latent construct (i.e. depression) rather changes in the assessment or structure of the scale. The effect size for this change was large (g = 0.78) and we believe this change can be reasonably attributed to the effects of treatment.

Study strengths and limitations

The main strengths of the present study is the relatively large, heterogeneous sample of patients with emotional disorders from 22 primary care centres across Spain. It is worth noting that in Spain, as in many other countries, the vast majority of patients with emotional disorders are diagnosed and treated only in the primary care setting [46]. Thus, our sample can be considered representative of persons with mild-to-moderate emotional disorders seeking help. One limitation of our study is that the PHQ-9 may have a different structure in individuals in need of specialized services, or those with more severe clinical conditions; and this possibility could not be tested in the current study. We did not assess the concurrent validity of the Spanish version of the PHQ-9 against a semi-structured clinical interview to establish empirically-derived cutoff levels. However, the good sensitivity and specificity of the PHQ-9 have already been demonstrated in different populations [47], including the Spanish cultural context [13, 14]. The subsample of patients assessed at follow-up was part of an ongoing RCT to compare group CBT to TAU in a primary care setting. As a result, the subsample assessed at follow-up may not have accurately reflected the overall sample assessed at baseline. Indeed, some small but significant differences were observed in terms of marital status and level of education between baseline and follow up assessments. We also found small but statically significant differences between the baseline sample and the subsample used to assess factorial invariance over time on two items of the PHQ-9. Although these differences were small (Hedges' g < 0.2) and significance was not adjusted for multiple comparisons, the findings obtained from this subsample (i.e., the results of invariance over time) need to be interpreted with caution. Finally, a potential limitation regarding the data used to calculate invariance over time is that these data came from a RCT, which implies an intervention between two time-points (i.e., baseline and assessment). However, most patients who seek treatment for psychological distress in primary care will receive some kind of pharmacological or psychological treatment from their GP, and thus the fact that invariance over time was based on a subsample of patients included in a RCT (and, therefore, who received some type of intervention) may actually reflect the real-world context in which PHQ-9 is likely to be used. Thus, it might be said that the PHQ-9 structure is invariant regardless of treatment for emotional disorders. Indeed, the fact that strict invariance holds provides additional support for the robustness of the latent structure of the scale.

Clinical and research implications

Our findings have several important clinical and research implications. First, given the high prevalence rates of emotional disorders and, particularly, depression among primary care patients, GPs need well-validated screening tools that are easy to administer, score, and interpret. The unidimensional factor structure of the PHQ-9 allows clinicians to use the scale without the need for complicated scoring algorithms. Thus, clinicians and researchers can rely on a global score and a single cut-off score. Second, measurement invariance across multiple groups provides empirical support to allow clinicians and researchers to interpret between-group comparisons as true differences in depression intensity and not a measurement artifact. For instance, establishing gender invariance (or non-invariance) is critical for studies with a gender perspective. Finally, invariance measurement across time further supports the utility of the PHQ-9 as a simple tool for monitoring treatment response.


The current study provides important evidence regarding the construct validity and multi-group factorial and time invariance of the Spanish version of the PHQ-9 in primary care settings in Spain. In the present study, both the one-factor and the two-factor models displayed good model fit. The two-factor model (with a factor of three ‘somatic’ items highly correlated to the ‘cognitive-affective’ factor) displayed slightly better fit, while the one-factor model was preferred for its parsimony. The one-factor model was additionally demonstrated to be invariant across demographic groups and across time. The findings presented here suggest that primary care physicians in Spain can use the PHQ-9 to assess depression severity and to reliably monitor changes over time.

Supporting information

S1 Table. PHQ-9 inter-item correlation matrix.



Membership of the PsicAP Research Group:

Antonio Cano-Vindel (lead author,, Juan Antonio Moriana, Francisco Santolaya Ochando, Antonio Capafons Bonet, Mª. Dolores Gómez Castillo, Fernando Chacón Fuertes, Gonzalo Jiménez Cabré, Estefanía Salgado Kvedaras, Jorge Perpiñá González, Carmen Abellán Maeso, Ana Mª. Agudo Rodrigo, Mª. del Mar Aguilar Uceda, Laura Agüero, Mónica Rodriguez-Enríquez, Mª Rosa Pizà, Ignasi Ramírez, Manuel Aires López, Ana María Alayeto Sánchez, Dolores Alfonso Doménech, Vicenta Almonacid Guinot, Sinuhe Alvarado Torres, Sonia Álvarez Gómez, Luisa Mª. Andrés Arreaza, Ana María Arnaiz Kompanietz, Mª. Pilar Arranz García, Eugenia Avelino Hidalgo, Mª. José Ávila Sánchez, Josune Barbero Goicoetxea, Manuel Barragán Solís, Pilar Barroso López, Sofía Bauer Izquierdo, Carmen Benavente Torres, Pilar Bermejo Ortega, Greta Borrás Moreno, Alejandro Buendía Romero, Carlos Buiza Aguado, José Caballero Moral, Nieves Caparrós Ezpeleta, Alejandro Casado Martínez, Cristina Casado Rodríguez, Consuelo Castiblanque Ballesteros, Francisca Ceinós Vicente, Patricia Cordero Andrés, Ana Costa Alcaraz, Isabel de Andrés Cara, Victoria de la Riva Casares, Bárbara Díaz Gómez, Mª. Dolores Domínguez Manrique, Soledad Escolar Llamazares, Encarnación Espinosa de los Monteros Zayas, Aurora Fabero Jiménez, Mª. Antonia Font Payeras, Mª. Rosa Fraile Gómez, Mariona Fuster Forteza, Montserrat Gallart Aliu, Olvido García Jaén, Carmen García Palacios, Mar García-Moreno, Yolanda Garnica Cascales, Miguel Gárriz Vera, Mª. Concepción Gómez Martín-Sonseca, Mª. Francisca Gómez Rodríguez, César González-Blanch, Ana Mª. Hermosilla Pasamar, Fernando Hernández de Hita, Margarita Herrero Delgado, Josefa Jaimez Moreno, Amale Jáuregui Larrabeiti, Juan Agustín Jiménez Luque, Antonio Jiménez Moreno, Antonio León Dugo, Carlos Lillo De la Quintana, Joaquim L. Limonero García, María Lleras de Frutos, Anna Llorca Mestre, Francisco López Ortiz, Lourdes Luceño Moreno, Pilar Madrid Almoguera, Sonia Martín de la Sierra Fuentes, Rebeca Martínez Bustos, Ainoa Mateu Mullor, Mercedes Matilla Caballero, Rafael Medina Reyes, Itsaso Mendizábal Gallestegui, Susana Merino Martín, Gonzalo David Moneva Vicente, Mª. Isabel Montejo Villa, Antonio Montero Cantero, Julio Montoya Fernández, Rosario Morales Moreno, Laura Morante Hernández, Eliana Mª. Moreno Osella, Luciana Moretti, Francisco Javier Muñoz Mora, Rafael Muñoz Sánchez-Villacañas, Juan Elías Murillo Céspedes, Soledad Nevado Roldán, Rosario Ortíz Fernández, Elia Peiró Martí, María Olga Peña Peña, Montserrat Pérez Fernández, Juana Pérez Girón, Nathalie Pérez Lizeretti, Aranzazu Pérez Medina, Pilar Pérez Ortín, José Jorge Pérez Pascual, Bartolomé Pérez Pérez, Teresa Perez Poo, María Pineda Alonso, Almudena Pinilla Carrasco, Jorge Juan Prada Pérez, Marta Quintanilla Santamaría, Ana Isabel Quiñones Gómez, María Teresa Recio García, Marcelino Requena Gallego, Mercedes Ricote Belinchón, Yolanda Rincón Villareal, Geoffrey Ritho Luhunga, Mª. Ángeles Rivas Marra, Mariano Robres Oliete, Virginia Rodríguez Coronado, Mercedes Rojo Tardón, Ana Mª. Roldán Villalobos, Mª. Teresa Rubio Rubio, Mª. Jesús Ruiz Hernández, Jesús Ruiz López, María Ruiz Torres, Lorenza Ruz Torres, Manuel Salcedo Espinosa, Monika Salgueiro, María Luisa Sánchez Benitez de Soto, Emilio Sánchez Caballero, María Teresa Sánchez Villares Rodríguez, Isabel Sepúlveda Gómez, María Serrano Miralles, Victor Julián Suberviola Collados, Beatriz Talavera Velasco, Javier Torres Ailhaud, Olga Umaran Algageme, Alazne Unanue Ortega, Iñigo Valdivielso Moneo, Cristina Valle García, Antonio Varo Soriano, José Fernando Venceslá Martínez, Mª. Jesús Villa Pérez, Esperanza Villar Coloma, Laura Yuste Hidalgo.


  1. 1. World Health Organization. Depression and Other Common Mental Disorders: Global Health Estimates. Geneva: World Health Organization; 2017.
  2. 2. Alonso J, Angermeyer MC, Bernert S, Bruffaerts R, Brugha TS, Bryson H, et al. Prevalence of mental disorders in Europe: results from the European Study of the Epidemiology of Mental Disorders (ESEMeD) project. Acta Psychiatr Scand Suppl. 2004;(420):21–7. pmid:15128384.
  3. 3. Haro JM, Palacín C, Vilagut G, Martínez M, Bernal M, Luque I, et al. Prevalencia de los trastornos mentales y factores asociados: resultados del estudio ESEMeD-España. Med Clin (Barc). 2006;126(12):445–51.
  4. 4. Serrano-Blanco A, Palao DJ, Luciano JV, Pinto-Meza A, Lujan L, Fernandez A, et al. Prevalence of mental disorders in primary care: results from the diagnosis and treatment of mental disorders in primary care study (DASMAP). Soc Psychiatry Psychiatr Epidemiol. 2010;45(2):201–10. pmid:19452110.
  5. 5. Kamphuis MH, Stegenga BT, Zuithoff NP, King M, Nazareth I, de Wit NJ, et al. Does recognition of depression in primary care affect outcome? The PREDICT-NL study. Fam Pract. 2012;29(1):16–23. pmid:21859837.
  6. 6. Fernández A, Pinto-Meza A, Bellón JA, Roura-Poch P, Haro JM, Autonell J, et al. Is major depression adequately diagnosed and treated by general practitioners? Results from an epidemiological study. Gen Hosp Psychiatry. 2010;32(2):201–9. pmid:20302995
  7. 7. Codony M, Alonso J, Almansa J, Vilagut G, Domingo A, Pinto-Meza A, et al. [Psychotropic medications use in Spain. Results of the ESEMeD-Spain study]. Actas Esp Psiquiatr. 2007;35 Suppl 2:29–36. pmid:18264867.
  8. 8. Fernández A, Haro JM, Codony M, Vilagut G, Martínez-Alonso M, Autonell J, et al. Treatment adequacy of anxiety and depressive disorders: Primary versus specialised care in Spain. J Affect Disord. 2006;96(1–2):9–20. pmid:16793140
  9. 9. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9. J Gen Intern Med. 2001;16(9):606–13. pmid:11556941
  10. 10. Manea L, Gilbody S, McMillan D. A diagnostic meta-analysis of the Patient Health Questionnaire-9 (PHQ-9) algorithm scoring method as a screen for depression. Gen Hosp Psychiatry. 2015;37(1):67–75. pmid:25439733
  11. 11. Mitchell AJ, Yadegarfar M, Gill J, Stubbs B. Case finding and screening clinical utility of the Patient Health Questionnaire (PHQ-9 and PHQ-2) for depression in primary care: a diagnostic meta-analysis of 40 studies. British Journal of Psychiatry Open. 2016;2(2):127–38. pmid:27703765
  12. 12. Kroenke K, Spitzer RL, Williams JBW, Löwe B. The Patient Health Questionnaire Somatic, Anxiety, and Depressive Symptom Scales: a systematic review. Gen Hosp Psychiatry. 2010;32(4):345–59. pmid:20633738
  13. 13. Diez-Quevedo C, Rangil T, Sanchez-Planell L, Kroenke K, Spitzer RL. Validation and Utility of the Patient Health Questionnaire in Diagnosing Mental Disorders in 1003 General Hospital Spanish Inpatients. Psychosom Med. 2001;63(4):679–86. pmid:11485122
  14. 14. Munoz-Navarro R, Cano-Vindel A, Medrano LA, Schmitz F, Ruiz-Rodriguez P, Abellan-Maeso C, et al. Utility of the PHQ-9 to identify major depressive disorder in adult patients in Spanish primary care centres. BMC Psychiatry. 2017;17(1):291. pmid:28793892; PubMed Central PMCID: PMCPMC5550940.
  15. 15. Donlan W, Lee J. Screening for depression among indigenous mexican migrant farmworkers using the Patient Health Questionnaire-9. Psychol Rep. 2010;106(2):419–32. pmid:20524542
  16. 16. Familiar I, Ortiz-Panozo E, Hall B, Vieitez I, Romieu I, Lopez-Ridaura R, et al. Factor structure of the Spanish version of the Patient Health Questionnaire-9 in Mexican women. Int J Methods Psychiatr Res. 2015;24(1):74–82. pmid:25524806.
  17. 17. Huang FY, Chung H, Kroenke K, Delucchi KL, Spitzer RL. Using the patient health questionnaire-9 to measure depression among racially and ethnically diverse primary care patients. J Gen Intern Med. 2006;21(6):547–52. pmid:16808734
  18. 18. Merz EL, Malcarne VL, Roesch SC, Riley N, Sadler GR. A multigroup confirmatory factor analysis of the Patient Health Questionnaire-9 among English- and Spanish-speaking Latinas. Cultural Diversity and Ethnic Minority Psychology. 2011;17(3):309–16. pmid:21787063
  19. 19. Granillo MT. Structure and Function of the Patient Health Questionnaire-9 Among Latina and Non-Latina White Female College Students. Journal of the Society for Social Work and Research. 2012;3(2):80–93.
  20. 20. Zhong Q, Gelaye B, Rondon M, Sanchez SE, Garcia PJ, Sanchez E, et al. Comparative performance of Patient Health Questionnaire-9 and Edinburgh Postnatal Depression Scale for screening antepartum depression. J Affect Disord. 2014;162:1–7. pmid:24766996; PubMed Central PMCID: PMCPMC4040145.
  21. 21. Kocalevent R-D, Hinz A, Brähler E. Standardization of the depression screener Patient Health Questionnaire (PHQ-9) in the general population. Gen Hosp Psychiatry. 2013;35(5):551–5. pmid:23664569
  22. 22. Titov N, Dear BF, McMillan D, Anderson T, Zou J, Sunderland M. Psychometric comparison of the PHQ-9 and BDI-II for measuring response during treatment of depression. Cogn Behav Ther. 2011;40(2):126–36. pmid:25155813.
  23. 23. Beard C, Hsu KJ, Rifkin LS, Busch AB, Björgvinsson T. Validation of the PHQ-9 in a psychiatric sample. J Affect Disord. 2016;193:267–73. pmid:26774513
  24. 24. Petersen JJ, Paulitsch MA, Hartig J, Mergenthal K, Gerlach FM, Gensichen J. Factor structure and measurement invariance of the Patient Health Questionnaire-9 for female and male primary care patients with major depression in Germany. J Affect Disord. 2015;170:138–42. pmid:25240840
  25. 25. Meredith W, Teresi JA. An Essay on Measurement and Factorial Invariance. Med Care. 2006;44(Suppl 3):S69–S77. pmid:17060838
  26. 26. Chilcot J, Rayner L, Lee W, Price A, Goodwin L, Monroe B, et al. The factor structure of the PHQ-9 in palliative care. J Psychosom Res. 2013;75(1):60–4. pmid:23751240
  27. 27. Cano-Vindel A, Muñoz-Navarro R, Wood CM, Limonero JT, Medrano LA, Ruiz-Rodríguez P, et al. Transdiagnostic Cognitive Behavioral Therapy Versus Treatment as Usual in Adult Patients With Emotional Disorders in the Primary Care Setting (PsicAP Study): Protocol for a Randomized Controlled Trial. JMIR Research Protocols. 2016;5(4):e246. pmid:28011446
  28. 28. Arbuckle JL. Amos 21 user’s guide. Chicago: SPSS; 2012.
  29. 29. Byrne BM. Structural equation modelling with AMOS: Basic concepts, applications, and programming. New York: Routledge; 2016.
  30. 30. Kline T. Psychological Testing: A Practical Approach to Design and Evaluation: SAGE Publications, Inc.; 2005.
  31. 31. Brown TA. Confirmatory factor analysis for applied research. New York: Guilford Publications; 2014.
  32. 32. Cheung GW, Rensvold RB. Evaluating Goodness-of-Fit Indexes for Testing Measurement Invariance. Structural Equation Modeling: A Multidisciplinary Journal. 2002;9(2):233–55.
  33. 33. Barrera-Barrera R, Navarro-García A, Peris-Ortiz M. El papel de la invarianza factorial en la validación del constructo calidad de servicio electrónico. Revista Europea de Dirección y Economía de la Empresa. 2015;24(1):13–24.
  34. 34. George D, Mallery M. Using SPSS for Windows step by step: a simple guide and reference. Boston: Allyn & Bacon; 2010.
  35. 35. Gallart GC, Saris WE, Foguet JMB. Temas avanzados en modelos de ecuaciones estructurales. Madrid: La Muralla; 2005.
  36. 36. Martín Ruiz JF. Los factores definitorios de los grandes grupos de edad de la población: tipos, subgrupos y umbrales. Scripta Nova: revista electrónica de geografía y ciencias sociales,. 2005;9:181–204.
  37. 37. Barrett P. Structural equation modelling: Adjudging model fit. Pers Individ Dif. 2007;42(5):815–24.
  38. 38. Cameron IM, Crawford JR, Lawton K, Reid IC. Psychometric comparison of PHQ-9 and HADS for measuring depression severity in primary care. Br J Gen Pract. 2008;58(546):32–6. pmid:18186994
  39. 39. Erbe D, Eichert H-C, Rietz C, Ebert D. Interformat reliability of the patient health questionnaire: Validation of the computerized version of the PHQ-9. Internet Interventions. 2016;5:1–4.
  40. 40. Ryan TA, Bailey A, Fearon P, King J. Factorial invariance of the Patient Health Questionnaire and Generalized Anxiety Disorder Questionnaire. Br J Clin Psychol. 2013;52(4):438–49. pmid:24117915
  41. 41. Hinz A, Mehnert A, Kocalevent R-D, Brähler E, Forkmann T, Singer S, et al. Assessment of depression severity with the PHQ-9 in cancer patients and in the general population. BMC Psychiatry. 2016;16(1). pmid:26831145
  42. 42. Krause JS, Reed KS, McArdle JJ. Factor structure and predictive validity of somatic and nonsomatic symptoms from the patient health questionnaire-9: a longitudinal study after spinal cord injury. Arch Phys Med Rehabil. 2010;91(8):1218–24. pmid:20684902.
  43. 43. Richardson EJ, Richards JS. Factor structure of the PHQ-9 screen for depression across time since injury among persons with spinal cord injury. Rehabil Psychol. 2008;53:243–9.
  44. 44. Angst J, Gamma A, Gastpar M, Lépine JP, Mendlewicz J, Tylee A. Gender differences in depression. Eur Arch Psychiatry Clin Neurosci. 2002;252(5):201–9. pmid:12451460
  45. 45. Kuehner C. Gender differences in unipolar depression: an update of epidemiological findings and possible explanations. Acta Psychiatr Scand. 2003;108(3):163–74. pmid:12890270
  46. 46. Wittchen HU, Jacobi F, Rehm J, Gustavsson A, Svensson M, Jönsson B, et al. The size and burden of mental disorders and other disorders of the brain in Europe 2010. Eur Neuropsychopharmacol. 2011;21(9):655–79. pmid:21896369
  47. 47. Wittkampf K, van Ravesteijn H, Baas K, van de Hoogen H, Schene A, Bindels P, et al. The accuracy of Patient Health Questionnaire-9 in detecting depression and measuring depression severity in high-risk groups in primary care. Gen Hosp Psychiatry. 2009;31(5):451–9. pmid:19703639