Assessing the dimensionality of the CES-D using multi-dimensional multi-level Rasch models

Objectives The CES-D is a widely used depression screening instrument. While numerous studies have analysed its psychometric properties using exploratory and various kinds of confirmatory factor analyses, only few studies used Rasch models and none a multidimensional one. Methods The present study applies a multidimensional Rasch model using a sample of 518 respondents representative for the Austrian general population aged 18 to 65. A one-dimensional model, a four-dimensional model reflecting the subscale structure suggested by [1], and a four-dimensional model with the background variables gender and age were applied. Results While the one-dimensional model showed relatively good fit, the four-dimensional model fitted much better. EAP reliability indices were generally satisfying and the latent correlations varied between 0.31 and 0.88. In the analysis involving background variables, we found a limited effect of the participants’ gender. DIF effects were found unveiling some peculiarities. The two-items subscale Interpersonal Difficulties showed severe weaknesses and the Positive Affect subscale with the reversed item wordings also showed unexpected results. Conclusions While a one-dimensional over-all score might still contain helpful information, the differentiation according to the latent dimension is strongly preferable. Altogether, the CES-D can be recommended as a screening instrument, however, some modifications seem indicated.

In contrast, a much smaller number of studies applied IRT models: For example, Stansbury, et al. [65] applied a Rasch model (RM) to a sample of~2,500 community-dwelling elderly, finding the reverse scored items (4,8,12, and 16) not in line with a one-dimensional latent construct and, therefore, eliminated them. But even the reduced set of 16 items still showed deviations from a uni-dimensional construct. Pickard et al. [66] analysed a sample of 101 stroke and 366 primary care patients with the RM, reporting generally good fit except for five items (2, 11, 15, 17, and 19). Gay et al. [67] applied a Rasch analysis to a sample of 347 adults with HIV/AIDS revealing five items (2,4,8,11, and 16) as problematic; however, even their omission would not improve the overall performance of the scale. Kim and Park [68] found in a convenience sample of 183 Korean stroke survivors items 2, 8, and 11 to misfit the RM. Covic et al. [69] and Covic et al. [70] investigated samples of Rheumatoid Arthritis patients with a RM, promoting a 13-items short-version of the CES-D (omitting items 2, 4, 8, 11, 12, 16, and 18) and rescoring the remaining items to a three categorical response format (merging the two middle categories). Two further studies applying an IRT model to the CES-D [71,72] focussed on linking scores of various depression assessments and were therefore not considered in the present article. Table 1 summarizes problematic items identified in the cited studies.

Research question
These results of the IRT analyses indicate that a one-dimensional model seems to not adequately describe the data generating mechanism. The often applied CFA approach allows already for a multidimensional analysis (and the results of these studies support indeed a multi-dimensional structure of the CES-D), however, the CFA model has been originally developed for interval scaled data, assuming linear relationships and a multivariate normal distribution. Although extensions covering ordered categorical data and non-normality exist, the IRT family of models is specifically designed for (ordered) categorical data as we obtain from questionnaires like the CES-D. Amongst others, the IRT approach allows for a detailed analysis of items and item categories, specifically taking into account the categorical response format  Subscale  I  I  I  I  II  II  II  II  II  III  III  IV  IV   Item number  4  8  12  16  3  6  9  17  18  2  11  15  (for a direct comparison of the various approaches see [73]). To the authors' knowledge, the CES-D has so far not been analysed with a multidimensional IRT model. Moreover, the present study is the first to also take background variables into account. Misfit of models applied so far could very well be due to the fact that the CES-D has been applied in specific populations (HIV/AIDS, community-dwelling elderly, stroke & primary care patients, and stroke survivors), although it has originally been designed for "general population surveys" [1] (p. 386]. Hence, the results obtained so far are of limited value, as it remains unclear, whether they also apply to the general population. To shed light on this open question, the present study uses a representative sample from the general population. To the authors' knowledge, this study is the first one analyzing the CES-D on the basis of a representative sample using a multi-dimensional IRT model.

Sample
The sample consisted of 518 respondents randomly selected from a large Austrian address broker's data base of phone numbers covering approximately 75% of the Austrian population according to the seller's information. The sample covered persons aged 18-65 years. Because no population register is available to us, a simple random sample would not be feasible and we applied a complex sampling scheme: Austria has 9 provinces, which have key responsibilities in certain public health issues relevant to our research question. Therefore, we decided to represent them accordingly in the sample by stratification. As the data collection is based on face-to-face interviews, the routes to the households have to be taken into account. Therefore, we used within each stratum a cluster sampling scheme based on districts, which are available in the data base. Based on logistic and financial capabilities, we decided to sample a total of 40 districts, which were drawn at random taking proper shares of urban vs. rural regions into account. The required number of respondents per district was determined proportionally to the respective gender shares and district sizes. The resulting number of male and female respondents per district was drawn at random from the districts addresses in the data base. The sample size has been chosen in line with general recommendation, for example as given by [74], stating that 500 establishes a "Size for most purposes" even under "Adverse Circumstanes" (p. 328).
First, a notification letter informing about study aims and processes was sent to the selected respondents. Then, study workers called each person by phone and asked for permission to visit them for performing the interviews and filling out the questionnaire. Those agreeing to the interview were visited at home. Persons, who were not reached (e.g., due to change of address or phone number) or refused study participation were replaced by further addresses from a back-up list sampled in the same way as the primary list.

Assessments
Psychiatric case identification was performed by using the SCAN 2.0, the Schedules for Clinical Assessment in Neuropsychiatry [75]. The SCAN is a semi-structured clinical interview designed for use by psychiatrists and clinical psychologists. Every symptom in SCAN is defined in detail [76] and wording is suggested for eliciting each symptom. However, interviewers had to continue inquiring until they dispose of sufficient information to decide whether or not symptom definitions were fulfilled. Its feasibility and reliability have been tested in international field trials [75]. Diagnoses were given according to ICD-10 [77] using a computer algorithm provided for SCAN. Only current disorders (occurring during the 4 weeks before interview) were evaluated in the present study. Eleven psychologists were recruited as interviewers, who were trained by experienced staff from one of the WHO-designated SCAN training centres. All interviewers performed several pilot SCAN interviews before data collection started.
Study participants could decide whether they wanted to start with the questionnaire or the research interview. Either way, interviewers were not aware of the CES-D results. Study participants were included only if they had signed the informed consent. The study was approved by the Ethics Committee of the Medical University of Vienna.

Model
The response format of the CES-D provides four categories requiring a polytomous version of the Rasch model. One frequently applied model of this kind is the partial credit model (PCM; [78]). However, the PCM is a one-dimensional Rasch Model, i.e., we cannot describe more than one subscale at a time. We also dispose of multidimensional IRT models, which assume more than one latent dimension to generate the responses (cf. [79]). A versatile multidimensional formulation is the multidimensional random coefficients multinomial logit model (MRCMLM; [80]). It covers multidimensionality and allows for controlling for background variables, which each latent factor can be regressed upon. We used a between-item-multidimensional formulation, i.e., each item is associated with exactly one latent factor (cf. [79]). Our analysis strategy was to apply first a one-dimensional model and contrast it to (a) the four-dimensional model and (b) the four-dimensional with background variables. Finally, we performed a differential item functioning analysis (DIF; [81]) to identify potentially problematic items.
For assessing model fit, we use the infit measure [82,83], the ideal value of which is one. Values larger than one indicate an increasing amount of responses differing from what the model would predict. Values below one indicate responses showing lesser variability than expected critical limits for the infit measure were chosen at 0.7 and 1.3 (cf. [84]). Further, the MRCMLM provides the EAP reliability index (based on Expected A Posteriori parameter estimates, cf. [85,86]) for each latent scale, which can be seen as an equivalent to the classical reliability measure, but for Rasch models; its value should be close to one. For comparing models we use the information based indices AIC [87], the bias corrected AIC (AICc; [88,89]), the bayesian information criterion (BIC; [90]), the adjusted BIC (aBIC; [91]), and the consistent AIC (CAIC; [92]). Information based indices allow for comparing competing models applied to the same data set, with smaller values indicating better over-all model fit. Moreover, we compare nested models with the likelihood ratio test (LRT; [93]).
We used R [94] for all calculations and graphics and the R-package Test Analysis Module (TAM; [95]) for the MRCMLM. A critical alpha of 5% (0.05) was applied for inferential assessment.

The one-dimensional model
First, a one-dimensional Rasch model for polytomous data (i.e., a PCM) was applied. This model constitutes the reference model, against which the more complex approaches will be tested. The EAP reliability index of the latent scale of this model was 0.795.
Thurstonian Category Thresholds depression (in contrast to the specific depression facets in the next model). From the histogram in the upper part we learn that the majority of the sample exhibits low depression values. In contrast, we find the majority of the thresholds in the higher regions of this latent dimension, indicating that only respondents with higher depression values are likely to choose the according response categories. Especially for items 2 (appetite), 9 (failure), 10 (fearful), 15 (unfriendly), and 19 (dislike), even the threshold between categories 0 and 1 is located considerably high. This means that these items are "difficult"from a psychometric point of view thus requiring a higher latent score to endorse them. Accordingly, the thresholds of the subscale I (Positive Affect), i.e., items 4 (good), 8 (hopeful), 12 (happy), and 16 (enjoy), are located in the lower regions of the latent dimension. One peculiarity becomes evident: The thresholds of items 3 (blues), 4 (good), and 9 (failure) are considerably close to each other indicating that these items do not differentiate very much across the latent dimension. Fig 2 shows the infit measures and the thresholds of the 20 CES-D items. Most of the values appear in the vicinity of 1, hence, the global impression is good. However, some items show peculiarities: The four items of subscale I show elevated item infit with statistically significantly deviating thresholds; thresholds 2 and 3 of the items 4 (good) and 8 (hopeful) are significant and three of them also lie above the upper limit of 1.3; further, thresholds 1 of items 12 (happy) and 18 (enjoy) are below the ideal value of 1 and were significant. In subscale II, item 6 (depressed) was close to the lower limit and significant; its first threshold was significant as well. The same applies to item 18 (sad). Finally, in subscale III, item 11 (sleep) was larger than 1 and significant.

The four-dimensional model
Next, we applied a four-dimensional model according to the item allocation as proposed by [1]. The EAP reliability indices for the 4 latent dimensions were 0.699 for Positive Affect (henceforth termed subscale I), 0.730 for Negative Affect (subscale II), 0.727 for Somatic Symptoms (III), and 0.451 for Interpersonal Difficulties (IV). Table 2 lists the information based indices indicating that the four-dimensional model describes the data better than the onedimensional model.
Also, the direct model comparison via the likelihood ratio test (LRT) identified the fourdimensional model to fit the data significantly better than the one-dimensional one (χ 2 = 398.77; df = 9; p < 1e-10). Fig 3 shows the person-item-map of the four-dimensional model.
The histogram of the person parameter estimates shows again that most respondents exhibit low values of depression, with subscale II (Negative Affect) covering a wider range than the other three subscales. The item category thresholds show a similar pattern as in the onedimensional case. However, the thresholds of the four-dimensional model cover a much broader range of values. Nevertheless, items 4 (good) and 8 (hopeful) still show thresholds considerably close to each other, which means that these two items still do not discriminate very well across the spectrum of depression, i.e. respondents chose predominantly either category 0 (not at all) or category 3 (all the time). Fig 4 shows the infit indices for the 20 CES-D items. Again, we find a few peculiarities in scale I, yet to a lesser degree: The thresholds 2 and 3 of item 4 (good) and threshold 3 of item 8 (hopeful) are still significant, but the infit measure is below the critical limit of 1.3. Interestingly, now the items 3 (blues), 9 (failure), and 10 (fearful) show infit measures above the critical limit of 1.3. Again, item 6 (depressed) and item 11 (sleep) have thresholds deviating significantly from the ideal value of 1. Table 3 shows the correlation matrix of the four latent dimensions (main diagonal entries denote the variances of each latent dimension).
The highest correlation was found between Negative Affect and Somatic Symptoms (.88) while the weakest correlation occurred between Positive Affect and Interpersonal Difficulties (.31); the remaining correlation coefficients were mediocre (between 0. 48

The four-dimensional model with background variables
Finally, the multidimensional model has been extended by the two background variables gender and age. Regarding model-fit, we find the person-item-map almost identical to that of the four-dimensional model without background variables (therefore not presented here; the same applies to the infit plot; interested readers can request a copy of these plots from the authors). The EAP reliability indices for the four latent dimensions were marginally better than for the previous model (I: 0.702; II: 0.740; III: 0.730; IV: 0.455). A direct model comparison using information based indices or the LRT is not possible, because this model was applied to a different data set (with the two background variables added).
The most interesting results of this model are the regression coefficients of the two background variables upon the four latent dimensions (see Table 4).
Regarding the impact of gender upon the subscales, we find two effects for the latent dimensions Negative Affect and Interpersonal Difficulties. In contrast, the respondents' age did not reveal any notable influence. From these results, we learn that gender but not age seems to play a role for the CES-D. This will be pursued further in the following DIF-analysis, which delivers more detailed insights.

DIF analysis
We split the sample according to gender on the one hand and a diagnosis of depression within the last month as split criteria for the DIF analysis-the former, because it proved to be influential as background variable, and the latter, because the CES-D has been developed to measure depression in the general population. Therefore, it is of particular interest, if there are items operating different in depressed people than in non-depressed-ones. We used the four-dimensional model without background variables for the DIF analyses, because controlling for gender or depression would eliminate possible effects we are looking for in this analysis step.
First, we will focus upon the global DIF-effect. Here, we find a weak general DIF-effect for gender (global effect parameter -0.103; 95% CI = -0.13/-0.07), i.e., women were slightly (but significantly) more likely to endorse all items. Because such an over-all effect is little informative, we turn to an item-wise analysis. Fig 5 presents the item-wise DIF-effects according to gender (solid line).
Seven items show a significant yet moderate DIF effect. The Positive Affect subscale is affected the most with three out of four items (hopeful, happy, enjoy) showing DIF in favour of men (i.e., men are more likely to endorse these items than women). There is a DIF-effect in favour of women for two of the Negative Affect subscale items (failure, cry) and in favour of men for two items of the Somatic Symptoms subscale (appetite, talk).
For the second DIF analysis, we split the sample into respondents with vs. without a diagnosis of depression according to SCAN. Other diagnoses were excluded for this step, resulting in    https://doi.org/10.1371/journal.pone.0197908.g004 a slight sample reduction (n red = 452). Item 2 (appetite) had to be excluded from the analysis for technical reasons (response category 3 did not occur in the reduced sample). There was a global effect with depressed respondents more likely endorsing all items. (effect parameter -0.656; 95% CI = -0.62/-0.69). Fig 5 shows the item-wise DIF-effects (dashed line). We find significant effects for 10 items: For depressed respondents, it was more difficult to endorse  15 (unfriendly), and 19 (dislike) and more easy to endorse items 3 (blues), 6 (depressed), 9 (failure), 10 (fearful), and 7 (effort). Although most of these effects were statistically significant, they can be considered small from a substantive perspective. The largest effect was observed for items 15 (unfriendly), 19 (dislike), and 13 (talk), which were more difficult to endorse for respondents fulfilling depression criteria.

Discussion
The present study analysed the CES-D with a multi-dimensional IRT model in a sample representative for the general population. A one-dimensional solution was contrasted to a fourdimensional model reflecting the subscales as asserted by [1]. Interestingly, the fit of the onedimensional model was already considerably good. Only item 1 (bothered) showed an infit value outside the usual limits of acceptability, and a few thresholds of the remaining items reached statistical significance. The EAP reliability measure of this model was 0.8, which can be regarded as fairly satisfying. Hence, we can conclude that an overall-score would deliver quite useful information. This finding supports the view of Radloff [1] advocating the use of the total score of the CES-D, however, based on a much more elaborated methodological foundation. This could be advantageous, for example, when using the CES-D as a screening instrument in a multistep diagnostic process, where a single total score with a certain cut-off value would be preferable. However, the fit of the four-dimensional model was by far (and significantly) better than the fit of the one-dimensional model. It is also in line with the meta study of Shafer (2006), who also found "strongest support (. . .) for the four-factor structure of the CES-D" [97] (p. 136). The reliability coefficients of the subscales revealed that subscales I, Positive Affect, II, Negative Affect, and III, Somatic Symptoms achieve values in the vicinity of 0.7, which is satisfying, while subscale IV, Interpersonal Difficulties was mediocre at best (0.45). When comparing reliability indices of the four-and the one-dimensional model, we have to keep in mind that reliability depends-amongst other things-on scale length as well. In the one-dimensional model, a common scale is built from all 20 items, while the subscales of the four-dimensional model are much shorter, therefore, the subscale indices are lower for technical reasons. Taking this into account, we consider the reliability indices of the subscales I-III as sufficiently high. The poor result of subscale IV implies that two items would not suffice to establish a meaningful subscale. Such short scales are rather useful for screenings in the first step of a two-step screening procedure fostering a decision regarding further diagnostic procedures (cf. [98][99][100]). However, they are hardly suitable for the quantitative assessment of a trait. In the present case, Interpersonal Difficulties-which is a rather complex construct-would be measured with a score consisting of two items and a total value ranging from 0 to 6. Hence, the interpretation of this scale is very limited and should be done with great caution (if at all).
Comparing the present results to those of the previously reported IRT-based studies, we find largely agreeing and some interesting new results: Generally, the one-dimensional model rendered seven items suspicious (five with significant infit plus two with significant thresholds only), whereas the four-dimensional model only showed significant infit for three items and suspicious thresholds for another 3 items. This is in line with the previous results, again showing the four-dimensional model to be superior to the one-dimensional model. We will, therefore, focus on this model in the discussion of item fit: Regarding subscale I, Positive Affect, item 4 (good) proved most problematic, as not only was its infit measure significant, but also thresholds 2 and 3. Items 8 (hopeful) and 16 (enjoy) had one problematic threshold each. Interestingly, item 12 (happy) worked well here, in contrast to [66] and [70,71]. For subscale II, Negative Affect, we find diverging results, as the suspicious items 3 (blues), 6 (depressed), and 9 (failure) have not been reported problematic in the previous studies. Taking into consideration that these items cover core symptoms of depression, our results might reflect the different populations in which the CES-D was used. Our study covered the general population, where these statements may play a different role compared to the specific populations reported in the previous studies. The DIF analysis discussed below will shed further light on this issue. For subscale III, Somatic Symptoms, the situation is fairly clear: Item 11 (sleep) was suspicious, which is in line with four out of the five reported studies. In contrast, item 2 (appetite) was inconspicuous in contrast to [67,68,70,71]. Interestingly, the infit measures of the two items of subscale IV, Interpersonal Difficulties, were satisfying in our study. Further details regarding the results and the discussion of our analyses can be found in the online supplemental material S1 File.
As a limitation, we have to take into account that the sample relies on a phone number data base, which will not cover the entire population of a country. Therefore, slight peculiarities may still exist. However, we consider this limitation tolerable for two reasons: First, it is unlikely that our results are severely biased as the data base still covers an enormous portion of the entire population. Second, Rasch models are "sample independent" [101], which, in short, describes the fact that item parameter estimates do not depend on the person parameter distribution and vice versa [102,103]. We therefore regard our results as dependable.
Concluding, we can state that the one-dimensional modelling approach proved clearly inferior to the multidimensional one. This is in line with previous studies: For example, Gay et al. [67] also used the PCM approach and found violations of the one-dimensionality assumption for all 20 items of the CES-D. Moreover, we found subscale IV, Interpersonal Difficulties, to exhibit severe limitations from a psychometric point of view. Therefore, it should be handled with care. Apart from that and a few limitations deserving further elaboration, analyses of the subscales yielded convincing results supporting the subscale structure of the CES-D. Therefore, although not entirely dismissing the overall score, we advocate the use of a subscale based interpretation due to its superior psychometric qualities.