The Center for Epidemiologic Studies Depression Scale: A Review with a Theoretical and Empirical Examination of Item Content and Factor Structure

Background The Center for Epidemiologic Studies Depression Scale (CES-D; Radloff, 1977) is a commonly used freely available self-report measure of depressive symptoms. Despite its popularity, several recent investigations have called into question the robustness and suitability of the commonly used 4-factor 20-item CES-D model. The goal of the current study was to address these concerns by confirming the factorial validity of the CES-D. Methods and Findings Differential item functioning estimates were used to examine sex biases in item responses, and confirmatory factor analyses were used to assess prior CES-D factor structures and new models heeding current theoretical and empirical considerations. Data used for the analyses included undergraduate (n = 948; 74% women), community (n = 254; 71% women), rehabilitation (n = 522; 53% women), clinical (n = 84; 77% women), and National Health and Nutrition Examination Survey (NHANES; n = 2814; 56% women) samples. Differential item functioning identified an item as inflating CES-D scores in women. Comprehensive comparison of the several models supported a novel, psychometrically robust, and unbiased 3-factor 14-item solution, with factors (i.e., negative affect, anhedonia, and somatic symptoms) that are more in line with current diagnostic criteria for depression. Conclusions Researchers and practitioners may benefit from using the novel factor structure of the CES-D and from being cautious in interpreting results from the originally proposed scale. Comprehensive results, implications, and future research directions are discussed.


Introduction
The Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision [1] characterizes depression as a multidimensional construct comprising negative emotion (i.e., negative affect; Criterion A1), an absence of positive emotions (i.e., anhedonia; Criterion A2), and a cluster of physical symptoms (i.e., somatisation; Criteria A3-5). The Center for Epidemiologic Studies Depression Scale (CES-D) [2] is among the most popular measures of depressive symptoms, likely owing its popularity to being free and generally comparable [3][4][5] with the wellestablished Beck Depression Inventories [6], [7]. Despite its popularity, the CES-D has areas of concern, particularly in its latent factor structure and item content.
The CES-D was originally posited as having a 4-factor structure representing depressed affect, absence of positive affect or anhedonia, somatic activity or inactivity, and interpersonal challenges [2]. The CES-D items and structure were not designed a priori to reflect diagnostic criteria at the time of its development [8] and recent investigations have called into question the robustness and stability of the original 4-factor 20-item structure [9][10][11]. Indeed, over 20 alternative factor solutions have been reported (Table 1) and have suggested the presence of one, two, three, and four factors [12][13][14]. The majority of factor-analytic studies of the CES-D have employed principal component analysis with orthogonal rotation [4], an analytic approach that may have theoretically improbable assumptions and biased factor solutions [15]. The shift away from such approaches is not a shift away from exploratory factor analyses, but a shift towards the best practices for such analyses; that said, exploratory factor analyses tend to be exploratory. In the case of constructs that are established (e.g., depression), confirmatory factor analyses may be more informative as measures are designed to fit a construct, instead of naming constructs to fit the results from a measure.
Many researchers have also questioned the validity and psychometric properties of several items on the CES-D [16][17][18][19][20][21][22][23][24][25]. Items potentially assessing somatic concerns (e.g., ''I felt that everything I did was an effort'') may artificially inflate CES-D scores for elderly or chronic pain populations [26], [27]. Two socially-focused items (i.e., ''People were unfriendly'' and ''I felt that people disliked me'') are believed to potentially confound the validity of the CES-D by assessing other constructs (e.g., perceived social competence) and symptoms of other disorders (e.g., Social Anxiety Disorder) [4], [11], [14], [21]. For at least one item (i.e., ''I had crying spells''), there appears to be a robust sex difference in responses, leading to inappropriate inflation of women's CES-D scores due to cultural norms regarding emotional expression, rather than actual differences in depressive symptoms [19], [20], [25], [28], [29]. Furthermore, the CES-D also includes four reverse-worded items (e.g., ''I was happy'') designed ''…to break tendencies toward response set as well as to assess positive affect (or its absence)'' [2]; however, these two purposes are at odds and may lead to misrepresentation of response patterns or biased estimations of positive affect [4], [30]. Research suggests that depression marked by absence of positive affect (i.e., anhedonia) may be qualitatively and quantitatively different than depression resulting from heightened negative affect [31][32][33], implying that measures of depression should assess this dimension directly.
The aims of the current study were to (1) identify any sex biases within the item content of the CES-D, (2) explore which of the many prior factor solutions for the CES-D (Table 1) would demonstrate the best factorial validity, and (3) test whether a new theory-driven solution would exhibit the best fit. The ability of items to predict depression similarly among men and women (i.e., differential validity) was assessed by using an application of item response theory. The factorial validity of the CES-D was examined using a series of confirmatory factor analyses (CFAs) that tested previously established models, as well as new models based on theory and empirical research. This approach is in line with conclusions from a recent meta-analysis [4] suggesting that the use of CFAs would be an appropriate next step in solidifying the optimal factor structure of the CES-D; that is, the use of CFAs will circumvent the almost exclusive prior use of exploratory factor analytic techniques with the CES-D [4], [15]. The present study performed these analyses using five different samples (i.e., undergraduate, community, rehabilitation, clinical with a history of depression, and a nationally representative sample from the National Health and Nutrition Examination Survey; NHANES) to permit generalizability of the findings across several applications (e.g., epidemiological, clinical), while addressing the overuse of data from specialized samples in this area (e.g., adolescent, geriatric).

Ethics Statement
The present study has been ethically approved by the University of Regina Research Ethics Board. The study uses archival data Using this type of sample generally ensures a wide range of responses, whereas an entirely clinical sample might provide a restricted range of relatively higher responses [15], [34]. Participants identified their ethnicity as White/Caucasian (89%), First Nations (i.e., Canadian Aboriginal; 3%), Asian (4%), or other (4%). Most reported being single (84%), while others were married or cohabiting (13%), separated or divorced (1%), or chose not to answer (2%). Undergraduates were recruited via campus advertisements directing them to a secure website for completion of an online questionnaire package.
The third sample was a rehabilitation sample of tertiary level rehabilitation patients (n = 522) from a government-sponsored rehabilitation program who completed the CES-D as part of tertiary assessment for issues related to injuries sustained in motorvehicle or work place accidents (246 men, 18-85 years [M age = 42.5; SD = 12.5] and 276 women, 18-79 years [M age = 43.2; SD = 12.5]). The rehabilitation sample was included to provide a comparatively broad range of responses from a treatment-seeking sample that is very likely distressed, but not necessarily depressed. Ethnicity data was not recorded for the rehabilitation sample, but can be assumed to be primarily Caucasian based on population demographics. Most reported being married or cohabiting (57%), while others were single (27%), separated or divorced (13%), or widowed (3%). Education levels were not available for this sample.
The fourth sample, described as a clinical sample, included community members (n = 84) from across Canada ( [1], [8]. The public access data is from the National Institute of Mental Health and we are grateful for the NHANES contribution. Comprehensive descriptions of the data collection are available directly online from the Centres for Disease Control and Prevention. Many of the NHANES participants reported having completed Grade 12 (37%) or having at least some postsecondary education (32%), and most reported being employed (52% full-time, 11% part-time) or working as homemakers (33%). NHANES participants identified their ethnicity as Caucasian (91%), African American (8%), or other (1%). The majority reported being married or cohabiting (79%), while others reported being single (7%), separated or divorced (8%), or widowed (6%).

Measures
The CES-D is a 20-item measure assessing symptoms of depression with items phrased as self-statements (e.g., ''I felt hopeful about the future''). Respondents rate how frequently each item applied to them over the course of the past week. Ratings were based on a 4-point Likert scale ranging from 0 (rarely or none of the time [less than 1 day]) to 3 (most or all of the time [5-7 days]).

Analyses
Descriptive statistics and differential item functioning. Descriptive statistics were calculated for each item within each of the samples ( Table 2). Means on each of the items for men and women were compared by t-tests across samples as an initial index of differential validity. Differential item functioning was subsequently estimated to assess whether men and women differed in their responses to each item along the continuum of CES-D scores. Differential item functioning occurs when individuals with the same latent trait (i.e., depression) or total score (e.g., on the CES-D) respond to items differently due to test characteristics (e.g., paper and pencil vs. computerised) or biases (e.g., due to sex or race [35], [36]). Estimates of differential item functioning can illustrate, for example, that men and women may respond similarly to an item when they have relatively low CES-D scores, but respond differently to the item when they are severely depressed. Differential item functioning was estimated using an item response theory approach rather than a Mantel-Haenszel approach as it provides a more accurate estimate of non-uniform differential item functioning (e.g., if it occurs only in more severe levels of depression [37]). Non-parametric item characteristic curves were rendered using jMetrik 2.1.0 [38] and were smoothed using a Gaussian kernel. Item characteristic curves are an integral part of item response theory that plot which response option (e.g., 0, 1, 2, or 3 on a Likert scale) is most likely to be endorsed by an individual with a certain total score. To illustrate an absence of differential item functioning on the CES-D, men and women with similar levels of depression should endorse the same option on each item of the CES-D (e.g., severely depressed men and women would both chose the highest option), and therefore exhibit very similar item characteristic curves. The distance between the curves for each sex was examined manually to identify potential differential item functioning. An item was only confidently deemed to exhibit differential item functioning if the curves for men and women were grossly dissimilar either in slope or intercept. Item response theory analyses require both relatively large samples and a range of scores spanning the full continuum of potential scores on the measure [35]; consequently, all five samples were combined for these, but not subsequent analyses. Item characteristic curves were plotted based on total CES-D scores, rather than latent depression, given the aforementioned difficulties associated with the latent structure of the CES-D.
Testing and modifying previous factor solutions. A series of CFAs was conducted to replicate and test selected factor structures published in previous studies (Table 1) and to extend these previous models by excluding potentially problematic items as suggested by previous research. Specifically, there appears to be consensus throughout the literature that items 15 (i.e., ''People were unfriendly'') and 19 (i.e., ''I felt that people disliked me'') may warrant removal as they reflect interpersonal difficulties, a dimension not consistent with contemporary diagnostic criteria for depression [1], [4], [11], [14], [21]. Similarly, item 17 (i.e., ''I had crying spells'') may warrant removal as it produces robust sex differences in endorsement [25], [28], [29]. Accordingly, previously demonstrated factor structures were tested with and without items 15, 17, and 19. Several previous analyses have also suggested that 2-item factors within the CES-D (Table 1) are inherently unstable [15], [39]. Given the challenges associated with 2-item factors, models including a 2-item factor (e.g., [21], [40]) were tested with and without the 2-item factor utilizing the same procedures (i.e., testing with and without items 15, 17, and 19).
CFAs were conducted separately in each sample to determine whether the structure of the CES-D is generalizable and stable across different applications. The size of the clinical sample was not optimal for CFAs but research supports the applicability of CFAs in samples of as low as 51 participants [41]; moreover, the reliability of the factors and the strength of the communalities between the items facilitate the use of CFAs in this sample. The CFAs were performed with AMOS 18 and data from each of the five samples were inputted in a maximum likelihood estimation procedure. Bollen-Stine bootstrap chi-square and computed bootstrapped parameter estimates with estimates from a maximum-likelihood procedure [45], [46] were also conducted because the data did not exhibit multivariate normality; however, results were comparable to the maximum-likelihood procedure and are excluded for brevity. Each model was evaluated using the following fit indices with 90% confidence intervals (when applicable): 1) chi-square (values should not be significant); 2) chi-square/df ratio (values should be less than 2.0); 3) Comparative Fit Index (CFI; values must be greater than.90, and ideal fits   approach or are greater than.95); 4) the Standardized Root Mean Square Residual (SRMR; values must be less than.10 and ideal fits approach or are less than.05); 5) Root Mean Square Error of Approximation (RMSEA; values must be less than.08 and ideal fits approach or are less than.05, with 90% confidence interval values below.10); and 6) Expected Cross-Validation Index (ECVI; when comparing these scores across different models, lower values indicate a closer fit [42], [43]. Evaluations emphasized the latter four fit indices (i.e., CFI, SRMR, RMSEA, and ECVI) [44]. Given the large number of models that were tested, only fit indices for solutions where the CFI exceeded.92 in at least three of the five samples were included for presentation.

Internal Consistency
Internal consistency was acceptable for the current undergraduate (Cronbach's a = .91), community (Cronbach's a = .94), rehabilitation (Cronbach's a = .92), clinical (Cronbach's a = .85), and NHANES (Cronbach's a = .85) samples. The average interitem Pearson correlation with the reverse-scored items (i.e., positive affect/anhedonia) was .34 for the undergraduate sample, .43 for the community sample, .38 for the rehabilitation sample, .23 for the clinical sample, and .26 for the NHANES sample. The average inter-item Pearson correlation without the reverse-scored items (i.e., positive affect/anhedonia) was .37 for the undergraduate sample, .44 for the community sample, .40 for the rehabilitation sample, .25 for the clinical sample, and .33 for the NHANES sample. In all cases the average inter-item correlation was relatively low, indicating diversity among the items and supporting notions of more than one latent construct. The lowest inter-item correlation was for the clinical sample and suggests that there may be substantial variation among clinical presentations of these symptoms for persons with a history of depression. Such variation is implicitly supported by DSM-IV-TR diagnostic criteria that allow for high levels of negative affect or high levels of anhedonia to qualify as hallmark criteria for major depressive disorder (i.e., ''(1) depressed mood or (2) loss of interest or pleasure''; page 356 [1]).

Sex Differences on CES-D Items
Across all samples, persons with missing data (i.e., fewer than 1%) were excluded from the analyses. The t-tests comparing men and women's responses from all samples combined suggested that women reported statistically significantly higher scores (p,.05) on most CES-D items (i.e., 1, 2, 3, 5, 6,9,10,11,12,14,15,16,17,18,19,20); however, the effect sizes (i.e., using percentage of variance accounted for ''r 2 '') were negligible (i.e., r 2 ,.01) for most, but not all items (i.e., items 3, 5, 6, 20, r 2 = .02; item 14, r 2 = .03; item 18, r 2 = .04; item 17, r 2 = .07). Item 17 (i.e., ''I had crying spells'') was the only item with item characteristic curves that differed markedly between men and women, suggesting it has significant differential item functioning. An item with nil or negligible differential item functioning (i.e., item 20) is presented in Figure 1 (i.e., Item characteristic curves) alongside item 17 for illustrative purposes. The item characteristic curves demonstrate that men and women respond similarly to item 17 when depression levels are low or slightly above average (22.5 SD to +0.5 SD), with both sexes choosing 0 (rarely or none of the time); however, as depression levels increase, women are more likely to choose a higher response option compared to men. Indeed, even the most depressed men are most likely to choose 1 (some or a little of the time), while the most depressed women are more likely to choose 2 (occasionally or a moderate amount of the time) or 3 (most or all of the time). The item characteristic curve plots for all items are not displayed for brevity, but are available from the authors upon request.

Structural Analyses: CFA Results
The fit indices for each of the previously reported models -as evaluated with data from each sample -are presented in Table 3 (where the model CFI exceeded.92 in at least three out of the five samples). The results were interpreted to suggest that five models might have the factorial validity to provide utility in divergent populations, as many of the fit indices met acceptable standards across the different samples. However, all of these models included item 17 and/or failed to include items that assess positive affect, which is inconsistent with current theory and diagnostic approaches concerning depression [1]. Of all the newly derived models (i.e., with items 15, 17, and 19 removed and without 2-item factors [if relevant]), only one exhibited acceptable fit indices within each sample, included positive affect items, and did not include item 17. The model with the best fit indices was a revision of the one proposed by Radloff [47], which also excluded items 9, 10, and 13. Relevant fit indices and inter-factor correlations for this newly derived model are reported in Table 4. The original model proposed by Radloff [47] (Table 7).

Discussion
Despite the popularity of the CES-D, there has been considerable debate regarding the optimal factor structure and item content for the measure (see Table 1). The current study sought to summarize and address these issues by assessing the differential validity of the CES-D and comparing the previously proposed factor solutions for the CES-D to a novel, theoretically-driven model. The results support a 14-item, 3-factor model that is relatively more congruent with current diagnostic criteria for depression [1].
Previous research has highlighted that item 17 (i.e., ''I had crying spells'') of the CES-D may lead to inflated scores for women [19], [20], [25], [28], [29]. As expected, item 17 exhibited significant differential item functioning, such that even the most depressed men were most likely to choose 1 (some or a little of the time) on the Likert scale for that item, compared to the most depressed women, who were more likely to choose 2 (occasionally or a moderate amount of the time) or 3 (most or all of the time). This finding underscores the importance of removing item 17 from the CES-D and subsequently creating and utilizing new norms for the measure that do not include this item. Continued use of item 17 and the associated norms or cut-offs will lead to notable overestimates of depression in women and underestimates of depression in men. Such misrepresentations owing to sex and cultural biases, rather than true differences in depression, may have significant social and practical healthcare implications. Attempting to control for this sex difference by subtracting a value from women's scores (e.g., one point off of the total), or by otherwise adjusting norms for each sex would be inappropriate because sex differences on this item are nonlinear (i.e., women score higher compared to men when both are severely depressed).
To illustrate, removing one point from women's scores would substantially and inappropriately lower scores of women who are on the lower spectrum of depression (i.e., because item 17 is less biased on the lower end of the spectrum) and would still overestimate the severity of depression in severely depressed women when compared to men.
Results of the CFAs failed to support CES-D models previously identified by exploratory factor analyses. All models with minimally acceptable fit indices for three out of the five samples included individual items or 2-item factors that previous research suggests should not be included in the CES-D [25], [28], or involved extreme reductions in item content that impede the capacity of the CES-D to assess DSM-IV-TR depressive symptoms [1]. A modified version of the model proposed by Radloff [47] provided a 3-factor (i.e., negative affect, anhedonia, and somatic symptoms), 14-item solution that is consistent with contemporary conceptualization of depression [1] and demonstrated excellent fit within all samples as indicated by all fit indices. The solution also exhibited acceptable internal consistency for all factors within all samples, with the exception of the somatic factor having relatively poor internal consistency within the sample with a history of depression. The differing results for internal consistency suggest that negative affect and anhedonia may be the most characteristic and consistent symptoms of depression, while somatic symptoms may be more variable between individuals with a history of depression. The differences may result from somatic symptoms being endorsed for reasons other than depression, such as chronic pain.
Several theoretical and clinical implications follow the present findings. Researchers and clinicians should not use item 17 of the CES-D (i.e., ''I had crying spells'') or be careful of its use and interpretation. As the current results illustrate, a women crying is not necessarily a viable index of her depression severity 2 perhaps owing to culture norms of emotional expression 2 and a lack of crying in either sex is not a viable index of an absence of depression. Utilizing item 17 may lead to skewed estimations of depression and invalid cut-offs scores. Nevertheless, crying is a symptom of emotional distress, and researchers should explore the possibility of creating a new item that assesses frequency of crying without a sex bias. For example, perhaps a relative measure of crying (e.g., ''I cried much more frequently than I usually do'' or ''I felt like crying more than usual'') rather than an absolute measure of crying (e.g., ''I cried most of the time'') may limit such sex biases. Moreover, the current model is consistent with previous findings suggesting that socially-focused items of the CES-D (i.e., items 15 and 19) should not be included in the measure [4], [11], [14], [21]. Finally, the current results further support depression as a multidimensional disorder consisting of negative affect, anhedonia, and somatic symptoms [48][49][50].
The review of prior studies on the factor structure of the CES-D highlights the divergent results of previous exploratory factor analyses, none of which were strongly supported by CFAs with the present data. Future studies of the CES-D may benefit more from conducting further theory-driven confirmatory analyses rather than exploratory analyses. The majority of previously reported factor solutions suggested by previous exploratory factor analyses exhibited poor fit in the current samples. The best fitting solution was derived from contemporary theoretical research and previously established empirical data and exhibited excellent fit in the variety of samples used. Accordingly, the version of the CES-D presented herein would likely maintain factorial validity across different settings (e.g., clinical, research). Future research on the CES-D would benefit from exploring different forms of validity (e.g., convergent validity, predictive validity) with the item set from the model suggested here. In addition, future research designs should explicitly include comments regarding the influence of sample on factor structure fit indices -a variable that the current results indicates is important.
Several limitations of the current study provide directions for future research. First, the majority of participants in the current samples were not formally evaluated (e.g., with a structured clinical interview) for clinically significant depression and although the diagnostic criteria for depression has changed minimally since data for the NHANES was collected (roughly 37 years ago), potential changes over time with respect to social and cultural attitudes may have resulted in different response rates and patterns than if this data was collected today. Future research should assess the sensitivity and specificity of the proposed item set with participants categorized as meeting or not meeting DSM-IV criteria for Major Depressive Disorder. Second, the inability to clinically classify individuals with or without depression also precluded estimation of appropriate cut-off scores for the CES-D. Future research may benefit from re-examining cut-off scores while removing items identified in the current paper as inappropriate. Such an examination may shed light on discrepancies in recommendations for cut-off scores [51][52][53][54][55][56]. Third, including the reverse-scored items that are straightforwardly worded assessments of positive affect/anhedonia may be creating a psychometric bias as a result of incidental response errors. Such a possibility is relatively less likely than using reverse-worded items, but future research could assess for such a bias by examining the items separately and adding a measure that is not based entirely in self-report for convergent and divergent validity. Fourth, combining all five samples created a large enough sample to produce accurate estimations of differential item functioning; however, the combination of differing samples (e.g., clinical, community) may have introduced unmeasured confounds (e.g., cultural differences in the NHANES but not in the clinical sample) that may impact differential item functioning. Future research should examine differential item functioning on the CES-D in a variety of large, culturally homogeneous samples. Fifth, the current study only provides support for a revised version of the CES-D in a primarily English-speaking sample. Future research should cross-validate this revision using a more culturally diverse sample and test its compatibility with versions of the CES-D in other languages. Sixth, the somatic factor included in the final solution demon- strated adequate fit, but relatively low internal consistency. As such, the somatic items may benefit from further revision as they may currently focus on symptoms that are also characteristic of other disorders (e.g., anxiety disorders) or fail to assess symptoms frequently associated with depression. For example, item 11 (i.e., ''My sleep was restless'') is too vague to be specifically related to  depression and certainly excludes hypersomnia, waking early, and difficulty falling asleep, which are characteristic of depression [1]. Additional revisions to CES-D content might also consider including items describing cognitive symptoms of depression (e.g., thoughts of worthlessness or suicidal ideation) to further adhere to current diagnostic criteria. It may also be worthwhile for future researchers to consider adopting a differential weighting schema for items in the CES-D, such that items are weighted  according to their analytical power. That said, given the increasing availability of alternative screening measures (e.g., PHQ-9 [57]), coupled with the longstanding psychometric difficulties of the scale, it may be time to begin the process of retiring the CES-D in favor of newer measures that are also freely available for use. The present study addressed pertinent issues associated with CES-D items and precedent factor structures. CFAs performed with several samples (i.e., undergraduate, community, rehabilitation, clinical, and NHANES) were interpreted to suggest a novel best fitting model for the CES-D that is psychometrically and theoretically robust, comprising 3-factors (i.e., negative affect, anhedonia, somatic symptoms) and 14-items relatively more congruent with current diagnostic criteria for depression [1]. The CES-D items may benefit from additional revision; however, this alternative solution offers a valid item set, without biases related to social concerns or sex, for research and clinical applications.