Measuring Life Satisfaction in Parkinson’s Disease and Healthy Controls Using the Satisfaction With Life Scale

The 5-item Satisfaction With Life Scale (SWLS) was designed to measure general life satisfaction (LS). Here we examined the psychometric properties of the SWLS in a cohort of persons with Parkinson`s disease (PwPD) and age and gender matched individuals without PD. The SWLS was administered to PwPD and controls from the Norwegian ParkWest study at 5 and 7 years after the time of diagnosis. Data were analysed according to classical test theory (CTT) and Rasch measurement theory. CTT scaling assumptions for computation of a SWLS total score were met (corrected item-total correlations >0.58). The SWLS was reasonably well targeted to the sample and had good reliability (ordinal alpha, 0.92). The scale exhibited good fit to the Rasch model and successfully separated between 5 statistically distinct strata of people (levels of SWLS). The seven response categories did not work as intended and the scale may benefit from reduction to five response categories. There was no clinically significant differential item functioning. Separate analyses in PwPD and controls yielded very similar results to those from the pooled analysis. This study supports the SWLS as a valid instrument for measuring LS in PD and controls. However, Rasch analyses provided new insights into the performance and validity of the SWLS and identified areas for future revisions in order to further improve the scale.


Introduction
Parkinson's disease (PD) is associated with a number of motor and non-motor symptoms that have a major influence on the lives of those affected by the disease and how satisfied they are with their lives [1]. Greater understanding of life satisfaction (LS) is necessary for improving PD management, particularly from a person-centred chronic disease management perspective, which in turn requires valid tools to quantify LS.
One of the most frequently used LS scales is the generic 5-item Satisfaction With Life Scale (SWLS). The SWLS was developed to measure people's perception and evaluation of their overall LS [2] Although the SWLS has been extensively tested in different populations [3][4][5][6][7][8] there are still some concerns regarding its psychometric properties. For example, whereas the three first SWLS items represent the present, the two last items represent the past and the last item in particular has been suggested to be relatively weakly associated with the other items [7]., thus challenging the unidimensionality and internal integrity of the scale. Furthermore, while its generic nature should allow for comparison of scores between respondent groups such as various patient populations and control subjects, the extent to which this is supported empirically appears untested.
Recent studies have suggested that the SWLS is useful for measuring LS in persons with PD (PwPD) [5][6][7][8]. Whereas these studies used classical test theory (CTT) methodology based on parametric statistics, this approach does not take the ordinal nature of data into account. Furthermore, modern psychometric methods and in particular the Rasch measurement model, is considered superior to CTT and provides more detailed insights into the psychometric properties of rating scales, including score invariance between subgroups of people [9].
The aim of this study was to examine the psychometric properties of SWLS in a cohort of PwPD and age matched individuals without PD using CTT and the Rasch measurement model.

Patients and controls
This paper is based on the Norwegian ParkWest study [10]. PwPD were included when diagnosed with PD according to the UK-Brain bank criteria [11], and the control group was recruited among their relatives and friends and from social clubs for elderly. Exclusion criteria for the control group were parkinsonism at clinical examination and/or inability to complete the study program at baseline. The ParkWest cohort has been followed prospectively bi-annually from the time of diagnosis. The SWLS has been used at the 5 (time 1; T1) and 7 (time 2; T2) year visits after diagnosis. Out of 165 PwPD and 170 controls at the 5-year visit and 147 PwPD and 155 controls at the 7-year visit, this study included 146 PwPD and 163 controls included from T1 and 116 PwPD and 143 controls from T2 that had responded to the SWLS. Table 1 shows clinical and demographic data from the 5-year visit (T1). The study was approved by the Regional Committee for Medical and Health Research Ethics in Western Norway. All participants provided written informed consents.

Examinations
The full protocol and study procedures have been described in detail elsewhere [10]. Demographic data included in this study were sex, age and years of education. Disease severity was assessed using the Hoehn &Yahr staging [12] and part III (motor examination) of the Unified PD Rating Scale (UPDRS) [13]. Due to ethical considerations, the data will not be shared publicly when data may compromise the privacy of study participants. An ethically compliant data set will be made available to interested researchers on request to the authors.

Satisfaction with Life Scale (SWLS)
The SWLS consists of 5 items with seven response categories each that are scored from 1-7, where 1 = strongly disagree and 7 = strongly agree. Summation of complete item responses yield a total raw score that can range from 5 to 35 [2]. A total score of 20-24 is considered average LS; total scores between 25-29 are considered high LS and scores between 15-19 are slightly below average, whereas scores of 5-14 and 30-35 suggest extremely low or high LS, respectively. This study used the Norwegian version of the SWLS (http://internal.psychology. illinois.edu/~ediener/SWLS.html).

Analyses
The  [14]. CTT analyses were designed to replicate the previous evaluation of the SWLS in PD [8]. In contrast to previous CTT analyses of the SWLS in PD, we also included data from healthy controls and the ordinal nature of raw item data was taken into account in the analyses. SWLS data were analysed regarding completeness (percentages of complete item responses and computable total scores), scoring assumptions for the legitimacy of computing summed total scores (i.e., similar item score means and standard deviations (SD); corrected item-total correlations !0.30 and !0.40 suggesting sufficient contribution by each item to the total score and unidimensionality, respectively). Unidimensionality was further tested by exploratory factor analyses (EFA) using minimum rank factor analysis based on polychoric correlations and parallel analysis (500 random permutations of raw data) to determine the number of dimensions [15]. Further analyses included targeting (i.e., average total SWLS scores close to the scale midpoint of 20; floor/ceiling effects 15%; skewness ±1), and reliability (i.e., coefficient alpha !0.80), including the standard error of measurement (SEM = SD x p 1-reliability) and the smallest detectable difference (SDD = SEM x 1.96 x p 2). All analyses were conducted with the pooled (PwPD + controls) sample, as well as for PwPD and controls separately. To account for the ordinal nature of item level data, item-total correlations were computed based on polychoric correlations, reliability was assessed by the ordinal version of coefficient alpha [16], and SEM and SDD were calculated based on ordinal alpha. Traditional parametric item-total correlations and coefficient alpha were also computed for comparative reasons. For methodological details regarding these analyses, see [8,[15][16][17].
Analyses were also extended to include the Rasch measurement model, which mathematically defines what is required from rating scale item level data to conform with linear measurement [18]. According to this model, the probability of a certain item response is a function of the difference between the level of the measured construct (e.g., LS) represented by the item and that reported by the person. The model separately locates persons and items on a common linear logit (log-odd units) metric, ranging from minus to plus infinity (with mean item location set at zero). If data accord sufficiently with the model, linear measurement and invariant comparisons are possible [17,[19][20][21][22].
Here we used Rasch analysis to address targeting, reliability, Rasch model fit, rating scale response category functioning, and uniform and non-uniform Differential Item Functioning (DIF) by time of assessment (T1 vs. T2), group (PwPD vs. controls), gender, age, and (for PwPD only) UPDRS III motor score. Subgroups for DIF analyses of age and UPDRS III were defined according to their respective median values. DIF by time of assessment was checked at the outset of these (and the CTT) analyses and absence of DIF by time was taken as support for merging data from the two time points, thereby gaining precision of estimates [23]. Following the main analysis, data were also Rasch analysed separately for PwPD and controls.
Rasch analyses were conducted using the unrestricted polytomous ("partial credit") model as implemented in RUMM2030 (Professional Edition version 5.4) [14], with the sample divided into eight class intervals (subgroups with similar levels of LS according to SWLS total scores). Analyses include both graphical and statistical methods, which are of equal primacy. To facilitate comparisons between analyses (full sample vs. PwPD vs. controls) and because type I errors increase by increasing sample sizes, data were analysed with the effective sample size algebraically adjusted to n = 250 in the calculation of P-values, while leaving all other aspects of data (e.g., locations, fit residuals) unaltered [20,24,25]. Bonferroni adjustments for multiple null hypothesis testing were applied (alpha level of significance, 0.05) [25,26].

Results
We found no evidence of DIF by time (P>0.34). Therefore, the main analyses were conducted with the merged (T1+T2) data set. Table 2 reports results from the CTT analyses. We found support for all aspects assessed, including CTT scoring assumptions, targeting and reliability in the pooled as well as the separate analyses of PwPD and control subjects. The SWLS was reasonably well targeted to the sample, although the scale tended to represent lower levels of LS compared to those reported by the sample, as suggested by average raw total scores above the scale midpoint and a slight negative skew ( Table 2). Fig 1 provides a more detailed account of targeting as derived from Rasch analysis. It is seen that the scale represents a quantitative continuum from lower to higher levels of LS (ranging approximately 5.8 logits, from about -1.83 to 4.01 logits; Fig 1, lower panel) that is similar to that found in the sample (ranging approximately 8.4 logits, from about -3.23 to 5.14 logits; Fig 1, upper panel). The mean person location is 1.07 logits, i.e., the sample reported LS levels on average about 1 logit above that represented by the SWLS. It is also seen that there tends to be gaps in the scale's representation of the variable at levels around 1 logit and above (Fig 1, lower panel). As a consequence, people with higher levels of LS are measured with less precision, as illustrated by relatively low information function values (i.e., the inverse of measurement error) at levels above about 1 logit (as well as below about -2 logits; Fig 1, upper panel). However, reliability was good and the scale was able to separate between 5 statistically distinct strata of people (Table 3).
Item response data displayed acceptable overall fit to the Rasch model (Table 3). Item characteristic curves (ICCs) of empirical responses among people in the eight class intervals relative to Rasch model expectations showed negligible to modest discrepancies (Fig 2). Item 5 had the poorest accordance between empirical data and model expectations, where empirical responses tended to exhibit a less steep pattern than expected, suggesting that this item may represent a somewhat different construct than the scale as a whole. Statistically, this was mirrored by a relatively large positive fit residual and a significant chi-square value ( Table 4).
Given that items 4 and 5 have been suggested to represent a somewhat different construct than items 1-3, we conducted a principal component analysis (PCA) on the residuals in order to explore this issue. In agreement with previous hypothesis, these item groups loaded in different directions. Items 1-3 displayed negative loadings (-0.493 to -0.680) on the first principal component, whereas items 4 and 5 loaded positively (0.539 and 0.810, respectively). However, the two subsets did not yield significantly different person location estimates for more than 5.6% of individuals (binomial 95% Agresti-Coull CI, 3.3-9.2%), suggesting sufficient unidimensionality across all five items (as also suggested by the polychoric based EFAs; Table 2). Furthermore, item residual correlations of the full 5-item SWLS were low (Table 3), suggesting local independence.
Assessment of the empirical functioning of the seven response categories showed that these did not work as expected with items 1 and 3. Specifically, the second and third response categories were problematic and while they did behave as expected with items 2, 4 and 5, the pattern was similar also among these items (Fig 3). That is, it appears difficult for people to distinguish between 7 levels of LS, particularly at the lower end of the continuum. We therefore explored reducing the number of response categories by post-hoc collapsing of the seven original response categories (scored as 0123456 in the analysis) into a five-category response scale (scored as 0011234) across all five items. Reanalysis did not reveal any problems with the revised response format while reliability was unaffected (0.86), suggesting that a five-category response scale may be advantageous. However, this needs empirical prospective confirmation and the overall fit deteriorated somewhat (overall item-trait chi-square interaction, 63.57; P = 0.013) following collapsing of response categories across items. This may be due to the collapsing of response categories that actually did work [28]. There was no DIF by time, age or gender, but item 5 exhibited uniform DIF by group. That is, except for in the lowest class intervals (those reporting lowest LS), PwPD were more likely to score higher than control subject on item 5 regardless of their levels of LS (Fig 4). Item 5 was then adjusted for the observed DIF by splitting it into two new subgroup specific items, one for PwPD and one for controls. The clinical significance of the observed DIF was then studied by assessing if the estimated person locations (logit measures) were affected by DIF. Person locations obtained after adjustment for DIF were compared to those estimated from the original non-DIF-adjusted scale. Before doing so, items without DIF in the original scale were anchored by their item locations from the DIF-adjusted scale to assure that the two sets of person estimates shared the same unit of measurement. The two resulting sets of person locations were very similar, with mean (95% CI) values of 1.07 (0.92-1.21) and 1.10 (0.95-1.25) logits for the unadjusted and DIF-adjusted scales, respectively. The intraclass correlation between the two was 0.998. In addition and also due to the observed signs of misfit of item 5, we explored the effects of omitting item 5. This compromised overall model fit (overall item-trait chi-square interaction, 49.04, P = 0.04) and yielded similar person locations (mean (95% CI), 1.34 (1.16-1.51) logits). The intraclass correlation between scales with and without item 5 was 0.950. Based on these observations item 5 was retained. Separate Rasch analyses among PwPD and controls yielded very similar results to those obtained in the main analysis (Tables 3 and 4), including issues with the seven response categories. There was no DIF by time, age, gender or PD severity (according to UPDRS III groups) among patients with PD, whereas data suggested uniform DIF by gender for items 2 and 5 among control subjects. Adjustment for gender DIF by splitting item 5 removed the gender DIF associated with item 2, suggesting that this DIF was artificial. Similarly to the DIF by group in the main analyses, the gender DIF of item 5 did not have an appreciable influence on person estimates.
To further explore the measurement invariance of the SWLS as estimated for the pooled sample as well as separately for PwPD and control subjects, the linear logit locations associated with each possible raw total score were examined and are displayed in Table 5 together with the estimated standard error for each location. Fig 5 illustrates the relationships between these estimated logit locations from the three analyses.

Discussion
In this paper we have replicated previous CTT based psychometric results from using the SWLS in PD. We also expanded the analyses to account for the ordinal nature of item data in the CTT based analyses and to include tests according to Rasch measurement theory and comparability of the scale when used among PwPD and age-matched controls. There was a relatively high level of missing responses to the SWLS in the PwPD group resulting in 13% of noncomputable total SWLS scores. However, this is a lower rate than the 44% non-response rate reported by Rosengren et al. [8] and appears to be explained by our inclusion of people with cognitive impairments, as indicated by relatively lower scores the MMSE in the group with missing item responses (data not reported). However, our CTT based observations are in accord with previous reports and support the legitimacy of creating simple sum scores from the five SWLS items representing a common latent variable that is measured with acceptable levels of reliability and precision, both among PwPD and control subjects. Rasch analyses provided similar implications but yielded additional and new insights into the performance of the SWLS. Particularly, we were able to reveal problems related to the distinction of aspects of LS into seven rating scale categories. However, our study shows that the SWLS is a valid instrument for measuring LS in PD and in comparing PwPD with healthy individuals. Rasch analyses also illustrated that people reporting "extremely" high and low LS (according to Diener's interpretation guide [29]) are measured with compromised precision. However, this is not considered a major problem because precision is arguably of less concern at the highest and lowest levels of LS, although it affects the ability to detect changes and differences within these levels. Furthermore, reliability was acceptable and the scale was still able to differentiate between 4-6 distinct strata of people. The obvious way to improve targeting and precision would be to increase the number of items and/or response categories to enhance representation of the latent LS continuum. However, the brevity of the scale may be considered as one of its advantages from a practical point of view, and we found clear evidence that the number of response categories would need to be reduced, not increased. That is, respondents appear to have problems distinguishing between response categories expressing lower levels of LS. Collapsing the response categories from a seven-into a five-grade response scale seems reasonable in order to reduce this problem without compromising its reliability and (therefore) precision. However, this is to be considered an experimental procedure and we do not recommend relying on collapsed response categories since it is not known how people actually would have responded according to the collapsed categories [17,28]. Instead, this needs to be examined empirically. Furthermore, it has been shown that collapsing categories that actually do work (albeit marginally, such as found here) can undermine the Rasch model [28], as illustrated here by compromised model fit following reduction of the response categories across all five items. Fit of the SWLS to the Rasch model was generally acceptable. Some of the statistical indices of model fit such as the total item-trait interaction chi-square based P-value and fit residuals in some cases exhibited values outside generally recommended ranges. However, it should be noted that there is no single aspect of fit that is either necessary or sufficient for the evaluation of fit, but all data need to be considered relatively, interactively and in perspective of context [17,19,20]. Indeed, as evident from the graphical representations of item model fit, empirical item responses exhibited close accordance with model expectations. The possible exception was item 5, which exhibited a pattern suggestive of multidimensionality. This is in accordance with previous reports suggesting that this item (together with item 4) may represent a somewhat different dimension than the other SWLS items due to referring to the past, as opposed to the present [29]. Considering the item wording, this is more evident for item 5 than for item 4, which is in accordance with our observations in that it was item 5 that exhibited signs of misfit. However, the misfit of item 5 was relatively modest, its deletion did not improve the scale, EFA supported unidimensionality, and assessment of the two suggested subdimensions of the SWLS did not reveal evidence of multidimensionality since person location estimates did not differ in more instances than would be expected by chance.
We also found that item 5 was associated with DIF by group in the main analysis and by gender among control subjects. While this is an additional indication that this item is not entirely coherent with the other SWLS items, the observed DIF did not appear to cause any obvious bias to the SWLS as a measure of LS. Therefore, taken together and given the theoretical underpinnings of its construction, the SWLS appears to exhibit reasonable enough fit to the Rasch model to provide measurement of LS among PwPD and age-matched controls that is useful for most circumstances. However, our data also show that in addition to the seven-grade SWLS response scale, reconsideration of content and/or wording of item 5 may be worthwile in future attempts to improve the scale.
The SWLS item hierarchy, i.e., the ordering of items from lower to higher LS according to their logit locations was consistent across the samples with regard to items representing the lowest and highest levels of LS (items 4 and 1, respectively). Furthermore, taking the uncertainty (i.e., item location standard errors) associated with the estimated locations into account, the hierarchies of the other items did not exhibit any clear differences between the samples.
The hierarchy also appears to make general theoretical and clinical sense in that considering one's life as close to ideal (item 1) represents higher levels of LS than it does to agree that one has achieved the important things in life (item 4). This provides general support for the internal construct validity of the SWLS [17].

Conclusion
We replicated previous psychometric CTT based results and expanded the analyses by taking account of the ordinal nature of item responses and using Rasch measurement theory. Rasch  Table 5). Inserted in each panel are the respective Pearson product-moment correlations (r). Nonparametric Spearman correlations were 1.0 in all three instances. Intraclass correlation across the three sets of estimates is 0.990. PD, Parkinson's disease; SWLS, Satisfaction with Life Scale. doi:10.1371/journal.pone.0163931.g005 analyses illuminated new aspects and more detailed information regarding the performance and validity of the SWLS, and identified areas for future implications in order to improve the scale. In particular, future studies should try to confirm whether the scale would benefit from a reduction from seven to five response categories. However, our findings support the SWLS as a reliable and valid instrument for measuring LS in PwPD, and that the scale is able to distinguish between levels of LS in and between PwPD and healthy controls. These observations are of considerable significance as life satisfaction and related constructs are central to a personcentred approach to chronic disease management.