Differential Item Functioning in the SF-36 Physical Functioning and Mental Health Sub-Scales: A Population-Based Investigation in the Canadian Multicentre Osteoporosis Study

Background Self-reported health status measures, like the Short Form 36-item Health Survey (SF-36), can provide rich information about the overall health of a population and its components, such as physical, mental, and social health. However, differential item functioning (DIF), which arises when population sub-groups with the same underlying (i.e., latent) level of health have different measured item response probabilities, may compromise the comparability of these measures. The purpose of this study was to test for DIF on the SF-36 physical functioning (PF) and mental health (MH) sub-scale items in a Canadian population-based sample. Methods Study data were from the prospective Canadian Multicentre Osteoporosis Study (CaMos), which collected baseline data in 1996–1997. DIF was tested using a multiple indicators multiple causes (MIMIC) method. Confirmatory factor analysis defined the latent variable measurement model for the item responses and latent variable regression with demographic and health status covariates (i.e., sex, age group, body weight, self-perceived general health) produced estimates of the magnitude of DIF effects. Results The CaMos cohort consisted of 9423 respondents; 69.4% were female and 51.7% were less than 65 years. Eight of 10 items on the PF sub-scale and four of five items on the MH sub-scale exhibited DIF. Large DIF effects were observed on PF sub-scale items about vigorous and moderate activities, lifting and carrying groceries, walking one block, and bathing or dressing. On the MH sub-scale items, all DIF effects were small or moderate in size. Conclusions SF-36 PF and MH sub-scale scores were not comparable across population sub-groups defined by demographic and health status variables due to the effects of DIF, although the magnitude of this bias was not large for most items. We recommend testing and adjusting for DIF to ensure comparability of the SF-36 in population-based investigations.


Introduction
Self-report health status measures, like the Short Form 36-item Health Survey (SF-36) [1], can provide rich information about the overall health of a population [2,3] and its components, such as physical, mental, and social health. However, in order for comparisons of health status across population sub-groups to be accurate, these self-report measures must be valid and reliable. Construct validity and test-retest reliability are frequently evaluated for a measure's summary score (s), that is, after the item responses have been summed [4]. Self-report measures are less often evaluated for the effects of differential item functioning (DIF), which can also affect construct validity [5]. DIF occurs when individuals with the same underlying (i.e., latent) level of health do not interpret a measure's items in the same way. DIF can result in an unexpected lack of scale comparability and erroneous conclusions about the presence of group differences [5].
The SF-36 has undergone comprehensive psychometric evaluations of its reliability and validity [6][7][8][9]. DIF and related topics of differential scale functioning have been investigated for the SF-12 and SF-36 [10][11][12][13][14][15][16], but most of these analyses have been conducted in clinical or disease-specific samples. DIF analyses are often conducted on demographic and ethnic characteristics even though other determinants of health, including risk factors for poor health and presence of chronic conditions may be potential sources of DIF [11,13,17].
The physical functioning (PF) and mental health (MH) sub-scales of the SF-36 are the subscales most frequently investigated in psychometric evaluations, and are also commonly used sub-scales to compare health status at the population level. Our study objective was to test for DIF on the SF-36 PF and MH sub-scale items in population-based data on demographic and health-related variables.

Data Source
Study data were from the Canadian Multicentre Osteoporosis Study (CaMos), a prospective cohort study initiated to provide national prevalence and incidence estimates for osteoporosis and osteoporosis-related fractures in the Canadian population [18]. Baseline data, which were the focus of the current study, were collected in 1996-1997, using both personal interview and papeer-based questionnaires, from participants in nine Canadian regional urban centres. Respondents were at least 25 years of age and were recruited without regard for disease status. decision regarding the acceptability of the project. Ancillary projects may be undertaken in any one of CaMos' nine regional centres, or as a collaboration among investigators, at least one of which must be a CaMos Centre Director. A formal proposal must be submitted to the DAP Committee for review. Following approval, the authors of the proposal will be notified of the release of data, and will sign an agreement, stating that they will only use the data for the purpose described, will follow the timeline specified for the analysis, and will destroy the data files by a given date. Details of the methodology to select the CaMos cohort and collect the study data have been described elsewhere [2,18]. To ensure the quality and integrity of the data, interviewers are trained to minimize the amount of missing data (i.e., probe for responses), and respondents are re-contacted if clarification of responses is required to resolve inconsistencies in the data.

Measures
Version 1 of the SF-36 was used in CaMos; it encompasses eight sub-scales: PF, role physical, bodily pain, general health, vitality, social functioning, role emotional, and MH. Item responses are captured using dichotomous or ordinal scales [1]. The PF sub-scale contains 10 item, each having three response options: limited a lot, limited a little, and not limited at all. The MH subscale consists of five items, each having six response options: all of the time, most of the time, a good bit of the time, some of the time, a little of the time, none of the time. Responses for "Have you felt calm and peaceful?" and "Have you been a happy person?" are reverse coded so that higher scores represent better MH, in keeping with the other sub-scale items. DIF analyses were conducted for the following demographic and health-related variables: sex, age group, body weight status, and self-perceived general (i.e., overall) health. Age was classified as 25-49 years, 50-64 years, 65-74 years, and 75 years. Body weight status was based on BMI, which was calculated from measured height and weight (kg/m 2 ), and was categorized as under or normal weight (<25.0), overweight (25.0-29.9) and obese (30.0) in accordance with published guidelines [19]. General health was based on a single question in the SF-36 and was categorized as excellent/very good, good, and fair/poor.

Analysis
The analyses were conducted for respondents with complete information on all items or explanatory variables. Descriptive analyses were conducted using frequencies and percentages.
Both non-parametric and parametric approaches have been proposed to test for DIF; they can accommodate multiple population characteristics (i.e., covariates) that may be associated with item responses. Parametric approaches include logistic regression analysis [20], the multiple indicators multiple causes (MIMIC) model, and item response theory (IRT) models [5,21,22], with the latter two being popular because they can be applied to binary and ordinal item responses [5], are flexible to incorporate one or more latent constructs, and can be readily implemented using existing software. Woods [23] demonstrated, via simulation, that the MIMIC model will produce more accurate results than the IRT model, in a simple two-group scenario, for small group sizes.
We adopted the MIMIC model and used the following strategy to test for DIF. First, the assumption of unidimensionality, that is, that all sub-scale items measure a single construct [5], was examined by applying factor analysis with oblique rotation to the polychoric correlations for the sub-scale items [24]. The first eigenvalue should be significantly higher than the second eigenvalue (i.e., ratio > 4:1) to support unidimensionality [25]. As well, fit of a unidimensional measurement model was assessed using the the comparative fit index (CFI) and the root mean square error of approximation (RMSEA). A RMSEA value 0.10, and a CFI value >0.90 indicate acceptable model fit [23,26,27]. The measurement model defines the relationship between the underlying latent construct (i.e., PF or MH) and the sub-scale items, while a structural model is used to specify the relationships of the observed covariates with the subscale items and the latent construct [23,28,29]. A statistically significant direct effect of the observed covariates on the sub-scale item(s) indicates uniform DIF, after controlling for differences in latent health status between comparison groups (Fig 1). In keeping with previous research [28][29][30][31], a three-step procedure was adopted: (a) empirically select an anchor item(s) for the sub-scale, (b) test for DIF on each sub-scale item, and (c) fit a final model that allows for differential functioning on the individual sub-scale items.
At least one anchor item is needed to define the latent construct on which the groups are compared. We adopted the following method to select an anchor item [32]. First, a model was fit to the data that included the effects of the covariates on the latent variable but no direct effects between the covariates and the sub-scale items. Next, a series of models were fit to the data that added paths from the covariates to the items; this was done one sub-scale item at a time. A likelihood ratio (LR) statistic was used to test the difference between the two models for each item [32]. An item was labelled DIF-free if the LR statistic was not statistically significant. The DIFfree item with the smallest LR statistic was selected as the anchor item. If none of the items were DIF-free, then the item with the smallest LR statistic was selected as the anchor item [32].
Next, an unconstrained DIF model was fit to the data that included direct effects from the covariates to each sub-scale item, except for the anchor item. Then we fit a second set of no-DIF models that did not contain direct effects. A LR statistic was used to test the difference between the constrained "no-DIF" model in which the covariate direct effects were set to zero and the unconstrained DIF model with freely-estimated covariate direct effects for each item. A statistically significant LR statistic indicates the presence of uniform DIF on the item.
The third step was to fit a model that included direct effects of the covariates on all the items for which DIF was identified in the previous step, as well as direct effects of the covariates on the latent variable. This final DIF model was used to obtain parameter estimates and predict the factor scores.
Regression coefficients from the final model were exponentiated to produce odds ratios (ORs) to estimate the size of the DIF effects. Cut-points of 0.3, 0.5, 0.7 for the log of the ORs were used to indicate small, moderate and large DIF effect sizes [33]. Accordingly, an OR outside the range of 0.5 to 2.0 (i.e., a large effect size) was used to indicate a clinically meaningful DIF effect [13,34].
The impact of DIF was also investigated by testing the associations of the demographic and health variables with the predicted factor scores in the final DIF and no-DIF models [13]. Illustration of the multiple indicators multiple causes model to test for differential item functioning on SF-36 sub-scale items. In this model, y i is the ith sub-scale item (i = 1, . . ., I); the dashed arrow from each covariate to the item represents the DIF (i.e., direct) effect; β 1Sex is the regression coefficient for the difference in thresholds on item i for males and females; similar regression coefficients are defined for other model covariates; α i is the regression coefficient for the latent variable and the ith item; γ k is the regression coefficient for the latent variable and the kth covariate (k = 1, . . ., K); τ ij is the threshold for the (j-1) th response category (j = 1, . . .,J) for item i; ε i is the error term for the ith item; ζ is the residual error for the latent variable. Differences in the predicted scores on the covariates for the two models were computed [13] and subsequently tested for statistical significance using a multivariable linear regression model.
Parameters were estimated using the maximum likelihood method with robust standard errors (MLR) [24]. Details for computation of the differences in log likelihood statistics for nested models can be found on the Mplus website (https://www.statmodel.com/chidiff.shtml). Given the increased probability of a Type I error when conducting multiple tests of significance, we adopted the Bonferroni procedure for all inferential analyses [35], adjusting the nominal level of significance (i.e., α) by either the number of items in the PF or MH sub-scales or the number of levels of the covariates, depending on the analysis.
The MIMIC framework was implemented using Mplus version 7.11 [24], while SAS version 9.3 was used to conduct the descriptive analyses [36]. To identify each model in the MIMIC framework, the latent factor mean was constrained to zero and the variance was constrained to one. This research was approved by the University of Manitoba Health Research Ethics Board. CaMos participants provided written informed consent at the time of study entry.

Results
Of the 9423 respondents in the CaMos cohort, 69.4% were women and slightly more than half (51.7%) were under 65 years of age. One third reported being in good health and 11.0% reported being in fair or poor health. Overweight and obese individuals accounted for 40.7% and 22.3% of respondents, respectively. A total of 96.2% (n = 9062) were retained in the PF sub-scale analysis and 96.7% (n = 9115) were retained in the MH sub-scale analysis after excluding individuals with missing observations. Table 1 shows the distribution of item responses for the PF and MH sub-scales. Close to half (43.3%) of respondents reported being limited a lot in vigorous activities, while only 16.0% were limited a lot in walking more than a mile. For the MH sub-scale, few respondents reported being very nervous, feeling so down in the dumps that nothing could cheer them up, or feeling downhearted and blue for some or all of the time. S1 Table and S2 Table shows the percentage of respondents for selected categories of the PF and MH sub-scale items on each of the covariates. There was a substantial increase in the percentage of respondents who experienced a lot of limitations in their activities, including walking a single block, with age. While there were few differences in the PF item percentages on "a lot of limitations" between respondents who were normal or underweight and overweight, those who were obese had a number of limitations. Respondents with excellent or very good overall health had substantially fewer functional limitations and were also much more likely to feel calm and peaceful and be happy. There were fewer differences for the MH sub-scale items, except for general health status.
For the PF sub-scale items, the factor analysis produced eigenvalues of 7.61 for the first factor and 0.63 for a second factor; the ratio of eigenvalues of 12.1 suggests a unidimensional factor structure. Further analysis revealed that a one-factor model without error covariances had a reasonable fit to the data (RMSEA = 0.11, 95% CI = 0.10-0.11; CFI = 0.98). However, a singlefactor model with error covariances between five pairs of items (PF1 with PF2, PF2 with PF3, PF4 with PF5, PF7 with PF8, PF8 with PF9) resulted in a better fit to the data (RMSEA = 0.057, 95% CI = 0.054-0.067, CFI = 0.99). Factor loadings for the sub-scale items ranged from 0.80 to 0.93.
For the MH sub-scale items, factor analysis revealed that the eigenvalues were 3.13 for the first factor and 0.69 for a second factor. The one-factor model produced CFI = 0.97 and RMSEA = 0.15 (95% CI = 0.14-0.15), and factor loadings ranged in value from 0.61 to 0.82. Both of these analyses support a single dominant factor for both the PF and MH sub-scale items.
In step 1 of the DIF analysis, which focused on identifying anchor items for each of the subscales, all of the LR statistics were statistically significant, indicating that no items were DIFfree. However, "walking more than a mile" and "felt so down in the dumps that nothing could cheer you up" had the smallest LR statistics and thus were selected as anchor items. Table 2 displays the LR test results for DIF. All the items except the anchor items showed statistically significant effects. The factor loadings and item thresholds are reported in S3 Table. As Table 3 reveals, eight of the 10 PF sub-scale items exhibited DIF by age group, sex, body weight status and/or self-perceived health status. Women had a greater odds of reporting limitations on PF2, PF3, and PF4, and a greater odds of having no limitations in bathing or dressing (PF10; OR = 1.66), even after controlling for differences in their underlying functional abilities. Older respondents with the same underlying functional abilities scored lower than younger respondents in vigorous and moderate activities, and scored higher in walking one block and bathing/dressing self. With respect to self-perceived general health status, DIF effects were observed for four PF items, with respondents in poorer health being more likely to endorse limitations in vigorous and moderate activities and also being more likely to report better physical function in walking one block, relative to those in very good or excellent health, after controlling for differences in latent physical functioning. Relative to normal/underweight respondents, overweight and obese people were more likely to report fewer limitations in moderate activities and lifting or carrying groceries, but were more likely to report limitations in bending, kneeling or stooping, after controlling for overall physical health status. The DIF effects were small to large in size across the PF items; the smallest effects were observed for the following PF items: vigorous activities and lifting and carrying groceries. The largest effects were observed for lifting and carrying groceries and walking one block.
Four items in the MH sub-scale exhibited DIF effects. Women tended to endorse having been a very nervous person, feeling less calm and peaceful, and feeling downhearted and blue more often than men, after controlling for differences in underlying MH status. Older respondents, relative to younger respondents who had the same latent MH status showed a greater propensity to feel calm and peaceful, and to be happy. Respondents with good and fair/poor health had lower odds of feeling calm and peaceful, and being less happy relative to respondents with better self-perceived general health status, even after controlling for differences in underlying MH status. Respondents who were obese had greater odds of being very nervous. Obese individuals were more likely to endorse being happy people (OR = 1.26) even after controlling for differences in their latent MH status. However, the DIF effects were small to moderate in size for all MH sub-scale items. Table 4 provides the regression parameter estimates for the effects of the covariates on the latent PF and MH scores in the DIF and no-DIF models, and their absolute differences. For the PF sub-scale, all of the covariates were associated with latent PF scores. However, adjustment for DIF resulted in a reduction in the size of the parameter estimates for all covariates, except for overweight. The magnitude of the change in the coefficients was greatest for the age groups (absolute differences ranged from 0.09 to 0.14) and sex (absolute difference 0.09). For the MH sub-scale, adjustment for DIF changed the direction of the association between body weight and MH. In the no-DIF model, MH scores were significantly higher for obese respondents relative to normal/underweight respondents. However, in the DIF model, MH scores were lower for obese respondents. The largest change in regression coefficients between no-DIF and DIF models was found for fair/poor health (absolute difference 0.12). As a sensitivity analysis, we fit separate models for each of the demographic and health status variables to test for DIF when the model was not simultaneously adjusted for all covariates. The ORs and magnitude of change in the coefficient estimates were similar to those reported in Tables 3 and 4.

Discussion
This study revealed that the majority of the SF-36 PF and MH sub-scale items showed evidence of DIF across one or more of the investigated demographic and health status variables after controlling for differences in the the latent PF and MH of respondents in this population-based sample. Some of the DIF effects observed for the PF and MH sub-scale items were consistent with those found in previous studies [6,13,14]. Specifically, older people reported more limitations in vigorous and moderate activities, and fewer limitations in bathing or dressing, even after controlling for differences in their underlying PF [13,14]. Older respondents were also more likely to endorse feeling calm and peaceful than younger respondents after controlling for differences in their underlying MH [13,14] Women tended to identify more problems in lifting or carrying groceries and climbing several flights of stairs, whereas they reported fewer problems in bathing or dressing than men [6,13,14].
This study revealed that body weight status was associated with DIF; this effect was observed for three items on the PF sub-scale and two items on the MH sub-scale. However, the direction of the effect for body weight status was not consistent across all items on the PF sub-scale. Overweight/obese people may not perceive themselves as being limited in activities that occur on a daily basis, like lifting or carrying groceries, but may be limited in activities that are more likely to occur on a daily basis, such as bending, kneeling or stooping. One prior study that evaluated DIF on the SF-36 for BMI showed significant non-uniform DIF in vigorous PF activities by weight category [11]. The inconsistent effect and limited research on the potential for DIF by body weight status suggests an opportunity for further research, including an exploration of potential non-uniform DIF. Adjustment for DIF did not change the direction of the associations between the covariates and the PF latent variables, but the strength of the associations did change such that they were almost always smaller in size after adjusting for DIF. This finding is consistent with other studies showing that significant sex and age differences in physical ability were not altered after adjusting for DIF [10,13]. The largest change in regression coefficients between the no-DIF and DIF models was observed for age on the PF sub-scale items. However, adjustment for DIF changed the association between body weight and MH. There was a significant difference in the latent MH variable between the obese and underweight/normal weight groups before DIF adjustment,whereas the difference became non-significant after controlling for DIF. Large changes in regression coefficients between the no-DIF and DIF models for MH were observed for fair/poor health and obese groups. The results suggest that the comparison of PF and MH scores across population sub-groups defined by demographic and health status variables may be biased if DIF is ignored. Group comparisons on PF may be most affected by age and sex, while group comparisons on MH may be most affected by body weight and general health status.
The advantages of the MIMIC framework over other methods for DIF detection, such as logistic regression or IRT models, are that it is based on well-established confirmatory factor analysis modeling processes and can be used to test for DIF on mutiple observed variables simultaneously. It can also be used to test for group differences in predicted factor scores Table 4. Regression coefficient estimates for the effects of demographic and health status variables on physical functioning and mental health latent variables in models without differential item functioning (no-DIF) and with differential item functioning (DIF) in the Canadian Multicentre Osteoporosis Study.  [13,37]. In addition, it allows for the analysis of DIF effects on a latent construct by comparing group differences in predicted factor scores between DIF-adjusted model and unadjusted models [13,37,38]. The MIMIC model for ordinal indicators is equivalent to the popular IRT graded-response model [22,38]; it is now frequently used to test for DIF and is known to perform well under a wide variety of data-analytic conditions [13,23,28,29]. Previous studies have demonstrated that the MIMIC model is effective in identifying DIF items, and is less sensitive to potential contamination of anchor items than other DIF detection methods [21,23,29]. Simulation studies have demonstrated that the MIMIC framework is more sensitive to identify DIF items and provides better control of Type I errors than conventional DIF test methods [21,31]. The chief disadvantage is that is cannot be used to test for non-uniform DIF (i.e., differential effects on item difficulty). The study has other strengths. The DIF analysis was applied to a national population-based sample with a large sample size, which allowed for consideration of multiple demographic and health status variables. BMI was based on measured height and weight, and is therefore less susceptible to measurement error than self-reported height and weight. Age and sex are also confirmed by study staff during the data collection process, and are therefore likely to exhibit little, if any, measurement error.
However, this study is not without limitations. The MIMIC approach assumes that items have similar discriminative performance across comparison groups. Non-uniform DIF models, which test interactions between covariates and latent variables on the item responses, were not investigated because previous simulation studies have reported that the MIMIC approach may result in inflated Type I error rates when interaction terms are added to the model [39]. Performance of the MIMIC model for non-uniform DIF needs to be further studied. We only tested for DIF for two SF-36 sub-scales; other sub-scales have been examined for DIF using other statistical methods (i.e., logistic regression) [11,14]. We did not include other sub-scales in the current study; when there are small numbers of items (i.e., less than five items per subscale), the MIMIC framework is not an appropriate choice [13]. The CaMos sample includes a higher proportion of older adults and women than in the general population, thus the findings may under represent younger people (i.e., <50 years of age) and men. As well, for the one-factor model for the MH items, the RMSEA did not suggest a good fit to the data, although this finding is consistent with other studies [40,41]. Additional covariates could be considered in the model, although age and sex are two of the most common demographic variables considered in DIF analyses.

Conclusions
In summary, this study revealed the presence of DIF in population-based SF-36 data. The results indicate that PF and MH sub-scale scores may not be comparable across sub-groups defined by demographic and health status variables without accounting for DIF. DIF can affect the validity of epidemiologic studies that use self-report measures of quality of life. Removing items that exhibit DIF affects the content validity of a measure, which is problematic for wellestablished and well-known measures such as the SF-36. Removing items can also affect the comparability of sub-scale scores across different studies [6,42]. An alternative strategy is to replace the items with equivalent items that do not exhibit DIF; item banks may be a resource for identifying DIF-free items [43]. However, often the preferred approach is to examine items for DIF prior to conducting other analyses and adjust for any identified DIF effects prior to making comparisons between sub-groups [6,11,42]. Evaluating population-based self-reported outcome measures for DIF, particularly on key demographic and health-related variables, should therefore be a routine component of all comparative analyses.
Supporting Information S1 Table. Percentages of respondents for the category "Limited a lot" on the PF sub-scale items by demographic and health status variables in the Canadian Multicentre Osteoporosis Study (n = 9062). (DOCX) S2 Table. Percentages of respondents for the category "All/most/good bit of the time" on the MH sub-scale items by demographic and health status variables in the Canadian Multicentre Osteoporosis Study (n = 9115). (DOCX) S3 Table. Factor loading and item threshold estimates of the differential item functioning (DIF) model for the SF-36 physical functioning and mental health sub-scale items in the Canadian Multicentre Osteoporosis Study. (DOCX)