Perceived fairness of claimants undergoing a work disability evaluation: Development and validation of the Basel Fairness Questionnaire

Background There are currently no tools for assessing claimants’ perceived fairness in work disability evaluations. In our study, we describe the development and validation of a questionnaire for this purpose. Method In cooperation with subject-matter experts of Swiss insurance medicine, we developed the 30-item Basel Fairness Questionnaire (BFQ). Claimants anonymously answered the questionnaire immediately after their disability evaluation, still unaware about its outcome. For each item, there were four response options, ranging from “strongly disagree” to “strongly agree”. The construct validity of the BFQ was assessed by running a principal component analysis (PCA). Results In 4% of the questionnaires, the claimants’ perception on the disability evaluation was negative (below the median of the scale). The PCA of the items responses followed by an orthogonal rotation revealed four factors, namely (1) Interviewing Skills, (2) Rapport, (3) Transparency, and (4) Case Familiarity, explaining 63.5% of the total variance. Discussion The ratings presumably have some positive bias by sample selection and response bias. The PCA factors corresponded to dimensions that subject-matter experts had beforehand identified as relevant. However, all item ratings were highly intercorrelated, which suggests that the presumed underlying dimensions are not independent. Conclusion The BFQ represents the first self-administered instrument for measuring claimants’ perceived fairness of work disability evaluations, allowing the assessment of informational, procedural, and interactive justice from the perspective of claimants. In cooperation with Swiss assessment centres, we plan to implement a refined version of the BFQ as feedback instrument in work disability evaluations.


45
In most European countries, individuals with social security coverage who consider themselves 46 as unable to work because of poor health can file a claim for disability benefits. In Switzerland, they 47 have to file this claim at the disability insurance (DI), via DI offices ("IV-Stellen") located in each canton. 48 The DI offices seek to establish the claimant's degree of work incapacity, as critical variable 49 determining the amount of work disability benefits. To this end, DI offices commission mono-, bi-, or 50 multidisciplinary medical evaluations, depending on the complexity of the claimant's medical history. 51 Mono-and bidisciplinary evaluations are directly assigned to medical experts, whereas a random 52 procedure allocates multidisciplinary medical evaluations to assessment centres, which need to be 53 licenced by the Swiss Federal Social Insurance Office ("Bundesamt für Sozialversicherung"). The 54 assessment centres forward the claims to contracted medical experts who, based on medical records, 55 by interviewing the patients, and by examining their health status, assess the claimants' functional 56 capacities with regard to demands at work [1]. 57 The legal system expects that the assigned medical experts evaluate all medical and functional 58 aspects neutrally, objectively, and equally [2]. Moreover, the legal system expects that medical experts 59 arrive to comparable evaluations for similar cases. Unfortunately, in practice, the interrater agreement 60 is limited even for the very same patients [3], and the variability across experts, in particular for the 61 assessment of mental disorders, can be surprisingly large [4]. The lack of comparability cannot easily 62 be dissolved, due to the complexity and uniqueness of the cases. This is particularly true for the 63 evaluation of limitations in performing work activities and of participation restrictions due to mental 64 disorders. In 2018, 47 % of Swiss work disability beneficiaries received their pension in consequence 65 of mental disorders [5]. 66 For claimants, the medical disability evaluations maximally consist of a few encounters with 67 medical specialists. Rather than low agreement between medical experts in the disability evaluation, 68 the claimant himself is more likely to experience a lack of agreement between the evaluation of the medical expert and the view of the attending general physician (or medical specialist), who might have 70 encouraged the patient to file a claim for disability benefits. Moreover, the claimant's self-perception 71 might diverge from the expert's assessment [6]. This discordance may lead to dissatisfaction of 72 claimants when the expert considers the health problems as less severe than the claimant, the 73 disability is not approved or to a lesser degree than the claimant has expected, and the disability 74 pension / financial compensation is less than expected. In this case, claimants might consider the 75 outcome of the evaluation and the financial compensation as unjust [7]. 76 However, claimants might also experience the interaction with the expert as cold, 77 disrespectful, or even demeaning, they might feel badly informed, or might find their point of view 78 poorly considered in the evaluation. Currently, there are few feedback loops in the process of work 79 disability evaluations to allow an assessment of these quality aspects, referring to interactional, 80 informational, and procedural justice [7]. Due to the lack of standardized and systematic assessment, 81 it is at present unknown, how often, to what extent, and in what regard claimants might perceive work 82 disability evaluations as unfair. The systematic assessment of perceived fairness in work disability 83 evaluations is hampered not just by lacking implementation of feedback instruments, but also by the 84 lack of instruments designed for that very purpose. 85 Perceived fairness might not only be considered as an important quality characteristic of work 86 disability evaluations, but research shows that perceived unfairness also tends to worsen mental 87 health problems of already vulnerable patients/claimants: After accidents and related musculoskeletal 88 injuries, perceived injustice modulated clinical symptoms, with higher degrees of perceived injustice 89 associated with higher degrees of depression, more severe pain, and poorer mental health later on [8-90 11]. Therefore, it appears mandatory to implement quality control measures in disability evaluations 91 that allow the assessment of perceived fairness. Such measures would provide feedback to the medical 92 experts about how claimants perceived their interaction during the work disability evaluation, allowing 93 the medical experts to monitor and to adjust their interviewing practice. Moreover, the measures 94 allow to document that the evaluations took place in compliance with certain quality standards. The aim of our study was to develop and validate a questionnaire on perceived fairness in work disability 96 evaluations as an initial step for implementing this kind of quality control in work disability evaluations. 97

98
Overview of the Study Procedure an item pool. Following its translation to German, six subject-matter experts in work disability 120 evaluations (including two of the authors, WdB and RK) iteratively re-worded the items for better 121 comprehensibility, adapted the items to the setting of Swiss insurance medicine, discarded irrelevant 122 items, and added items related to aspects not covered by the Dutch questionnaire. All six subject-123 matter experts had at minimum 15 years of practical work experience in field of insurance medicine 124 and disability evaluations. A prefinal set of 29 items was tested for clarity, comprehensibility, and 125 coverage of all relevant issues by applying them to more than 40 patients [16], using the thinking aloud 126 method [17]. The pre-testing led to the deletion of five items and addition of six new items. 127 The final questionnaire, as used in this study, included 30 items with four response options per 128 item, ranging from "strongly disagree" to "strongly agree", which were numerically coded from "1" to 129 "4", respectively. Four items referring to the knowledge of the expert about the claimants' medical 130 records had a fifth response option ("Can't tell"). Ratings on the overall perceived fairness of the 131 evaluation and the general satisfaction with the evaluation complemented the questionnaire. These 132 two ratings were made on a 7-point Likert scale with two anchor points: "very unfair/dissatisfied" (= 133 1) and "very fair/satisfied" (= 7). Moreover, the claimant could provide verbatim comments on the 134 satisfaction with the disability evaluation, which have previously been reported [18]. Four subject-135 matter experts (including one author, WbB) individually grouped the 30 items into two to five thematic 136 clusters, discussed the clusters, and iteratively reorganised the structure until they reached consensus. 137 The final, conjoint clustering contained four clusters or dimensions, which the subject-matter experts 138 deemed to reflect perceived fairness in disability evaluations. These dimensions were a) the 139 interviewing skills of the expert, b) rapport (atmosphere of trust and respect), c) case familiarity (the 140 expert's knowledge about the patient and his/her records), and d) transparency (provision of 141 information by the expert). 142 In order to establish that our questionnaire is related to other instruments measuring similar 143 constructs and shows convergent validity, we added two scales of the CPQ [12]: a) "patients' trust in was developed to quantify patient satisfaction with all major aspects of service quality in hospital care 146 and contains many scales of no relevance for disability evaluations. The CPQ items used for the current 147 study were re-worded to adopt them to the context of disability evaluations. One CPQ item was 148 identical to an item of the questionnaire ("the expert let me finish speaking"). CPQ items had four 149 response options, ranging from "strongly disagree" to "strongly agree", which were numerically coded 150 from "1" to "4". In order to establish that our questionnaire discriminates between the patients' 151 satisfaction with their disability evaluation and their satisfaction with life in general (divergent validity), 152 we included the SWLS, a validated instrument for assessing life satisfaction [13]. The SWLS contains 153 five items with five response options per item, ranging from "strongly disagree" (= 1) to "strongly 154 agree" (= 5). Finally, in order to check to what degree patients and medical experts agree in their 155 judgment of the evaluation process, experts rated on 7-point Likert scales to what extent the patient 156 presumably had perceived the evaluation as overall fair, from "very unfair" (= 1) to "very fair" (= 7). 157 Moreover, the medical experts rated from their perspective the quality of the interaction with the 158 patient and how well the survey of the patient succeeded during the interview from "very poor" (=1) 159 to very good (=7). 160 Participants 161 All participants were required to undergo a multidisciplinary medical evaluation for work 162 capacity, to have sufficient command of the German language for filling out the questionnaire, and to 163 be between 18 and 65 years old. For claimants, the participation in such medical evaluations is 164 obligatory, whereas the participation in our study was voluntary. For the study, we recruited claimants 165 at four Swiss assessment centres from April 2015 to March 2017, as well as the assigned medical 166 experts who evaluated the work capacity of the claimants. Case managers of these assessment centres 167 (two centres in Basel, one centre in Lucerne and Binningen each) checked the eligibility of the patients 168 for the study and sent the study information and questionnaire to potential study participants and the 169 expert questionnaire to one of the assigned experts. Each patient assessed the perceived fairness in centre. Patients had to complete the questionnaire promptly after their evaluation while unaware of 172 its outcome and returned the questionnaire in a sealed envelope to the study centre (EbIM). The 173 questionnaire also asked for some basic demographic data (age, sex, marital status, profession), as 174 well as for the medical discipline in which the work disability evaluation took place. The research team 175 had no access to clinical records of participants. The questionnaires of experts and patients had a joint 176 number code for linking the cases. Otherwise, the responses were anonymous. The expert 177 questionnaire did not ask for any information on the expert in order to ensure the anonymization of 178 the expert data. The final patient sample consisted of 305 patients (187 female, 115 male, 3 179 undisclosed) with a mean age of 47.4 years (SD 10.9 years). For 293 of these patients, the assigned 180 experts rated the evaluation process and returned the questionnaires immediately after the 181 evaluation. Cases with expert ratings but no claimant ratings were not considered for the data analysis. 182 The medical experts were informed about the research project and were aware that the claimants 183 would rate their performance, but they did not know the individual BFQ items. 184 Statistics 185 For the current study purpose, we analysed the ratings across participants (i.e. we did not 186 differentiate between the disciplines of the experts who performed the evaluations). The item 187 difficulty and item discrimination were extracted as item characteristics: Item difficulty is defined as 188 the ratio between the mean score of all participants and maximally achievable score [19]. Thus, the 189 higher the value, the more respondents agreed with the item. Item discrimination is defined as the 190 correlation between the item's scores and the total test scores. The higher the value, the higher is the 191 degree to which an item and the whole test measure the same content

218
Of 538 questionnaires sent to claimants in disability evaluations, 305 questionnaires (56.7%) 219 were returned. The evaluations took place in various disciplines (psychiatry: n = 78; neurology: n = 28; rheumatology: n = 32; internal medicine: n = 40; other disciplines: n = 41; missing or ambiguous 221 responses: n = 86). For the current study purpose, we did not differentiate between disciplines. 222 Across all participants, the mean scoring of the 30 items was 3.41 (SD 0.51), meaning that the feedback 226 provided by the questionnaire was on average quite positive. The distribution was strongly left-skewed 227 (Fig 1). Only 12 patients provided mean ratings below the median of the response scale (i.e. < 2.5). The 228 on average highly positive feedback was also reflected in the high item difficulty (= high agreeance 229 with item's statement, Table 1), as well as in the high ratings for the perceived fairness of the 230 evaluation and for the general satisfaction with the evaluation, as obtained by the two separate ratings 231 on a 7-point Likert scale. The latter ratings were 5.74 (SD 1.24) and 5.66 (SD 1.29) respectively, with a 232 mode of "6" for both ratings. Thus, participants did not choose the most positive category ("7") when 233 the rating scales were finer graded. However, also for these ratings, only 16 respectively 15 participants 234 provided ratings below the median of the response scale for perceived fairness and general satisfaction 235 (i.e. ratings < 4.0). 236 Missing responses were scarce for most items, except for the four items with the "can't tell" 241 response option (items 2, 14, 24, and 27; Table 1). The mean ratings of participants who answered all 242 items with four response options and of participants who produced any missing data in the latter items 243 systematically varied (complete questionnaire group: n = 221, mean rating = 3.52 SD 0.46; missing data 244 group: n = 84, mean rating 3.19 SD 0.56; F 1, 304 = 27.001, p < 0.001, d = 0.668; the items with the "can't 245 tell" response option were excluded for calculating these mean ratings). In other words, claimants who 246 answered all items were more positive about the evaluation than claimants who skipped items. 247

Principal component analysis 248
Cronbach's alpha was > 0.9 (α = 0.969 for 30 items, n = 144; α = 0.961 for 26 items, n = 239). The 249 Kaiser-Meyer-Olkin statistic was 0.957 and Bartlett's test was highly significant (p<0.001), indicating 250 that the PCA was deemed appropriate for the data. Four factors of the unrotated PCA had eigenvalues 251 and 0.748 ( Table 1). The varimax factor rotation resulted in a so-called simple structure (high loadings 254 on one factor and low loadings on the others), with all but two items showing loadings > 0.5 on one of 255 the four factors ( Table 2). Taking the content of the items into consideration, the rotated factors widely 256 corresponded to the dimensions, as beforehand identified by the subject-matter experts: Factor 1 257 corresponded to Rapport, factor 2 to Interviewing Skills, factor 3 to Transparency, and factor 4 to Case 258 Familiarity ( Table 2). 259 In an exploratory analysis, the PCA was restricted to items with high loadings in factors 1 and 2, 261 as the high loadings in one factor were mostly accompanied by moderate loadings in the other factor. 262 For this exploratory PCA, the number of extracted factors were restricted to two. Visual inspection of 263 the factor loadings showed that the two underlying factors were likely non-orthogonal (the factor axes 264 did not run through chunks of items). If such a PCA was followed by an oblique (oblimin) rotation, the 265 two extracted factors were considerably correlated (r = 0.775). Factor 1 again corresponded to 266 Rapport, whereas factor 2 had particular high loadings on the items 21, 22, 25, and 28 (S1 Table). Thus, 267 factor 2 might reflect the impression of the claimant of being part of the conversation and not just a (valued) respondent to questions. Therefore, this factor might be better understood as factor 269 reflecting the feeling of participation (as created by the expert's interviewing skills). 270 Aside from items 12 and 16 (which did not load on any factor > 0.5), we identified some items, 271 which could be removed from the questionnaire without noteworthy information loss: Some item pairs 272 showed high correlations (>0.7), namely the items 4 and 11, 6 and 9, 9 and 10, 13 and 15, as well as 14 273 and 17. Taking also the semantic similarity of the items, as well as the item characteristics into account, 274 we identified the items 9, 11, and 15 as removable items. These three items were excluded when 275 calculating the mean ratings of each factor, as described in the following. 276

Rating profile 277
The mean ratings of each factor were obtained by averaging the scores of the five items with 278 the highest factor loadings in each factor. These were items 1, 3, 4, 6 and 13 for factor 1 (Rapport), 279 items 21, 22, 23, 25, and 28 for factor2 (Interviewing Skills), items 18, 19, 20, 26, and 29 for factor 3 280 (Transparency), and items 2, 8, 14, 17, and 27 for factor 4 (Case Familiarity, Tables 1 and 2). The mean 281 factor ratings (not to be confused with the factor loadings) were compared by a repeated measure 282 ANOVA between the four factors. This ANOVA revealed that the mean ratings varied between the 283 factors (F 3,906 = 155.926, p < 0.001, ηp 2 = 0.341). The mean ratings were higher for factors 1 and 2 than 284 for factors 3 and 4, with no difference between the first two items and the latter two items (factor 1: 285 3.59 SD 0.52; factor 2: 3.58 SD 0.51; factor 3: 3.12 SD 0.71; factor 4: 3. 16 Fig 2), as well as between the mean fairness questionnaire 294 score and the Being bullied score (r = 0.556, p < 0. 001, Fig 3). The factor scores for Rapport and 295 Interviewing skills were considerably associated with the two CPQ scales, whereas the factor scores 296 for Transparency and Case familiarity only showed a weak or no association (Table 3).   001, Fig 4). Claimants with low life satisfaction tended to perceive the evaluation as less 314 fair. The individual factor scores and the SWLS scores showed only weak correlations (Table 3). 315 Association with other questionnaire ratings and expert rating 319 The claimants separately rated the perceived fairness of the disability evaluation on a 7-point 320 Likert scale (5.74 SD 1.24). These ratings showed a strong association with the mean questionnaire 321 score (r = 0.767, p <0.001) and small to medium associations with the factor scores between 0.256 and 322 0.474 ( Table 3). The corresponding expert ratings ("Do you think the claimant considered the disability 323 evaluation as fair?") were on average quite high (6.09 SD 0.78) and significantly more positive than the 324 claimant ratings (F 1,270 = 18.437, p < 0.001, ηp 2 = 0.064). The expert ratings showed only a weak 325 association with the mean questionnaire score (r = 0.186, Table 3 In the current study, we aimed to develop and to validate a questionnaire that measures to 329 what degree claimants for disability benefits perceive their evaluation by the medical expert as fair. In 330 the following, we discuss the claimants' mean BFQ ratings on the disability evaluation, the validity of 331 BFQ, the association between the claimant and expert ratings, and the practical implications of the 332 study for quality control in work disability evaluations.

Mean ratings 334
The study showed that the claimants' satisfaction with the disability evaluation was on average 335 high, in particular their satisfaction with the atmosphere of trust and respect and with their 336 participation in the interview. Only about 4 % of the claimants rated the disability evaluation below 337 the median of the scale and about 2 % reported feelings of being bullied by the expert. These data 338 provide a quite positive feedback about the perceived fairness in disability evaluations in Switzerland. 339 Given the lack of instruments, there are currently no defined thresholds for patient satisfaction in 340 disability evaluations. However, experts, assessment centres, and insurance providers would likely 341 consider 2 to 4 % of unsatisfied claimants already as best case scenario because claimants often have 342 mental health issues (in forms of personality disorders, depression, or substance abuse, [21]), which 343 might occasionally affect their ratings negatively. Indeed, considering the SWLS data, we found that 344 participants with low life satisfaction were prone to ratings below the median of the scale, which were 345 virtually absent for participants with good life satisfaction (Fig 4). This might suggest that factors like 346 the claimants' life situation, life satisfaction, or personality could have some negative impact on the 347 satisfaction with the disability evaluation. However, the experts need to cope with the claimant's poor 348 life satisfaction, might face an increased difficulty to create and to maintain an atmosphere of mutual 349 trust and respect with these more vulnerable patients, and might ultimately sometimes fail in 350 providing it. Finally, correlation does not imply causation, and it is possible that claimants who 351 experienced the disability evaluation as very negative tended to rate their life satisfaction as more 352 negative. 353 The across claimants very positive ratings should be interpreted with considerable caution due 354 to potential selection and response bias: First, the four assessment centres that participated in the 355 study represent only a small and regionally restricted sample of Swiss assessment centres. Given this, 356 the four assessment centres might not be representative for (German-speaking) Switzerland. 357 Second, the ethic committee approval covered only the collection of the questionnaire data, 360 which means that no demographic or medical data of participants and non-participants were available, 361 except the demographic data of participants obtained by the questionnaire. This limits the 362 characterization of the study sample, but also prohibits the characterization of non-participants 363 (eligible claimants who were either not invited to participate or who were invited but did not return 364 the questionnaire). Given this, we cannot rule out that the study sample varied from non-responders 365 and was, thus, not representative for claimants with good German command in general. Future studies 366 with the aim to provide quality benchmarks in perceived fairness across Switzerland (or other 367 countries) will require collecting demographic or medical data to ensure representative samples. 368 Third, the exclusion of claimants with insufficient command of German -albeit necessary by 369 the design of the study -may have introduced additional selection bias. Compared to disability 370 evaluations with native speakers, evaluations requiring an interpreter are more difficult to conduct. 371 Information loss is likely to occur when communication switches from a direct to an indirect format 372 [22]. This information loss might have adverse effects on perceived fairness as well. Such quality 373 deficiencies might be discovered by translated versions of the BFQ, which we plan to use in future 374 studies in order to test whether the perceived fairness varies between work disability evaluations with 375 and without interpreters. 376 Selection bias might have occurred since study participation was voluntary. While the study 377 achieved a favourable return rate, still, more than 40 % of claimants did not return their questionnaire. 378 Nguyen and co-workers [23] have argued that satisfied patients are more likely to return 379 questionnaires than dissatisfied patients. Comparing claimants who completed their questionnaires 380 with those who skipped some responses might suggest that claimants who skipped all questions (i.e., 381 did not return the questionnaire) might have been more negative in their ratings. Even if this 382 assumption was true and non-responders behaved like participants who skipped items, this would 383 have minimally decreased the mean rating of the fairness questionnaire from 3.41 to 3.33.
The high levels of satisfaction might partly reflect a positive skew (the preference of responses 385 towards the favourable end [24]). Such response bias may in parts be related to the asymmetric dyadic 386 interaction in work disability evaluations, with one party evaluating and the other one being evaluated. 387 With these asymmetric social roles in mind, claimants might find it difficult or even risky to provide 388 negative feedback to medical experts, even though anonymity is granted. It is a ubiquitous finding that 389 service recipients report high levels of satisfaction [23]. Nguyen and co-workers therefore argued that 390 the level of satisfaction in absolute terms is often meaningless and instead satisfaction in relative terms 391 should be preferred (i.e. the comparison of levels of satisfaction across institutions or across time, 392 when assessed by the same instrument). 393 Finally, the medical experts were aware that they were rated. Although they had no detailed 394 knowledge about the BFQ items, they might have deduced from the expert questionnaire that the 395 quality of the interaction was of importance. Thus, it is possible that medical experts adopted their 396 interviewing behaviour. Such behavioural changes have been conceptualized as Hawthorne or 397 observer effect [25]. For ethical reasons, we deliberately discarded the option to keep the medical 398 experts uninformed about the study and took the risk of possible observer effects, when the designing 399 the study. 400 To sum up, the high mean ratings may overestimate the perceived fairness of disability 401 evaluations due to selection bias, response bias, and observer effects. We recommend that in future 402 nationwide surveys, commissioning insurers should distribute the questionnaires and collect the socio-403 demographic data of responders and non-responders to assure representative samples. Only two items showed factor loadings < 0.5 on any of these four factors. These were the items 12 and 414 16 ( Table 1, "The expert asked me how I feel." and "I could ask questions."). The phrasing of both items 415 might be too unspecific to provide valid feedback. Both items will therefore be removed from future 416 versions of the questionnaire. 417 Albeit the PCA confirms that most items show associations to the dimensions they were 418 designed for, it would be incorrect to consider these dimensions as independent from each other, even 419 though the orthogonal rotation is based on the assumption that the underlying factors (as 420 mathematical reflections of these dimensions) are uncorrelated. First, all items show considerable 421 intercorrelations and all items show an item discrimination > 0.6 ( Table 1). In addition, the mean rating 422 (across 30 items) and the global rating of the perceived fairness were considerably correlated. This 423 implies that the questionnaire measures one quality. This quality ("perceived fairness") can relatively 424 well be quantified by a single rating, as provided with a 7-point Likert scale ("As how fair would you 425 rate the evaluation that has just taken place?"). However, we would argue that it is important to ask 426 for details regarding the perceived fairness because otherwise it would not be possible to infer from a 427 negative feedback ("the disability evaluation was very unfair") to what aspect this negative feedback 428 refers. This would in turn not allow implementing measures against this quality defect. In consequence, 429 the overall BFQ score as well as the four subscores should be reported when reporting information on 430 perceived fairness in disability evaluations. 431 The perception of the disability evaluation as a whole might affect the evaluation of single 432 dimensions and vice versa ('Halo effect' [26]). Let us assume a claimant leaves the evaluation with a bad feeling, without being able to verbalize what made him feel that way. Being questioned about the 434 various dimensions of the evaluation (like rapport, participation, transparency etc.) might result in 435 negative ratings due to a spillover from his overall negative impression (rather than due to an analytic 436 assessment of the individual dimensions). Reversely, a single annoying aspect (e.g., the claimant felt 437 poorly informed by the expert) might affect the perceived fairness in general and worsen the rating of 438 other aspects. Individual rating profiles (the assessment in each dimension) need to be interpreted 439 with caution because of such spillover effects. 440 Nevertheless, rating profiles might become a valuable source of information when averaged 441 across multiple evaluations and compared between experts and assessment centres. Repeated ratings 442 of an expert's interaction by different claimants will minimize the error variance (=variance of 443 unsystematic effects). For example, the expert's mean rating for Transparency might be considerable 444 worse than his ratings for other factors and considerably worse than the mean Transparency ratings 445 of other experts. This would indicate that the expert repeatedly lacked transparency, from the 446 perspective of several claimants. In the context of quality improvement, the expert would be 447 encouraged to re-think his way of providing information to the claimant. 448 Convergent and divergent validity: The BFQ mean rating was strongly associated with the 449 scores of the two CPQ scales. This indicates that the two instruments measure partly the same 450 underlying dimension and provides evidence for convergent validity of the BFQ. The two scales of the 451 CPQ do not measure independent dimensions; their scores showed a considerable correlation (r = 452 0.543, p < 0.001). Patients who felt bullied by the expert did not rate the atmosphere of trust and 453 respect positively. High correlations between the BFQ and CPQ were found for its dimensions Rapport 454 and Interviewing skills, but much less so for Transparency and Case familiarity (Table 3). Thus, the 455 latter two dimensions are not covered by the CPQ, as one might infer from the semantic content of 456 the CPQ items. 457 satisfaction, even though claimants with poor life satisfaction had some tendency for poorer fairness 460 ratings. Many claimants with poor life satisfaction (< 3) gave rather positive fairness ratings (> 3, Fig  461   4). The weak association between satisfaction with life and the fairness ratings is also underlined by 462 the differential distribution of the data (with fairness ratings being left-skewed and life satisfaction 463 being right-skewed). This finding provides evidence for divergent validity of the BFQ. 464

Expert ratings 465
The experts poorly predicted how fair the claimants would have perceived the disability 466 evaluation (S1 Fig). However, the expert ratings showed only little variance. The experts hardly scored 467 < 5, which might be due social desirability (i.e. they do not want to present themselves as unfair.) This 468 small response variance limited the chances to find significant correlations between the patient and 469 expert ratings. Moreover, the patient questionnaire drew the patients' attention to several aspects of 470 the interview (by asking to evaluate them), whereas the expert questionnaire lacked of such details, 471 as it only contained one question referring to perceived fairness. Thus, the weak association between 472 patient and expert ratings might reflect the poor overlap in questionnaire structure and question type. 473 For future studies, the expert questionnaire will be revised, putting a focus on a self-assessment in the 474 four BFQ factor dimensions.

496
We developed the first questionnaire that measures to what degree patients perceive their 497 disability evaluation as fair. The BFQ has good internal consistency, construct validity, convergent and 498 divergent validity. The questionnaire has great potential for quality evaluation and improvement 499 purposes in disability evaluations. However, selection bias, response bias and observer effects may 500 have influenced the currently observed ratings and thereby provided a euphemistic image of the 501 quality of the disability evaluations in Switzerland. By using strategies that counter selection bias, cover 502 response bias, and tackle observer effects, future studies should be able to describe the quality of 503 disability evaluations and to discover potential systematic quality deficiencies more accurately. 504 Funding 505 The study was funded by the non-commercial, public sector Swiss National Accident Insurance 506