Validity and measurement invariance across sex, age, and education level of the French short versions of the European Health Literacy Survey Questionnaire

Background Short versions of the European Health Literacy Survey (HLS-EU) questionnaire are increasingly used to measure and compare health literacy (HL) in populations worldwide. As no validated versions of these questionnaires have thus far appeared in French, this study aimed to study the psychometric properties of the French translation of the 16- and 6-item short versions (HLS-EU-Q16 and HLS-EU-Q6), including their measurement invariance across sex, age, and education level. Methods A consensual French version of the HLS-EU-Q16 and HLS-EU-Q6 was developed by following the current recommendations for transcultural questionnaire adaptation. It was then completed by 317 patients recruited in waiting rooms of general practitioners in the Paris area (France). Structural validity was studied with the Rasch model for the HLS-EU-Q16 and confirmatory factorial analysis (CFA) for the HLS-EU-Q6. Concurrent and convergent validity, respectively, were assessed by scores on the Functional Communicative Critical Health Literacy (FCCHL) questionnaire and the physicians’ evaluations of their patient’s HL. Results The 16 items of the HLS-EU-Q16 were Rasch homogenous but meaningful differential item functioning (DIF) was found across sex, age, and/or education level for eight items. The CFA model fit for the HLS-EU-Q6 was poor. The overall scores for both HLS-EU short versions correlated poorly with the FCCHL scores. Similarly, HL levels defined using either short-version score did not agree with physicians’ HL assessments. Conclusion The French version of the HLS-EU-Q16 has acceptable psychometric properties, despite meaningful DIF for age, sex and education level and a poor discriminative power among subjects with average to high HL level. We recommend its use to measure HL in populations with sufficient reading skills to discriminate between subjects with low to average HL. Also, sensitivity analyses should be performed to evaluate the potential measurement bias due to DIF. Our results did not demonstrate the validity of the HLS-EU-Q6.


Introduction
Health literacy (HL) is defined as "the cognitive and social skills which determine the motivation and ability of individuals to gain access to, understand and use information in ways which promote and maintain good health" [1]. Three dimensions are distinguished: functional literacy, which involves basic skills (reading, writing, etc.) to access health information; interactive literacy, which refers to more advanced cognitive skills to understand this information; and critical literacy, which involves in-depth cognitive and social skills that ultimately lead to better control of life events [2]. Low HL has been shown to be associated with poor health, limited survival, and a higher cost of care [3][4][5][6][7]. Furthermore, the World Health Organization emphasizes the central role of HL in addressing health inequalities worldwide [1]. Nevertheless, HL has been studied only sparsely in France, likely due to the lack of adequate measurement instruments validated in French [8][9][10].
Some screening tools for low HL, such as the Rapid Estimate of Adult Literacy in Medicine (REALM) or the Test of Functional Health Literacy in Adults (TOFHLA) have been translated in French, but they assess only functional literacy, through timed tests evaluating the recognition of medical terms or the understanding of medical texts [11,12]. More recently, however, French adaptations of broader tools are being developed. For instance, the Functional Communicative Critical Health Literacy (FCCHL) scale, based on Nutbeam's definition, has recently been validated in French, but its use in epidemiological studies is limited because the implication of disease diagnosis in its wording ("If you are diagnosed. . .") reduces its relevance in healthy populations [13]. A transcultural adaptation in French of the Health Literacy Questionnaire (HLQ), measuring nine dimensions related to individual traits and abilities as well as contextual and health system resources, has recently been published [14,15]. Another interesting tool is the European Health Literacy Survey Questionnaire (HLS-EU-Q), which is built on a conceptual model of HL developed by a European consortium (not including France) based on a review of 170 publications [16]. This model integrates four health information processing skills (accessing, understanding, appraising, and applying health information) applied in three health contexts (healthcare, disease prevention, and health promotion). These skills go well beyond functional literacy, which focuses mainly on understanding health information; they also consider its communicative (e.g., accessing and discussing this information) and critical dimensions (appraising and applying it) [17,18]. A Delphi method was used to generate and select 47 items covering the 12 domains (three health contexts x four skills) [17].
One of the main obstacles to the use of these questionnaires in epidemiological studies, however, is their length. The addition of more than 40 questions to measure HL is rarely possible in studies that already involve several other questionnaires. Although no short version of the HLQ currently exists, to our knowledge, short versions of the HLS-EU-Q, containing 16 (HLS-EU-Q16) and 6 (HLS-EU-Q6) items [18], have been developed. The 16 items of the HLS-EU-Q16 were selected among the 47 HLS-EU-Q items based on their psychometric properties evaluated by Rasch analysis and their simultaneous good face and content validity by ensuring representation of the 12 HLS-EU domains [19] (S1 Table). This questionnaire provides an overall score of HL that has been shown to be highly correlated (r = 0.82) with the overall score on the full 47-item version of the HLS-EU-Q. The 6 items of the HLS-EU-Q6 were selected from the HLS-EU-Q16 using confirmatory factorial analysis (CFA) to establish its factorial structure, and correlation coefficients with the scores on the longer versions to determine its convergent validity [20]. The HLS-EU-Q6 has been used in a limited number of clinical studies in Europe [21,22], while the HLS-EU-Q16 is increasingly used for population studies in Europe in numerous countries. It has been translated into Dutch [7,[23][24][25], Swedish [26,27], German [28][29][30], Norwegian [31], Spanish and Catalan [32], Italian [33], Greek [34], Czech [35], Hebrew [36], and Arabic [37]. A French version of this questionnaire has been used in two studies in Belgium [7,38], but the published information regarding its psychometric properties is limited.
French is the fourth most widely spoken first language in the European Union, after German, Italian, and English. It is an official language of four European countries (France, Belgium, Switzerland, and Luxembourg) and of 25 independent nations outside Europe. As the short versions of the HLS-EU-Q are increasingly used to measure and compare HL in populations within Europe and worldwide, it is imperative to ascertain the validity of the French version of these questionnaires. As for any measurement device, measurement invariance is a required property to guarantee accurate group comparisons and is thus essential for questionnaire validation. According to Mokkink et al. and Milsap, "[a] measuring device should function in the same way across varied conditions, so long as those varied conditions are irrelevant to the attribute being measured" [39,40]. The objective of this study was thus to translate the HLS-EU-Q16 and HLS-EU-Q6 into French and to evaluate their psychometric properties, including measurement invariance across sex, age, and education level.

Translation
In accordance with the steps described in the current recommendations on transcultural adaptation of questionnaires [41,42], six experts from various disciplines (epidemiology, biostatistics, psychometrics, general medicine, public health, and psychiatry), including one bilingual English-French expert and five French experts with very high levels of English language proficiency, independently translated the English version of the HLS-EU-Q16 into French [43]. A consensus meeting was then held to arrive at a consensual French version of the questionnaire, based on the six independent translations and on the French version of the questionnaire previously used in Belgium (S1 Text). No back-translation was performed, as this has recently been proven unnecessary [44]. Ten subjects (4 males, mean age = 30 years) tested this version (completion time: 5 to 12 min). No formal cognitive debriefing interviews were performed but short individual discussions to assess acceptability and comprehensiveness of each item. No modification of the translated HSL-EU-Q-16 version was needed after this pilot test. The readability level of the translation, assessed using the Flesch Readability Score adapted to texts written in French, was 48, which corresponds to an undergraduate (bordering end of high-school) level [45].

Psychometric properties
Sample. Subjects were recruited from May 15 to June 30, 2016, in the waiting rooms of 17 general practitioners involved in the general practice network of Université Paris-Sud (France). General practitioners were selected to ensure representation of the various social backgrounds that exist in the Paris area, but not statistical representativeness for either Paris or France as a whole. Explanations about the study were provided to all French-speaking patients arriving in the waiting room, aged 18 years or older. They were then asked to complete the "patient questionnaire". At the end of the day, the physician completed a "physician questionnaire" for each patient who had participated in the study that day. All patients provided signed informed consent to participate. The institutional ethic committee (Comité d'Ethique du Collège National des Généralistes Enseignants, n˚IRB IRB00010804) approved the study, that is, determined that it met the requirements of legal codes that govern health research in France.
Data collection. Patients provided socio-demographic information including sex, age, and education level, as well as perceived health status ("Would you say that overall, your health is: excellent/very good/good/medium/poor?") and perceived financial situation ("Currently, with regard to your household financial situation, would you say that: you are very comfortable/relatively comfortable/just about managing/not really managing <often struggle to make ends meet> or not managing <often have to do without essentials or go deeper into debt>). Patients completed the French version of the HLS-EU-Q16 by indicating their response to each question on a 4-point Likert-like scale ("very easy", "easy", "difficult", "very difficult") for each item. To study the concurrent validity, we measured functional, communicative, and critical HL with the French version of the FCCHL. In addition, the physician answered one question about each patient: "In your opinion, this patient's level of HL is: inadequate/medium/ satisfactory?". Apart from the World Health Organization's definition of HL [1], no specific criteria were provided to practitioners to answer this question.
Statistical analyses. Answers for each of the 16 items of the HLS-EU-Q16 were re-scored in reverse so that higher scores reflected higher levels of HL. Ceiling and floor effects were identified for each item; these two effects were defined a priori as respectively more than 95% of respondents who select the highest and the lowest category.
The structural validity of the 16-and 6-item versions of the HLS-EU-Q was evaluated by using the same statistical strategy used in the initial validation studies [18,20]. Specifically, a Rasch analysis was used for the HLS-EU-Q16, with dichotomized items (the "very easy" category was merged with the "easy" category, and the "difficult" category with the "very difficult" category). A monotonely homogeneous model of Mokken was fitted to verify the three fundamental assumptions (unidimensionality, local independence, monotonicity) on which the Rasch model relies. Its fit was considered acceptable if the Loevinger H coefficients were >0. 3 for the H coefficient of scalability and for the H j coefficients associated with each item j (j = 1,. . .,16) and were >0 for the H jk coefficients associated with each pair of items j and k [46]. The global fit of the Rasch model was evaluated with a Chi 2 test, and individual item fit with standardized residuals (expected to be ± 2.5) and Chi 2 tests. The dimensional structure of the HLS-EU-Q6 was studied by using CFA on the reversed 4-point-Likert-like items and as the robust estimator for categorical data being the Weighted least square Means and Variances adjusted. Two models were fitted: a one-factor model and a two-order model with three factors according to health contexts and a higher order factor for global HL. Fit indices used were the comparative fit and Tucker-Lewis indices (CFI & TLI, good fit if >0.95, poor fit if <0.90, acceptable fit elsewhere) and the root mean square error approximation (RMSEA, good fit if <0.06, poor fit if >0.1, acceptable fit elsewhere) [47].
The Rasch analyses allowed us to assess measurement invariance which holds if two subjects being identical on the measured construct but from different groups (males and females, for example) have the same probability of giving any particular answer to any item of the scale [40,48]. If measurement invariance does not hold it means that one or several items of the scale "functions" differently in the groups to be compared (resulting in the Rasch model in different item parameter, termed difficulty, in the two groups) and that group comparisons of the total scale score may be inaccurate; this phenomenon is termed differential item functioning (DIF) [48,49]. Two kinds of DIF can be distinguished: uniform if the relation between the group and the response to the item is identical at every level of the latent trait (HL); otherwise DIF is non-uniform [48]. These both kinds of DIF were investigated in the HLS-EU-Q16 across sex, age (categorized based on tertiles) and educational level (primary or none, secondary, post-secondary) [50]. When statistically significant DIF was observed, the item was split in pseudo-items to estimate its difficulty in each group. DIF was considered meaningful if the difference in item difficulties across groups was higher than 0.25 logit or if more than 25% of the items of the scale were affected by DIF in the same direction. When DIF affected several items in opposite directions, the expected difference in the scale score across groups due to DIF was evaluated [51]. Internal consistency was assessed with the Cronbach alpha coefficient (acceptable if higher than 0.7) [52].
To assess concurrent validity, the overall HLS-EU-Q16 score was computed as the simple sum score of the 16 binary items, while the overall HLS-EU-Q6 score was computed by averaging the responses to the six items on the reversed four-point Likert scale, both as recommended for other language versions [20]. The three levels of HL were the same as those in the other language versions: inadequate (HLS-EU-Q16 score �8, HLS-EU-Q6 score �2), problematic (HLS-EU-Q16 score >8 and �12, HLS-EU-Q6 score >2 and �3), and adequate (HLS-EU-Q16 score >12, HLS-EU-Q6 score >3) [18,20,53]. The association between the overall HLS-EU-Q16 and HLS-EU-Q6 scores was estimated with the Spearman correlation coefficient, and the kappa coefficient was used to evaluate agreement (acceptable if kappa>0.6, excellent if >0.8) between HL levels determined by the HLS-EU-Q16 and HLS-EU-Q6 [54]. Spearman coefficients were also used to evaluate the association of both HLS-EU overall scores with the FCCHL functional, communicative and critical HL scores, and the kappa coefficient was used to evaluate the agreement between the HL levels obtained for each patient by the HLS-EU-Q16 and HLS-EU-Q6 with the level evaluated by the physician.
To determine the questionnaires' convergent and discriminant validity, comparisons were made between patients depending on their education level, perceived health status, and financial situation, and HL as evaluated by their physician. Lower HL was expected for less educated patients, those with poorer perceived health status, poorer perceived financial situation, and low physician-assessed HL level [55][56][57][58]. These a priori hypotheses were tested with Mann-Whitney tests. The kappa coefficient was also used to evaluate the agreement between the HL level determined with the HLS-EU-Q16 and HLS-EU-Q6, and the physician-assessed HL level. Analyses were performed with the Stata v.14 (data management and basic statistics), RUMM2030 (Rasch analyses), and Mplus v7.4 (CFA) software [59][60][61].

Subjects
Of the 372 patients who were approached for the study, 343 agreed to participate (response rate: 92%); 26 (8%) were subsequently excluded due to a missing answer on one or more of the HLS-EU-Q16 items. Table 1 summarizes the socio-demographic characteristics of the remaining 317 patients; 207 (65%) were women, their mean age was 53 (±18) years and 188 (59%) had a post-secondary education level. In all, 216 (68%) assessed their financial situation as "very comfortable" or "relatively comfortable" and 208 (66%) rated their health as "good", "very good" or "excellent".

Psychometric properties of the French version of the HLS-EU-Q16 and HL-SEU-Q6
Descriptive analyses and floor and ceiling effects. None of the HLS-EU-Q16 items had floor or ceiling effects when the 4-point Likert scale was used (S2 Table). After they were dichotomized, however, ceiling effects were observed for four items: item 3 (understanding your doctor), 4 (understanding your doctor's or pharmacist's instructions), 7 (following your doctor's or pharmacist's instructions), and 10 (understanding why you need health screening tests). The distributions of the scores on the HLS-EU-Q16 and HLS-EU-Q6 are reported in Table 2. A ceiling effect was observed for the HLS-EU-Q16 score, with 80 (25%) patients scoring 16. When the scores were categorised, the HL level was defined as inadequate for 26 (8%) and 16 (5%) subjects, problematic for 106 (33%) and 218 (69%), and adequate for 185 (58%) and 83 (26%) with the HLS-EU-Q16 and HLS-EU-Q6, respectively.
Rasch analyses and study of the measurement invariance of the HLS-EU-Q16. Loevinger's H coefficients confirmed the unidimensionality, local independence and monotonicity hypotheses, except for the H coefficient (0.28) associated with item 1 (finding information on treatments). The overall Chi 2 test P-value for the Rasch model fit was 0.08. Standardized fit residuals and Chi 2 tests indicated that the Rasch model had a good fit at the item level, as summarized in Table 3. Item difficulty varied from -2.42 to 2.18, and the latent trait (HL) level of more than 40% of the sample was higher than the highest item's difficulty, as shown on the person-item map (S1 Fig). Table 4 presents the results from the DIF analyses. Item 1 (finding information on treatments) showed meaningful DIF across sex can be interpreted as follow: if men and women  have the same HL level, men respond more often than women that it is difficult to find information on treatments of illnesses that concern them. Three items (item 3: understanding what your doctor says; item 5: judging when you may need to get a second opinion; and item 14: understand advice on health from family members or friends) showed meaningful DIF across age. At the same HL level, older people respond more often than younger that it is difficult to understand what the doctor says and to judge when they need to get a second opinion. Persons between 41 and 60 respond more often that it is easy to understand advice on health from relatives and friends compared to younger or older people. Education level was associated with a meaningful DIF: two items were easier for more educated patients (item 1: finding information on treatments, and item 6: using information the doctor gives you to make decisions), and two more difficult (item 11: judging if the information on health risks in the media is reliable, and item 12: deciding how you can protect yourself from illness based on information in the media). At the same HL level, more educated subjects answer more often that it is difficult to find information on treatments and to use information given by the doctor to make decisions. To the contrary, they answer less that it is difficult to judge the reliability of the information in the media and to decide how to protect themselves based on information in the media. As DIF was not in the same direction for these five items, a higher HLS-EU-Q16 score was expected for more educated than for less educated subjects when the latent trait (HL level) was low, while a lower score was expected when the HL level was high (  Validity of the French short versions of the European Health Literacy Survey Questionnaire Concurrent validity. The Spearman correlation between the HLS-EU-Q16 and HLS-EU-Q6 scores was 0.88. In contrast, the agreement between HL levels defined by both of these versions was poor, with a Kappa coefficient equal to 0.36. In addition, the Spearman correlation coefficients of the scores on both these versions with the functional, communicative and critical HL scales of the FCCHL were statistically significant (P-value<0.05) but all below 0.3 (range: 0.11-0.29).
Convergent and discriminant validity. Results from the a priori hypotheses-tests are shown in Table 2. No significant differences were observed between patients according to their education level or perceived health status. A trend was observed by which the HLS-EU-Q16 score decreased with perceived health status, but not for the HLS-EU-Q6. HL measures were not associated with education level, even when the score was computed without the five items affected by DIF concerning education. On the other hand, a significant association was found between HL and perceived financial situation. Physicians evaluated the HL as insufficient for 26 (9%) patients, medium for 81 (28%) and satisfactory for 179 (63%). On average, the overall HLS-EU-Q16 and HLS-EU-Q6 scores were higher with higher physician-assessed HL (P-values = 0.002 and 0.033, respectively) ( Table 4). Nonetheless, the agreement between the HL categories as evaluated by physicians and using the questionnaires was poor, with Kappa coefficients equal to 0.10 for the HLS-EU-Q16 and 0.06 for the HLS-EU-Q6.

Discussion
Transcultural adaptation and validation of the short forms of the HLS-EU questionnaire are necessary steps before they can be used to measure HL at the population level among Frenchspeaking subjects and then to compare HL levels across populations, which was the primary aim of the HLS-EU study [17].
Our results indicate that the French version of the HLS-EU-Q16 is Rasch homogenous, which is a highly recommended property for composite measurement scales. Its internal consistency is also satisfactory. On the other hand, our analyses also revealed certain limitations that suggest the need for some caution when using this form of the questionnaire. First, four items showed ceiling effects when dichotomized, which suggests that these items are not sufficiently discriminant in this population. The Rasch analysis and the ceiling effect observed for the total score are consistent with this observation and indicate that the scale based on dichotomization of the items is not sufficiently discriminatory for use among subjects with high levels of HL. As such, the French version of the questionnaire appears more appropriate for the study of people and groups with relatively low levels of HL, to discriminate between them. This finding also raises the question whether any HLSEU-Q16 items should be treated as binary items, since when scored on a four-point Likert scale, they did not show any ceiling effects.
Second, measurement invariance did not hold as DIF was observed for sex, age, and education level. Different explanations can enlighten this phenomenon. Women are more concerned about their health than men. They have more medical encounters and are more likely to seek medical treatment. Moreover, physicians spend more time with female patients and give them more explanations [62]. This specific relationship to the health care system may explain why women declare that finding information on treatments is easier than men.
Two items were more difficult for older people (understand what the doctor says and to judge when they need to get a second opinion). This may be due, for example, to age related hearing or cognitive impairments that make understanding and judgement more complicated without modifying the HL level at all. The finding that more educated subjects answer more often that it is difficult to find information on treatments and to use information given by the doctor to make decisions may be related to the known relationship between education and information seeking or preference for decision making [63,64]. While some degree of DIF is often found in questionnaire validation studies, ignoring the presence of this phenomenon could lead to biased results [51]. The DIF is particularly noticeable for education level, where it affects five items in opposite directions. Because the score difference across education levels due to DIF amounts to 0.5 units on the HLS-EU-Q16 total score, it can lead to underestimating the educational gradient in HL. The overall good fit of the Rasch model means that it is the same measured latent trait (HL) across groups but the presence of DIF signifies that it is not measured in the same way across groups. To counteract this bias, it is recommended that studies using the questionnaire in populations that vary in terms of sex, age, or (especially) education level perform sensitivity analyses by calculating the HLS-EU-Q16 scores with and without the items showing DIF. To our knowledge, this study is the first to investigate DIF for the HLS-EU questionnaire. Evidence of the amplitude of the biases this phenomenon may produce is necessary to allow a well-informed use of the questionnaire. We therefore recommend that it be investigated in other language versions.
The results for convergent and divergent validity were consistent with our a priori hypotheses, except for that concerning education level, which was not related to the total score on the HLS-EU-Q16. This may be due to the readability level of the French version of this questionnaire, which corresponded to that of a university undergraduate. It might have induced a selection bias by discouraging respondents with lower education levels from completing the questionnaire.
Finally, correlations between the overall HLS-EU-Q16 score and the FCCHL scores were very low, as was the agreement between the HL categories determined by the HLS-EU-Q16 and by the physician. This suggests that what is measured by the HLS-EU-Q16 is different from HL as measured by the FCCHL or as conceptualized by French physicians. With regard to the latter, nonetheless, we note that many of the doctors who participated in the study were unfamiliar with the concept of HL; this lack of knowledge probably explains the low level of agreement. Correlation with the FCCHL was also low, perhaps because the latter questionnaire focuses mainly on HL in the patient-clinician interaction, and less on the health information processing skills of accessing, understanding, appraising, and applying health information in disease prevention and health promotion, which are important objects of the HLS-EU-Q. Further research exploring the links and differences between the constructs measured by the various existing health literacy measurement instruments would be helpful in choosing the most suitable questionnaire for each study.
The validity of the French version of the HLS-EU-Q6 could not be established, due to the poor fit of the one-factor CFA model and our inability to estimate the fit of a two-order CFA model reliably, due to computation issues. The Spearman correlation between the overall HLS-EU-Q16 and HLS-EU-Q6 scores was nevertheless high, suggesting that both tools measure the same construct. On the other hand, the agreement between HL levels measured by the HLS-EU-Q16 and HLS-EU-Q6 was poor, which suggests that different thresholds may be needed to categorize the overall HLS-EU-Q6 score. Moreover, the results from the analyses regarding convergent and discriminant validity were not convincing. To our knowledge, no studies have examined the psychometric properties of this short version, in any language, although it has already been used in some epidemiological studies [21,22]. Further studies should be planned to evaluate the validity of the HLS-EU-Q6 in other languages.
This study nonetheless has limitations. The sample size could be perceived as a weakness, although a sample size of 200 persons has been recommended for Rasch analysis [65]. In addition, although the entire French population has access to primary care without social differences, the method for participant recruitment via the waiting room probably resulted in over representing women and elderly people in the sample. Accordingly, inadequate specification of the psychometric properties of the French version of the HLS-EU short versions due to this selection bias cannot be ruled out. Moreover, we used a sample of participants from the Paris area, so our results must be replicated on samples of subjects from other French-speaking areas to evaluate their robustness. Finally, the use of self-administered questionnaires is limited to people who can read French well, and ad hoc studies should be planned to develop selfadministered measurement tools that can be used in populations with poor reading skills.

Conclusion
Despite these limitations, we conclude that the psychometric properties of the French version of the HLS-EU-Q16 enable its use in surveys of health literacy, provided that the population surveyed has sufficient reading skills (preferably not lower than high-school level), and that it is mainly suitable to discriminate between subjects with low to average HL level. Sensitivity analyses should also be performed to evaluate the role of potential measurement bias due to DIF related to sex, age, and education level in the HLS-EU-Q16. Furthermore, as measurement invariance has rarely been studied in the field of HL assessment [13], we suggest that further studies should assess this property in every language version of the HLS-EU questionnaires, as well as for other HL measurement instruments commonly used, such as the HLQ, REALM and TOFHLA. Finally, the validity of the HLS-EU-Q6 could not be established in this study.
Supporting information S1 Table. Items from the European Health Literacy Survey Questionnaire short forms, 16 items (HLS-EU-Q16) and 6 items (HLS-EU-Q6, in bold, and the health contexts and health-information-processing skills to which they apply.