Measurement agreement of the self-administered questionnaire of the Belgian Health Interview Survey: Paper-and-pencil versus web-based mode

Before organizing mixed-mode data collection for the self-administered questionnaire of the Belgian Health Interview Survey, measurement effects between the paper-and-pencil and the web-based questionnaire were evaluated. A two-period cross-over study was organized with a sample of 149 employees of two Belgian research institutes (age range 22–62 years, 72% female). Measurement agreement was assessed for a diverse range of health indicators related to general health, mental and psychosocial health, health behaviors and prevention with kappa coefficients and intraclass correlation (ICC). The quality of the data collected by both modes was evaluated by quantifying the missing, ‘don’t know’ and inconsistent values and data entry mistakes. Good to very good agreement was found for all categorical indicators with kappa coefficients superior to 0.60, except for two mental and psychosocial health indicators namely the presence of a sleeping disorder and of a depressive disorder (kappa≥0.50). For the continuous indicators high to acceptable agreement was observed with ICC superior to 0.70. Inconsistent answers and data-entry mistakes were only occurring in the paper-and-pencil mode. There were no less missing values in the web-based mode compared to the paper-and-pencil mode. The study supports the idea that web-based modes provide, in general, equal responses to paper-and-pencil modes. However, health indicators based upon factual and objective items tend to have higher measurement agreement than indicators requiring an assessment of personal subjective feelings. A web-based mode greatly facilitates the data-entry process and guides the completing of a questionnaire. However, item non-response was not positively affected.


Introduction
Population surveys have traditionally used paper-and-pencil self-administered questionnaires to collect information on sensitive questions. However with the growth of internet use, webbased questionnaires have become an important alternative to paper-and-pencil questionnaires due to their many advantages [1;2]. For instance, the process of manual data-entry with its accompanying data-entry mistakes becomes unnecessary [3;4]. As well, web-based questionnaires can produce higher data quality since an automatic skipping and branching logic and warning messages in case of missing and implausible answers can be foreseen [3;4].
Web-based questionnaires cannot,however, be the sole mode of data collection for population surveys, as even in countries with high internet penetration, internet access and skills vary among demographic groups [5;6]. To overcome this limitation, mixed-mode data collection including a web-based and paper-and-pencil mode can be used. Mixing different modes in one survey, can lead to mode effects by simultaneously creating selection and measurement effects [7]. Selection effects can occur when respondents with different characteristics choose a different mode to complete the questionnaire. Measurement effects can occur if the mode influences how respondents understand the question, retrieve relevant information, make a judgment about the adequate response and finally choose the answer [8;9]. For instance, a web-based mode offers a greater opportunity to multitask since respondents are more likely to be engaged in several other activities while completing the questionnaire [10;11]. This might lead to "satisficing" behavior; respondents simply provide a satisfactory answer (e.g. answering don't know or skipping the question) because an optimal response requires a substantial amount of cognitive effort [12;13]. As well, a web-based mode may limit the ability of the respondents to re-read the questions at their own pace, in their preferred order and to synchronize the answers [14;15]. Furthermore, a web-based mode can generate more honest responses since respondents can be transported into another virtual world wherein they forget their immediate surrounding [16]. In this way, it can create an illusion of privacy.
Mode effects have implications for the comparability of the data collected by different modes [8]. Recent meta-analyses and review studies of the comparability of electronic and paper-andpencil modes generally found evidence for the equivalence across the modes [14;17;18]. However, other studies found differences in the reporting of general health [15], mental health [19;20] and sensitive health behaviors [21;22]. In a mixed-mode design, it is not possible to disentangle selection effects from measurement effects [7]. That is why, in the context of future mixed-mode data collection for the self-administered questionnaire of the Belgian Health Interview Survey (BHIS), a study with a repeated measures design was organized to test for measurement effects. More specifically, the aim of this study was to assess the measurement agreement between the newly developed web-based and the paper-and-pencil mode for several health indicators and to ascertain the extent to which the quality of the collected data varied between these modes.

Research design and study population
A two period cross-over design was used, in which respondents completed the questionnaire in both modes with a certain time interval in between. Respondents were recruited on a voluntary basis from a pool of 730 employees of two Belgian research institutes. The research protocol was submitted to the directors of the participating institutes for approval. No ethics committee was involved as this was an internal pilot study. The employees were informed about the objectives of the study in an e-mail before giving their written consent for participation. No benefits or risks were derived from participating in this study. The answers of the participants were kept anonymous as each participant had a unique ID code and the link between the name and the ID code was not accessible to the researchers. This link was deleted after the end of the data collection. In total 195 employees volunteered to participate. Half of the respondents were first assigned to the paper-and-pencil mode (paper first group) and the other half to the web-based mode (web first group). After two weeks the groups were switched: the paper first group received the web-based mode and inversely. Only respondents who completed the questionnaire by both modes were included in the final sample of 149 respondents. At end a response rate of 20.4% (149/730) was achieved. The median number of days between completing the questionnaire in the two modes was 14 days (minimum 2 and maximum 40 days) (Fig 1).

Instrument
The questionnaire was based on the self-administered paper-and-pencil questionnaire of the BHIS 2013 [23] and could be completed in French or Dutch. The web-based questionnaire was developed to be as comparable as possible to the paper-and-pencil mode. Therefore, the questions were identical (similar wording and almost similar instructions) and the design was comparable (similar colors and lay-out). Still, the web-based mode was developed while applying the imbedded features of this mode such as automatic skipping and branching. Furthermore, soft warnings were given in case of missing values for the first question of every module and for filter questions and in case respondents gave inconsistent or implausible answers. As well, the web-based mode had a multipage design displaying only a few questions on every screen which differs from the paper-and-pencil questionnaire that allows a comprehensive view on the whole questionnaire. Web respondents were, however, able to go back in the questionnaire to change answers given to previous questions. After completing the last questionnaire respondents were asked if they had experienced a health change during the washout period.
The web-based questionnaire, developed using BlaiseIS 4.8 software, could be completed using a computer but not using a tablet or smartphone. Data from the web-based questionnaire were automatically saved in a database. Data collected with the paper-and-pencil questionnaire were entered manually using a program also developed with Blaise1 software. A double dataentry was done in order to correct for data-entry mistakes. Table 1 provides an overview of the indicators selected to assess the measurement agreement. These indicators are organized in 4 topics: general health, mental and psychosocial health, health behaviors, and prevention. Measurement agreement of a self-administered health questionnaire: Paper-and-pencil versus web

Statistical analyses
Statistical analyses were performed using the statistical package SAS1 9.3. The significance level for all the analyses was set at 5%, with corresponding 95% confidence intervals (CI).
Measurement agreement. For categorical indicators, kappa coefficients were estimated [26]. Simple kappa coefficients were calculated for binary and nominal indicators whereas linear weighted kappa coefficients were calculated for ordinal indicators. Weighted kappa coefficients take into account the greater disagreement between response categories that are further apart than for those that are closer together on an ordinal scale [27;28]. Linear weights were defined as w i = 1-(i/(c-i)) where i is the difference between the response categories in the web-based mode and paper-and-pencil mode and c is the total number of categories of the indicator. For the interpretation, we followed the cutoffs proposed by Landis & Koch [26]: 0.00 = poor, 0.00-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = good, 0.81-1.00 = very good agreement. In addition, percentages of exact (for binary, nominal and ordinal categorical indicators) and global agreement (only for ordinal categorical indicators) were calculated [29]. Exact agreement was estimated as the percentage of respondents who have the same category in both modes. Global agreement was calculated as the percentage of responses that fell within one category in the positive and negative direction. The percentages of agreement depend on the number of categories; they are expected to be higher for indicators with only a few categories.
Measurement agreement for continuous indicators was assessed using the intraclass correlation coefficient (ICC) [30]. The ICC measures the correlation between a single rating on a continuous measure using the web-based mode and a continuous measure using the paperand-pencil mode [4]. A score above 0.80 is usually sought in mode comparison, with 0.70 considered as an acceptable value [31]. ICC is based on mean-centered versions of the indicators and is insensitive to respondent's tendency to provide consistently higher responses in one mode compared to the other [4;31]. For this reason, Wilcoxon signed ranked tests were calculated to detect the presence of differences between both modes.
Kappa and ICC coefficients were calculated overall and by order group (web first or paper first group). In this paper the overall kappa and ICC coefficients are presented. However in case of a difference between the order groups, the coefficients by order group are mentioned. Further, the kappa and ICC coefficients were calculated with and without respondents who said they experienced a health change (n = 11) but since it had almost no effect, it was decided to use the sample including all respondents.
Data quality. The quality of the data was assessed by evaluating the missing, 'don't know' and inconsistent values. The latter was defined as an answer that should not have been given according to the skipping and branching logic or as an answer that was inconsistent with other answers. 'Don't know' is a non-substantive answer since it can be seen as a way of refusing to answer a question [32]. The quantification of the values was done by counting the total number of these values separately for both modes of data collection. Furthermore, the mean number of missing, 'don't know' and inconsistent values by questionnaires were calculated for both modes and the differences between the modes were evaluated by performing a Wilcoxon signed rank test.
Additionally, paper-and-pencil surveys require manual data-entry and this may generate mistakes and hence, have a negative impact on the data quality. For this reason, a double dataentry was performed. In case inconsistencies were found, they were resolved by checking the paper-and-pencil questionnaire. The number of data-entry mistakes was assessed by counting the total number of data-entry mistakes per data encoder.

Characteristics of the respondents
About 72% of the respondents were female and 57% were younger than 40 years. The age range was 22 to 62 years. No gender or age differences between the order groups were detected ( Table 2).

Measurement agreement
General health. For two indicators a very good agreement was found, with a kappa coefficient of 0.92 (95% CI: 0.85-1.00) for chronic health problems and of 0.84 (95% CI: 0.69-0.99) for activity limitations ( Table 3). For self-rated health there was somewhat lower but still good agreement (kappa = 0.74 (95% CI: 0.53-0.96)). About 97% of the respondents had the same response category in both modes for these indicators. The kappa coefficients calculated within each order group showed lower agreement in the web first group compared to the paper first group for self-rated health and for activity limitations. However, there was at least moderate agreement between both modes (kappa! 0.55).
Mental and psychosocial health. For lifetime suicidal ideation a very good agreement was found (kappa = 0.86 (95% CI: 0.76-0.95)) ( Table 3). Four other indicators showed good agreement with kappa coefficients varying between 0.61 (95% CI: 0.48-0.74) for mental distress and 0.78 (95% CI: 0.61-0.95) for the presence of an eating disorder. The presence of a depressive disorder (kappa = 0.52 (95% CI: 0.32-0.71)) and of a sleeping disorder (kappa = 0.50 (95% CI: 0.35-0.64)) exhibited only moderate agreement. 77.4% to 95.9% of the respondents had the same response category in both modes. For the ordinal categorical indicator quality of social support, all respondents reported the same response category or stayed within one response category in the positive or negative direction. The kappa coefficients calculated in each order group showed somewhat lower agreement for the presence of an eating disorder in the web first group compared to the paper first group and for the presence of a sleeping disorder and lifetime problematic alcohol consumption in the paper first group compared to the web first group. However the agreement was still at least moderate between both modes (kappa ! 0.57) except for the presence of a sleeping disorder (kappa = 0.36 (95% CI: 0.14-0.58)).
The continuous indicator vitality index had an ICC value of 0.79 (95% CI: 0.72-0.84) which indicates that the agreement was acceptable ( Table 4). No significant difference between the two modes was observed. The ICC coefficients were similar when doing the analyses in each order group.
Health behaviors. For all six categorical health behavior indicators very good agreement was found ( Table 3). The kappa coefficients ranged between 1.00 (95% CI: 1.00-1.00) for lifetime cannabis use and 0.84 (95% CI: 0.76-0.91) for risky single occasion alcohol drinking. The percentages of exact agreement indicate that 83.9% to 100% of the respondents had the same response category in the web-based mode as in the paper-and-pencil mode. Concerning the two ordinal indicators alcohol drinking in the past 12 months and risky single occasion alcohol drinking, 100% and 98.6% of the respondents, respectively, gave the same response category or remained within one response category in the web-based and paper-and-pencil mode. When considering kappa coefficients calculated in each order group, equal results were obtained.  Table 3. The ICC coefficients for the continuous indicators showed high agreement for the number of alcoholic drinks over the whole week (0.89 (95% CI: 0.83-0.93)) and for the age at starting For each indicator, statistics were calculated among respondents who gave an answer in both modes. a Percentage of respondents who have the same response category in the web-based and paper-and-pencil mode.

Paper-andpencil
b Percentage of respondents who have the same or within one response category in the positive or negative direction in the web-based and paper-and-pencil mode.
c Weighted kappa coefficients were calculated instead of simple kappa coefficients for ordinal categorical indicators. In addition to percentages of exact agreement, we also calculated percentage of global agreement for these indicators.
https://doi.org/10.1371/journal.pone.0197434.t003 Table 4. Intraclass correlation between the paper-and-pencil and web-based mode for continuous indicators. drinking alcohol (0.91 (95% CI: 0.88-0.94)) ( Table 4). No significant differences between the two modes were identified. The ICC coefficients were similar when we did the analyses for every order group. Prevention. For mammography in the past 2 years and ever being tested for HIV a very good agreement was found with kappa coefficients of, respectively, 0.95 (95% CI: 0.88-1.00) and 0.93 (95% CI: 0.87-0.99) ( Table 3). For cervix smear test in the past 3 years somewhat lower but still good agreement was found (kappa = 0.80 (95% CI: 0.65-0.95)). 94.2% to 98.1% of the respondents had the same response category in both modes for the prevention indicators. The kappa coefficients indicated lower agreement for cervix smear test in the past 3 years in the web first group compared to the paper first group. However, the level of agreement was still good (kappa = 0.65 (95% CI: 0.37-0.93)).

Data quality
Although the total number of missing values was low in both modes, it was higher in the webbased mode (228 (1.3%)) compared to the paper-and-pencil mode (104 (0.6%)) ( Table 5). No significant differences were found in the mean number of missing values between the questionnaires in both modes. The total number of 'don't know' values was somewhat higher in the paper-and-pencil mode (93 (3.2%)) compared to the web-based mode (82 (2.8%)) but no significant differences in the mean numbers were found. In the paper-and-pencil mode, there were 12 (1.3%) inconsistent values, while no such values were detected in the web-based mode because of the integrated controls and automatic skipping and branching logic. The two data encoders made 132 data-entry mistakes in total. Data encoder 1 made more mistakes (117 (0.7%)) than data encoder 2 (15 (0.1%)).

Discussion
This study showed generally a strong agreement between the web-based and the paper-andpencil mode. For general health indicators good to very good agreement was observed. This is consistent with the findings of Hoebel et al. [8] who found no differences in the prevalence rates for general health indicators between a web-based and paper-and-pencil health interview survey Table 5. Comparison of missing values, 'don't know' values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).

Paper-and-pencil
Web-based and of Ritter et al. [33] who found that respondents answered similarly in these modes for selfreport general health instruments. All behavior indicators showed very good agreement. This is in agreement with the results of Vergnaud et al. [4] who found high measurement agreement for variables related to tobacco use. Hoebel et al. [8] also found no differences in the prevalence rates for tobacco use and alcohol consumption between these modes. For the three prevention indicators good to very good agreement was found. This is again in line with Hoebel et al. [8] who found no differences between the web-based and paper-and-pencil mode for participation in influenza vaccination which can be seen as a prevention indicator. For mental and psychosocial health good to very good agreement was found for six indicators and moderate agreement was observed for two indicators namely the presence of a sleeping disorder and of a depressive disorder. This is in line with a systematic review study that found generally high reliability between electronic and paper-and-pencil modes for psychiatric self-report instruments [17]. The moderate agreement found for depressive and sleeping disorder could be related to the recall period of only one week of the SCL-90-R instrument [34]. Since the washout period in this study was two weeks, it is possible that respondents experienced mood swings or sleeping variation between completing both questionnaires. The variation between health topics in measurement agreement could be due to the nature of the questions as all indicators for which very good agreement was found are based upon factual and objective items whereas the indicators for which moderate agreement was found require assessing personal subjective feelings.
As expected, the web-based mode offered advantages regarding data quality. In the paperand-pencil mode, respondents gave some answers that should not have been given according to the branching logic and answers that were inconsistent with other answers. Such problems were not reported in the web-based mode due to integrated controls and automatic branching and skipping logic. Furthermore, the process of manual data-entry and the accompanying mistakes were avoided. However, there were no less missing values in the web-based mode. On the contrary, slightly more missing values were generated but this was not a statistically significant difference. Other studies generally found less missing values in a web-based mode compared to a paper-and-pencil mode [3;4;18]. This difference might be explained by the fact that our respondents were allowed to skip questions. Studies that also didn't enforce answers as well found slightly more missing values in the web-based mode [35;36].
This study has some limitations. A convenience sample of the employees of two research institutes was used. These people are generally in good health, part of the working-age population, mainly highly educated and probably familiar with completing questionnaires in both modes. Consequently, it should be acknowledged that this sample excluded people who do not routinely access the internet. Due to these factors, the sample may not be representative for the general population. Nevertheless, web-based questionnaires in mixed-mode surveys are more likely to attract younger and highly educated people with internet access [37]. This study tested measurement agreement for BHIS indicators which are aggregated indicators based upon multiple questions/ items of existing health instruments and that combine multiple response categories of questions. This might have masked potential differences between modes. A two-week washout period prevented that answers given the first time would be recalled and influenced the answers given the second time [38]. However, since this study was organized during the holiday period some variability in the wash-out period occurred (2-40 days). Nevertheless other studies that tested measurement agreement reported comparable variability in washout periods [3;4;39] and a study that compared test-retest reliability of health status instruments using a two-day or two-week washout period found no time interval effect [40]. Furthermore, respondents could indicate if they experienced a health change during the washout period since this could have affected the agreement [3].
In conclusion, this study supports the idea that web-based modes provide, in general, equal responses as paper-and-pencil modes. A web-based mode greatly facilitates the data-entry process and guides the completing of a questionnaire, however, item non-response was not positively affected. Even with the limitation of having a sample with a majority of highly educated and internet familiar people, the agreement between the two modes was quite substantial to conclude that mixed-mode data collection including a paper-and-pencil and web-based questionnaire could be undertaking without impacting the comparability of the estimates.