Biases and Power for Groups Comparison on Subjective Health Measurements

Subjective health measurements are increasingly used in clinical research, particularly for patient groups comparisons. Two main types of analytical strategies can be used for such data: so-called classical test theory (CTT), relying on observed scores and models coming from Item Response Theory (IRT) relying on a response model relating the items responses to a latent parameter, often called latent trait. Whether IRT or CTT would be the most appropriate method to compare two independent groups of patients on a patient reported outcomes measurement remains unknown and was investigated using simulations. For CTT-based analyses, groups comparison was performed using t-test on the scores. For IRT-based analyses, several methods were compared, according to whether the Rasch model was considered with random effects or with fixed effects, and the group effect was included as a covariate or not. Individual latent traits values were estimated using either a deterministic method or by stochastic approaches. Latent traits were then compared with a t-test. Finally, a two-steps method was performed to compare the latent trait distributions, and a Wald test was performed to test the group effect in the Rasch model including group covariates. The only unbiased IRT-based method was the group covariate Wald’s test, performed on the random effects Rasch model. This model displayed the highest observed power, which was similar to the power using the score t-test. These results need to be extended to the case frequently encountered in practice where data are missing and possibly informative.


Introduction
Subjective health measurements are increasingly used in clinical studies to assess patients' perception of their own health [1,2]. For example, they allow assessing phenomena such as quality of life, tiredness, depression or anxiety. These phenomena are called latent variables because they can not be directly observed nor measured. However, their effects can be accessible through the analysis of other variables that are directly observable.
Assessing these subjective measurements is usually done by using self-assessment questionnaires called patient reported outcomes (PRO) which consist of a set of questions often called items. Two strategies have been developed to analyse such questionnaires: the Classical Test Theory (CTT) and the Item Response Theory (IRT). These theories provide different conceptual frameworks for the analysis of PRO, each being based on several hypotheses that have to be tested before analysis. CTT is based on the assumption of a linear model explaining the individual observed score by a theoretical individual score plus a stochastic error term. Such an hypothesis can be tested using Cronbach's alpha [3]. On the other hand, IRT is based on the assumption of a logit model explaining the individual item responses by a latent parameter, often called latent trait. Such an hypothesis can be tested using R1m global tests of item fit [4].
With CTT, the item responses are combined to provide scores allowing analysing the data. In most cases, these scores should be considered as ordinal qualitative measurements of the latent variables studied, and thus cannot be considered as interval measurements [5,6]. It means that a unit difference characterizes the same amount when measured from different initial levels on the latent trait scale.Therefore, a given score variation cannot be associated with a given latent variable variation and one should not rely on CTT to quantify an expected effect or a clinical significance threshold [7,8].
With IRT, the latent variable is quantified by measuring the latent trait. The latent trait, estimated by modelling the probability of an observed response to an item, can always be considered as a quantitative variable with interval measurement properties [9]. Then, the IRT systematically allows both quantifying an expected effect or the clinical relevance of an observed difference, but also highlighting latent trait differences between compared groups.
A simple and widely used IRT model, adapted to the analysis of dichotomous items, is the Rasch model [9]. In this model, the probability of a specific response (e.g. positive or negative answer) is modelled as a function of person and item parameters. Person parameters pertain to the latent trait level of people who are evaluated while item parameters pertain to the difficulty of the items (in a Rasch model, the difficulty of an item is equal to the latent trait of an individual who would have an equal probability of responding positively or negatively to this item). Person parameters can then be interpreted as a propensity to respond positively to each item.
This model can be grasped in different ways: all the individual latent traits can be considered as a set of fixed effects (this is known as the fixed effects Rasch model), or as realizations of a random variable assumed to be normally distributed (this is known as the random effects Rasch model). With a fixed effects Rasch model, the purpose is to assess for each individual the value of his/her individual latent trait. On the contrary, with a random effects Rasch model, the purpose is to directly estimate the parameters of the overall distribution of the latent trait: in the case of a normal distribution, two parameters are estimated: the mean and the variance of the latent trait. Finally, if the sample consists of individuals coming from potentially distinct populations, it is possible to add a group covariate in the random effect model.
Several methodologies can be used to compare two samples of patients on PRO data coming from an IRT-based or a CTTbased validated questionnaire. These methodologies depend on the use of CTT or IRT, and on the chosen model to estimate latent traits if IRT is used. Whether one approach would be more suitable than another is still under debate and not perfectly known to date.
The aim of our study is to evaluate and to compare different group-comparison methods from IRT-based and CTT-based models. The statistical properties of the different methods either based on CTT or IRT were assessed and compared by simulations regarding the type I error, power, and bias in parameter estimates.

Simulation Study
One of the most relevant strategies to explore the empirical properties of comparison methodologies is to perform them in perfectly known contexts. Then, the ''true'' statistical conclusion is known, and can be compared with the observed conclusion. For example, to study the type I error of a group comparison test, it should be performed on two samples both drawn from the same population. The proportion of rejections of the null hypothesis should actually correspond to the probability of finding a difference that does not exist in reality. In contrast, this test should be performed on two samples drawn from different populations to study its power.
An appropriate strategy to know a priori the origin of the analysed samples is to generate those using Monte-Carlo simulations. Unlike a real data study, data resulting from Monte-Carlo simulations should allow differentiating whether a statistically significant difference is linked to a real difference or to the first order risk of the considered test.
In our study, we generated the data using Monte Carlo simulations with a Rasch model. Doing so allowed us to assume that the simulated questionnaires had been previously validated to be analysed either with a Rasch model or with CTT: the assumptions needed to analyse a data with a CTT-based model were necessarily fulfilled through a data satisfying the assumptions of a Rasch model [10].
Several parameters combinations were considered to generate the simulated data.
N For each simulation, we simulated two samples A and B of equal size n~n A~nB . The sample size per group ranged from 50 to 400 subjects to reflect sample sizes commonly encountered in clinical research studies. N The latent trait distribution was defined as normal. The normal distribution was chosen to respect the hypothesis related to the implementation of a random effects Rasch model. N The latent trait distribution variances were equal to 1 to be within the framework of reduced data, and so to overcome the problem of the measurement scale. Thus, the differences in latent trait and in difficulties were only expressed in terms of standard deviation fraction.
N The simulated differences between the means of latent traits D~m B {m A were set at 0, 0.2s, 0.5s and 0.8s. The latent traits mean for groups A and B were therefore respectively equal to m A~{ D 2 and to m B~D 2 . A difference set at 0 corresponded to a lack of effect, and allowed estimating the tests type I error by computing the proportion of rejection of the null hypothesis. A difference set at 0.2s, 0.5s or 0.8s corresponded respectively to a small, medium or large effect size [11] and allowed estimating power by computing the proportion of rejection of the null hypothesis.
N The items were defined as dichotomous, so they could be analysed by a Rasch model. Each positive response was coded as 1 and each negative response as 0. The number of items was set at 5 or 10 in accordance with the size of the subscales of the most commonly used questionnaires to measure PRO. For example, the NHP consists of 6 subscales composed of 3 to 9 dichotomous items [12]. As well, the SF-36 consists of 8 subscales composed of 2 to 10 items, 2 subscales being only composed of dichotomous items (Emotional Role Limitation, and Physical Role Limitation), the others of polytomous items [13].
N These items difficulties were defined as the percentiles of a standard normal distribution or as the percentiles of an equiprobable mixture of two Gaussian distributions. These two possibilities allowed considering two different situations that can be encountered in practice. The normal distribution reflected the situation where the questionnaire was perfectly adapted to a population with normally distributed latent traits. Evenly distributed items difficulties allowed considering the score as an interval measurement. The bimodal mixture corresponded to a more irregular and probably more realistic items difficulties distribution. Gaussian parameters of this mixture were then chosen to distinguish two groups of items within the scale: a first group of items whose difficulty values were very close, and a second whose difficulty values were more far apart. Such a distribution involved a poorer match to the latent trait distribution and thus floor or ceiling effects, and did not allow considering the score as an interval measurement.
N The individual items responses were generated by Bernoulli trials, after calculating for each individual the probability of response to each item by a Rasch model.
N Each parameter combination of the simulations was replicated 1000 times.
The details of the chosen simulation parameters are presented in table 1.

Statistical Analysis
For each simulation of each parameters combination, the individual scores s i for person i (i~1:::n) were defined as the sum of the items positive responses. The latent trait analysis (IRT) has been performed with fixed effects and random effects Rasch models. These analyses were conducted assuming three distinct cases: N One could consider the difficulty parameters as unknown, which required to estimate them during the IRT analysis, N One could assume these parameters as already known (eg estimated during previous studies, or coming from items banks such as the quality of life item bank PROMIS [14]). In this case, they were not estimated during the analysis. Knowledge of these parameters was then envisaged in two ways: -The difficulty parameters were considered as well known: the fixed values of the difficulty parameters used during the analysis d j were equal to the simulated difficulties d j Simulation -The difficulty parameters were considered as imperfectly known, or known with error: the fixed difficulty parameters values used during the analysis d j were randomly drawn from uniform distributions U(d jSimulation {s; d jSimulation zs)

The Rasch Model
One of the commonly used IRT model adapted to the analysis of dichotomous items is the Rasch model [9]. Let X ij be the dichotomous variable representing the response of person i (i~1:::n) to an item j (j~1:::k). For a questionnaire containing k dichotomous items, the model can be written as follows (eq.1): where x ij~0 for a negative response and x ij~1 for a positive response, d j is the difficulty associated with item j, and h i is the individual value of the latent trait for patient i. When all the individual latent traits are considered as a set of fixed effects, the Rasch model is known as a fixed effects Rasch model, while when the individual latent traits are considered as realizations of a random variable assumed to be normally distributed, the Rasch model is known as a random effects Rasch model.

The Fixed Effects Rasch Model
The estimates of the fixed effects Rasch model parameters were obtained using a two-step procedure, providing consistent estimators [15][16][17]. The estimates of the items difficulty parameters were obtained with conditional maximum likelihood, given the individual scores s i (eq.2). The estimates of the individual latent traits were then obtained with weighted maximum likelihood (WML) (eq.3). This entire procedure is known as the CML procedure. By extension, in this study, a fixed effects Rasch model will be called CML-model.
Let d be the k-vector of items difficulty parameters d j , h be the nvector of individual latent-traits, s be the n-vector of individual scores s i , x i be the k-vector of the items responses for the i th individual and x be the (k|n)-vector of the items responses for all the n individuals.
The d j parameters are consistently estimated by maximizing the conditional likelihood (eq.2): where L C is the conditional likelihood given the subject' s scores s.
The h i parameters are then estimated without biases by maximizing the weighted likelihood L W (eq.3): As with any maximum likelihood estimating procedure, the parameters estimated with the CML procedure are asymptotically normally distributed according to a normal distribution with mean equal to their maximum likelihood estimator. To assign to each individual his own latent trait value, we must define a decision rule based on this estimated distribution. It will be defined in section: ''Different possible estimates of the individual latent traits''.

The Random Effects Rasch Model
The estimate of the random effects Rasch model parameters were obtained with marginal maximum likelihood (eq.4), known as the MML procedure [16]. The latent trait was then considered normally distributed with mean m and variance s 2 . By extension, a random effects Rasch model will be called MML-model in this study.
The d, m, and s 2 parameters can be consistently estimated by maximizing the marginal likelihood L M (eq.4): where W(hDm,s 2 ) is the cumulative distribution function of the studied population latent trait h, assumed to follow a normal distribution with parameters (m; s 2 ). The estimators of each individual latent trait, assumed to be normally distributed, could be obtained afterwards, with expected a posteriori Bayesian (EAP) estimates [17,18]. EAP estimates are obtained by taking the expectation of the posterior density function of h i , conditional on x i andd d (eqs. 5 & 6).

Including a Group Effect in a Rasch Model
The group effect can be represented by a covariate in the formulation of the Rasch model [19]. The individual latent traits h i are then decomposed into a part related to the group (Vzg i c), and a part related to the individual (h Resi ). The model is then written as (eq.7): where g i~0 if the i th individual is in the first group and g i~1 if the i th individual is in the second group. The average latent trait in the first group is equal to V, and in the second group equal to Vzc. The individual latent traits h i can then be computed as: h i~V zg i czh Res i . We did not perform any fixed effects Rasch model with group covariates. Such a model would be unidentifiable, estimates for the Rasch model with fixed effects being computed conditionally on the individuals. It was only possible to include a group covariate within a random effects Rasch model. This model has been called MML-Cov.

Different Possible Estimates of the Individual Latent Traits
Two different ways of estimating the individual latent traits can be proposed.

Different Methods to Compare Two Groups on PRO
Different methodologies have been proposed for comparing two groups of subjects A and B on PRO data. N When using CTT, the groups are compared with a t-test using mean scores. In our study, this method has been called score ttest.
N When using IRT, groups can be compared using several tests.
-Individual latent traits values can be compared with a t-test, whether these are defined as the estimated means of the individual latent traits distributions (WML-CML, EAP-MML and EAP-MML-Cov methodologies) or as plausible values coming from these distributions (PV-CML, PV-MML and PV-MML-Cov methodologies). For example, this is how the most currently used software for Rasch analysis: RUMM software [22] compares individuals groups: the individual latent traits, estimated using WML-CML methodology, are compared with a t-test. -Using the MML-Cov model, it is possible to perform a group comparison by testing the nullity of the parameter associated with the group covariate with a Wald test. In our study, this method has been called ''Wald-test''. -Mislevy [23] noted that obtaining the variance estimate of a the latent traits within a group by calculating the variance of their individual estimates is biased because it only corresponds to the between-individual variance estimate, regardless of the withinindividual variance estimate [24,25]. With multiple imputations of plausible values (MI method), it is possible to estimate the distribution parameters of the latent traits of each group, taking into account both the between-individual and the within-individual variance. One can then compare the groups with a t-test. In our study, these methods have been called IM-CML, IM-MML or IM-MML-Cov according to the model used (CML, MML or MML-Cov model).
This methodology was developed for large scale surveys used in educational sciences (eg the PISA, TIMSS and NAEP studies). the number of imputations used was then between 3 and 5. Rubin recommends making between 2 and 10 imputations [24]. In our study, we performed five imputations to be comparable to studies using this methodology.
-Finally, it was proposed to perform groups comparisons with a two-step procedure (this procedure is called 2-Steps method [26]). The first step is to estimate the difficulty parameters with MML method, and the second one is to separately estimate the latent traits distributions parameters for each group by performing a random effect Rasch model in each of these groups, with difficulty parameters set to the estimated values obtained during the first step. Since it is possible to estimate with this method the mean and the variance of the latent traits for each group, it is then possible to compare the groups by performing a t-test.
All these methodologies are summarized in figure 1. All the tests were performed with a threshold a th~0 :05.

Comparison of Methods
To compare the methods to analyse PRO data, four criteria were studied: the type I error, the power, the position bias and the dispersion bias.
N The type I error was classically obtained by calculating the proportion of rejection of the null hypothesis among the 1000 replications of the same parameters combination when D was set to 0. A test of equality between the observed type I error and 0.05 was then performed with a t-test.
N The power 1{b was obtained by calculating the proportion of rejection of the null hypothesis among the 1000 replications of the same parameters combination when D was different from 0. It was considered that a power variation of less than 0.05 was not relevant in practice.
N When the methodology was based on IRT: -We estimated the difference between the latent traits means of each group by computing the average of the differences between the means of the latent traits of the groups A and B over the 1000 replicated simulations: D D obs . This average was then compared to the simulated difference D with a t-test. When D D obs was significantly different from D, we concluded to a statistically significant position bias. It was then considered that a position bias of less than 0:02s when D was equal to 0, or less than 10% of D when D was different from 0 was not relevant in practice.
-We assumed that the variances of the two groups were equal: s 2 A~s 2 B . We estimated the latent traits variance of each group by computing the average of the latent traits variances over the 1000 replicated simulations: s s 2 obs . This average was then compared to the simulated common variance s 2 with a t-test. When s s 2 obs was significantly different from s 2 , we concluded to a statistically significant dispersion bias. It was then considered that a bias of less than 10% of s 2 was not relevant in practice.
N When the methodology was based on CTT: -We estimated the difference between the score means of each group by computing the average of the differences between the means of the scores of the groups A and B over the 1000 replicated simulations: D D Sobs . This average was then compared to the true value of group effect D S with a t-test. When D D Sobs was significantly different from D S , we concluded to a statistically significant position bias. It was then considered that a position bias of less than 10% of D S was not relevant in practice.
The true value of group effect D S was not known and was approached using the difference of the expected score in each group.
The expected score in each group was computed as follows: with G(h i Dm g ,s 2 ) the normal distribution with mean m g and variance s 2 . These integrals can be estimated using Gauss-Hermite quadratures -We did not estimate the dispersion bias when the methodology was based on CTT.
Simulations and statistical analyses were performed with the Stata 11.0 software and the Gllamm package [27].

Type I Error
The type I error level was similar whether the item difficulties were considered unknown, well known or imperfectly known. We will only present the observed type I errors for unknown difficulties that had to be estimated (table 2).
The type I errors observed for the score t-test, WML-CML, PV-CML, EAP-MML, PV-MML and Wald-test methods were not significantly different from 0.05. MI-CML and MI-MML methodologies minimized the type I error, while EAP-MML-Cov, PV-MML-Cov, MI-MML-Cov and 2 Steps methodologies increased the type I error, whatever the values of the simulation parameters.

Power
The methods for which the observed type I errors were significantly greater than 0.05 were excluded from the power analysis. We therefore excluded EAP-MML-Cov, PV-MML-Cov, MI-MML-Cov and 2 Steps methods.
The knowledge of the items difficulties (unknown, well known or imperfectly known) did not affect the comparison methodologies power. We will only present the observed powers for unknown difficulties (table 3 and figure 2).
The methods respecting the type I error could be grouped into three groups according to their power: (i) the tests with low power, ie the methods based on multiple imputation (MI-MML and MI-CML methods), (ii) the tests with moderate power, ie the methods based on single imputations of plausible values (PV-MML and PV-CML methods), and (iii) the tests with high power, ie the methods based on the comparison of the individual latent traits defined as their average distribution (EAP-MML and WML-CML methods), the Wald-test method and the score comparison ttest.
A global increase of the sample size resulted in an increase of the observed power. In 67% of the cases, this increase was relevant in practice, whatever the values of the other parameters (figure 2). Cases where the difference was not relevant corresponded to observed powers greater than 0.9, resulting in a ceiling effect.
Increasing the number of items resulted in an increase of the observed power. In 55% of the cases, the power increase resulting from the transition from 5 to 10 items was relevant in practice, whatever the values of the other parameters. Cases where this increase was not relevant were either observed powers greater than 0.9, or D equal to 0:2s. Finally, the items difficulties distribution did not affect the comparison methods power.

Bias
Position bias. The knowledge of the items difficulties (unknown, well known or imperfectly known) did not affect the position bias estimate (the difference between D D obs and D). We will only present the estimated position bias for unknown difficulties (table 4).
Score t-test, WML-CML, PV-CML, MI-CML, EAP-MML-Cov, PV-MML-Cov, MI-MML-Cov, 2 Steps and Wald test methodologies did not present any position bias relevant in practice whatever the values of the simulation parameters.
Methods based on a random effects Rasch model without covariates (EAP-MML, PV-MML and MI-MML methods) did not present a relevant position bias when the simulated difference D was equal to 0, but presented a position bias systematically relevant in practice when D was greater than 0. This bias was then greater than 30% of D in all the cases. Dispersion biases. The dispersion biases estimates (the difference between s s 2 obs and s 2 ) were similar when items difficulties were considered unknown or well known. However, the dispersion biases estimates increased when the items difficulties were considered as imperfectly known: these estimated dispersion biases were greater than those estimated by considering the difficulties as unknown or perfectly known by an average of 15% of s 2 , whatever the values of the other parameters. However, the knowledge of the items difficulties did not affect the effect of the other simulation parameters on the observed dispersion biases. We will only present the dispersion biases estimated for unknown difficulties (table 5).
The 2 Steps, Wald test and PV-MML-Cov methods were the only methodologies which did not present any dispersion bias

Example
We illustrate the results of this simulation study using data coming from the surveillance program for upper-extremity musculoskeletal disorders (UE-MSDs) in the working population of the French Loire Valley region [28]. One of the objectives of this study was to compare the quality of life of workers according to their occupational category.
In this example, we focused on comparing the physical role level of blue collar workers to that of other workers. The physical role was estimated using the RP (Role Physical) sub-scale of the SF-36 questionnaire [13], including four dichotomous items. We only included individuals aged between 21 and 50 years to take into account the potential effect of age as a confounding variable. 591 blue collar workers and 828 other workers aged from 21 to 50 years completed the SF36 questionnaire. The observed item nonresponse rate was very low (1.2% in blue collar workers and 1.0% in other workers).
We used all the methods witch did not resulted in an observed type I error significantly greater than 0.05 to compared the physical role according to the workers occupational categories. The methods used were either based on CTT (as the score t-test) or based on IRT (methods based on fixed effect rasch models: WML-CML, PV-CML and MI-CML; methods based on random effect Rasch models: EAP-MML, PV-MML and MI-MML; and methods based on random effect Rasch models including group covariate: the Wald-test method). The score used for the t-test method was calculated as recommended by the SF-36 manual, imputing missing responses by the average observed responses for each individual who responded to at least half of the items [13]. The results of all these comparisons are presented in table 7.
Only four methods highlighted a significant physical role difference according to the occupational category: the Score ttest, the WML-CML, the EAP-MML and the Wald-test methods. These were the methods presenting the highest powers in our simulation study. In this example, their power was substantially identical. Finally, the estimation of the latent trait difference varied according to the different methodologies: EAP-MML and WML-CML provided the lowest estimate of the latent trait difference. We could extrapolate, using the simulation study, that only methods the t-test method and Wald-test were not biased.
In a second step, we randomly generated missing data and compared once again the physical role of blue collar workers to that of other workers to study the effect of missing data on these group comparison methods. The simulated probability of an item non-response was set to 20%. We simulated whether an individual responded or not to an item using Bernoulli trials. Such a method for generating missing data allowed ensuring the non-informativity of missing data. We used the same comparison methods than previously. The results of these comparisons are presented in table 8.
The estimation of the score difference between groups using the t-test method varied from more than 20% depending on whether data was complete or missing. Although missing data were fully non-informative, the estimation of the score difference between groups was lower in case of missing data. On the other hand, the estimation of the latent trait difference between groups using nonstochastic IRT methods (WML-CML, EAP-MML and Wald-test methods) did not seem impacted by the presence or absence of missing data: for these considered methods, the latent trait difference estimation varied from less than 5%. Finally, when data was missing, only two methods highlighted a significant physical role difference according to the occupational category: the EAP-MML and the Wald-test methods. The Score t-test method no longer highlighted such a difference.

Choice of the Most Efficient Methods for Comparing Two Groups of Individuals on PRO Data
The preferred methods of comparison are those for which the type I error is not significantly greater than 5%. Those with the greatest power will then be preferred. Among them, those with the most reduced biases will be the ones to consider.
Type I error. The methods based on the individual latent traits analysis estimated by a Rasch model with group covariate (EAP-MML-Cov, PV-MML-Cov and MI-MML-Cov method) and the 2 Steps method resulted in an unacceptable rate of type I error. These methods were therefore unsuitable for latent traits comparison.
Power. Among the methods controlling the type I error, methods based on multiple imputations of plausible values (MI-CML and MI-MML methods) had the lowest power. This power loss can be associated with their dispersion biases. Their estimated variances s s 2 obs were all biased and greater than the simulated variances s 2 . These biases were related to the addition of the within-subject variance component to the latent traits variance   [20]. Indeed, in the framework of cross-sectional studies, each individual latent trait is measured only once, which does not make it possible to assess individual latent traits variability. Therefore, if one focuses on the latent trait dispersion parameters within a population at a given time (as in cross-sectional studies), only the between-subject variance should be taken into account. Methods based on plausible values (PV-CML and PV-MML methods) presented a moderate power. For the PV-CML method, this limited power can be linked with the increase of the dispersion biases associated with the use of plausible values. Methods based on conditional likelihood for estimating individual latent traits are known to result in a biased and increased variance estimate [29]. The addition of a between-subject variance component with plausible values methodologies can only increase this bias. For the PV-MML method, this limited power can be linked with the dispersion biases due to the use of Bayesian expected a posteriori estimates for estimating individual latent traits [30]. These expected a posteriori estimates are indeed shrunk to their a priori value. Thus, the D D obs are decreased compared to the simulated D.
The following methods WML-CML, EAP-MML, Wald-test and score t-test presented the highest powers. These methods' powers were almost identical.
Biases. As expected, the WML-CML method did not lead to any relevant position biases in practice but to dispersion biases when estimating the latent traits distribution parameters [31]. The estimated variance s s 2 obs was indeed greater than the simulated variance s 2 . The EAP-MML method leaded to position and dispersion biases when estimating the latent traits distribution parameters. The D D obs were minimized compared to the simulated D, as well as the estimated variance s s 2 obs that was less than the simulated s 2 . These biases were related to the shrinkage phenomenon associated with the Bayesian posterior estimates of the individual latent traits [29].
The Wald-test and the scores t-test methods did not lead to any position nor dispersion bias when estimating the parameters of the latent traits distribution.
Influence of the simulation parameters. For all the considered methods, an increase in the sample size involved an increase of the tests' power. However, no link has been found between the sample size and the magnitude of the observed biases.
An increase in the number of items involved a reduction of the position and dispersion biases, and an increase of the tests' power. This phenomenon is known [32], and some authors recommend  to estimate the variances and averages of latent traits by a Rasch model only if the questionnaire comprise a minimum of 10 items [17]. The Wald-test method providing unbiased estimates even with less than 10 items, this recommendation should not necessarily be followed to perform group comparisons using this method. The power rise due to the number of items increase is due to the subjective nature of the latent traits. Latent variables being not directly observable, their estimate accuracy is largely dependent on the tool used to perform these estimates. Increasing the items number of a questionnaire leads to an increase of the accuracy of the latent traits estimation, and thus to an increase of the tests' power performed with this questionnaire [33]. Finally, a change in the distribution of the item difficulties did not affect the tests' power, nor their position biases. However, such a change in the item difficulties distribution involved a variation of the dispersion biases for methods based on a Rasch model, and a variation of the scores variance for methods based on the score analysis. In addition, a ceiling effect was observed when the items distribution resulted from a mixture of Gaussian distributions.

Influence
of the knowledge on the items difficulties. Several scenarios were considered, the difficulty parameters of items being considered as unknown, well known or imperfectly known. The parameters chosen to simulate imperfectly known difficulties corresponded to a rather poor precision that might be rarely encountered in real situations. However, the impact of the knowledge on the items difficulties remained negligible on the power estimate of the different comparison methods, as well as on the estimated position biases [33]. Only the variance estimate of the latent traits was slightly increased when the items difficulties were imperfectly known.
It is therefore possible to use difficulty parameters previously estimated during an IRT based questionnaire validation to perform group comparisons with IRT-based methods on PRO measurement in clinical trials or epidemiological studies. Moreover, choosing these difficulty parameters allows comparing patients coming from different studies that made use of the same questionnaire.
Influence of missing data and limitations of the study. A limitation of this study is that it does not take into account the possible presence of missing data. An illustrative real data example has been used for this purpose. This example illustrates some very important changes in the properties of the considered comparison methods according to whether data is missing or not. Even if missing data is not informative, which is the most favourable case, the CTT based method seems to be very disturbed by such missing data. On the contrary, the IRT-based methods seem less affected by the presence of missing data, in view of the example presented in this article. These differences can be explained by the fact that with IRT, an individual latent trait is directly estimated by analysing the items the individuals have answered, without taking account of the missing item answers. With Rasch family models, such estimations are consistent because of the specific objectivity property of such models. On the other hand and with the CTT, the measurements are performed by calculating scores. When data is missing, the score calculation is only possible by performing missing data imputations, which potentially generates biases. It seems important to continue this study by comparing these different group comparison methods in case of missing data considering different scenarios of missing data process, leading to informative or non-informative missing data (missing completely or not at random).  Even though more and more questionnaires are validated by IRT methods, Rasch models investigated in this study may seem too restrictive to be applied to all the situations of clinical research studies (in this study, the items were necessarily dichotomous, and the items difficulties should be independent of the patients groups studied). It appears necessary to pursue this study by analysing extensions of the Rasch model, allowing for polytomous items analysis (as the Partial Credit Model or the Rating Scale Model), and the analysis of items with difficulties that are dependent of the patients groups studied (by integration of the differential item functioning phenomenon in the studied models).

Conclusion
If data follow both a Rasch model and a CTT-based model, the most appropriate methods to compare two groups of patients on PRO measurements are the scores comparison by t-test when analysing such variables with CTT, and the covariate Wald test, performed with a random effect Rasch model including a group covariate, when analysing such variables with IRT. These two methods displayed very similar powers and unbiased estimates.

Author Contributions
Conceived and designed the experiments: JFH JBH VS. Performed the experiments: JFH. Analyzed the data: JFH. Contributed reagents/ materials/analysis tools: JFH JBH GK TLN VS YR. Wrote the paper: JFH JBH VS.