Power and Sample Size Determination for the Group Comparison of Patient-Reported Outcomes with Rasch Family Models

Background Patient-reported outcomes (PRO) that comprise all self-reported measures by the patient are important as endpoint in clinical trials and epidemiological studies. Models from the Item Response Theory (IRT) are increasingly used to analyze these particular outcomes that bring into play a latent variable as these outcomes cannot be directly observed. Preliminary developments have been proposed for sample size and power determination for the comparison of PRO in cross-sectional studies comparing two groups of patients when an IRT model is intended to be used for analysis. The objective of this work was to validate these developments in a large number of situations reflecting real-life studies. Methodology The method to determine the power relies on the characteristics of the latent trait and of the questionnaire (distribution of the items), the difference between the latent variable mean in each group and the variance of this difference estimated using Cramer-Rao bound. Different scenarios were considered to evaluate the impact of the characteristics of the questionnaire and of the variance of the latent trait on performances of the Cramer-Rao method. The power obtained using Cramer-Rao method was compared to simulations. Principal Findings Powers achieved with the Cramer-Rao method were close to powers obtained from simulations when the questionnaire was suitable for the studied population. Nevertheless, we have shown an underestimation of power with the Cramer-Rao method when the questionnaire was less suitable for the population. Besides, the Cramer-Rao method stays valid whatever the values of the variance of the latent trait. Conclusions The Cramer-Rao method is adequate to determine the power of a test of group effect at design stage for two-group comparison studies including patient-reported outcomes in health sciences. At the design stage, the questionnaire used to measure the intended PRO should be carefully chosen in relation to the studied population.


Introduction
Patient-reported outcomes (PRO) are important as endpoint in clinical trials and epidemiological studies. These outcomes comprise all self-reported measures by the patient regarding the patient's health, the disease and its impact, or its treatment. They include health related quality of life, pain, patient satisfaction, psychological well-being, symptoms, treatment adherence/preference,… [1] PRO have first gained importance as secondary endpoints because they can be helpful to evaluate the effects of treatment on patient's life or to study the quality of life of patient along with the disease progression to adapt the patient's care. They can also be used as primary endpoint, especially in chronic diseases such as cancer [2], to compare two standard treatments with comparable survival outcomes or to help decision making.
The deleterious impact of each treatment on patient's quality of life can also be evaluated [3].
The singularity of PRO lies in the fact that the outcome, such as quality of life or wellness, cannot be directly observed. This particular outcome is defined as a latent variable. Generally, a questionnaire is the instrument that indirectly measures the latent variable and the responses of patients to items are further analyzed. Models from the Item Response Theory (IRT) link the probability of an answer to an item with item parameters and a latent variable. This theory has gained importance in Patient-Reported Outcomes area compared to the Classical Test Theory (CTT) where models are based on a score that often sums the responses to the items. IRT has shown advantages such as the management of missing data, the possibility to obtain an interval measure for the latent trait, the comparison of latent traits levels independently of the instrument, the management of possible floor and ceiling effects [4].
With the development of patient-reported outcomes in clinical research, guidelines were edited for construction, validation and administration of questionnaires [5][6][7]. However, the literature presents few references to the design stage. In particular, the sample size requirements when IRT is intended to be used for analysis of PRO seems to lack of theoretical work [8,9]. When PRO are used as primary endpoint in a group comparison study, it is essential at the design stage to correctly determine the sample size to achieve the desired power for detecting a clinically meaningful difference in the future analysis. An inadequate sample size may lead to misleading results and incorrect conclusions. General recommendations on the sample size in the framework of education can be found. It should be highlighted that these recommendations are usually made without any theoretical justification. It is admitted that the sample size has to increase with the complexity of the model [10]: a number of 50 individuals was proposed for the simplest model of IRT, the Rasch model [11], a sample size of 200 respondents for the twoparameter logistic model has been suggested [12] and 500 examinees for the graded-response model [13]. Consequently, publications on health outcomes assessments make generally only few comments on the sample size determination as no analytical formula for the sample size exists.
It has been recently pointed out that the widely-used formula for the comparison of two normally distributed endpoints in two groups of patients was inadequate in the IRT setting [9]. Indeed, the power achieved by the tests of group effects using IRT modeling in a simulation study was lower than the expected power using the formula for normally distributed endpoints. Subsequently, Hardouin et al [14] have proposed a methodology to determine power related to sample size for PRO cross-sectional studies comparing two groups of patients in the framework of the Rasch model. The power determination depends on the difference between the expected means in the two groups (the group effect) and its standard error. The key point of the method is to estimate this standard error using the Cramer-Rao bound. This theoretical approach was first validated by simulation studies in some cases (small variance, appropriate questionnaire for the population under study) that may not reflect what is encountered in practice. Whether the method would perform as well in a large variety of situations often met in clinical and epidemiological studies remains unknown. As a matter of fact, the population of the study can have heterogeneous levels of the latent variable. Moreover, the PRO instrument might be more or less suitable for the population under study. Indeed, the items composing the instrument can be more or less relevant for the intended population of the study. For example, items from a disease-specific questionnaire (such as the QLQ-C30 [15] evaluating the quality of life of cancer patients) can be too difficult in a newly-diagnosed population in the sense that items specific to the disease can almost never be encountered in a population where the disease was recently detected, potentially before most of the symptoms appear. The measures provided by the PRO might not be reliable for all patients and the power could therefore be impacted by the choice of the questionnaire.
The purpose of this study was to validate the Cramer-Rao method for PRO cross-sectional studies comparing two groups of patients using the Rasch model. The impact of the variation of the variance of the latent variable (inter-patient heterogeneity regarding the latent variable) and of the distribution of the item parameters (appropriateness of the questionnaire for the population) on the proposed methodology has been studied by comparing the results of the Cramer-Rao method to the results of a simulation study.

Methods
At the planning stage, the calculation of a sample size is usually based on a statistical test to detect a clinically meaningful effect at desired levels of type I and type II errors. In the case of the comparison of mean levels of PRO measures in two groups of patients, the widely-used formula for the comparison of two normally distributed endpoints may apply [16]. The formula assumes that the two groups are independent and that the variance of the endpoint s 2 is common across the groups. The hypotheses for the two-sided test of comparison are defined as H 0 : m 0~m1 against H 1 : m 0 =m 1 , where m 0 and m 1 are the means of the endpoint in the first group and the second group respectively. The number of patients to be included in the first group N 0 is determined by specifying an expected difference in the means of the PRO measures (m 0 {m 1 ) and the common variance (s 2 ) as well as the type I error (a) and the desired power (1{b) of the test.
where N 1~k N 0 is the number of patients in the second group and z i the i th percentile of the standard normal distribution. If this formula is adequate for manifest variables such as quality of life scores, it seems to incorrectly determine the sample size for latent variables [9] as it doesn't take into account the uncertainty due to the estimation of the latent variable. So, this formula is not adapted for studies intending to use IRT models for the analysis.

Sample Size and Power Determinations in IRT
The rasch model. In IRT, the link between a latent variable, that is the non-directly observable variable that the PRO instrument intends to measure (quality of life for example), and item parameters is modeled. Amongst the large family of IRT models, the Rasch model [17,18] is largely used for dichotomous items in health sciences. It models the probability that a person i answers a response x ij to an item j by a logistic model with two parameters, (i) the value of the latent variable of the person, h and (ii) the item parameter associated with the item j, d j . For a questionnaire composed of J dichotomous items answered by N patients, the Rasch mixed model can be written as follows: where x ij is a realization of the random variable X ij (x ij~0 for the most defavorable response, x ij~1 for the most favorable one). d j is also called the difficulty of item j. As the value of d j increases, the item is more and more difficult which means that patients are less and less likely to answer positively to the item. For example, an item ''Does your health allows you to run an hour?'' will be more difficult than an item ''Does your health allow you to dress yourself?'' if the positive answer is defined as ''yes''. h is a realization of the random variable H, generally assumed to have a gaussian distribution. In this case, the parameters of the Rasch model can be estimated by marginal maximum likelihood (MML) [19]. A constraint has to be adopted to ensure the identifiability of the model. The nullity of the mean of the latent variable (m~0) is often used for this purpose.
Power estimation using cramer-rao bound. In the design of a cross-sectional study for the comparison of two groups of patients in IRT, we are interested in the evaluation of a group effect, c~m 1 {m 0 , defined as the difference between the means of the latent variable in the two groups. Let N 0 and N 1 be the expected sample size in the first group and the second group respectively. To identify the model presented above, the constraint of the nullity of the mean of the latent variable m is adopted. The mean m is the mean between m 0 and m 1 , each of them weighted by the sample sizes N 0 and N 1 . Consequently, Let H be a random variable representing the latent variable in the first and the second group respectively.
The variance of the latent trait s 2 is assumed to be equal in the two groups. The mixed Rasch model including a covariate to estimate a group effect c can be expressed as follows: in the first group and g~N 0 N 0 zN 1 in the second group in order to meet the constraint of identifiability. The sample size determination often relies on the Wald test to assess whether the group effect is significant. The following hypotheses are to be tested, H 0 : c~0 against H 1 : c=0. To perform the test, an estimate C of c and its variance are required.
The test statistic C ffiffiffiffiffiffiffiffiffiffiffiffiffi ffi var(C) p follows a normal distribution N(0,1) under H 0 . At the design stage, Hardouin et al. [14] proposed to use Fisher's information and the Cramer-Rao (CR) boundary property to obtain an analytical formula for the standard error of C. This method takes into account the characteristics of the questionnaire by using the parameters of the items to estimate the variance of the group effect. It also incorporates the uncertainty related to the estimation of the latent trait in the IRT model. At the design stage, the item parameters are set to some planning expected values as well as N 0 , N 1 , c and s 2 . In addition, as the patient's responses are not known, they should be determined. For each possible response patterns (2 J for binary response), the associated probability is computed for each group using the Rasch model, conditionally on the planned values of N 0 , N 1 , c and s 2 . The expected frequency of each response pattern in each group is then determined [14]. The dataset created with the response patterns and their associated expected frequencies is analyzed using a mixed Rasch model including a group effect to estimate the variance of the group effect using CR and the power of the Wald test.
The expected power of the test of the group effect based on the Cramer-Rao bound (CR), 1{b b CR , can be approximated by [14]: with c assumed to take a positive value, z 1{a=2 be the quantile of the standard normal distribution and c var var(ĉ c) evaluated using Cramer-Rao bound.
The whole procedure has been implemented in the free Raschpower module accessible at http://rasch-online.univnantes.fr. This module determines the expected power of the test of the group effect based on the Cramer-Rao bound given the expected values of the sample size in each group (N 0 and N 1 ), the group effect (c), the variance of the latent variable (s 2 ) and the item parameters (d j ) defined by the user.

Simulation Study
To validate the Cramer-Rao method, the power determined with this method was compared to the power obtained by a simulation study, used as a reference.
Generation of data. Responses to J dichotomous items of two groups of patients were simulated using a mixed Rasch model where the latent variable has normal distributions in the first and the second group respectively.
To study the impact of the values of the item difficulties, the distribution of items could vary in two different ways according to the regularity of the spacing of the items and the gap between the mean of the latent variable and the mean of the items distribution. To obtain item difficulties that are quite regularly spaced, their values are set to the percentiles of a determined probability distribution. The normal distribution is used with the same mean and variance as the latent trait distribution. The questionnaire will therefore estimate the patients levels of quality of life with the same accuracy whatever the level of quality of life on the continuum of the latent trait as shown on Figure 1 (subfigure A). To obtain irregularly spaced item difficulties, an equiprobable mixture of two gaussian distributions was used. When the spacing is irregular, the estimates of the patients levels, of quality of life for example, will be more precise when difficulties are close to each other than when they are far apart from each other. We can see on Figure 1 (subfigure B) that the quality of life levels around 21 will be estimated more precisely than quality of life levels between 20.5 and 0.5. The case of irregular spacing of item difficulties is probably more encountered in practice than regular spacing.
The distribution of the items could be centered on the same mean as the latent trait or a gap, D, between the means of the latent trait and the mean of the item difficulties could be simulated. A positive gap is illustrated in Figure 1 (subfigures C and D). The latent variable distribution and the items distribution are then no more overlaid. The most difficult items of the questionnaire (on the right of the distribution) will be too difficult for the population. Hence, a very small part of the patients will respond positively to these items while most of the patients will respond positively to the easiest items (on the left of the distribution) leading to a floor effect. Due to this floor effect, the estimates of the patients levels will be less accurate on the left of the latent trait distribution (for poor levels of quality of life for example). In practice, a floor effect can occur when a disease-specific population answers to a generic questionnaire. For example, patients with serious physical impairment won't be likely to answer positively to physical functioning items such as the ability to walk a block, to run or to climb stairs (example of items from the physical functioning of the generic questionnaire SF-36). On the opposite, a negative gap will lead to a ceiling effect as the items will be too easy for the studied population.
Parameters of the simulation study. The following values of the parameters were used in the simulation study: N The number of individuals was equal in both groups (N 0~N1~N ) and could take the value 50, 100, 200, 300 or 500.
N The group effect (c) was equal to 0, 0.2, 0.5 or 0.8. N The value of the variance of the latent trait (s 2 ) could be 0.25, 1, 4 or 9.
N The number of items (J) was 5 or 10. N The item difficulties could come from a normal distribution N(0zD,s 2 ) (quite regularly spaced) or from an equiprobable mixture of N({szD,(0:3s) 2 ) and N(szD,s 2 ) (irregularly spaced). The global mean of the latent variable was equal to 0. So, the distributions of items and latent traits were overlaid if the mean of item distribution was also equal to 0. The gap, D, was defined as 0 (overlaid distributions), 1s, 2s. As the gap becomes larger, the items distribution departs more and more from the latent traits distribution and floor effect could occur more frequently. In the case of a normal distribution with a null D, the questionnaire is assumed to be appropriate for the population without a floor effect and the items are quite regularly spaced.
The combination of all parameter values lead to 960 different cases. 1000 replications were simulated for each case.
Evaluated criteria. Each simulated dataset was analyzed with a mixed Rasch model including a covariate to estimate the group effect. A Wald test was then performed for assessing the significance of the group effect. For the simulations only, the type I error was estimated as the rate of rejection of the null hypothesis (null group effect) amongst datasets where the group effect was null (c~0). The confidence intervals of the type I error was computed as exact binomial proportion confidence intervals. The power of the test of group effect of the simulations, 1{b b S , was estimated as the rate of significant tests amongst the simulations where the simulated value of c was not null. This result was compared with the estimated power using CR, 1{b b CR (eq. 2), computed with the  Raschpower module of Stata [14]. As the estimation of 1{b b CR is based on the estimated value of the standard error of c, a good estimation of the power requires a good estimation of this standard error. Hence, the estimated value of the variance of the group effect in the simulation, d Var Var S , was compared with the estimated variance of the group effect using CR, d Var Var CR .

Results
Estimation of the Variance of the Group Effect Table 1 and table 2 show the estimated variance of group effect obtained either by simulations or using CR for all the values of parameters, for a group effect equals to 0 or 0.2 and 0.5 or 0.8, respectively. The estimations of the variance are close for both methods in general. As expected, the variance of the group effect decreases as N and J increase. Coherently, the variance of the group effect increases with the variance of the latent variable, s 2 . It slightly increases with the value of the group effect.
We note that the estimations of the variance for CR method are larger as compared to the simulation mostly when the gap is high (D~2s) and c=0. The highest overestimated values of the variance for CR are observed for low values of the sample size N and of the number of items J, high values of the latent variable variance s 2 and a normal distribution of the items as compared to a mixture of normal distribution of items.

Type I Error and Power of the Test of Group Effect
For the simulations, the type I error is well maintained to the expected value of 5% in almost all scenarios (results not shown). The type I error fluctuates between 2.6% (J~5, N~500, s 2~1 , D~2s, for a mixture distribution of the item difficulties) and 6.8% (J~10, N~200, s 2~0 :25, D~0, for a mixture distribution of the item difficulties). Amongst the 240 values of the type I error, only 9 confidence intervals at 95% of the estimated type I error don't contain the expected value of 5%. None of the parameters seems to have an impact on the value of the type I error. Table 3 and table 4 present the estimated values of the power obtained either by simulations or using CR for the values of all parameters for a questionnaire composed of 5 and 10 items, respectively. For simulations, the power was estimated by the rate of rejection of the null hypothesis amongst datasets where the group effect was not null (c=0). For all values of the simulation parameters, the estimated powers are close for each method (CR or simulations) when there is no gap. The difference between the powers obtained by simulation and using CR is around 0.003 in average and fluctuates between 20.034 (N~300, J~10, c~0:5, D~0, items normally distributed) and 0.059 (N~50, J~10, c~0:5, D~0, items normally distributed). As expected, the power increases as the sample size (N) and the number of items (J) increase and decreases as the variance of the latent trait (s 2 ) increases. It also increases with the group effect (c).
We observe an impact of the gap between means of the latent variable and item difficulties (D) which is stronger when the gap is high (D~2s). In these cases, the power obtained using CR is lower than the power of simulations. The loss of power is the highest when the variance (s 2 = 4 or s 2 = 9) and the group effect (c~0:5 or c~0:8) are high, the number of items J is low and the distribution of the items is normal as compared to a mixture of normal distribution of items. The loss can exceed 220% in the worst cases. For example, when N~300, c~0:8, s 2~9 , J~5, D~2s and the distribution of items was normal, the estimated power is 83.4% for the simulations and 60.3% for CR.

Discussion
The validity of the method to estimate the standard error of group effect and to determine the power of the test of group effect in IRT using Cramer-Rao bound was investigated for a large number of situations that may be often encountered in practice. The estimated variance of group effect and power obtained using Cramer-Rao were close to the estimations from the simulations when the distributions of the latent variable and the items were overlaid (D~0). As expected, the variance of group effect increased with the variance of the latent variable. This led to a decrease of the power of the test of group effect that does not differ for both methods (Cramer-Rao and simulations). The Cramer-Rao method seems to be still valid for high values of the variance of the latent variable.
However, when the gap between means of the latent variable and item difficulties (D) is high, we observed an inflated estimation of the variance of group effect and consequently a loss of power for CR compared to the simulations. The Cramer-Rao method seems to reach its limits for D~2s and high values of s 2 and c. The impact of an underestimation of the power can have large consequences on the planned sample size. To achieve a power of 80% for a gap equal to 2s when c~0:8 and s 2~9 , the Cramer-Rao method suggests to use N~500 patients per group whereas N~300 patients per group is a sufficient sample size to obtain a power of 83.4% according to the simulations. Hence, in this example, 200 patients in each group would have been unnecessarily included in the study to achieve a power of 80% using the Cramer-Rao method with a gap equals to 2s. So, the choice of a questionnaire appropriate to the population at the design stage is an important issue. For example, the use of a disease-specific questionnaire in general population is not recommended as the population of the study will probably not encounter some of the symptoms strongly related to this disease. Thereby, some items evaluating the symptoms will have only few or no positive responses leading to a floor effect and an incorrect determination of the power with the Cramer-Rao method.
We recommend taking time on the choice of the questionnaire before the study. To evaluate the suitability of a questionnaire, it seems important to first check that the items composing the questionnaire intended to be used are relevant for the population of study. An item is not considered as relevant if the population will answer mainly to one of its modality only and will lead to ceiling or floor effect. When choosing a questionnaire for a study, one has to take into account the characteristics of the population used for its former validation (type of the disease, seriousness of the pathology, …) in order to be suitable enough for the population to be studied.
At the planning stage, the parameters of the distribution of the latent variable and the item parameters have to be fixed. To do so, it is easier to rely on a pilot study or on previous articles for example. Hence, it may be possible to evaluate if a gap between the mean of the latent variable and the mean of the items distribution is likely to occur.
Despite all the precautions taken at the planning stage, a gap can be observed at the analysis stage. Unfortunately, the Cramer-Rao method would have underestimated the power in this case. Consequently, the number of subjects to be included in the study would have been overestimated which raise ethics and financial problems. Given the results, it does not seem reasonable to use the Cramer-Rao method for a gap equals or higher than 2s. In fact, a gap equals to 2 standard deviations seems to already reflect a poorly suitable questionnaire, a generic questionnaire assessing health-related quality of life of a seriously ill population for Table 2. Cont.    Table 3. Power estimated in the simulation study (12b b S ) and using the Cramer-Rao's bound (12b b CR ) for different values of the sample size in each group (N~N 0~N1 ), the group effect (c), the variance of the latent variable (s 2 ), the spacing regularity of the items and the gap between the global mean of the latent variable and the mean of the distribution of the item difficulties (D). example. However, the Cramer-Rao performs well in a large number of situations and can handle a moderate gap between the distributions of the latent variable and the items (D~1s). We observed a slight impact between the quite regularly spaced items (normal distribution) and the irregularly spaced items (mixture of normal distributions) on the variance and the power when the gap was high (D~2s). The normal distribution gave higher estimations of variance and so lower power than the mixture of normal distributions. This effect increased with the gap. It could be explained by the fact that, in the way the data were simulated in our study, the items coming from the mixture of distributions covers a wider part of the latent variable distribution as shown in Figure 1. Furthermore, when the latent variable and items distributions are not overlaid (D=0), the easiest item coming from the normal distribution (d 1~1 :03 in Figure 1 (subfigure C) for example) is more on the right of the latent variable distribution than the easiest item coming from the mixture of distributions (d 1~0 :86 in Figure 1 (subfigure D)). Therefore, the floor effect, resulting from the gap, occurs at a lowest level of h for the normal distribution than for the mixture of distributions. And so, the floor effect has more impact on the variance and power obtained using item parameters coming from the normal distribution. As this effect is linked with the simulation process, it can't be interpreted as an impact of the regularity of the items on the performance of the Raschpower method.

Normal distribution Mixture of normal distributions
Beyond the impact of items and variance of the latent trait, the effect of the sample size, the number of items and the group effect were also studied. Their values were chosen to reflect what is frequently encountered in practice in health studies. However, some assumptions had to be made to perform the simulation study. Instead of the Rasch model, another IRT model for dichotomous items could be considered such as the 2-PLM [20] or the OPLM [21]. These models are more complex than the Rasch model in the sense that they include item discriminations in addition to item difficulties. The variance using Cramer-Rao could probably be estimated with the same efficiency by adapting the formula and fixing the item discrimination to known values as made for the item difficulties.
The estimation of the variance and the determination of the power are based on the expected planned values that are fixed. This is usual at the design stage but it can turn out to be problematic if no previous studies can provide some information on the values of the parameters. If the planning values are far from the estimated values in the study at the analysis stage, the variance could be incorrectly estimated and the power for a determined sample size could then not be achieved. It seems important to further study the impact of misspecifications in the choice of the planning values on the performance of the Cramer-Rao method. The robustness of this method when some of the assumptions on the model are violated should also be evaluated to identify settings where the method should or should not be used.
For now, the main limitation of the Cramer-Rao method is that the variance can only be estimated in the frame of Patient-Reported Outcomes evaluated with dichotomous items in a crosssectional setting. Two major developments seem to be necessary to make this method applicable in almost all studies in health sciences. First, the method should be able to deal with polytomous items. The estimation of the variance can be based on the partialcredit model [22] or the rating-scale model [23], which are extensions of the Rasch model for this type of items. The introduction of such models will lead to a more complex procedure of estimation as the number of parameters will increase with the number of modalities of the items. Second, the study of the evolution of a criteria is often of interest in health sciences. Patients' evolution of PRO through time are often evaluated in longitudinal studies. The validity of the Cramer-Rao method in this context has to be studied as the correlated measures of patients   Table 4. Power estimated in the simulation study (12b b S ) and using the Cramer-Rao's bound (12b b CR ) for different values of the sample size in each group (N~N 0~N1 ), the group effect (c), the variance of the latent variable (s 2 ), the spacing regularity of the items and the gap between the global mean of the latent variable and the mean of the distribution of the item difficulties (D). bring into play a more complex model than in cross-sectional studies.

Normal distribution Mixture of normal distributions
The estimated variance of group effect and power obtained using Cramer-Rao were close to the estimations from the simulations in most cases. These results show that the variance using Cramer-Rao bound correctly estimates the variance of the group effect. Hence, the Cramer-Rao method can be used to determine the power of the test of group effect at design stage for two-group comparison studies including patient-reported outcomes for many situations in health sciences. The important recommendation is to choose the most appropriate questionnaire for the population. Otherwise, sample size might be misspecified by this methodological approach.