A new item response theory model to adjust data allowing examinee choice

In a typical questionnaire testing situation, examinees are not allowed to choose which items they answer because of a technical issue in obtaining satisfactory statistical estimates of examinee ability and item difficulty. This paper introduces a new item response theory (IRT) model that incorporates information from a novel representation of questionnaire data using network analysis. Three scenarios in which examinees select a subset of items were simulated. In the first scenario, the assumptions required to apply the standard Rasch model are met, thus establishing a reference for parameter accuracy. The second and third scenarios include five increasing levels of violating those assumptions. The results show substantial improvements over the standard model in item parameter recovery. Furthermore, the accuracy was closer to the reference in almost every evaluated scenario. To the best of our knowledge, this is the first proposal to obtain satisfactory IRT statistical estimates in the last two scenarios.


Introduction
Item response theory (IRT) comprises a set of statistical models for measuring examinee abilities through their answers to a set of items (questionnaire). One of the most important advantages of IRT is allowing comparison between examinees who answered different tests. This property, known as invariance, is obtained by introducing separate parameters for the examinee abilities and item difficulties [1]. The IRT models have optimal properties when all items within a test are mandatory for the examinees to answer. In contrast, if the examinees are allowed to choose a subset of items, instead of answering them all, the model estimates may become seriously biased. This problem has been raised by many researchers [2][3][4][5][6][7][8] but still lacks a satisfactory solution.
This problem is an important issue because several studies have provided evidence that choice has a positive impact in terms of educational development [9][10][11][12]. That is, these studies indicated that allowing students to choose which questions to answer increases motivation and engagement in the learning process. In a testing situation, allowing choice seems to reduce the concern of examinees regarding the impact of an unfavourable topic [13]. In addition, it has been claimed as a necessary step for improving educational assessment [14,15]. PLOS  Furthermore, allowing choice could be used to ameliorate important challenges faced in the application of IRT models. For example, to achieve invariance, the items used in different tests must be calibrated in the same scale. This calibration is usually done by creating a question bank from which the selected items are extracted. Items in the bank were previously calibrated by being exposed to examinees who have similarities to those from whom the tests are intended. Therefore, these items were pre-tested. The pre-test process is typically extremely expensive and time consuming. For example, in 2010, it was reported that the Brazilian government spent approximately US$3.5 million (considering an average exchange rate of R$1.75 per US$1.00 in August 2010) to calibrate items for an important national exam (ENEM). Nevertheless, serious problems were reported during the pre-test, such as the employees of a private high school supposedly making illegal copies of many items and their subsequent release. In addition, the number of items currently available in the national bank was approximately 6 thousand, whereas the ideal number would be between 15 and 20 thousand. All these events were harshly criticized by the mainstream media [16][17][18][19]. Recent works have proposed optimal designs for item calibration to reduce costs and time [20,21]. Still, there is a limit on the number of items that an examinee can properly answer within a period of time. If the examinees are allowed to choose v items within a total of V items (v < V) and if satisfactory statistical estimates are provided for all V items, the costs of calibration per item are reduced.
This paper presents a new IRT model to adjust data generated using an examinee choice allowed scenario. The proposed model incorporates network analysis information using Bayesian modelling in the IRT model. This paper evaluates the one-parameter logistic model, known as the Rasch model [22], which is the simplest and most widely used IRT model [1]. Future studies will aim to extend the proposed approach to more complex IRT models, such as the two-and three-parameter logistic models.
The results show substantial improvements in the accuracy of the estimated parameters compared to the standard Rasch model, mainly in situations in which examinee choice is different from a random selection of items. To the best of our knowledge, this study is the only proposal to date that achieves a satisfactory parameter estimation in critical scenarios reported in the literature [5].

The standard IRT model
The item characteristic curve of the Rash model is given by Eq 1 [1]: where Y iα = {0, 1} is a binary variable that indicates whether examinee α correctly answered item i; θ α is the ability parameter of examinee α; b i is the difficulty parameter of item i; P(Y iα = 1|θ α , b i ) is the probability that a randomly chosen examinee with ability θ α correctly answers item i; and the probability of an incorrect response is equal to P(Y iα = 0|θ α , b i ) = 1 − P(Y iα = 1|θ α , b i ). In the Rasch model, b i represents the required ability for any examinee to have a 50% (0.50) chance of correctly answering item i. Given M examinees and V items, estimates for θ α and b i in the IRT models are found using the likelihood shown in Eq 2 [1]: whereỹ ¼ ðy 1 ; y 2 ; . . . ; y M Þ is the vector of the examinee abilities,b ¼ ðb 1 ; b 2 ; . . . ; b V Þ is the vector of the item difficulties andỸ is the vector of the observed responses. To use Bayesian estimation, the prior distributions f(θ α ) and f(b i ) must be defined. Sincẽ y ?b;ỹ r ?ỹ s ;b p ?b q , for r 6 ¼ s and p 6 ¼ q, then the joint posterior distribution for parametersỹ andb is given by Eq 3.
Furthermore, the ability and item parameters cannot be estimated simultaneously using maximum likelihood because of the scale indeterminacy [1,23]. This problem can be solved by choosing arbitrary scales for either the ability or the item parameters. Using classic (frequentist) statistical analysis, the mean parameter of the abilities is usually set to zero. Using Bayesian modelling, the scale constraint can be imposed by means of the prior distribution ofỹ. Thus, each latent ability θ α is assumed to come from a standard normal distribution. Further information regarding Bayesian estimation of IRT models can be found in [23]. A standard prior distribution for the item parameter is given by Eq 4 [23][24][25][26]: where N(μ b , σ b 2 ) comprises a normal distribution with a mean of μ b and a variance of σ b 2 .
One can use Markov chain Monte Carlo (MCMC) methods to sample from the posterior distribution of μ b and σ b 2 . Details about MCMC in the context of IRT are found in [25]. In S1 and the joint posterior distribution is given by Eq 6: Bradlow and Thomas [5] showed that if examinees are allowed to choose items, then valid statistical inference forỹ andb can be obtained using Eq 2 only if the assumptions in Eqs 7 and 8 hold: Assumption (7) is known as the missing at random (MAR) assumption and implies that examinees are not able to distinguish which items they would likely answer correctly. Assumption (8) implies that examinees of different abilities generally do not broadly select easier or more difficult items. Further details are found in [5]. If both assumptions (7) and (8) hold, then the posterior distribution can be rewritten as shown in Eq 9: f ðỹ;b;Ỹ miss jR;Ỹ obs Þ / f ðỸ obs jỹ;bÞf ðỸ miss jỹ;bÞ In this case, it is assumed that the process that generates missing data is non-informative. Details about statistical inference in the presence of missing data are found in [28]. In Bayesian inference,Ỹ miss can be treated as a random quantity (details in [28]); thus, it can be sampled from the posterior distribution [25]. This sampling is possible because Eq 5 is an augmented data likelihood. Hereafter, it is assumed that if the examinees are randomly selecting items, then the unobserved values are MAR; otherwise, the process that generates missing data is informative, and the unobserved values are not MAR.
Wang, Wainer and Thissen [4] conducted an experiment called "choose one, answer all" to test whether the MAR assumption is empirically plausible. In the experiment, 225 students indicated which items in pairs of questions they would rather answer. However, they still had to answer all items. The results showed that the MAR assumption did not hold. For example, a particular pair of items (items 11 and 12) was introduced to the students. One item (item 12) was much more difficult than the other item (item 11). It was observed that only 20% of the examinees preferred to answer item 12. It was further observed that those students who had chosen item 12 performed far better on item 11, which indicated that they had made a disadvantaged choice. Moreover, the examinees who chose item 11 performed better on both items than those who chose item 12. These results were observed elsewhere [29,30], suggesting that students with higher abilities are more likely to differentiate difficulties between items.
Furthermore, Bradlow and Thomas [5] performed simulation studies to demonstrate the bias in the estimated parameter using the standard Rasch model in violation of assumptions (7) and (8). The complete data set had 5,000 examinees and 200 items. In the simulation study, 50 items were mandatory, and the remaining 150 items were divided into 75 choice pairs. In the experiment in which the first assumption (7) was violated, there occurred a consistent underestimation of item difficulty for the 50 mandatory items and more severe underestimation for the remaining 75 choice pairs. Furthermore, in the experiment in which the second assumption (8) was violated, there occurred overestimation of item difficulty for high-difficulty items and underestimation for low-difficulty items. The authors also stated that very little is known about the nature and magnitude of realistic violations of those assumptions.
Wang et al. [8] proposed the inclusion of a random effect parameter γ α in the standard IRT models to account for the choice effect, as shown in Eq 10: Eq 10 is similar to Eq 1 with the inclusion of the choice effect parameter γ α . In this model, the ability parameter θ α is separated from the choice effect, and therefore, the abilities can be compared across different examinees. In this case, θ and γ follow a bivariate normal distribution. Supposedly, the inclusion of the choice parameter γ α would make assumptions (7) and (8) valid. The proposed model produced better results than the standard IRT model in some simulated scenarios. Nevertheless, the authors state that if assumption (7) or assumption (8) is violated, as described in [5], valid statistical inferences are not obtained using the standard IRT model or the proposed model with the choice effect parameter.
Liu and Wang [31] also proposed a two-dimensional IRT model that includes a choice effect parameter γ α . The model accounts for the random selection of items using a nominal response model [32]. For example, let a test be composed of V items from which examinees are free to choose v items (v < V). The number of possible choices, denoted by W, is equal to the number of different combinations of choosing v of the V items, i.e., W ¼ be the index of the item choice combination for examinee α, M a 2 f1; . . . ; k; . . . ; Wg.
According to the nominal response model, the probability of examinee α selecting the k th choice combination is defined as: where q k is the slope parameter and p k is the intercept parameter for choice k. is assumed to be conditionally independent ofỸ obs ;Ỹ miss ;ỹ andb, giveng andB. The PDF f ðỸ obs ;Ỹ miss jỹ;bÞ represents the standard IRT model given by Eq 1. The authors performed two simulation studies in which the proposed model yielded satisfactory parameter recovery. Nevertheless, the authors stated that the two simulation studies were rather simple and that the findings may not be generalizable to more complex situations. The model proposed by Liu and Wang [31] has a limitation. The estimates of the model are possible only if the number of possible choices (W) is relatively small. The authors indicate that the number of examinees should be at least 10 times the number of item parameters in order to achieve reliable estimates in the nominal response model equation. As will be shown, this paper evaluates a scenario in which examinees are allowed to choose 20 items from a total of 50 items. This scenario requires a nominal response model with 2 Â 50 20 ¼ 9:425842 Â 10 13 item parameters. Consequently, the minimum number of examinees required to estimate the model is 9.425842 × 10 14 , which is impractical.

Network information
Pena and Costa [33] proposed a novel representation of examinees and their selected items using network analysis. The data set was coded as layers, vertices (or nodes) and edges. Briefly, a network (or graph) G ¼ ðṼ;ẼÞ consists of a set of V vertices (Ṽ) that identify elements of a system and a set of E edges ðẼÞ that connect pairs of vertices {v i , v j } {v i , v j }, pointing out the presence of a relationship between the vertices [34]. In the proposed network representation, each examinee is represented as a single network in which the V items are the vertices and every pair of the V α selected items is connected by an edge. That is, the data set is initially represented as M single-layer networks or a multilayer networkG ¼ ðG 1 ; G 2 ; . . . ; G M Þ [35]. From the multilayer network, two matrices are created. The first matrix, called the overlapping matrix, O = [o ij ], is a weighted V × V matrix in which the elements o ij indicate the number of examinees that chose both items i and j [35,36]: where ɑ ij [α] = 1 if examinee α chose both items i and j and = 0 otherwise; 0 o ij M,8i, j. The second matrix, called matrix U = [u ij ], is a binary V × V matrix in which u ij is equal to 1 if o ij (Eq 13) is greater than a threshold (C) and zero otherwise. The threshold is calculated to identify recurrent edges in the multilayer network and is given by Eq 14: where is the total number of edges in the multilayer network, Z ε ¼ VðVÀ 1Þ 2 is the maximum number of edges in a single-layer network and Z γ is a z-score statistic (Z 0,05 = 1.645). Under the null hypothesis, the statistical distribution of the number of incident edges in each pair of vertices is the same. Thus, C represents the upper bound of the observed number of edges between vertices iand j under the hypothesis that the E [o] edges are randomly distributed. Further details are found in [33]. Therefore, matrix U is a binary matrix that preserves only the statistically significant edges, as shown in Eq 15: In [33], it was shown that the density of matrix U can be used to test whether the MAR assumption holds. The larger the density of matrix U is, the more violated the MAR assumption; that is, the density of matrix U indicates the violation level of the MAR assumption. The density of a network G ¼ ðṼ;ẼÞ is given in Eq 16 [37]: Moreover, in [33], several network centrality measures and their correlations with item difficulty when the MAR assumption is violated were evaluated. The most frequent centrality measures found in the literature [38] were tested. The eigenvector of matrix O was found to be the most consistent and robust network statistic to estimate item difficulty. The eigenvector centrality of a vertex i (ρ i ) is the i-th element of the first eigenvector of matrix O: where λ is the eigenvalue andr is the first eigenvector of matrix O. Further details about eigenvector centrality are found in [39]. In addition, the simulation study presented in [33] indicates that the larger the MAR assumption violation is, the larger the correlation between the eigenvector centrality and item difficulties. It is worth mentioning that the eigenvector centrality assumes values within the range 0-1. Therefore, it provides a standardized measure of vertex centrality.

The proposed model
In general, the relation between item difficulty b i and the first eigenvector of matrix O can be written as shown in Eq 18.
where g(ρ i ) is a function of the i-th element of the first eigenvectorr. This paper proposes a new IRT model that takes into account the relation shown in Eq 18. This can be achieved defining the following prior distribution: where m b i ¼ gðr i Þ and σ b 2 accounts for the variability of b i that cannot be explained by g(ρ i ). It is worth mentioning that the larger the correlation betweenr andb is, the lower the dispersion parameter σ b 2 . The posterior distribution, shown in Eq 6, can be rewritten as shown in Eq 20: Eq 21 assumes that, givenr, the missing data indicatorR is independent ofỸ mis ;ỹ andb. Further information about the conditioning on covariates for the missingness mechanism becoming negligible are found in [40,41].
In this paper, the following mathematical model betweenr andb is proposed: where β 0 , β 1 and β 2 are coefficients to be estimated and C is the minimum value of ρ i , i.e., C ¼ minðrÞ: This model represents the inverse equation of a logistic function with a shape that asymptotically tends to the lowest value ofr. The β 2 parameter is required to move the vectorr below 1. Furthermore, the β 2 parameter is also used to shift vectorr above its lowest value to prevent log 1À ðr i À b 2 Þ 0 À Á ¼ þ1. This model was empirically proposed based on simulation studies, as shown in the Results section. It is worth mentioning that other functions to describe the relation betweenr andb can be proposed. The BUGS code for the proposed model is available in S2 Table. Simulation study To investigate the data behaviour under the violations of assumptions (7) and (8), we performed several simulations using three different scenarios. In all scenarios, 1,000 examinees (α = 1,. . ., 1,000) choose 20 items within a total of 50 items (i = 1, . . ., 50). The examinee ability (θ α ) and item difficulty (b i ) were both generated from a standard normal density distribution and held fixed. The complete data set can be represented by a 1,000 versus 50 dimensional matrix of binary responses. For each examinee α and item i, the probability of a correct answer was calculated using Eq 1. This probability was used to generate the outcome Y iα using a Bernoulli random number generator.
In the first scenario, hereafter named scenario 1, both assumptions (7) and (8) were valid. That is, the items were randomly selected by the examinees, and consequently, the process that causes missing data was non-informative. Therefore, this is the scenario in which valid statistical inference can be obtained using the standard Rasch model.
The second scenario, named scenario 2, is identical to the first simulation scenario presented in [33]. In this scenario, each examinee chooses items based on current values of θ α and b i . That is, for each examinee α, the items are divided into two groups: the first group comprises items that are easier than the examinee ability, i.e., b i θ α . This is the group in which the examinee has a probability of more than 0.50 to achieve a correct answer. The second group comprises items that are more difficult than the examinee ability, i.e., b i > θ α . In this group, the examinee has a probability of less than 0.50 to answer the items correctly. A weight value (w i ) is assigned to the items in each group. For items in group 2, the weight value is w i [2] (7) and (8) are violated. Finally, in the third scenario, named scenario 3, the examinee choice depends on y mis . That is, assumption (7) is violated. Similar to scenario 2, for each examinee α the items are divided into two groups. In scenario 3, the first group comprises items that were correctly answered by the examinee in the complete data set (y iα = 1), and the second group contains the items that the examinee failed (y iα = 0). Similar to scenario 2, for items in group 2, the weight value is w i [2] = 1; for items in group 1, the weight value varies from 1.5 to 30: This scenario is similar to the second simulation study described in [5]. Table 1 summarizes the proposed simulations. Each configuration was replicated 100 times.
It is worth mentioning that the selected items were generated using a multinomial probability distribution. That is, the probability of examinee α selecting item i is: In scenario 1, w iα = 1 8 i.

Results and discussion
In this section, we first present empirical evidence of the proposed function, given in Eq 22, to describe the relation betweenb andr. Second, we define values for the variance parameter of the prior distribution (σ b 2 ). Prior values of the σ b 2 parameter improve the statistical properties of the proposed model. Finally, several simulation studies are performed in different conditions to compare the accuracy of parameter recovery obtained using the standard Rasch model and using our proposed model.

Empirical validation of the proposed model
To evaluate the performance of the proposed function (Eq 22) to predict item difficulty, using the first eigenvector of matrix O, we performed 10,000 Monte Carlo simulations for scenarios 2 and 3 and calculated the residual sum of squares (RSS), as shown in Eq 24.

RSS
whereb i is the fitted value of b i using Eq 22. It is worth mentioning that the smaller the value of RSS is, the better the fit of the model and the lower the bias of the estimates. The β 0 , β 1 and β 2 coefficients were estimated using the least squares method adapted to a nonlinear model in the software R [42,43]. Table 2 shows the minimum, mean, maximum and standard deviation values of the RSS for different weight values used in scenarios 2 and 3. In both scenarios, the larger the weights are, the lower the RSS. Thus, there is empirical evidence that the proposed function seems to provide a better fit in situations in which the MAR assumption is more violated. Fig 1A shows the density of matrix U versus the RSS, using data generated from scenario 2. Fig 1B shows the density of matrix U versus the RSS, using data generated from scenario 3. In general, the larger the density of matrix U is, the lower the RSS. In [33], it was shown that the larger the density of matrix U is, the stronger the MAR assumption violation. Therefore, the density of matrix U can be used as a predictor of the goodness-of-fit statistic of the proposed model, as shown in Eq 22.
We propose using the observed density value of matrix U to choose different values for σ b 2 in the prior distribution (Eq 19). Lower prior values for σ b 2 mean that the posterior distribution of b i will be concentrated towards its mean, m b i ¼ gðr i Þ, i.e., the posterior estimate of b i is mostly defined by the proposed eigenvector centrality function. In contrast, the larger the prior value of σ b 2 is, the less the posterior estimate of b i is affected by the eigenvector centrality function.
Based on the results shown in Fig 1 and further simulation studies, the proposed values of σ b 2 are shown in Table 3. As previously mentioned, the lower the density of matrix U is, the larger the prior value of σ b 2 . Furthermore, if the density of matrix U is within 0.0-0.1, there is empirical evidence that the MAR assumptions holds. In this case, a large variance value is chosen in order to make the prior distribution for b i less informative. Future work includes exhaustive simulation studies to propose a mathematical function that relates the density of matrix U and the prior value of σ b 2 , thus further minimizing the bias of the proposed model.  three different weight values, w i 2 {2, 5, 30}. The proposed function was adjusted using the mean value of ρ i for each simulated weight, that is, " r i ¼ P 10 i¼1 r i , and is represented with black lines. In both scenarios, the proposed function achieved a good data fit.

Accuracy of the estimated parameters
To evaluate the performance of the proposed model compared to the standard Rasch model, 100 Monte Carlo simulations were performed under the three scenarios previously described. Scenario 1 is used as the reference scenario since in this scenario, valid statistical inferences can be obtained using the standard Rasch model [5]. The accuracy of the estimated parameters is evaluated using the bias, the square root of the mean square error (RMSE) and the square root of the maximum square error (RSE MAX ), which are given by Eqs 25, 26 and 27, respectively. where b i is the true difficulty of item i andb ni is the estimated value of b i in the n-th Monte Carlo simulation. In addition, 95% highest probability density (HPD) intervals [44] are provided for each estimated difficulty. Using the HPD intervals, the proportion of simulated parameters that fall within the HPD intervals are calculated. It is worth mentioning that the posterior mean, i.e., the mean of the posterior distribution, was used as point estimate of b i [45]. The standard Rasch model and the proposed model were estimated using the BUGS codes shown in S1 and S2 Tables. These BUGS codes can be run using one of the following free software programs: WinBugs [46], OpenBugs [47] and JAGS [48]. The WinBugs [46] software was used. The MCMC simulations were applied using one chain, 10,000 iterations, a burn-in period of 1,000 iterations and a sample period (thin) of 5 iterations. The variance of the prior  distribution was set as 10 for the standard Rasch model and was selected according to Table 3 for the proposed model. The initial values of the parameters were as follows: β 0 = 0, β 1 = -1.5, β 2 = 0.01, μ b = 0, b i = 0 8 i, and θ α = 0 8α. The convergence of the model was evaluated using the autocorrelation level of the chains, which was near zero, and the trace plot, which indicated that convergence was achieved. The average time to run the standard Rasch model once was 15.10 minutes, and the average time to run the proposed model was 15.40 minutes using an Intel I Core i7 processor with 2 GHz and 8 GB of RAM. Further details about the model specifications in WinBugs can be found in [49]. Fig 3 shows the differences between the estimates using the standard Rach model and the true item difficulties plotted against the true item difficulties. Fig 3A shows the results using data generated from scenario 1. Fig 3B and 3C show the results using data generated from scenario 2 and scenario 3, respectively, using a weight value equal to 10. Fig 3A shows that the residues (b i À b i ) behave as a random cloud around zero. In contrast, Fig 3B shows that the dispersion of the residues around zero is larger for more difficult items. Fig 3C shows a severe underestimation of the item difficulties. Fig 4 shows the estimated 95% HPD intervals and the true item difficulties (red points). The HPD obtained using the data set generated from scenario 1 (Fig 4A) and scenario 2 ( Fig 4B) contains most of the true item difficulties. It is worth noting that some of the HPD intervals shown in Fig 4B are larger than those in Fig 4A. In contrast, none of the HPD intervals shown in Fig 4C (scenario 3) contain the true item difficulties, which shows a serious model fitting problem.
The empirical results showed that the fitting problem reported in scenario 3 cannot be solved by replacing the standard Rasch model with the proposed model. It can be seen that the variability of the bias is smaller than in scenario 2; nevertheless, the estimated difficulties are consistently shifted towards lower values. We suggest fixing this problem by fitting the proposed model in two stages. In the first stage, the examinees are free to choose v within V items. In the second stage, a few of those V items are set as mandatory (v c ), and the examinees are required to answer these v c items again. Thus, the v c items will be estimated twice as follows: in the first stage, the v c items are adjusted using our proposed method; in the second stage, the complete responses of the v c items are adjusted separately using the standard IRT model. The average difference between the estimated difficulties in the second and the first stages of the v c items is calculated and added to the first difficulty estimates of all V items. This procedure adds a constant value to the estimates obtained in the first stage, thus correcting the bias shift in scenario 3. An alternative to the second stage adjustment is simply to include in the questionnaire v c pre-tested items, i.e., items in which the parameter b i is known in advance. In this case, the average difference will be calculated using the pre-calibrated value of b i of the v c items. In this case, mandatory items are not required.
In our simulation study, two items were set as mandatory to perform this calibration. The results show that two items were sufficient to correct the bias shift. Future work includes more exhaustive simulation studies to define the number of v c items in the second stage. It is worth mentioning that this procedure is not necessary in scenario 2 since the residues are already centered at zero. Nevertheless, in a real situation, is it not possible to identify whether the choice was made following scenario 2 or scenario 3. Nonetheless, this procedure can be applied in both scenarios. Table 4 shows the parameter accuracy for scenario 1 using the standard Rasch model and the proposed model. As previously mentioned, the standard Rasch model achieves valid statistical inference in this scenario and will be used as the accuracy reference of the estimated parameters. Thus, the reference model has minimum and maximum bias of -0.2542 and 0.1648, respectively. The mean and maximum values for the RMSE are 0.1225 and 0.2814,  respectively. RSE MAX represents the largest absolute difference between the true and estimated difficulty for each item; the mean value is 0.3251, and the maximum value is 0.7342. The largest difference between the true and estimated items difficulties in scenario 1 is 0.7342. The average width of the 95% HPD interval is 0.4996, and the maximum width is 0.8198. Finally, on average, 95.30% of the true item difficulties are inside the 95% HPD interval, and the minimum proportion is 82%.
To test whether the proposed model fits the data generated from scenario 1, a Student's ttest was performed to compare its results with the standard Rasch model, as suggested by one referee. The p-value obtained using the bias as the dependent variable was 0.8940, that obtained using the RMSE was 0.9343 and that obtained using the RSE MAX was 0.4239. Thus, it can be concluded that, assuming a significance level of 5%, the proposed model was statistically similar to the standard Rasch model regarding the average bias, the average RMSE and the average RSE MAX of the item difficulties when the MAR assumption is valid. This result was found because the variance of the prior distribution when the density of matrix U is close to zero is sufficiently large, which makes the prior distribution less informative. It is worth noting that the maximum value of RSE MAX is lower in the proposed model (0.5715) than in the standard Rasch model (0.7342). This result occurs because even though the data sets were generated using a random selection of items, a correlation between the eigenvector and the item difficulties can eventually occur in a few simulations. This correlation may be the reason why the maximum value of RSE MAX is lower in the proposed model. Table 5 shows the accuracies of the estimated parameters for scenario 2 using the standard Rasch model and the proposed model with five levels of MAR assumption violation. The standard Rasch model results are close to scenario 1 for lower weight values: w i = 1.5 and w i = 2. For larger levels of MAR assumption violation, w i = 5, w i = 10 and w i = 30, the RMSE and RSE-MAX values seem to be different from those observed in scenario 1, especially in terms of the maximum values. In contrast, the results using the proposed model seem to be closer to those observed in scenario 1, even for larger weights. Furthermore, our proposed model presented a narrower 95% HPD width and a larger average proportion of the true item difficulties inside the HPD intervals.
Three univariate two-way analysis of variance (ANOVA) was conducted to further investigate whether the observed differences between the two models are of statistical significance in scenario 2. The bias, RMSE and RSE MAX were used as dependent variables, and the estimation method (standard Rasch model or propose model) and the weight magnitude (1.5, 2, 5, 10, 30) were used as factors. The results are shown in Table 6. The null hypothesis assumes that the mean value of the dependent variable does not change among different models and/or different weights. Assuming a significance level of 5%, it can be concluded that the proposed model is significantly different from the standard Rasch model regarding the RMSE and RSE MAX in scenario 2, i.e., the p-value is smaller than 5%, thus rejecting the null hypothesis. Table 7 shows the results obtained for scenario 3. In general, the RMSE and RSE MAX values for both models are larger than those values observed in scenario 2. The results of the proposed model seem to differ from those values observed in scenario 1 only for w i = 30, particularly the range of the bias and the maximum value of RMSE. Nevertheless, the proposed model seems to provide better results than the standard Rasch model, especially for larger levels of MAR assumption violation (w i = 5, w i = 10, w i = 30).
In addition, it is worth noting that the largest difference between true and estimated item difficulties is 0.7082 in the proposed model (shown in Table 7 for weight equal to 30), whereas the largest difference using the standard Rasch model is 1.3339 (shown in Table 5 for weight equal to 30). This improvement is significant since the scale of the item difficulties is based on a standard normal distribution. Similar to scenario 2, three univariate two-way ANOVA were performed to investigate whether the observed differences between the two models are of statistical significance in scenario 3. The results are shown in Table 8. Assuming a statistical significance level of 5%, it can be concluded that the weight value had a significant effect on the bias, RMSE and RSE MAX . Furthermore, the interaction between the estimation method and weight values was statistically significant for RMSE and RSE MAX (p-value is smaller than 5%). Fig 5 shows the RSE MAX differences among the evaluated models for scenario 3 using different weights. The results of the proposed model are represented by the blue boxplots, and the  Table 8. Two-way ANOVA using data generated in scenario 3. Finally, the simulated results using different scenarios, weights and methods were simultaneously compared with the standard Rasch model results (scenario 1) using Dunnett's test [50]. Dunnett's test is a multiple t-test comparison procedure that compares the simulation results with the reference scenario (standard Rasch model results for scenario 1). Three statistical tests were evaluated using the bias as the dependent variable, the RMSE as the dependent variable and the RSE MAX as the dependent variable. Table 9 shows the Dunnett's test results. Using a statistical significance level of 5%, the standard Rasch model results for scenarios 2 and 3 were significantly different from the reference scenario 1 in the following cases: (a) RMSE and RSE MAX in scenario 2 using a weight value of 30, (b) using a weight value of 5 in scenario 3, (c) using a weight value of 10 in scenario 3 and (d) using a weight value of 30 in scenario 3. Using bias as the dependent variable, the results were significantly different in scenario 3 using weight values of 10 and 30. Using RMSE as the dependent variable, our proposed model was significantly different from the reference in scenario 3 using a weight value of 30. It is worth mentioning that for larger values of weights, such as 5, 10, and 30 in scenario 2, larger p-values indicated that our proposed model was closer to the reference than the standard Rasch model regarding RMSE and RSE MAX . Fig 6 shows the differences between the proposed model and the standard Rasch model using a data set generated from scenario 2 with a weight equal to 10. The same data set was used in Fig 3B. Fig 6A presents the residues of the standard Rasch model, and Fig 6B presents the residues of the proposed model. The results shown in Fig 6B seems to be a random cloud centred at zero, as observed in Fig 3A. Furthermore, differences between the true and estimated item difficulties are, in general, lower for the proposed model. Fig 7 shows the differences between the proposed model and the standard Rasch model using the same data presented in Fig 3C. The proposed two-stage procedure is able to centre the residues at zero. That is, the severe underestimation problem observed in Fig 3C was fixed. Fig 7A shows the results of the standard Rasch model. It is worth noting that these results are the previously estimated results shown in Fig 3C, with the addition of the constant shift. Fig 7B  shows the results using the proposed model. The residues of the proposed model are closer to a random cloud centred at zero, as observed in Fig 3A, than the residues of the standard Rasch model.

Two-way
It is worth mentioning that examinee abilities can be obtained using Bayesian or classical statistical modelling by assuming the item parameter estimates as known values [1] and then estimating the examinee abilities. That is, the estimates ofỹ can be calculated consideringb to be a known value in Eq 2. In fact, our proposed model provide better estimates ofb than the standard model. However, if examinees are free to choose an arbitrary number of items from a questionnaire, the set of selected items may not be appropriate to properly estimate the ability parameter θ α . For example, suppose that examinee ability is θ α = 1.5 but the selected items have difficulty parameters (b i ) below 1. Consequently, the amount of information to estimate θ α is poor [1]. This problem can be overcome by introducing mandatory items. The minimum number of mandatory items is an open research question and will be investigated in future research.
In contrast, if the purpose of the questionnaire is to calibrate items for an item bank rather than estimating the examinee abilities, then the presence of a few mandatory items is required to allow the second stage adjustment in our proposed model. The simulation results showed that a small number of mandatory items, such as two or three mandatory items, is sufficient to provide reliable estimates of the item parameters using the proposed model. As previously described, if some pre-tested items are included in the questionnaire, no mandatory items are required.

Conclusion
This paper presents a new IRT model to estimate item difficulties using questionnaire data in which examinees are allowed to choose a subset of items. Using network analysis, new information is incorporated in the model. The results are presented using three simulation scenarios. In the first scenario, the assumptions required to apply the standard Rasch model in the literature are met. This scenario is used as the reference scenario. The second and third scenarios include five increasing levels of violating those assumptions. These scenarios are reported in IRT literature, and to the best of our knowledge, no existing proposal provided satisfactory results.
In both scenarios 2 and 3, the standard Rasch model is robust for lower levels of violations. The proposed model performs better in both scenarios, especially for larger levels of violations. Furthermore, the proposed model is able to achieve estimated parameters that are closer to the reference values for both scenarios and different levels of violations, except for the largest violation level in scenario 3. As our main conclusion, we strongly recommend the use of the proposed model over the standard Rasch model when allowing examinees to choose items.
To overcome further technical issues that prevent allowing examinee choice in practical situations, further investigations are required. For instance, real data set analyses are a crucial point for understanding how examinees actually make their choices. Experiments in which examinees are required to indicate which subset of items they would rather answer, despite all items being mandatory, could be used. Furthermore, some important issues such as the number of items and which items should be mandatory (or pre-tested) for the two-stage adjustment were beyond the scope of this paper and require further investigation. Finally, the proposed approach requires further evaluation and research in order to be used in more complex IRT models. Supporting information S1 Table. BUGS code for the standard Rasch model. This table shows the BUGS code for the standard Rasch model estimation using MCMC method. Generally, the BUGS code comprises the likelihood function (lines 2-5) and the prior distributions (lines [6][7][8][9]. It is worth mentioning that the BUGS language uses the precision parameter as opposed to the variance parameter for the normal distribution. Thus, line 10 shows the precision parameter as a function of the standard deviation. Further details about the BUGS code to fit IRT models can be found in [26]. (DOCX) S2 Table. BUGS code for the proposed model. This table shows the BUGS code for the proposed model estimation. Lines 2-5 comprise the likelihood function, lines 6-8 show the hyperparameter prior distributions (β 0 , β 1 and β 2 ), and lines 9-14 show the parameter prior distributions (θ α and b i ). It is worth noting that line 8 shows that the prior density distribution of β 2 was truncated on the left at 0.0001 in order to avoid division by zero in Eq 22, that is, β 2 > 0. Furthermore, in line 12, "rho[i]" represents ρ i , and in line 13, "C" represents the lowest value ofr; both values are data driven.