Figures
Abstract
In a typical questionnaire testing situation, examinees are not allowed to choose which items they answer because of a technical issue in obtaining satisfactory statistical estimates of examinee ability and item difficulty. This paper introduces a new item response theory (IRT) model that incorporates information from a novel representation of questionnaire data using network analysis. Three scenarios in which examinees select a subset of items were simulated. In the first scenario, the assumptions required to apply the standard Rasch model are met, thus establishing a reference for parameter accuracy. The second and third scenarios include five increasing levels of violating those assumptions. The results show substantial improvements over the standard model in item parameter recovery. Furthermore, the accuracy was closer to the reference in almost every evaluated scenario. To the best of our knowledge, this is the first proposal to obtain satisfactory IRT statistical estimates in the last two scenarios.
Citation: Pena CS, Costa MA, Braga Oliveira RP (2018) A new item response theory model to adjust data allowing examinee choice. PLoS ONE 13(2): e0191600. https://doi.org/10.1371/journal.pone.0191600
Editor: Irene Sendiña-Nadal, Universidad Rey Juan Carlos, SPAIN
Received: August 22, 2017; Accepted: January 8, 2018; Published: February 1, 2018
Copyright: © 2018 Pena et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Item response theory (IRT) comprises a set of statistical models for measuring examinee abilities through their answers to a set of items (questionnaire). One of the most important advantages of IRT is allowing comparison between examinees who answered different tests. This property, known as invariance, is obtained by introducing separate parameters for the examinee abilities and item difficulties [1]. The IRT models have optimal properties when all items within a test are mandatory for the examinees to answer. In contrast, if the examinees are allowed to choose a subset of items, instead of answering them all, the model estimates may become seriously biased. This problem has been raised by many researchers [2–8] but still lacks a satisfactory solution.
This problem is an important issue because several studies have provided evidence that choice has a positive impact in terms of educational development [9–12]. That is, these studies indicated that allowing students to choose which questions to answer increases motivation and engagement in the learning process. In a testing situation, allowing choice seems to reduce the concern of examinees regarding the impact of an unfavourable topic [13]. In addition, it has been claimed as a necessary step for improving educational assessment [14, 15].
Furthermore, allowing choice could be used to ameliorate important challenges faced in the application of IRT models. For example, to achieve invariance, the items used in different tests must be calibrated in the same scale. This calibration is usually done by creating a question bank from which the selected items are extracted. Items in the bank were previously calibrated by being exposed to examinees who have similarities to those from whom the tests are intended. Therefore, these items were pre-tested. The pre-test process is typically extremely expensive and time consuming. For example, in 2010, it was reported that the Brazilian government spent approximately US$3.5 million (considering an average exchange rate of R$1.75 per US$1.00 in August 2010) to calibrate items for an important national exam (ENEM). Nevertheless, serious problems were reported during the pre-test, such as the employees of a private high school supposedly making illegal copies of many items and their subsequent release. In addition, the number of items currently available in the national bank was approximately 6 thousand, whereas the ideal number would be between 15 and 20 thousand. All these events were harshly criticized by the mainstream media [16–19]. Recent works have proposed optimal designs for item calibration to reduce costs and time [20, 21]. Still, there is a limit on the number of items that an examinee can properly answer within a period of time. If the examinees are allowed to choose v items within a total of V items (v < V) and if satisfactory statistical estimates are provided for all V items, the costs of calibration per item are reduced.
This paper presents a new IRT model to adjust data generated using an examinee choice allowed scenario. The proposed model incorporates network analysis information using Bayesian modelling in the IRT model. This paper evaluates the one-parameter logistic model, known as the Rasch model [22], which is the simplest and most widely used IRT model [1]. Future studies will aim to extend the proposed approach to more complex IRT models, such as the two- and three-parameter logistic models.
The results show substantial improvements in the accuracy of the estimated parameters compared to the standard Rasch model, mainly in situations in which examinee choice is different from a random selection of items. To the best of our knowledge, this study is the only proposal to date that achieves a satisfactory parameter estimation in critical scenarios reported in the literature [5].
Material and methods
The standard IRT model
The item characteristic curve of the Rash model is given by Eq 1 [1]:
(1)
where Yiα = {0, 1} is a binary variable that indicates whether examinee α correctly answered item i; θα is the ability parameter of examinee α; bi is the difficulty parameter of item i; P(Yiα = 1|θα, bi) is the probability that a randomly chosen examinee with ability θα correctly answers item i; and the probability of an incorrect response is equal to P(Yiα = 0|θα, bi) = 1 − P(Yiα = 1|θα, bi).
In the Rasch model, bi represents the required ability for any examinee to have a 50% (0.50) chance of correctly answering item i. Given M examinees and V items, estimates for θα and bi in the IRT models are found using the likelihood shown in Eq 2 [1]:
(2)
where
is the vector of the examinee abilities,
is the vector of the item difficulties and
is the vector of the observed responses. To use Bayesian estimation, the prior distributions f(θα) and f(bi) must be defined. Since
, for r ≠ s and p ≠ q, then the joint posterior distribution for parameters
and
is given by Eq 3.
Furthermore, the ability and item parameters cannot be estimated simultaneously using maximum likelihood because of the scale indeterminacy [1, 23]. This problem can be solved by choosing arbitrary scales for either the ability or the item parameters. Using classic (frequentist) statistical analysis, the mean parameter of the abilities is usually set to zero. Using Bayesian modelling, the scale constraint can be imposed by means of the prior distribution of . Thus, each latent ability θα is assumed to come from a standard normal distribution. Further information regarding Bayesian estimation of IRT models can be found in [23]. A standard prior distribution for the item parameter is given by Eq 4 [23–26]:
(4)
where N(μb, σb2) comprises a normal distribution with a mean of μb and a variance of σb2. One can use Markov chain Monte Carlo (MCMC) methods to sample from the posterior distribution of μb and σb2. Details about MCMC in the context of IRT are found in [25]. In S1 Table, a BUGS (Bayesian inference Using Gibbs Sampling) [27] code for the adjustment of the standard Rasch model is found.
Inference under examinee choice design
When allowing examinee choice, the likelihood equation includes the missing data indicator vector , where Riα = 1 if the response of examinee α for item i is observed; otherwise, Riα = 0. Given
, the set of responses
can be written as
[28], where
denotes the observed values and
denotes the missing values. Thus, the likelihood function given in Eq 2 is rewritten [5] as shown in Eq 5:
(5)
and the joint posterior distribution is given by Eq 6:
(6)
Bradlow and Thomas [5] showed that if examinees are allowed to choose items, then valid statistical inference for and
can be obtained using Eq 2 only if the assumptions in Eqs 7 and 8 hold:
(7)
(8)
Assumption (7) is known as the missing at random (MAR) assumption and implies that examinees are not able to distinguish which items they would likely answer correctly. Assumption (8) implies that examinees of different abilities generally do not broadly select easier or more difficult items. Further details are found in [5]. If both assumptions (7) and (8) hold, then the posterior distribution can be rewritten as shown in Eq 9:
(9)
In this case, it is assumed that the process that generates missing data is non-informative. Details about statistical inference in the presence of missing data are found in [28]. In Bayesian inference, can be treated as a random quantity (details in [28]); thus, it can be sampled from the posterior distribution [25]. This sampling is possible because Eq 5 is an augmented data likelihood. Hereafter, it is assumed that if the examinees are randomly selecting items, then the unobserved values are MAR; otherwise, the process that generates missing data is informative, and the unobserved values are not MAR.
Wang, Wainer and Thissen [4] conducted an experiment called “choose one, answer all” to test whether the MAR assumption is empirically plausible. In the experiment, 225 students indicated which items in pairs of questions they would rather answer. However, they still had to answer all items. The results showed that the MAR assumption did not hold. For example, a particular pair of items (items 11 and 12) was introduced to the students. One item (item 12) was much more difficult than the other item (item 11). It was observed that only 20% of the examinees preferred to answer item 12. It was further observed that those students who had chosen item 12 performed far better on item 11, which indicated that they had made a disadvantaged choice. Moreover, the examinees who chose item 11 performed better on both items than those who chose item 12. These results were observed elsewhere [29, 30], suggesting that students with higher abilities are more likely to differentiate difficulties between items.
Furthermore, Bradlow and Thomas [5] performed simulation studies to demonstrate the bias in the estimated parameter using the standard Rasch model in violation of assumptions (7) and (8). The complete data set had 5,000 examinees and 200 items. In the simulation study, 50 items were mandatory, and the remaining 150 items were divided into 75 choice pairs. In the experiment in which the first assumption (7) was violated, there occurred a consistent underestimation of item difficulty for the 50 mandatory items and more severe underestimation for the remaining 75 choice pairs. Furthermore, in the experiment in which the second assumption (8) was violated, there occurred overestimation of item difficulty for high-difficulty items and underestimation for low-difficulty items. The authors also stated that very little is known about the nature and magnitude of realistic violations of those assumptions.
Wang et al. [8] proposed the inclusion of a random effect parameter γα in the standard IRT models to account for the choice effect, as shown in Eq 10:
(10)
Eq 10 is similar to Eq 1 with the inclusion of the choice effect parameter γα. In this model, the ability parameter θα is separated from the choice effect, and therefore, the abilities can be compared across different examinees. In this case, θ and γ follow a bivariate normal distribution. Supposedly, the inclusion of the choice parameter γα would make assumptions (7) and (8) valid. The proposed model produced better results than the standard IRT model in some simulated scenarios. Nevertheless, the authors state that if assumption (7) or assumption (8) is violated, as described in [5], valid statistical inferences are not obtained using the standard IRT model or the proposed model with the choice effect parameter.
Liu and Wang [31] also proposed a two-dimensional IRT model that includes a choice effect parameter γα. The model accounts for the random selection of items using a nominal response model [32]. For example, let a test be composed of V items from which examinees are free to choose v items (v < V). The number of possible choices, denoted by , is equal to the number of different combinations of choosing v of the V items, i.e.,
. Let
be the index of the item choice combination for examinee α,
. According to the nominal response model, the probability of examinee α selecting the kth choice combination is defined as:
(11)
where
is the slope parameter and
is the intercept parameter for choice k. The proposed model uses the vector of choice combinations
rather than the missing data indicator vector
. In this case, the joint distribution for the complete data is given by Eq 12:
(12)
where
} is the set of parameters defined in Eq 11. Furthermore, the
and
vectors can be correlated. The probability density function (PDF)
is known as the missingness distribution, and vector
is assumed to be conditionally independent of
and
, given
and
. The PDF
represents the standard IRT model given by Eq 1. The authors performed two simulation studies in which the proposed model yielded satisfactory parameter recovery. Nevertheless, the authors stated that the two simulation studies were rather simple and that the findings may not be generalizable to more complex situations.
The model proposed by Liu and Wang [31] has a limitation. The estimates of the model are possible only if the number of possible choices () is relatively small. The authors indicate that the number of examinees should be at least 10 times the number of item parameters in order to achieve reliable estimates in the nominal response model equation. As will be shown, this paper evaluates a scenario in which examinees are allowed to choose 20 items from a total of 50 items. This scenario requires a nominal response model with
item parameters. Consequently, the minimum number of examinees required to estimate the model is 9.425842 × 1014, which is impractical.
Network information
Pena and Costa [33] proposed a novel representation of examinees and their selected items using network analysis. The data set was coded as layers, vertices (or nodes) and edges. Briefly, a network (or graph) consists of a set of V vertices (
) that identify elements of a system and a set of E edges
that connect pairs of vertices {vi, vj} {vi, vj}, pointing out the presence of a relationship between the vertices [34]. In the proposed network representation, each examinee is represented as a single network in which the V items are the vertices and every pair of the Vα selected items is connected by an edge. That is, the data set is initially represented as M single-layer networks or a multilayer network
[35]. From the multilayer network, two matrices are created.
The first matrix, called the overlapping matrix, O = [oij], is a weighted V × V matrix in which the elements oij indicate the number of examinees that chose both items i and j [35, 36]:
(13)
where ɑij[α] = 1 if examinee α chose both items i and j and = 0 otherwise; 0 ≤ oij ≤ M,∀i, j. The second matrix, called matrix U = [uij], is a binary V × V matrix in which uij is equal to 1 if oij (Eq 13) is greater than a threshold (Ψ) and zero otherwise. The threshold is calculated to identify recurrent edges in the multilayer network and is given by Eq 14:
(14)
where
is the total number of edges in the multilayer network,
is the maximum number of edges in a single-layer network and Zγ is a z-score statistic (Z0,05 = 1.645). Under the null hypothesis, the statistical distribution of the number of incident edges in each pair of vertices is the same. Thus, Ψ represents the upper bound of the observed number of edges between vertices iand j under the hypothesis that the E[o] edges are randomly distributed. Further details are found in [33]. Therefore, matrix U is a binary matrix that preserves only the statistically significant edges, as shown in Eq 15:
(15)
In [33], it was shown that the density of matrix U can be used to test whether the MAR assumption holds. The larger the density of matrix U is, the more violated the MAR assumption; that is, the density of matrix U indicates the violation level of the MAR assumption. The density of a network is given in Eq 16 [37]:
(16)
Moreover, in [33], several network centrality measures and their correlations with item difficulty when the MAR assumption is violated were evaluated. The most frequent centrality measures found in the literature [38] were tested. The eigenvector of matrix O was found to be the most consistent and robust network statistic to estimate item difficulty. The eigenvector centrality of a vertex i (ρi) is the i-th element of the first eigenvector of matrix O:
(17)
where λ is the eigenvalue and
is the first eigenvector of matrix O. Further details about eigenvector centrality are found in [39]. In addition, the simulation study presented in [33] indicates that the larger the MAR assumption violation is, the larger the correlation between the eigenvector centrality and item difficulties. It is worth mentioning that the eigenvector centrality assumes values within the range 0–1. Therefore, it provides a standardized measure of vertex centrality.
The proposed model
In general, the relation between item difficulty bi and the first eigenvector of matrix O can be written as shown in Eq 18.
(18)
where g(ρi) is a function of the i-th element of the first eigenvector
. This paper proposes a new IRT model that takes into account the relation shown in Eq 18. This can be achieved defining the following prior distribution:
(19)
where
and σb2 accounts for the variability of bi that cannot be explained by g(ρi). It is worth mentioning that the larger the correlation between
and
is, the lower the dispersion parameter σb2. The posterior distribution, shown in Eq 6, can be rewritten as shown in Eq 20:
(20)
where
(21)
Eq 21 assumes that, given , the missing data indicator
is independent of
and
. Further information about the conditioning on covariates for the missingness mechanism becoming negligible are found in [40, 41].
In this paper, the following mathematical model between and
is proposed:
(22)
where β0, β1 and β2 are coefficients to be estimated and C is the minimum value of ρi, i.e.,
This model represents the inverse equation of a logistic function with a shape that asymptotically tends to the lowest value of
. The β2 parameter is required to move the vector
below 1. Furthermore, the β2 parameter is also used to shift vector
above its lowest value to prevent
. This model was empirically proposed based on simulation studies, as shown in the Results section. It is worth mentioning that other functions to describe the relation between
and
can be proposed. The BUGS code for the proposed model is available in S2 Table.
Simulation study
To investigate the data behaviour under the violations of assumptions (7) and (8), we performed several simulations using three different scenarios. In all scenarios, 1,000 examinees (α = 1,…, 1,000) choose 20 items within a total of 50 items (i = 1, …, 50). The examinee ability (θα) and item difficulty (bi) were both generated from a standard normal density distribution and held fixed. The complete data set can be represented by a 1,000 versus 50 dimensional matrix of binary responses. For each examinee α and item i, the probability of a correct answer was calculated using Eq 1. This probability was used to generate the outcome Yiα using a Bernoulli random number generator.
In the first scenario, hereafter named scenario 1, both assumptions (7) and (8) were valid. That is, the items were randomly selected by the examinees, and consequently, the process that causes missing data was non-informative. Therefore, this is the scenario in which valid statistical inference can be obtained using the standard Rasch model.
The second scenario, named scenario 2, is identical to the first simulation scenario presented in [33]. In this scenario, each examinee chooses items based on current values of θα and bi. That is, for each examinee α, the items are divided into two groups: the first group comprises items that are easier than the examinee ability, i.e., bi ≤ θα. This is the group in which the examinee has a probability of more than 0.50 to achieve a correct answer. The second group comprises items that are more difficult than the examinee ability, i.e., bi > θα. In this group, the examinee has a probability of less than 0.50 to answer the items correctly. A weight value (wi) is assigned to the items in each group. For items in group 2, the weight value is wi[2] = 1; for items in group 1, the weight value varies from 1.5 to 30: wi[1] ∈ {1.5, 2, 5, 10, 30}. For example, if wi[1] = 2, then it can be said that the items in group 1 have double the chance of being selected by the examinee than the items in group 2. In this scenario, assumptions (7) and (8) are violated.
Finally, in the third scenario, named scenario 3, the examinee choice depends on ymis. That is, assumption (7) is violated. Similar to scenario 2, for each examinee α the items are divided into two groups. In scenario 3, the first group comprises items that were correctly answered by the examinee in the complete data set (yiα = 1), and the second group contains the items that the examinee failed (yiα = 0). Similar to scenario 2, for items in group 2, the weight value is wi[2] = 1; for items in group 1, the weight value varies from 1.5 to 30: wi[1] ∈ {1.5, 2, 5, 10, 30}. This scenario is similar to the second simulation study described in [5].
Table 1 summarizes the proposed simulations. Each configuration was replicated 100 times.
It is worth mentioning that the selected items were generated using a multinomial probability distribution. That is, the probability of examinee α selecting item i is:
(23)
In scenario 1, wiα = 1 ∀ i.
Results and discussion
In this section, we first present empirical evidence of the proposed function, given in Eq 22, to describe the relation between and
. Second, we define values for the variance parameter of the prior distribution (σb2). Prior values of the σb2 parameter improve the statistical properties of the proposed model. Finally, several simulation studies are performed in different conditions to compare the accuracy of parameter recovery obtained using the standard Rasch model and using our proposed model.
Empirical validation of the proposed model
To evaluate the performance of the proposed function (Eq 22) to predict item difficulty, using the first eigenvector of matrix O, we performed 10,000 Monte Carlo simulations for scenarios 2 and 3 and calculated the residual sum of squares (RSS), as shown in Eq 24.
(24)
where
is the fitted value of bi using Eq 22. It is worth mentioning that the smaller the value of RSS is, the better the fit of the model and the lower the bias of the estimates. The β0, β1 and β2 coefficients were estimated using the least squares method adapted to a nonlinear model in the software R [42, 43]. Table 2 shows the minimum, mean, maximum and standard deviation values of the RSS for different weight values used in scenarios 2 and 3. In both scenarios, the larger the weights are, the lower the RSS. Thus, there is empirical evidence that the proposed function seems to provide a better fit in situations in which the MAR assumption is more violated.
Fig 1A shows the density of matrix U versus the RSS, using data generated from scenario 2. Fig 1B shows the density of matrix U versus the RSS, using data generated from scenario 3. In general, the larger the density of matrix U is, the lower the RSS. In [33], it was shown that the larger the density of matrix U is, the stronger the MAR assumption violation. Therefore, the density of matrix U can be used as a predictor of the goodness-of-fit statistic of the proposed model, as shown in Eq 22.
(A) Scenario 2. (B) Scenario 3.
We propose using the observed density value of matrix U to choose different values for σb2 in the prior distribution (Eq 19). Lower prior values for σb2 mean that the posterior distribution of bi will be concentrated towards its mean, , i.e., the posterior estimate of bi is mostly defined by the proposed eigenvector centrality function. In contrast, the larger the prior value of σb2 is, the less the posterior estimate of bi is affected by the eigenvector centrality function.
Based on the results shown in Fig 1 and further simulation studies, the proposed values of σb2 are shown in Table 3. As previously mentioned, the lower the density of matrix U is, the larger the prior value of σb2. Furthermore, if the density of matrix U is within 0.0–0.1, there is empirical evidence that the MAR assumptions holds. In this case, a large variance value is chosen in order to make the prior distribution for bi less informative. Future work includes exhaustive simulation studies to propose a mathematical function that relates the density of matrix U and the prior value of σb2, thus further minimizing the bias of the proposed model.
Fig 2 illustrates the proposed function. It shows the plot of bi versus ρi for 10 Monte Carlo simulations. The black circles represent the data generated from scenario 1 (wi = 1) in which there is no correlation between and
. With the exception of the black circles, Fig 2A shows the data generated from scenario 2, and Fig 2B shows the data generated from scenario 3 using three different weight values, wi ∈ {2, 5, 30}. The proposed function was adjusted using the mean value of ρi for each simulated weight, that is,
, and is represented with black lines. In both scenarios, the proposed function achieved a good data fit.
(A) Scenario 2. (B) Scenario 3. The black lines represent the proposed function (Eq 22).
Accuracy of the estimated parameters
To evaluate the performance of the proposed model compared to the standard Rasch model, 100 Monte Carlo simulations were performed under the three scenarios previously described. Scenario 1 is used as the reference scenario since in this scenario, valid statistical inferences can be obtained using the standard Rasch model [5]. The accuracy of the estimated parameters is evaluated using the bias, the square root of the mean square error (RMSE) and the square root of the maximum square error (RSEMAX), which are given by Eqs 25, 26 and 27, respectively.
(25)
(26)
(27)
where bi is the true difficulty of item i and
is the estimated value of bi in the n-th Monte Carlo simulation. In addition, 95% highest probability density (HPD) intervals [44] are provided for each estimated difficulty. Using the HPD intervals, the proportion of simulated parameters that fall within the HPD intervals are calculated. It is worth mentioning that the posterior mean, i.e., the mean of the posterior distribution, was used as point estimate of bi [45].
The standard Rasch model and the proposed model were estimated using the BUGS codes shown in S1 and S2 Tables. These BUGS codes can be run using one of the following free software programs: WinBugs [46], OpenBugs [47] and JAGS [48]. The WinBugs [46] software was used. The MCMC simulations were applied using one chain, 10,000 iterations, a burn-in period of 1,000 iterations and a sample period (thin) of 5 iterations. The variance of the prior distribution was set as 10 for the standard Rasch model and was selected according to Table 3 for the proposed model. The initial values of the parameters were as follows: β0 = 0, β1 = -1.5, β2 = 0.01, μb = 0, bi = 0 ∀ i, and θα = 0 ∀α. The convergence of the model was evaluated using the autocorrelation level of the chains, which was near zero, and the trace plot, which indicated that convergence was achieved. The average time to run the standard Rasch model once was 15.10 minutes, and the average time to run the proposed model was 15.40 minutes using an Intel I Core i7 processor with 2 GHz and 8 GB of RAM. Further details about the model specifications in WinBugs can be found in [49].
Fig 3 shows the differences between the estimates using the standard Rach model and the true item difficulties plotted against the true item difficulties. Fig 3A shows the results using data generated from scenario 1. Fig 3B and 3C show the results using data generated from scenario 2 and scenario 3, respectively, using a weight value equal to 10. Fig 3A shows that the residues () behave as a random cloud around zero. In contrast, Fig 3B shows that the dispersion of the residues around zero is larger for more difficult items. Fig 3C shows a severe underestimation of the item difficulties. Fig 4 shows the estimated 95% HPD intervals and the true item difficulties (red points). The HPD obtained using the data set generated from scenario 1 (Fig 4A) and scenario 2 (Fig 4B) contains most of the true item difficulties. It is worth noting that some of the HPD intervals shown in Fig 4B are larger than those in Fig 4A. In contrast, none of the HPD intervals shown in Fig 4C (scenario 3) contain the true item difficulties, which shows a serious model fitting problem.
(A) Data set generated from scenario 1. (B) Data set generated from scenario 2 using a weight value equal to 10. (B) Data set generated from scenario 3 using a weight value equal to 10.
The red points represent the true item difficulties. (A) Data set generated from scenario 1. (B) Data set generated from scenario 2 using a weight value equal to 10. (B) Data set generated from scenario 3 using a weight value equal to 10.
The empirical results showed that the fitting problem reported in scenario 3 cannot be solved by replacing the standard Rasch model with the proposed model. It can be seen that the variability of the bias is smaller than in scenario 2; nevertheless, the estimated difficulties are consistently shifted towards lower values. We suggest fixing this problem by fitting the proposed model in two stages. In the first stage, the examinees are free to choose v within V items. In the second stage, a few of those V items are set as mandatory (vc), and the examinees are required to answer these vc items again. Thus, the vc items will be estimated twice as follows: in the first stage, the vc items are adjusted using our proposed method; in the second stage, the complete responses of the vc items are adjusted separately using the standard IRT model. The average difference between the estimated difficulties in the second and the first stages of the vc items is calculated and added to the first difficulty estimates of all V items. This procedure adds a constant value to the estimates obtained in the first stage, thus correcting the bias shift in scenario 3. An alternative to the second stage adjustment is simply to include in the questionnaire vc pre-tested items, i.e., items in which the parameter bi is known in advance. In this case, the average difference will be calculated using the pre-calibrated value of bi of the vc items. In this case, mandatory items are not required.
In our simulation study, two items were set as mandatory to perform this calibration. The results show that two items were sufficient to correct the bias shift. Future work includes more exhaustive simulation studies to define the number of vc items in the second stage. It is worth mentioning that this procedure is not necessary in scenario 2 since the residues are already centered at zero. Nevertheless, in a real situation, is it not possible to identify whether the choice was made following scenario 2 or scenario 3. Nonetheless, this procedure can be applied in both scenarios.
Table 4 shows the parameter accuracy for scenario 1 using the standard Rasch model and the proposed model. As previously mentioned, the standard Rasch model achieves valid statistical inference in this scenario and will be used as the accuracy reference of the estimated parameters. Thus, the reference model has minimum and maximum bias of -0.2542 and 0.1648, respectively. The mean and maximum values for the RMSE are 0.1225 and 0.2814, respectively. RSEMAX represents the largest absolute difference between the true and estimated difficulty for each item; the mean value is 0.3251, and the maximum value is 0.7342. The largest difference between the true and estimated items difficulties in scenario 1 is 0.7342. The average width of the 95% HPD interval is 0.4996, and the maximum width is 0.8198. Finally, on average, 95.30% of the true item difficulties are inside the 95% HPD interval, and the minimum proportion is 82%.
To test whether the proposed model fits the data generated from scenario 1, a Student’s t-test was performed to compare its results with the standard Rasch model, as suggested by one referee. The p-value obtained using the bias as the dependent variable was 0.8940, that obtained using the RMSE was 0.9343 and that obtained using the RSEMAX was 0.4239. Thus, it can be concluded that, assuming a significance level of 5%, the proposed model was statistically similar to the standard Rasch model regarding the average bias, the average RMSE and the average RSEMAX of the item difficulties when the MAR assumption is valid. This result was found because the variance of the prior distribution when the density of matrix U is close to zero is sufficiently large, which makes the prior distribution less informative. It is worth noting that the maximum value of RSEMAX is lower in the proposed model (0.5715) than in the standard Rasch model (0.7342). This result occurs because even though the data sets were generated using a random selection of items, a correlation between the eigenvector and the item difficulties can eventually occur in a few simulations. This correlation may be the reason why the maximum value of RSEMAX is lower in the proposed model. Table 5 shows the accuracies of the estimated parameters for scenario 2 using the standard Rasch model and the proposed model with five levels of MAR assumption violation. The standard Rasch model results are close to scenario 1 for lower weight values: wi = 1.5 and wi = 2. For larger levels of MAR assumption violation, wi = 5, wi = 10 and wi = 30, the RMSE and RSEMAX values seem to be different from those observed in scenario 1, especially in terms of the maximum values. In contrast, the results using the proposed model seem to be closer to those observed in scenario 1, even for larger weights. Furthermore, our proposed model presented a narrower 95% HPD width and a larger average proportion of the true item difficulties inside the HPD intervals.
Three univariate two-way analysis of variance (ANOVA) was conducted to further investigate whether the observed differences between the two models are of statistical significance in scenario 2. The bias, RMSE and RSEMAX were used as dependent variables, and the estimation method (standard Rasch model or propose model) and the weight magnitude (1.5, 2, 5, 10, 30) were used as factors. The results are shown in Table 6. The null hypothesis assumes that the mean value of the dependent variable does not change among different models and/or different weights. Assuming a significance level of 5%, it can be concluded that the proposed model is significantly different from the standard Rasch model regarding the RMSE and RSEMAX in scenario 2, i.e., the p-value is smaller than 5%, thus rejecting the null hypothesis.
Table 7 shows the results obtained for scenario 3. In general, the RMSE and RSEMAX values for both models are larger than those values observed in scenario 2. The results of the proposed model seem to differ from those values observed in scenario 1 only for wi = 30, particularly the range of the bias and the maximum value of RMSE. Nevertheless, the proposed model seems to provide better results than the standard Rasch model, especially for larger levels of MAR assumption violation (wi = 5, wi = 10, wi = 30).
In addition, it is worth noting that the largest difference between true and estimated item difficulties is 0.7082 in the proposed model (shown in Table 7 for weight equal to 30), whereas the largest difference using the standard Rasch model is 1.3339 (shown in Table 5 for weight equal to 30). This improvement is significant since the scale of the item difficulties is based on a standard normal distribution.
Similar to scenario 2, three univariate two-way ANOVA were performed to investigate whether the observed differences between the two models are of statistical significance in scenario 3. The results are shown in Table 8. Assuming a statistical significance level of 5%, it can be concluded that the weight value had a significant effect on the bias, RMSE and RSEMAX. Furthermore, the interaction between the estimation method and weight values was statistically significant for RMSE and RSEMAX (p-value is smaller than 5%).
Fig 5 shows the RSEMAX differences among the evaluated models for scenario 3 using different weights. The results of the proposed model are represented by the blue boxplots, and the standard Rasch model results are represented by the red boxplots. Fig 5 also includes the RSEMAX results using the standard Rasch model and scenario 1 (reference), which is represented by the green boxplot. Fig 5 shows that for large values of weights (5, 10 and 30), the proposed model achieves RSEMAX values much closer to the reference scenario than the standard Rasch model results achieve.
The blue boxplot represents the results from the proposed model, and the red boxplot represents the results from the standard Rasch model. The green boxplot is used as a reference and was obtained from the adjustment of the standard Rasch model in the data of scenario 1. In the boxplots, the lower outer contour of the rectangle indicates the first quartile (Q1), the upper outer contour of the rectangle indicates the third quartile (Q3), and the horizontal line inside the rectangle indicates the median (Q2). The vertical lines extended from the box indicate variability outside the first and third quartiles. The upper vertical line indicates the maximum observed value within the range [Q3; Q3+1.5×(Q3-Q1)]. The lower vertical line indicates the minimum observed value within the range [Q1; Q1-1.5×(Q3-Q1)]. Observations beyond the vertical lines are represented as points and indicate outliers.
Finally, the simulated results using different scenarios, weights and methods were simultaneously compared with the standard Rasch model results (scenario 1) using Dunnett’s test [50]. Dunnett’s test is a multiple t-test comparison procedure that compares the simulation results with the reference scenario (standard Rasch model results for scenario 1). Three statistical tests were evaluated using the bias as the dependent variable, the RMSE as the dependent variable and the RSEMAX as the dependent variable. Table 9 shows the Dunnett’s test results. Using a statistical significance level of 5%, the standard Rasch model results for scenarios 2 and 3 were significantly different from the reference scenario 1 in the following cases: (a) RMSE and RSEMAX in scenario 2 using a weight value of 30, (b) using a weight value of 5 in scenario 3, (c) using a weight value of 10 in scenario 3 and (d) using a weight value of 30 in scenario 3. Using bias as the dependent variable, the results were significantly different in scenario 3 using weight values of 10 and 30. Using RMSE as the dependent variable, our proposed model was significantly different from the reference in scenario 3 using a weight value of 30. It is worth mentioning that for larger values of weights, such as 5, 10, and 30 in scenario 2, larger p-values indicated that our proposed model was closer to the reference than the standard Rasch model regarding RMSE and RSEMAX.
Fig 6 shows the differences between the proposed model and the standard Rasch model using a data set generated from scenario 2 with a weight equal to 10. The same data set was used in Fig 3B. Fig 6A presents the residues of the standard Rasch model, and Fig 6B presents the residues of the proposed model. The results shown in Fig 6B seems to be a random cloud centred at zero, as observed in Fig 3A. Furthermore, differences between the true and estimated item difficulties are, in general, lower for the proposed model.
The differences between estimated and true item difficulties are plotted against the true item difficulties. (A) Estimated values using the standard Rasch model. (B) Estimated values obtained using the proposed model.
Fig 7 shows the differences between the proposed model and the standard Rasch model using the same data presented in Fig 3C. The proposed two-stage procedure is able to centre the residues at zero. That is, the severe underestimation problem observed in Fig 3C was fixed. Fig 7A shows the results of the standard Rasch model. It is worth noting that these results are the previously estimated results shown in Fig 3C, with the addition of the constant shift. Fig 7B shows the results using the proposed model. The residues of the proposed model are closer to a random cloud centred at zero, as observed in Fig 3A, than the residues of the standard Rasch model.
The differences between the estimated and true item difficulties are plotted against the true item difficulties. (A) Estimated values using the standard Rasch model. (B) Estimated values using the proposed model.
It is worth mentioning that examinee abilities can be obtained using Bayesian or classical statistical modelling by assuming the item parameter estimates as known values [1] and then estimating the examinee abilities. That is, the estimates of can be calculated considering
to be a known value in Eq 2. In fact, our proposed model provide better estimates of
than the standard model. However, if examinees are free to choose an arbitrary number of items from a questionnaire, the set of selected items may not be appropriate to properly estimate the ability parameter θα. For example, suppose that examinee ability is θα = 1.5 but the selected items have difficulty parameters (bi) below 1. Consequently, the amount of information to estimate θα is poor [1]. This problem can be overcome by introducing mandatory items. The minimum number of mandatory items is an open research question and will be investigated in future research.
In contrast, if the purpose of the questionnaire is to calibrate items for an item bank rather than estimating the examinee abilities, then the presence of a few mandatory items is required to allow the second stage adjustment in our proposed model. The simulation results showed that a small number of mandatory items, such as two or three mandatory items, is sufficient to provide reliable estimates of the item parameters using the proposed model. As previously described, if some pre-tested items are included in the questionnaire, no mandatory items are required.
Conclusion
This paper presents a new IRT model to estimate item difficulties using questionnaire data in which examinees are allowed to choose a subset of items. Using network analysis, new information is incorporated in the model. The results are presented using three simulation scenarios. In the first scenario, the assumptions required to apply the standard Rasch model in the literature are met. This scenario is used as the reference scenario. The second and third scenarios include five increasing levels of violating those assumptions. These scenarios are reported in IRT literature, and to the best of our knowledge, no existing proposal provided satisfactory results.
In both scenarios 2 and 3, the standard Rasch model is robust for lower levels of violations. The proposed model performs better in both scenarios, especially for larger levels of violations. Furthermore, the proposed model is able to achieve estimated parameters that are closer to the reference values for both scenarios and different levels of violations, except for the largest violation level in scenario 3. As our main conclusion, we strongly recommend the use of the proposed model over the standard Rasch model when allowing examinees to choose items.
To overcome further technical issues that prevent allowing examinee choice in practical situations, further investigations are required. For instance, real data set analyses are a crucial point for understanding how examinees actually make their choices. Experiments in which examinees are required to indicate which subset of items they would rather answer, despite all items being mandatory, could be used. Furthermore, some important issues such as the number of items and which items should be mandatory (or pre-tested) for the two-stage adjustment were beyond the scope of this paper and require further investigation. Finally, the proposed approach requires further evaluation and research in order to be used in more complex IRT models.
Supporting information
S1 Table. BUGS code for the standard Rasch model.
This table shows the BUGS code for the standard Rasch model estimation using MCMC method. Generally, the BUGS code comprises the likelihood function (lines 2–5) and the prior distributions (lines 6–9). It is worth mentioning that the BUGS language uses the precision parameter as opposed to the variance parameter for the normal distribution. Thus, line 10 shows the precision parameter as a function of the standard deviation. Further details about the BUGS code to fit IRT models can be found in [26].
https://doi.org/10.1371/journal.pone.0191600.s001
(DOCX)
S2 Table. BUGS code for the proposed model.
This table shows the BUGS code for the proposed model estimation. Lines 2–5 comprise the likelihood function, lines 6–8 show the hyperparameter prior distributions (β0, β1 and β2), and lines 9–14 show the parameter prior distributions (θα and bi). It is worth noting that line 8 shows that the prior density distribution of β2 was truncated on the left at 0.0001 in order to avoid division by zero in Eq 22, that is, β2 > 0. Furthermore, in line 12, “rho[i]” represents ρi, and in line 13, "C" represents the lowest value of ; both values are data driven.
https://doi.org/10.1371/journal.pone.0191600.s002
(DOCX)
References
- 1.
Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of Item Response Theory. Newbury Park. CA: Sage; 1991.
- 2. Wainer H, Wang XB, Thissen D. How well can we compare scores on test forms that are constructed by examinee choice? Journal of Educational Measurement. 1994; 31:183–199.
- 3. Fitzpatrick AR, Yen WM. The psychometric characteristics of choice items. Journal of Educational Measurement. 1995; 32:243–259.
- 4. Wang X, Wainer H, Thissen D. On the viability of some untestable assumptions in equating exams that allow examinee choice. Applied Measurement in Education. 1995; 8:211–225.
- 5. Bradlow ET, Thomas N. Item response theory models applied to data allowing examinee choice. Journal of Educational and Behavioral Statistics. 1998; 23(3): 236–243.
- 6.
Linn RL, Betebenner DW, Wheeler KS. Problem choice by test takers: Implication for comparability and construct validity (CSE Technical Report 485). Los Angeles, CA: University of California, Los Angeles. 1998.
- 7. Cankar G. Allowing Examinee Choice in Educational Testing. Metodološkizvezki. 2010; 2(7):151–166.
- 8. Wang WC, Jin KY, Qiu XL, Wang L. Item Response Models for Examinee-Selected Items. Journal of Educational Measurement. 2012; 49(4):419–445.
- 9.
Brigham TA. Some effects of choice on academic performance. In Perlmuter L.C. and Monty R.A., editors, Choice and perceived control. Hillsdale, NJ: Lawrence Erlbaum. 1979.
- 10. Baldwin TT, Magjuka RJ, Loher BT. The perils of participation: effects of choice of training on trainee motivation and learning. Personnel Psychology. 1991; 44: 51–65.
- 11. Cordova DI, Lepper MR. Intrinsic motivation and the process of learning: beneficial effects of contextualization, personalization, and choice. Journal of Educational Psychology. 1996; 88:715–30.
- 12. Siegler RS, Lemaire P. Older and younger adults’ strategy choices in multiplication: testing predictions of ASCM using the choice/no-choice method. Journal of Experimental Psychology: General. 1997; 126: 71–92.
- 13. Jennings M, Fox J, Graves B, Shohamy E. The test-takers’ choice: An investigation of the effect of topic on language-test performance. Language Testing. 1999; 16, 426–456.
- 14. Rocklin TR, O’Donnell AM, Holst PM. Effects and underlying mechanisms of self-adapted testing. Journal of Educational Psvchologv. 1995; 87: 103–116.
- 15. Powers DE, Bennett RE. Effects of allowing examinees to select questions on a test of divergent thinking. Applied Measurement in Education. 2000; 12: 257–279.
- 16.
Agência Brasil. Custos para pré-testar itens do Enem podem chegar a R$6 milhões. O Globo. 2010. https://oglobo.globo.com/sociedade/educacao/custos-para-pre-testar-itens-do-enem-podem-chegar-r-6-milhoes-2965286
- 17.
Moura RM. Inep amplia pré-teste do Enem e valor do contrato tem aumento de 559%. O Estado de São Paulo. 2010. http://www.estadao.com.br/noticias/geral,inep-amplia-pre-teste-do-enem-e-valor-do-contrato-tem-aumento-de-559-imp-,594468
- 18.
Borges P. Provas do pré-teste do Enem foram fotografadas no Christus. IG. 2011. http://ultimosegundo.ig.com.br/educacao/enem/provas-do-preteste-do-enem-foram-fotografadas-no-christus/n1597343813746.html
- 19.
S.P., Mandelli M. Bando de questões tem menos de 50% do necessário, diz Ministro. O Estado de São Paulo. 2011. http://www.estadao.com.br/noticias/geral,banco-de-questoes-tem-menos-de-50-do-necessario-diz-ministro-imp-,791524
- 20. Van der Linden WJ, Ren H. Optimal Bayesian Adaptive Design for Test-Item Calibration. Psychometrika. 2014; 1–26.
- 21. Lu H-Y. Application of Optimal Designs to Item Calibration. Plos One. 2014; 9(9): e106747. pmid:25188318
- 22.
Rasch G. Probabilistic models for some intelligence and attainment test. Copenhagen: Danish Institute for Educational Research; 1960.
- 23.
Fox JP. Bayesian Item Response Modeling: Theory and Applications. New York, NY: Springer. 2010. ISBN 978-1-4419-0741-7.
- 24. Albert JH. Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics. 1992; 17: 251–269.
- 25. Patz RJ, Junker BW. Applications and Extensions of MCMC in IRT: Multiple Item Types, Missing Data, and Rated Responses. Journal of Educational and Behavioral Statistics. 1999; 24(4):342–366.
- 26. Curtis SM. Bugs code for item response theory. Journal of Statistical Software. 2010; 36: 1–34.
- 27. Gilks WR, Thomas A, Spiegelhalter DA. A Language and Program for Complex Bayesian Modelling. The Statistician. 1994; 43:169–178.
- 28. Rubin DB. Inference and missing data. Biometrika. 1976; 63(3): 581–592.
- 29.
Chi MTH. Knowledge structures and memory development. In Siegler R. S. (Ed.). Children’s thinking: What develops? Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. 1978: 73–96.
- 30.
Chi MTH, Glaser R, Rees E. Expertise in problem solving. In Sternberg R. J. (Ed.). Advances in psychology of human intelligence. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. 1982: 7–70.
- 31. Liu CW, Wang WC. Non-ignorable missingness item response theory models for choice effects in examinee-selected items. British Journal of Mathematical and Statistical Psychology. 2017; 70(3): 499–524. pmid:28390145
- 32. Bock RD. Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika. 1972; 37(1): 29–51.
- 33.
Pena CS, Costa MA. A network approach to deal with the problem of examinee choice under item response theory. Doctoral Dissertation, Graduate Program in Production Engineering: University of Minas Gerais. 2016.
- 34.
Barrat A, Barthélemy M, Vespignani A. Dynamical processes in complex networks. Cambridge University Press; 2008. ISBN 978-0-521-87950-7.
- 35. Battiston F, Nicosia V, Latora V. Structural measures for multiplex networks. Physical Review E. 2014; 89(3):032804.
- 36. Bianconi G. Statistical mechanics of multiplex ensembles: Entropy and overlap. Phys. Rev. E. 2013; 87:062806.
- 37.
Lewis TG. Network Science: Theory and Applications. New Jersey: John Wiley and Sons; 2009.
- 38. Batool K, Niazi MA. Towards a methodology for validation of centrality measures in complex networks. PlosOne. 2014; 9(4):e90283.
- 39. Bonacich P. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology. 1972; 2: 113–120.
- 40.
Little RJA, Rubin DB. Statistical analysis with missing data. New York, NY: Wiley and Sons; 1987.
- 41. Smeeth L. What is the difference between missing completely at random and missing at random? International journal of epidemiology. 2014. ISSN 0300-5771 pmid:24706730
- 42.
Fox J, Weisberg S. Nonlinear Regression and Nonlinear Least Squares in R. An appendix to an R companion to applied regression, 2 ed. 2010.
- 43.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria; 2015. URL https://www.R-project.org/.
- 44. Chen MH, Shao QM. Monte Carlo estimation of Bayesian credible and HPD intervals. Journal of Computational and Graphical Statistics. 1999; 8(1): 69–92.
- 45. Kieftenbeld V, Natesan P. Recovery of Graded Response Model Parameters: A Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo Estimation. Applied Psychological Measurement. 2012; 36(5): 399–419.
- 46. Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS: A Bayesian Modelling Framework: Concepts, Structure, and Extensibility. Statistics and Computing. 2000; 10: 325–337.
- 47. Thomas A, O’Hara B, Ligges U, Sturtz S. Making BUGS Open. R News, 2006; 6(1): 12–17. URL http://CRAN.R-project.org/doc/Rnews/.
- 48.
Plummer M. JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing. Austria, Vienna: In K Hornik, F Leisch, A Zeileis (eds.). 2003. ISSN 1609-395X, URL http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/.
- 49.
Spiegelhalter D, Thomas A, Best N, Lunn D. WinBUGS User Manual Version 1.4. MRC Biostatistics Unit. 2003. URL http://www.mrc-bsu.cam.ac.uk/wp-content/uploads/manual14.pdf.
- 50. Dunnett CW. New Tables of Multiple Comparisons with a Control. Biometrics. 1964; 20: 482–491.