A New Extension of the Binomial Error Model for Responses to Items of Varying Difficulty in Educational Testing and Attitude Surveys

We put forward a new item response model which is an extension of the binomial error model first introduced by Keats and Lord. Like the binomial error model, the basic latent variable can be interpreted as a probability of responding in a certain way to an arbitrarily specified item. For a set of dichotomous items, this model gives predictions that are similar to other single parameter IRT models (such as the Rasch model) but has certain advantages in more complex cases. The first is that in specifying a flexible two-parameter Beta distribution for the latent variable, it is easy to formulate models for randomized experiments in which there is no reason to believe that either the latent variable or its distribution vary over randomly composed experimental groups. Second, the elementary response function is such that extensions to more complex cases (e.g., polychotomous responses, unfolding scales) are straightforward. Third, the probability metric of the latent trait allows tractable extensions to cover a wide variety of stochastic response processes.


Introduction
In this paper we introduce a class of item response models for the analysis of response distributions derived from survey data. The simplest item response function in this class is a generalization of the binomial error model for ability testing of Keats and Lord [1]. Similar item response functions were considered briefly by Lazarsfeld [2] and Coleman [3] but not implemented as models for response data. The distinguishing characteristic of the approach taken here is that the probability of an item response is written as a function of a latent variable that can also be interpreted as a probability. The choice of a probability metric suggests the Beta density as a natural choice to model the distribution of the latent variable. Furthermore, a response function formulated in this way is easy to modify to accommodate variations in the nature and complexity of response tasks.
We first introduce the extended binomial error (EBE) model as a generalization of the binomial error model. We discuss issues of identifiability, estimation and fit, and give two examples, one for a single sample, and another for data from two independent samples, noting the similarities in fit between the extended binomial error and loglinear Rasch models. Finally, we discuss extensions for polychotomous data and for modeling responses produced by noncumulative response functions.
The resulting class of models have special utility for the investigation of substantively important questions about the nature of response (as opposed to scoring long tests), Further, in contrast to the loglinear Rasch models that have been most influential in sociology due to the work of Duncan [4] in particular, this approach allows us to make strong assertions about the social distribution of the latent trait.

The Extended Binomial Error Model
Introducing item difficulty into the binomial error model for dichotomous responses The binomial error model for test scores was introduced by Keats and Lord [1] [5] as a strong true score theory for dichotomous test items of more or less equal difficulty. This model is based on the assumption that the conditional distribution of the observed score given the true score (conceived as a theoretical "proportion correct" measure) is binomial, with the true score playing the role of the constant probability of a correct answer in M independent trials where M is the number of items. With no additional specifications, Keats and Lord were able to show that the first M moments of the true score distribution can be determined from the moments of the observed score distribution. Furthermore, they showed [6] that a linear regression of true on observed score implies and is implied by a negative hypergeometric distribution of the observed scores. They also proved that a two-parameter Beta distribution for the true scores implies a linear regression of true score on observed score and a negative hypergeometric distribution for the observed scores.
Of course, it is infrequent that we analyze a set of dichotomous items with the same levels of difficulty (see also [7]); hence Keats and Lord indirectly incorporated variations in item difficulties in their compound binomial model, which accordingly lost many of the most attractive features of the binomial error model. Some other modifications of the binomial error assumptions have been proposed [8], but these are not often implemented as IRT models.
We propose to modify the binomial error model by direct incorporation of item-difficulty parameters and by assuming a bounded proportion-like latent trait. The model assumptions allow a compact expression for the probability of any observed response pattern in terms of item-difficulty parameters and the parameters of the distribution of the latent trait, assumed to be two-parameter Beta. This expression leads in turn to an algorithm for simultaneous ML estimation of both sets of parameters. In contrast to a Rasch model, therefore, our estimation of item parameters is not separable from the estimation of person parameters; the advantage of specifying a flexible family of distributions like the Beta comes in the capacity to easily analyze factorial experiments in which either the trait, or item hardnesses, or both may be affected by treatments, as we show below.

A Generalization
Let x = [x 1 , x 2 ,..,x M ] be a response vector for a set of M dichotomous items, coded so that x i = 1 implies completion of a task related to some underlying ability or acceptance of a statement consistent with some hypothesized underlying attitude dimension and x i = 0 otherwise. (The 1/ 0 coding of responses is arbitrary but useful in writing down the probabilities associated with complete response patterns.) Consider a latent variable y, where 0<y<1, and let the difficulty of the i th item be k i (k i >0 for all i). We propose the simple response function The conditional probability of any set of M responses under conditional independence is and the unconditional probability is where ϕ(y) is the density of y.
There are a number of advantages to constructing the model using the form of Eq 1, as the latent trait may be interpreted as the probability of answering standardized (k i = 1) item in a positive direction. However, a complication arises for the interpretation of standard errors of the k parameters given that they must be strictly non-negative. As is the case for other models with necessarily non-negative parameters (such as models for variances), the standard errors cannot be interpreted symmetrically; indeed, in cases of very poor fit, standard errors may imply that the true population value might quite plausibly be negative. When the model fits well, these issues are minor, but it makes sense to carry out statistical tests on the equivalence of k parameters not by comparison of standard errors, but by the comparison of chi-squares of nested models constraining or not constraining parameters to be equal. If the use of confidence intervals is required, one can rewrite the item response function as a double-exponential whose argument is the difference between an unbounded latent variable ξ and an unbounded item difficulty parameter χ i , where ξ = -log[-log(y)] and χ i = log(k i ), ξ, -1< ξ, χ i <+1. Thus we can write We note, however, that in contrast to a Rasch model, Eq 4 does not imply equiprobability when ξ = χ i ; instead, Pr[x i = 1|ξ] = .5 when (ξ-χ i ) = -.3665. Given that Eq 1 has the more intuitive relation to a probability statement, we prefer this for the development of a family of models for stochastic response. The choice of a probability metric suggests the Beta density as a natural choice to model the distribution of the latent variable, as it, like a probability, is defined for the interval [0,1]. Thus we propose: where a, b>0, 0<y<1, and Γ is the complete gamma function. (We note that such a specification was suggested but not implemented by Lazarsfeld [2].) Note that E[y] = a/(a+b); V[y] = ab/[(a+b) 2 (a+b+1)]. For k i = k for all i, this leads to the binomial error model with a Beta density for the latent trait, sometimes called the beta-binomial model [7]. It can be shown that given this density, for any k>0, Thus, for example, the unconditional probability of the unit response vector x = [1,1,1,. . .,1] is simply More generally, any Pr[x] may be written as a function of gamma functions whose arguments are the sums of the distribution parameters a and b and/or item difficulties k i .

Representation of the Unconditional Response Probabilities
Let Pr[x] represent the unconditional probability of a response. For any x, let I(x) = {i: x i = 0} and C(x) = {i: x i = 1}. Let A be any subset of I(x), including the null set Ø and let |A| be the number of elements in A. Given these definitions, we show in Appendix A that for the EBE model. These equations are useful for constructing joint item response functions in programming routines for ML estimation.
Initial estimates for the model parameters may be obtained from bivariate cross-classifications. Given the resulting table from any such cross-classification of the i th and j th items, and fixing one of the two relevant item parameters (say k i ) = 1, the other three parameters k j , a and b can be estimated from the three degrees of freedom in the 2×2 table. Repeating this M-1 times for all cross classifications of the i th item with all other items produces a complete vector of starting values. (We give the details in Appendix B.) We then use the Polak and Ribière [9] version of the Fletcher and Reeves conjugate gradient method [10] to find maximum likelihood estimates; we also use the Nelder and Mead [11] simplex methods to check that there are no preferable solutions in the area of the start values. In no case did we find that our maximum was local only.

Scores and Score Variances from the Posterior Distribution of y given x
The posterior density of y given x is defined as Accordingly, the expectation of the trait score associated with a manifest response vector is just We shall regard E[y|x] as an estimate of the person score corresponding to the response vector x.
The variance of the posterior distribution serves as a measure of the precision of E[y|x] as a representation of the person score and can be calculated from the relation V As discussed in Appendix A, there is a form for the score of any x that is similar to Eqs 8 and 9.

Local Identifiability
Eq 8 is locally identifiable if the Jacobian matrix of the transformation from model parameters to response probabilities has full column rank for a given vector of parameter values [12]. Our numerical explorations suggest that this model, without the imposition of any additional constraints, is locally identifiable for plausible regions of the parameter space. Nevertheless, the imposition of normalizations of the form k i = 1 for some i, providing an item-specific metric of the latent variable, produces more stable full-information maximum likelihood estimates than the unconstrained model. This is the result of the following interesting circumstance: the power of a Beta-distributed random variable is a random variable that is nearly-but not quite -distributed as Beta variable. The choice of i to fix is arbitrary in the following senses: a) the fit of the model to data is virtually independent of this choice and b) ML estimates of item parameters for all normalizations are related in a simple way. This last point deserves a brief elaboration. Suppose for some set of data the model is normalized with respect to the h th item and that ML estimates are written as k i h , and a h , b h with k i h = 1 if i = h. Let k i h and a h , b h be the ML estimates for normalization with respect to item h so that k i h = 1 if i = h. To a close approximation, We also note that the special case in which the b parameter is set a priori to 1 is underidentified and requires an arbitrary constraint on one item parameter or, equivalently, on the a distribution parameter. With this constraint, there are simple closed form solutions for the ratios of item parameters to the single distributional parameter. For this case, Eq 6 simplifies to and hence Thus the ratio of any item hardness to the single distribution parameter can be recreated simply as a ratio of failures to successes at the marginals.

Issues of Fit
We go on to apply this model to sets of dichotomous items, making comparisons to results obtained with the Rasch model [13] for the simplest cases. The overall fit of the model can be assessed using the likelihood ratio chi-square, which can also be used to test nested models we shall introduce below. But for comparison of non-nested models, we use the model selection criterion of Raftery's BIC [14], which is equal to L 2 -df Ã ln(N), where L 2 is the likelihood ratio chi-square, df represents the degrees of freedom, and N the number of persons in the sample. (The saturated model has a BIC of 0, any model with a negative BIC is preferred to the saturated model, and the model with the lowest BIC is preferred.) As recently emphasized by Weakliem [15], BIC is not without its drawbacks, and more rigorous implementations of Bayesian logic are now tractable [16]. However, the chief practical drawback of BIC seems to be its overly conservative nature, as it prefers more parsimonious models, and given the temptation to over-fit data sets, we think that BIC serves well as a general criterion for model comparison in which many models are fit to the same data set.

The Single Group Case
We begin by re-analyzing the classic Army data presented by Stouffer [17] and analyzed using a Rasch model by Duncan [4] and Kelderman [18]. The data are to be found in Table 1 Most importantly, the two models agree closely as to the positions of the different items. We find a correlation > .99 between the natural log of the extended binomial error model's item parameters and those of the Rasch model. (Here we use results from a conditional maximum likelihood fit. Duncan's item parameters are quite different, which we assume to be a mistake, since the order of parameter hardnesses differs not only from our Rasch analyses, but from the marginal distribution of the items. Kelderman did not report item parameters.) Inspection of the estimated scores (Table 2) from the extended binomial error (Eq 11) model shows that  while there are clear score differences between patterns with the same Rasch raw score (the number of "positive" responses), the ranking of respondents with respect to posterior scores is roughly the same as the ranking with respect to raw scores. The differences between posterior expected value scores over response patterns are small compared to the posterior standard deviations for each pattern; this is to be expected given that the number of items is small. In this case and most others we have investigated the log linear version of the Rasch and extended binomial error models lead to similar conclusions, though latter tends to be more parsimonious and the former to fit somewhat better, as it has M-2 more free parameters. Nevertheless, the two need not agree. Numerical investigations demonstrate the existence of cases in which response distributions generated by the extended binomial error model parameters fit the Rasch model well (as judged by goodness-of-fit measures) but produced estimates of loglinear trait distribution parameters that violate the moment inequalities conditions of Cressie and Holland [19].
In sum, the EBE behaves similarly to the Rasch model; given an implementation involving greater distributional flexibility (the loglinear version) the Rasch model will naturally fit somewhat better at the cost of more parameters. There are, however, data that are fit by one model and not by the other. Most importantly, the probability-like metric of the latent trait allows for extensions that may be of great interest when we wish to examine the mechanisms of response processes (as opposed to scoring long tests). We go on to examine several such extensions, starting with models for independent groups under experimental conditions.

The Multiple Group Case
The multiple group extension of the extended binomial error model permits investigation of issues related to item bias and, under certain conditions, study of group differences in the distribution of the latent trait. Let g = 1,. . .,G index groups which may be formed by partitioning a single random sample, sampling from diverse populations, or by random assignment of different item formats. To allow for differences in item and distribution parameters over these groups, we may rewrite Eqs 4 and 5 as follows: To illustrate, we take data from a national telephone survey of Italian adults 18-69 years old that was conducted in April and May, 1994 [20]. The survey was part of a study of regional and ethnic prejudice in Italy and included a series of items dealing with stereotypical beliefs about Africans and East Europeans living in Italy. Each respondent was assigned at random to a set of questions targeted at one of three immigrant groups: a) North Africans, such as Moroccans, Tunisians, or Algerians (probability = .25); b) Africans from regions of Central Africa, such as Senegal and Somalia (probability = .25); and c) Eastern Europeans, such as Poles, Albanians, or Slavs (probability = .5). For each target group, the respondent was presented with a statement incorporating a positive or negative adjective pertaining to the group, e.g.: "do you agree that most of them are complainers? (they try to make others feel sorry for them)". The interviewer then asked respondents "Do you agree strongly, agree somewhat, disagree somewhat, or disagree strongly with this description?" For our example, we selected responses to items incorporating four negative adjectives which roughly translate into "selfish", "slackers", "violent", and "complainers", combined the North and Central African target groups into a single target group (based on similarity of the marginal distributions), and dichotomized the responses into "Agree" (coded 1) and "Disagree" (coded 0). The response distribution for N = 2001 respondents in the 1994 survey is shown in Table 3.
As an exercise, we can use these data to determine whether the responses are consistent with the hypothesis of an underlying trait of prejudice or hostility to minorities, and if so, if Africans and Eastern Europeans are perceived identically. If there are differences in the perception of Africans and Eastern Europeans, we can determine whether there is still a single trait of "prejudice" with perhaps different thresholds for attributing negative characteristics to Africans and Eastern Europeans, or whether there seem to be different latent traits involved when it comes to judging members of the two target groups. Table 4 presents fit statistics for selected models for these data. The first model corresponds to Eqs 14 and 15, only adding the constraint k 3g = 1 for g = 1,2 (the reason the third item is chosen will become clear below). This model fits quite well, generating a likelihood ratio chi-square of 15.93 with 20 degrees of freedom (p = .721). Models 2 and 3 impose substantively important constraints. Model 2 sets k ig = k i for all i; it is a test of identical item meanings (semantic invariance) that allows the choice of target group to evoke different traits (e.g., degree of hostility to Africans or degree of hostility to Eastern Europeans). This sort of model might be used to examine item bias across two non-experimental groups; if this model failed to fit one could attempt to see if there were particular items that had different hardnesses across groups.
Given that our groups are the results of experimental treatments, however, we may instead begin by assuming identity of the distribution of the latent trait. Model 3 sets a g = a and b g = b; given the normalization k 3g = 1 for g = 1,2 this is equivalent to saying that there is only one trait of overall prejudice, but the items (except for the third) can have different hardnesses depending on the target group.
Given model 1, model 2 must clearly be rejected as the difference in chi-square is significant (10.31 at 3 df, p = .016). Given model 1, however, loss of fit due to the constraints associated with model 3 is insignificant (chi-square of .62 at 2 df, p = .733). (The results here are not independent of the choice of item that is fixed; setting k 3g = 1 for g = 1, 2 resulted in the lowest chisquare for model 3 and was hence also used for models 1 and 2.) Further, model 4 demonstrates that the location of item 2 can also be equated for the two groups (the chi-square difference of 2.01 between models 3 and 4 is insignificant at 1 df with p = .156), although models 5 and 6 demonstrate that this is not true of items 1 and 4. Inspection of the item parameters demonstrates that the item locations are more spread out on the underlying continuum when the target group is Africans than when the target is Europeans. Such a result is consistent with the interpretation that there is a single trait of out-group hostility among the respondents, but that Italians have a more differentiated stereotype of Africans than they do of Eastern Europeans.

Generalization To Ordered Polychotomies
The generalization of the extended binomial error model to ordered polychotomies relies on a standard threshold parameterization for ordered categories that was developed in an IRT context by Edwards and Thurstone [21] and also for regression analysis of ordered categories (see, for example, [22][23][24]). Let j = 1,. . ., J represent the a priori order of the response categories for the i th item where j = 1 is "low" and j = J is "high". We denote the hardness of each of these response categories as k i,j , where k i,1 <k i,2 < . . ..< k i,J-1. By analogy with Eq 1 for dichotomous responses, we write the probability of that a response fall into category j or higher as For a fixed y value, the probabilities thus defined diminish as j increases. Given this definition, the probabilities of the intermediate response categories are calculated as differences between the probabilities associated with adjacent dichotomizations. This then implies that the model   for ordered categories can be written as follows: where we set the second term to zero for j = J. The model given as Eq 17 has a number of welcome features. First, the probability function for any category j is single-peaked; if we denote its maximum y Ã j then (j < k) , (y Ã j < y Ã k ); y Ã 1 = 0 and y Ã J = 1. Finally, because this is a threshold model, it satisfies the two principles Jansen and Roskam [25] called the joining assumption (the probability of an aggregated response category is the sum of the probabilities of its constituent, usually adjacent, response categories) and ξ-equivalence (the latent traits embedded in aggregated and disaggregated versions of the item response functions are identical or are related by an admissable transformation). In contrast, the most methodologically tractable generalizations to the polychotomous case in the Rasch model (the rating scale and partial credit models and their relatives [26]) do not satisfy these criteria which means that a model that fits the polychotomous data may not fit dichotomous or otherwise collapsed data [27][28][29][30]. (The graded response model [31] does satisfy the collapsing conditions but is somewhat more complex.) health" and "drinking can get you sick" and the response categories for both reasons are "not a reason at all," "not an important reason," "somewhat important" and "very important." In this illustration, the category response functions are written so that high values are associated with responses indicating the reasons stated are important for decisions to abstain or to be careful about drinking. Thus y k1,1 is the conditional probability, given y, that a respondent will regard the "drinking is bad for your health" as "an important reason" for abstaining or being careful about drinking. Table 6 shows the parameter estimates and standard errors for fitting the extended binomial error model for ordered polychotomies to these data. As judged by the likelihood ratio chisquared value (12.78, 8df), the fit of the model to these data is acceptable. The beta distribution generated by a = 1.112 and b = 0.598 is skewed toward high values of y implying that the majority of respondents consider both reasons for abstinence to be important. A comparison of the category threshold parameters indicates that "bad for your health" is considered a more compelling reason for abstinence than "getting sick" among the young adult women in the national sample.

Single-Peaked Response Functions
This threshold formulation can also be adapted to model response processes in which subjects are more likely to give a positive response to some item if that item's location is near their own on the latent trait. Following the method of Andrich and Luo [33][34][35][36], each item is turned into a trichotomy, with two failure regions (the item is too far above the subject for her to answer in a positive direction and the item is too far below her), with the intermediate leading to a positive response. However, there is an alternative approach that is sufficiently flexible to handle a variety of empirical cases.
Here we approximate the response function in question for some set of such dichotomous items as follows: which is a unimodal function achieving its maximum when y ki = .5, which is equivalent to ξ = χ i in an unbounded metric if we again define χ ɩ = log(k i ), and ξ = -log[-2log(y)]. The leading α i may be considered akin to a discriminating ability parameter for cumulative models, or it may be considered an overall normalization factor if constrained α i = α for all i; it may also be fixed in advance, such that (for example) the predicted probability of a positive response is 1 when ξ

Conclusion
In sociology the loglinear Rasch model has become the most widely known and used IRT model in sociology [37][38][39][40][41]. There are two reasons for this popularity. First, Duncan [4] argued that its indifference to the distribution of the latent trait made it a better scientific instrument than the covariance-based methods that dominate social modeling. Second, the Rasch model's parameters turned out to be estimable with a very simple loglinear approach [18,42]. This approach treats the distribution of the latent variable as a set of nuisance parameters-the total score preserves all useful information in grading respondents, and the item hardnesses can be estimated without further investigation of the distribution of the latent trait. But preserving a metric representation of the trait aids the investigation of the response process. While other IRT models, including the Rasch model, can be adapted in this way, the model presented here allows for a wide-range of substantively important extensions to be modeled rather simply, due to the combination of a) a latent trait that can be interpreted as a probability, b) a simple family of response functions that can model dichotomous, ordered polychotomous, and unfolding-type responses, and c) a flexible two-parameter Beta distribution for the latent trait. Such a family of models can be particularly useful in the analysis of experiments that involve changes in wording, response formats, and item order.

Appendix A
A compact representation of the predicted response probabilities for any value of the parameters can be derived from the exclusion-inclusion theorem (see, e.g., [43], 72f). Let the vector h = (a,b,k 1 ,. . .,k M ) represent the model parameters and Pr[x|h] the unconditional probability of a response for a given h. For any x, let I(x) = {i: x i = 0} and C(x) = {i: x i = 1). Let A be any subset of I(x), including the null set Ø and let |A| be the number of elements in A. Now since the responses to each item are assumed to be independent conditional on the value of the latent trait, we can write Pr[x|h] in product form, where the expectation is taken with respect to the distribution of the latent trait: This representation is valid for all latent trait models for dichotomous responses that assume conditional independence. For the extended binomial error model, we make the following and 5, these have following structure: These equations cannot be solved uniquely to obtain a,b, ki, and kj. However, if we set ki = 1 (in effect, normalizing the model by setting y = Pr[xi = 1|y]) and use Γ(s+1) = sΓ(s), we get It follows that and with manipulation we get for all i and j. Thus, for each item other than the i th , its cross-classification with x i gives us its ratio to a + b.
To construct an initial estimate of a + b, which we will denote Θ for brevity, we take advantage of the fact that from Eq B.3, When the model describes the data perfectly, the M-1 equations corresponding to each j (j 6 ¼ i) should yield the same root. For actual data, we average our derived values of Θ. We then substitute this value in Eq B.7 to get initial values of kj, j 6 ¼ i. Our starting estimate of the parameter a can be recovered from the product of (a + b) and Pr[xi = 1] (given that, by construction, ki = 1), and that of b from b = Θ-a.