A Bayesian hierarchical latent trait model for estimating rater bias and reliability in large-scale performance assessment

We propose a novel approach to modelling rater effects in scoring-based assessment. The approach is based on a Bayesian hierarchical model and simulations from the posterior distribution. We apply it to large-scale essay assessment data over a period of 5 years. Empirical results suggest that the model provides a good fit for both the total scores and when applied to individual rubrics. We estimate the median impact of rater effects on the final grade to be ± 2 points on a 50 point scale, while 10% of essays would receive a score at least ± 5 different from their actual quality. Most of the impact is due to rater unreliability, not rater bias.


Introduction
Performance assessment plays a fundamental role in society, especially in education, where it is common practice to test students and base their assessment on test scores. Decisions that follow from performance assessment often have profound consequences for those involved, such as allocation of funding, promotion, and, as is the case with the essay scoring data used in this study, enrolment into higher-educational programs.
Some skills can be assessed with relatively structured tasks, but writing skills and language proficiency are typically assessed with essays, unstructured essay-type questions, or other types of free-response tasks. Scoring these types of performance assessments relies heavily on human judgement. That makes the scores subjective and vulnerable to rater effects. That is, scores can be affected by factors not related to the ability being assessed, such as bias (strictness, leniency) and (un)reliability (non-systematic error) of the rater.
Methods for dealing with rater effects include adding more raters, rater training, score negotiation, the use of scoring rubrics, and rater bias correction. Making valid inferences from essay scores and managing rater effects therefore first requires a good understanding of rater effects, which is what we focus on in this study. That is, we propose a statistical model for estimating rater effects and apply it to rater scoring data. However, the model can be applied in any setting where a set of performances is rated by a set of raters and where at least some performances are assessed by two or more different raters. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 influenced by mechanical characteristics of students' writing rather than the content even when they used a scoring rubric.

Contributions of this study
In most related work, the available number of essays or raters is small (a couple of dozen), which limits model complexity. There are a few studies of large-scale assessments (10.000 + scores) [9,15,18], but even these span only a very short period of time (a single essay) and analyse only short term rater drift. Our research is based on large-scale assessment data, which consists of almost 80.000 scores and a stable set of over 200 raters over a period of 5 years and 5 different essays (see Table 1 for details). Each score is broken down into sub-scores according to a standard rubric. These data, described in more detail in the Essay scoring data section, allow us to study several substantive research questions in more detail, compared to related work. First, an accurate estimate of rater bias and reliability. Second, an investigation of longterm rater drift. Third, an analysis of scores by rubric, in particular, the relationship between language (writing mechanics) and content. And fourth, a quantification of the impact of rater effects on final grades.
Related work relies predominantly on null-hypothesis testing for presence of effects, rather than estimating effect size, and, despite a typical large number of parameters (effects), no attempts are made at adjusting for multiple hypothesis tests, which may lead to spurious findings where there are no effects. We develop our statistical model in the Bayesian framework to deal with these issues, simplify interpretation, and facilitate estimation of the impact of rater effects through simulations from the posterior distributions. Our approach is related to the hierarchical rater model (HRM) [22,23] but with discretized Tobit instead of logit (partialcredit) model for observed scores and an extension of correlated bias and severity across scoring rubrics.

Essay scoring data
The data used in our research were obtained during the nation-wide external examination conducted by the National Examination Centre in upper secondary schools in Slovenia. This examination, also called Matura, is an obligatory exam, which represents the completion of secondary education and is the main criterion for the enrolment into higher-educational programs. Matura consists of three compulsory (native language-usually Slovenian language, mathematics, and foreign language) and two elective subjects. The basic maximum grade a student can obtain is 28, but this can be raised up to 34 if the student elects to take 2 subjects at a higher level. The essays we analyse in this study are part of the Slovenian language subject Table 1. Summary of the data broken down by year: Number of essays (n), raters(m), mean score (" y), standard deviation (sd(y)), and median number of scores per rater (median(n r(i) )). exam and represent 50% of the exam score. The total for this exam is 8 so the essay effectively contributes 4 to total of 28 on the Matura. Matura is held twice per year, spring and autumn, and each term, students can opt for one of two types of essays, argumentative or interpretive. In this study, we only use the spring term argumentative essays for years between 2010 and 2014, inclusive. The reasons behind this choice are that a vast majority of students take the spring exam (autumn exams are typically taken by students who have failed the spring exam, want to improve their score, or were not able to attend the spring exam) and most students opt for the argumentative essay. In the studied period of time, 83% of all essays were spring term argumentative essays. A summary of our data is presented in Table 1.
An essay can receive a score between 0 and the maximum of 50 points. This score is broken down into rubrics: 20 points for the language-related rubrics and 30 for the content-related rubrics. A rater can, at their discretion, assign 6 bonus points, 3 for language and 3 for content, but the sum of all points cannot exceed 50. Points for language are additionally divided into three scoring rubrics: syntax (up to 8 points; grammar, syntax, punctuation, word use, . . .), style (up to 6 points; loosely defined as measuring the principles of good communication), and structure (up 6 points; title, word count, organization, . . .). Points for content are also broken down into rubrics, but the number of rubrics and points per rubric vary from year to year, depending on the essay topic. To simplify analysis, we only record and analyse the total of the content related rubrics.
The group of Matura raters consists of upper secondary schools teachers. Raters have to attend trainings and seminars organized by the National Examination Centre and consider detailed criteria when assessing the exams. The group of raters is relatively consistent through the time, since many Matura raters score the matura essays over their entire teaching career. Each essay is scored by two raters and their average is the essay's final score. In cases where the two scores differ by more than 10 points, a third rater is included, who inspects both scores and his score is the final score.
The data are available for download S1 Data.

Methods
In the ideal case we would know the quality of each essay, that is, the ground-truth, making it simple to estimate each rater's bias and variability from how the rater's scores deviate from and vary around the ground-truth. Note that in the context of the model, we use rater variability instead of rater reliability, but the two are interchangeable.
In some studies, experienced raters' scores [16] or a scoring committee's average [15] are used as a proxy for ground-truth. However, this is problematic, as there is no evidence that senior raters are unbiased or have substantially higher reliability (lower variability). Therefore, unless we have a very large committee or a large number of raters per essay, failing to account for variability in such proxy ground-truth will lead to incorrect conclusions, in particular, underestimating the rater's reliability or falsely concluding that raters exhibit central tendency (avoiding the extremes). Therefore, we cannot assume certainty so it is natural to model an essay's quality with a latent variable. It is necessary to have scores from 2 or more raters for at least some of the essays, however, because rater and essay quality variability can only be identified from the raters' mean and variability on the same essay, respectively.
Assessment scores are, by definition, ordered, and, in most cases, numeric and discrete. We can choose to model the scores as continuous or categorical (ordinal) variables. One such suitable model is the ordinal logistic regression model, which is the almost exclusively used approach in related work. We do not base our model on this model, however, for several reasons. The first reason is model complexity-for k ordinal categories, k − 1 breakpoint variables are required, and, for rater-specific variables, which is necessary if we are to study raterspecific effects, this is multiplied by the number of raters. Furthermore, allowing the breakpoints to vary across raters would complicate the placement of a hierarchical prior distribution and make it more difficult to interpret the bias and variability of the raters. Instead, we model the observations as a function of a normally distributed latent variable that is discretized to the nearest integer and truncated at 0 and maximum score. Therefore, we keep the number of parameters at a minimum, while allowing for zero-or maximum-inflation, which is relevant for rubrics with a small number of categories. We add essay population and rater population-level distributions and parameters. The hierarchical Bayesian approach serves two purposes. First, the second level provides additional insight into essay and rater properties and simplifies simulation of rater effects using quantities derived from the posterior distributions. And second, shrinkage of essay-and rater-specific parameters towards the populationlevel mean alleviates the identification of extreme effects purely by chance and the problem of multiple comparisons (see [24] for details). Therefore, the model in our approach can be viewed as a discretization of the Tobit model [25,26], truncated at both ends, or as a simplified ordinal probit model with fixed breakpoints between categories and non-unit variance. In the remainder of the section we provide a detailed description of the single score model and its extension to a setting where the total score is broken down into rubrics.

A model for essay scoring
We are interested in modelling essay scores in a setting where each essay is scored one or more times, each time by a rater from a group of identifiable raters and with a bounded discrete numerical score. Let N, n, and m be the total number of scores, essays, and raters, respectively. We use w(i) and r(i) to index the essay and rater of the i-th score.
We assume that the true latent quality of an essay l w(i) lies somewhere on an underlying continuum and that the essays perceived quality in the eyes of the rater l Ã wðiÞ is the sum of the essay's true latent quality, the rater's bias b r(i) , and random noise i : where the noise is normally distributed with rater-specific variability, i * N(0, σ w(i) ) The actual reported and therefore observed integer score y i is assumed to be a truncation and discretization of the perceived quality: Function T is defined as where nint is the nearest-integer function and z the maximum number of points for the essay or, if applied to a single rubric, the maximum number of points for that rubric. The minimum number of points is assumed to be 0 always.
We place hierarchical priors on the model's parameters. The essays latent qualities are assumed to be draws from a normal distribution l * N(μ l , σ l ) and the rater-specific non-systematic errors are assumed to be draws from a normal distribution σ i * N(μ , σ ) left-truncated at 0.
If the biases b are allowed to vary freely, we cannot identify the latent qualities and the biases (an arbitrary constant can be added to either group of parameters and subtracted from the other). To achieve identifiability, we introduce a restricted bias parameter b Ã i , i = 1‥(m − 1), forcing the biases to sum up to 0: The assumption that raters as a group are expected to be unbiased is natural and it would not make sense to assume that the expectation (consensus) of all raters in a closed world is biased. In theory, this assumption can lead to the same issues as those discussed earlier regarding assuming ground-truth scores. However, in practice the total number of raters and scores per rater are always much higher than the number of different raters per essay.
Finally, priors on the hyper-parameters were set as to allow for the placement of weaklyinformative priors: normal priors for the means μ l * N(ν l , η l ), μ * N(ν , η ) and uniform pri- Measuring rater effects. We estimate the impact of rater effects with simulation from the distribution of the absolute deviation of the average score across a raters from the latent score of an essay and, corrected for estimated bias, where l * N(μ e , σ e ), b * N(0, σ b ), * N(μ , σ ), and the parameters are drawn from the posterior distributions. Model variants. We will refer to the above model as the rater-specific variability model. In order to investigate how certain parts of the model affect how well the model fits the data, we include two simpler models in the comparison. First, the common variability model, which is identical to the rater-specific model, with the exception of random noise variability, which is assumed to be the same for all raters. And second, the baseline model, where all the rater effects are omitted (bias and rater-specific variability). For both models, a uniform prior is placed on the common variability.

Model fitting and evaluation
All statistical analyses were performed using the R programming language [27] and the Stan programming language for (Bayesian) inference [28]. The models were implemented in Stan and fitted using MCMC sampling-based approximation-the No-U-Turn Sampler variant of Hamiltonian Monte Carlo [29]. Sampling was run for a number of iterations sufficient to achieve estimated sampling errors of at least two orders of magnitude lower than the posterior standard deviations and would therefore not have a practical effect on interpretation of the results.
When choosing constants for the hyper-priors, we partially relied on prior knowledge, but only to omit the practically impossible ranges of values for the standard deviation parameters and weakly informative values were placed on the means: ν l = 30, ν = 5, η l = η = 100, and u = u l = u b = 70. Due to the amount of data available, the posterior distributions were found to be insensitive across a range of more and less restrictive values and practically equivalent to selecting flat priors across the unrestricted ranges for all parameters. However, with restricted ranges, sampling was substantially more effective.
We estimated the models' goodness-of-fit using approximate leave-one-out cross-validation (LOO-CV) (see [30]) as implemented in the loo package [31]. Note that all reported values of the LOO-CV criterion are transformed to be on the deviance scale. We also performed several visual checks of how the model's predictions of certain aspects of interest match the observed data.

Rubrics model
The main purpose of the rubrics model is to estimate the effects of rubric-specific biases and potential correlations between them. A choice was made not to model rater and rubric-specific variability and to model latent quality as independent across rubrics. The main reason was computational convenience and the fact that these would not contribute to bias estimation. Therefore, the rubrics model extends the common variability model, but now the observations are vectors of length K, where K is the number of rubrics. Similarly, y i , i , i = 1‥N, b i , i = 1‥m, and l i , i = 1‥n are now also vectors of length K and y For the latent essay qualities and rater variability, we use a similar hierarchical prior structure, independently for each rubric, and similar values for the hyper-prior constants. For the latent qualities l i $ N K ðμ l ; σ l IÞ, with μ l $ N K ð5; 100IÞ and σ l $ N K ð5; 100IÞ (left-truncated at 0). And for the rater variablity i $ N K ð0; σIÞ and σ i * U(0, 70), for i = 1‥K. On the biases, however, we put a covariance structure b j * N K (0, S b ), decomposing the covariance matrix into scale vector and correlation matrices S b = AOA, where A ¼ σ b I. Priors are placed separately on the scale σ b,i * U(0, 70), for i = 1‥K and correlation O* LKJ(1). LKJ is the distribution of correlation matrices from [32] and LKJ(1) is equivalent to a flat prior over all correlation matrices.

Model comparison and validation
We start with model comparison and visual validation. In Table 2 we compare the three models described in the Methods section. The rater-specific variability model provides the best fit, but does not result in substantially different estimates of bias, compared to the common variability model. Both models that consider bias (common and rater-specific model) fit the data substantially better than the baseline model which does not allow for bias.  Table 2 also shows how the estimated bias correlates with the rater's deviation from the overall mean score on the current and following year. As we would expect, the correlation is high for the current year, and, although the two are not exactly the same, deviation from the mean would be a decent proxy for estimated bias. However, the standard deviations of scores in Table 1 imply that, even in our case, with several dozen scores per rater, the standard error of such estimates is still relatively high (between 0.5 and 1.0 points).
Correlations of the estimated biases with the next-year's biases (only for raters who scored both years) suggest that rater bias is partially preserved over time, despite the fact that each year there is a different essay topic. We explore this further in the Rater effects section that follows.
Finally, Fig 1 visually compares the models in terms of the distributions of some of the quantities of interest replicated from the models and how they match the observed empirical distributions. In all cases, with the exception of the baseline model's rater means, the distributions match the observed and there are no notable differences between the common and raterspecific variability model. The baseline model matches the distribution of scores, but underestimates the standard deviation of the distribution of rater means around the average score across all essays, which is what we would expect in the presence of rater bias.

Rater effects
Based on results from the Model comparison section, we used the best model (rater-specific model) to perform all further analyses.   Table 3 shows the estimated impact of rater effects on essay scores. In our case, where there are two raters per essay, we estimate that one half of all the scores differ by 2 or more points and 10% of all essay scores differ by 5 or more points from the actual score. By accounting for the bias, the impact is estimated to decrease by 0.5 and 1 point, respectively, but variability/ unreliability of rater scores has a substantially stronger impact on grades than bias. Another option that could be considered is adding another rater, which is estimated to have an impact similar to accounting for bias.   Table 4 shows the correlations between estimated posterior bias and reliability between the studied years. We can observe that the correlations are quite high, implying that a rater's effects partially, but not entirely, persist through the years. This was already indicated in Fig 2 and partially in Table 2, but here we can observe it in more detail and it applies to both bias and variability. Out of the total 221 raters, 95 raters have all mean estimated posterior biases (across all years in which they participated as raters) of the same sign (44 lenient and 51 severe).
The between-subject and between-year variability in mean posterior biases account for 60% and less than 1% of the total variability, while the rest is residual variability. For reliability, these numbers are 67% and 1%. These calculations were made assuming fixed bias estimates (taking the posterior means and discarding the uncertainty). If we also take into account the uncertainty in our estimates (the variability of the posterior distributions), it would, in the absolute sense, amount to approximately 25% of the total variability for bias and 30% for reliability. Therefore, it is possible that as little as 15% and 5% of the variability in bias and reliability, respectively, is residual (within-subject variability over time). This uncertainty in our estimates of within-subject variability over time is a consequence of having only 5 points (years) per rater.
We observed in Table 2, Fig 2, and Table 4 that bias and reliability partially persist over time. Another aspect of interest are potential changes of rater effects with rater experience. While rater experience data were not available, we can indirectly and partially infer rater experience from the number of years they participated in scoring. The results are shown in Fig 4  and reveal that there are no discernible drifts in bias or reliability. Fig 5 compares the score distributions replicated from the rubrics model with the actual observed distributions. Overall, the proposed model is flexible enough to fit the different distributions, with relatively small maximum score inflation in the content rubric.

Rubrics
Similar to the single total score, we can use the rubrics model to estimate the impact of rater effects for individual rubrics. Table 5 shows that the overall impact of a rubric corresponds to A Bayesian hierarchical latent trait model for estimating rater bias and reliability the maximum number of points for that rubric, the content rubric having by far the largest impact. However, after normalization, the style and structure rubrics have the highest impact, followed by content and syntax, and the two bonus rubrics have the least impact. Finally, we investigate if a rater's rubric specific biases are correlated. That is, do raters biased in one specific rubric tend to be biased in another rubric as well. Fig 6 compares correlation coefficients between estimated bias for every pair of rubrics. The estimated correlations are relatively stable across years and the two bonus points rubrics are by far the most correlated. Positive correlation can also be observed between syntax and style, content and content bonus, and structure and content. Language-related rubric pairs except structure and bonus have non-negative correlation and the content rubric bias has negative correlation or does not correlate with any of the language-related rubrics (except for structure).

Discussion
The proposed model provides a good fit for both the total score when extended to individual rubrics, despite substantial semantic differences in rubrics and subsequently in their scales and distributions. As a tool, the model can be used to estimate rater effects, their impact on final scores, and to identify the most biased and unreliable raters. The model could also be used to adjust final grades, correcting for bias. However, this is not likely to occur, due to prohibitive legislative issues. While the focus was on essay scoring, the model is general and applicable to any scoring setting with more than one score per subject/object and recurring raters. Our approach has several advantages over those typically used in related work. First, it provides a unified model of population-level parameters and individual raters/essays. The hierarchical structure also mitigates the problems of multiple comparisons when identifying extremely biased or unreliable raters, making the model robust to a smaller number of scores and/or raters. Finally, compared to more flexible ordinal models, we estimate the impact of rater effects on final scores directly, which substantially simplifies interpretation of rater effects and makes computation feasible. Note that the proposed model can be interpreted as an ordinal probit model with fixed equidistributed unit cut-offs.
The substantive findings based on our results agree with the prevailing empirical evidence that in essay scoring (and other similar assessments) raters are biased and unreliable to the extent of having a non-negligible impact on final scores. While most related work reports bias, only a few attempt to measure the impact of bias (and unreliability) on final scores/grades. Even in those cases, the different methodologies used make it impossible to accurately compare and consolidate the findings, so the following is approximate at best. The numbers reported in related work and reviewed in the Introduction vary, but in most cases the expected impact lies between 10% and 20% of the ratings scale. In our case, the median effect of a single rater is 3 (or 6%), which would suggest that the raters have a smaller impact. Taking into account that these are well-trained and mostly also very experienced raters, this is not surprising.
With the current two raters per essay, we estimate that about 10% of essays would be scored 5 points too high or 5 points too low only due to rater effects. This implies that 10 points gaps between essays are not uncommon, which translates into 10% of Slovenian language exam and almost 1 point on the entire Matura. If we were to correct for bias, we could mitigate some of the impact of rater effects, however, rater unreliability is a much larger source of variability and represents a vast majority of the impact on final scores. Such non-systematic error can not easily be corrected and must be managed in some other way. Excluding the most unreliable and biased raters is one option, but the gain would be small. Another option would be an additional rater. Combined with correcting for bias, however, it would only reduce the impact by approximately one third.
This study is, to the best of our knowledge, the first systematic investigation of rater effects over a period of several years. Results confirm that a rater's bias and reliability are characteristics that persist through time, potentially with very little variability. Furthermore, we found no evidence that bias and reliability of raters as a population meaningfully change over time or with the number of essays a rater scores (a proxy for rater experience).
At the rubrics-level the overall impact of rater effects is greater in the style and structure rubrics, relative to the content and syntax rubrics. This makes sense, as it can be argued that content and syntax are better-defined and therefore easier to score. The strongest correlation is between the biases of the two bonus rubrics. These two rubrics clearly measure two distinct aspects of an essay, so this correlation can be either due to differences in overall strictness/ leniency or due to the raters being unable (or unwilling, for example, giving a content bonus to additionally reward an exceptional performance in language) to distinguish between language and content when giving out bonus points. We are not able to decide between these explanations, but all of them imply misuse of the bonus points rubrics.
There are correlations in other rubric pairs as well but these are more difficult to interpret. Assuming that there is no substantial overlap between rubrics (in the sense that they measure the same aspect of an essay's quality) then any differences in biases between different rubrics pairs would imply that the raters are not able to properly distinguish between rubrics. However, it seems more likely that observed correlations between biases are, at least partially, due to rubrics actually overlapping. The correlated rubrics do make sense: all the language rubrics are positively correlated, especially syntax and style, and not correlated with the content rubric, with the exception of content and structure, which are positively correlated.

Directions for future work
The proposed approach could also be extended to explore several potentially interesting aspects we did not or could not investigate. One important path for future work is modelling how rater characteristics affect rater bias and reliability. For example, age, gender, and, in particular, rater experience, which would be relatively straightforward to incorporate into the model. In our study we were unable to pursue this due to lack of relevant data.
In terms of bias management, we dedicated most of our attention to the rater's overall bias, the effect that we were primarily interested in and wanted to remove, and less to rubric-specific biases. However, a more detailed analysis of the latter, especially of whether distinct types of raters exist, would be useful in identifying particular areas of bias and unreliability and training focused on a particular area.
In this study, potential changes over time were not incorporated directly into the model. Instead, we analysed them post-hoc using the posterior distributions. Modelling all years simultaneously with a temporal component made the computation infeasible and would require a different computational approach. We did not pursue this, because we had only 5 time-points (years) of data and no other time-related information, which would allow for more certain estimates of changes over time. If short-term timestamps or data over a longer period of time were available, trends could be modelled directly.
Supporting information S1 Data. The data used in our study. The data are stored as a serialized R programming language data.frame object (see saveRDS()). Each row represents one essay, the rater IDs and scores, and a detailed breakdown of the scores by rubric. (ZIP)