Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability

Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market.


Introduction
Ratings have been part of the assessment landscape in many areas for many years. They are considered the gold standard of science and are present in peer review of grant proposals or journal articles [1], are integral parts of educational and psychological assessments [2], and are present in student admission processes [3] or selection of new hires. The legitimacy of rating procedures depends crucially on the reliability, validity and fairness of ratings systems and processes [4].
There are numerous covariates that may affect ratings, such as an applicant's or reviewer's gender, ethnicity and major or research area [1]. These factors may be potential source of bias and unfairness in ratings, but may also influence the inter-rater reliability (IRR) [5]. One factor that may cause bias is institutional proximity of the applicant. Such "affiliation bias" has, for instance, been shown in grant proposal peer reviews [6][7][8].
In labor economics, both theoretical and empirical studies confirm the commonsense notion that the human resource management processes used to make hiring decisions can have profound effects on the workforce labor productivity [9][10][11]. The productivity of new hires is dependent both on their individual attributes and the fit between employees and organizations [12][13]; social competency, compatibility, and capital may be highly valuable and support positive work environments, productivity, and success of organizations as a whole.
In many contexts, the selection of an employee for a position can come down to a choice between an external applicant and an insider, i.e. an applicant that is internal to a firm or organization. Yet there is relatively little evidence on how hiring processes treat external and insider applicants.
Studies that focus on internal (promotions or lateral transfers) and external hiring find that external candidates face an uphill battle to be hired over internal candidates, in that they tend to need better observable indicators of quality than their internal peers [13][14]. This finding may be related to hiring managers having relatively more knowledge about internal candidates, the importance of firm-specific human capital, or the desire by firms to create promotion-related incentives for other employees [15].
One important issue that has received little attention is whether the applicant selection tools and ratings often used in assessing job applicants function differently for internal and external applicants. In particular, internal applicants may have advantages over external applicants, due to their knowledge of the attributes that employers are looking for, because they are more likely to receive recommendations from individuals who understand the attributes that employees are looking for, or because they are directly known by hiring officials.
In this paper, we examine how the ratings on applicant selection tools compare for internal and external applicants to teaching positions in Spokane Public Schools (SPS), a relatively large school district in eastern Washington State. We use mixed-effect models [16] allowing rater-and applicant-covariates to test for bias. We analyze differences in ratings between external and internal applicants, with a particular focus on variance and IRR for these groups. We also derive a test of between-group differences in IRR, relying on mixed-effect models allowing group-specific variance terms of random variables.

SPS teacher applicant selection tool
For hiring decisions, SPS utilizes a four-stage hiring process [17]. In the first step, an online application management system is used for uptake and initial check of applications. Next, prescreening of potential applicants is made by central office human resources officials. In the third stage, screening of applicants meeting initial screening standards is done by school-level hiring officials. Finally, applicants with the highest school-level screening scores are invited for in-person interviews: job offers are made based upon judgments after this final stage.
In this work, we analyze data from school-level screening, the third stage of the SPS hiring process. Important for our purposes, a large number of applicants who are screened at this stage have multiple ratings. Applicants at this stage (for the majority of the study period) were rated on a 6-point scale in nine criteria, each a subcomponent of the rating instrument. The screening rubric (which is on a 54-point scale) and criteria are outlined in Table 1. Ratings were based on written materials that were included in the application and in supporting documentation (e.g., resume, cover letter, and at least three letters of recommendation). A summative score was used to select which candidates receive in-person interviews. During the study period, about 40% of applicants screened on the school level were not advanced to an interview. In previous studies, both the district-level and school-level selection tools have been shown to be predictive of later teacher and student outcomes [17].

Research questions
We analyze rating disparities of internal and external applicants. Specifically, we address the following research questions 1. Do external applicants receive lower ratings on subcomponents and in total than internal applicants?
2. Are any differences in ratings between the two groups explainable by other measures of applicant qualifications and quality, available before hiring decision (e.g. years of experience, licensure test scores), or measured in years following after the hiring decision (e.g. estimated teacher value added to subsequent achievement of their students)?
3. Does the magnitude of variance components differ for internal versus external applicants?
4. Is the IRR equal for internal and external applicants, or is it higher for insiders?

Certificate and Education
Note completion of course of study, certificate held (current or pending), and education.

Training
Look for quality, depth, and level of candidate's additional training related to position.

Experience
Note the degree to which experience supports the prediction of success, not just the number of years. A beginning candidate could be rated highly.

Classroom Management
Look for specific references to successful strategies. This may mean planned and directed rather than quiet and orderly. Effectively handles large/small or ethnically/sociologically diverse groups; develops routines and procedures to increase learning; establishes clear parameters; and responds appropriately.

Flexibility
Note multiple endorsements, activity, coaching interests, student, building or district, or community support. Willing to learn new concepts and procedures; successfully teachers a variety of assignments; effectively uses various teaching styles.

Instructional Skills
Look for specific references in support of skill in this area: plans; implements; evaluates; relates to students; creative; employs multiple approaches; monitors and adjusts; uses culturally responsive strategies appropriate to age, background, and intended learning of students.

Teacher application dataset
Our dataset contains ratings of applicants (assessees) for teaching positions in SPS during the school years 2008-09 through 2012-13. This includes a total of 3,474 individual ratings with known applicant and rater ID and job location, representing 1,090 individual applicants rated by 137 raters for classroom-teaching job postings at 54 job locations (schools). These units are partially crossed both with applicants (many applicants applied to multiple schools) and with raters (some raters rated for multiple schools). Applicants were rated on a 6-point scale in nine subcomponents (Table 1), and the summative score (on a cumulative 54-point scale) was also provided. Multiple ratings of the same applicant may occur within the same school during one time period (e.g. some schools employ more raters and use average total score to rank the applicants), based on multiple applications to the school at one time (to multiple job openings) or over time, and/or across different schools in the district.
We also consider three other proxies of applicant quality and qualifications: teaching experience (in years), state licensure scores (WEST-B average, math, reading and writing, all standardized statewide) and, for applicants hired in Washington State, estimates of teacher value added to achievement of their students in mathematics and reading. Teacher value added, in simple terms, is the estimated contribution of teachers toward student achievement gains on standardized tests, generally adjusted for student background characteristics, such as free or reduced-price lunch status. The specific linear model used to generate the value added we used in this paper is described elsewhere [17].
We consider an applicant to be internal when he or she either was previously employed as a teacher in the district (e.g., at a different school, different position or in a different time period) or had completed his or her student teaching (part of teacher training) in the district. Otherwise, the applicant is considered to be external to the district at the time she/he is rated. Of all ratings, 2,322 were for internal applicants, and 1,152 were for external applicants. Fifty-one applicants were, by our criteria, marked as external for some ratings and as internal for others. We keep these individuals in the analysis. For comparison of the two samples, they are included in both pools depending on the status when measure was taken. In the analyses, applicant status is included in the model.

Data analysis
Statistical environment R version 3.4.3 [18] and its libraries lme4 [19][20] and lmerTest [21] are used for analyses as specified in subsections below. Library data.table [22] is used to reshape the data, and library ggplot2 [23] is used to prepare graphics. Commented sample R code is provided in supplemental materials.
Absolute differences in summative ratings of external and internal applicants. Descriptive statistics for all measures are calculated for internal and external applicants. Two sample t tests are used to test significance of the differences, and we utilize the Benjamini-Hochberg correction of p values to account for multiple comparisons [24]. Besides p values, Cohen's d, defined as the absolute difference between means of the two groups divided by a standard deviation for the data [25], is used to evaluate effect sizes of the differences.
We begin testing for bias in total ratings with respect to applicant internal/external status in Model (1): In this model, μ is the mean for external applicants, β 0 is the estimated effect of being an internal applicant (identified by ω i = 1). We also assume random effects for applicant A i , rater B j , and school S l to account for the hierarchical structure of the data, and we include applicantschool interactions AS il to account for the possibility of applicant-school matching effects. The residual e ijl reflects the departure of observed scores on the rating of applicant i by rater j for school l from what would be expected given the grand mean, the individual's true score, and the effect of the rater, school and applicant-school interaction. Residual includes a possible interaction between applicant and rater and between rater and school, which are not included in the model since the data contains limited multiple ratings of the same applicant by the same rater and limited ratings of the same rater for different schools. We assume joint normal, uncorrelated and mean-zero distributions for applicants, raters, and residuals. In additional models, we further add fixed effects β describing the i th teacher's qualities x i : number of years of experience, licensure scores (WEST-B) as well as estimate of teacher value added to subsequent achievement of their students in mathematics and reading in the subpopulation of teachers hired in Washington State: In all models, we test for significance of applicant internal status β 0 using likelihood ratio tests [26]. Variance decomposition and testing for differential IRR for internal and external applicants. Starting with Model defined by Eq 1 we estimate the contributions of variance from the various sources: the applicant effect, the rater effect, the school effect, applicant-school matching effects, and the residual: Assuming single raters, inter-rater reliability of applicant ratings within schools is defined as ratio of true-score variance to total variance It is clear from Eq 2 that IRR is higher when applicants, schools and applicant-school interactions account for substantial proportion of rating variation and raters and other sources of variation do not. When analyzing between-group differences in reliability, IRR is usually calculated separately for groups using stratified data [27,28]. We take a more flexible approach to test for differential IRR by group. Specifically, in the following model we allow variance terms of main random effects to differ by group (i.e. for internal and external applicants): In model defined by Eq 3 (also addressed as Model (3) We use bootstrap procedures to calculate confidence intervals for the IRR estimates and to calculate confidence intervals for the difference between the IRRs for internal and external applicants. All calculations are performed for the summative overall score as well as for individual subcomponents.
Effect of higher number of raters. We use the prophecy formula [29][30] and generalizability theory [31] to provide estimates of IRR using various potential scoring designs, i.e., assuming differing number of raters. IRR is estimated as the ratio of "true score" variance of applicant for a given school to the total variance of the average scores from multiple ratings (the true score plus the error variance of the average). For J raters, the average ratings is " ; and the variance decomposes to Higher number of raters J and lower error variance, We provide estimates of IRR for internal and external applicants using Model (3) for cases of one, two and three raters. We also use the standard error of measures (SEM), the square root of s 2 B J þ s 2 e J , to evaluate the precision of estimates of the score level. To analyze whether the reliability of ratings influences their predictive validity, we examine correlations of ratings with estimates of teacher value added. Correlations between teacher value added and ratings are calculated from the full sample without accounting for applicant status. Correlations between teacher value added and average of two or three raters, are estimated using IRR estimates under Model (3) with respect to applicant status by employing the attenuation formula [32][33]: Table 2 provides applicant pre-hiring characteristics, summative and sub-component ratings received by each applicant during the hiring process as well as applicant's subsequent quality measures (estimated teacher value added). We observe a significantly higher male to female ratio and greater experience in external applicants. While licensure scores are more often missing in external applicants, and later value added estimates are less often available due to the lower hiring percentage in external applicants, the available mean licensure scores and mean value added estimates of internal and external applicants are comparable.  (WEST-B). The difference is more apparent (around 4 points) when accounting for subsequent teacher quality estimated as teacher value added in subsample of applicants hired to Washington state (Table 3). These differences are consistent in all subsamples (S1 Table).

Differences in variance decomposition and inter-rater reliability
Besides differences in ratings of external and internal applicants, we now pay attention to differences in precision of the ratings between the two groups (for summative score, see Fig 3).
To assess differences in IRR between internal and external applicants, we provide decomposition of variance terms in joint Model (3) by applicant type, internal and external (Fig 4, S2 Table). We also provide comparison with a stratified approach, e.g. in [27,28] (S3 Table). We observe that for the summative score, as well as for most of the subcomponents, rater variance is higher for external applications, i.e. ratings are less homogeneous when rating external applicants. In addition to higher rater variance, we also observe lower applicant variance for external applicants, i.e., external applicants (their qualities) are more homogeneous. These differences in variance components result in lower IRR in external applicants (0.42, CI 0.36-0.49) than in internal applicants (0.51, CI 0.45-0.57), with the difference between internal and external IRR being significantly nonzero for summative scores (0.09, CI 0.03-0.14), see Fig 5 and S2 Table. The differences between internal and external IRR are confirmed as statistically significant by likelihood ratio tests. We find that Model (3) allowing for different variance terms in ratings of internal and external applicants fits significantly better than Model (1) for summative score as well as for subcomponents.
Note, if Model (1) is utilized for internal and external applicants separately (S3 Table), we also obtain higher rater variance and lower IRR for external applicants. However, this model does not allow for testing the significance of the difference, nor does it allow for different variance components in groups, or simultaneous use of information from applicants who were external in some applications but internal in others. Finally, Model (3) is more flexible in allowing the researcher to decide which variance components are treated as common for the two groups.  Table 4 provides IRR estimates for the three scoring designs (using one, two and three raters per school). While the rule-of-thumb lower limit of 0.7 for reliability [34] can be reached for Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability the summative score when the average of three raters are used for internal applicants, this 0.7 standard is not reached for external applicants. We also find that for both the summative and subcomponent scores, the standard errors are quite large if only a single rater is employed for rating application materials (Table 4). For the summative score, standard error of measures (SEM) is over 5.0 which implies that the scores could easily move 10 points up or down, a very large gap relative to the 54-point scale. Across most subcomponents, SEM is higher for external applicants. Increasing the number of raters reduces the SEMs but differences between internal and external applicants in SEM remain large.

Random Effects Var (SD) Var (SD) Var (SD) Var (SD) Var (SD) Var (SD)
Appl Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability To summarize, using higher number of raters remarkably improves predictive validity (Table 4). In our case, predictive validity of the summative score for predicting subsequent teacher value added in math is estimated to increase from 0.17 to about 0.20 (an increase of    Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability 18%) for internal applicants when three raters are employed compared to a single rater. This increase is slightly higher for external applicants. Additionally, some subcomponents in cases of single ratings with insignificant correlations with value added (namely for Training, Experience, Cultural Competency, and Preferred Qualifications) are found to have significant correlations with value added with a higher number of raters (see Table 4).

Discussion and conclusions
This study compared ratings for external and internal applicants to teacher positions. We find that in all subcomponents, insider applicants are rated higher than applicants without previous teaching experience or training in the district they are applying to work in. Notably, the difference in ratings remains significant even when accounting for various measures of applicant qualifications and quality. We also found that the reliability of ratings is significantly higher for internal applicants. There are several possible explanations for lower and less precise ratings of external applicants. Many of the recommendations, upon which the ratings are based, for internal applicants are likely to come from employees in SPS who are familiar with the context and type of teachers the district seeks to hire. Thus, internal applicants are likely to have letter of recommendation writers who have good information about what the district is looking for, meaning some criteria may not be addressed in letters supporting external applicants, causing lower and less homogeneous ratings in external applicants. More information on rating criteria, and better prompts in terms of the kinds of information that the district is trying to illicit about teacher applicants may help reliably identify high-quality applicants from outside the district.
Additionally, raters may score an applicant higher and more consistently whom they have themselves observed, or an applicant whose letter of recommendation comes from a writer the Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability rater knows personally. On the contrary, lower and more conservative ratings may be given to external candidates whose letter of recommendation comes from writers raters don't know.
Enabling the external candidates to volunteer or work for the district to obtain a letter of recommendation from district employees may thus help in this aspect. As we have shown, higher number of raters may also help to increase reliability, decrease error variance and improve predictive power of applicant ratings. A higher number of raters might therefore be considered for rating external applicants to reach IRR levels comparable to those in internal applicants. Nevertheless, while higher number of raters has the potential to increase reliability of ratings, it is unlikely to solve the issue of lower, more conservative ratings of applicants from outside the district.
It is also worth pointing out methodological innovations used in this study that may be useful in other contexts. Specifically, to test group differences in inconsistencies in ratings, we employed model-based estimates of IRR. We have implemented the mixed-effect models to Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability allow for analysis of IRR with unbalanced hierarchical structure of the data and we have allowed for different variance terms for different applicant status-a covariate which may moderate IRR. This approach was shown to be more flexible than stratifying data with respect to applicant status and estimating IRR separately for the two groups. Our model-based approach was able to more precisely describe the data, to jointly use information from the whole dataset and to detect differences in IRR between the two groups in cases when stratified analysis was not able. Although we focus on applicant status (internal vs. external) as a moderator of IRR in the context of teacher hiring, this is just one example of a possible application of model-based IRR. IRR has been analyzed and compared for groups with respect to assessee or rater characteristics in journal peer review [35], grant peer-review [5,28,36], classroom observations of teachers [37][38], university candidates [3], student ratings, etc. In these areas and others, potential exists for assessee covariates, such as gender and ethnicity, rater characteristics such as rater position, experience or training [39][40] or covariates of units, e.g. school type or job type, which may moderate IRR and precision of ratings. In these cases, our model-based IRR may be able to detect differences in reliability between groups even when stratified IRR calculated separately for groups is not.

Limitations
This paper investigates differences of ratings between internal and external applicants only on one of the stages of SPS selection process. However, other stages of the hiring process, e.g. the district-level rating or the interview stage may also introduce bias.
To explain the bias in ratings, we only examine three measures of teacher qualifications and quality. While being important predictors of teacher quality and student achievement [41][42][43], these measures are somewhat limited in how well they describe teacher quality. In particular, as we described above, one possible explanation for what appears to be bias in the ratings is that there is better social fit for internal applicants, i.e. that there is an unobserved factor influencing the internal-external differences. To investigate more thoroughly whether SPS might be losing high quality external applicants due to rating biases or to find evidence explaining why ratings of external applicants are lower, we would need other measures of teacher quality that may capture dimensions of teacher quality unaccounted for here, such as teacher observation scores, or student/family survey ratings.
Finally, there are additional complexities that might be addressed in future work. For example, our analysis treated the ratings as if they were all completed at the same time, however, some repeated ratings occurred in timespan of 5 years and applicant characteristics might have changed during this period.

Conclusion
In conclusion, our study demonstrated lower and less precise ratings for external applicants to teacher positions with bias in ratings significant even when accounting for various measures of teacher qualifications and quality. This result is of high importance for educational research as well as for other fields, suggesting that high quality applicants who are "external" and have less connections to the institution and raters may be lost due to lower and less precise rating. As a result, the external applicants may be advised to become "insiders" before submitting an application, e.g. through volunteering, visits, substitute teacher or visiting positions, whenever possible. The institutions, on the other hand, might consider providing clearer guidance about what they are seeking when hiring, with a particular eye toward guidance aimed at applicants, and those recommending them, who do not know the district well.
Given the high stakes involved in ratings in many situations-e.g., ratings of job candidates, grant applications, journal submissions etc., we recommend investing resources to study and improve rating systems for ameliorating rating biases and inconsistencies across applicant subgroups.
Supporting information S1