Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Disparities in ratings of internal and external applicants: A case for model-based inter-rater reliability

  • Patrícia Martinková ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Statistical Modelling, Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic, Institute for Research and Development of Education, Faculty of Education, Charles University, Prague, Czech Republic

  • Dan Goldhaber,

    Roles Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Center for Education Data and Research, School of Social Work, and the Center for Statistics in the Social Sciences, University of Washington, Seattle, WA, United States of America

  • Elena Erosheva

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliations Department of Statistics, School of Social Work, and the Center for Statistics in the Social Sciences, University of Washington, Seattle, WA, United States of America, Laboratoire J.A. Dieudonné, Université Côte d’Azur, CNRS, Nice, France


Ratings are present in many areas of assessment including peer review of research proposals and journal articles, teacher observations, university admissions and selection of new hires. One feature present in any rating process with multiple raters is that different raters often assign different scores to the same assessee, with the potential for bias and inconsistencies related to rater or assessee covariates. This paper analyzes disparities in ratings of internal and external applicants to teaching positions using applicant data from Spokane Public Schools. We first test for biases in rating while accounting for measures of teacher applicant qualifications and quality. Then, we develop model-based inter-rater reliability (IRR) estimates that allow us to account for various sources of measurement error, the hierarchical structure of the data, and to test whether covariates, such as applicant status, moderate IRR. We find that applicants external to the district receive lower ratings for job applications compared to internal applicants. This gap in ratings remains significant even after including measures of qualifications and quality such as experience, state licensure scores, or estimated teacher value added. With model-based IRR, we further show that consistency between raters is significantly lower when rating external applicants. We conclude the paper by discussing policy implications and possible applications of our model-based IRR estimate for hiring and selection practices in and out of the teacher labor market.


Ratings have been part of the assessment landscape in many areas for many years. They are considered the gold standard of science and are present in peer review of grant proposals or journal articles [1], are integral parts of educational and psychological assessments [2], and are present in student admission processes [3] or selection of new hires. The legitimacy of rating procedures depends crucially on the reliability, validity and fairness of ratings systems and processes [4].

There are numerous covariates that may affect ratings, such as an applicant’s or reviewer’s gender, ethnicity and major or research area [1]. These factors may be potential source of bias and unfairness in ratings, but may also influence the inter-rater reliability (IRR) [5]. One factor that may cause bias is institutional proximity of the applicant. Such “affiliation bias” has, for instance, been shown in grant proposal peer reviews [68].

In labor economics, both theoretical and empirical studies confirm the commonsense notion that the human resource management processes used to make hiring decisions can have profound effects on the workforce labor productivity [911]. The productivity of new hires is dependent both on their individual attributes and the fit between employees and organizations [1213]; social competency, compatibility, and capital may be highly valuable and support positive work environments, productivity, and success of organizations as a whole.

In many contexts, the selection of an employee for a position can come down to a choice between an external applicant and an insider, i.e. an applicant that is internal to a firm or organization. Yet there is relatively little evidence on how hiring processes treat external and insider applicants.

Studies that focus on internal (promotions or lateral transfers) and external hiring find that external candidates face an uphill battle to be hired over internal candidates, in that they tend to need better observable indicators of quality than their internal peers [1314]. This finding may be related to hiring managers having relatively more knowledge about internal candidates, the importance of firm-specific human capital, or the desire by firms to create promotion-related incentives for other employees [15].

One important issue that has received little attention is whether the applicant selection tools and ratings often used in assessing job applicants function differently for internal and external applicants. In particular, internal applicants may have advantages over external applicants, due to their knowledge of the attributes that employers are looking for, because they are more likely to receive recommendations from individuals who understand the attributes that employees are looking for, or because they are directly known by hiring officials.

In this paper, we examine how the ratings on applicant selection tools compare for internal and external applicants to teaching positions in Spokane Public Schools (SPS), a relatively large school district in eastern Washington State. We use mixed-effect models [16] allowing rater- and applicant- covariates to test for bias. We analyze differences in ratings between external and internal applicants, with a particular focus on variance and IRR for these groups. We also derive a test of between-group differences in IRR, relying on mixed-effect models allowing group-specific variance terms of random variables.

SPS teacher applicant selection tool

For hiring decisions, SPS utilizes a four-stage hiring process [17]. In the first step, an online application management system is used for uptake and initial check of applications. Next, pre-screening of potential applicants is made by central office human resources officials. In the third stage, screening of applicants meeting initial screening standards is done by school-level hiring officials. Finally, applicants with the highest school-level screening scores are invited for in-person interviews: job offers are made based upon judgments after this final stage.

In this work, we analyze data from school-level screening, the third stage of the SPS hiring process. Important for our purposes, a large number of applicants who are screened at this stage have multiple ratings. Applicants at this stage (for the majority of the study period) were rated on a 6-point scale in nine criteria, each a subcomponent of the rating instrument. The screening rubric (which is on a 54-point scale) and criteria are outlined in Table 1. Ratings were based on written materials that were included in the application and in supporting documentation (e.g., resume, cover letter, and at least three letters of recommendation). A summative score was used to select which candidates receive in-person interviews. During the study period, about 40% of applicants screened on the school level were not advanced to an interview. In previous studies, both the district-level and school-level selection tools have been shown to be predictive of later teacher and student outcomes [17].

Research questions

We analyze rating disparities of internal and external applicants. Specifically, we address the following research questions

  1. Do external applicants receive lower ratings on subcomponents and in total than internal applicants?
  2. Are any differences in ratings between the two groups explainable by other measures of applicant qualifications and quality, available before hiring decision (e.g. years of experience, licensure test scores), or measured in years following after the hiring decision (e.g. estimated teacher value added to subsequent achievement of their students)?
  3. Does the magnitude of variance components differ for internal versus external applicants?
  4. Is the IRR equal for internal and external applicants, or is it higher for insiders?


Teacher application dataset

Our dataset contains ratings of applicants (assessees) for teaching positions in SPS during the school years 2008–09 through 2012–13. This includes a total of 3,474 individual ratings with known applicant and rater ID and job location, representing 1,090 individual applicants rated by 137 raters for classroom-teaching job postings at 54 job locations (schools). These units are partially crossed both with applicants (many applicants applied to multiple schools) and with raters (some raters rated for multiple schools).

Applicants were rated on a 6-point scale in nine subcomponents (Table 1), and the summative score (on a cumulative 54-point scale) was also provided. Multiple ratings of the same applicant may occur within the same school during one time period (e.g. some schools employ more raters and use average total score to rank the applicants), based on multiple applications to the school at one time (to multiple job openings) or over time, and/or across different schools in the district.

We also consider three other proxies of applicant quality and qualifications: teaching experience (in years), state licensure scores (WEST-B average, math, reading and writing, all standardized statewide) and, for applicants hired in Washington State, estimates of teacher value added to achievement of their students in mathematics and reading. Teacher value added, in simple terms, is the estimated contribution of teachers toward student achievement gains on standardized tests, generally adjusted for student background characteristics, such as free or reduced-price lunch status. The specific linear model used to generate the value added we used in this paper is described elsewhere [17].

We consider an applicant to be internal when he or she either was previously employed as a teacher in the district (e.g., at a different school, different position or in a different time period) or had completed his or her student teaching (part of teacher training) in the district. Otherwise, the applicant is considered to be external to the district at the time she/he is rated. Of all ratings, 2,322 were for internal applicants, and 1,152 were for external applicants. Fifty-one applicants were, by our criteria, marked as external for some ratings and as internal for others. We keep these individuals in the analysis. For comparison of the two samples, they are included in both pools depending on the status when measure was taken. In the analyses, applicant status is included in the model.

Data analysis

Statistical environment R version 3.4.3 [18] and its libraries lme4 [1920] and lmerTest [21] are used for analyses as specified in subsections below. Library data.table [22] is used to reshape the data, and library ggplot2 [23] is used to prepare graphics. Commented sample R code is provided in supplemental materials.

Absolute differences in summative ratings of external and internal applicants.

Descriptive statistics for all measures are calculated for internal and external applicants. Two sample t tests are used to test significance of the differences, and we utilize the Benjamini–Hochberg correction of p values to account for multiple comparisons [24]. Besides p values, Cohen’s d, defined as the absolute difference between means of the two groups divided by a standard deviation for the data [25], is used to evaluate effect sizes of the differences.

We begin testing for bias in total ratings with respect to applicant internal/external status in Model (1): (1) In this model, μ is the mean for external applicants, β0 is the estimated effect of being an internal applicant (identified by ωi = 1). We also assume random effects for applicant Ai, rater Bj, and school Sl to account for the hierarchical structure of the data, and we include applicant-school interactions ASil to account for the possibility of applicant-school matching effects. The residual eijl reflects the departure of observed scores on the rating of applicant i by rater j for school l from what would be expected given the grand mean, the individual’s true score, and the effect of the rater, school and applicant-school interaction. Residual includes a possible interaction between applicant and rater and between rater and school, which are not included in the model since the data contains limited multiple ratings of the same applicant by the same rater and limited ratings of the same rater for different schools. We assume joint normal, uncorrelated and mean-zero distributions for applicants, raters, and residuals. In additional models, we further add fixed effects β describing the ith teacher’s qualities xi: number of years of experience, licensure scores (WEST-B) as well as estimate of teacher value added to subsequent achievement of their students in mathematics and reading in the subpopulation of teachers hired in Washington State: In all models, we test for significance of applicant internal status β0 using likelihood ratio tests [26].

Variance decomposition and testing for differential IRR for internal and external applicants.

Starting with Model defined by Eq 1 we estimate the contributions of variance from the various sources: the applicant effect, the rater effect, the school effect, applicant-school matching effects, and the residual:

Assuming single raters, inter-rater reliability of applicant ratings within schools is defined as ratio of true-score variance to total variance (2) It is clear from Eq 2 that IRR is higher when applicants, schools and applicant-school interactions account for substantial proportion of rating variation and raters and other sources of variation do not.

When analyzing between-group differences in reliability, IRR is usually calculated separately for groups using stratified data [27, 28]. We take a more flexible approach to test for differential IRR by group. Specifically, in the following model we allow variance terms of main random effects to differ by group (i.e. for internal and external applicants): (3) In model defined by Eq 3 (also addressed as Model (3) below), estimates of variance components are obtained for internal and for external applicants. IRR is then estimated using Eq 2 for the two sets of variance component estimates. The total variance now decomposes into 8 terms, and the within-school IRR now varies for the two groups due to variance components that are allowed to vary by group: (4) (5) We use bootstrap procedures to calculate confidence intervals for the IRR estimates and to calculate confidence intervals for the difference between the IRRs for internal and external applicants. All calculations are performed for the summative overall score as well as for individual subcomponents.

Effect of higher number of raters.

We use the prophecy formula [2930] and generalizability theory [31] to provide estimates of IRR using various potential scoring designs, i.e., assuming differing number of raters. IRR is estimated as the ratio of “true score” variance of applicant for a given school to the total variance of the average scores from multiple ratings (the true score plus the error variance of the average). For J raters, the average ratings is and the variance decomposes to Higher number of raters J and lower error variance, implies higher within-school IRR: (6) We provide estimates of IRR for internal and external applicants using Model (3) for cases of one, two and three raters. We also use the standard error of measures (SEM), the square root of , to evaluate the precision of estimates of the score level.

To analyze whether the reliability of ratings influences their predictive validity, we examine correlations of ratings with estimates of teacher value added. Correlations between teacher value added and ratings are calculated from the full sample without accounting for applicant status. Correlations between teacher value added and average of two or three raters, are estimated using IRR estimates under Model (3) with respect to applicant status by employing the attenuation formula [3233]: (7)


Characteristics of internal and external applicants

Table 2 provides applicant pre-hiring characteristics, summative and sub-component ratings received by each applicant during the hiring process as well as applicant’s subsequent quality measures (estimated teacher value added). We observe a significantly higher male to female ratio and greater experience in external applicants. While licensure scores are more often missing in external applicants, and later value added estimates are less often available due to the lower hiring percentage in external applicants, the available mean licensure scores and mean value added estimates of internal and external applicants are comparable.

Table 2. Applicant characteristics for internal and external applicant ratings.

Differences in rating on the 54-point screening rubric

Table 2, Fig 1 and Fig 2 demonstrate differences in rating of applicants internal and external to the district. While internal applicants’ total score is on average 39 points, external applicants score on average more than 3 points lower. Ratings are significantly lower for external applicants across all subcomponents.

Fig 1. Distribution of total ratings for internal and external applicants.

Fig 2. Distribution of subcomponent ratings for internal and external applicants.

Summative ratings of internal applicants remain significantly higher, by about 3 points, even when accounting for measures of teacher qualifications: previous teaching experience or state licensure scores (WEST-B). The difference is more apparent (around 4 points) when accounting for subsequent teacher quality estimated as teacher value added in subsample of applicants hired to Washington state (Table 3). These differences are consistent in all subsamples (S1 Table).

Differences in variance decomposition and inter-rater reliability

Besides differences in ratings of external and internal applicants, we now pay attention to differences in precision of the ratings between the two groups (for summative score, see Fig 3).

Fig 3. Mean and range of summative ratings of applicants rated multiple times between 2009–2013.

Each vertical line connects summative ratings given to single applicant during this period. Applicants are ordered by average summative rating (solid circles).

To assess differences in IRR between internal and external applicants, we provide decomposition of variance terms in joint Model (3) by applicant type, internal and external (Fig 4, S2 Table). We also provide comparison with a stratified approach, e.g. in [27, 28] (S3 Table). We observe that for the summative score, as well as for most of the subcomponents, rater variance is higher for external applications, i.e. ratings are less homogeneous when rating external applicants. In addition to higher rater variance, we also observe lower applicant variance for external applicants, i.e., external applicants (their qualities) are more homogeneous.

Fig 4. Variance decomposition for internal and external applicants calculated using Model (3) jointly on all data.

These differences in variance components result in lower IRR in external applicants (0.42, CI 0.36–0.49) than in internal applicants (0.51, CI 0.45–0.57), with the difference between internal and external IRR being significantly nonzero for summative scores (0.09, CI 0.03–0.14), see Fig 5 and S2 Table. The differences between internal and external IRR are confirmed as statistically significant by likelihood ratio tests. We find that Model (3) allowing for different variance terms in ratings of internal and external applicants fits significantly better than Model (1) for summative score as well as for subcomponents.

Fig 5. Within-school IRR estimates for internal applicants, external applicants and their difference, including bootstrap confidence intervals, calculated using Model (3) jointly on all data.

Note, if Model (1) is utilized for internal and external applicants separately (S3 Table), we also obtain higher rater variance and lower IRR for external applicants. However, this model does not allow for testing the significance of the difference, nor does it allow for different variance components in groups, or simultaneous use of information from applicants who were external in some applications but internal in others. Finally, Model (3) is more flexible in allowing the researcher to decide which variance components are treated as common for the two groups.

Effect of higher number of raters on reliability and validity of scoring

Table 4 provides IRR estimates for the three scoring designs (using one, two and three raters per school). While the rule-of-thumb lower limit of 0.7 for reliability [34] can be reached for the summative score when the average of three raters are used for internal applicants, this 0.7 standard is not reached for external applicants.

Table 4. Effect of number of raters on reliability, standard error and predictive validity of scoring.

We also find that for both the summative and subcomponent scores, the standard errors are quite large if only a single rater is employed for rating application materials (Table 4). For the summative score, standard error of measures (SEM) is over 5.0 which implies that the scores could easily move 10 points up or down, a very large gap relative to the 54-point scale. Across most subcomponents, SEM is higher for external applicants. Increasing the number of raters reduces the SEMs but differences between internal and external applicants in SEM remain large.

To summarize, using higher number of raters remarkably improves predictive validity (Table 4). In our case, predictive validity of the summative score for predicting subsequent teacher value added in math is estimated to increase from 0.17 to about 0.20 (an increase of 18%) for internal applicants when three raters are employed compared to a single rater. This increase is slightly higher for external applicants. Additionally, some subcomponents in cases of single ratings with insignificant correlations with value added (namely for Training, Experience, Cultural Competency, and Preferred Qualifications) are found to have significant correlations with value added with a higher number of raters (see Table 4).

Discussion and conclusions

This study compared ratings for external and internal applicants to teacher positions. We find that in all subcomponents, insider applicants are rated higher than applicants without previous teaching experience or training in the district they are applying to work in. Notably, the difference in ratings remains significant even when accounting for various measures of applicant qualifications and quality. We also found that the reliability of ratings is significantly higher for internal applicants.

There are several possible explanations for lower and less precise ratings of external applicants. Many of the recommendations, upon which the ratings are based, for internal applicants are likely to come from employees in SPS who are familiar with the context and type of teachers the district seeks to hire. Thus, internal applicants are likely to have letter of recommendation writers who have good information about what the district is looking for, meaning some criteria may not be addressed in letters supporting external applicants, causing lower and less homogeneous ratings in external applicants. More information on rating criteria, and better prompts in terms of the kinds of information that the district is trying to illicit about teacher applicants may help reliably identify high-quality applicants from outside the district.

Additionally, raters may score an applicant higher and more consistently whom they have themselves observed, or an applicant whose letter of recommendation comes from a writer the rater knows personally. On the contrary, lower and more conservative ratings may be given to external candidates whose letter of recommendation comes from writers raters don’t know. Enabling the external candidates to volunteer or work for the district to obtain a letter of recommendation from district employees may thus help in this aspect.

As we have shown, higher number of raters may also help to increase reliability, decrease error variance and improve predictive power of applicant ratings. A higher number of raters might therefore be considered for rating external applicants to reach IRR levels comparable to those in internal applicants. Nevertheless, while higher number of raters has the potential to increase reliability of ratings, it is unlikely to solve the issue of lower, more conservative ratings of applicants from outside the district.

It is also worth pointing out methodological innovations used in this study that may be useful in other contexts. Specifically, to test group differences in inconsistencies in ratings, we employed model-based estimates of IRR. We have implemented the mixed-effect models to allow for analysis of IRR with unbalanced hierarchical structure of the data and we have allowed for different variance terms for different applicant status–a covariate which may moderate IRR. This approach was shown to be more flexible than stratifying data with respect to applicant status and estimating IRR separately for the two groups. Our model-based approach was able to more precisely describe the data, to jointly use information from the whole dataset and to detect differences in IRR between the two groups in cases when stratified analysis was not able.

Although we focus on applicant status (internal vs. external) as a moderator of IRR in the context of teacher hiring, this is just one example of a possible application of model-based IRR. IRR has been analyzed and compared for groups with respect to assessee or rater characteristics in journal peer review [35], grant peer-review [5, 28, 36], classroom observations of teachers [3738], university candidates [3], student ratings, etc. In these areas and others, potential exists for assessee covariates, such as gender and ethnicity, rater characteristics such as rater position, experience or training [3940] or covariates of units, e.g. school type or job type, which may moderate IRR and precision of ratings. In these cases, our model-based IRR may be able to detect differences in reliability between groups even when stratified IRR calculated separately for groups is not.


This paper investigates differences of ratings between internal and external applicants only on one of the stages of SPS selection process. However, other stages of the hiring process, e.g. the district-level rating or the interview stage may also introduce bias.

To explain the bias in ratings, we only examine three measures of teacher qualifications and quality. While being important predictors of teacher quality and student achievement [4143], these measures are somewhat limited in how well they describe teacher quality. In particular, as we described above, one possible explanation for what appears to be bias in the ratings is that there is better social fit for internal applicants, i.e. that there is an unobserved factor influencing the internal-external differences. To investigate more thoroughly whether SPS might be losing high quality external applicants due to rating biases or to find evidence explaining why ratings of external applicants are lower, we would need other measures of teacher quality that may capture dimensions of teacher quality unaccounted for here, such as teacher observation scores, or student/family survey ratings.

Finally, there are additional complexities that might be addressed in future work. For example, our analysis treated the ratings as if they were all completed at the same time, however, some repeated ratings occurred in timespan of 5 years and applicant characteristics might have changed during this period.


In conclusion, our study demonstrated lower and less precise ratings for external applicants to teacher positions with bias in ratings significant even when accounting for various measures of teacher qualifications and quality. This result is of high importance for educational research as well as for other fields, suggesting that high quality applicants who are “external” and have less connections to the institution and raters may be lost due to lower and less precise rating. As a result, the external applicants may be advised to become “insiders” before submitting an application, e.g. through volunteering, visits, substitute teacher or visiting positions, whenever possible. The institutions, on the other hand, might consider providing clearer guidance about what they are seeking when hiring, with a particular eye toward guidance aimed at applicants, and those recommending them, who do not know the district well.

Given the high stakes involved in ratings in many situations—e.g., ratings of job candidates, grant applications, journal submissions etc., we recommend investing resources to study and improve rating systems for ameliorating rating biases and inconsistencies across applicant subgroups.

Supporting information

S1 Table. Model 1A from Table 3 for restricted samples.


S2 Table. Decomposition of variance terms using Model (3) jointly for data of internal and external applicants.


S3 Table. Decomposition of variance terms when using Model (1) separately for internal and external applicants.



Disclaimer: The ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors’ institutions or the funding agencies is not intended and should not be inferred. The authors take responsibility for any errors.

This work was supported by the Czech Science Foundation Grant #JG15-15856Y, the Institutes of Education Sciences, U.S. Department of Education Grant #R305C130030, the National Science Foundation Grant #1759825 and by the COST Action TD1306 “New frontiers of peer review” ( The research was partly conducted while P. Martinková was visiting University of Washington as a Fulbright-Masaryk fellow. The work has benefited from helpful research assistance by graduate students Malcolm Wolf (University of Washington) and Adéla Drabinová (Charles University and Institute of Computer Sciences of the Czech Academy of Sciences). The authors would also like to thank Roddy Theobald and Marek Brabec for their helpful comments on prior versions of this manuscript.


  1. 1. Mutz R, Bornmann L, Daniel H-D. Heterogeneity of inter-rater reliabilities of grant peer reviews and its determinants: A general Estimating Equations Approach. PLOS ONE, 2012; 7(10): e48509. pmid:23119041
  2. 2. Casabianca JM, Junker BW, Patz R. The hierarchical rater model. In: van der Linden WA & Hambleton RK, editors. Handbook of modern item response theory. Boca Raton, FL: Chapman & Hall/CRC; 2017. pp. 449–465.
  3. 3. Ziv A, Rubin O, Moshinsky A, Gafni N, Kotler M, Dagan Y, Lichtenberg D, Mekori YA, Mittelman M. MOR: a simulation-based assessment centre for evaluating the personal and interpersonal qualities of medical school candidates. Medical Education, 2008; 42: 991–998. pmid:18823518
  4. 4. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 2014.
  5. 5. Marsh HW, Jayasinghe UW, Bond NW. Improving the peer-review process for grant applications: reliability, validity, bias, and generalizability. American Psychologist, 2008; 63(3): 160–168. pmid:18377106
  6. 6. Van den Besselaar P. Grant Committee membership: service or self-service? Journal of Informetrics. 2012; 6:580–585.
  7. 7. Sandströ U., & Hallsten M. (2008). Persistent nepotism in peer-review. Scientometrics, 74(2), 175–189.
  8. 8. Wennerås C., & Wold A. (1997). Nepotism and sexism in peer-review. Nature, 387, 341–343. pmid:9163412
  9. 9. Becker B, Gerhart B. The impact of human resource management on organizational performance: Progress and prospects. Academy of management journal, 1996; 39(4): 779–801.
  10. 10. Huselid MA. The impact of human resource management practices on turnover, productivity, and corporate financial performance. Academy of management journal, 1995; 38(3): 635–672.
  11. 11. Koch M, McGrath R. Improving Labor Productivity: Human Resource Management Policies do Matter. Strategic Management Journal, 1996; 17(5): 335–354.
  12. 12. Lazear EP. Firm-specific human capital: A skill-weights approach. Journal of political economy, 2009; 117(5): 914–940.
  13. 13. DeVaro J, Morita H. Internal promotion and external recruitment: a theoretical and empirical analysis. Journal of Labor Economics, 2013; 31(2): 227–269.
  14. 14. Chan W. External Recruitment versus Internal Promotions. Journal of Labor Economics, 1996; 14(4): 555–570.
  15. 15. DeVaro J, Kauhanen A, Valmari N. Internal and external hiring: the role of prior job assignments. Paper presented at the Fourth SOLE-EALE World Meeting, Montreal. 2015. Retrieved from
  16. 16. Goldstein H. Multilevel Statistical Models, Fourth Edition. Chichester, UK: Wiley; 2011.
  17. 17. Goldhaber D, Grout C, Huntington-Klein N. Screen twice, cut once: Assessing the predictive validity of teacher selection tools. Education Finance and Policy, 2017; 12 (2): 197–223.
  18. 18. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria; 2018. Retrieved from
  19. 19. Bates D, Maechler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 2015; 67(1): 1–48.
  20. 20. Pinheiro J, Bates D. Mixed-Effects Models in S and S-PLUS. Springer, New York, NY; 2000.
  21. 21. Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software, 2017; 82(13): 1–26.
  22. 22. Dowle M, Srinivasan A. data.table: Extension of “data.frame”. R package version 1.10.4–3, 2017. URL
  23. 23. Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York, NY: Springer-Verlag; 2009.
  24. 24. Benjamini Y., Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the -Royal Statistical Society: Series B, Statistical Methodology, 1995; 57, 289–300.
  25. 25. Cohen J. Statistical power analysis for the behavioral sciences. USA: Lawrence Erlbaum Associates; 1988.
  26. 26. Agresti A. Categorical data analysis. Hoboken, NJ: Wiley–Interscience. 2002.
  27. 27. Casabianca JM, McCaffrey DF, Gitomer DH, Bell CA, Hamre BK, Pianta RC. Effect of Observation Mode on Measures of Secondary Mathematics Teaching. Educational Psychological Measurement, 2013; 73(5), 757–783.
  28. 28. Sattler DN, McKnight PE, Naney L, Mathis R. Grant Peer Review: Improving Inter-Rater Reliability with Training. Clifford T, ed. PLoS ONE. 2015;10(6):e0130450. pmid:26075884
  29. 29. Spearman CC. Correlation calculated from faulty data. British Journal of Psychology, 1910; 3: 271–295.
  30. 30. Brown W. Some experimental results in the correlation of mental abilities. British Journal of Psychology. 1910; 3: 296–322.
  31. 31. Brennan RL. Generalizability theory. New York, NY: Springer-Verlag; 2001.
  32. 32. Spearman C. The proof and measurement of association between two things. The American Journal of Psychology, 1904; 15(1): 72–101.
  33. 33. Schmidt FL, Hunter JE. Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1996; 1(2): 199–233.
  34. 34. Nunnally JC, Bernstein IH. Psychometric theory (3rd ed.). New York: McGraw-Hill; 1994.
  35. 35. Marsh HW, Ball S. The Peer Review Process Used to Evaluate Manuscripts Submitted to Academic Journals: Interjudgmental Reliability. Journal of Experimental Education. 1989; 57: 151–69.
  36. 36. Jayasinghe UW, Marsh HW, Bond N. A multilevel cross-classified modelling approach to peer review of grant proposals: The effects of assessor and researcher attributes on assessor ratings. Journal of the Royal Statistical Society. Series A (Statistics in Society), 2003; 166(3): 279–300.
  37. 37. Hill HC, Charalambous CY, Kraft MA. When rater reliability is not enough: Teacher observation systems and a case for the G-study. Educational Researcher. 2012; 41(2): 56–64.
  38. 38. Ho AD, Kane TJ. The reliability of classroom observations by school personnel. 2013. Retrieved from
  39. 39. Conway JM, Jako R, Goodman D. A meta-analysis of interrater and internal consistency reliability of selection interviews. Journal of Applied Psychology, 1995; 80(5): 565–79.
  40. 40. Katko NJ, Meyer GJ, Mihura JL, Bombel G. Moderator analyses for the interrater reliability of Elizur's Hostility Systems and Holt's Aggression Variables: A meta-analytical review. Journal of Personality Assessment, 2013; 91(4), S1–S3.
  41. 41. Rockoff JE. The Impact of Individual Teachers on Student Achievement: Evidence from Panel Data. The American Economic Review, 2004; 94(2), 247–252.
  42. 42. Goldhaber D, Gratz T, Theobald R. What's in a teacher test? Assessing the relationship between teacher licensure test scores and student STEM achievement and course-taking. Economics of Education Review, 2017; 61(C), 112–129.
  43. 43. Chetty R, Friedman JN, Rockoff JE. Measuring the Impacts of Teachers II: Teacher Value-Added and Student Outcomes in Adulthood. American Economic Review, 2014; 104(9): 2633–79.