Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Critical Meta-Analysis of Lens Model Studies in Human Judgment and Decision-Making


Achieving accurate judgment (‘judgmental achievement’) is of utmost importance in daily life across multiple domains. The lens model and the lens model equation provide useful frameworks for modeling components of judgmental achievement and for creating tools to help decision makers (e.g., physicians, teachers) reach better judgments (e.g., a correct diagnosis, an accurate estimation of intelligence). Previous meta-analyses of judgment and decision-making studies have attempted to evaluate overall judgmental achievement and have provided the basis for evaluating the success of bootstrapping (i.e., replacing judges by linear models that guide decision making). However, previous meta-analyses have failed to appropriately correct for a number of study design artifacts (e.g., measurement error, dichotomization), which may have potentially biased estimations (e.g., of the variability between studies) and led to erroneous interpretations (e.g., with regards to moderator variables). In the current study we therefore conduct the first psychometric meta-analysis of judgmental achievement studies that corrects for a number of study design artifacts. We identified 31 lens model studies (N = 1,151, k = 49) that met our inclusion criteria. We evaluated overall judgmental achievement as well as whether judgmental achievement depended on decision domain (e.g., medicine, education) and/or the level of expertise (expert vs. novice). We also evaluated whether using corrected estimates affected conclusions with regards to the success of bootstrapping with psychometrically-corrected models. Further, we introduce a new psychometric trim-and-fill method to estimate the effect sizes of potentially missing studies correct psychometric meta-analyses for effects of publication bias. Comparison of the results of the psychometric meta-analysis with the results of a traditional meta-analysis (which only corrected for sampling error) indicated that artifact correction leads to a) an increase in values of the lens model components, b) reduced heterogeneity between studies, and c) increases the success of bootstrapping. We argue that psychometric meta-analysis is useful for accurately evaluating human judgment and show the success of bootstrapping.


Improving judgment and decision making is of utmost importance across multiple domains of life, as even minor inaccuracies can sometimes have a major impact. For example, within the medical domain, if a physician is able to accurately diagnosis cancer, the patient will likely receive early treatment and has a greater chance to survive. Within other domains such as business or education, individuals (e.g., managers, teachers) must make important decisions over the use of human and financial resources based on their judgment of ambiguous situations (e.g., the payoff of a certain strategy, the intelligence of a student). Hence, it is no wonder that judgmental achievement and decision-making has for many years been an important area of research as reflected in the considerable number of studies which have evaluated the success of human judgment across multiple fields (e.g., [1][3]). Within judgment and decision-making approaches, the lens model ([4], see below) provides a useful framework for understanding and modeling components of judgmental achievement. Previous meta-analyses of lens model studies have indicated that estimates of judgmental achievement vary widely across studies (see [5]). Because previous meta-analyses [5], [6] have not corrected for methodological artifacts (e.g., measurement error), previous estimates of judgmental achievement are likely biased. Furthermore, there is ambiguity with regards to the extent to which heterogeneity in estimates of judgmental achievement across studies stems from methodological artifacts as opposed to ‘substantial’ differences due to underlying moderators (e.g., decision domain, judge expertise).

To address the problems with previous meta-analyses, we conduct a psychometric meta-analysis of lens model studies across a number of decision-making domains (e.g., business, medicine, education, psychology). We correct for multiple study design artifacts (e.g., sampling error, measurement error, dichotomization). We compare results of a traditional meta-analytical approach with the psychometric approach to examine how methodological artifacts bias estimates and may lead to erroneous interpretations. Furthermore, we examine the extent to which judgmental achievement varies by domain (e.g., if physicians judge more accurately than teachers), level of expertise (i.e., if experts judge more accurately than novices), and whether the effect of expertise differs by domain (i.e., if expertise leads to better accuracy in some domains but not in others).

Finally, a further goal of the current paper is to contribute to the development of better decision making tools. Researchers have used the lens model equation to build linear models to ‘bootstrap’ judges (that is, replace human judges by equations to guide decision making) to increase judgment accuracy. For example, researchers have built models that physicians can use to make important medical judgments (see for example [7]). Previous meta-analyses have suggested that bootstrapping judges generally results in a slight increase in judgmental achievement relative to human judgment, although there seems to be high heterogeneity in its success (e.g., [2], [6]). However, it is possible that failure to correct for methodological artifacts may have led to an over- or underestimation of the potential success of bootstrapping relative to human judges. We therefore examine whether psychometrically-corrected linear models for decision making can increase the success of bootstrapping.

The Lens Model Framework

The lens model [4] identifies multiple components of judgment (in) accuracy. In a typical lens model study, a ‘judge’ must make a number of decisions based on different pieces of information (‘cues’). Judgmental achievement is measured by the extent to which the judge's judgment matches (i.e., correlates) with an indicator of the actual outcome or situation (‘criterion’). Einhorn (second study, [8]) provides an example of a typical lens model study (see Figure 1). In this study, physicians evaluated the severity of Hodgkin's disease (cancer) based on patient's biopsy slides (see the right side of Figure 1, Ys). Physicians made a judgment with regards to the estimated survival time, which was compared with the actual number of months of survival (see the left side of Figure 1, Ye). A high correlation between physicians' judgments and the actual months of survival indicated high judgmental achievement.

Figure 1. The lens model applied to physicians' diagnosis of cancer (see [8]).

The lens model is the basis for the lens model equation (LME; see [9][11]; for more background information on the LME, see [12]). As shown in Equation 1, the LME mathematically describes judgmental achievement (ra, i.e., the correlation between a person's judgments and a particular criterion) in terms of four components. Namely, judgmental achievement is equal to a linear knowledge term (G) multiplied by task predictability term (Re) term multiplied by a consistency term (Rs) plus a non-linear knowledge term (C). The linear knowledge component (G) refers to the correlation between the predicted human judgment and the predicted criterion (e.g., the predicted physician's judgment about survival time, the predicted actual months of survival). Task predictability (Re) refers to the multiple correlation of the cues with the criterion (e.g., the extent to which characteristics of the biopsy slide correlate with the months of survival), or in other words, the extent to which a decision can be made based on the information available. Consistency (Rs) refers to the reliability of judgments, that is, the extent to which a judge reliably reaches the same decision based on the same pieces of information (e.g., the extent to which a physician reaches the same diagnosis based on biopsy slides with the similar characteristics), or in other words, the multiple correlation of the cues with the person's estimates. The non-linear knowledge component (C) represents the correlation between the variance not captured by the environmental predictability component or the consistency component (i.e., the correlation between the residuals from the above predictions). Previous research has revealed that the non-linear knowledge component is generally quite small (average C = .08, [13], p. 129); hence we exclude it from our analysis.

The definitions of the single components in detail are:

ra = the achievement index (i.e., the correlation between a person's judgments and the criterion),

Re = the task predictability index (i.e., the multiple correlation of the cues with the criterion),

Rs = consistency (i.e., the multiple correlation of the cues with a judge's estimate),

G = a knowledge index that reflects achievement (i.e., the correlation between the predicted levels of the criterion and the predicted judgments), and

C = an unmodeled knowledge component that signifies the correlation between the variance not captured by the environmental predictability component or the consistency component (i.e., the correlation between the residuals from the above predictions).(1)

The success of bootstrapping judges with a linear model

The lens model can be used to create linear judgment models (i.e., equations) that can be used to support judgment and decision making, essentially by ‘correcting’ for the inconsistency with which human judges use cues to reach a judgment. The process (and success) of replacing a human judge with a judgment model is referred to as ‘bootstrapping’ (see [6], [14]) and is also discussed under the topic of ‘man versus model of man’ (see [8]). The idea of creating such judgment models can be traced back to Meehl's [15] evaluation of whether clinical psychologists reach more accurate judgments about a patient relative to an equation.

Linear judgment models are defined with the same linear knowledge (G) and task predictability (Re) terms as in the lens model (see Equation 1), but with the assumption that there is perfect consistency in how a judge uses a particular piece of information (Rs = 1), which is of course never the case with a human judge. As displayed in Equation 2, the success of a linear judgment model relative to a human judge can be estimated by the difference between the linear judgment model on the one hand and human judgmental achievement ra on the other hand (for details, see [2], p. 413):(2)

Previous Meta-Analyses of Judgmental Achievement

Previous meta-analyses of lens model studies have revealed a large heterogeneity of judgmental achievement estimates across studies [5], [6] and that the success of bootstrapping judges with a linear judgment model generally results in only a slight increase in judgmental achievement (e.g., [2], [6]). However, to the best of our knowledge, no previous meta-analysis has followed a psychometric approach that appropriately corrects for multiple methodological artifacts. When left uncorrected, methodological differences between the studies included in the meta-analyses such as varying sample sizes (sampling error), varying reliability of the measurements used in different studies (measurement error), and dichotomization of a continuous variable can lead to biased estimations. Two previous meta-analyses of lens model studies (e.g., [5], [6]) applied ‘bare-bones meta-analysis’ (i.e., only correct for sampling error; [16], p. 132), but they did not control for other methodological artifacts. In the current study, we build on the results of previous bare-bones meta-analyses and follow the psychometric Hunter-Schmidt approach (see below) to correct for multiple study design artifacts and thus, we argue, arrive at less biased estimates of the LME components. We also check the robustness of our results by estimating the potential effect of publication bias, that is, the phenomenon for studies with significant results to be published more often relative to studies with non-significant results. In our case, it could be that studies with zero correlations are probably reported less frequently than studies with at least moderate correlations. Publication bias may thus threaten the representativeness of the studies included in the meta-analysis. We describe a new method for estimating potential publication bias (see below).

In the current study, we also extend previous research and investigate whether judgmental achievement varies according to judge expertise and decision domain. Karelaia and Hogarth [6] found that expertise is negatively related to judgmental achievement; however the authors did not control for decision domain. The authors concluded that expertise in some domains may be particularly difficult to develop and hence only weakly related to judgmental achievement (see also [17], [18]). Kaufmann and Athanasou [5] considered different decision domains, but they neglected to simultaneously consider judges' expertise. In the current psychometric meta-analysis, we therefore simultaneously investigate both expertise and decision domain as well as expertise within domains as potential moderators of judgmental achievement. Does expertise matter more in some domains relative to others? Finally, we also compare the success of bootstrapping (see Equation 2) with linear judgment models based on estimates of the LME components generated from bare-bones meta-analysis with the success of bootstrapping with linear judgment models based on estimates generated from psychometric meta-analysis.


Description of the Database

The flowchart in Figure 2 depicts the five literature search strategies used in the current study (see Figure 2, point A). To find studies, we searched relevant databases (e.g. PsycINFO, Psyndex, Web of Science) using different keywords (e.g., ‘lens model’, ‘lens model equation’, ‘judgmental achievement’) as well as key articles and books in the area of research and activated a Google alert to notify us of any new relevant publications. We then cross-checked the database with sources found in other reviews (e.g., [19], see point B in the flowchart).

Figure 2. The process of identifying relevant studies for the meta-analysis.

Point C lists the exclusion criteria. To prevent any aggregation bias, we only considered studies on judgment that had aggregated results across individuals, thus excluding those with aggregated results across cues (e.g., [20]). We included data derived from lens model studies of individual judges and of aggregated data across judges. We observe that the idiographic approach is often neglected in lens model studies [21]. Hence, mostly aggregated judgments made by multiple judges as opposed to judgments of single judges are reported in lens model studies.

In the current study we were interested in evaluating judgmental achievement without any feedback opportunities as would be the case in naturalistic, everyday settings. Business managers, for example, receive little feedback on the accuracy of their judgments. Moreover, they often can have no idea whether the feedback they do in fact receive is accurate or not (see [22]). Likewise, physicians frequently do not get any feedback about the accuracy of their judgments, as patients fail to return or are referred elsewhere, or diagnoses remain uncertain [23]. We therefore excluded studies in which judges received ongoing feedback on the accuracy of their decisions and/or had the opportunity to learn during the tasks. We argue that studies that included feedback and/or learning opportunities do not adequately represent the daily life of participants and could thus have biased our results.

Further details on the construction of our database, such as our search protocol, are available in Kaufmann [13].

A total of 31 studies met our inclusion criteria [8], [14], [23][51]. The studies were coded based on certain characteristics (e.g., year of publication, sample size) or possible moderator variables (judges' level of expertise, decision domain). Tables 1 and 2 summarize the characteristics of the included studies. Decision domain was coded as medicine, business, psychology, education, or as miscellaneous. With the exception of the medical domain, all other domains included both experts and non-experts (i.e., students) as judges. The database included 49 judgment tasks with 1,151 judgments made by 1,055 participants. Of the 1,055 participants, 68 participated in more than one task. Compared to the database by Kaufmann and Athanasou [5] our database is slightly different due to improved analysis tools and additional studies (e.g., [51]).

Table 1. Study characteristics ordered according to decision domain and expertise.

Table 2. Characteristics of studies in the ‘miscellaneous’ domain ordered by expertise.

The Psychometric Meta-Analytical Approach

Several studies contributed to the eventual development of various meta-analytical approaches in the 1970s (e.g., [15], [52], [53]). For example, Eysenck [52] concluded from a narrative review that psychotherapy was ineffective, prompting a response from the experienced therapist Glass, who statistically compared the outcomes of psychotherapy and refuted Eysenck's conclusion ([54], see also [55]). Since then, researchers have used meta-analysis to systematically summarize the outcomes of multiple studies to increase the generalizability of results (e.g., regarding the effectiveness of psychological, pedagogical and behavioral interventions [56]; regarding predictors of student achievement [57]).

The meta-analytical approach has undergone continuous development, resulting in a number of approaches such as the Hedges-Olkin [58], the Rosenthal-Rubin [59] and the Hunter-Schmidt [16] approach (for an overview, see [60], [61]; for a critical discussion, [62]). Field [63], [64] evaluated different traditional meta-analytical approaches and favored the random-effect model of the Hunter-Schmidt approach. The random-effect model takes into account that the studies included in a meta-analysis are drawn from a greater ‘population’ of studies. Hence, differences in effect sizes across studies arise from sources within as well as between studies. The traditional, ‘bare bones’ Hunter-Schmidt approach (as evaluated by Field) corrects for sampling error: Since meta-analysis is generally based on many studies with different sample sizes, sampling error is inherent in the data (larger for smaller sample sizes). The Hunter-Schmidt approach has since been additionally modified to correct for up to 11 other methodological artifacts (‘psychometric Hunter-Schmidt approach’; [16], p. 35). Since multiple methodological artifacts threaten the estimations of the LME parameters, we argue that the psychometric Hunter-Schmidt is the most appropriate approach for the current study, since it is the only meta-analytical approach that corrects for multiple differences in study design.

With regards to potential bias due to measurement artifacts, the knowledge component (G) is attenuated by the unreliability of the estimate of the judge, the unreliability of the criterion and the restriction of range in both. Therefore, the bias inherent in estimates of the knowledge component (G) can be corrected when S (restriction or enhancement of range), the reliability of the judge (see rttRs) and the reliability of the criterion (see rttRe) are known. The knowledge component can thus be described as in Equation 3:(3)Neglecting the nonlinear knowledge term (C) in Equation 1 and considering it as an error term e, substituting Equation 3 into Equation 1 results in Equation 4:(4)Therefore the unbiased estimate of the knowledge component (G) corrected for attenuation and restriction of range would be Equation 5:(5)In Equation 5, the psychometric Hunter-Schmidt approach incorporates the estimation of the population parameter according to Wittmann [65], [66]. This equation serves as an illustration of how to psychometrically meta-analyze the LME in our study. The psychometrically-corrected component (e.g., G) is called “true” and is an approximation of the value without any study design artifact. The “true” value is for example the actual judgmental achievement or the knowledge component without any artifacts introduced by the study design. Put simply, Equation 5 can be divided into three parts.

Firstly, the numerator of the fraction, the term e, represents sampling error. Meta-analysis carried out for the purpose of population estimation is often based on different studies including different numbers of participants, which results in sampling errors. Such a sampling error is larger for smaller sample sizes and can be positive or negative. It should be noted that traditional bare-bones meta-analysis corrects only for sampling error, although several additional study design artifacts (as introduced) are known. Due to the bias related to sampling error, there is a risk to over- or to underestimate the particular component.

Second, the first part in the denominator describes psychometric concepts of the reliability associated with judges and tasks. Failure to correct for the reliability of tasks or judges introduces two dangers that may result in an underestimation of the component. In addition, failure to correct for selection problems, known either as restriction or as enhancement of range might lead to under- or overestimation of for example judgmental achievement as maybe an extremely easy or difficult task.

Third, in the second part in the denominator, the term RsRe, can be traced back to Brunswik's research and the LME (see Equation 2) and represents construct reliability. Wittmann [67], [66], further extended Hunter-Schmidt's psychometric approach by adding the symmetry concept. Judgmental achievement increases if both the judgment and the criterion are measured at the same level of aggregation (i.e., they are ‘symmetrical’). For example, if a physician is asked to judge whether cancer is present and the criterion is whether a cancer tumor was detected, then the judgment is not symmetrical, as cancer can exist without a detectable tumor. In contrast, if a physician is asked to judge whether there is cancer only when a cancer tumor has been detectable, then the judgment and the criterion are said to be symmetrical. We did not control for symmetry in the current analysis. Neglecting symmetry may lead to two additional risks of potentially underestimating the components.

To summarize, due to the potential for different methodological artifacts, there is a tendency to over- or underestimate the “true value” of each component as illustrated by Equation 5. Based on Equation 5, the odds of underestimating the component with a bare-bones meta-analysis are 6 (sampling error, reliability of tasks or judges, selection effects, symmetry of tasks, judges) to 2 (sampling error, selection effects) as compared with estimates generated from a psychometric meta-analysis.

In our psychometric Hunter-Schmidt meta-analysis, we weighted each judgment task by the number of judges to correct for sampling error. To correct for measurement error with regards to both the criterion and human judgment, we used an artifact distribution compatible with the Hunter-Schmidt approach ([16], p. 137). To correct for measurement error on the judgment side within medicine and business, we use the studies' reliability values (e.g., [36]) or, otherwise, the retest reliabilities provided by Ashton [68] who reported retest reliability values across and within different domains. For example, when a study within the medical domain did not report measurement reliability, we used the mean of the reported test-retest reliability of .73 to correct for measurement error. No area specific retest-reliability values were available for measurement error correction by judges in the areas of education, psychology or miscellaneous professions. We therefore used the Reliability Generalization approach [69] to correct the measurement error of judges in these areas. In line with the Reliability Generalization theory, we estimate a retest-reliability value for our measurement error corrections, namely .90, as an upper bound of the reliability distributions, as the averaged retest-reliability of professional judgments across domains is .78 (see [68]). Hence, our assumed measurement-error may have led to an underestimation of all components as we assume a smaller measurement error relative to the average reported by Ashton [68]. With regards to the measurement reliability values on the ecological side of the lens model (i.e., the criterion for against which human judgment is compared), we distinguished between three types of criteria. First, for subjective judgments, e.g., a physician's judgment (see [25]); we used the same approach as with the judgment side of the model as previously described. Second, for test criteria (e.g., MMPI), we used the test-specific retest-reliability value as available in the literature. Third, we did not correct objective criteria (e.g., an angiography; see [24]), as we assumed that there is only minimal measurement error with objective criteria. Finally, we considered further artifacts, such as the dichotomization of a continuous variable (see [38]).

Forest plots (see Figure 3) provide an overview of the results of the included studies and psychometrically corrected confidence intervals (see [16], p. 207). We also report credibility intervals as an indication of the existence of moderators of judgmental achievement. In contrast to confidence intervals, credibility intervals are calculated with standard deviations after removing artifacts. If the credibility interval includes zero or is sufficiently large, then there is a higher potential for moderator variables relative to when the credibility interval is small and excludes zero. Hunter and Schmidt [16] also recommend a simple 75% rule to detect moderator variables, which is typically more accurate than significance tests used to assess homogeneity. According to this rule, if the variance after correcting for artifacts accounts for less than 75% of the uncorrected variance (i.e., when artifacts account for less than 25% of the total variance, moderator variables are suspected). It should be noted that the variance remaining after artifact correction represents the upper boundary of any potential moderator effects, as it is impossible to correct of all potential artifacts. We emphasize that we do not apply Fisher-Z transformations, in line with the recommendations of Hunter and Schmidt [16].

Figure 3. Forest plots of judgmental achievement and the underlying components.

Finally, we apply the trim-and-fill method introduced by Duval and Tweedie [70] to estimate a possible publication bias in order to check the robustness of our estimations. By applying the trim-and-fill method, we estimated the effect sizes of potentially missing studies and included them in a further psychometric meta-analysis corrected for publication bias. In the following, we refer to this approach that to our knowledge is hereby introduced to the literature for the first time as the psychometric trim-and-fill method. We use the retest-reliability values to correct for judgment reliability, as in the case of education and psychology, and we assume no measurement error on the criterion side.


Tables 3 to 6 and Figure 3 display the results of the meta-analyses. The results of the bare-bones meta-analysis for each research area are displayed first, followed by the results of the psychometric meta-analysis. Whenever the psychometrical trim-and-fill method did not match the psychometric results with regards to the indication of moderators, the suggested values are reported as publication bias in the tables.

Table 3. Comparison of estimations of judgmental achievement (ra) with different meta-analytical approaches ordered by domain and experience level.

Table 4. Comparison of estimations of the linear knowledge component (G) with different meta-analytical approaches ordered by domain and experience level.

Table 5. Comparison of estimations of the consistency component (Rs) with different meta-analytical approaches ordered by domain and experience level.

Table 6. Comparison of estimations of the task-predictability component (Re) with different meta-analytical approaches ordered by domain and experience level.

Judgmental Achievement

Table 3 and Figure 3 show the meta-analytic results of judgmental achievement. Correcting for sampling error (bare bones approach) only results in an estimated judgmental achievement of .39. Correcting for additional artifacts with the psychometric approach resulted in an increased estimate of .45. That is, across all included lens model studies, human judgment correlated .45 with the given criterion.

Domain and Expertise as Moderators

The relatively small reduction in variability resulting from the psychometric approach relative to the bare bones approach suggested the existence of moderator variables under the assumption of no measurement error on the criterion side for objective criteria. We therefore re-ran the analyses within each domain (medicine, business, education, psychology, miscellaneous), for experts versus novices, and for expertise within domain (e.g., expert teachers versus novice teachers). These subsequent analyses revealed that judgmental achievement depended on decision domain. Specifically, judgmental achievement was lowest in psychology (ra = .22) and higher in education (ra = .39), medicine (ra = .40), miscellaneous professional domains (ra = .44), and highest in business (ra = .50). The results from the psychometric meta-analysis confirmed this pattern of results.

Against our expectation, results indicated that students reached a slightly higher judgmental achievement than experts. The 75% rule and the credibility intervals indicated the existence of moderator variables among student's judgmental achievement. We therefore reran our analysis, separating expertise within domains. This analysis revealed that the potential for moderator variables (once again as indicated by the 75% rule as well as by the credibility intervals) amongst experts runs not across all domains. In contrast, the analysis indicated the existence of moderator variables amongst business science students only.

Inspection of the scatter plots of students' judgmental achievement within the business domain indicated that Wright's study [32] had low values of judgmental achievement and might have influenced our results. Excluding this study from the sample increased estimated judgmental achievement (ra = .97, varcorr = .00), but still indicated the presence of moderator variables according to the 75% rule (30.51%).

Finally, the application of the psychometric trim-and-fill method generally confirmed our results. However, estimates of judgmental achievement among business experts dropped to a low value (no publication bias was indicated in studies using business students). Likewise, experts' judgments in other research domains decreased from .68 to .31. The application of the psychometric fill-and-trim method to judgmental achievement in the field of education indicated the existence of moderator variables. The potential for moderator variables according to the credibility intervals and the 75% rule decreased after we separated the analysis by experience level in the education domain. We therefore assume that experience level is a moderator variable within education. The judgment-achievement values for students in other domains remained stable after correcting for potential publication bias.

Components of Judgmental Achievement

Tables 4 to 6 and Figure 3 present the estimates of the LME parameters. As seen in Table 4, our results indicated high values of the knowledge component (G) in nearly every domain/experience-level except among experts in psychology. In addition, the results from the psychometric trim-and-fill method suggested a lower value for students' knowledge components. Hence, it seems that our analysis overestimated the knowledge component (G) among students, although the knowledge component for students was lower relative to experts.

Table 5 displays estimates of the consistency component (Rs). The results from the bare-bones and psychometric meta-analyses both suggest high values and generally indicate no moderator variables for all analyses across domains and expertise-level. All of the estimated consistency components (Rs) remain high when using the psychometric fill-and-trim method. In addition, the results from the psychometric fill-and-trim method indicated the existence of moderators within education science, among experts in the miscellaneous domain, and aggregated cross all domains.

Finally, Table 6 presents estimates for the task predictability component (Re). All values were above .68 in each and every analysis across domains and experience-level. The 75% rule indicated moderator variables across all domains, mainly based on students' task predictabilities in business science and the miscellaneous domain. In addition, the psychometric trim-and-fill method suggested that task predictabilities were overestimated amongst psychology students, as the 75% rule suggested the existence of moderators.

The Success of Bootstrapping Judges with a Linear Model

Table 7 compares the success of bootstrapping judges with a linear judgment model (see Equation 2) based on corrected versus uncorrected estimates of LME parameters. Failure to correct the component estimates for various artifacts clearly lead to underestimations of bootstrapping success. Indeed, the current results with corrected parameters indicate that the linear judgment models are actually more successful than previous studies have suggested (see [2], [6]). Hence, using corrected estimations of the LME components (e.g., G, Re) has practical consequences for the success of bootstrapping with linear judgment models. We therefore argue that corrected parameter estimates should be used to evaluate the success of bootstrapping.

Table 7. Comparison of the success of bootstrapping judges with a linear judgment model (GRe) based on different meta-analytical approaches (bare-bones vs. psychometric approach).


The major finding of our study is that bare-bones meta-analysis (e.g., [5], [6], see one-trial category), clearly underestimates true judgmental achievement values relative to psychometric meta-analysis, which more appropriately corrects for study design artifacts. Consequently, we argue that a psychometric meta-analysis is needed to more accurately evaluate judgment accuracy and can help researchers to more efficiently detect moderators. So far, previous meta-analyses of lens model studies have neglected the need to correct for multiple artifacts, although even minor increases in judgmental achievement may have a high practical impact at the individual level, for example, in life or death decisions in the medical domain. Our results indicate that failure to correct for artifacts (as with a bare-bones meta-analysis) leads to underestimations of all LME parameters across and within expertise domains, and the potential for moderator variables is generally overestimated. Parameter estimates from psychometric meta-analysis can be used to improve linear judgment models and hence bootstrapping, especially in areas where the price of false decision-making is high.

With regards to specific moderators of judgmental achievement, the present study confirms the pattern previously found for comparisons between different domains [5], namely, that judgmental achievement varies greatly across the medical, educational, psychological, business and other professional domains. In line with the meta-analysis of Aegisdottir et al. (p. 368) [1], we found low judgmental achievement in psychological science, for example, in the prediction of violence. Our analysis revealed that such low judgmental achievement within psychology may be explained by a moderate knowledge component. Hence, the question arises whether judgmental achievement in psychology can be improved by increasing the knowledge component, meaning that psychologists would need to expand their relevant knowledge for linear information integration. The success of psychometrically-corrected linear judgment models was higher than the low human judgmental achievement in psychology. Therefore, it might be particularly worthwhile to bootstrap judges within this domain (for further information, see [71]).

Against our expectation, the results of the meta-analyses suggest that experts do not make much better judgments than non-experts at the aggregated level. However, the effect of expertise appears to depend on domain. Specifically, within the business and psychology domains, students had higher judgmental achievement than experts. This surprising result may imply situations of learning and feedback (see also [22]). That is, higher judgmental achievement among experts relative to students may indicate higher feedback and learning in the respective domain. It seems possible to improve judgmental achievement through feedback and learning. There is only one study [47], however, that directly compares experts and students in four different tasks. Our results and conclusions regarding this point should therefore be taken with caution.

An innovative aspect of the current study was that we estimated publication bias using a psychometric trim-and-fill method, potentially leading to better estimates. To the best of our knowledge, calculation of publication bias has previously only been applied within bare-bones meta-analyses (see [72]), and we are not aware of any previous psychometric meta-analysis that has corrected for publication bias in this way. We recommend that researchers check the robustness of the results of future psychometric meta-analyses by using the psychometric trim-and-fill method described in this paper. We caution, however, that the psychometric trim-and-fill method used in the current study may need improvement and replication, because the underlying data were heterogeneous, which can potentially be problematic. Indeed, Rothstein [73] asserted that disentangling the effects of publication bias from other sources of heterogeneity can be difficult.

As common in meta-analytical research, the studies included in the analyses did not always report all of the data needed to calculate “true” judgmental achievement values (e.g., measurement reliability). Indeed, researchers interested in conducting psychometric meta-analyses often face the problem of missing data. Based on the Reliability Generalization theory [69], we suggest estimating a measurement error with an rr = .9 to check the robustness of the data as a possible solution. We also emphatically recommend that future researchers thoroughly and consequently report all relevant information on study method and results (e.g., reliability values, dichotomizations) in order to enhance the accuracy of further meta-analyses (and hence their usefulness). We would also like to encourage researchers to report more idiographic data in lens model studies (see [21]). For instance, multi-level analysis (see [74]) could be applied to gain further knowledge about judges' strategies within and between tasks.

In the current study, we corrected for a number of methodological artifacts (sampling error, measurement error, and dichotomization). Importantly, there may well be additional artifacts for which we did not correct. On this note, we heartily agree with Hunter and Schmidt [16] that, “all quantitative estimates are approximations. Even if these estimates are quite accurate, it is always desirable to make them more accurate, if possible” (p. 168). For instance, Wittmann [66], [67], further extended Hunter-Schmidt's psychometric approach by adding the symmetry concept. We did not control for symmetry in the current analysis. Hence, we may have underestimated overall judgmental achievement, although our analyses rarely indicated any moderator variables, suggesting that there is not much variance left for further artifact correction.

In the current study, we focused on the evaluation of the success of bootstrapping with only linear judgment models. However, we did not consider experience within domains in detail. Further analyses are needed to shed light on this topic (see [75]).

As linear judgment models are often criticized for lack of user friendliness, we also see our analysis as an inspiration for the development of new judgment models (see [76]). The true power of psychometrically corrected linear judgment models should urgently be evaluated against new kinds of judgment models.

In sum, our study demonstrates that psychometric meta-analysis is useful for evaluating judgmental achievement and for constructing better linear judgment models for bootstrapping. This first psychometric meta-analysis of lens model studies confirms and extends previous results from bare-bones meta-analysis: Judgmental achievement clearly varies across domains. Our analysis also extended previous research on the potential moderating role of expertise within and between decision domains. The current analysis revealed that failure to correct for methodological artifacts can lead to underestimations of judgmental achievement and overestimations of heterogeneity between studies. Consequently, the success of bootstrapping with linear judgment models is also underestimated if LME parameters are not corrected for methodological artifacts. We therefore recommend that future researchers follow a psychometric approach in order to arrive at less biased estimations and more successful linear judgment models. If the relevant data for psychometric analyses (e.g., data on measurement error) are not immediately available, researchers can conduct robustness analysis with estimated values.

Supporting Information

Checklist S1.

PRISMA Checklist for systematic review and meta-analysis.



We are grateful to Lars Sjödahl, James A. Athanasou, Franz Eberle and Stephan Schumann as well as the Graduate School of Economics & Social Sciences at the University of Mannheim (Germany) and the Brunswik Society. We thank also two anonymous reviewers to comments and improvement of this paper. Finally, we thankfully acknowledge support for publication fees from University of Zurich.

Author Contributions

Analyzed the data: EK UR WWW. Contributed reagents/materials/analysis tools: EK UR WWW. Wrote the paper: EK UR WWW.


  1. 1. Aegisdottir S, White MJ, Spengler PM, Maugherman AS, Anderson LA, et al. (2006) The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. Couns Psychol 34: 341–382
  2. 2. Camerer C (1981) General conditions for the success of bootstrapping models. Organ Behav Hum Perform 27: 411–422
  3. 3. Grove WM, Zald DH, Lebow BS, Snitz BE, Nelson C (2000) Clinical versus mechanical prediction: A meta-analysis. Psychol Assess 12: 19–30
  4. 4. Brunswik E (1952) The conceptual framework of psychology. International encyclopedia of unified science. Chicago, IL: University of Chicago Press.
  5. 5. Kaufmann E, Athanasou JA (2009) A meta-analysis of judgment achievement defined by the lens model equation. Swiss J Psychol 68: 99–112
  6. 6. Karelaia N, Hogarth R (2008) Determinants of linear judgment: A meta-analysis of lens studies. Psychol Bull 134: 404–426
  7. 7. Jenny MA, Pachur T, Williams SL, Becker E, Margraf J (2013) Simple rules for detecting depression. J Appl Res Mem Cogn 2: 149–157
  8. 8. Einhorn HJ (1974) Cue definition and residual judgment. Organ Behav Hum Perform 12: 30–49
  9. 9. Hammond KR, Hursch CJ, Todd FJ (1964) Analyzing the components of clinical inference. Psychol Rev 71: 438–456
  10. 10. Hursch CJ, Hammond KR, Hursch JL (1964) Some methodological considerations in multiple-cue probability learning studies. Psychol Rev 71: 42–60
  11. 11. Tucker LR (1964) A suggested alternative formulation in the developments by Hursch, Hammond and Hursch and by Hammond, Hursch and Todd. Psychol Rev 71: 528–530.
  12. 12. Hammond KR, Stewart TR (2001) The essential Brunswik: Beginnings, explications, applications. Oxford, UK: University Press.
  13. 13. Kaufmann E (2010) Flesh on the bones: A critical meta-analytical perspective of achievement lens studies. (Doctoral dissertation, MADOC: University of Mannheim). Available: Accessed 22 November 2013.
  14. 14. Goldberg LR (1976) Man versus model of man: Just how conflicting is that evidence? Organ Behav Hum Perform 16: 13–22
  15. 15. Meehl P (1954) Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis, MN: University of Minnesota Press.
  16. 16. Hunter JE, Schmidt FL (2004) Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage.
  17. 17. Dawes RM, Faust D, Meehl PE (1989) Clinical versus actuarial judgment. Science 243: 1668–1674
  18. 18. Shanteau J (2002) Domain differences in expertise. Working paper. Kansas State University, KS: Manhattan.
  19. 19. Armstrong JS (2001) Judgmental bootstrapping: Inferring experts' rules for forecasting. In: Armstrong JS, editor. Principles of forecasting. Philadelphia, PennsylvaniaUSA: Springer. p. 171
  20. 20. Cooksey RW, Freebody P, Wyatt-Smith C (2007) Assessment as judgment-in-context: Analysing how teachers evaluate students' writing. Educ Res Eval 13: 401–434
  21. 21. Kaufmann E, Sjödahl L, Mutz R (2007) The idiographic approach in social judgment theory: A review of components of the lens model equation components. International Journal of Idiographic Science 2.
  22. 22. Hogarth RM (2006) Is confidence in decisions related to feedback? Evidence from random samples of real-world behavior. In: Fiedler K, Juslin P (editors), Information sampling and adaptive cognition. Cambridge, UK: Cambridge University Press. pp. 456–484.
  23. 23. Lichtenstein S, Fischhoff B, Phillips DP (1981) Calibration of probabilities: The state of the art to 1980. Technical Report PTR-1092-81-6. Available: Accessed: 22 Nov 2013
  24. 24. Nystedt L, Magnusson D (1975) Integration of information in a clinical judgment task, an empirical comparison of six models. Percept Mot Skills 40: 343–356.
  25. 25. LaDuca A, Engel JD, Chovan JD (1988) An exploratory study of physicians' clinical judgment: An application of social judgment theory. Eval Health Prof 11: 178–200
  26. 26. Smith L, Gilhooly K, Walker A (2003) Factors influencing prescribing decisions in the treatment of depression: A social judgment theory approach. Appl Cogn Psychol 17: 51–63
  27. 27. Speroff T, Connors AF, Dawson NV (1989) Lens model analysis of hemodynamic status in the critically ill. Med Decis Making 9: 243–261
  28. 28. Ashton AH (1982) An empirical study of budget-related predictions of corporate executives. Journal of Accounting Research 20: 440–449.
  29. 29. Roose JE, Doherty ME (1976) Judgment theory applied to the selection of life insurance salesmen. Organ Behav Hum Perform 16: 231–249
  30. 30. Kim CN, Chung HM, Paradice DB (1997) Inductive modeling of expert decision making in loan evaluation: A decision strategy perspective. Decis Support Syst 21: 83–98
  31. 31. Mear R, Firth M (1987) Assessing the accuracy of financial analyst security return predictions. Accounting Organizations and Society 12: 331–340
  32. 32. Wright WF (1979) Properties of judgment models in a financial setting. Organ Behav Hum 23: 73–85
  33. 33. Harvey N, Harries C (2004) Effects of judges' forecasting on their later combination for forecasts for the same outcomes. Int J Forecast 20: 391–409.
  34. 34. Singh H (1990) Relative evaluation of subjective and objective measures of expectations formation. Q Rev Econ Bus 30: 64–74.
  35. 35. Cooksey RW, Freebody P, Davidson GR (1986) Teachers' predictions of children's early reading achievement: An application of social judgment theory. Am Educ Res J 23: 41–64
  36. 36. Wiggins N, Kohen ES (1971) Man versus model of man revisited: The forecasting of graduate school success. J Pers Soc Psychol 19: 100–106.
  37. 37. Athanasou JA, Cooksey RW (2001) Judgment of factors influencing interest: An Australian study. Journal of Vocational Education Research 26: 1–13.
  38. 38. Szucko JJ, Kleinmuntz B (1981) Statistical versus clinical lie detection. Am Psychol 36: 488–496.
  39. 39. Cooper RP, Werner PD (1990) Predicting violence in newly admitted inmates: A lens model analysis of staff decision making. Crim Justice Behav 17: 431–447
  40. 40. Werner PD, Rose TL, Murdach AD, Yesavage JA (1989) Social workers' decision making about the violent client. Soc Work Res Abstr 25: 17–20.
  41. 41. Werner PD, Rose TL, Yesavage JA (1983) Reliability, accuracy, and decision-making strategy in clinical predictions of imminent dangerousness. J Consult Clin Psychol 51: 815–825
  42. 42. Gorman CD, Clover WH, Doherty ME (1978) Can we learn anything about interviewing real people from “interviews” of paper people? Two studies of the external validity of a paradigm. Organ Behav Hum Perform 22: 165–192.
  43. 43. Reynolds DAJ, Gifford R (2001) The sounds and sights of intelligence: A lens model channel analysis. Pers Soc Psychol Bull 27: 187–200.
  44. 44. Bernieri FJ, Gillis JS, Davis JM, Grahe JE (1996) Dyad rapport and the accuracy of its judgment across situations: A lens model analysis. J Pers Soc Psychol 71: 110–129.
  45. 45. Lehman HA (1992) The prediction of violence by lay persons: Decision making by former psychiatric inpatients. Unpublished doctoral dissertation, The California School of Professional Psychology Berkeley/Alameda.
  46. 46. Stewart TR (1990) Notes and correspondence: A decomposition of the correlation coefficient and its use in analyzing forecasting skill. Weather and Forecasting 5: 661–666.
  47. 47. Stewart TR, Roebber PJ, Bosart LF (1997) The importance of the task in analyzing expert judgment. Organ Behav Hum Decis Process 69: 205–219
  48. 48. Steinmann DO, Doherty ME (1972) A lens model analysis of a bookbag and poker chip experiment: A methodological note. Organ Behav Hum Perform 8: 450–455
  49. 49. MacGregor D, Slovic P (1986) Graphic representation of judgmental information. Int J Hum Comput Interact 2: 179–200.
  50. 50. McClellan PG, Bernstein ICH, Garbin CP (1984) What makes the Mueller a liar: A multiple-cue approach. Percept Psychophys 36: 234–244.
  51. 51. Trailer JW, Morgan JF (2004) Making “good” decisions: What intuitive physics reveals about the failure of intuition. The Journal of American Academy of Business 3: 42–48.
  52. 52. Eysenck HJ (1952) The effects of psychotherapy: An evaluation. J Consult Psychol 16: 319–324.
  53. 53. Pearson K (1904) Report on certain enteric fever inoculation statistics. Br Med J 3: 1243–1246.
  54. 54. Smith ML, Glass GV (1977) Meta-analysis of psychotherapy outcome studies. Am Psychol 32: 752–760.
  55. 55. Wittmann WW, Matt GE (1986) Meta-Analyse als Integration von Forschungsergebnissen am Beispiel deutschsprachiger Arbeiten zur Effektivität von Psychotherapie [Meta-analysis as an integration of research exemplified for German studies on the effect of psychotherapy]. Psychol Rundsch 27: 20–40.
  56. 56. Lipsey MW, Wilson DB (1993) The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. Am Psychol 48: 1181–1209.
  57. 57. Hattie J (2009) Visible learning: A synthesis of over 800 meta-analyses relating to achievement. London, New York: Routledge.
  58. 58. Hedges LV, Olkin I (1985) Statistical methods for meta-analysis. Orlando, FL: Academic Press.
  59. 59. Rosenthal R (1991) Meta-analytic procedures for social research. Newbury Park, CA: Sage.
  60. 60. Bangert-Drowns RL (1986) Review of developments in meta-analytic method. Psychol Bull 99: 388–399
  61. 61. Rosenthal R, DiMatteo MR (2001) Meta-analysis: Recent developments in quantitative methods for literature reviews. Annu Rev Psychol 52: 59–82
  62. 62. Ioannidis JPA (2010) Meta-research: The art of getting it wrong. Res Synth Methods 1: 169–184
  63. 63. Field AP (2001) Meta-analysis of correlation coefficients: A Monte Carlo comparison of fixed- and random-effects methods. Psychol Methods 6: 161–180.
  64. 64. Field AP (2005) Is the meta-analysis of correlations accurate when population correlations vary? Psychol Methods 10: 444–467.
  65. 65. Wittmann WW (1988) Multivariate reliability theory. Principles of symmetry and successful validation strategies. In: Nesselroade JR, Cattell RB, editors. Handbook of multivariate experimental psychology. New York: Plenum Press. pp. 505–560.
  66. 66. Wittmann WW (2009) Evaluationsmodelle. In: Holling H, editor. Enzyklopädie der Psychologie. Themenbereich B Methodologie und Methoden. Serie IV Evaluation - Band 1. Grundlagen und statistische Methoden der Evaluationsforschung. Göttingen: Hogrefe. pp. 59–98.
  67. 67. Wittmann WW (1985) Evaluationsforschung: Aufgaben, Probleme und Anwendungen. [Evaluation research: Tasks, problems and applications]. Berlin, Germany: Springer-Verlag.
  68. 68. Ashton RH (2000) A review and analysis of research on the test-retest reliability of professional judgment. J Behav Decis Mak 13: 277–294
  69. 69. Vacha-Haase T (1998) Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educ Psychol Meas 58: 6–20
  70. 70. Duval S, Tweedie R (2000) Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. BIOMETR BULL 56: 455–463.
  71. 71. Kaufmann E, Wittmann WW (2009) Do we underestimate the validity of linear expert models? Poster presented to the Society for Judgment and Decision Making (SJDM), Boston (MA), November, 22.
  72. 72. Renkewitz F, Fuchs HM, Fiedler S (2011) Is there an evidence of publication biases in JDM? Judgm Decis Mak 6: 870–881.
  73. 73. Rothstein HR (2008) Publication bias as a threat to the validity of meta-analytic results. J Exp Criminol 4: 61–81.
  74. 74. Mutz R, Seeling U (2010) A nomothetic version of the Brunswikian lens model - A variable- and person-oriented approach. Z Psychol 218: 175–184
  75. 75. Kaufmann E, Wittmann WW (2013) The success of bootstrapping models under the lens. Working paper. University of Zurich, University of Mannheim.
  76. 76. Herzog SM, Hertwig R (2009) The wisdom of many in one mind: Improving individual judgments with dialectical bootstrapping. Psychol Sci 20: 231–237.