Figures
Abstract
Background
It has been proposed that the school origin of items for cross-institutional Progress Tests (PTs) may introduce a bias in favour of students from the same school, posing a potential threat to the validity and reliability of PT results and cross-institutional comparisons. The aim of this study was to examine whether origin bias is present in a Brazilian cross-institutional PT examination.
Methods
This study conducted a cross-sectional analysis of seven schools affiliated with the oldest PT consortium in Brazil, utilising a pooled analysis of differences in students’ performance concerning self and non-self items. A proportional meta-analysis of the items’ rate differences and confidence intervals with random effects was performed, providing an odds ratio (OR) for self and non-self items. Differences between the two groups of items were assessed by scrutinising whether the OR and 95% confidence intervals overlapped.
Results
The findings indicated no discernible differences in psychometric indices based on the school responsible for item creation. Three schools consistently demonstrate superior performance on items authored by their faculty, however, these they also excelled on non-self items. Furthermore, an overlap in the 95% confidence intervals for both self and non-self items was observed across all seven schools.
Citation: Hamamoto Filho PT, Hafner MdLMB, Ribeiro ZMT, Lima ARdA, Diehl LA, Costa NT, et al. (2025) Absence of item origin bias on a Brazilian interinstitutional Progress Test examination: A pooled analysis of items approach. PLoS One 20(6): e0325734. https://doi.org/10.1371/journal.pone.0325734
Editor: Mohammad Mofatteh, Queen's University Belfast, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: January 23, 2025; Accepted: May 16, 2025; Published: June 9, 2025
Copyright: © 2025 Hamamoto Filho et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: Pedro Tadao Hamamoto Filho and Angélica Maria Bicudo have received an award from the National Board of Medical Examiners (PA, PA, USA). GRANT_NUMBER: Proposal LAG5-2020. Pedro Tadao Hamamoto Filho is supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). GRANT_NUMBER: 313047/2023-5. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The Progress Test (PT) stands as a widely adopted assessment tool used by medical schools worldwide to evaluate students’ knowledge accumulation throughout their undergraduate program years [1–3]. Its integration into assessment programs yields several advantages, particularly in terms of valuable feedback for students, faculty members, and institutions [4–6]. The validity and reliability of the PT hinge upon the implementation of sound practices in test construction, administration, thorough analysis and review of results, and input from all stakeholders [7].
PT examinations are designed at the final-year level, theoretically allowing for curriculum and cross-institutional comparisons, provided that these curricula and institutions share similar educational goals despite potential differences in methodological approaches [8,9]. However, the effectiveness of PT can be compromised by various factors, including flaws in item writing, inclusion of irrelevant constructs, use of imprecise terms, and heterogeneity in test difficulty [7,10–12].
Another potential source of bias affecting PT results is the endogeneity effect, commonly referred to as “origin bias”. Muijtjens et al. (2007) have explicitly defined origin bias as the phenomenon in which “the origin of items introduced bias in favour of students from the same school as the item producers” [13]. The underlying argument posits that students from a specific school are more likely to achieve better performance on items written by faculty members from their own school.
Inter-institutional PT offers the advantage of cost-sharing and expertise collaboration between schools. However, if origin bias is present, the fairness of assessment results would be compromised, hindering meaningful cross-comparisons. Since the original study by Muijtjens et al., there has been a notable absence of investigation into the presence of item origin bias in PT examinations. This study aims to fill the gap by examining whether origin bias is present and determining any significant differences between schools in a Brazilian cross-institutional PT examination.
Methods
Settings
This study took place within the most traditional Brazilian Consortium for PT, comprising nine public schools: Faculdade de Medicina de Marília (FAMEMA), Faculdade de Medicina de São José do Rio Preto (FAMERP), Universidade Federal de São Carlos (UFSCAR), Universidade Federal de São Paulo (UNIFESP), Universidade Estadual de Campinas (UNICAMP), Universidade de São Paulo (USP—Bauru and Ribeirão Preto campi), and Universidade Estadual Paulista (UNESP) in São Paulo State, and Universidade Estadual de Londrina (UEL) in Paraná State [14].
This consortium conducts biannual examinations for all students (approximately 4500) spanning the first to sixth undergraduate years. The exam comprises 120 multiple-choice questions structured around a fixed blueprint encompassing various areas, disciplines, and themes. Initially, the blueprint was distributed evenly across six equal areas (basic sciences, internal medicine, paediatrics, surgery, obstetrics, gynaecology, and public health) in adherence to national legislation for medical residency selection. However, this division led to an imbalance between areas and low reliability, prompting a blueprint modification. The current structure allocates percentages as follows: internal medicine (28.3% of items), paediatrics (19.7%), surgery (19.7%), obstetrics and gynaecology (16.7%), and public health (16.7%) [15].
During the test construction, the Coordination Committee assigns specific requests for each blueprint content. For example, if the blueprint focuses on acute coronary syndrome, the Coordination Committee can establish different requests for each examination, such as the diagnosis of unstable angina, initial treatment of acute myocardial infarction, or electrocardiogram interpretation. These requests are disseminated to faculty members from the nine participating schools. Consequently, for each blueprint content, up to nine items may be generated, with the area subcommittee selecting the best from the nine to be included in the final test. In subsequent tests, a different request from the same blueprint is provided.
Study design and statistical analysis
For this study, data were extracted from the 6th (final) year undergraduate students across seven schools within the consortium. This sampling choice was made considering that the PT is designed at the final-year level of knowledge. Two schools were excluded because of the limited number of items incorporated into the test. The details of students and written items for each school are shown in Table 1.
First, we conducted an evaluation to determine whether the psychometric properties differed according to their school of origin. The difficulty index, representing the percentage of students with incorrect answers, and the discrimination index denoting the difference in the percentage of correct answers between 27% of the high performers and 27% of the low performers in the test, were computed. Differences between mean difficulty and discrimination indices were checked using one-way ANOVA with statistical significance set at p < 5%, and analyses were executed using SPSS software for Mac Book (Statistical Package for Social Sciences, v. 24.0, IBM Corp, Armonk, NY, United States).
For each item, we compared the rate of correct answers between schools that authored the item and other schools. A pooled analysis of the items, as previously described [16], involved calculating the rate of correct answers for each item along with a corresponding 95% confidence interval (CI). A proportional meta-analysis of the items’ rate differences and confidence intervals with random effects, was performed. The intervention group comprised students from the same school as the item author, whereas the control group consisted of other students. Consequently, each school had a final odds ratio (OR) for self-items, following the same procedure for non-self items. Heterogeneity between the rates of each item was determined using I2 statistics [17]. Ultimately, a statistical difference between self and non-self items was established if their combined 95% confidence intervals did not overlap [18]. These analyses were carried out using MedCalc for Windows, version 19.4 (MedCalc Software, Ostend, Belgium) with statistical significance was set at p < 5%.
Ethical considerations
As our study involved secondary data, with no individual students’ identification (i.e., analyses were conducted at the item level, not individual level), approval from the institutional review board was deemed unnecessary, in accordance with national legislation governing research ethics involving human subjects.
Moreover, no school was disclosed in the presentation of results in accordance with the consortium’s code of conduct. To prevent rankings between participating schools, each school was granted access only to the overall results and their specific outcomes for any analysis, whether for research purposes or otherwise. This approach aligns with ethical considerations to maintain confidentiality and equitable treatment among the participating institutions.
Results
To verify that the psychometric quality of items was not different across the schools, we analysed the mean values of difficulty and discrimination of items according to the school. We observed no differences in the psychometric indices based on the school of origin for the items. The mean difficulty of the items averaged 0.39 ± 0.20, with variations from 0.33 (school E) to 0.44 (school F), demonstrating no statistically significant differences (F = 0.68, p = 0.67). Similarly, the mean discrimination indices showed no significant distinctions, with a mean of 0.38 ± 0.13, ranging from 0.33 (school A) to 0.42 (school C), and no statistical significance observed (F = 0.77, p = 0.59) (Fig 1).
No statistically significant differences were observed for both indices, highlighting the consistency in psychometric properties across the schools involved in the study.
In examining self-items, three schools exhibited an OR greater than 1.0, indicating a significantly higher answer rate for items crafted by their faculty. In other three schools, the ORs exceeded 1.0, however, the confidence intervals reached 1.0, and no statistically significant differences were detected. Conversely, only one school had an OR < 1.0, although the difference was not statistically significant (Fig 2).
The squares represent the odds ratio, and the confidence intervals are depicted by the horizontal lines associated with each OR. Squares to the right of the “1” axis indicate superior performance in correct answers for students from the school that authored the item, whereas squares to the left signify better performance by students from other schools. The diamond represents the final result of the school’s OR. Panel A shows a school with superior performance on their items (final OR does not touch “1”). Panel B indicates a school with no discernible differences in performance on their items. Panel C highlights a school whose performance trends are lower, though not significantly (final OR touches “1”).
The superior performance of these three schools on self-items could not be solely attributed to origin bias; it might be associated with their overall superior performance. For further exploration, we calculated the OR for non-self items. The same three schools demonstrated superior performance on non-self items (OR > 1.00, p < 0.05), whereas the remaining four schools displayed significantly inferior performance on non-self items (OR < 1.00, p < 0.01) (Table 2). The heterogeneity of rates varied from moderate to high for all but two analyses, where schools B and F exhibited low heterogeneity for self-items, with similar rates across items. The Q-test results for the heterogeneity analysis are presented in the S1 Data.
Despite the differences in performance for non-self items, there was an overlap in the combined confidence intervals for all seven schools (Fig 3), which means that the differences in students’ performance between self and non-self items were not statistically significant.
The 95% confidence intervals (CI) of the upper diamonds are displayed inferiorly, whereas the CI of lower diamonds is not represented (out of scale). Across all schools, there is a noticeable overlap of CIs, indicating that the differences are not statistically significant. The schools were categorised based on the diamond’s position. School A had an OR < 1.0 for both self and non-self items. Schools C, D, and E had an OR > 1.0 (though with p > 0.05) for self items and an OR < 1.0 (with p < 0.01) for non self items. Schools B, F, and G had an OR > 1.0 for both, self and non-self items (p < 0.05).
Discussion
The students’ differential backgrounds were identified as a source of variability in different examinations. For example, instructor changes (staff turnover) in a Dutch school of economics caused significant variations in students’ grades, as well as pass and fail rates [19]. Other studies have shown that socioeconomic variables have a significant effect on students’ mean scores [20–22]. However, few studies have addressed the origins of test items as a source of variation in student performance. In the context of PT, this study represents the second attempt, to the best of our knowledge, to explore the role of origin bias in influencing students’ performance.
In contrast to the findings of Muijtjens et al. [9], our study did not reveal origin bias influencing the PT results across different institutions. Muijtjens et al. [9] demonstrated a consistent superiority in the performance of students from Maastricht University compared to those from Nijmegen, particularly on items written by Maastricht staff. This trend was especially pronounced from the 2nd to the 5th undergraduate years. They employed the Dscore, which gauges the difference in item difficulty between student groups, and incorporated a model with independent variables such as undergraduate year and university [13].
Criticism of their model revolves around potential biases arising from curriculum exposure discrepancies and random fluctuations. Additionally, cross-institutional comparisons might foster undesirable competitive dynamics among students and institutions [23].
In our investigation, we adopted a distinct method, focusing on the odds ratio of the percentage of correct answers for each item, rather than relying on the difference in percentages between comparison groups. Unlike Muijtens et al. [9], we deliberately avoided performing pairwise comparisons between schools and refrained from identifying them. In our methodology, the control group for each school encompassed all other participating schools. Notably, we did not identify any origin bias that would account for differential performance among students from various institutions. Our method offers the advantage of handling sparse data while focusing on aggregate item-level effects, without requiring access to individual student performance. This feature is particularly relevant for preserving data privacy, especially in benchmarking exercises involving multiple schools. Moreover, the high heterogeneity rates on I2 analyses suggest a non-uniform pattern of students’ responses across the exam items, indicating high variability between schools and supporting the absence of origin bias.
The three schools (B, F, and G) whose students demonstrated superior performance on items written by their faculty also achieved the best overall performance. Although concerns may arise regarding schools C, D, and E, where students exhibited better performance on self items than on non-self items, these schools more likely underperformed consistently in the entire test, rather than specifically in the subset of non-self items, compared to B, F, and G.
This absence of item origin bias in our study can be attributed to the conscientious efforts of the Coordination Committee in adhering to the best practices for item writing. A relevant experiment conducted by Bertoni et al. in Australia involved the identification and correction of flaws in multiple-choice questions used in finance exams. Following corrections and exam re-administration, they observed enhanced clarity for students, accompanied by an increase in correct answers [24]. This underscores the substantial impact that frequent errors in item writing can have on students’ overall performance [7].
Our PT consortium is actively involved in enhancing item-writing practices by conducting workshops for school faculties, adhering to international guidelines for effective assessment [25,26]. Despite these efforts, the Coordination Committee receives numerous flawed items, which may carry unconscious endogenic flags, for PT composition [27]. Our belief is that a meticulous review of items and the selection of the best item for each blueprint request contribute to the creation of uniform tests with standardised items. We recommend that schools utilizing PT not only adopt and implement a blueprint but also explicitly define the desired expectation for item writing in each blueprint component.
However, our study is not without limitations. First, we present data from a single test with a limited number of items and this cross-sectional dataset may have failed to capture dynamic aspects of item origin bias. To draw more robust and definitive conclusions, regular monitoring of item origin and differential performance based on schools for each item is crucial. Second, our conclusions are specific to 6th year students, and generalizing them to other undergraduate years should be approached with caution. Third, the moderate-to-high heterogeneity (I2) in most analyses suggest the potential influence of random fluctuations across items, which cannot be entirely ruled out.
Despite these limitations, our study offers valuable insights into origin bias as a significant factor influencing variation in student performance. This overlooked concern poses a threat to the validity and reliability of common assessment tools in medical education. Furthermore, the proportion meta-analysis of pooled items adds a complementary tool to the existing set, enhancing benchmark assessments. Although we cannot rule out the possibility that methodological limitations—such as the limited power of a single cross-sectional dataset—may have contributed to the observed absence of origin bias, we believe that the adoption of best practices in PT construction more likely reflects a genuine improvement in fairness by mitigating such bias.
Conclusions
This study found no evidence of origin bias in a progress test examination administered by a Brazilian cross-institutional consortium of medical schools. The adoption of best practices in blueprinting, item writing, and test editing may have contributed to minimizing such bias. As the use of progress test continues to expand globally, monitoring origin bias is important to enhance the validity and comparability of test results.
Supporting information
S1 Data. The dataset includes the number of correct answers for each item, organized by the school that developed the item, along with a comparison to the performance of students from other schools.
https://doi.org/10.1371/journal.pone.0325734.s001
(XLSX)
References
- 1. Freeman A, Van Der Vleuten C, Nouns Z, Ricketts C. Progress testing internationally. Med Teach. 2010;32(6):451–5. pmid:20515370
- 2. Schuwirth L, Bosman G, Henning RH, Rinkel R, Wenink ACG. Collaboration on progress testing in medical schools in the Netherlands. Med Teach. 2010;32(6):476–9. pmid:20515376
- 3. Blake JM, Norman GR, Keane DR, Mueller CB, Cunnington J, Didyk N. Introducing progress testing in McMaster University’s problem-based medical curriculum: psychometric properties and effect on learning. Acad Med. 1996;71(9):1002–7. pmid:9125989
- 4. Coombes L, Ricketts C, Freeman A, Stratford J. Beyond assessment: feedback for individuals and institutions based on the progress test. Med Teach. 2010;32(6):486–90. pmid:20515378
- 5. Tio RA, Schutte B, Meiboom AA, Greidanus J, Dubois EA, Bremers AJA, et al. The progress test of medicine: the Dutch experience. Perspect Med Educ. 2016;5(1):51–5. pmid:26754310
- 6. Cecilio-Fernandes D, Bicudo AM, Hamamoto Filho PT. Progress testing as a pattern of excellence for the assessment of medical students’ knowledge: concepts, history, and perspective. Medicina (Ribeirão Preto). 2021;54(1):e173770.
- 7. Wrigley W, van der Vleuten CPM, Freeman A, Muijtjens A. A systemic framework for the progress test: strengths, constraints and issues: AMEE Guide No. 71. Med Teach. 2012;34(9):683–97. pmid:22905655
- 8. Schuwirth LWT, van der Vleuten CPM. The use of progress testing. Perspect Med Educ. 2012;1(1):24–30. pmid:23316456
- 9. Muijtjens AMM, Schuwirth LWT, Cohen-Schotanus J, van der Vleuten CPM. Differences in knowledge development exposed by multi-curricular progress test data. Adv Health Sci Educ Theory Pract. 2008;13(5):593–605. pmid:17479352
- 10. Holsgrove G, Elzubeir M. Imprecise terms in UK medical multiple-choice questions: what examiners think they mean. Med Educ. 1998;32(4):343–50. pmid:9743793
- 11. Downing SM. Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ Theory Pract. 2002;7(3):235–41. pmid:12510145
- 12. Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ. 2008;42(2):198–206. pmid:18230093
- 13. Muijtjens AMM, Schuwirth LWT, Cohen-Schotanus J, van der Vleuten CPM. Origin bias of test items compromises the validity and fairness of curriculum comparisons. Med Educ. 2007;41(12):1217–23. pmid:18004993
- 14. Bicudo AM, Hamamoto Filho PT, Abbade JF, Hafner MLMB, Maffei CML. Consortia of Cross-Institutional Progress Testing for All Medical Schools in Brazil. Rev Bras Educ Med. 2019;43:151–6.
- 15. Hamamoto Filho PT, Bicudo AM, Pereira-Júnior GA. Assessment of medical students’ Surgery knowledge based on Progress Test. Rev Col Bras Cir. 2023;50:e20233636.
- 16. Hamamoto Filho PT, Lourenção PLT de A, Abbade JF, Cecílio-Fernandes D, Caramori JT, Bicudo AM. Exploring pooled analysis of pretested items to monitor the performance of medical students exposed to different curriculum designs. PLoS One. 2021;16(9):e0257293. pmid:34506599
- 17. Lin L. Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract. 2020;26(1):376–84. pmid:31234230
- 18. El Dib R, Nascimento Junior P, Kapoor A. An alternative approach to deal with the absence of clinical trials: a proportional meta-analysis of case series studies. Acta Cir Bras. 2013;28(12):870–6. pmid:24316861
- 19. Arnold IJM. Changing the guard: staff turnover as a source of variation in test results. Stud Educ Eval. 2015;47:12–8.
- 20. Ballou D, Sanders W, Wright P. Controlling for student background in value-added assessment of teachers. J Educ Behav Stat. 2004;29:37–65.
- 21. Suna HE, Tanberkan H, Gür B, Perc M, Özer M. Socioeconomic status and school type as predictors of academic achievement. J Econ Cult Soc. 2020;61:41–64.
- 22. Shahriar AA, Puram VV, Miller JM, Sagi V, Castañón-Gonzalez LA, Prasad S. Socioeconomic diversity of the matriculating US medical student body by race, ethnicity, and sex, 2017-2019. JAMA Netw Open. 2022;5:e222621.
- 23. Albanese M. Benchmarking progress tests for cross-institutional comparisons: every road makes a difference and all of them have bumps. Med Educ. 2008;42(1):4–7. pmid:18181841
- 24. Bertoni F, Smales LA, Trent B, Van de Venter G. Do item writing best practices improve multiple choice questions for university students?. SSRN Journal. 2019;45:39.
- 25. Norcini J, Anderson B, Bollela V, Burch V, Costa MJ, Duvivier R, et al. Criteria for good assessment: consensus statement and recommendations from the Ottawa 2010 Conference. Med Teach. 2011;33(3):206–14. pmid:21345060
- 26. Schuwirth L, Pearce J, Australian Medical Assessment Collaboration. Determining the quality of assessment items in collaborations: aspects to discuss to reach agreement; 2014. https://research.acer.edu.au/cgi/viewcontent.cgi?article=1044&context=higher_education
- 27. Hamamoto Filho PT, Bicudo AM. Improvement of faculty’s skills on the creation of items for progress testing through feedback to item writers: a successful experience. Rev Bras Educ Med. 2020;44:e018.