Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The importance of test order in external and standardized test results: The case of PISA 2018

  • R. van Grieken,

    Roles Conceptualization, Project administration, Resources

    Affiliation Chemical and Environmental Engineering Group, School of Experimental Sciences and Technology, Rey Juan Carlos University, Madrid, Spain

  • J. D. Tena,

    Roles Methodology, Software, Supervision, Validation, Writing – original draft

    Affiliations Management School, University of Liverpool, Liverpool, United Kingdom, Department of Economics, University of Sassari and CRENoS, Sassari, Italy

  • Luis Pires,

    Roles Data curation, Investigation, Methodology, Project administration, Software, Supervision, Visualization, Writing – original draft

    Affiliation Department of Applied Economics I, Faculty of Economics and Business, Rey Juan Carlos University, Madrid, Spain

  • Ismael Sanz,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision

    Affiliations Department of Applied Economics I, Faculty of Economics and Business, Rey Juan Carlos University, Madrid, Spain, Social Policy Department, London School of Economics, London, United Kingdom

  • Lilliana L. Avendaño-Miranda

    Roles Supervision, Validation, Writing – review & editing

    lilliana.avendano@urjc.es

    Affiliation Department of Applied Economics I, Faculty of Economics and Business, Rey Juan Carlos University, Madrid, Spain

Abstract

Standardized tests intend to reduce information asymmetry by providing a common and objective measure of students’ academic performance. The basic assumption underlying standardized testing is that differences in student performance on standardized tests should be attributed primarily to differences in the quality of education received by students. However, there is evidence that environmental factors can affect standardized test scores, which may result in anomalous observations or outliers that show a distortion of student performance. In this regard, the exclusion of Spain from PISA 2018 is particularly interesting as Spanish data met PISA 2018 Technical Standards but showed implausible student-response behavior. The aim of this paper is to complement the OECD’s analysis of Spain’s exclusion from PISA 2018 by exploring the potential reasons behind the outlier results, focusing on the Madrid region.

1. Introduction

Assessment is an essential practice in education. Educational assessment is of great interest to parents, teachers, educational institutions, government, and other decision-makers. School tests are designed for a variety of purposes. Standardized tests intend to reduce information asymmetry by providing a common and objective measure of students’ academic performance [1, 2]. Of relevance is the OECD’s PISA program, which assesses the reading, mathematics, and science literacy of secondary school students [3].

The basic presumption behind standardized testing is that “differences in the achievement of students on standardized tests should be primarily attributable to differences in the quality of education received by students” [4]. However, there is evidence that environmental factors may affect standardized tests results. For example, Ebenstein, Lavy and Roth [5] found that the students’ cognitive performance, during matriculation high school exams in Israel, was associated with air pollution level on the day of the test. Wen and Burke [6] obtained similar results linked to wildfire exposure in western United States, while Park Goodman, Hurwitz, and Smith [7] presented evidence of lower test results related to heat exposure in classrooms during the American PSAT exams. In other words, standardized testing results might be affected by environmental factors that may result in anomalous observations or outliers, showing a distortion of students’ performance. Outliers may be due to random variation that direct to misleading results, but they may also indicate a scientifically interesting event [8].

In 2018 PISA results for Spain were so atypical that they were excluded from the results report [3]. The exclusion of Spain from PISA 2018 is particularly interesting as Spanish data met PISA 2018 Technical Standards but showed implausible student-response behavior. The OECD released a report explaining the problems that caused the exclusion of Spain [9]. The aim of this paper is to complement the OECD’s analysis of Spain’s exclusion from PISA 2018 by exploring the potential reasons behind the outlier results, focusing on the Madrid region.

We chose Madrid region for three main reasons: 1) it is the region with the most abnormal deviation in Reading; 2) it represents a fourth of the Spanish Education System, and 3) Madrid conducted high-stakes exams for tenth-grade students which overlapped PISA testing. The coincidence of the testing period was one of the reasons that OECD pointed out as a possible cause of atypical results in Spain [9].

2. PISA: Characteristics, trends, and atypical results

The PISA survey releases comparative data on 15-years-old students in Reading, Mathematics, and Science every three years. In its last edition, 79 countries (37 belonging to the OECD and 42 associate countries) participated in PISA 2018. Each edition chooses one of the three evaluated competencies as the primary. This implies that there are more questions about the primary competence and that the OECD international report is more focused on it. More specifically, Reading was the primary competence in PISA 2018 (similar to PISA 2000 and 2006). The PISA survey provides rich information on students’ abilities in each competence and characteristics of schools and students.

PISA scores are decided based on the variation in results observed across all test participants. Thus, PISA uses item-response-theory models to describe the relationship between student proficiency, item difficulty and item discrimination. PISA defines a score of 500 as the average proficiency of students across OECD countries, with a standard deviation (a measure of variability) of 100 score points. Therefore, a one-point difference on the PISA scale corresponds to an effect size of 0.01. To interpret the meaning of students’ scores in substantive terms, recall that PISA scales are split into proficiency levels, defining the knowledge and skills needed to complete tasks successfully. Each proficiency level corresponds to a range of about 80 score points. Based on estimations on the average score-point difference across adjacent grades for countries, the OECD states that, on average across countries, the difference between adjacent grades is about 40 score points. However, the OECD reports refrain from expressing PISA score differences in terms of an exact "years-of-schooling" equivalent ([3], p. 43–44).

The results of the PISA assessments are estimates because they are obtained from samples of students using a limited set of assessment tasks. It publishes the confidence interval to determine when a difference is statistically significant. In doing it, PISA takes into account two sources of uncertainty, namely, the sampling error (around 2 to 3 PISA score points for most countries) and the measurement or imputation error (around 0.5 of a PISA score point in Reading, and 0.8 of a point in mathematics and Science). An additional source of uncertainty allows for comparison across different PISA assessments, based on the difference in the test instruments, items, and calibration samples, which results in a link error. For example, for comparisons between reading results in PISA 2018 and past PISA assessments, the link error corresponds to at least 3.5 score points ([3], p. 45).

One of the primary uses of the PISA reports is the comparative analysis of different education systems through time. Thus, each PISA edition shows the relative position of each country or region and policymakers aim to keep or improve that situation in subsequent years. The OECD General Secretary clearly states this: "PISA is not only the world’s most comprehensive and reliable indicator of students’ capabilities, it is also a powerful tool that countries and economies can use to fine-tune their education policies" ([3], p. 4). Moreover, in its PISA reports, the OECD analyses the performance evolution of participating countries and regions. It also highlights those with better results, such as the four participating Chinese regions in the last PISA edition (Beijing, Shanghai, Jiangsu and Zhejiang). These four regions have performed better than the rest of participating countries, except Singapore.

The 2007 evaluation of the impact of PISA reveals that “Countries that rank relatively high in PISA use the PISA results as a mechanism for evaluating their education system, but do not seem to have introduced any policy initiatives directly in light of PISA. On the contrary, in countries that perform relatively low, we identify a direct policy impact after the publication of the PISA results” ([10], p.8).

PISA highlights those countries with a positive trend evolution. This is the case of Estonia, which has become the European country with the best academic performance. In particular, after an improvement of 22 and 9 score points in Reading and Mathematics respectively, Estonia has reached the first position in the world PISA ranking in Reading (523 score points) while taking the third position in Mathematics (523 score points) after Japan and South Korea. The positive evolution of Portugal has also been remarkable. More specifically, despite being one of the most affected countries by the financial crisis in 2008, Portugal has increased their performance in Reading (22 score points), Mathematics (38 score points) and Science (33 score points). Other countries with even more significant improvement in PISA scores are Qatar, Poland, and Peru. From 2006 to 2018, Qatar has increased its PISA results in Reading (95 score points), Mathematics (96 score points) and Science (70 score points). Poland and Peru have experienced aggregate increases of more than 20 score points on average.

Of course, there are also countries with a negative trend. For example, since 2000, the evolution in Reading scores has been moderately negative in Australia, Finland, Iceland, and New Zealand, and sharply negative in South Korea, the Netherlands and Thailand (OECD, PISA 2018 Database, Table I.B1.10; Figure I.9.1). However, PISA performance remains relatively stationary in many other countries.

However, observing a sharp improvement or deterioration of a given country in the PISA ranking is rare. Fig 1 shows all the score changes in Reading between two PISA editions from 2000 to 2018. We observe that 87.3% and 60% of the 308 jurisdictions (countries or regions) are within the ±20 and ±10 interval, respectively. Furthermore, only 4.5% of the sample is outside the ±30 interval. Recall that 40 score points amount to the knowledge obtained in a whole academic year. Thus, it is unexpected that students in a given country gain or lose all the competencies they acquire in a year within three years. The most remarkable case is Turkey which in 2015 decreased by 47 score points in Reading (with 28 score points in Mathematics and 38 in Science). Still, it bounced back in 2018 with an increase of 38, 34 and 43 score points in Reading, Mathematics and Science, respectively. The OECD remarked on this anomalous situation, stating: "When considering results from all years, it is clear that PISA 2015 results–which were considerably lower–were anomalous, and neither the decline between 2012 and 2015, nor the recovery between 2015 and 2018, reflect the long-term trajectory" ([3], p. 340). Therefore, a change of 40 score points between two consecutive editions can be considered atypical, and the circumstances surrounding the implementation of the test could explain this anomaly, as we will discuss below.

thumbnail
Fig 1. Change in score points between two consecutive PISA editions in Reading (2000–2018).

https://doi.org/10.1371/journal.pone.0309980.g001

To understand these results, it is worth noticing that the PISA test is very complex as its implementation requires many different people, companies, and education authorities. Any mistake in any of their steps could make results unreliable. For that reason, PISA has developed technical standards to set specific procedures for test implementation and data collection [1115]. The test includes the following steps: elaboration of questions, translation of questions to the local language, photocopy of the test (if on paper) or saving it on tablets or computers (if online), choice and acceptance of the school and student sample, allocation of days for the test, correct application of the test by agents external to the school. A final fundamental element for a reliable interpretation of PISA results is that students will be focused on the questions. This is not a minor issue since about 7 out of 10 students informed putting less effort into the PISA test than they would have done if, for example, their performance on the test had counted towards their grades [3]. If a substantial number of students do not put in enough effort, this outcome will be reflected in the position of the region or country in the PISA ranking as in the case of Australia, whose mean performance has been declining over the period 2000–2018 [3].

Since 2000, PISA results have not been published on some occasions as they did not meet standards. In other circumstances, PISA results were issued, indicating not to use the information in international comparisons or study its trend. The OECD usually spots these problems after test results are analyzed. This is a complex process that lasts more than one year. The main OECD uses the following criteria to ensure the information is suitable: student- and school-level exclusions, minimum sample sizes, response rates, and inconsistencies and deviations from the expected patterns. Despite the complexity of the PISA evaluation, the OECD only had to exclude 11 out of 421 participant countries or regions in previous editions. However, there are relatively minor problems with a more significant number of cases where the OECD has decided they are statistically significant and comparable. Table 1 shows the 11 countries or regions excluded from PISA in some previous editions. There are two different cases. First, some are suspected of inflating their scores. This may be a credible strategy as the PISA test can be deemed the most prestigious evaluation of the educational system in a given country. This would be the case of Argentina, Kazakhstan, Malaysia, or Viet Nam which showed a massive improvement of their scores in the exclusion years and a return to the mean effect in subsequent years; the OECD excluded them from the PISA results due to these irregularities (3, 14) as we will describe below. Second, this group includes countries that experienced technical problems that did not drive scores in a particular direction, such as the Netherlands, United Kingdom, United States or Spain.

Thus, for example, the Netherlands in PISA 2000 ([11], p. 186) and the United Kingdom in PISA 2003 ([12], p. 281) included this warning as school participation rates were exceptionally low. Similarly, Reading literacy results for the United States were excluded from the database and international reports because of an error in printing the test booklets in PISA 2006 ([12], p. 281). Albania’s data for parental occupation and school enrolment were deleted from the PISA 2012 international dataset due to evidence of systematic errors and violations of the PISA Technical Standards in the survey instruments, the procedures for test implementation and coding of student responses at the national level ([13], p. 283). For the same reasons, the PISA 2015 international database does not include all the information collected through student questionnaires for Albania. The OECD published this information in an additional dataset but forewarning that no attempt should be made to link the student data included in the international PISA database with the additional dataset for Albania ([14], p. 269). The international dataset did not include Argentina in PISA 2015, but this information is available as a separate dataset. Still, Ciudad Autónoma de Buenos Aires (CABA) data were fully included in the international dataset even though the national defined target population deviated significantly from the desired target population ([14], pp. 269–270). When assessing the results of the PISA test, all multiple-choice questions and certain short-answer questions are automatically scored within the system. On the contrary, open-ended questions that do not conform to this scoring are evaluated and scored by experts. It was discovered that scores for human-scored items submitted by Kazakhstan were contradictory with the success rates in preceding PISA cycles and were virtually unrelated to scores on multiple choice items. Thus, data for Kazakhstan in PISA 2015 were removed from the international dataset (but available as a separate dataset) because of leniency among national experts, which forced the elimination of all human-scored items ([14], p. 271). Data for Malaysia in PISA 2015 are included in a separate database because of a low response rate among the initially sampled schools ([14], p. 271). In the last PISA 2018, the international dataset did not include financial literacy sample data for the Netherlands because of a low school participation rate ([15], pp. 9–10). Data for Viet Nam in PISA 2018 were removed from the international dataset (available as a separate dataset) because of several minor violations of implementation standards ([15], p. 12). Finally, PISA 2018 reading results for Spain were not published in the results report [3] and are not included in OECD average results but are available as a separate dataset [9].

3. PISA in Spain and Madrid region

Spain has participated in PISA since its creation, PISA 2000. Although the central government sets general Education Laws in Spain, autonomous regions oversee their implementation and development. In Spain there are 17 autonomous regions and 2 autonomous cities (Ceuta and Melilla). For this reason, since PISA 2003, autonomous regions have been gradually incorporated into PISA, making their results comparable to other participant countries. Table 2 reports the PISA score of the different autonomous regions and cities. The anomalous score reduction in PISA 2006 (mainly due to Reading) and the subsequent recovery in PISA 2009. A similar situation can be observed for the autonomous region of Murcia, with a decrease of 18 score points in PISA 2012 and a subsequent increase of 24 score points in PISA 2015. We can observe a sustained improvement in the majority of other regions.

However, PISA 2018 is especially atypical. Such a situation has produced the exclusion of Spain and all its autonomous regions. All of them worsen score points, and the magnitude of the deterioration is highly unreliable in some cases. The most affected autonomous regions were Madrid (-46), Navarre (-42), Comunidad Valenciana (-26), Castile and Leon (-25) and La Rioja (-24). Moreover, the case of Madrid is especially remarkable because it was in the highest rank, and its situation drastically changed in PISA 2018.

Interestingly, with minor exceptions for some specific schools, Spain did not find any significant problem applying the PISA test before PISA 2018. Thus, the exclusion of Spain from PISA 2018 is particularly interesting as Spanish data met PISA 2018 Technical Standards but showed implausible student-response behavior. According to the OECD, around 68% students across OECD countries put less effort on the PISA test than they would have done if the test had counted in their grades. Therefore, the main problem was the large number of students who acknowledged “having spent very little effort on the PISA test they just completed” and hence did not achieve their expected scores on the reading test ([9], p. 8). The accuracy or reliability of answers is a common problem in many surveys. However, it was the first time that this problem caused the exclusion of a country from the PISA ranking. Although some members of the education system (teachers, principals, parents, students) opposed these tests, this is not a significant problem for their implementation as these schools are substituted or removed from the sample. However, there was no example of a general lack of interest by participant students that caused the exclusion of a country so far. As a result, the OECD released an eight-page report, in the form of an appendix, explaining the results of a study on the application of PISA 2018 in Spain that caused its exclusion [9]. In the other 10 cases of exclusion, there was only a short explanation consisting of a few paragraphs in the Technical Report.

To complement the analysis of the OECD about the exclusion of Spain from PISA 2018, in the following sections, we show a statistical analysis of the possible causes of abnormal results in the region of Madrid. Three main reasons explain the choice of Madrid. Firstly, it is the region with the most abnormal deviation in Reading. Secondly, it is highly representative of the country because it is a large region that approximatively represents a fourth of the Spanish Education System. Thirdly, the OECD pointed out as a possible reason of the lack of motivation of some of the Spanish students that "in 2018, some regions in Spain conducted their high-stakes exams for tenth-grade students earlier in the year than in the past, which resulted in the testing period for these exams coinciding with the end of the PISA testing window" [9]. As shown in Table 3, Madrid was one of the regions where different schools differed about when they set their school exams. This allows for identifying the importance of this event on PISA scores. Moreover, the PISA test overlapped in some cases with another external test at the regional level for tenth-grade students. Due to these reasons, the Madrid region is, in principle, an interesting example of how crowded exam schedules affect students’ performance.

thumbnail
Table 3. Calendar of school exams and external tests of tenth-grade students in Madrid 2018.

https://doi.org/10.1371/journal.pone.0309980.t003

4. Data and methodology

The dataset is built from all plausible values for Spain in 2018 PISA report that were published in December 2019 as explained in Annex A9 (2020a) [9].

As we discussed previously, the 2018 PISA results could be affected by two events in the Madrid region: 1) participation in the regional external and standardised test that all students in their final year of compulsory school must take (LOMCE test); and 2) the 2017/18 academic calendar change for the third evaluation in the region (Table 3). In both cases, their impact is estimated using a difference-in-difference approach. For the first event, the regional test affected student performance at PISA because the PISA test took place from April 15th to May 30th, while the regional test came about on April 26th-27th. To identify the effect of this event, we take advantage of the fact that our control group, grade repeater students, only take the PISA but not the regional assessment. In contrast, non-repeaters (treatment group) take both exams. Thus, we estimate the impact on PISA score of taking the PISA test after the regional test or in the exam week in May affects students’ scores compared to a control group of students for whom PISA do not overlap either with the regional test nor the exam period. In the specification we include information of individual, family and school characteristics that could control for this fact. Thus, the implicit assumption in our approach is that, once we control for these characteristics, performance differences are entirely explained by the action of treatment.

We aim to account for the effect of two different explanations for the extreme results in Madrid regions, namely the clash with the regional test and the school evaluation. For this purpose, we employ two difference in difference specifications. The first one compares non-repeater students who take the PISA test after the regional test (first difference) with a control group of repeater students, who do not take the PISA test in any case. The second specification compares students in non fee-paying schools when the PISA clashes and when it does not clash with the exam period (first difference) with a control group of students in fee paying schools that are not affected by the overlapping of the two tests (second difference). Note that while a difference in difference approach allows treatment and control groups to be different, the implicit assumption is that these differences are accentuated by the effect of treatment. For example, repeaters and non-repeaters students get different scores in the PISA test under the null hypothesis and the approach tests if the difference between the two groups of students increases or decreases when a non-repeater student takes both the regional and the PISA tests. Descriptive statistics of the variables employed in the difference-in-difference models are shown in S1 Appendix.

Based on the discussion in the previous paragraph, the model in the first specification can be defined as follows: (1) where Yi is the dependent variable that represents the relevant 2018 PISA score by student i; D is a dummy variable that takes value 1 if the PISA test took place after the regional test, and 0 otherwise; NR takes value 1 for non-repeater students and 0 otherwise; Xi is the ith observed student’s characteristic; β0 to β3 and γk for k = 1 to K are parameters to be estimated; and εi, is an error component. In particular, β0 is the intercept, β1 and β2 represent the impact on the 2018 PISA score of having the PISA test after the regional test and being a non-repeater student, respectively. Our focus parameter β3 reflects the joint impact of being a non-repeater student and having the PISA test after the regional test. This is a fundamental consideration as only non-repeater students take both the PISA and the regional tests. The K parameters γk indicate the influence of the individual covariates on the response variables. Model (1) and all the subsequent specifications are estimated by OLS.

For the second case, we follow a similar strategy to estimate the impact of the 2017/18 academic calendar change in the region on PISA outcome. Under the new calendar, the 2018 PISA clashes with the exam period in non-fee paying schools of the Madrid region by the end of May. Fee-paying schools are the control group in this case. Accordingly, we estimate the following model: (2) where variables are similarly defined to expression (1) except for ME which takes value 1 if the 2018 PISA exam happens in May and 0 otherwise, and NF is a dummy variable taking value 1 for non-fee paying schools and 0 otherwise.

The estimation of the previous two models will provide a clear picture of the effect of two different scheduled events on students’ academic performance. More specifically, our focus parameter in specification (1) is β3 that indicates the expected score difference of students who took the PISA test after the regional test compared to those who did not take that exam once we control for observed students’ characteristics. In specification (2), our focus parameter is β3′, that indicates the expected score reduction from taking a PISA exam while preparing their school evaluation once we control for observable student characteristics.

5. Empirical results and discussion

We start the empirical analysis by estimating the importance of different individual and institutional determinants of plausible PISA scores using regression analysis in the region of Madrid. Table 4 shows the estimation results for specification (2) in the previous Section. Estimation results are generally consistent with initial expectations. The only exception is the number of minutes devoted to different abilities that negatively impact performance. A plausible explanation for this is that a potential reverse causality problem makes this estimation biased, i.e., students who have lagged in these subjects need more study time to make up work. This observation is consistent with Kuehn and Landeras [16] that suggests an endogeneity bias in a similar estimation. To tackle this problem, they propose a two-least square (2LS henceforth) estimation with homework time and time spent in private lessons studying math as the proposed instruments for science study time. However, these instruments are also likely to be affected by students’ performance, which would invalidate the exclusion restriction assumption. Therefore, we propose using the average number of teaching hours in that subject and the total number of students in that school. Although no instrumental variable is free of criticism, decisions about teaching hours at the school level are likely to affect individuals’ habits while not directly guided by individual mark expectations. The effect of minutes devoted to the different subjects is no longer negative in the 2LS regression, while the impact of all the other variables is qualitatively similar.

Regardless of the estimation method, girls get higher scores than boys in Reading and Global competencies. However, girls get lower scores than boys in Mathematics and Science. Students who repeated the course reduced their expected score by at least 70 marks compared to non-repeaters. Younger students who have not repeated the course get better results. The expected scores of immigrants are generally lower. However, it is remarkable that, after controlling for the migrant status of the student, having an immigrant mother or a mother who does not speak Spanish does not affect the student’s score. The ESCS index positively affects PISA performance, especially in math, but the squared value of this variable is not significant.

When we turn our attention to the type of school, private schools, which is the reference variable, get the best results followed (in this order) by bilingual private schools, bilingual charter schools, non-bilingual charter schools and non-bilingual IES.

Looking at different geographical areas, compared with the reference case (DAT west) we observe that students in DAT west schools get significantly lower expected scores for all competencies. West area is the richest area and the one where the best results were expected. This negative effect is more prominent in Reading (about 30 marks lower). South DAT only negatively affects Reading (about 32 marks). From these results, it is highly remarkable that DAT west shows a significant impact on PISA scores even after controlling for many different student and school characteristics in the regression analysis. It could indicate that some unobserved factors associated with the test’s application could have affected students in this area, for example, lack of engagement or experience of applicators.

Taking the PISA test after the regional test in Madrid negatively impacted performance. However, the effect is not significant at the conventional values. Moreover, the correlation of PISA scores with exam week in May is almost negligible. Overall, these results do not suggest that other exams could explain the deficient performance of Madrid students in PISA. Not all students and schools were similarly affected by these two events. Only non-repeater students would be affected by the overlap of the PISA and Madrid tests. Thus, adding the interaction term between non-repeaters and after the Madrid test to the regression analysis previously reported in Table 5 we would estimate how this affects the student’s expected score (which corresponds to specification (1) in the previous section). In this estimation, repeater students who took the PISA test after the Madrid test would be the control group.

thumbnail
Table 5. Difference in difference estimation for the impact of the PISA and Madrid test overlap on non-repeater students.

https://doi.org/10.1371/journal.pone.0309980.t005

Similarly, another possible event that could explain a negative score in the PISA test is its overlap with the exam week in May. However, only students in public and charter schools take these exams while those in private schools take their exams in May. Thus, adding the interaction term between exam week in May and private school students to the baseline regression 2 would identify the impact of this event. Therefore, a regression model with the interaction effect just described is a difference in difference estimation where the students in the control and treatment groups are those who took the PISA test in May and belonged to a non-fee paying (public and charter) and private school, respectively.

Tables 5 and 6 show estimates of the strategies explained in the previous paragraph. For the sake of simplicity, only foci parameters are reported. Consistently with logit intuition, private school students who are non-affected by May exams perform better than their counterparts in non-fee paying schools. However, the effect is not significant at the conventional values. On the other hand, the overlap of the PISA and Madrid proof does not explain the different performance results in repeater and non-repeater students. Overall, these results suggest that a crowded schedule due to overlapping with other tests or exams did not significantly impact students affected by these events.

thumbnail
Table 6. Difference in difference estimation for the impact of the PISA and week exam in May overlap on non-repeater students.

https://doi.org/10.1371/journal.pone.0309980.t006

Knowing that PISA asks school principals to complete a questionnaire covering the school system and school characteristics [14], the timing of the PISA test is not randomly allocated but depends on schools’ requirements. Thus, it is possible that school principals and managers in centers that take the PISA tests during the first weeks could be more competent to envisage potential risks associated with a crowded schedule. This could affect the validity of our estimation results if students’ performance in PISA is also correlated with the ability of schools’ principals. To tackle this concern, we perform the previous regression analyses with an estimation sample that only considers students who took the PISA test within two weeks of the Madrid tests. Table 7 shows the results of this estimation. As our analysis does not show any significant results, we believe that the reason may be a combination of different events that cannot be estimated, for example, the quality and experience of the applicators, students negatively disposed towards the PISA test, students’ well-being. Furthermore, while a biased significant result is a common concern in empirical studies, results in our analysis are not significant.

thumbnail
Table 7. Difference in difference estimation for the impact of the PISA test within two weeks of the Madrid tests.

https://doi.org/10.1371/journal.pone.0309980.t007

6. Concluding remarks

This paper examines the Madrid region’s PISA 2018 results to identify the possible causes of the anomalous results in that period. Our belief is that results are affected by two events in the Madrid region: 1) participation in the regional external and standardized test that all students in their final year of compulsory school must take; and 2) the 2017/18 academic calendar change in the region. In both cases, their impact are estimated by using a difference-in-difference approach. We suspect that the regional test affects student performance at PISA because the PISA test took place from April 15th to May 30th, while the regional test came about on April 26th-27th. To identify the effect of this event, we leverage the fact that our control group, grade repeater students, only take the PISA but not the regional assessment. In contrast, non-repeaters (treatment group) take both exams. The analysis includes a set of observable individual, family, and school characteristics that could explain individual differences due to reasons different from treatment. The implicit assumption in our approach is that, once we control for these characteristics, performance differences should be entirely explained by the action of treatment.

We find that taking the PISA test after the regional test in Madrid negatively impacted performance, but the effect is not significant at the conventional values. Additionally, the correlation of PISA scores with exam week in May is minor. Therefore, these results do not suggest that other exams could explain the deficient performance of Madrid students in PISA.

Another possible event that could explain a negative score in the PISA test is its overlap with the exam week in May. However, only students in public and charter schools take these exams while those in private schools take their exams on a different date. We conducted a difference in difference estimation where the students in the control and treatment groups are those who took the PISA test in May and belonged to a non-fee-paying (public and charter) and private school, respectively. As expected, private school students who are non-affected by May exams perform better than their counterparts in non-fee-paying schools. However, the effect is not significant at the conventional values. Conversely, the overlap of the PISA and Madrid exams does not explain the different performance results in repeater and non-repeater students. These results indicate that a crowded schedule resulting from overlaps with other exams has no significant impact on students affected by these events.

It is possible that the reason for the atypical results in Spain is the result of a combination of factors beyond the scope of our measurements. Perhaps these results merit a follow-up study to obtain more significant results as in Brevik and Hellekjær [17].

Potential future research could focus on identifying strategies or interventions that educational institutions can implement to reduce the negative effects of crowded exam schedules. For example, staggered exam schedules or providing additional support during busy testing periods could be explored and the long-term effect of crowded exam schedules on students’ academic performance, motivation, and well-being. Similarly, longitudinal studies could provide insights into how testing schedules may impact students’ educational trajectories. For instance, poor performance because of a crowded schedule may prevent students from obtaining a scholarship and access to higher education. Similarly, low grades may lead to students’ discouragement and lack of interest in continuing their studies.

References

  1. 1. Díaz DA, Jiménez AM, Larroulet C. An agent-based model of school choice with information asymmetries. J of Simulation. 2021;15(1–2): 130–147. https://doi.org/10.1080/17477778.2019.1679674
  2. 2. Hastings JS, Weinstein JM. Information, school choice, and academic achievement: Evidence from two experiments. Q. J. Econ, 2008; 123(4): 1373–1414. https://doi.org/10.1162/qjec.2008.123.4.1373
  3. 3. PISA 2018 Results (Volume I): What students know and can do. OECD, 2019.
  4. 4. William D. Standardized testing and school accountability, Educ. Psychol. 2010; 45(2):107–122. https://doi.org/10.1080/00461521003703060
  5. 5. Ebenstein A, Lavy V, Roth S. The long-run economic consequences of high-stakes examinations: Evidence from transitory variation in pollution. AEJ: Applied. 2016;8(4): 36–65. https://doi.org/10.1257/app.20150213
  6. 6. Wen J, Burke M. Lower test scores from wildfire smoke exposure. Nat. Sustain. 2022; 5(11): 947–955.
  7. 7. Park RJ, Goodman J, Hurwitz M, Smith J. Heat and learning. Am Econ J Econ Policy. 2020; 12(2): 306–339.
  8. 8. Langford IH, Lewis T. Outliers in multilevel data. J. R. Stat. Series A.1998;161(2): 121–160. https://doi.org/10.1111/1467-985X.00094
  9. 9. PISA Annex A9. A note about Spain in PISA 2018: Further analysis of Spain’s data by testing date (updated on 23 July 2020). OECD, 2020.
  10. 10. Hopkins D, Pennock D, Ritzen J, Ahtaridou E, Zimmer K. External evaluation of the policy impact of PISA. OECD, 2008.
  11. 11. PISA 2000 Technical Report, OECD, 2002. https://doi.org/10.1787/19963777
  12. 12. PISA 2006 Technical Report, OECD, 2009. https://doi.org/10.1787/19963777
  13. 13. PISA 2012 Technical Report, OECD, 2014. https://doi.org/10.1787/19963777
  14. 14. PISA 2015 Technical Report, OECD, 2017, https://doi.org/10.1787/19963777
  15. 15. PISA 2018 Technical Report, OECD, 2020. https://doi.org/10.1787/19963777
  16. 16. Kuehn Z, Landeras P. Study time and scholarly achievement in PISA. MPRA Paper.2012;49033.
  17. 17. Brevik LM, Hellekjær GO. Outliers: Upper secondary school students who read better in the L2 than in L1. Int. J. Educ. Res. 2018; 89:80–91. https://doi.org/10.1016/j.ijer.2017.10.001.