The long-run effects of secondary school track assignment

This study analyzes the long-run effects of secondary school track assignment for students at the achievement margin. Theoretically, track assignment maximizes individual outcomes when thresholds between tracks are set at the level of the indifferent student, and any other thresholds would imply that students at or around the margin are better off by switching tracks. We exploit non-linearities in the probability of track assignment across achievement to empirically identify the effect of track assignment on educational attainment and wages of students in the Netherlands, who can be assigned to four different tracks. We find that attending higher tracks leads to increases in years of schooling by around 1.5 years for students at the lowest and the highest choice margin, and wage gains of around 15% and 5%, respectively. For the margin between the two middle tracks, attending the higher of the two tracks has no effect on educational attainment and decreases wages by around 12%. The negative returns for the medium margin and the relatively low returns for the higher margin (compared to the required educational investments) are partly mediated by motivation and study choice.


Introduction
The grouping of students according to educational achievement is common across educational systems worldwide. The motivation for such practices is that different students may require different environments and different levels of instruction to optimally develop their skills. Anglo-Saxon countries generally address this by selecting students into ability groups for different school subjects, while many Continental European countries sort secondary school students into different tracks that each have their own distinct curriculum. The use of tracking is continuously debated in both the public and the academic domain. Early empirical studies in economics have typically focused on the efficiency and equity implications of tracking systems, or on the effect of the exact age at which tracking takes place; see, e.g., [1,2,3,4,5]. Less attention has been paid to the allocation process that sorts students into different tracks. Tracking involves, either explicitly or implicitly, the use of ability thresholds in order to sort students. Students above a particular threshold are deemed fit to attend the higher track, while those below are projected to be better off in a lower track. A crucial question is whether the exact location of such a threshold in the ability distribution is optimal for students that fall to either side of that threshold.
The aim of this paper is to estimate the long-run effects of track assignment for students who are at the achievement margin. The empirical analysis of this study is focused on the Dutch educational system, where track assignment is partly determined by scores on an achievement test taken at the end of primary school (grade 6). This test contains certain threshold scores, which indicate the required ability level for a specific track. In reality, the adherence to these threshold scores by schools is lenient, but they still induce non-linearities in the probability of treatment across achievement that are not proportional to differences in the ability and potential of these students.
This can be exploited to identify the effects of track assignment on future outcomes. We match survey data on Dutch secondary education students with administrative data on educational attainment and job market information in later life. Job market information is available up to an age of 42 years old. The Dutch educational system contains four different tracks in the period under analysis, which implies that there are three choice margins for which the effect of track assignment can be estimated. We find that attending the higher track increases educational attainment by around 1.5 years for the lowest and the highest choice margin. The subsequent labor market returns from attending the higher track are around 15% in the former and around 3-7% in the latter case. In contrast, we find that attending the higher track at the medium margin has no effect on years of schooling, and lowers wages by around 12% in the long run. These effects are partly mediated by study choice and motivation.
The choice of which track to attend typically involves schools, students and parents. Parental aspirations generally lean towards higher/more academic tracks that involve better peer quality and provide more direct paths towards high levels of post-secondary education [6]. Social status considerations and overconfidence could lead students to attend tracks that are too demanding and thereby hinder learning; see, e.g., [7] for evidence on education as a positional good and [8] for evidence on overconfidence in self-assessment. [9] find that a shift from parental to teacher influence for track assignment in Germany has led to a reduction in grade retention in secondary school, providing suggestive evidence that parents are indeed prone to push their children into too demanding tracks. Conformity to the educational level of the parents could make especially those with highly educated parents prone to attend too high tracks (and those with low educated parents to attend too low tracks). Schools face other considerations when setting thresholds for track attendance. Lower entrance requirements attract more students, while higher entrance requirements improve average peer quality within each track, and decrease the risk of costly grade retention. Hence, there are several reasons why the achievement threshold is not set at the indifferent student.
Research on tracking has traditionally focused on estimating effects for school achievement and educational attainment. To assess the effectiveness of tracking or track assignment, it is especially important to look at how tracking affects labor market outcomes, as that is ultimately what students are prepared for in these tracks. In fact, an increase in the number of completed years of schooling represents a cost from an economic perspective, and can only be beneficial when such an investment produces positive returns in meaningful later-life outcomes. While average wage returns to extra schooling are high, they are also strongly heterogeneous [10]. Increasing the size of the higher track can lead to increases in educational attainment simply because more students are eligible for higher levels of post-secondary education. Additionally, shifting from an academic to a vocational focus might lead to decreases in academic achievement, but could be to the benefit of other skills. Recent studies that estimate long-run effects underline this. [11] show that a tracking reform in Romania led to an increase in the number of students that completed an academic track, but not to increases in labor market outcomes. Similarly, [12] analyzes a policy change in Sweden that gave the vocational track a more academic curriculum and identifies an increase in educational attainment in secondary school, but no effect on earnings. Additionally, [13] show that the life-cycle dynamics of students following vocational tracks and students following academic tracks are different, underlining the importance of measuring wage effects at multiple ages across the life cycle.
These studies analyze policy changes that changed the content of tracks. In contrast, we analyze assignment of students to tracks for a given set and content of tracks. As such, this paper relates most closely to a recent study by [14], who estimate the long-run effect of track attendance by exploiting the fact that relatively younger students are less likely to attend higher tracks because of the month of birth effect. Our study estimates a similar effect, but at a different margin. [14] identify a local average treatment effect (LATE) for those that would attend a higher track if they would be born earlier in the year. We elicit a different local effect, namely for those who would attend a higher track if their achievement would have been marginally higher. This particular LATE reflects a local effect that is critical for decision-making with respect to tracking. It answers whether ability thresholds are indeed set at the indifferent student or whether students around the margin would be better off by switching to another track. Additionally, our study adds to the literature by estimating the effects of track attendance at multiple margins, and for multiple cohorts in time.
The organization of this paper is as follows. Section 2 specifies the theoretical framework of this study. Section 3 gives a brief overview of relevant characteristics of the Dutch educational system. Section 4 discusses the data. Section 4.2 presents the methodology, while main results are discussed in Section 5. Robustness analyses are presented in Section 6. Section 7 concludes.

Theory
Countries differ in how they assign students to tracks, but track assignment generally relies largely on measures of student achievement. This can rely on measures of general ability or on how students perform by type of subject (e.g. in quantitative subjects versus languages). Additionally, in countries in which tracks have strongly differentiated curricula, student preferences are a leading determinant. In our empirical setting, track assignment is based on a measure of overall ability, which is why the theoretical framework is constructed from this perspective as well.
We define an overall ability indicator θ i , on which track allocation decisions are based. Future outcomes (Y i ) for students (e.g. future wages) depend on θ i and the track T the student attends. We specify a simple linear relation: In the remainder of this section we assume, without loss of generality, that there are three tracks: low (L), medium (M) and high (H). Each track has its own curriculum (including not only the set of courses, but also level and pace of instruction). The level of these curricula are geared towards the average ability of students in that track. As such, lower tracks produce more favorable outcomes for low-ability students and higher tracks for high-ability students. In our linear framework, this implies that α L >α M >α H and β L <β M <β H . This situation is depicted on the left side of Fig 1. In the figure, it is assumed that allocation of students to tracks is efficient: no student can switch and make themselves better off. In other words, outcomes equal: Allocation of students to tracks: theoretical framework. Note: The figure shows a theoretical depiction of the relation between outcome Y and attendance of low (L), medium (M) and high (H) tracks, across the ability distribution. On the left side, track assignment is efficient and the effective outcome line traces out the maximum outcomes. On the right side, the lower threshold is more strict and the higher threshold is more lenient, leading to discontinuities in outcomes.
In this scenario, the threshold ability levels lie before the first student in the distribution for which the higher track is the optimal choice and after the last student in the distribution for which the lower track is the optimal choice. When thresholds are located at different points, some students can be better off by switching. The effective outcome line will not follow the highest points and discontinuities in the outcome variable across the distribution will appear. The right side of Fig   1 describes the cases where the lower threshold is more lenient and the higher threshold is more strict. In the former case, the lowest-ranked students in the medium track would be better off in the low track and hence there is a downward jump between the highest low-track students and the lowest medium-track students. In the latter case, the highest ranked medium-track students would be better off in the higher track and there is an upward jump in the effective outcome line.
In reality, θ i is not directly observed and typically proxied with (noisy) measures of school achievement. This also applies to the empirical setting of this paper. Using a noisy ability signal for student sorting automatically implies that allocation is not fully efficient and some students would be better off in another track than where they were assigned to. Given that noisy ability signal, outcomes are still maximized when the threshold achievement level is set at the indifferent student (assuming noise is symmetrically distributed). In that case, average outcome lines would still be equal to the situation in Fig 1, and the average effective outcome line is smooth across the distribution. Put differently, efficiency gains can be made by using less noisy achievement measures or by putting thresholds at a more optimal position. Our focus in this paper is on the latter.
We do not consider peer effects explicitly in this model, which are incorporated within α T and β T . If one would assume a constant positive effect of better peers, α L would decrease (i.e. Track L is shifted downward) and α H would increase (i.e. Track H is shifted upward). Hence, peer effects reduce the part of the distribution for which the lower track is more optimal and increase the part of the distribution for which the higher track is more optimal. Explicitly modeling peer effects in the framework, however, provides no added value in the context of this study.
Additionally, we explicitly take an individual point of view. 'Efficient' assignment as in Fig   1 means that individuals cannot switch and be better off. The optimal allocation of students to tracks from a social welfare perspective represents a separate policy question. Changing effective thresholds also changes peer quality, pace of instruction etc. This would lead to changes in the parameters α T and β T . If one assumes that all students benefit from higher average peer ability then thresholds that maximize individual outcomes given the assignment of the rest of the distribution (as in Fig 1) is too lenient from a social welfare perspective, ceteris paribus (more so if we also consider general equilibrium effects). Stricter thresholds would be more optimal since they increase peer quality in tracks on both side of the threshold. Nonetheless, the framework as depicted is valuable as it identifies gains and losses for individuals who are at the choice margin for two specific tracks, which represents crucial information for students and their parents who face this choice. When we refer to 'optimal' track allocation in the remainder of this paper, this pertains to this individual perspective and not to a social optimum.
As argued in the introduction, the impact of track assignment can potentially differ between different outcome measures. In the context of the theoretical framework, this means that α T and β T are dependent on the defined outcome variable. For example, positive impacts on educational attainment could arise even in the absence of better student learning, because higher tracks make students eligible for higher levels of post-secondary education. Although literature clearly shows that the average return to an extra year of schooling is positive and substantial, this return can be different for students who are induced to prolong their educational career because of more lenient requirements. This underlines the value of measuring also the labor market effects of such treatments.

Dutch educational system
The Dutch educational system is characterized by relatively early tracking and a high number of tracks. Fig 2 provides a schematic overview from primary to tertiary education. Primary education lasts six years (preceded by two years of kindergarten). The focus of this study is on tracking in secondary education, and on how this influences post-secondary trajectories.

Secondary education
After finishing primary school, students in the Netherlands are selected into four main tracks: lbo (vocational), mavo (lower general), havo (higher general) and vwo (pre-university). Students can be relegated to lower tracks after track assignment, in case of low achievement, but moving to higher tracks is generally only possible when the current track has been completed. The lbo and mavo tracks have been merged in 1999 into the joint vmbo track. Within vmbo, there still exists a practical and a theoretical subtrack. For the vast majority of students in our sample, the system with four tracks applies. For the remainder of this paper, we refer to the available tracks as T1 for lbo, T2 for mavo, T3 for havo and T4 for vwo.
A leading determinant of track assignment is the 6th grade exit test that primary schools are obliged to administer. While several alternative tests exist, a large majority of 85% of Dutch primary schools administer the standardized 'Cito test' [15]. The obtained Cito score is connected to a 'teacher recommendation' for any of the four tracks, which is given by the 6th grade teacher and sent to the prospective secondary school. Recommendations can be mixed, when students' achievement level is around the margin of the required level for a certain track. Students and their parents can decide to deviate from the track recommendation, if secondary schools allow for this.
The latter are obliged by law to consider at least one of the two sorting mechanisms (test score and/or teacher recommendation) when admitting or sorting students. However, secondary schools are free to decide their exact assignment rules, and to deviate from the score thresholds that the manufacturers of the Cito test report.
There is an additional opportunity for students to be selected into a track that does not correspond to their Cito score, because final selection can be postponed until the second or third year of secondary education. This occurs through the existence of temporary comprehensive grades (socalled 'bridge-grades'), where students of two or more tracks are still kept together. This is most common for Track 3 and Track 4 students, while students in the lower two tracks are generally selected early. In our sample, roughly 90% of T3 and T4 students is in a comprehensive grade for at least one year, and 35% for 2 years. About 75% of students in Tracks 1 and 2 is in a specific track in the first year of secondary education already. When track assignment is postponed to later years, it is also based on student achievement, most commonly through school-specific requirements with respect to grade point averages.
Hence, while achievement is the key driver of track sorting in the Netherlands, these institutional features imply that assignment is far from deterministic and that there is considerable leeway in getting into tracks that do not match up with 'eligibility' status of students if we would follow only the achievement test. We elaborate on the patterns of student sorting by achievement and their implications for the empirical approach in Section 4.2.

Post-secondary education
The Dutch educational system has three levels of post-secondary education. The lowest level is mbo, which has a vocational orientation. Higher education can be divided into two categories. Hbo provides higher professional education (also known as vocational university), while wo consists of university education. Students with a high school diploma from the T3 track or higher can enter hbo, while wo is only available for students who complete T4. Completing the highest level of mbo makes one eligible for entering hbo as well, while completing the first year of hbo gives direct access to university education. Hence, it is still possible for students from lower tracks to complete higher education, although the route is less direct and involves additional time.
Nonetheless, around 15% of students from the T2 track in our data sample still complete higher education. For the most recent cohorts, this even equals 22% (see Appendix Figure A1).

Data
For the empirical analysis, we link several data sources. Secondary school data are collected from the Secondary Education Cohort Studies. These are large representative longitudinal surveys of Dutch pupils whose educational career was followed from the first year of secondary education (around age 12) until they leave full-time education. They are carried out by Statistics Netherlands and the Groningen Institute for Educational Research; see, e.g., [16,17]. These data include measures of student achievement, student background, and the attended level of education for each year. We use cohort studies starting in 1977, 1983, 1989, 1993 and 1999 when we assume that all these students finish the study they are currently attending. The share of students still in education is negligible for all other cohorts.
The 1989, 1993 and 1999 cohorts also contain data on 9th grade achievement. The test scores in the 1989 and 1993 cohorts suffer from a high number of missing observations. We find that the probability of taking the test is positively related to previous achievement of the student (i.e. the better students in class are more likely to take the test). This can severely confound the estimates of our analysis, since the empirical approach effectively compares students at the margin. For these two cohorts, we use two imputation approaches that assess sensitivity to the missing value issue.
For the 1999 cohort, the number of missing observations is negligible. Grade 9 test scores are available for language, mathematics and general problem-solving. Summary statistics are provided in Table 1. The share of students attending T1 slightly increases over the cohorts, mainly at the expense of T2. As in other developed countries, we observse a steady increase in obtained years of schooling over time. Parental education levels are increasing as well across cohorts, reflecting that this upward trend was already present in earlier cohorts. There is some variation in the mean levels of other background variables, but no clear trend nor marked differences in the background of students across these cohorts.

Non-linearity design
The aim of this study is to estimate the effect of track assignment on educational attainment and wages. As specified in the theoretical framework, one can find the effect of track assignment by identifying discontinuities in outcome variables around the achievement margin between two specific tracks. As such, the setup would be well-suited for a regression discontinuity design (RDD), which exploits discontinuities in treatment at a specific threshold value of a forcing variable. This approach is, however, empirically not feasible here. For one, we only observe the true forcing variable in one of the cohorts (1999). More importantly, the variability in the thresholds that schools adhere to and the use of postponed tracking lead to a very high fuzziness of track allocation around the threshold. Figs 3-5 show the relation between treatment probability and the Entrance test, for all three margins (Fig 4 excludes T4 students, as we need the relation between achievement and track assignment to be monotonic; as shown later, including only the lower three tracks is the preferred specification in the empirical analysis). The figures consistently show that a true discontinuity in treatment is absent. Moreover, Appendix Figure A2 shows that in the cohort where we do observe the official Cito score, the pattern of treatment probability across score is similar to that for the Entrance Test. Hence, the lack of a strong discontinuity in treatment at or around the achievement threshold is not driven by the use of the proxy test, but by the specific dynamics of student sorting in the Dutch system.  Share of students assigned to T3, across test scores and cohorts. Note: The figure shows the share of students that are assigned to the T3 track, for every score on the Entrance test, separately for all cohorts. Students that attend T4 are excluded in the calculation. Share of students assigned to T4, across test scores and cohorts. Note: The figure shows the share of students that are assigned to the T4 track, for every score on the Entrance test, separately for all cohorts.
Although Figs 3-5 fail to show a strong discontinuity in treatment probability, they do inhibit a non-linear pattern that can potentially be exploited. Because track assignment is based on achievement, the probability of being assigned to a certain track is highly responsive to changes in achievement around a certain (implicit) threshold score. The treatment probability increases strongly around this score, while it remains flat in segments before and after. In other words, the increase in the probability of treatment is disproportionally low for increases in score that are far from the threshold and disproportionally high for increases in score around the threshold. This 'disproportionality' can be exploited to estimate the effect of track assignment, through a two-stage model in which the fraction of students that is treated for a given test score acts as an instrument for the treatment indicator. Such a conditional mean approach does not rely on any defined threshold score, as it incorporates all changes in probability within the defined bandwidth.
A first condition for applying this approach is that the relationship between the outcome variable and the forcing variable has a different functional form (net of treatment) than the relationship between the treatment variable and the forcing variable. Fig 6 displays the relation between educational attainment or wages and the Entrance Test score, for the sample as a whole. Appendix figure A3 shows the same relation split across tracks. These figures show a linear pattern for educational attainment in all cohorts and for wages in the three oldest cohorts. The pattern for the 1993 cohort is strongly non-linear as many high achieving students have not entered the labor market yet, leading to two opposing forces in the relation between achievement and wages. Consequently, the optimal bandwidth is too narrow for a feasible first stage and wage effects are therefore not estimated for the 1993 cohort. A second condition is that there can be no similar non-linearities in other determinants of Y i , as these will be attributed to the treatment effect. there is some degree of non-linearity at the very high end of the distribution, but the number of observations for these high scores is low. We assess sensitivity towards the specified bandwidth in Section 6, which will provide an indication to what extent this can bias results. The pattern for the 1983 cohort is also more erratic, as the test taken in that year is somewhat less predictive of future outcomes. The Two Stage Least Squares (2SLS) model becomes: The approach we use bears similarities to that of [18], although in a different setting. They show that non-linearities in hedonic markets can be exploited in an IV approach through the calculation of conditional mean functions. The approach assumes that all unobserved determinants of the outcome are linearly related to S i across the bandwidth, and therefore captured by the control function. In other words, the model assumes that is mean independent of S i (E( |S i = 0)). This assumption is stronger than for an RD design, which requires that other (unobserved) determinants of the outcome are smoothly continuous at the threshold. On the other hand, our model relies on weaker assumptions than an OLS model, which requires that such determinants are completely unrelated to the treatment.
As mentioned before, part of the low adherence to threshold achievement levels occurs because different schools may employ different thresholds or different levels of strictness in adhering to thresholds. This may result in students that are just below the margin for one school to actively 'shop' for schools until they find a school that allows entry into the higher track. We emphasize that this behavior is not a threat to our identification, as long as our assumption of a linear relation between the outcome and unobservable determinants holds.

Bandwidth selection
As our objective is to estimate track assignment effects at the achievement margin, we mainly want to select observations close to the (implicit) achievement margin. As with the RDD, we therefore need to establish bandwidths for our and assess whether the first stage of the model is still valid for the resulting subsample. We follow the typical approach used for RDD models by executing a cross-validation (CV) procedure to identify the bandwidth that minimizes the mean squared error of the control function [19]. The CV procedure weighs the additional power of including more observations against the loss in precision from moving further away from the threshold. As there is no explicit threshold score, we set the implicit threshold at the score where the treatment probability surpasses the 50% mark. We then step-wise extend the bandwidth from this point on each side of the threshold, and assess which particular score range minimizes the mean squared error.
We bandwidths for all treatment effects and cohorts can be observed in Table 3. Appendix Table A1 shows the first stage results, applying each of these bandwidths. All estimates of first stage power are strongly above conventional thresholds (with the exception of the estimation with respect to wages in the 1993 cohort, as argued before).
The first stage is sufficiently strong in these cases because these bandwidths are rather wide, thereby ensuring that the relation between S i and θ i is still non-linear within the estimation window. These bandwidths are nonetheless optimal according to the CV exercise, which confirms that the (net) relation between S i and the outcomes is strongly linear, and that the linear control function still provides a very good fit also when moving further away from the implicit achievement threshold.
One may still question whether this conditional mean approach estimates treatment effects at driven by the potentially differential treatment effects of those who are relatively further from the margin. We assess this concern by conducting a heterogeneity analysis (Section 5.4) and by inducing variation in sample composition (Section 6.3). Moreover, Section 6.3 will assess whether results still hold for more narrow bandwidths (with the caveat that very narrow bandwidths are not feasible in the model because the relation between S i and θ i would become linear).

Results
We now present the estimated effects of track assignment at all three margins. The coefficients always represent the effect of attending the higher of the two tracks at the margin. Our main out-come variables are wages and educational attainment. We also add estimation of track assignment for school achievement in grade 9 (Section 5.3).  This is, however, solely due to the control for gender (i.e. women are more likely to attend T2 vs. T1 for a given score, while they earn lower wages). As expected, those attending the higher track at the margin have higher educated parents from higher social classes, and controlling for these variables lowers treatment effects in the OLS model in all cases.

Educational attainment
The estimates of the long-run effects of tracking for the main IV model are presented in Table 3.  Notes: *Significant at 10% level **Significant at 5% level ***Significant at 1% level The table shows the effect of track attendance on years of schooling (YoS) and the log of the average monthly wage, measured by using an OLS model that regresses the outcome on track attendance and the Entrance Test score, for the three choice margins. Results are separately presented for a model without and a model with controls (for a list of control variables, see Fig 7). We apply bandwidths and control functions as suggested by the cross-validation procedure (and the same as in the IV mode, see Table 3 for exact ranges). 'Wage' takes the average wage over the period 2001 to 2007. Standard errors are between parentheses and are robust and corrected for clustering at the school level. nificance in 1983 and 1989, but effect sizes are rather consistent. For T4 vs. T3 treatment, effects gradually increase by cohort, which could be related to the fact that higher education attendance has increased markedly over the same period. All estimates for the effect of T3 vs. T2 are statistically insignificant. Standard errors are large, especially for the 1983 and 1989 cohorts, but the point estimates are consistently low. Hence, it appears that attending a higher track for students at this specific margin does not translate into higher educational attainment. Additional analysis (not shown) indicates that students that attend T3 over T2 at the margin still predominantly complete Track 3 and also obtain slightly more hbo diplomas, but are simultaneously less likely to complete any post-secondary education.

Wages
Columns 2, 4 and 6 of Table 3  Estimates of the effect of track attendance on wages are statistically significant and negative for the T3 vs. T2 margin. This contrasts with the other two margins, where attending the higher track improves wages, and also with the lack of an effect on years of schooling. This suggests that marginal students are often pushed into a T3 track when they would be better off in a T2 track, from a lifetime earnings perspective. It is not surprising that this is not yet reflected in the years of schooling results as the T3 students attend a higher track and are eligible for higher levels in post-secondary education. One could therefore say that the lack of a positive effect for educational attainment already signals a high potential fo sorting too many students into T3 at the T2/T3 margin.
The question remains what concretely drives these negative wage effects. As mentioned before, the near-zero effects on average years of schooling hide some degree of substitution of mbo diplomas for more hbo diplomas but fewer post-secondary diplomas in general. It could be that the return to mbo diplomas is comparatively strong at this margin. Data on study choice might provide an additional explanation. We find that attendance of T3 over T2 at the margin leads to increases in study choice towards health and, especially, humanities, and a decrease in exact sciences and, especially, economics (see Appendix Table A2). Average wages are considerably higher in the latter compared to the former. Hence, study choice can explain part of these negative wage results. We can only speculate on the reason why these study choice patterns emerge, but it could be that being in a more demanding track and being ranked lower compared to classroom peers leads students to move away from studies that are perceived as having 'more challenging' curricula. Comparing the IV estimates (Table 3) to the OLS estimates (Table 2), the results for the two higher choice margins are more favorable for attending the higher track in the OLS model. This likely reflects the positive bias in the ATE estimates due to students self-selecting into tracks. For T2 vs. T1 treatment, the positive wage estimates are higher for the IV model. As a strong negative bias in the ATE appears unlikely, this is likely driven by the local nature of IV estimates. The IV model estimates the effect of T2 vs. T1 for students that are induced to switch to a higher track when they have higher achievement levels. This could be students that are especially ambitious and they might also attend relatively better schools (they are more likely to attend schools with stronger entrance requirements which is likely to imply better peer quality). This could explain the relatively high local estimates in this particular case. Table 4 shows the effect of track assignment on school achievement in grade 9. While the focus of this study is on the long-run effects of track attendance, achievement could provide a potential mechanism towards such long-run effects. As explained in Section 4, the 9th grade test results in the 1989 and 1993 cohort contain a large share of (selective) missing values. To deal with this, we provide two imputation approaches. The first imputes missing tests from the relevant domain of the Entrance test. Taking into account that those that did not take the test might have developed especially poorly between grades 7 and 9, we subtract half a standard deviation from the imputed values in the second imputation approach. Comparing these alternative approaches provides an indication of the robustness of the results to the issue of missing values. Panel B of Table 4 shows results for the 1999 cohort, in which the share of missing tests is negligible. The data for this cohort also contains a problem-solving test. There are only two margins to estimate here, as tracks T1 and T2 were merged in 1999.

School achievement
Panel A of Table 4 shows that results are indeed sensitive to imputation, but the positive treatment effect for T4 vs. T3 appears robust, especially for language. Positive effects at this margin are also identified for the 1999 cohort. In Panel A, there is no (robust) evidence for an effect on achievement at the other margins. In contrast, panel B provides positive effects at the lower margin for both language and math. The impact of track assignment on the problem-solving test is low and statistically insignificant for both margins. This could reflect that such skills are less driven by instruction and peer effects or, more generally, difficult to influence at later ages.
The positive results for language and math for the lower margin appear to contradict panel A, but these results are difficult to compare because of the merger of the two lowest tracks. This has effectively created a new margin. Moreover, the 1999 cohort constitutes the first cohort after the policy change and we could therefore also pick up on transition effects. We conclude that there appear to be positive achievement effects for T4 vs. T3 at the margin, while we do not find any achievement effects at the lower margins under the old system. It should be emphasized that these tests elicit academic achievement, while attending vocational tracks can potentially benefit Notes: *Significant at 10% level **Significant at 5% level ***Significant at 1% level The table shows the IV estimates of the effect of track assignment on school achievement in grade 9, using Model (3). 'Impute 1' imputes missing scores from the same domain on the Entrance test. 'Impute 2' additionally subtracts 0.5 standard deviation from the imputed values. Panel B shows results for the 1999 cohort, in which T1 and T2 now form one merged track, separately for when using the Entrance Test (low stakes) or the Cito test (high stakes) as forcing variable. The new merged track is labeled as 'T1/T2'. 'PS' refers to a problem-solving test. Standard errors are between parentheses and are robust and corrected for clustering at the school level. Bandwidths are based on the cross-validation procedure and are between brackets. The full range for the Cito test score is [501-550]. students' skills in non-academic disciplines. The data to test the impact of track attendance on vocational skills are not available.
The identified effect for T4 vs. T3 appears predominantly driven by peer effects. Adding average peer quality in class as a control leads to low and statistically insignificant estimates. The literature on peer effects suggests that an increase in peer quality of 1 standard deviation leads to an increase in individual achievement by around 0.40 of a standard deviation [20]. The difference in peer achievement between T3 and T4 is around 0.75 of a standard deviation, indicating the effect sizes are in line with what the literature predicts on the basis of peer effects.
The 1999 cohort also contains data for the high-stakes Cito test, which allows a comparison of using either test as forcing variable. Panel B of Table 4 shows that using either the Entrance Test or the Cito test leads to similar estimates. This result is in line with the observation from Figure   A2 that the Cito score is not necessarily a stronger predictor of track assignment. Hence, having only a proxy for the true selection test does not appear to impact the estimates in a strong way.
These results indicate an improvement in cognitive skills from attending the higher track, at least at the higher margin. Track attendance can potentially also effect non-cognitive skills, which are also highly important for future earnings [21]. The data are relatively limited when it comes to such outcomes, e.g. measurement of Big Five personality skills are missing. Cohorts 1977Cohorts , 1989 and 1993 do contain 9th grade measures for school enjoyment and 'need for achievement'.
The latter can be seen as a proxy for motivation, and has been shown to be highly predictive for educational outcomes [22]. Effects for school enjoyment are low, and only statistically significant for T4 vs. T3 treatment in 1989 (with positive sign). Effects for 'need for achievement' are positive for T2 vs. T1 treatment and strongly negative for T3 vs. T2 treatment in all cohorts. Hence, the long-run treatment effects at these margins could be partially driven by intermediate effects on non-cognitive skills.

Heterogeneity across background characteristics
We further analyze whether the effect of track assignment differs across observable characteristics (these results are not shown but available on request). For this purpose, we include an interaction between the attended track and specific background characteristics and instrument this variable with an interaction between the instrument and the specific characteristic. In general, the precision of the interaction estimates is low, especially for the T3 vs. T2 margin, but they reveal some interesting patterns. We first of all look at interactions with background characteristics. For gender, point estimates suggest weaker positive effects on years of schooling for women at the lower margin, but these are not statistically significant. We identify a statistically significantly stronger effect for women with respect to years of schooling at the higher margin in 1977, but not for any of the other cohorts. Point estimate with respect to wages are consistently low. We further estimate interactions with parental education. Conformity suggests that those with highly educated parents could be especially prone to be assigned to too demanding tracks. We do not find evidence for this.
We identify slightly weaker positive effects for those with higher educated parents with respect to years of schooling at the higher margin, and also somewhat for wages (only for the 1989 cohort).
All other interaction estimates are low and statistically insignificant. A potential explanation is that higher educated parents also have more means to support their child if it is struggling in a more demanding track.
Additionally, we estimate interactions between treatment and achievement indicators, namely the Entrance Test score and the teacher recommendation given at the end of primary school. These estimations provide insights into how effects differ by perceived ability, and in the extent to which the main estimates truly reflect the treatment effect for those at the margin (namely those with a mixed teacher recommendation). With respect to the Entrance Test score, we identify positive point estimates as one would expect given the depiction in the theoretical model, but these are generally low and only statistically significant with respect to years of schooling at the highest margin. Hence, we do not identify strong heterogeneity by achievement but this is likely the result of a lack of statistical power and the fact that students from a specific track can be concentrated within a relatively small range. In other words, a large part of the range of the functions portrayed in Fig 1 is not observed in reality. Interactions with the teacher recommendation show similar results as for the Entrance Test.
More importantly, the (summed) point estimates for those with a mixed recommendation are very close to the main estimates, across cohorts, outcomes and margins. In other words, the identified effect in the main model appears highly representative of the group with a teacher recommendation right at the margin.

Discussion
For students at the lowest and at the highest choice margin, being assigned to the higher track provides a return in the form of higher wages, but at the cost of time and resources spent on education. The wage gains can be a complete result of the increase in educational attainment, but other mechanisms can be in effect as well. As such, we cannot state that the wage gain represents the 'return' on the extra investment in schooling, but if one wants to assess whether individuals are (economically) better off when attending the higher track, years of education and wages represent the relevant costs and benefits of that choice. From that perspective, one extra year of education has a 'return' of around 10% on monthly wages for T2 vs. T1 treatment. For T4 vs. T3 treatment, the payoff is around 7% for the 1977 cohort, and negligible if we look at average wages for the 1983 and 1989 cohorts. If we take the more recent wage data for the 1989 cohort, the return for an extra year of education is around 3% at age 30.
The return to schooling is generally found to be around 8 or 9% in the literature; see, e.g., [23,24,25]. However, these averages hide a considerable amount of heterogeneity, and might not be representative of returns at the achievement margin. Returns to schooling at the margin of dropout are often found to be especially high [25]. This could also explain why our wage estimates are larger at the lowest margin. [10] estimate an average return to an extra year of college of 14%, but a marginal return that ranges from 1.5% to 8.5% (depending on the specific margin the policy change affects). The 'return' we identify for T4 treatment is in line with those effect sizes.
The results for the highest margin may be seen as especially low given the positive effects on achievement. However, earlier cited studies have shown that attendance of higher tracks often leads to short-term increases in achievement but no or weak labor market returns. Study choice may also provide an explanation for the low returns. For the T4 vs. T3 margin, we also find a relative increase in attendance of post-secondary studies with lower average wages, mainly towards humanities (see Appendix Table A2). This also likely explains the slight negative effect on the FTE of the job at the T4 vs. T3 margin, as the hours worked for those with majors in these areas tend to be lower. From an individual perspective, it remains inconclusive whether students around the margin are better off in the higher track. Many individuals do not pursue higher education even when expected returns are high, due to, e.g., income risk and psychic costs of studying [26].
Whether attending the higher track represents a net gain or a net loss for the marginal student ultimately depends on his or her utility function.
Such ambiguity does not apply to the T3 vs. T2 margin, as there is a strong negative wage effect, for the same average years of schooling. Hence, thresholds for this margin appear to be too lenient. This indicates that either parents and students are more likely to push for the higher track when close to this achievement margin, and/or that schools set more lenient entry requirements. The former could occur because completing T3 provides direct access to higher education and could therefore be seen as an especially crucial threshold. For schools, the incentives to set lower thresholds indeed appear comparatively high at this threshold. Tracks 3 and 4 typically fall within the same school while T2 and T3 do not. As such, schools are incentivized to set higher thresholds between T3 and T4 to increase average student quality within each track, while they are incentivized to set lower thresholds between T2 and T3 to attract more students. We lack the data to empirically assess these potential explanations.
It should be emphasized that, in this setting, 'too lenient' thresholds do not strictly imply that the threshold scores that schools set are, on average, too low. As there is leeway for parents with high aspirations for their children to still get into higher tracks also when they have relatively low scores, or to switch to another school with more lenient requirements, it can also be the case that such behavior pushes higher formal (though lenient) thresholds into lower effective thresholds.
Hence, negative wage returns for attending T3 over T2 could potentially also be avoided by stricter adherence to thresholds, rather than increasing the thresholds as such. In any case, our results indicate to parents that pushing students into higher tracks around the margin is not necessarily beneficial for later-life outcomes.
Our main findings appear to contrast with those of [14], who find no long-run effect for either educational attainment or wages from attending the higher track. However, as stated before, they estimate treatment effects at a very different margin. The lack of any effect in their study does not imply that ability thresholds are placed 'optimally'. For example, it can also reflect that underambitious allocation of younger students is canceled out by overambitious allocation of older students. Additionally, [14] suggest that their zero effects can be a consequence of the relatively high upward flexibility in the German system. The Dutch tracking system is comparatively more rigid with respect to upward mobility between tracks, which can also explain the difference in findings.

Robustness
We now assess the sensitivity of our results towards different specifications and robustness tests.
Many of the tests we conduct are similar to those for the traditional RD design, as the identification threats are similar (although based on more lenient assumptions) compared to our non-linearity approach. As stated before, where the RDD assumes no discontinuity in other determinants, our design assumes linearity in other determinants across the specified range of the running variable.
We critically assess this assumption by analyzing sensitivity to the inclusion of observable characteristics, to bandwidth choice and to excluding tracks that are not part of the relevant margin. The latter two tests also addresses the issue to what extent our design still estimates treatment effects at the achievement margin. Finally, we assess sensitivity of our results to how the non-linearity in treatment probability is exactly exploited.

Observable characteristics
The estimation approach assumes that any other determinants of the outcome variable are linearly related to achievement and thereby captured by the control function for the Entrance Test score.
When the inclusion of control variables strongly changes the estimates this is a strong indication that this assumption is invalid. The main results presented before include the set of controls X . Table 5 compares those to a model without controls. The changes in the estimates are all small. This result confirms descriptive statistics shown before that indicated that the relation between achievement and control variables is linear. Fig 7 has shown that some non-linearity is present at the higher end for wages in the 1983 cohort. The results from Table 5 show that this has no major impact on the estimates, as sensitivity in the 1983 cohort is not larger than in other cohorts. The underlying reason is that any such non-linearity is also reflected in the CV procedure for optimal bandwidths, which consequently suggests narrower bandwidths. Table 5 also shows estimates when we additionally include a control for the teacher recommendation that students receive at the end of primary school. Track recommendations correlate strongly with the Cito exit test, but can differ when teachers feel that the test does not accurately reflect student ability or potential. As such, the variable provides a valuable control for aspects that are important for future success but not fully captured by test scores, such as non-cognitive skills and motivation. If our estimates would strongly respond to controlling for the teacher recommendation, it would suggest that they are partly driven by non-linearity in such unobserved determinants across the control function. Addition of the teacher recommendation reduces first stage power somewhat, partly because the variable is not available for around 5% of the sample. Table 5 shows that the sensitivity to this additional control is very minimal.
As shown by [27], coefficient stability when adding controls should always be judged in relation to the explanatory power of those control variables. Table 5 also reports the R 2 of the different specifications. The results show that control variables have substantial explanatory power, especially towards wages and especially for higher tracks, and that the teacher recommendation further adds to this. The fact that the latter explains outcomes also conditional on student achievement and  (3)), and with an additional control for track recommendation (TR). For a list of control variables, see Fig 7. For an explanation on the estimation approach and an overview of all bandwidths, see Table 3. YoS = Years of Schooling. Standard errors are between parentheses and are robust and corrected for clustering at the school level. R 2 's are reported between brackets.
background suggests that it indeed captures other types of skills than a test score. The sensitivity of the estimates is not higher in cases where the R 2 increases more strongly. We cannot rule out that non-linearities in unobserved determinants of long-run outcomes bias our results. Nonetheless, the low sensitivity of the estimates to the inclusion of observed indicators lends further validity to our results. The high coefficient stability also in cases where the R 2 increases substantially implicitly indicates that any potential selection on unobservables (conditional on S i ) would have to be very strong to fully drive these estimates.
A more formal exercise to assess possible non-linearity in important observable characteristics is to conduct placebo tests that use constructed control vectors as outcome variables in Model 3. Table A3 in the appendix. Only one of the placebo tests is statistically significant (at the 10% level). Given that we test this for 21 different hypotheses, these results support the assumption that important observable determinants are linearly related to our outcomes, and therefore are captured by the control function approach. We similarly obtain statistically insignificant estimates when the teacher recommendation is used as outcome.

Sensitivity to bandwidth
Our non-linearity model has been estimated after establishing optimal bandwidths, following the same approaches as in RDD designs.

Sample composition
We have argued before that our approach still elicits treatment effects close to the margin, because the LATE puts more weight on those observations for which the first stage slopes are steeper. The consistency of results for more narrow bandwidths provides evidence in favour of this. Moreover, we can assess how sensitive results are to the exclusion of tracks that are not part of the relevant margin.
In the main approach, we have included at least one other track in addition to the two marginal tracks in all but one case, as these sample compositions are favoured by the CV exercise. Appendix Table A5

Smoothing out the conditional mean function
The main estimation approach exploits any changes that occur in treatment probability across S i .
A potential problem is that this also incorporates variation in treatment probability that is the result Notes: *Significant at 10% level **Significant at 5% level ***Significant at 1% level The table shows the sensitivity of the estimates from Table 3 to changes in the bandwidth. The first entry shows the result for the baseline bandwidth (BL). Other entries show estimates for changes in the upper limit (UL) and lower limit (LL) of the bandwidth. Bandwidths change with intervals of 5. For example, for the first row the bandwidths are: , , , , , , , . Open entries imply that the end of the bandwidth is already reached. See Appendix Table A4 for a full overview of all bandwidths in this exercise. For an explanation on the estimation approach, see Table 3. YoS = Years of Schooling. Standard errors are between parentheses and are robust and corrected for clustering at the school level.
of noise. When a set of students with a specific test score happens to be a 'good draw' by chance, this leads to both a higher probability of assignment to the higher track and likely better long-run outcomes as well. Coefficients would then be biased in favour of attending the higher track. We estimate an alternative approach in which we 'smooth out' any such volatility by predicting treatment by the Entrance test, assuming a logistic relation. These results are shown in Appendix Table   A6. Estimates are naturally less precise in this approach, partly because we exclude noise but also because we might exclude any other variation that is not fitted by the assumed logistic functional form. This is mainly an issue for estimation of educational attainment effects for the medium margin. Hence, the result for this particular treatment effect remains somewhat inconclusive, and we cannot fully exclude that there is some effect of track assignment on years of schooling for this margin. Overall, point estimates In Table A6 are very similar to the main results. The small differences that do occur indeed point to slightly lower estimates in the logistic approach, as we would expect, but these are very minor and all the main conclusions are still upheld. Hence, volatility in treatment probability that is caused by noise does not drive our estimation results.

Conclusion
This study has assessed the long-run effect of secondary school track assignment for students who at the margin of the required achievement level for a specific track. The empirical approach relies on a design that exploits non-linearity in the relation between treatment probability and achievement, while assuming that achievement is linearly related to other determinants of the outcome variable. The fact that track assignment is based on (implicit) achievement thresholds leads to a situation in which increases in the probability of attending a higher track are not proportional to increases in ability or potential. Descriptive statistics and various robustness analyses lend support to the assumption that the non-linearity in the relationship with achievement is exclusive to the probability of track assignment. Our results indicate that students in the Netherlands obtain higher educational attainment and higher wages when attending the higher track, for the choice margin between Tracks 1 and 2 and the choice margin between Tracks 3 and 4. The returns in the labor market are around 15% for the lower margin and between 3% and 7% for the higher margin, at the expense of around 1.5 additional years of schooling. Estimates for the choice margin between the two middle tracks are low and statistically insignificant with respect to educational attainment (although imprecisely estimated), and negative with respect to wages, indicating a wage loss of around 12% from attending the higher track.
Several potential mechanisms can drive these treatment effects. Attending higher tracks directly provides access to higher levels of post-secondary education, which subsequently is linked to higher expected earnings in later life. Additionally, different tracks imply different curricula that teach different types of skills, as well as differences in peer and school quality. We find robust evidence of positive effects of higher track attendance on school achievement for the higher margin, which is suggestive of true learning effects. In light of the positive effects for achievement and educational attainment, the labor market returns at the higher margin appear to be low. Additionally, labor market returns are negative for the middle margin, where there is no effect on educational attainment. These patterns appear to be explained at least partly by study choice. Attending a higher track leads to more frequent sorting into study majors with lower future earnings. This is suggestive evidence that the more challenging track in secondary school leads students to select 'less challenging' post-secondary educational paths. It also relates to recent findings by [28] that a lower rank in class (which attending a higher track directly induces) leads to lower investment in human capital, conditional on ability. We also identify negative treatment effects on motivation at the middle margin, which could contribute to this pattern of study choice. Further disentangling the different potential mechanisms that are behind the effects of track assignment is an interesting avenue for future research.
The results from this study highlight that, while higher tracks are associated with higher wages, the students who are at the achievement margin of a track do not necessarily obtain such strong wage gains. In our study, wage returns depend strongly on the margin we are looking at: high at the lower margin, positive but low compared to educational investments at the higher margin, and negative at the middle margin. We can only speculate on the underlying reasons for these differences. As discussed before, different considerations of students and their parents and of schools can influence the location of the thresholds. Parental aspirations for the educational attainment of their children are known to be high and this can lead parents to push children into too demanding tracks. Our pattern of results could reflect that such parental overconfidence would be less prominent among the low-achieving children. Additionally, the negative results for the middle margin could partly accrue due to the structure of Dutch schools. As the majority of T3 schools also offer T4 but not T2, there is a comparatively stronger incentive for schools to lower the threshold for T3 to attract more students. This variation in treatment effects across margins also makes it difficult to project the sign and size of track assignment effects in other countries. Replication of our analysis for those countries (possibly with a similar empirical approach) would be needed to assess the external validity of our findings.
For the lowest and highest choice margin, wage gains come at the expense of extra investment in education. Whether this is perceived as a gain or a loss for the individual depends on his or her utility function. The particular approach developed in this paper does not answer what is the optimal choice from the perspective of society. A lower threshold can negatively affect the untreated, because peer quality is reduced on both sides of the threshold. On the other hand, higher educational attainment could induce externalities for society in terms of reduced crime rates or productivity spillovers. The social welfare implications of different achievement thresholds for track assignment can be an interesting avenue for future research, for example by looking at exogenous variation in assignment over time induced by policy changes. At the same time, such policy changes cannot be used to assess the implications of choosing the higher over the lower track for the individual student at the margin, because they reassign a whole segment of the distribution and thereby also change the nature of each track. Hence, one has to rely on another source of exogenous variation.
Ideally, the effects of track assignment would be estimated by exploiting a strong discontinuity in explicit threshold scores for track eligibility. Despite the comparatively strong reliance on achievement in track assignment in the Dutch system, the traditional RD design was not feasible, even in the case where the official running variable is available. The identification approach therefore relies on the assumption that unobservable determinants of long-run outcomes are linearly related to achievement, which we cannot formally prove. While relying on stronger assumptions, the conditional mean approach presented in this study can provide an alternative approach for empirical studies with similar data designs.