What is the effect of changing eligibility criteria for disability benefits on employment? A systematic review and meta-analysis of evidence from OECD countries

Background Restrictions in the eligibility requirements for disability benefits have been introduced in many countries, on the assumption that this will increase work incentives for people with chronic illness and disabilities. Evidence to support this assumption is unclear, but there is a danger that removal of social protection without increased employment would increase the risk of poverty among disabled people. This paper presents a systematic review of the evidence on the employment effects of changes to eligibility criteria across OECD countries. Methods Systematic review of all empirical studies from OECD countries from 1990 to June 2018 investigating the effect of changes in eligibility requirements and income replacement level of disability benefits on the employment of disabled people. Studies were narratively synthesised, and meta-analysis was performed using meta-regression on all separate results. The systematic review protocol was registered with the Prospective Register for Systematic Reviews (Registration code: PROSPERO 2018 CRD42018103930). Results Seventeen studies met inclusion criteria from seven countries. Eight investigated an expansion of eligibility criteria and nine a restriction. There were 36 separate results included from the 17 studies. Fourteen examined an expansion of eligibility; six found significantly reduced employment, eight no significant effect and one increased employment. Twenty-two results examined a restriction in eligibility for benefits; three found significantly increased employment, 18 no significant effect and one reduced employment. Meta-regression of all studies produced a relative risk of employment of 1.06 (95% CI 0.999 to 1.014; I2 77%). Conclusions There was no firm evidence that changes in eligibility affected employment of disabled people. Restricting eligibility therefore has the potential to lead to a growing number of people out of employment with health problems who are not eligible for adequate social protection, increasing their risk of poverty. Policymakers and researchers need to address the lack of robust evidence for assessing the employment impact of these types of welfare reforms as well as the potential wider poverty impacts.

Publishability: The primary concerns that I have with the paper regard the actual implementation of the meta-analysis. As I detail in my comments below, it seems the narrative of the paper contradicts itself in some places, is less clear than it can (or should) be in others, and overall would benefit greatly from additional polish and attention to detail. I have done my best below to explain these thoughts in relation to where they came up within the current version of the manuscript.

Recommendations by Location
1. (Lines 56-57): The logic here is not necessarily true as it depends on the rate at which average age and retirement age are increasing. That's not to say it can't be true, but it's a claim that needs to be backed up especially if it is forming the core of the research's motivation.
2. (Lines 67-69): These few lines encapsulate a running concern I have with this manuscript that I'll return to throughout this report. At a fundamental level, a meta-analysis should be about combining multiple estimates of some underlying parameter of interest to increase the precision beyond what any single estimate could provide on its own. The obvious concern being pointed out here is that, for something like disability benefits, other institutional factors matter, including access to health care and other social safety net policies. As a result, it can be challenging to defend the combination of estimates across countries (here across the U.S. and Europe). It is perfectly reasonable to limit a meta-analysis to a specific group of countries, either because the research question being asked focuses on that group or because the best evidence comes from there. However, throughout this manuscript I see comparisons (primarily U.S. vs. other) of estimates that seem to imply to me that the authors themselves see these as two different "groups" of estimates that should not be considered as independent draws from the same underlying treatment effect distribution; these lines are the first instance of this. As a result, it is unclear to me why the authors decide to combine all of the results during their meta-analysis, ignoring these differences and (at least implicitly) assuming that such results are directly comparable.
3. (Lines 74-75): This is related to my point above. It is a good point that, although similar in absolute terms, the effects of expanded eligibility may not simply be -1 times the effects of restricted eligibility. However, in your meta-analysis you effectively treat these as the same in most cases. Your funnel plot combines them and, although you admittedly separate them in Figure 2, I do not see any discussion of why we should believe these effects mirror one another. The empirical evidence I see here seems to imply that they don't look statistically different from one another, implying that the conclusions of the previous literature are unaffected by this beyond having greater imprecision. However, given the work here is still rather imprecise, that does not yield a clear improvement.

(Line 81):
The analysis here only includes eight countries (out of a possible 37), so this seems to be, at best, a marginal improvement. Furthermore, of those eight, three are represented by a single study. Perhaps these studies included are of a higher quality, but I do not see a compelling argument for quantity of studies as a marked improvement.
5. (Lines 84-85): Why is regional variation within a country (your example is across Canadian provinces) a limitation? I would see a group of studies using cross-region withincountry variation as preferable to a series of cross-country studies.
6. (Lines 96-98): Why do we believe that these estimates across countries are comparable? You've already argued earlier (Lines 67-69) why U.S. studies should be thought of differently from other OECD countries and it appears that Barr et al. (2010) stated this explicitly in their exclusion of such studies. Why the change in stance? I think it's reasonable to be skeptical to assume that U.S./non-U.S. studies are drawing from the same underlying treatment parameter distribution, even conditional on assuming all such estimates are unbiased and causal, so this decision requires motivation and justification.
7. (Lines 123-126): How do you handle the wide variety of age ranges included in these studies? There are a large number of factors here that may influence both the probability of being (or becoming) disabled. You say that the study must incorporate older workers, and while many of these studies focus at or near your age range of 50-65, those that do not will be biased from the effect on older workers towards the average effect for all workers. If you believe those effects are the same, you should defend that claim. If you do not believe so, you should discuss how these studies will potentially bias your results as my guess would be the inclusion of younger workers pushes you towards a null effect. Given that's your ultimate conclusion, could be a real concern.
8. (Lines 144-150): With only 17 studies in your final data set, why not apply this doublechecking process to all of them?
9. (Lines 176-179): You criticized the informativeness of papers studying increased access (Lines 74-76) because that's not the direction of most current policy changes. This criticism, of course, has merit. However, here you assume that the effects are reflexive and I am unsure why you have decided to not only back off of their informativeness but make the additional assumption that the effects are 1-for-1 in the opposite direction.
10. (Line 238): Each paper only showed approximately 2 regression estimates? Or have you pared down the estimates in some other manner that you haven't mentioned? It's relatively common to restrict each study to it's main or primary conclusions, so I don't have any concerns regarding that, but an explicit acknowledgment should be included. studies why lump them together in your meta-analysis and ignore the clear cross-country heterogeneity? I understand that statistical power is a classic limitation of this type of work, but these studies don't appear to be fundamentally comparable in this manner and, if they are, it would require discussion and justification.
14. (Lines 329-330): There's a fundamental issue here that I believe the authors need to discuss. Namely, that is the "bite" of these disability policies. It seems most studies find a treatment effect estimate that is in line with the hypothesis that restricted eligibility increases employment and relaxed eligibility decreases employment. However, most studies also find a small effect in absolute terms and this effect is most often statistically insignificant. However, the authors do not discuss the prevalence of disabilities within the broader population. To demonstrate, imagine the "true" RR of relaxing eligibility was a decrease in employment by 1 percentage point for the sample being studied. Most people will not be affected by disabilities, though. So, to calculate the effect of this policy on disabled workers, the estimate needs to be scaled by their prevalence in the sample studied. If 10 percent of the population were affected by this policy change, then the effect on them would be 10 times as large. This is especially important here since, as the authors point out, some of these studies use the population of workers in a given country. As a result, to truly understand the effect of this policy change on affected workers the estimates need to be scaled. This also comes, of course, with the implicit assumption that there are no spillover effects as a result of the policy on non-disabled workers. This would also introduce a potential bias when using the population of workers, but my gut feeling is that this bias is secondary in nature. At minimum, this concern needs to be discussed and descriptive statistics need to be provided giving the reader the potential magnitude and scope of this concern.
15. (Figure 3): It is hard to determine the median value from this figure, but it appears to be to the right of 1.0. As with my comment above, even a small effect in absolute terms could be large when looking more closely at the group of workers actually affected. 17. (Line 394): What was the result of your FAT? One way to really enhance the arguments put forth in this manuscript would be to provide direct, quantifiable estimates wherever possible. At present, the manuscript does this inconsistently.
18. (Lines 397-405): I must admit, I am a bit unsure what this paragraph is trying to accomplish. As I read it, it seems to me to argue against the quality of the studies included in the meta-analysis. That's not to say the points raised here are not valid, but it seems to undermine the idea of performing a meta-analysis with such studies in the first place.
19. (Lines 409-419): It could also be the case that there is an effect but its size is small when examined at the population level. As stated earlier, I think if this type of claim is going to be made (i.e. that the effects are close to zero) at the very least the upper end of the CI needs to be reported as the largest effect size the authors are willing to rule out. But this also needs to account for the bite of the policy. Perhaps the bite is quite large and so the effects do not need to be scaled a great deal but it would still need to be addressed.

Minor recommendations
1. (Throughout): The polish of the writing is quite rough throughout the manuscript and it needs a thorough proofreading.
2. (Figure 3): Figure 3 should have the standard error on the y-axis changed to the precision (1/se). That way the graph itself does not need to look any different, but the y-axis can be increasing from bottom to top rather than decreasing. The flipped axis initially confused me as I was looking at it.
3. (Lines 424-425): I'm not completely sold on making generalizations to OECD countries given that only 20% of its countries are represented here (eight out of a possible 37) and that there seems to be a great deal of heterogeneity across the U.S./non-U.S. studies.