Bias Due to Sample Selection in Propensity Score Matching for a Supportive Housing Program Evaluation in New York City

Objectives Little is known about influences of sample selection on estimation in propensity score matching. The purpose of the study was to assess potential selection bias using one-to-one greedy matching versus optimal full matching as part of an evaluation of supportive housing in New York City (NYC). Study Design and Settings Data came from administrative data for 2 groups of applicants who were eligible for an NYC supportive housing program in 2007–09, including chronically homeless adults with a substance use disorder and young adults aging out of foster care. We evaluated the 2 matching methods in their ability to balance covariates and represent the original population, and in how those methods affected outcomes related to Medicaid expenditures. Results In the population with a substance use disorder, only optimal full matching performed well in balancing covariates, whereas both methods created representative populations. In the young adult population, both methods balanced covariates effectively, but only optimal full matching created representative populations. In the young adult population, the impact of the program on Medicaid expenditures was attenuated when one-to-one greedy matching was used, compared with optimal full matching. Conclusion Given covariate balancing with both methods, attenuated program impacts in the young adult population indicated that one-to-one greedy matching introduced selection bias.


Introduction
Propensity score matching has been widely used to reduce bias due to confounding in observational studies [1][2][3]. It allows researchers to examine distributions and differences in observed covariates between treatment (or exposed) and control groups using statistical and graphical tools, which is more advantageous for unbiased estimation than conventional regression adjustment that lacks such tools [4]. When addressing covariate imbalance via propensity score matching, optimal full matching has been shown to be more efficient than one-to-one greedy matching [5]. This is because optimal full matching minimizes the total distance between treatment and control groups, whereas one-to-one greedy matching performs localized matching in which a person in the treatment/exposed group is sequentially matched with a person in the control group [5,6]. In addition, optimal full matching employs flexible matching ratios (e.g. N:N), which is more efficient in balancing covariates than matching restricted to one-to-one pairs [5,7]. Along with improved internal validity via covariate balancing, optimal full matching can retain almost all subjects, unlike one-to-one greedy matching which only retains pairs of treatment and control subjects [5]. When evaluating public health interventions targeting certain populations, it is important to ensure comparability between the propensity score-matched population and the original population of interest. If they are systematically different due to the exclusion of unmatched subjects, then external validity (or generalizability) may be reduced [8,9].
Although propensity score matching has been shown to improve internal validity by balancing covariates, little is known regarding the influence of sample selection on estimation that results from propensity score matching approaches [5]. Different propensity score matching procedures tend to produce a subsample that may differ from the original population, but generalizability of the results to the original sample using the matched data has rarely been examined. Although external validity is critical in contextualizing evidence for public health interventions and practices, studies using current causal inference methods, including propensity score matching, often put too little emphasis on external validity over internal validity. The purpose of this methods evaluation was to assess potential bias due to sample selection in one-to-one greedy matching as opposed to optimal full matching, which was one of the major analytic considerations in an evaluation of whether placement in a supportive housing program in New York City (NYC) reduced costs from various government services.

Population
In an effort to address homelessness, NYC and New York State created a program to establish 9,000 units of supportive housing for people who are homeless or at risk of homelessness in NYC. Housing placement began in 2007 and will continue until at least 2016. To evaluate the effectiveness of the program on the utilization and expenditures of government services and benefits, we conducted data linkage across multiple administrative records including other types of government housing, jails, homeless shelters, New York State psychiatric facilities, Medicaid, cash assistance, and food stamps. Data were provided by the NYC Department of Homeless Services, the NYC Department of Correction, the NYC Department of Health and Mental Hygiene, the NYC Human Resources Administration and within it Customized Assistance Services and the HIV/AIDS Services Administration, and the New York State Office of Mental Health. For the purpose of this analysis, we focused on applicants who were eligible from 2007 through 2009 for 2 of the 9 populations housed by the program. More details on the program and population definitions can be found in a recent report [10]. Readers interested in accessing the data should contact [epidatar-equest@health.nyc.gov] to determine how data may be shared in a way that protects confidentiality.
One population was adults with chronic homelessness and an active substance use disorder (''SUD population''; placed: 456, unplaced: 335). The other population was young adults aging out of foster care (''young adult program''; placed: 122, unplaced: 299). The placed group included individuals who during their 1 st year of follow-up time were continuously placed in the supportive housing program. The unplaced group included individuals who were eligible for the program but who were not placed in the program or in any other government-subsidized housing programs tracked by the evaluation for more than 7 days [10]. The NYC Department of Health and Mental Hygiene Institutional Review Board (IRB) determined that the program evaluation is not human subject research, and therefore does not fall under the purview of the IRB.

Variables
The exposure variable in this evaluation was living in the program for 1 year, which we refer to as being ''placed,'' as opposed to the comparison group that was eligible for the program but ''unplaced'' in it. Baseline was defined as the earliest housing placement date for the placed group and the earliest program eligibility date for the unplaced group. Among the placed group, the median difference between the first eligibility and the first placement dates was 50 days, indicating that there was not a lengthy waiting period between becoming eligible and moving into housing. This paper focuses on total Medicaid costs and Medicaid costs due to 1) outpatient care, 2) inpatient care, 3) emergency department visits, and 4) prescription drugs. We included a large number of covariates in the propensity score matching that described baseline demographic and clinical characteristics and pre-baseline service/benefit utilization (see Table S1 for the full list of covariates). We included all variables in the propensity score models except for those with extremely wide confidence intervals because those suggested multicollinearity (data about confidence intervals not shown).

Propensity score matching
We estimated propensity scores using a logistic regression model for each population with housing placement as a dependent variable and baseline or pre-baseline covariates as independent variables. We then performed propensity score matching using 2 different algorithms. First, using a one-to-one greedy matching algorithm (i.e., nearest neighbor matching) without replacement utilizing the MatchIt program in R software version 2.14.2 (Vienna, Austria), we created matched pairs of placed and unplaced subjects. For the SUD population, we randomly selected one placed subject at a time and then matched that subject to an unplaced subject because the size of the placed group was larger than that of the unplaced group. For the young adult population, we used the default option in the MatchIt program, in which a placed subject was sequentially selected according to the largest propensity score and matched with an unplaced subject. We also performed one-to-one greedy matching using the random option and found the same matching result, confirming that matching was independent of the order of sample selection (e.g., random, largest to smallest) when the placed group was larger than the unplaced one. Second, we used the optmatch program in R software version 2.14.2 (Vienna, Austria) to perform optimal full matching, which generated matched sets of at least 1 placed and 1 unplaced individual as an optimal solution to minimize the total sampled distance of propensity scores. Unlike one-to-one greedy matching, optimal full matching creates matched sets that contain varying numbers of placed and unplaced subjects.

Propensity score matching evaluation
We assessed the performance of propensity score matching using 2 criteria: 1) whether the covariates were balanced (internal validity) and 2) whether those retained in the analysis were representative of the original population included in the evaluation (external validity). For the first criterion, we evaluated the extent to which each matching method balanced differences between placed and unplaced groups by means of standardized absolute differences. Specifically, for all covariates we calculated the absolute difference in an average covariate value between placed and unplaced groups and divided that estimate by the pooled standard deviation. After incorporating propensity score matching in this calculation, we examined whether propensity score matching decreased the standardized absolute difference. If the difference became less than 0.1, which was considered to be a negligible difference in a covariate between 2 groups on average [11], we concluded that the observed covariate balance between 2 groups was achieved, and therefore propensity score matching was effective. For evaluating external validity, we compared baseline demographic and clinical characteristics and pre-baseline service/ benefit utilization between the original population and the population that remained after propensity score matching, and examined whether there were systematic differences between these 2 populations by means of chi-squared tests (categorical variables) or independent t-tests (continuous variables).

Estimation of treatment impacts
We estimated the impact of supportive housing on the difference in Medicaid costs using propensity score-matched data. After having established that covariates were balanced, which confirmed that bias due to observed confounding was unlikely and internal validity was achieved, we compared these estimates from one-to-one greedy matched data with those from optimally fullmatched data, allowing us to assess potential bias due to the sample selection in a propensity score matching process (i.e., a threat to external validity). To account for skewed data and propensity score matching, we estimated median differences in outcomes by placement status by inverting the Wilcoxon signed rank test and Hodges-Lehmann (H-L) test using the one-to-one greedy-matched and the optimally full-matched data, respectively [12]. Because these 2 tests are identical in terms of their estimation algorithm (i.e., the H-L aligned rank sum test is the extension of the Wilcoxon signed rank test for matched sets), we expected to obtain almost identical point estimates if internal and external validity were established in propensity score matching mechanisms. In this study, we considered a point estimate with good internal and external validity to be a true value and assessed bias in terms of the difference between the observed and true values. Using the same tests, we tested the null hypothesis that the H-L point estimate was equal to zero.
For all analyses, statistical significance was established by a 2sided p-value,0.05. All statistical analyses except for propensity score matching were performed using SAS 9.2 software (Cary, NC). Figure 1 describes distributions of the propensity scores (i.e., the likelihood of being continuously placed in the housing program as estimated by the propensity score models) for placed and unplaced subjects. There were substantial overlaps in the distributions between the 2 groups, meeting an important prerequisite for propensity score matching because matching placed with unplaced subjects was performed on the basis of similarities in propensity score. Stratified by propensity score quintiles, SUD program participants in lower quintile groups were more likely to be non-Hispanic white, receive supplemental security income, have mental and physical illness diagnoses, and have histories of hospitalization (data not shown). A limited capacity to live independently as measured by the number of activities of daily living requiring assistance and less frequent substance use were also associated with lower propensity score quintiles. Likewise, in the young adult population, having mental and physical illness, receiving supplemental security income, and residing in foster care or institutions such as jail or hospitals at the time of application was associated with a lower likelihood of being placed in the program.

Results
Balance in baseline characteristics after propensity score matching (internal validity) Optimal full matching retained all subjects, whereas one-to-one greedy matching excluded 121 from the SUD group and 177 from the young adult group (Table 1). For the SUD program, all those excluded by one-to-one greedy matching were in the placed group and for the young adult program all were in the unplaced group. In the SUD population, the performance of one-to-one greedy matching in reducing observed differences between placed and unplaced groups greatly differed across variables, while optimal full matching in general performed well in establishing covariate balancing ( Table 2). In the young adult population, both matching methods successfully reduced differences in demographic and service utilization characteristics between placed and unplaced groups.
Representation of the original study population after propensity score matching (external validity) Overall there were no clear systematic differences between retained and excluded subjects in the SUD population after oneto-one greedy matching (all p.0.05 except for past violencerelated symptoms/behaviors; Table 3). Even though 27% of placed subjects (n = 121) were excluded, the exclusion was independent of propensity scores, and therefore distributions were similar before and after one-to-one greedy matching. However, in the young adult population, subjects excluded by one-to-one greedy matching were predominantly from the lower quintiles, which were characterized by having mental and physical illness and current substance use, and needing assistance with activities of daily living. This resulted in systematic differences in population profiles between one-to-one greedy matched data versus the original data (Table 3). Unlike one-to-one greedy matching, optimal full matching retained all participants in both programs.

Estimated differences in outcomes associated with treatment
Given good internal and external validity, we considered the estimated program impact from optimally full-matched data to be a gold standard, and compared it with estimates using one-to-one greedy matched data to assess bias. For the SUD population the estimated program impacts on Medicaid costs were generally greater using one-to-one greedy matching as opposed to optimal full matching ( Table 4). In contrast, for the young adult population the estimated program impacts on total Medicaid costs and outpatient Medicaid costs were attenuated when one-toone greedy matching versus optimal full matching was used. Given similar covariate distributions for one-to-one greedy and optimally Table 1. Number of supportive housing tenants and unplaced applicants in programs by propensity score quintiles. full-matched data in the SUD population, which indicated external validity, the discrepancies in estimates were likely due to covariate imbalances (i.e., low internal validity) that one-to-one greedy matching failed to reduce. In the young adult population, given that both matching methods effectively established internal validity, the differences were more likely to be attributed to the one-to-one greedy matching process that systematically excluded people with low propensity scores who were likely to experience a greater impact of the supportive housing program on total Medicaid costs given their baseline characteristics (i.e., low external validity).

Discussion
In this evaluation we demonstrated that one-to-one greedy matching led to biased estimates when selection was not independent of the program impact. In the young adult population, one-to-one greedy matching systematically excluded unplaced subjects with low propensity scores, generating a matched population that was healthier and more independent in daily living than the original one. Despite good internal validity, estimated program effectiveness was attenuated compared with that from optimally full-matched data. In contrast, for the SUD population the sample selection for one-to-one greedy matching appeared to be independent of propensity scores, indicating external validity. For this population some differences in the program impact between the two matching methods were observed, which was likely due to covariate imbalance, rather than selection bias.
Current literature offers little discussion of influences on estimation due to sample selection with propensity score matching mechanisms. This may be because this potential selection bias does not occur when a treatment impact is estimated only for the treatment group [13]. Yet, in some contexts where understanding the impact of a treatment among the entire population of interest is desired, selecting subjects for propensity score matching could introduce unintended bias into estimation. Such a case would be the evaluation of a public health intervention targeted to a particular population, e.g., what change in outcomes would have occurred if all subjects in the population had received a treatment?
Our findings support Little and Rubin's argument that if sample selection is non-ignorable, the size of bias in the estimated population-level effects depends on the degree of association between treatment effects and selection after adjusting for covariates [8,9]. We found that sample selection in one-to-one greedy matching depended on the extent to which the propensity Table 2. Absolute standardized differences in selected covariates between supportive housing tenants and unplaced applicants before and after propensity score matching.  Table 3. Baseline characteristics and pre-supportive housing service utilization between retained and excluded subjects after one-to-one greedy matching.
{ These estimates were based on Hodges-Lehmann (full matching) and Wilcoxon (one-to-one matching) signed rank test (two-sided p-value). Some estimates were non-estimable because a majority of subjects had zero outcomes.

"
Because there were more placed subjects than unplaced ones, each random selection of placed subjects prior to matching produced slightly different matched pairs, which resulted in slightly different estimates. score distribution overlapped between placed and unplaced groups and the sample size of these groups. In addition, our findings confirmed current evidence that optimal full matching that employs flexible matching ratios is more effective in covariate balancing than one-to-one greedy matching [14]. Despite the advantage of optimal full matching over one-to-one greedy matching in establishing both internal and external validity, oneto-one greedy matching tends to be a popular propensity score matching choice because analysis of matched pairs and interpretation of the results are more conceptually and computationally straightforward than those of optimal full matching. With limited emphasis on potential selection bias, researchers often justify using one-to-one greedy matching if covariate balancing is observed. Our findings highlight the importance of examining both internal and external validity in determining a propensity score matching method.
There are some limitations to this evaluation. First, we have not identified variables that are a common effect of treatment and outcome (collider) or located in the causal pathway from treatment and outcome (mediator) among covariates. Estimates could be biased due to controlling for these variables via propensity score matching [15]. To minimize this potential distortion of true association between treatment and outcome (e.g., biased either away or toward to the null), we only used baseline and pre-baseline covariates. Second, unobserved covariates could have biased estimates. However, the study focused on differences between 2 propensity score matching methods using the same data and differential influences from unobserved covariates by matching methods were quite unlikely. Despite these limitations, a main strength of this evaluation includes the well-defined comparison group that consists of applicants eligible for the housing program. Another strength is that multiple administrative data sources provided a large number of baseline and pre-baseline characteristics, which improved the estimation of propensity scores.
Propensity score matching is a useful tool to reduce bias due to confounding and estimate a treatment effect when there is sufficient overlap in the distribution of propensity scores between treatment and control groups. Yet, unintended selection bias can arise when sub-setting the original population for matching is associated with program impact. In this evaluation, we provide a practical diagnostic approach to assessing potential selection bias in propensity score matching mechanisms that we used in a program evaluation. When inference is made to the whole study population in a program evaluation, we suggest considering optimal full matching over one-to-one greedy matching to strengthen both internal and external validity and minimize potential selection bias.

Supporting Information
Table S1 Covariates included in the propensity score models. This table lists all the covariates that we included in the propensity score models. (DOCX) Text S1 This text includes R codes that allow for performing two types of propensity score matching (optimal full matching and one-to-one greedy matching) and the Wilcoxon signed rank test. It also includes SAS codes that can be be used for performing the Hodges-Lehmann aligned rank sum test and for obtaining Hodges-Lehmann estimators. (DOCX)