Optimal Caliper Width for Propensity Score Matching of Three Treatment Groups: A Monte Carlo Study

Propensity score matching is a method to reduce bias in non-randomized and observational studies. Propensity score matching is mainly applied to two treatment groups rather than multiple treatment groups, because some key issues affecting its application to multiple treatment groups remain unsolved, such as the matching distance, the assessment of balance in baseline variables, and the choice of optimal caliper width. The primary objective of this study was to compare propensity score matching methods using different calipers and to choose the optimal caliper width for use with three treatment groups. The authors used caliper widths from 0.1 to 0.8 of the pooled standard deviation of the logit of the propensity score, in increments of 0.1. The balance in baseline variables was assessed by standardized difference. The matching ratio, relative bias, and mean squared error (MSE) of the estimate between groups in different propensity score-matched samples were also reported. The results of Monte Carlo simulations indicate that matching using a caliper width of 0.2 of the pooled standard deviation of the logit of the propensity score affords superior performance in the estimation of treatment effects. This study provides practical solutions for the application of propensity score matching of three treatment groups.

Introduction PSM (propensity score matching) is widely used to reduce bias in non-randomized and observational studies [1,2,3]. The propensity score(PS), introduced by Rosenbaum and Rubin in 1983 [4], is defined as a subject's probability of receiving a specific treatment conditional on a group of observed covariates. As the representation of many covariates, it is estimated at baseline to control selection bias. There are four main propensity score methods-propensity score matching, stratification on propensity score, covariate adjustment using propensity score, and propensity score weighting [5]-among which PSM is used most commonly [6].
Propensity score methods have been widely applied to two treatment groups, but few studies has reported its use for multiple treatment groups. Imbens extended Rosenbaum and Rubin's work to multiple treatment groups. The multiple propensity score, defined as the probability of receiving a particular treatment conditional on the observed covariates, can be estimated by a multinomial logistic regression, given that there is no inherent order among the different treatments [7]. Wang et al. proposed the application of stratification on the multiple propensity score to dose-response relationships in drug safety studies [8]. With a practical step-by-step approach using data from a mental health study, Spreeuwenberg et al. introduced covariate adjustment using multiple propensity score [9]. However, none of the existing studies, as far as we know, had dealt with PSM for multiple exposure groups. Many key issues have not been resolved, such as the assessment of balance in baseline variables and sensitivity analysis, which limit the application of multiple PSM. In addition, how to select the optimal caliper is another key issue in multitreatment PSM. Austin summarized eight caliper widths commonly employed in two treatment group scenario [10]. Two or three treatment groups are common in clinical practice. The primary objective of this study was to choose the optimal caliper for three treatment groups by comparing the PSM methods of different calipers based on Monte Carlo simulations.

Theory
Two different approaches of matching are available in PSM: global optimal algorithms and local optimal algorithms (also referred to as greedy algorithms) [11]. Global optimal algorithms use network flow theory, which can minimize the total distance within matched subjects [12]. Global methods may be difficult to implement when there are large numbers of potential controls from which to choose, since they require the creation of a large distance matrix. Local optimal algorithms begin with the random selection of the first subject in the treatment group, and then find its closest control match based on the absolute value of the difference between propensity scores (or the logit of the scores). We chose to use local optimal algorithms for PSM among three treatment groups, as is done commonly for two treatment groups. The simulations were all based on 1:1:1 matching within a prespecified caliper width, without replacement. In this schema, the first subject in group-1 is matched with subjects from group-2 and group-3 using the smallest distance. The caliper width defines the range within which the propensity scores (or logit of the propensity scores) must fall to be considered a valid match [13]. Wider caliper widths can result in the inclusion of more subjects, a greater sample size, and more precision, but can also decrease balance between groups and introduce more bias in estimating treatment effects. The opposite is true for narrower caliper widths. Thus, only those subjects with distances fallen within the caliper width are being included. Those subjects not matched are supposed as outliers and being excluded. ATE (Average Treatment Effect) and ATT (Average Treatment Effect for the treated) are commonly estimated based on potential outcomes, or so called counterfactual outcomes [14]. ATE is a weighted average of ATT and ATU (Average Treatment Effect for the untreated). In PSM using caliper with two treatments or more, only sufficiently overlapped regions among groups were considered, that is to say, to match only those units in one group with covariate values that are sufficiently close to the values observed in other groups. When this is done, the quantity estimated is no longer the ATE or ATT, because the average is only taken over the region of common support. However, we can call it average treatment effects for the matched samples.
The sum of the probabilities of a subject receiving treatment g is defined as: where p g_x (g = 1,2…G) denotes the subject x's probability of receiving g-th group treatment conditional on the observed covariates, and G denotes the number of the group. For two treatment groups, the variance of the propensity score of the treatment group is equal to the variance of the propensity score of the control group, and therefore considering the probability of the subject receiving one of the two treatments is sufficient. When considering more than two treatment groups, however, the variances of the propensity score of g groups are not equal. Therefore, g propensity scores should be taken into account for the purpose of matching.
For two treatment groups, the matching distance D can be defined either by D 1 or D 2 : p 11 is the probability of receiving treatment 1 conditional on the observed covariates for the patient in treatment 1, p 21 is the probability of receiving treatment 1 conditional on the observed covariates for the patient in treatment 2, and etc.
For three treatment groups, each patient has three Propensity Scores. Notation of Propensity Scores in 3 treatment groups is presented in Table 1.
For three treatment groups, formula (3) can be extended as: For multi-treatment groups, the matching distance D can be extended as: where p gk_x (g = 1,2…G-1; k = 1,2…G) denotes the g-th group subject x g 's probability of receiving the k-th group treatment conditional on the observed covariates, and p jk_x (j = g+1,g+2…G; k = 1,2…G) denotes the j-th group subject x j 's probability of receiving the k-th group treatment conditional on the observed covariates. Propensity score matching methods have not been used for three treatment groups in prior empirical research, and no commonly used caliper widths can be referenced. We allowed caliper widths to range from 0.1 to 0.8 of the pooled standard deviation of the logit of the propensity score in increments of 0.1 using Monte Carlo simulations. For multi-treatment groups, the pooled standard deviation is defined as: where S i denotes the standard deviation of the logit of the propensity score, and i is the number of the sample combined.
Another key issue in this study is the evaluation of the balance of baseline covariates before and after PSM. The balance of baseline variables should be assessed by the standardized difference, which was first introduced by Flury and Reidwyl in 1986 [15]. Its use has recently become common in propensity score studies [16,17]. Standardized difference can be used for both continuous and dichotomous variables. For continuous variables, the standardized difference is defined as: where x x t and x x c denote the mean of the variable in treatment and control subjects, respectively, and s 2 t and s 2 c denote the variances of the variable in the treatment and control groups, respectively. For dichotomous variables, the standardized difference is defined as: where p t and p c denote the proportion of treatment and control groups, respectively. It has been suggested that a standardized difference of less than 0.1 indicates negligible imbalance in a given covariate between groups [18]. Standardized difference satisfies the two properties of balance assessment methods proposed by Imai et al. [19]. First, it should be a characteristic of the sample rather than a characteristic of the population. Second, this statistic should be unaffected by sample size. Hypothesis testing does not satisfy these two properties. A statistical test of hypothesis is influenced by the sample size, and as a result, better balance may be achieved in the matched samples than in the initial overall sample due simply to a smaller sample size. As of now, standardized difference has only been applied in the analysis of two treatment groups. For multiple treatment groups in this study, pairwise comparison is conducted to calculate standardized difference, among which the greatest standardized difference is chosen as an indicator to access the overall balance in our study.

Monte Carlo Simulations
We used Monte Carlo simulations to examine the performance of the eight different calipers.
2.2.1. We generated eight imbalanced variables among three treatment groups-4 continuous variables (C 1 , C 2 , C 3 , and C 4 ) and 4 dichotomous variables (D 1 , D 2 , D 3 , and D 4 [7])-and assumed that one continuous covariate and one dichotomous covariate had maximal standardized differences of 0.2 among the three groups. Similarly, we assumed that the remaining 3 pairs of continuous and dichotomous variables had maximal standardized differences in the full sample of 0.3, 0.4, and 0.5. The 8 variables are independent.
For each of the 1000 subjects, T j (j = 1, 2, 3) denotes a multinominal grouping variable. T j was transformed to a set of dummy variables where T 3 was considered as the base category (control group) as follows: Continuous covariates were generated randomly from the following distribution: where C k denotes the k-th continuous covariate. Thus, the distribution of the continuous covariates would be N(f k ,1) for group-1 subjects and group-2 subjects, and N(0,1) for group-3 subjects. Once we had randomly generated the treatment status, 4 continuous variables and 4 dichotomous variables for each of the 1000 subjects, we randomly generated a continuous outcome for each subject using the following linear model: where e*N(0,4), where X 1 and X 2 denote the treatment status. The coefficients of the 4 continuous variables and 4 dichotomous variables denote different correlations with outcome from weak to strong. The results of normality test for outcome Y are presented in Table 2. The outcome Y obeys normal distribution in the scenarios we considered.

2.2.3.
We randomly generated 1000 datasets with the required treatment effect using the above data-generating process (each randomly generated dataset consisted of 1000 subjects with required conditions). Within each randomly generated dataset, we estimated the propensity score by regressing treatment status on the 8 variables using a multinomial logistic regression model. The matched samples were obtained by matching subjects on the logit of the propensity score using nearest neighbor matching, with calipers ranging from 0.1 to 0.8 of the pooled standard deviations of the logit of the propensity score in increments of 0.1. The matching distance was described in Section 2.
2.2.4. The standardized difference(SD) between groups, matching ratio, relative bias(RB), and Mean Square Error(MSE) were applied to assess simulation results. Standardized difference is used for comparing the balance of matched samples. The matching ratio, defined as the number of matching subjects vs. total subjects in the treatment with the least number of subjects, provides an estimate of precision. The relative bias provides a measure of the magnitude of the bias. The smaller the relative bias, the smaller the bias of the propensity score model. The relative bias is defined as:

RB~100|
DETE-TTED TTE , where ETE and TTE denote the estimated treatment effect and true treatment effect, respectively. TTE is taken from the simulation model. ETE is average treatment effects for the matched samples. Group-3 was considered a referenced category(control group), and the relative bias values between group-3 and other groups were reported. Under the data-generating process, the true treatment effect is equal to 1when evaluating group-1 vs. group-3, and also when evaluating group-2 vs. group-3. The MSE is used to evaluate the precision of the propensity score model, which is defined as:

MSE(ĥ h)~Var(ĥ h)z(Bias(ĥ h,h)) 2 ,
where Var(ĥ h) denotes the variance of the estimator, and Bias(ĥ h,h) denotes the bias of the estimator. Group-3 was considered the referenced category, and the MSEs between group-3 and other groups were reported. MSE is a quantitative measure of the tradeoff between variance and bias. Our focus on MSE allows an investigator to select a caliper width that optimizes this implicit trade-off.
All simulations were performed in SAS, version 9.1.

Results
The mean standardized difference for each of the 8 variables, matching ratio, relative bias, and MSE are presented in Tables 3, 4, 5. RB 13 denotes the relative bias between group-1 and group-3, and RB 23 denotes the relative bias between group-2 and group-3. Similarly, MSE 13 denotes the MSE between group-1 and group-3, and MSE 23 denotes the MSE between group-2 and group-3.
When the pre_matching ratio of subjects was 1:2:3, the standardized differences for the 8 variables ranged from 0.072 to 0.096.When the pre_matching ratio of subjects was 2:3:5, the corresponding range was from 0.065 to 0.088 for the 8 different caliper widths. In both scenarios a standardized difference of each covariate of less than 0.1 indicates negligible imbalance between treatment groups. When the pre_matching ratio of subjects was 1:2:7, the corresponding range was from 0.094 to 0.129, and only 31.3% of the standardized differences were less than 0.1.
Irrespective of the pre_matching ratio of subjects, using a caliper width of 0.8 of the pooled standard deviation of the logit of the propensity score resulted in the greatest matching ratio. When the pre_matching ratios of subjects were 1:2:7, 1:2:3 and 2:3:5, the matching ratios were 99.9%, 98.9%, and 97.9%, respectively. Using caliper width of 0.1 of the pooled standard deviation of the logit of the propensity score resulted in the lowest matching ratio. Thus, when the pre_matching ratios of subjects were 1:2:7, 1:2:3 and 2:3:5, the matching ratios were 82.6%, 81.5%, and 77.3%, respectively.
When the pre_matching ratio of subjects was1:2:3 or 2:3:5, using a caliper width of 0.2 of the pooled standard deviation of the logit of the propensity score resulted in the lowest MSE between group-1 and group-3. When the pre_matching ratio of subjects was 1:2:3, using a caliper width of 0.2 of the pooled standard deviation of the logit of the propensity score resulted in the lowest MSE between group-2 and group-3. When the pre_matching ratio of subjects was 2:3:5, using a caliper width of 0.2 of the pooled standard deviation of the logit of the propensity score resulted in a MSE that was negligibly higher than the lowest MSE between group-2 and group-3. When the pre_matching ratio of subjects was 1:2:7, the MSE was only slightly lower than that of the highest MSE (MSE ranging from 0.260 to 0.313). When the pre_matching ratio of subjects was1:2:3 or 2:3:5, using a caliper width of 0.8 of the pooled standard deviation of the logit of the propensity score resulted in the highest MSE.

Empirical Study
We applied the matching schemes to a perspective observational clinical study designed to evaluate the mortality and morbidity of patients with heart failure already receiving optimal medical therapy in China. There were 601 patients recruited in this study, who were divided into 3 treatment groups by heart failure durations, including less than 1 year (n = 215), 1-5 years (n = 242) and more than 5 years (n = 144).
In order to reduce the bias from confounding variables, PSM was used to adjust the baseline differences. Standardized differences for covariates before and after matching is compared (We can't calculate the RB and MSE in this example because the average treatment effect is unknown). The propensity score was estimated by using a logistic regression model. Heart failure durations were used as the dependent variable, and the other 9 confounders that were identified in Table 6 as independent variables, including sex, rhythm, LVEDD (left ventricular enddiastolic dimension), age, NYHA (New York Heart Association) class, QRS duration, LVEF (Left Ventricular Ejection Fraction), Heart rate and Medicine therapy. We used 0.2 of the pooled standard deviation of the logit of the propensity score as caliper width for PSM. The matching ratio was 73.6%. The balances of baseline variables were assessed by standardized difference. The standardized differences for the 9 confounders were all bigger than 0.1 before PSM. The standardized differences of all confounders were smaller than 0.1 except Medicine therapy, which indicates negligible imbalance among treatment groups after PSM.

Discussion
The primary objective of the current study was to compare the PSM methods of different calipers and to choose the optimal caliper for three treatment groups. We summarize our findings as follows.
Firstly, on the basis of a fixed matching ratio, the test of a good propensity score model is the degree to which it results in the baseline covariates being balanced between treatment and control subjects [3]. However, it should not be expected that perfect balance will be achieved for all baseline variables between treatment and control subjects in the matched sample. Currently there is no balance test widely recognized for PSM of more than two treatments. In this study, the standardized difference was used for pairwise comparison among three groups, and the greatest standardized difference was chosen to evaluate the overall balance. If the greatest standardized difference is less than 0.1, it represents a meaningful balance among the three groups. Balance of covariates between groups was achieved when the pre_matching ratio of subjects was 1:2:3 or 2:3:5. When the pre_matching ratio of subjects was 1:2:7, the standardized differences were not reduced to below 0.1(but were below 0.15). The reason may be  that the sample size proportion of group-1 was too small. A standardized difference below 0.1 suggests negligible bias in a given covariate between two treatments. Whether this criterion is suitable for multiple treatments is a valuable question worth further discussion. Secondly, in spite of the pre_matching ratio of subjects, the matching ratio increased as caliper width increased. When the caliper width changed from 0.1 to 0.2 of the pooled standard deviation of the logit of the propensity score, the increase in matching ratio was substantial (increased by 12.4%, 8.8%, and 9.7% when the pre_matching ratios of subjects were 1:2:7, 1:2:3, or 2:3:5, respectively). When we used caliper widths from 0.2 to 0.8 of the pooled standard deviation of the logit of the propensity score in increments of 0.1, the matching ratios were only slightly lower than that of the highest matching ratio.
Thirdly, when the pre_matching ratio of subjects was 1:2:7, the MSE was only slightly lower than that of the highest MSE when we used caliper widths of 0.2 of the pooled standard deviation of the logit of the propensity score. When the pre_matching ratio of subjects was 1:2:3 or 2:3:5, using a caliper width 0.2 of the pooled standard deviation of the logit of the propensity score resulted in the lowest MSE or close to the lowest MSE. This shows that using this caliper results in greater precision than achieved with other calipers. Regardless of the pre_matching ratio of subjects, using a caliper width of 0.1 of the pooled standard deviation of the logit of the propensity score resulted in the lowest bias, followed by the caliper width of 0.2 of the pooled standard deviation of the logit of the propensity. Depending on the pre_matching ratios of subjects, our findings suggest that a width of 0.2 of the standard deviation of the logit of the propensity score is the optimal caliper.
There are certain limitations to the current study. Whether the greatest standardized difference among ''pairwise comparison between three groups'' was a suitable indicator for overall balance is worthy of further discussion. In our simulations, only continuous outcome variables were generated in the model to determine the optimal caliper width by estimating their differences in means, but not differences in risk (for binary outcomes). In our scenarios, the first four covariates (C 1 -C 4 ) were assumed to be independent normal random variables, while the last four covariates (D 1 -D 4 ) were assumed to be independent Bernoulli random variables, we didn't consider the difference between continuous variables and dichotomous variables as covariates. Additional pre_matching ratios of subjects and further iterations were not examined, due to computational limitations. In our scenarios, we used approximately 10 days on a PC (CPU: Pentium Dual-Core E5400; memory: 4G). Much of the time was spent in forming the multiple propensity score matched samples.
Our studies have provided practical solutions for the application of PSM to three treatment groups or more, however, it seems not a good choice to apply PSM with treatment arms more than 4, due to the insufficiently overlapped regions among groups, computational complexity and burden.