University of Groningen Propensity Scoring after Multiple Imputation in a Retrospective Study on Adjuvant Radiation Therapy in Lymph-Node Positive Vulvar Cancer

Propensity scoring (PS) is an established tool to account for measured confounding in nonrandomized studies. These methods are sensitive to missing values, which are a common problem in observational data. The combination of multiple imputation of missing values and different propensity scoring techniques is addressed in this work. For a sample of lymph node-positive vulvar cancer patients, we re-analyze associations between the application of radiotherapy and disease-related and non-related survival. Inverse-probability-oftreatment-weighting (IPTW) and PS stratification are applied after multiple imputation by chained equation (MICE). Methodological issues are described in detail. Interpretation of the results and methodological limitations are discussed. Introduction One of the pertinent challenges in estimating causal treatment effects from observational data is to control for confounding bias. The lack of randomization can lead to systematic differences between treated and untreated subjects. In this case, observeddifferences in outcome cannot securely be attributed to treatment exposure. Propensity scoring (PS) is the established statistical approach to reduce bias resulting from imbalanced measured covariate distributions across treatment groups [1–5]. The propensity score (PS) e(xi) for a subject i is the probability that the subject receives the treatment Zi, given its individual vector of covariates xi, e(xi) = P(Ti = 1|xi). Zi = 1 applies if subject i receives the treatment, otherwiseZi = 0. Various PS methods exist PLOS ONE | DOI:10.1371/journal.pone.0165705 November 1, 2016 1 / 12 a11111


Introduction
One of the pertinent challenges in estimating causal treatment effects from observational data is to control for confounding bias. The lack of randomization can lead to systematic differences between treated and untreated subjects. In this case, observed differences in outcome cannot securely be attributed to treatment exposure. Propensity scoring (PS) is the established statistical approach to reduce bias resulting from imbalanced measured covariate distributions across treatment groups [1][2][3][4][5]. The propensity score (PS) e(x i ) for a subject i is the probability that the subject receives the treatment Z i , given its individual vector of covariates x i , e(x i ) = P(T i = 1|x i ). Z i = 1 applies if subject i receives the treatment, otherwise Z i = 0. Various PS methods exist including PS matching [2], PS stratification [6] PS covariate adjustment [7] and inverse-probability-of-treatment-weighting (IPTW) [8]. All PS models are very sensitive to missing values, which are regularly encountered in retrospective studies. Patients or alternatively covariates with missing data have to be excluded from the analysis. Different approaches to solve the problem of missing values in PS analyses have been studied [9][10][11][12][13][14]. The multiple-imputationby-chained-equations (MICE) has been demonstrated to be an appropriate method to deal with missing values, if they are missing at random [13][14][15][16]. With this method, missing values are replaced by repeatedly drawn values from conditional probability distributions.
The results of the primary analysis and of one propensity score approach using available data of the AGO-CaRE 1 (Arbeitsgemeinschaft Gynäkologische Onkologie-Chemo-and Radiotherapy in Epithelial Vulvar Cancer) study were reported in a medical companion paper [17]. We re-analyzed the data, containing lymph-node positive vulvar cancer patients, of which a subgroup was treated with adjuvant radio(chemo)therapy. Associations with mortality from vulvar cancer (disease-related death (DRD)) and death from other / unknown causes (DOC) were analyzed as competing risks. In the present work, the methodology of data analysis using multiple imputation and propensity scoring to estimate causal effects from observational data is shown in detail and considerations about methodological issues are disclosed. The specific focus of this work is the detailed description and discussion of the applied statistical methodology. The use of the applied techniques are opposed to other potential techniques. Advantages and disadvantages are discussed.

Patients
In the AGO-CaRE 1 study, 1618 patients with advanced vulvar cancer (FIGO stage ! IB [UICC staging 2006]) treated between 1998 and 2008 were retrospectively collected [13]. In the present analysis, a subgroup of 346 patients with lymph-node involvement, age 90 years and documented follow-up status were included. Of these patients, 182 (52.6%) were treated with adjuvant radiotherapy, whereas 164 (47.4%) did not receive adjuvant radiotherapy.

Ethical Approval and Informed Consent
The study protocol was approved by local ethics committees at each center [leading vote: Hamburg (reference number PV3658)] and registered with clinicaltrials.gov (NCT01304667). Patients provided general written informed consent to access their medical records for scientific analysis at first contact with the respective study center. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Multiple imputation
The MICE approach is an established imputation method creating multiple complete data sets in which the missing values are replaced by estimates from a specified regression model using the observed data [13;15;16]. The procedure assumes the missing data to be missing at random, which means that the probability that a value is missing only depends on the measured data.
With these multiply-imputed data sets, estimation is possible without omitting covariates or individuals with missing values. Let x 1 ,. . .,x k be the k variables to be considered, with some or all of them having missing values. In the first step, all missing values are replaced at random. Then, the first variable with missing values, e.g. x 1 , is regressed on the variables x 2 ,. . .,x k . From this estimation model on observed values of x 1 , a prediction of x 1 is generated, from which the missing values of x 1 are replaced by simulated drawing. The next variable with missing values x 2 is regressed on all other variables x 1 ,x 3 ,. . .,x k , including the imputed values. Again, missing values of x 2 are replaced by drawings from the posterior predictive distribution of x 2 . The procedure is repeated for all variables with missing values. After completion of such one cycle, the procedure is replicated for ten cycles to create one complete data set with stabilized imputations, X complete . It is recommended to generate m = 3-20 data sets X ð1Þ complete ; . . . ; X ðmÞ complete [10;13;18]. In this analysis, m = 10 complete data sets were generated. Considered variables were those listed in Table 1, except resection margin and lymph node metastasis diameter, as these variables contained too many missing values (62% and 70%, respectively). The different types of variables (continuous, dichotomous, categorical) were accounted for, and implausible values (negative count data, nonexisting categories) were avoided [10;13]. To account for possible imbalances of the covariates amongst the treated and untreated patients, MI was conducted for both treatment groups separately [10].

Estimating treatment effects on disease-related and unrelated death
The effect of adjuvant therapy on the competing causes of death was computed separately in the 10 imputed data sets and then averaged over data sets using Rubin's combination rules [19]. The cause-specific hazards model was applied to consider the competition of the investigated causes of death. Using this approach, the specific events are analysed separately, treating the competing events as censored. Tests were performed two-sided with a 5% level of significance.

Propensity Scoring
Identifying confounders and estimation of the PS. Confounding exists if a baseline variable correlates with the outcome and is furthermore imbalanced between the treatment groups [20]. Identification of relevant variables to be included in the PS model is a key factor for confounding control. Simulations showed that variables that are related to the outcome should be included in the model, even though they are not associated with the exposure [21]. The variance of the estimated exposure effect is decreased by this technique, without increasing bias [21]. In contrast, variables that are imbalanced with respect to the exposure can only produce bias, if they were related to the outcome. Including variables associated with the exposure but not with the outcome would increase this variance without decreasing bias [21,22]. However, the ultimate aim of propensity scoring is to balance covariates. Therefore, an iterative procedure was described by   [8]. In his work, he proposed to start with an initially specified propensity score model and to evaluate the resulting balance. If important systematic differences between exposure groups remain, the PS model should be modified. This procedure can be repeated until the group differences have been "reduced to an acceptable level" [8]. In the present investigation, we follow these two approaches. In an initial step, all potential confounders associated with either one of the competing endpoints were taken into account. Associations with outcome were tested using univariate cause-specific hazards models stratified across the 10 imputed data sets X ð1Þ complete ; . . . ; X ð10Þ complete . After applying the PS and evaluating the balance achievement, the selection of confounders was adjusted iteratively until acceptable balance for all covariates was achieved.
The PS as defined by Rosenbaum and Rubin [1] represents the conditional probability of receiving the treatment of interest, given the variables observed at baseline. It was estimated using multivariate logistic regression of the treatment status on the confounding baseline covariates selected in the previous step. The resulting logit of the PS was then used to predict the probability of being treated [5;14]. Application of the propensity score. The IPTW method [7;23-25] was applied in each of the imputed data sets before averaging the results. The idea behind this method is to reweight the single individuals in the data set by the inverse probability of receiving the treatment, calculated from the PS. Thus, a sample in which the treatment assignment is independent of the distribution of measured covariates has been created [8].
Stabilized weights w i for individuals i have been defined as . The variable Z i indicates the treatment status for each subject i. If subject i was treated, then Z i equals 1, and 0 otherwise. PS i defines the individual propensity score for patient i and P is the rate of patients receiving the treatment. A robust variance estimator was used.
For comparison, PS stratification was applied. The study sample was split up into five strata according to quintiles of the PS. Stratified Cox regressions for comparing treated and untreated groups were performed for each imputed data set and then averaged. A robust variance estimator was used. Rosenbaum and Rubin stated that five strata according to quintiles of the PS can remove 90 percent of the bias in the considered covariates [1]. If the PS was correctly specified, the treated and untreated subjects within each stratum would have similar distributions of baseline covariates and could be compared directly without bias [26].
Balance check. If all prognostically relevant covariates were balanced between the treatment groups the result of a univariate group comparison could be interpreted as a causal effect. The recommended way to examine if continuous variables are balanced is to compute standardized differences between treatment groups, defined as the difference between treated and untreated means of each factor, divided by the pooled standard deviation [5]. A method for factor variables is also described in Crowson et al. [5]. In this work, balance was tested in the originally measured data and in the data sets after applying the individual propensity score techniques. In the multiply imputed data, results of the balance checks were averaged across data sets using Rubin's rules [19]. Absolute values of standardized differences <0.1 indicated sufficient balance [26].
Achieving balance across treatment groups is the goal of PS. Therefore, balance was checked after PS application. Depending on the resulting balance, the set of confounding variables was adapted and a new PS was calculated and applied.  [29] and "pbalchk" [30] were applied for PS.

Results
Out of 346 included patients with lymph-node positive vulvar cancer and documented followup, 182 (53%) received adjuvant radiation therapy. Median follow-up was 16.4 months (range 0.3-163.6 months). During follow-up, 78 disease-related deaths, 17 disease-unrelated deaths and 40 deaths due to unknown reasons were observed. Median disease-free and overall survival were 15.3 months and 42.7 months, respectively. The patient characteristics as well as their association with treatment assignment are summarized in Table 1. Several differences between the treated and untreated patients were observed. Treated patients had considerably better Eastern Cooperative Oncology Group (ECOG) performance status than untreated patients, but at the same time treated patients were older and had more affected lymph-nodes with larger lymph-node metastases. Additionally, distribution of the type of groin surgery and groin dissection differed amongst the groups.

Missing values
Of the 346 patients, only 24 (7%) were completely documented regarding the 15 considered covariates (Table 1), whereas 79 (23%) had more than three missing values. Of the 15 considered variables, only four were fully documented. Lymph-nodes metastasis diameter (70% missing) and minimum resection margin (62.4% missing) could not be considered as covariates due to their high missing rates.

Selection of confounders for computing the PS
Associations with disease-related death were found for the variables tumor stage, ECOG, number of affected nodes, type of groin surgery and age in the original data set as well as in the imputed data. Tumor stage, resection status, ECOG, number of affected nodes, type of groin dissection (uni-/ bilateral) and age were related to death from other / unknown causes in both, the original and the imputed data ( Table 2). These variables also show imbalances with regards to the standardized differences (Table 3). With respect to the achieved balance the best results were obtained by considering all these potential confounders except tumor stage to compute the PS.

Inverse-probability-of-treatment-weighting (IPTW)
Weighting the data according to the inverse probability of treatment resulted in predominantly balanced confounding variables (

PS stratification
Based on the quintiles of the PS, the data set was stratified into four groups with 69 patients and one group with 70 patients. Effect estimates pooled across strata and combined from the Observed data Imputed data  Table 3. Standardized differences to identify imbalances between treatment groups before and after imputing and inverse-probability-of-treatment-weighting.

Standardized differences observed data
Standardized differences multiply imputed data  (Table 4).

Discussion
In this study, MI followed by PS was applied to estimate the causal effect of radiation therapy in lymph-node positive vulvar cancer on competing causes of death using data from the AGO--CaRE 1 study [17].
In detail, ten complete data sets were generated using MI by chained equation (MICE), stratified by treatment allocation [10;15;16]. Then, confounders to include in the PS calculation were identified by testing univariate associations between baseline covariates and outcomes, stratified across the multiple complete data sets. Thirdly, the PS was computed for each subject. In a fourth step, PS was applied using IPTW and PS stratification [6;8;19;26;31;32]. With IPTW, each patient was weighted according to her PS value. Stratification entailed splitting each data set according to quintiles of the PS and performing analyses stratified over groups. The achieved balance of baseline covariates before and after MI and IPTW was evaluated by standardized differences. The cause-specific hazards model was used to evaluate associations between treatment allocation and the competing causes of death. Results were estimated within each of the imputed data sets and then averaged. This approach is comparable to the 'Within approach' from Mitra and Reiter (2012), who applied PS matching after MI [33]. In contrast, other approaches to overcome the problem of missing values in PS estimation have been studied [9;11;12]. For example, Qu and Lipkovich (2009) proposed an adaptation including indicators of missing data patterns in the PS model. This technique may reduce bias when data are not missing at random [11].
The results from both applied PS methods after MI were very similar and also comparable to those from the naïve group comparison without MI and PS. All approaches agree in showing no associations, but slight tendencies towards improved disease-related survival in patients receiving radiation therapy ( Table 4).
The two other established PS methods, PS matching and PS covariate adjustment, were not appropriate in this example. PS matching entails assigning matched sets of treated and untreated patients, sharing a similar PS value. Various techniques are available to select one or more untreated subjects to match each treated subject [2;8;26;31;32;34-38]. However, all PS matching approaches require the group of untreated patients to be large enough (two-to threefold larger than the group of treated subjects) to provide acceptable matching partners [32]. In the AGO-CaRE 1 data, the number of treated patients exceeded the number of controls. In such situations, matching would either result in heterogeneous matched pairs or in a small number of matched pairs, omitting a significant amount of treated or untreated subjects for which no appropriate matching partner could be found. Therefore, PS matching was not implemented in this work. With PS covariate adjustment, the PS is included as adjusting covariate in a Cox proportional hazards model, where the outcome is regressed on the treatment variable. There is currently no consensus whether there is a benefit of this method, compared to performing a multivariate regression model adjusted for the confounding variables [39]. Furthermore, differences in covariate variances between treated and untreated patients can cause difficulties. In such cases, D'Agostino (1998) [32] advises against this method, which was therefore not applied in the present work. The general purpose of the PS method is to reduce imbalances in outcome-related variables. Most imbalances that were present in originally observed and imputed data were cured after IPTW. The tumor stage, the type of vulva surgery, tumor diameter and the number of dissected groin lymph-nodes were still imbalanced in the multiply imputed data. However, these variables (except tumor stage) had no association with the outcome ( Table 2) and therefore do not bias the results.
The validity of the results is limited by the assumptions inherent to the methods used. MI requires that the missing values are missing at random, which led to a similar distribution of baseline variables (Table 3) and similar univariate associations between baseline variables and outcome (Table 2) in the originally observed and the multiply imputed data. A general assumption in all PS methods is the presumption of no unmeasured confounders. Confounders that are not accounted for because they are not or imperfectly measured or not measurable can still bear a bias. In the present example, psychological factors and quality-of-life aspects were not considered and may therefore bear the risk of unmeasured confounding.
In conclusion, the points to consider in our PS application were: 1. Missing values can be a problem in propensity score analysis. Different methods like MI, as applied here, or the use of an missing values pattern indicator [9;11;12] are available. In our example, results from a complete case analysis did not differ much from PS after MI.
2. Different propensity score methods are established, like matching, stratification or IPTW, each providing even more options to choose from. The IPTW method yields an averaged treatment effect of all subjects, in contrast to most matching methods, which calculate the averaged treatment effect of the treated patients. Further, if the groups have similar size, the IPTW method performs well [9].
3. The set of confounders to include in the PS have to be chosen carefully. The main goal of all PS methods is however to obtain balance in the variables considered to be "important" in the analysis.
4. For computing the PS, a logistic regression model is the established method. However, there are also other ways including boosting or CART models [40].

Author Contributions
Conceptualization: CE AS PN AR SM LW.

Data curation: PN AR UC TF AL MH LW SM.
Formal analysis: CE AS PN AR.