Robust policy evaluation from large-scale observational studies

Under the current policy decision making paradigm we make or evaluate a policy decision by intervening different socio-economic parameters and analyzing the impact of those interventions. This process involves identifying the causal relation between interventions and outcomes. Matching method is one of the popular techniques to identify such causal relations. However, in one-to-one matching, when a treatment or control unit has multiple pair assignment options with similar match quality, different matching algorithms often assign different pairs. Since all the matching algorithms assign pairs without considering the outcomes, it is possible that with the same data and same hypothesis, different experimenters can reach different conclusions creating an uncertainty in policy decision making. This problem becomes more prominent in the case of large-scale observational studies as there are more pair assignment options. Recently, a robust approach has been proposed to tackle the uncertainty that uses an integer programming model to explore all possible assignments. Though the proposed integer programming model is very efficient in making robust causal inference, it is not scalable to big data observational studies. With the current approach, an observational study with 50,000 samples will generate hundreds of thousands binary variables. Solving such integer programming problem is computationally expensive and becomes even worse with the increase of sample size. In this work, we consider causal inference testing with binary outcomes and propose computationally efficient algorithms that are adaptable for large-scale observational studies. By leveraging the structure of the optimization model, we propose a robustness condition that further reduces the computational burden. We validate the efficiency of the proposed algorithms by testing the causal relation between the Medicare Hospital Readmission Reduction Program (HRRP) and non-index readmissions (i.e., readmission to a hospital that is different from the hospital that discharged the patient) from the State of California Patient Discharge Database from 2010 to 2014. Our result shows that HRRP has a causal relation with the increase in non-index readmissions. The proposed algorithms proved to be highly scalable in testing causal relations from large-scale observational studies.


Introduction
Effective and evidence-based public policy decisions aim to manipulate one or many socioeconomic variables and analyze their impact on the desired outcomes [1]. The impact assessment is not associational but causal [1,2] which requires an understanding of the counterfactual-the difference in outcomes with or without the presence of the policy [3]. This is also true for any post policy evaluation [1]. A policy maker may design multiple policies and calculate the causal quantities including the effect of the proposed policies on different recipient groups, effects over time, possible trade-offs between competing goals, and, finally, choose the optimal policy [4]. The gold standard approach for calculating those causal quantities is conducting a randomized experiment [5][6][7][8]. In a randomized experiment, the experimenter will assign observations to either treatment or control group randomly; this randomness can avoid bias and eliminate confounding effects of covariates and thus can achieve unbiased estimation of treatment effects. In this case, a possible association between treatment and outcome will imply causation. However, many studies in health care, social science, economics, and epidemiology cannot be designed as a randomized experiment due to legal or ethical reasons. Randomization can also be impractical, time consuming, or very expensive. Hence, in most such cases experiments are performed on data that are collected as a natural process. Such experiments are called observational studies (also referred to as natural experiments or quasi-experiments) [9] and can be implemented in a prospective (collecting sample data as natural observation over time) or retrospective (experimenting on already collected data) way.
Making causal inferences from an observational study lacks the experimental elements of randomization on all possible background covariates (the observed and unobserved characteristics of a sample unit) [10,11] and are prone to bias and systematic confounding on covariates. However, with proper understanding of the underlying process and careful control of non-randomized data, it is possible to make a reasonable estimation of the causal effect [5]. Researchers have been utilizing matching methods for identifying causality since the 1940s [10] and it is one of the most popular methods. It was used or noted in as many as 486,000 academic articles involving causal inference (see S1 File). Matching methods examine the possibility of restoring or replicating properties of randomization based on the observed covariates [10]. In fact, matching attempts to retrieve the latent randomization within the observational data [12]. Being true to its name, matching methods aim to find a control group that is identical to the treatment group in terms of joint distribution of the observed covariates. As discussed by Stuart [10], and Zubizarreta [13], matching the empirical distribution of the covariates has several significant advantages. For example, matching forces the experimenter to closely examine the data, check the common support on the covariates, and assess the quality of inference. Even though the matching process can be complex, the outcome analysis is often done with simple methods [14]. For instance, the Rubin Causal Model (also known as Potential Outcome Framework) estimates the causal effect as the difference of expected outcomes between the control group and the treatment group [15]. Due to its simple architecture and other attractive properties (see [10,13,16]), matching has been used to make policy decisions in health care [17][18][19][20], education [21,22], economics [23], law [24], and politics [25].
In this paper, we adopt a robust methodology recently proposed by Morucci et al. [26] and extend it to accommodate causal inference from big data observational studies. We show the efficiency of the proposed methods by evaluating the impact of the implementation of the Medicare Hospital Readmission Reduction Program (HRRP) [27] on non-index readmissions -readmission to a hospital that is different from the hospital that discharged the patient.

Motivation and contribution
Motivation. The objective of the current one-to-one matching paradigm under the potential outcome framework is to find pairs (t, c) between samples t from treatment group T and c from control group C . A pair (t, c) is assigned in such a way that t and c are the same or very similar on a specific, pre-determined set of covariates X: fðt; cÞ : t ' cjX; t 2 T and c 2 C g. Over the years, researchers developed a wide array of algorithms to find such pairs, for example, Propensity Score matching [14], Mahalanobis Distance matching [14], Nearest Neighbour Greedy matching [28], Coarsed Exact Matching [29], and Genetic matching [30] are among the most popular algorithms. All these algorithms (including those not listed here) disregard the outcomes ðY 1 t ; Y 0 c Þ of corresponding pairs (t, c) in the assignment process. Though the matching process reduces bias in treatment effect estimation, disregarding the outcomes in the assignment process introduces a new source of uncertainty. If a sample t 2 T has multiple possible pair assignments fc 1 ; c 2 ; � � � ; c n g 2 C and have similar covariate balance but different outcomes (i.e., , by assigning pairs without considering the outcomes, an experimenter can estimate multiple degrees of causal effect (one for each possible assignment). Similarly, a sample from control group c 2 C can have multiple possible assignment options ft 1 ; t 2 ; � � � ; t n g 2 T . A possible scenario is presented in Fig 1 where within each circle we have multiple pair assignment options with almost similar match quality but different outcomes (outcomes are presented as the size of the data points). In such cases, different experimenters using different matching algorithms can get different pairs, hence, their causal effect estimates and conclusions on the experiment can be different. It is possible that two researchers having the exact same hypothesis and using the exact same data but with different matching algorithms reach completely opposite results due to this uncertainty. This problem is exacerbated for studies involving big data as we may have more pair assignment options. Therefore, making policy decisions in health care or any other field by using the matching method that disregards uncertainty due to pair assignments can lead to erroneous conclusions.
In 2012, Congress adopted HRRP as part of the Patient Protection and Affordable Care Act (PPACA) [27] to increase quality of care and reduce hospital readmission rates. HRRP penalizes hospitals when patients with certain clinical conditions (i.e., pneumonia, acute mayocardial infraction (AMI), congestive heart failure (CHF)) who have been discharged are readmitted within 30 days. The index hospital is always penalized even if the patient is readmitted to a different hospital (non-index hospital) [27]. Though readmissions to the index hospital are following a decreasing trend over the post HRRP periods, non-index readmissions are increasing [31,32]. This increase in the non-index readmission rate-approximately one fifth of all readmissions for Medicare patients [31,33]-creates suspicion that hospitals are possibly discouraging patients from readmission to avoid penalties introduced by HRRP. Moreover, a recent study identified that non-index readmissions are associated with higher odds of inhospital mortality and longer length of stay [34]. Therefore, we aim to identify whether HRRP has a causal relation to the increase in non-index readmission. Finding such causal relation involves analyzing a large volume of health care data and matching method would be vulnerable to the uncertainty discussed above. The robust method proposed in [26] to handle such uncertainty requires solving multiple Integer Programming (IP) models (a minimization and a maximization problem) iteratively. Using state-of-the-art integer programming solvers to solve those IP models for big data observational studies will be computationally expensive.
Contribution. In this work, we extend the robust causal inference testing method proposed by Morucci et al. [26] to handle large-scale observational studies with binary outcomes. To handle big data, first, we propose a robustness condition that identifies when a robust solution is possible and combines the maximization and minimization problems into a single problem. Second, we propose an efficient algorithm to calculate the test statistics for the robust condition. In addition, we propose two algorithms-one to solve the minimization problem and one to solve the maximization problem-for any condition that will show the degree of uncertainty for a selected number of matched pair. Finally, we implement the algorithms by testing the causal effect of HRRP to non-index readmissions using the State of California Patient Discharge Data and compare the computational efficiency with canonical IP solvers.
Remark 1. Please note, by "Robust" we imply "Robust to the choice of matching method": if A represents a set of all possible matching algorithms, a researcher choosing any algorithm A i 2 A and testing a hypothesis of causal effect will get the same result if she has chosen algorithm A j6 ¼i 2 A . Also, we are considering matching as pre-processing and plan to achieve robust test result from a large-scale observational study for a given set of good matches M identified by any matching algorithm A i 2 A .

Causal inference with matching method and robust test
In the Rubin Causal Model, a sample unit i from a set of observations f1; 2; � � � ; ng 2 S can have two outcomes or responses. The response Y T i is called treatment response when the unit i receives certain treatment (T = 1) and control response when unit i does not receive treatment (T = 0). It is assumed that the treatment assignment of any unit does not interfere with the outcome of other units [35]. This assumption is commonly known as the Stable Unit Treatment Value Assumption (SUTVA). Under this assumption, the treatment effect on a sample unit i 2 S is calculated as However, it is impossible to observe the counterfactual scenario for the same sample [15]. Under a certain treatment regime T 2 {0, 1} and identical conditions, we can only observe Y T¼1 [5,15]. Therefore, we cannot directly measure the treatment effect TE at an individual level. On the other hand, the causal inference literature offers a statistical solution to this fundamental problem by taking expectation over the observation set S , formally called Average Treatment Effect (ATE).
The ATE as defined in Eq 1 provides the opportunity to divide S into the treatment group T when T = 1 and control group C when T = 0 such that ðT [ C Þ ¼ S and work with their expectations. So, we can construct the ATE as E[Y 1 |T = 1] − E[Y 0 |T = 0] but, this form of ATE implicitly assumes that the potential responses are independent of treatment assignment: Though this independence assumption holds in randomized experiments, in general, it does not hold for observational studies as the experimenter rarely has control over the treatment assignment process. This problem is solved by making an assumption known as Strong Ignorability [7]. Let X 2 X and X 2 R k be the set of pre-treatment background variables (covariates) which characterizes the observations. The strong ignorability assumption states that the potential responses are independent of treatment assignment when conditioned on the covariates: Y 1 i ; Y 0 i ? TjX and every unit i 2 S has a positive probability to receiving treatment: 0 < Pr(T = 1|X = x) < 1. Another commonly used estimate of causal effect is Average Treatment Effect on Treated (ATT) which is defined under slightly relaxed assump- Both of these estimates are prone to bias as the treatment assignment process is not random. In the matching method, an unbiased estimate of causal inference can be achieved if treatment unit t 2 T is exactly matched with a control unit c 2 C in terms of the covariate set X 2 X [7]. However, in most of the applications, it is impossible to achieve exact matching [7,13,36,37]. A wide variety of matching methods are employed to make (t, c) pairs as similar as possible [7,13,38] or to find a subset of control group samples C � C that is similar to the treatment group samples T � T in the joint distribution of the covariate set X [29,36]. In this work, we consider one-to-one matching that aims to find a pair ðt; cÞ � ðT ; C Þ that is matched (either exactly or by some user defined balance function) on a set of covariates X � X.
Before explaining the difference between the classical method of causal inference [5,14,15] and the robust causal inference testing approach [26], let us define the set of good match M and the pair assignment variables a i,j . For a given set of possible matches M , we can perform hypothesis test in the following form with the null hypothesis being no causal effect and alternative being the opposite.
Under the classical approach of matching method, we can test these hypotheses first by defining a test statistic Λ, specifying an imbalance measure along with a tolerance limit on the imbalance. Then, we apply a matching algorithm A i 2 A to find the set of good match M that satisfies the imbalance limit; otherwise we tune the allowable imbalance limit to generate M . Robust approach differs from the classical approach moving forward from here (see Fig 2).
The classical approach picks one (out of many) possible combination of pairs from M and conducts the hypothesis test wherein, the robust approach calculate the maximum and minimum value of the test statistic (Λ max , Λ min ) and corresponding p-values to explore all possible assignment combinations within M which does not increase imbalance under Definition 1. The test will be robust if both Λ max and Λ min produce same conclusion on the hypothesis. We formally define the Robust Test in Definition 3.

Definition 3. (Robust Test) Let α be the level of significance set for the hypothesis H 0 and
Calculating the test statistic Λ generates an integer programming model which is computationally expensive for large scale data (see Numerical experiment section). In the following section, we propose a robustness condition following the Robust Test definition which will allow us to calculate a Λ robust = Λ min = Λ max for absolute-robust test and we can avoid solving two integer programming problems. From definition 3, it is clear that an absolute-robust test is always robust. In this work, we are interested in testing the hypothesis stated in Eqs (3 and 4) for binary outcomes: Y 2 {0, 1} with the McNemar's test [39] as proposed in [26].

Robust McNemar's test
McNemar's test is the ideal candidate for testing hypothesis in Eqs (3 and 4) as it deals with one-to-one matched pairs. It operates on a 2 × 2 contingency table (see Table 1) and the test statistics under the null hypothesis assume that the marginal proportions are homogeneous. Among the four types of matched pairs, we are mainly interested in the discordant pairs is the pair assignment operator defined in Definition 2. Here, B counts the number of pairs where treatment units has outcomes 0: Y 1 = 0 and control units has outcomes 1: Y 0 = 1 and C counts the discordant pairs where Y 1 = 1 and Y 0 = 0. Under the assumption of having at least 1 discordant pair: B + C � 1 we will use the test statistic Λ as defined in Eq (5) to test both hypotheses. Steps before covariate balance achievement remains same for each approach. In the remaining steps, Black arrows show the classical approach, and Blue arrows show the robust approach proposed in [26].
https://doi.org/10.1371/journal.pone.0223360.g002 Morucci et al. [26] proposed the following integer programming model that explores all possible assignment options and calculate maximum and minimum possible test statistics Λ max and Λ min , respectively.
Additional user-defined covariate balance constraints to find M The total number of discordant pair constraint in Eq (8) provides an opportunity to linearize the robust McNemar's test model. We can calculate the maximum and minimum test statistic by solving the integer programming model iteratively for different values of m until either a robust solution is obtained under definition 3 or m cannot be increased further. For the latter case, we will not find a robust solution.
As it is shown in Table 1, B is the total number of untied responses when Y 1 Under this definition of B and C, we provide the following propositions on the objective function of the robust McNemar's test and its optimal values. Proposition 1. The objective function Λ(a), has the following properties: For any B � 0, Λ(a) is monotonically decreasing in C for C � 1 and strictly decreasing for C > 1 Proof. Let C > 0, then for any B 2 R þ , we have which implies Λ(a) is strictly increasing in B for a fixed C. Similarly, let B � 0, then for any C � 1 we have, this proves the claims of Proposition 1. Before further discussion on Λ(a) and the optimality conditions, we introduce the following notations and definitions of maximum untied responses, for both B and C.
Definition 5. (Maximum type two discordant pair) C max is the maximum number of possible pairs between Y 1 i 2 T and Y 0 j 2 C where the treated observation has positive ("Yes") outcome but the untreated (control) observation has negative ("No") outcome, i.e., For a fix value of m, the McNemar's test model becomes linear and the objective functions become, Using the property of Λ(a) explained in Proposition 1, we can find the optimal solution. Proposition 2. Let C � 1 and denote m as the total number of discordant pairs, then the optimal pair (C � , B � ) is given by: Proof. From Proposition 1, we know that Λ(a) is monotonically decreasing in C when C � 1. Therefore, in the minimization problem, assignment will be made to maximize C until we are about to violate constraint B + C = m. When the total number of discordant pairs is set to m > C max , C will take the value of C max and B will take the value of m − C max just to satisfy the total number of discordant pair constraints and the solution will be optimal. If m < C max , the new C = m and the minimum value will be achieved at C = m and B = 0. Similarly, from Proposition 1, we know that Λ(a) is strictly increasing in B for any B 2 R þ . So, in the maximization problem, pair assignment will be made to maximize B within the feasible region. When the total number of discordant pair is set to m > B max , at optimal solution, B will take the value of B max and C will take the value of m − B max just to stay in the feasible region. When m < B max , the B will take the value m and the maximum value will be achieved at B = m and C = 0. Proposition 3. For the linear model, an absolute-robust estimate will be achieved if and only if the total number of discordant pair m = B max + C max .
Proof. According to the proposed approach to the causal inference estimate, an absoluterobust estimate is achieved when Λ(a) max and Λ(a) min is equal. For the McNemar's test model, the model becomes infeasible when m is set to m > B max + C max as we can only have B max + C max number of total untied responses. So feasible range of m is: 0 < m � (B max + C max ).
To prove the Proposition 3, we first set m to it's maximum value B max + C max . Using Proposition 2, in this case, the optimal solution for the Λ(a) max problem is: B = B max , C = m − B max = C max and the optimal solution for Λ(a) min problem is: C = C max , B = m − C max = B max . So, for m = B max + C max case, we get Λ(a) max = Λ(a) min and the solution is absolute-robust.
Conversely, m can take any integer value in the range 0 < m < (B max + C max ) which can lead to the following six cases. For each of the cases, we will find the optimal solution using Proposition 2.
The optimal solution for the minimization problem is The optimal solution for the minimization problem is C = m, B = 0 and the maximization problem is The optimal solution for the minimization problem The optimal solution for the minimization problem is C = C max , B = m − C max and the maximization problem is C = 0, B = m.
The optimal solution for the minimization problem is C = m, B = 0 and the maximization problem is C = 0, B = m.
• 0 < m � B max � C max < (B max + C max ): The optimal solution for the minimization problem is C = m, B = 0 and the maximization problem is C = 0, B = m.
For all of the above six cases, Λ(a) max 6 ¼ Λ(a) min , hence, the solution is not absolute-robust. Therefore, the total number of discordant pairs m have to be B max + C max to get an absoluterobust estimate. As we can see from the Proposition 2, the optimization problem has become a counting problem and can be solved efficiently for big data. However, the optimal solution calculated with Proposition 2 disregards the assignment constraints Eqs (9 and 10) and additional userdefined constraints. To find the optimal solution using the result from Proposition 2 that is feasible, we take a two-step approach. At the first step, we handle the user-defined constraints to find a good set of match M as a pre-processing step. We can use any off-the-shelf matching algorithm for that purpose or define a separate pair assignment model with different covariate balance measure to find M . At the second step, we partition the set of good match M into P partitions such that within a partition p 2 f1; 2; � � � ; Pg, any treatment unit t can be matched with any control unit c. A formal definition of a partition is provided below. Construction of partitions under Definition 6 ensures that only good matches are considered for assignment. In addition, Definition 4 calculates B max by pairing negative outcomes of treatment units and positive outcomes of control units which inherently satisfies the pair assignment constraints Eqs (9 and 10). Similarly, we calculate C max by assigning a pair between samples with positive treatment outcomes and negative control outcomes. Therefore, none of the treatment or control unit is used more than once in the pair assignment process which satisfies the pair assignment constraints Eqs (9 and 10). Now, using the above mentioned results, we propose Algorithm 1 which identifies the robustness condition and corresponding absolute-robust test statistic Λ(a) robust .
Algorithm 1: Absolute-robust test statistic Λ(a) robust at robustness condition ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi B max þ C max p By the sketch of the Algorithm 1, it seems like we are only matching the discordant pairs and ignoring the other possible pair assignments in the data, which is not true. In Proposition 4, we show that we match maximum possible pairs. pÀ c À C p max Þ control samples. Now, we can assign the remaining treatment and control samples into the other two types of pairs A and D to their limit: It is trivial to show that the for partition p, An example of maximum pair assignment is provided in Fig 3 where treatment outcomes (t) are sorted in descending order and control outcomes (c) are sorted in ascending order. In the left panel, we can have minðN p t ; N p c Þ ¼ 5 pairs at maximum. After assigning B max = 2 and C max = 2 according to Algorithm 1, we can assign only one pair to D max and A max = 0. Therefore, we achieve the maximum number of pair assignments. We follow the similar procedure in the middle and right panels.
In the Algorithm 1, we calculate the absolute-robust test statistics at robustness condition which an experimenter can use to find the corresponding p-value and compare with a predefined level of significance α to make conclusion on the hypothesis of no causal relation. Decisions made in this process will be free of uncertainty and robust to the choice of matching algorithms. If different experimenters perform matching on same data using different matching algorithms but follow the above mentioned procedure, all of their conclusions will be exactly the same.
Regarding the computational complexity arises due to big data, our proposed algorithm only involves counting elements in vectors and few algebraic operations. The counting processing can be done with the summation of vectors as we are dealing with only binary outcomes: summation implies the total number of positive outcomes and we can calculate the negative outcomes by subtracting it from the size of the vector. In addition, we only need to solve the problem once-at robustness condition. Therefore, the proposed algorithm will be highly efficient for big data.
While the Algorithm 1 directly calculates test statistics at robustness condition, a researcher might be interested in exploring the degree of uncertainty in the causal inference test. She may want to see how the uncertainty changes towards the robust estimate with respect to the number of discordant pairs matched. For this purpose, we propose the following two algorithms (2, 3) following the result of Proposition 2.
Algorithm 2: Maximizing the test statistics Λ(a) Require: Vector of outcomes ðY 1 ; Y 0 Þ 1 ; � � � ; ðY 1 ; Y 0 Þ P and increment in m: IB, LðaÞ min ¼ B À C À 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi B þ C p Now, to show the worst case time complexity for the proposed algorithms, we first define q = max p2P (|(Y 1 ) p |, |(Y 0 ) p |), where (Y 1 ) p and (Y 0 ) p denote the outcome vector for partition p for the treatment and control group, respectively. Then, by definition, the number of operations needed in Algorithm 1 for calculating the term min(|Y 1 = 0|, |Y 0 = 1|) is 2q. The time complexity becomes T(p) = p(4q + 2). Therefore, Algorithm 1 has a time complexity of OðpqÞ. Again, using the same definition of q, we can write B max + C max � q + q = 2q. Since the time complexity for the loop (p = 1: P) is just 2 arithmetic operations, the overall time complexity of Algorithm 2 and 3 become T(p) = (B max + C max ){p(2)} � (2q).(2p) = 4pq. Therefore, Algorithm 2 and 3 have a time complexity of OðpqÞ.

Numerical experiment
In this section, we present the efficiency of the proposed algorithms with data from the State of California Patient Discharge Database and address an interesting hypothesis on the effectiveness of the HRRP implemented in October 2012.
A hospital's readmission rate is considered an important measure of its care quality. As noted, to increase the care quality and hold hospitals accountable, US Congress introduced the HRRP under the PPACA in 2012 [27]. The most important feature of this program is that the index hospital (the hospital that discharged the patient) is penalized if patients with pneumonia, congestive heart failure (CHF), and acute mayocardial infraction (AMI) are readmitted (to the index hospital or any other hospital) within 30 days of discharge. During the post HRRP period, the overall rate of readmission has been decreasing, which the proponents of HRRP are attributing to the success of the policy. However, in this period, readmission to different hospitals (non-index readmission) has been increasing [31,33]. Non-index readmissions have been found to be associated with longer lengths of stay and higher in-hospital mortality rates [34]. Hospitals are possibly discouraging patients seeking readmission to avoid penalties introduced by the HRRP. To examine the increase in non-index readmission post HRRP, we advance the following hypothesis and test it with the proposed algorithms with the level of significance α = 0.05.

Data description and covariate balance
In this research, we primarily used patient discharge data between 2010 to 2014 from California. We obtained this nonpublic data set from the California Office of Statewide Health Planning and Development (OSHPD), which collects in-patient data from California licensed hospitals. Each patient in this data set has a unique identifier that can be used to determine if a patient is readmitted. In addition, the data set also contains patient level information such as ICD-9 codes for clinical diagnosis, comorbidities, age, gender, discharge destination, patients' Zip code, and insurance information. When a readmission was identified, we ascertained the destination hospital of that readmission. Then, a binary variable was created with 0 if the patient was readmitted to the same hospital or 1 if different hospital. To test the hypothesis, we used this variable as our outcome: Y = 1 if readmitted to a different hospital or Y = 0 if readmitted to the same hospital.
Moreover, the OSHPD data set was merged with publicly available data from the Centers for Medicare and Medicaid Services, American Association Annual Hospital Survey and the Area Resource file. From these additional data sources, we obtained important hospital-level information including teaching status (membership in the Council of Teaching Hospitals), ownership type (public, non-profit, investor owned), hospital size based on number of beds (small: below 100 beds, medium: 101 to 399 beds, and large: 400 and above beds) and hospital location (rural, metro). We also included a proxy for patient household incomes based on the median income of a patient's residence Zip code. We divide the data into two sets: before and after October 1, 2012, the implementation date of HRRP. The treatment here is the implementation of HRRP, readmissions between February 1, 2010 and September 30, 2012 is considered as control group C (treatment T = 0) and readmissions from October 1, 2012 to November 30, 2014 is considered the treatment group T (treatment T = 1). To capture any potential readmission within 30 days of an index discharge, admissions before February 1, 2010 and beyond November 30, 2014 were excluded. A descriptive view of readmitted patients' characteristics is presented in Table 2.
We matched the patients based on the following covariates: age, gender, primary diagnosis, household income, Charlson Comorbidity Index, hospital location, hospital teaching status, hospital ownership status, and hospital size. We divided the covariates into two groups: 1) discrete and 2) continuous. The discrete covariates (i.e., gender, primary diagnosis, hospital location, hospitals' teaching status, hospitals' ownership status, and hospital size) are matched exactly. The continuous covariates (i.e., age, household income, and Charlson Comorbidity Index) were first divided into categories as shown in Table 2, then, the categories were matched exactly. This matching strategy resulted in 1822 partitions of data; within a partition any treatment sample can be matched with any control sample. The number of possible matched pairs in each partition can be calculated by taking the minimum number of treated or control samples in that partition. Among the 1822 partitions, we had 35,584 possible pairs. Though this matching approach seems ad hoc in nature, it is very similar to the well known method called Coarsed Exact matching (CEM) [29] with 1822 bins. Traditionally, CEM is implemented with a much lower number of bins due to the lack of common support between treatment and control groups but a higher number of bins makes for a finer covariate balance [40,41], which is the objective of any matching method. However, to implement the proposed algorithms, an experimenter is not limited to CEM or the matching method we used. Given a good set of matches created under Definition 1, we can always create the partitions under Definition 6.

Experiment and result
To test the hypothesis H 0 , first, we performed the matching operations in R [42] to obtain matched sets. Then, using the matched sets of data, we calculated the test statistic Λ(a) max and Λ(a) min using the 1) optimization model with an Integer Programming solver and 2) with the proposed algorithms. The integer programming model was implemented in AMPL [43] and solved with the commercial solver CPLEX [44]. We implemented the Algorithm 1, 2, and 3 in R [42]. All the experiments were performed in a Dell Precision workstation with 64 GB RAM, Intel(R) Xeon(R) CPU E5-2670 v3 processor running at 2.30 GHz. Table 3 shows the comparison of solutions obtained using an optimization model with CPLEX iterating over different values of discordant pairs (m) and proposed algorithm at robustness condition. The range of p-value achievable corresponding to the test statistics Λ (a) max and Λ(a) min is presented in Fig 4. The proposed Algorithm 1 directly identifies the robustness condition which is B max = 12082 and C max = 9448 and calculates the absoluterobust test statistics Λ(a) robust . Including all four types of discordant pairs (A, B, C, D), Algorithm 1 generates 35,584 matched pairs. Regarding the efficiency of the hypothesis test, for a 5% level of significance we would need at least 942 pairs to have 90% power (calculated using result from Connor [45]) wherein we have 35,584 matched pairs. The computation time required by the proposed algorithm is very insignificant compared to the time required by the optimization model. A major implication of the robust McNemar's test is that if the same experiment is conducted with as many as 19,000 discordant pairs, we can achieve any p-value between 0 to 0.23 (see Fig 4); some experimenter might reject the hypothesis and some might fail to reject the null hypothesis. Both experimenters, in this case, are right but their conclusions differed due to the fact that they choose different pairs. Any policy decision made using the matching method without considering this uncertainty has a possibility to fail. In regards to the hypothesis we made at the beginning of this section, we can reach a conclusion by using the p-value calculated at the robustness condition or the result from Table 3 and corresponding p-values from Fig 4. We can see that both the maximum and minimum pvalue < α when we match more than 20,000 discordant pairs. Therefore, we can reject the null hypothesis of no causal effect and conclude that the HRRP is a cause for increase in the nonindex readmissions. This result also suggests that not only the readmission rate but also the non-index readmission rate should be considered as a measure of health care quality.

Conclusion
Any policy decision or evaluation requires identifying the causal relation between policy alternatives and potential outcomes. Matching methods have become very popular in identifying such causal relations. However, in one-to-one matching, when we have multiple pair assignment options, matching method is vulnerable to uncertainty as the pair construction process does not consider outcomes. In this paper, we consider the integer programming model for robust causal inference testing approach with binary outcomes proposed by Morucci et al. [26] and develop scalable algorithms that can be used for large-scale observational studies. We identify a robustness condition that combines the maximization and minimization problem proposed in [26]. Instead of solving two computationally expensive integer programming models iteratively by increasing the number of discordant pairs until a robust estimate is achieved, we convert the problems into counting problems through a series of propositions. In addition, the proposed Algorithm 1 solves one problem instead of two separate problems and it is computationally efficient. Quadratic time complexity and the numerical experiment conducted on the State of California Patient Discharge Database show that the proposed algorithms are highly scalable. The numerical experiment shows an interesting result regarding a highly visible health care policy-HRRP-adopted as part of the PPACA in 2012 to improve health care quality. We identify that the HRRP is a cause of the increase of non-index readmissions, which has been shown to be associated with higher in-hospital mortality rate and a longer length of stay. Though the numerical experiment is performed with around 100,000  Table 3. The red line represents minimum possible p-value (corresponding to Λ (a) max ) and the blue line represents the maximum possible p-value (corresponding to Λ(a) min ).
https://doi.org/10.1371/journal.pone.0223360.g004 samples, the algorithms proposed in this paper can handle observational studies with millions of samples efficiently without further modification. In the future, we plan to develop similar robust causal inference testing algorithms with continuous outcomes for large-scale observational studies.
Supporting information S1 File. Popularity of matching method. Google Scholar search results to show the popularity of matching method in causal inference. (DOCX)