Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods

Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.


Introduction
Missing medical data are common in epidemiologic studies. This problem is exacerbated in longitudinal studies, where missing data increase over time, sometimes compromising results [1]. Because it can be difficult or impossible to verify whether data are missing at random (MAR) or are related to the outcome of interest [2], some studies ignore missing data by dropping missing observations from the data set. Although such listwise deletion (LD) is sometimes the simplest or only way to conduct an analysis, this method can lead to inaccurate conclusions. In such cases, investigators impute missing values to generate complete data for a key variable or to maximize power for an intent-to-treat study. The objective of this article is to aid researchers in selecting the imputation method that will provide the most valid estimates by simulating the nature of missing data in the Strong Heart Study (SHS) and testing differing remedial measures. In this article, the importance of understanding the reason missing values may arise in the data is highlighted, and the measures most appropriate for each context are discussed.
Missing data can be classified in one of three ways: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). When the probability that a value is missing is statistically independent of its own hidden value and the value of all other variables, then the data are considered MCAR. Because of this independence, the bias resulting from the missing data is mitigated, but because most software will automatically apply listwise deletion (LD) to observations with missing data, reduction in power and information loss can be substantial. When the probability that a value is missing is correlated with the values of other variables, but is independent of its own hidden value, then the data are considered MAR. Finally, if the probability that a value is missing is correlated with its own value, then the data are considered NMAR. Understanding why data are missing in a data set is an important factor in choosing the remedial measure to be used for imputing the missing data. While general rules regarding imputation method applications apply, the appropriateness of any method is largely determined in each specific case based on numerous factors, including the nature of the missing data. In this article, we will investigate this issue using data from the SHS, a longitudinal study of American Indians.
Kidney disease is widespread among populations with high prevalence of diabetes. Individuals with chronic kidney disease (CKD) are at elevated risk for all-cause and cardiovascular disease (CVD) mortality [3][4][5][6][7][8][9], and an adverse CVD risk factor profile is associated with declining kidney function [10]. Occurrence of diminishing renal function along with worsening CVD risk factors is a phenomenon challenging to longitudinal study of these diseases in populations, such as American Indians, in which both conditions are common. The morbidity and mortality associated with these diseases result in large amounts of missing data, and these data are likely NMAR.
The SHS population has high rates of CVD, diabetes, and renal disease. Data on serum creatinine (Scr), a screening test for kidney function, are missing at varying rates across the three phases of the SHS study, in some cases probably for non-random reasons. Therefore, after selecting a subset of patients with full longitudinal data, we simulated missing data using four different methods. We then applied five different remedial measures to deal with the missing values. A Cox survival regression was executed on the full data set, predicting time to a hard atherosclerotic cardiovascular disease event based on Scr. A Cox regression model was then performed on the imputed data sets, and the hazard ratios for Scr values at each exam were compared with the hazard ratios of the first. We hypothesized that the pattern-mixture (PM) method would generate hazard ratios closest to those in the complete set.

Study population
The SHS was initiated in 1988 to investigate CVD and its risk factors in American Indians from 13 tribes in Arizona, Oklahoma, and North and South Dakota. The SHS design and methods have been published [11][12]. The SHS was approved by the Oklahoma Center Indian Health Service institutional review board (IRB), the Dakota Center Indian Health Service IRB, the Arizona Center Indian Health Service IRB, and the MedStar Health Research Institute IRB. In addition, this study was approved by the American Indian communities. All data were anonymized and de-identified before the analyses. This cohort of 4549 American Indians includes men and women ages 45-74 years seen at the first (1989)(1990)(1991), second (1993)(1994)(1995), and third (1998)(1999) exams. Participants receiving dialysis or who had a kidney transplant were eliminated from the data set. Of the 4549 SHS participants at baseline, 3605 were alive at Exam 3, and 2219 (62%) were women. A subset of 2264 participants with complete renal function data at all three exams was used for the current analyses.

Renal function measures
Scr was assayed by a single core laboratory using automated alkaline picrate rate methodology [12]. Urinary creatinine was measured at all three exams [10,13].

Cardiovascular and diabetes surveillance
CVD surveillance for nonfatal and fatal clinical events occurred throughout the follow up and is complete through December 31, 2003 [14]. Criteria used to define definite fatal myocardial infarction, stroke, coronary heart disease, and nonfatal CVD have been published [15], as have methods for ascertaining incident CVD events [13,[16][17]. Incident diabetes was identified by self-report, use of hypoglycemic agents, or fasting glucose !126 mg/dl [18].

Creation of missing data models
The complete SHS data set included age, gender, history of diabetes, CVD status, and three serial measures of Scr. The outcome variable of interest was CVD. Four models with randomly (Models 1-3) and non-randomly (Model 4) missing data were created from the complete data set using the algorithms described below.
Model 1, Base data with MAR data. In Model 1, the data were missing at random. This model was created from the complete data set.
Model 2, Autoregressive MAR data. Let M be the matrix that represents the missing data, so a value of 0 indicates that the observation exists and 1 indicates that it is missing. Then consider the Y p matrix, which consists of Scr measurements on the 3605 subjects: Let M p be the matrix of missing data associated with Y p. We proposed the following autoregressive MAR generating mechanism: corresponds to the entry in the row I and column j in M. The probability that a subject has a missing value at stage j depends on whether the subject had a missing value in stage j -1. Thus, missing Scr data at Exam 3 are not independent of missing Scr data at Exam 2 and any algorithm generating MAR data should account for that, even though missing Scr data at Exam 2 are independent of missing data at Exam 1. When fitting an autoregressive model of the form P(M p [i,2] = 1) = α + β × 1{M p [i,1] = 1}, the slope coefficient has a t-value of 1.48 and thus is not significant at the 10% alpha level. Missing values at Exam 1 were selected at random with a Bernouli random generator.
Model 3, Autoregressive MAR data augmented with gender and age. For gender, let X be a binary vector denoting gender, with women having an entry of 0 and men an entry of 1. Women and men appear to have different rates of missing data. At Exams 2 and 3, women have significantly lower rates of missing data than men (all P values <0.01). At Exam 1, the significant difference between genders was not observed (p>0.05).
Thus, we proposed the following algorithm to generate values for Model is an indicator function with a value of 1 when Scr is missing in the previous exam. We did not apply this model to Exam 1 missing data, because no significant difference between genders was observed. The following generalized linear models with a binary response variable for the rate of missing data and an identity link were fitted using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [19]. The BFGS method solves an unconstrained nonlinear optimization problem using gradient descent. It is a member of a broad class of hill-climbing optimization techniques. We applied it to predict the rate of missing data at Exams 2 and 3 as follows: ; gender coefficient has a t statistic of 3.44, while the autoregressive coefficient has a t-value of 21.75.
Age is a variable that influences the rate of missing observations. The elderly may experience difficulty getting to the testing site, may move to retirement homes or hospices, or may die, thus missing exams. The relationship between the probability of missing data and participant age at Exams 1 and 2 was weak and unexpectedly U-shaped, while the probability of a relationship between participant age and missing data at Exam 3 was a more pronounced Ushape. One explanation is that younger participants may drop out more frequently because of employment or relative lack of concern for personal health. The following formulas define Model 3, using age (denoted by Y) and previous missing data as explanatory variables (all of the coefficients were significant, thus we excluded the t-values).
Exam 2: Finally, including all the explanatory factors, we propose Model 3 with gender and age as a MAR algorithm: Adding gender does not dramatically alter the age and the previous missing data coefficients from this model, yet the likelihood-ratio tests show that adding age decreases residual deviance in Exams 2 and 3 significantly. In this analysis, Model 3 with gender and age was the one used to test the performance of the different imputation methods on autoregressive missing data augmented with covariates.
Model 4. Empirical NMAR data. Similar to the empirical models for MAR data (Models 1-3), the following model estimates two potential NMAR mechanisms for our complete Scr data set. The algorithm used was developed by Troxel et al. [20]. This algorithm assumes that data are multivariate normal and Let π i denote the probability that the observation is missing at stage i. We then fit the following model for women and men separately: logit(π i ) = α+βy ij ; where y ij was the Scr value in phase i for subject j. We estimated the parameters above using the Likelihood function described by Troxel et al. Using the S-Plus library MASS and the function OPTIM [21] to maximize the likelihood, we obtained standard errors of the estimates by retaining the Hessian matrix, inverting it, and taking the square root of the diagonal.
Empirical NMAR data model for women and men. The algorithm used in this model was developed by Troxel et al. [20], which assumes that the data are multivariate normal and f The Troxel method: Let π i denote the probability that the observation is missing at stage i. We will then fit the following model for women and men separately: logit(π i ) = α+βy ij ; where y ij is the Scr value in phase i for subject j. We will estimate the parameters above by using the Likelihood function [20]. The parameter estimates and their standard errors (in parentheses) were obtained for women and men for the equation above.
For women, the parameter estimates were significant, but for men, the intercept was significant. At Exam 2, the β estimate was not significant at the 5% significance level, and at Exam 3 it was marginally significant at that level. The probability of missing data is, therefore, less dependent on Scr values for men than for women.
The fitted missing data probability functions were used to generate the last missing value mechanism. Although the fitted NMAR model was not significant for men at Exam 2, it is important to examine the effects of missing data on the outcome measures.
The four models are summarized below: , and PM) to compare the efficacy of these methods in providing Scr values for randomly and non-randomly missing data.

Imputation methods
The five imputation methods used to fill in missing Scr values are described:

Statistical Analysis
Baseline characteristics were provided for the SHS participants with complete renal data. Means with corresponding 95% confidence intervals (CIs) of Scr were generated by each imputation method and compared with the complete data set for each of the generated missing data models. Cox proportional hazard models were used to examine associations between CVD risk and imputed Scr in the four models. Hazard ratios (HRs) and 95% CIs were calculated and adjusted for age, gender, and diabetes status. The five imputation methods were compared to determine a) differences in distributions between the imputed data sets and the complete data set as measured by the mean and 95% CIs, b) discrepancies in estimates of the adjusted HRs of incident CVD and limits of the 95% CIs for each imputed data set versus the complete set at each exam using a non-time-dependent covariate Cox proportional model, and c) the significance of the models examined in each imputed data set compared with the complete data set using a time-dependent covariate Cox proportional model. The method that provided HR estimates closest to those generated by the complete data set was considered to perform the best.
In the 2264 SHS participants with complete Scr data for all three exams, mean baseline Scr was 0.88 mg/ml (standard deviation [SD] = 0.3 mg/ml), and mean age was 54.9 years (SD = 7.4 years). Sixty-four percent of participants were female, 37.3% had diabetes, and 4.6% had prevalent CVD. During a median 10 years of follow up (Exam 1 to December 31, 2003), 447 (19.7%) experienced a CVD event ( Table 1).
The distribution of Scr in the four models is presented for each of the five imputation methods (Table 2). Models 1, 2, and 3 represent MAR data, while Model 4 represents the NMAR data. All the imputation methods underestimated the mean, especially for Model 4 at Exam 3, in which the rate of missing data was approximately 40%. LD performed slightly better than the other imputation methods in Models 1, 2, and 3, but underestimated the mean and SD in Model 4 at Exams 2 and 3. Imputation using the mean also performed well in Models 1, 2, and 3 but, like the LD method, underestimated the mean and SD in Model 4 at Exams 2 and 3. Imputation using the AV did not perform as well as LD in Models 1, 2, and 3, but it was slightly better in not underestimating the mean and SD in Model 4 at Exams 2 and 3 than the LD and MI methods were. Imputation using MI overestimated the mean in Models 1, 2, and 3 at Exam 3 and underestimated the mean and SD in Model 4 at Exam 3. The PM method at the 10 th , 25 th , and 50 th percentiles of the Scr data estimated the mean in Model 4 at Exams 2 and 3 across all levels of percentiles better than the other imputation methods did. Additionally, the higher percentiles seemed to provide estimates closer to those made with the complete data set. The imputation techniques also were compared with respect to adjusted HRs and 95% CIs for CVD risk at Exams 1, 2, and 3 across the four models (Table 3). The complete case data showed significant relations between Scr and CVD risk only at Exam 2. Performance of the imputation methods varied with the different data sets. At Exam 1, all the imputation methods gave similar estimates of HRs. All the imputation methods showed significant relations between imputed Scr and CVD risk in Model 2 at Exam I. At Exam 2, all the imputation methods yielded significant results between the imputed Scr values and CVD risk in Models 2 and 3. The biggest difference in the results was found in Model 4 at Exam 2. A non-significant protective effect (i.e., HRs <1) was found in Model 4 using the LD and mean imputation methods. At Exam 3, all of the imputation methods yielded similar results in Models 1, 2, and 3, but overestimated the hazard ratio in Model 4, compared with the complete data set. However, the estimated HRs using the PM and AV methods were closer to the HRs from the complete data set.
In a time-dependent covariate Cox model (Table 4), in which we did not break down the data into the three exam periods, the adjusted HRs with 95% CIs for CVD risk were stronger for all the imputation methods in Model 4, compared with the complete data set. In Model 4, all the imputation methods yielded significant results, but PM and AV performed better than the others.

Conclusions
We developed four models of missing data, generated from a complete data set, and modeled the missing Scr data across the three SHS examinations. Results varied depending on the imputation technique. For the MAR model with 20-40% missing data, all the imputation methods performed similarly. For the NMAR model, AV performed almost as well as PM, possibly because renal dysfunction progresses over time, so using AV may generate results close to those generated with the complete data set. Using different imputation methods to estimate missing Scr values provided varied results, with some methods overestimating Scr and others underestimating it. No one method was superior to the others across all models and exams.
This finding is reasonable because we used two empirical mechanisms to generate patterns of missing data (MAR and NMAR), and because Scr is a protein that changes over time. The PM method performed better in Model 4 across all exams, providing hazard ratio estimates closest to those generated with the complete data set. The PM method outperformed the others on both the mean estimation and the hazard ratio, providing estimates that were closest to those made with the complete data set. Finally, these findings suggest that the PM method for imputing missing Scr values performed better for the data not MAR. Most imputation methods work well when data are MAR. For data not MAR, the PM method performed best for imputing renal function data in this large study of progressive CKD and CVD.
These findings reinforce the point that remedial methods chosen to manage missing values are dependent on the specific case, the nature of the missing data, the nature of the random variable, and the correlation between the missing data and the values in the data. In this case, because of the reasonable assumption that the missing data primarily arose from deterioration in kidney function and resulting mortality (i.e., not MAR), the PM method addressed the missing value problem the best. The assumption of the cause of mortality is as important as the other factors in choosing a remedial method. If mortality is a potential cause of missing values and is not correlated with the variable to be imputed, then the missing values are more appropriately treated as MAR. In that case, methods such as MI or auto-regression are more appropriate.
Further issues may arise regarding the specification of the model used for the imputation. The researcher must exercise judgment regarding whether all factors and covariates that affect the variable to be imputed are controlled for in the regression model. If all the factors are not available, then the researcher must decide whether to impute them based on partial information or use listwise deletion. This study is strengthened by its large cohort, which allowed us to model missing data, use several covariates to explain the missing data, and generate several types of missing data. This work provides a basis for handling missing data by identifying whether the data are MAR or NMAR.
When the missing data mechanism is not accounted for when performing statistical analyses, the resulting estimates can be misleading. The type and extent of missing data should be considered when choosing an imputation technique. Taking steps to determine whether data are MAR or are missing because of some mechanism can help investigators select the best imputation method for their data.