Figures
Abstract
Background
Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.
Objectives
We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.
Methods
We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.
Citation: Kampf J, Dykun I, Rassaf T, Mahabadi AA (2025) A comparison of various imputation algorithms for missing data. PLoS One 20(5): e0319784. https://doi.org/10.1371/journal.pone.0319784
Editor: María Paula Fernández García, University of Oviedo: Universidad de Oviedo, SPAIN
Received: August 6, 2024; Accepted: February 8, 2025; Published: May 12, 2025
Copyright: © 2025 Kampf et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Our data contain sensitive information about the health of the patients. Thus making data publicly available was prohibited by the University Hospital of Essen. The data from the simulation part is made available as supplementary material. Anonymized data from the real data part will be provided upon reasonable request from the relevant data access committee (email: datenschutz@uk-essen.de).
Funding: The author(s) received no specific funding for this work.
Competing interests: Jürgen Kampf and Iryna Dykun declare no conflict of interest. Tienush Rassaf received honoraria, lecture fees, and grant support from Edwards Lifesciences, AstraZeneca, Bayer, Novartis, Berlin Chemie, Daiicho-Sankyo, Boehringer Ingelheim, Novo Nordisk, Cardiac Dimensions, and Pfizer, all unrelated to this work. Amir Mahabadi received honoraria, lecture fees, and/or grant support from Amgen, Daiichi-Sankyo, Edwards Lifesciences, Novartis, Sanofi, all unrelated to this work. Tienush Rassaf and Amir Mahabadi are co-founders of Mycor GmbH, a company focusing on the development of AI-based ECG-algorithms. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development or marketed products associated with this research to declare.
1. Introduction
Missing data is a major problem in medicine and other branches of science [1,2,3,4]. There are many competing methods to deal with missing data. Complete-case analysis is simple to implement, but has undesirable statistical properties. Maximum-likelihood estimation has good statistical properties, but requires implementation of whole algorithms, since there are usually no analytical expression in the presence of missing data. Even worse, these algorithms are rarely implemented in statistical software packages, so this implementation has to be done by the scientist who wants to analyze the data. Imputation methods, i.e., algorithms that generate values and impute them for the values missing in the data, are available in standard software packages and may have good statistical properties depending on their exact specification.
There are two kinds of imputation methods for data sets that have missing values in several variables, joint modeling and multiple imputation by chained equations [5]. In joint modeling one builds one multivariate model for all variables jointly. This method is computationally expensive, unless a multivariate normal distribution is suitable as joint model. Multiple imputation by chained equation [6], also called fully conditional specification or sequential regression, is an imputation method for which only marginal distributions of the variables given all other variables have to be specified. It works by imputing the variables one by one using a subroutine. This subroutine can be an arbitrary algorithm for imputing values in a data set in which only one variable has missing values. Multiple imputation by chained equations is available through the R-package mice [7].
There are many competing subroutines for multiple imputation by chained equation. Predictive mean matching [8], weighted predictive mean matching [9], simple sampling, classification and regression trees [10] and random forests [11] work for any kind of numerical data. For continuous variables unconditional mean imputation and (Bayesian) linear regression [12, p. 167, 13] are useful. For binary variables one can use logistic regression [14,15]. A survey is given in [6].
We consider three different missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). These mechanisms assume that there is some unobserved full data of which several values are removed before the rest is observed. Under MCAR the choice whether a value is removed is made without regarding the data. Under MAR the choice whether a value is removed may depend on other observed values, however it may not depend on the values to be decided or on further unobserved values. Under MNAR the choice whether a value is removed may depend on the whole data and, in particular, on the value itself.
There are a lot of papers comparing different methods for treating missing data, e.g., [5,14,15,16,17,18,19,20,21,22,23] or the literature cited in [24]. In this article we take the point of view that the decision to treat the missing values using multiple imputation by chained equations has already been made and it only remains to choose the subroutine. Such comparisons have been made before in [25] for Cox regression with MCAR and MAR data and in [11] for linear regression and logistic regression with MAR data. Here we perform a big simulation study considering the estimation of means, the estimation of variances and covariances, linear regression, logistic regression and Cox regression for MCAR, MAR and MNAR data. Moreover, we apply the methods to data of patients suffering from an obstructive coronary artery disease (CAD). This is the first time that a really big simulation study concerning the choice of a subroutine for multiple imputation by chained equations is conducted and this is the first time that various multiple imputation algorithms are compared on data of patients suffering from obstructive CAD.
Multiple imputation by chained equations has certain limitations. On the theoretical side it is not satisfactory that parameters estimated based on imputed data do not converge to the true values as both the number of iterations and the sample size tend to infinity even under the MCAR assumption. Nevertheless, the estimated parameters get so close to the true parameters that there is not a practical problem. Moreover, under MNAR assumptions the difference may be so big that it is even a problem from a practical point of view. Finally the results of multiple imputation by chained equation depend on the choice of the subroutine, while this dependence is largely unexplored. The present paper will help understanding this dependence.
We consider 15 different simulation scenarios. Under each scenario we simulate 1,000 data sets and based on these data sets we calculate six statistics for each variable and each subroutine. With 15 simulation scenarios and six statistics per variable and subroutine this is a relative big simulation study. 1,000 simulated data sets are enough for a reasonable statistical analysis. As for every simulation study, our results are random to a certain degree. Moreover, a simulation study only shows that an algorithm performs good or poor without giving an explanation why.
We only include subroutines that are implemented in the R package mice. Since we will deal both with continuous and with categorical variables in this study, we consider only subroutines that can deal with both of them. Thus five subroutines qualified: predictive mean matching, weighted predictive mean matching, sample, classification or regression trees and random forests.
The rest of this paper is organized as follows. In Section 2 we explain the algorithms compared in this paper. Section 3 provides details on the real data. In Section 4 we report the setup and the results of our simulation study. In Section 5 we discuss the results and point out directions for future research.
2. Algorithms
Multiple imputation by chained equations is an algorithm for creating imputations in a data set which has missing values in several variables. It is an iterative algorithm. In the initialization for all missing values just a random value sampled uniformly from all observed values of the same variable is imputed (while it is remembered which values are real and which values are imputed). In the iteration step one variable is chosen and all its imputed values are replaced by new imputed values which are created using a subroutine. Such a subroutine is an algorithm which imputes values in a data set which has missing values in only one variable. Several possible choices for the subroutine will be given below. In order to apply the subroutine, for the chosen variable all values that have been implemented before are removed, while for all other variables the previously implemented values are treated as if they were real. Then the subroutine creates an imputation for the missing values of the variable at hand. All variables with missing values are treated one by one like this and, when all variables have been treated, one starts again with the first variable. This procedure is repeated several times – five times in this paper, in the R package mice it can be specified by the parameter maxit – and hence one completed data set is obtained. Since it is desirable to have multiple completed data sets [1,2,3,6], we have to run the whole algorithm multiple times – in this article 20 times, in mice to be specified by the parameter m.
A first possible subroutine is predictive mean matching (PMM). A regression model is fitted with the incomplete variable as target and all other variables as covariables based on the cases for which also the incomplete variable is observed using Bayesian regression. Then for each case in which the incomplete variable is missing, a value is sampled from its posterior distribution and one looks for the five cases in which the incomplete variable is non-missing whose predicted value is closest to
. Then one of the observed values of these five cases is chosen at random and inserted for the missing value. Notice that predictive mean matching has an extension to several variables with missing values, but this is not advised if there is only a small number of complete cases. PMM is the default subroutine of mice. Further information can be found in [8] or in [6, Section 3.4].
Weighted predictive mean matching, also called MIDAStouch, is a modified version of predictive mean matching. Instead of considering those cases whose predicted values have a small enough distance to with equal probability, we now consider all cases, but with a probability that is the smaller the larger the distance between its predicted value and
is. An advantage over PMM is that the probability for an observed value to be chosen now decreases smoothly with the distance between its predicted value and
instead of dropping down to 0 at an arbitrarily chosen threshold. The price for this is an enormous increase in computation time. Further information can be found in [9].
The method called SAMPLE simply samples from all observed values of the incomplete variable without considering the information the other variables give. Notice that one gets the same results if you use its multivariable version directly as you get if you use it as a subroutine of multiple imputation by chained equations.
Classification or regression trees (CART) are trees used for the prediction of a target variable. At each node of the tree a covariable of the data set and a threshold is given and depending on whether this covariable has for the data item at hand a value which is less or greater than the threshold, one goes to the left child or to the right child of the node. One does this until one reaches a leaf. Then a good prediction of an unknown value of the target variable is possible by considering known values of the target variable of cases that end up in the same leaf. In order to use classification or regression trees in the context of imputation of missing data, one takes the variable which has missing values as target variable and all other variables as covariables. The tree is build using a training set which consists of all cases for which the incomplete variable is observed. For every node of the decision tree one selects the variable and the threshold yielding the best possible split (in the sense of maximal Gini impurity). After the tree is build, for each data item for which the incomplete variable is missing the leaf it ends up is determined and the imputed value is sampled randomly from all observed values belonging to that leaf. The CART method has been designed to be used as a subroutine of multiple imputation by chained equations and therefore it cannot deal directly with data that has missing values in more than one variable. The CART method can deal with non-linear effects, but it produces predictions that are locally constant. For further information on classification and regression trees in general see [26] or [27, Chapters 9–10] and for the use as imputation method see [10] or [6, Section 3.5].
Random forests (RF) consist of multiple classification or regression trees. In building the trees there are two differences to single trees: First as training data not the set of cases for which the incomplete variable is observed is taken, but a bootstrap sample of it is used. Second at each node a random subset of some variables is drawn and the optimal split is only determined among these variables. Now for each data item and each tree one has one leaf in which the data item ends up. The imputed value for a data item for which the incomplete variable is missing is now sampled from the union over all trees of all observed values of data items ending up in the same leaf. Similarly as CART, RF can deal with non-linear effects, but produces predictions that are locally constant. Further information on random forests in general can be found in [28] and [29] and for the use of random forests as imputation method we refer to [11] and [6, Section 3.5].
We excluded one subroutine that is suitable both for continuous and for categorical data, namely imputation at level 2 by predictive mean matching, from the comparison. Indeed, this method was designed to deal with multilevel data [6, Section 7.2 and Section 7.8], while the other methods in this comparison are not suitable for dealing with multilevel data, and hence any comparison would be questionable.
The experiments were carried out using R 4.2.0 with mice 3.14.0 and R 4.2.1 with mice 3.14.0.
3. The cardiologic data
In this section we apply the algorithms introduced in the previous section to real data.
We consider the problem of predicting mortality from seven risk factors (age, sex, systolic blood pressure, LDL, smoking behavior, diabetes, family history of premature heart diseases) after the diagnosis of an obstructive CAD. For this, we used data from the Essen coronary artery disease (ECAD) registry which contains the results of 33,978 coronary angiographies conducted between 01/01/2005 and 31/12/2019 at the Department for Cardiology and Vascular Medicine, University Hospital of Essen. In 10,627 coronary angiographies an obstructive coronary artery disease was discovered. These 10,627 coronary angiographies were conducted on 7,398 different patients and we included only the first positive coronary angiography for each patient. Finally one patient was excluded, since all variables we are interested in are missing, leaving 7,397 patients for the analysis.
We examined the follow-up duration, the information whether the patient died and the seven risk factors mentioned above. Of the 7,397 rows of the table, only 1,297 (18%) are complete. Statistics for the different columns of the data are shown in Table 1.
We see that four variables are complete, while for the other variables up to 61% of the values are missing. The mean value for all patients for which the examined variable are known and the mean value for all patients for which all variables are known are close together except for the family history of premature heart diseases.
A natural question is, whether this data set follows an MCAR, an MAR or an MNAR mechanism. Since MNAR is known to be not testable, we only test the MCAR assumption assuming that MAR holds [4, Section 1.9]. In Table 2 we test for each pair of Quantity i and Quantity j whether Quantity i influences the missingness of Quantity j. In order to correct for multiple testing we applied the Bonferroni method, i.e., we multiplied all p-values by 45. We see many significant results implying that we do not have a MCAR mechanism.
For each pair of Quantity i (in the rows) and Quantity j (in the columns) we test whether Quantity i influences the missingness of Quantity j by testing Quantity i for those patients for which Quantity j is missing vs. those patients for which Quantity j is non-missing. We apply a t-test if Quantity i is approximately normal distributed (age, systolic blood pressure, LDL), a Wilcoxon test if Quantity i is seriously skewed (follow-up duration) and a chi-squared test if Quantity i is binary (death, sex, diabetes, smoking, family history).
We now compare the performance of the five subroutines mentioned in Section 2, when the analysis to be carried out is
- the calculation of the mean values of the variables,
- the calculation of the variances and covariances of the variables,
- the estimation of the coefficients in a logistic regression model or
- the estimation of the coefficients in a Cox regression model.
We do not consider linear regression here, since our data does not contain an appropriate target variable for that.
We analyze the data set described above, where the missing values are imputed using each of the five subroutines from Section 2 and producing M = 20 completed data sets (per subroutine). We report for each variable and each subroutine the mean value,
the inner standard deviation, the between standard deviation and the total standard deviation. In order to define these quantities precisely, let be the estimated value for the
-th variable and the
-th completed data set and let
be the estimator for the standard deviation of the estimated value for the
-th variable based on the
-th completed data set. E.g., when the characteristic of interest is the expected value, then
is the arithmetic mean over all patients, while
is the empirical standard deviation divided by the square root of the number of patients. Then the mean value is defined as
, the inner standard deviation is defined as
and the between standarddeviation is
. The total standard deviation is calculated using Rubin’s formula (see, e.g., (5.20) in [1]).
For categorical variables we additionally evaluated the number of imputations that were nonsense, meaning that a value which does not make sense for this variable was imputed.
The results for the estimation of the expected values of systolic blood pressure and of diabetes are reported in Table 3. Since the true values are unknown, all that can be said for systolic blood pressure is that the values are plausible. The same holds for LDL, smoking behavior and family history and therefore the results for these variables were moved to the supplementary material. A special situation holds, however, for diabetes. The test for diabetes was not conducted if it was obvious that the patient did not suffer from diabetes. So the presence of diabetes reduced the probability that the diabetes variable is missing. This is missing not at random (MNAR). It is better to impute the value 0 (no diabetes) for missing values than to use any of the five subroutines tested here. This way, we get a mean value of 10.60% which is less than any of the five values in Table 3 by far.
None of the five subroutines did ever impute a value that was nonsense. So it is justified that these five subroutines are claimed in the R help files to work for categorical data.
In Table 4 we present the computation times of the different algorithms for the generation of all 20 imputations for all variables in the example just considered. The experiment was carried out on an Intel Core i7-11700 processor with 8 cores and 2.50 GHz. We see that SAMPLE is the fastest subroutine and PMM is neglectable slower. The computation times for CART and RF are already more than 10 times as high, but still they are less than 4% of the computation time for MIDAStouch. Considering that a RF consist of multiple trees, it is quite surprising that RF is faster than CART.
The results for the estimation of variances and for fitting logistic regression model and Cox regression models are deferred to the supplementary material. They brought essentially the same conclusion as the results for the estimation of expected values.
4. Simulations
4.1. Expected values.
Next we want to evaluate the subroutines on simulated data. While the model for the simulated data is designed to share some features of the real data, we do not aim at getting as close to the real data generation process as possible, and, in particular, we do not estimate values in order to fit the model. We define
where is multivariate normal distributed with mean zero and covariance matrix
We consider three different missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). These mechanisms assume that there is some unobserved full data of which several values are removed before the rest is observed – and that is the way we proceed in the simulations. Under MCAR the choice whether a value is removed is made without regarding the data. We removed and
with probability 0.6 and
and
with probability 0.3 independent of each other and independent of the random variables
. Under MAR the choice whether a value is removed may depend on other observed values, however it may not depend on the values to be decided or on further unobserved values. We removed
if
was greater than its
-quantile,
and
with probability
if
,
with probability 0.8 if
and with probability 0.4 if
and
with probability 0.4 if
and with probability
if
, where all events in this sentence for which only a (conditional) probability is specified should be (conditionally) independent of
and independent of each other and where
denotes the cumulative distribution function of the standard normal distribution. Under MNAR the choice whether a value is removed may depend on the whole data and, in particular, on the value itself. We removed
if it is greater than the
-quantile of its distribution,
with probability 0.4 if
and with probability 0.2 if
,
with probability
if
,
with probability 0.6 if
, and
with probability 0.4 if
and with probability
if
. Notice that these probabilities were arranged in such a way that for each variable the probabilities that it is missing under MCAR, MAR or MNAR are the same.
Due to the enormous computation time of MIDAStouch we did not apply this subroutine to the simulated data, but we focused on the other four methods. It is an open problem to write an algorithm with same statistical properties as MIDAStouch and an acceptable computation time.
We simulate a dataset of 10,000 patients and we repeat the experiments times. For each variable and each subroutine, we report the mean, the standard deviation using Rubin’s formula, the absolute bias, the simulated standard deviation, the square root of the mean squared error and the coverage probability of the confidence interval. In order to make this precise let
and
be the fitted value and the estimator of its standard deviation for the
-th variable in the
-th completed data set of the
-th repetition and let
denote the true value. Then
where denotes the
-quantile of the standard normal distribution.
The true means (needed to determine the coverage probability of the confidence intervals) were calculated analytically.
In order to save space only the results for Variable 5 are included in Table 5, while the results for the other variables are deferred to the supplementary material.
Under MCAR we see that the mean value is close to the true value for all algorithms and all variables, likewise the standard deviation from Rubin’s formula is close to the simulated standard deviation and the coverage probability of the confidence intervals is close to the nominal level. There is one exception, namely RF yields poor results for Variable 7, while it is not obvious what the problem is with this combination. Moreover, in general the results for PMM are a bit better than the results for the other subroutines.
Under MAR the superiority of PMM is becoming quite pronounced, with CART usually taking the second place. This is a bit surprising, since RF is more sophisticated than CART. Under MNAR all methods behave poorly. Usually the mean values of PMM and CART are a bit closer to the true values than the mean values of SAMPLE or RF, but the difference between the subroutines is small compared to the distance any of these mean values has to the true value.
4.2. Variances and covariances
After having analyzed the mean values in Section 4.1, the logical next step is the analysis of variances and covariances. We use the same simulation model as in Section 4.1. Due to the large amount of results, only variances will be included in the main article and the pdf supplementary material and the covariances will only be given in the supplementary txt-files.
Notice that unlike in Section 4.1, the true values are not readily available. Hence we generated a sample of 1,000,000 patients without missing data and took the empirical variances and covariances of that sample as “real” variances and covariances.
The results on Variable 5 are given in Table 6, the other results are deferred to the supplementary material. For MCAR data the situation for variances is similar to the situation for expected values. The best results are produced by PMM and the only really poor results are those by RF for Variable 7. In general, the coverage probabilities for variances are a bit lower than those for mean values.
Similar things can be said about MAR data. The results for PMM and CART are better than the results for SAMPLE and RF as we already had it for the mean values. The results of Variables 5 are quite poor. The coverage probabilities for variances are a bit lower than those for mean values.
Under MNAR we obtain good results for Variable 6 and poor results for the other variables. An explanation, what is so special about Variable 6, is not obvious.
4.3. Linear regression model
We fitted a linear regression model whose covariables were the third to ninth variable of Section 4.1 and whose target variable was
where The variables
were missing as in Section 4.1 and
was never missing.
The results for Variable 5 are presented in Table 7, the other results are deferred to the supplementary material. For MCAR data, PMM produces good results, while the estimates of CART are slightly biased yielding a low coverage probability. For SAMPLE the situation is even worse resulting in a coverage probability of 0. This may be due to the fact that SAMPLE does not make attempts to reconstruct dependencies between different variables. Frequently, Rubin’s formula overestimates the standard deviation and this results in a too high coverage probability for the coefficients of and
for RF.
All this remains true under MAR assumptions. Under MNAR assumptions all algorithms behave poorly.
4.4. Logistic regression model
Here we considered a logistic regression model with covariables as in Section 4.1 and with a target variable
fulfilling
where
The variables were missing as in Section 4.1 and
was never missing.
The results for Variable 5 are presented in Table 8, the remaining results are deferred to the supplementary material. Both under MCAR and under MAR assumptions, we get good results of PMM, poor results of SAMPLE and the other two methods being in between. For MNAR data, there are some good results among a lot of poor ones, but these good results are so irregularly scattered that it is probably pure coincidence. We remark that the coverage probabilities we obtain here are higher than the coverage probabilities we obtained for estimating the expected values or the variances or for linear regression models under MNAR assumptions, but still unacceptable low.
4.5. Cox regression
We simulated a Cox regression model with covariables as in Section 4.1, a survival time that was exponentially distributed with rate
where
and a competing risk whose time point is exponentially distributed with rate independent of the covariables.
We present the results on Variable 5 in Table 9 and the remaining results in the supplementary material. Once again, both under MCAR and under MAR the best method is PMM. Under MNAR we observe the same as in logistic regression: Among a lot of poor results there are some irregularly scattered good ones.
5. Discussion
We investigated multiple imputation by chained equations, a method for treating missing data. We compared five different subroutines, but one of them, MIDAStouch, had to be excluded from a part of our analysis due to its long computation time. Among the remaining four methods there is one that clearly performed best, namely predictive mean matching (PMM).
This generalizes the finding of [25] that at least under simple, “linear” interactions between the different components of the observed random vector, PMM is superior to tree-based methods. However, it was noticed both in [11] and in [25] that, in models with more complex, “non-linear” interactions between the different components of the observed random vector, CART and RF may be superior to PMM.
Non-linear interactions can arise in different ways, e.g., a variable can depend in a non-linear way on another variable or one variable can depend on the product of two variables. In the data set from our cardiologic data example, we have, however, not much problems with that. In a quadratic regression model with systolic blood pressure as target variable and age as covariable we got a p-value of 0.056 for complete case analysis, 0.113 for PMM, 0.270 for SAMPLE, 0.143 for CART and 0.133 for RF for the hypothesis that the true quadratic coefficient is zero. Hence we have no reason to assume a non-linear relation. In a logistic regression model with death as target variable and age, diabetes and the product of age and diabetes as covariables we got a p-value of 0.063 for complete case analysis, 0.390 for PMM, 0.826 for SAMPLE, 0.489 for CART and 0.126 for RF for the hypotheses that the true product coefficient is zero. Hence we have no reason to assume a non-linear relation here either. The results are similar for most other variables.
The results for MNAR are poor throughout – as could be expected due to similar findings in the literature. In such a case specific model assumptions have to be made [1, Chapter 15], e.g., the probit selection model or the normal pattern-mixture model.
In several occasions, Rubin’s formula severely overestimated the standard deviation even under MCAR assumptions for all subroutines except PMM – in particular in linear regression models, but also in logistic and Cox regression models.
We had to restrict ourselves to a relatively small study in order to keep the paper to an acceptable length. Besides the type of the target variable and the missing data mechanism also the amount of missingness, the number of variables and the number of data items may have an impact on the performance of a multiple imputation algorithm. It would have been desirable to examine this, but if we had tried out only three values for each of these influence factors, we would have ended up with a paper containing 27 times as many tables.
A further restriction is that we considered only one model per target and missing data mechanism. As mentioned five paragraphs ago, the performance of a multiple imputation algorithm may depend on the exact model. So it would have been desirable to try out several different models per target and missing data algorithm.
We considered only correctly specified models in our simulation. It would be interesting to consider misspecified models as well. Model misspecifications can arise easily, e.g., by fitting a (generalized) linear model when the true interactions are non-linear or by choosing an incorrect link function.
We restricted ourselves to those subroutines that can handle both continuous and binary variables. There are 20 subroutines that can handle continuous data and 12 subroutines that can handle binary data implemented in the mice package. So while it would be desirable to include every combination of a subroutine for continuous data and a subroutine for binary data in the comparison, there are 240 such combinations.
Considering that there is a vast literature on comparisons of missing data treatment techniques, but every paper can only cover a relatively small study as explained in the last four paragraphs, some meta-analysis seems to be desirable.
The simulation results tell us that PMM is superior to the other methods, but they do not tell us why. Hence some theoretical analysis seems to be desirable.
Supporting information
MI_comparison_supp_rev. Additional tables as pdf.
All tables referred to in this article – whether they were presented or not – can be found in this pdf file.
https://doi.org/10.1371/journal.pone.0319784.s001
(PDF)
S2_Tables. Additional tables as txt.
The same content as MI_comparison_supp_rev in a form that it is readable by computers. The specification of the txt-files can be found in the last chapter of MI_comparison_supp_rev.
https://doi.org/10.1371/journal.pone.0319784.s002
(ZIP)
MI_001_elementary.R. R file for the elementary computations presented in Section 3.
Some modifications have to be made before executing this file, since the data used in it is not publically available.
https://doi.org/10.1371/journal.pone.0319784.s003
(R)
MI_002_ECAD_Mean.R. R file for the mean values of the real data presented in Section 3.
Some modifications have to be made before executing this file, since the data used in it is not publically available.
https://doi.org/10.1371/journal.pone.0319784.s004
(R)
MI_003_generator.R. R file that generates the random vectors used by the other R files.
https://doi.org/10.1371/journal.pone.0319784.s005
(R)
MI_004_Sim_Mean_A.R. R file for the simulation of the mean values presented in Section 4.1.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s006
(R)
MI_005_Sim_Mean_B.R. R file for the simulation of the mean values presented in Section 4.1.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s007
(R)
MI_006_ECAD_Var.R. R file for the (co)variances of the real data presented in Section 3.
Some modifications have to be made before executing this file, since the data used in it is not publically available.
https://doi.org/10.1371/journal.pone.0319784.s008
(R)
MI_007_Sim_Var_A.R. R file for the simulation of the (co)variances presented in Section 4.2.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s009
(R)
MI_008_Sim_Var_B.R. R file for the simulation of the (co)variances presented in Section 4.2.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s010
(R)
MI_009_Sim_LinReg_A.R. R file for the simulation of the linear regression presented in Section 4.3.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s011
(R)
MI_010_Sim_LinReg_B.R. R file for the simulation of the linear regression presented in Section 4.3.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s012
(R)
MI_011_ECAD_LogReg.R. R file for the logistic regression of the real data presented in Section 3.
Some modifications have to be made before executing this file, since the data used in it is not publically available.
https://doi.org/10.1371/journal.pone.0319784.s013
(R)
MI_012_Sim_LogReg_A.R. R file for the simulation of the logistic regression presented in Section 4.4.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s014
(R)
MI_013_Sim_LogReg_B.R. R file for the simulation of the logistic regression presented in Section 4.4.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s015
(R)
MI_014_ECAD_CoxReg.R. R file for the Cox regression of the real data presented in Section 3.
Some modifications have to be made before executing this file, since the data used in it is not publically available.
https://doi.org/10.1371/journal.pone.0319784.s016
(R)
MI_015_Sim_CoxReg_A.R. R file for the simulation of the Cox regression presented in Section 4.5.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s017
(R)
MI_016_Sim_CoxReg_B.R. R file for the simulation of the Cox regression presented in Section 4.5.
This file can be executed directly.
https://doi.org/10.1371/journal.pone.0319784.s018
(R)
Acknowledgments
The analysis of the simulated data was carried out on the computers of the Paderborn Cluster for Parallel Computing, Paderborn University, Paderborn, Germany.
Literature
- 1.
Little R, Rubin D. Statistical analysis with missing data. Hoboken: Wiley; 2020.
- 2.
Schafer J. Analysis of incomplete multivariate data. London: Chapman & Hall; 1997.
- 3.
Molenberghs G, Kenward M. Missing data in clinical studies. New York: Wiley; 2007.
- 4.
Enders C. Applied missing data analysis. New York: The Guilford Press; 2010.
- 5. Wang Z, Akande O, Poulos J, Li F. Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison. Survey Methodol. 2022;48:375–399.
- 6.
Van Buuren S. Flexible imputation of missing data. Boca Raton: Chapman & Hall; 2018.
- 7. Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67.
- 8. Little R. Missing-data adjustments in large surveys. J Business Econ Stat. 1988;6:287–296.
- 9.
Gaffert P, Meinfelder F, Bosch V. Towards an MI-proper predictive mean matching. 2016:15. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=5f13dd8d367c9b0ebc127c24487aa06773a2f00f.
- 10. Burgette L, Reiter J. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol. 2010;172:1070–1076.
- 11. Doove L, van Buuren S, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal. 2014;72:92–104.
- 12.
Rubin D. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons; 1987.
- 13. Heitjan D, Little R. Multiple imputation for the fatal accident reporting system. J Royal Stat Soc Series C: Appl Stat. 1991;40:13–29.
- 14. van der Palm D, van der Ark L, Vermunt J. A comparison of incomplete-data methods for categorical data. Stat Meth Med Res. 2016;25:754–774.
- 15. Akande O, Li F, Reiter J. An empirical comparison of multiple imputation methods for categorical data. Am Statist. 2017;71:162–170.
- 16. Giorgi R, Belot A, Gaudart J, Launoy G. The performance of multiple imputation for missing covariate data within the context of regression relative analysis. Stat Med. 2008;27:6310–6331.
- 17. Ibrahim J, Chen M, Lipsitz S, Herring A. Missing-data methods for generalized linear models – a comparative review. J Am Stat Ass. 2005;100:332–346.
- 18. Lee J, Huber C. Evaluation of multiple imputation with large proportions of missing data: how much is too much? Iranian J Public Health. 2021;50:1372–1380.
- 19.
Huo Z. A comparison of multiple imputation methods for missing covariate values in recurrent event data [Master Thesis]. Uppsala University; 2015.
- 20. Nur U, Shack L, Rachet B, Carpenter J, Coleman M. Modelling relative survival in the presence of incomplete data: a tutorial. Int J Epidemiol. 2010;39:118–128.
- 21. Stavseth M, Clausen T, Røislien J. How handeling missing data may impact conclusions: a comparison of six different imputations methods for categorical questionnaire data. SAGE Open Med. 2019;7:1–12.
- 22. Zhuchkova S, Rotmistrov A. How to choose an approach handeling missing categorical data: (un)expected findings from a simulated statistical experiment. Quality Quantity. 2022: 56, 1–22.
- 23. Makaba T, Dogo E. A comparison of strategies for missing data on machine learning classification algorithms. 2019 Int Multidis Inform Technol Engineer Conf. 2019:7.
- 24. Mohammed M, Zulkafli H, Adam M, Ali N, Baba I. Comparison of five imputation methods in handling missing data in a continuous frequency table. AIP Conf Proceed. 2021:9.
- 25. Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am J Epidemiol. 2014;179:764–774.
- 26.
Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. New York: Wadsworth Publishing; 1984.
- 27.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning – data mining, inference, and prediction. Springer; 2017.
- 28.
Genuer R, Poggi J. Random forests with R. Springer; 2020.
- 29. Breiman L. Random forests. Machine learn. 2001;45:5–32.