Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.
Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models’ discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.
The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4%) had missing data. Family history had the highest proportion of missing data (25%). Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals). Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.
Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.
Citation: Masconi KL, Matsha TE, Erasmus RT, Kengne AP (2015) Effects of Different Missing Data Imputation Techniques on the Performance of Undiagnosed Diabetes Risk Prediction Models in a Mixed-Ancestry Population of South Africa. PLoS ONE 10(9): e0139210. https://doi.org/10.1371/journal.pone.0139210
Editor: Noel Christopher Barengo, University of Tolima, COLOMBIA
Received: April 11, 2015; Accepted: September 10, 2015; Published: September 25, 2015
Copyright: © 2015 Masconi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: Data are available upon request due to ethical and funding constraints. Ethical approval for the Bellville South study from the involved institution preclude further utilization of data or specimens collected from participants for purposes other than those clearly specified on the informed consent form signed by study participants. The consent of the Bellville South study participants made no reference to data sharing, deposition or use out of the control of the principal investigators. Andre Pascal Kengne (firstname.lastname@example.org) may be contacted to request the data.
Funding: KLM was supported by a scholarship from the South African National Research Foundation and the Carl & Emily Fuchs Foundation. The authors received no other specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Missing data is common in predictive research, and can negatively affect the performance of risk prediction models. In an ideal setting, a subject with missing data on a predictor or outcome variable should be replaced with a randomly selected subject from the source population. However, replacement is burdensome and most often impossible. Instead, researchers can use observed data to make an estimation of the status of the participants for the characteristic with missing value. Imputation techniques are based on the basic principle of replacement, indicating that any conclusion drawn from the study should not depend on the sample that is involved in the study. Should each subject in the chosen sample be replaced by a new subject from the same source population as the original subject, the conclusions should not be compromised .
It is widely advocated that imputation of missing data is superior to the overlooking of the missing data, that the indicator method often provides biased results, that conditional mean imputation is better than unconditional implementation, and that multiple imputation method is better than single imputation [1–13]. However, studies have suggested that simple methods for filling missing data can be just as accurate as complex methods, allowing for easier implementation in prediction studies [14, 15]. The type and percentage of missing data are important determining factors for the accuracy of the different imputation methods. Data missing completely at random (MCAR) has a low probability that the observation missing is related to any other patient characteristics and most simple techniques for handling missing data give unbiased results . When the missing data depends on information that is not observed, the missing data is considered missing not at random (MNAR) . Although there is no advocated method available to handle the valuable information that has been lost through MNAR data, multiple imputation can be unbiased for MNAR data . Most often, missing data are neither MCAR nor MNAR , but rather missing at random (MAR). This type of missing data is missing at random conditional on the individuals other characteristics that are available at the time of analysis . When missing data are MAR, common and simple techniques used to handle missing data such as complete case and available case analysis, indicator method and overall mean imputation are likely to introduce selection bias as the database is no longer a random sample of the source population [5, 6, 11, 16].
This study aims to implement a number of simple and more complex imputation methods for filling missing data, and assess the comparative effects on the performance of undiagnosed diabetes risk prediction models during external validation. For this purpose, we use data for mixed-ancestry South African who took part in the Bellville-South study in Cape Town.
Material and Methods
Details of the study design and recruitment of the database that served as the basis for all imputation methods implementation have been described below. The Bellville South Study was a cross-sectional study conducted from mid-January 2008 to March 2009 (cohort 1), and from January 2011 to November 2011 (cohort 2). The study was approved by the Ethics Committee of the Cape Peninsula University of Technology (CPUT/HW-REC 2008/002 and CPUT/HW-REC 2010) and Stellenbosch University (N09/05/146). Recruited subjects were visited by recruitment team the evening before participation and reminded of all the survey instructions. All participants signed written informed consent after all the procedures had been fully explained in the language of their choice.
Bellville-South is located within the Northern suburbs of Cape Town, South Africa and is a traditionally a Coloured township formed in the late 1950s. According to the 2011 population census, its population stands at approximately 29 301 with 76.0% (22 270) consisting of the mixed ancestry individuals [17, 18]. The target population for this study were subjects between the ages of 35 and 65 years and their number was estimated to be 6 500 in the 2001 population census .
Research Design and Study Population
The data was collected during January 2008 to March 2009. Using a map of Bellville South, multistage stratified random sampling was approached as follows: From a list of streets from each stratum, the streets were then classified as short, medium and long streets based on the number of houses. Streets with houses ≤ 22 were classified as short, medium; houses 23–40 and long streets were > 40 houses. A total of 16 short streets representing approximately 190 houses, 15 medium streets representing approximately 410 houses and 12 long streets representing approximately 400 houses were randomly selected across the different strata. From the selected streets, all household members meeting the selection criteria were invited to participate in the study. Community authorities requested that participants outside the random selection area should benefit from the study.
Information regarding the project was disseminated to the local residents through the local radio station, community newspaper, brochures and fliers; the latter bearing information about the project and distributed through school children and taxis to the local residents by the recruitment team. Recruited subjects were visited by the recruitment team the evening before participation and reminded of all the survey instructions.
A detailed protocol describing data-collection procedures (questionnaires and physical examination) was developed. The questionnaire designed to retrospectively obtain information on lifestyle factors such as smoking and alcohol consumption, physical activity, diet, family history of CVD and DM, and demographics was administered by trained personnel. A detailed drug history was obtained by interrogation and by examining the clinic cards as well as the record of drugs that participants brought to the study site. Clinical measurements included height, weight, hip and waist circumferences, body fat measurements and blood pressure.
All participants, except the self-reported diabetic subjects, confirmed by either medical card record or drugs in use, had blood taken for fasting blood glucose and underwent a 75 g oral glucose tolerance test (OGTT) as prescribed by the WHO. Diabetes was diagnosed according to the WHO 2006 criteria .
Identification of undiagnosed diabetes prediction models
Existing prediction models were obtained from a systematic review by Brown et al, 2012 . Models met the criteria for model selection for this paper if they were developed to predict the presence of undiagnosed diabetes based on predictors measured in the Bellville South study. We focused on models developed from non-invasively measure predictors. Therefore the models retained were: Cambridge Risk model , Kuwaiti Risk model , Omani Diabetes Risk model , Rotterdam Predictive model 1  and the simplified Finnish Diabetes Risk model . Model characteristics and formulas have been published by Masconi et al . All models included age as a predictor, while a range of other predictors were variably combined in models. These included: sex, BMI, use of antihypertensive medication, family history of diabetes, waist circumference, past or current smoking and the use of corticosteroids. Table 1 shows the overview of the performance of the undiagnosed diabetes risk prediction models across the five imputation methods.
Analysis of missingness.
Data analysis used the R statistical software, version 3.1.2 . Aggregation plots were created using the ‘VIM’ package to identify of the pattern of missing data for each variable. The corresponding frequencies were tabulated.
Identification of imputation methods.
A comprehensive search was previously carried out on the imputation methods available . The aim was to compare deletion, single and multiple imputation techniques. To allow for a broad spectrum of techniques, it was decided to compare pair-wise deletion , simple imputation , conditional mean imputation [7, 30], stochastic regression [7, 32] and multiple imputation for non-monotone missing patterns . Imputation was completed on the outcome and all variables. Where applicable, the outcome and all variables were used as a predictor for the variable being imputed.
Simple imputation via mean substitution was implemented with the package ‘Hmisc’ through the function ‘impute (x, fun = mean)’ where x is the predictor of interest . Conditional mean imputation was implemented through the creation of a regression model and the subsequent inbuilt ‘predict’ function. Imputation via stochastic regression used the method ‘norm.nob’ of the R package ‘mice . Multiple imputation for non-monotone missing patterns via the Multiple Imputation by Chained Equations (MICE) method, using fully conditional specification (FCS) was implemented using the ‘mice’ package . The m imputed datasets were analysed separately, then the estimates and the associated variance from the imputed data sets combined using rules established by Rubin that incorporates the within and between imputation variability .
The original selected models were validated for the overall data and subsets using the formulas, both prior to recalibration and following intercept adjustment to eliminate differences in diabetes prevalence between the development population of the model and this test population. The predicted probability of undiagnosed diabetes for each participant was computed using the baseline measured predictors. The performance was expressed in terms of discrimination and calibration. Discrimination describes the ability of the model’s performance in distinguishing those at a high risk of developing diabetes from those at low risk . The discrimination was assessed and compared using concordance (C) statistic and non-parametric methods .
Calibration describes the agreement between the probability of the outcome of interest as estimated by the model, and the observed outcome frequencies . It was assessed with formal statistical tests, determining the agreement between the expected (E) and observed (O) rates (E/O). The 95% confidence intervals for the expected/observed probabilities (E/O) ratio were calculated assuming a Poisson distribution . We also calculated 1) the Yates slope, which is the difference between mean predicted probability of type 2 diabetes for participants with and without prevalent undiagnosed diabetes, with higher values indicate better performance; and 2) the Brier score, which is the squared difference between predicted probability and actual outcome for each participant with values ranging between 0 for a perfect prediction model and 1 for no match in prediction and outcome [30, 35].
The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4%) had missing data. Table 2 summarises the number of missing values for each variable included in the 5 selected risk prediction models. Additionally, Figs 1 and 2 show the proportion and combinations of missing data respectively. Family history was the variable with the most missing data [mother (25.1%, father (24.9%), sister (25.0%), and brother (25.1%)]. The rest of the variables had a missing proportion of less than 5%, except smoking status (6.1%).
*BMI, Body Mass Index; WC, Waist Circumference; SBP, Systolic Blood Pressure; DBP, Diastolic Blood Pressure; FH, Family History; Cort, Corticosteroids; med, medication; Hpt, Hypertensive.
*BMI, Body Mass Index; WC, Waist Circumference; SBP, Systolic Blood Pressure; DBP, Diastolic Blood Pressure; FH, Family History; Cort, Corticosteroids; med, medication; Hpt, Hypertensive.
Table 3 shows the variable characteristics of the original database compared to the five methods of dealing with missing data. Pairwise deletion resulted in a significantly reduced sample size (754) while conditional mean imputation resulted in a varied length of each variables as only missing values with complete cases for every other variables were imputed. Simple imputation, stochastic regression imputation and multiple imputation all imputed all missing data allowing for a full database of 1083 individuals.
Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals). Pairwise deletion saw a higher mean age (52.5 years) and lower systolic and diastolic blood pressure (122.0 mmHg and 74.7 mmHg respectively) when compared to the other imputation methods. There was no substantial difference in body mass index and waist circumference between the methods. Stochastic regression imputed higher prevalence of individuals on hypertensive medication (387 individuals), mother having diabetes (182 individuals), and sister having diabetes (143 individuals). Multiple imputation reported the highest prevalence of father (78 individuals) and brother (87 individuals) having diabetes. Variable characteristics across the five imputation datasets is shown in Table 4. Mother, father and sister family history, as well as smoking status had the most variation between the five multiple imputation datasets.
Most notably, model performance following pairwise deletion deviated from the model performance from other imputation methods. The discrimination was lower in all five models, however calibration was better in the Cambridge Diabetes Risk model [1.81 (1.09–2.52)]. Overall, although not large differences, simple imputation yielded the highest C-statistic for four of the five models; the Cambridge Diabetes Risk model [0.69 (0.65–0.73), vs. 0.67 (0.62–0.72)], Kuwaiti Risk model [0.70 (0.66–0.74), vs. 0.68 (0.63–0.73)], Omani Diabetes Risk model [0.67 (0.63–0.71) vs. 0.65 (0.61–0.69)] and Rotterdam Predictive model [0.65 (0.61–0.70) vs. 0.64 (0.59–0.69)]. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model [0.65 (0.61–0.70), which were matched by simpler imputation methods. Table 5 details the indifference in model performance across the five datasets produced through multiple imputation.
The pattern of the overall calibration (E/O) did not vary substantially across imputation methods. Uniformly, all imputation methods resulted in the Cambridge and Omani risk models overestimating diabetes risk, while the others showed underestimation. Other performance measures across subgroups, shown in Table 1, did also not show significant differences between imputation methods. When recalibration was performed, all models across all imputation techniques had an improved agreement between predicted and observed rates (Table 1).
The suggested imputation method for the handling of missing data is a hot topic, with strong advocators for multiple imputation and those who propose that simple techniques can be just as effective. Several studies have been done to determine the effect of several imputation methods on the predictive performance of risk models, however these have been largely contradicting. Donders et al  performed a simulation study in an attempt to illustrate that single imputation yields unbiased estimates with too narrow confidence intervals and multiple imputation indeed yields unbiased estimates with correct standard errors. Both single and multiple imputation produced unbiased estimates of association, and the conclusion was that despite single imputation appearing more precise, multiple imputation produces less bias and more precise results. Alternately, a study by van der Heijden et al  concluded that the models fitted using the indicator method, a simple method of dealing with missing data, showed higher regression coefficients and predictive accuracy when compared to the models derived from the imputation methods. As confirmed in this study, we did not observe large differences between the models obtained after single unconditional, single conditional and multiple imputation of the missing data. Deletion of individuals with missing data resulting in an expected reduced discriminatory ability of the models. Model calibration was improved across all areas when recalibration was performed. This however, has no influence on the imputation techniques or the discriminatory ability of the models.
What should be noted is that most studies comparing imputation techniques start with a complete data set and introduce missing data to set variables. Our study made use of an existing database which already included missing data on a number of variables. This results in the true underlying value of the missing data being unknown, as well as the true regression coefficients and predictive accuracy of each variable. This lack of reference criterion can be limitation in a study. However, the use of existing databases should be encouraged as this is more translatable to medical research outside of a controlled setting.
Despite recent advances in understanding missing data and imputation methods, most researchers still report deletion, perhaps because of a lack of adequate guidelines for handling missing data. What should be encouraged is the use of more than one method, the results compared and a preferred approach chosen and defended. When data are missing on several variables it is important to use some procedure that imputes them all together, rather than one variable at a time. This ensures that the imputed data are related to each other in the same way as those data that are observed.
This study aimed to compare the performance results of undiagnosed diabetes risk prediction models across multiple imputation techniques. The results showed a lower model performance when deletion is used to deal with missing data and little difference between simple and more complex methods on the effect of risk prediction model performance. Missing data is an important aspect of predictive research and needs to be handled correctly. Imputation, specifically more complex and time-intensive imputation, can often be avoided by researchers due to preconceived complexity. Simpler imputation methods that allow for similar or better predictive performance are easy to undertake and should encourage researchers of all levels to limit the use of deletion of individuals with missing data. The negligible difference in model performance between simple and multiple imputation allows for the recommendation of single imputation for handling missing data in undiagnosed diabetes predicative research.
KLM was supported by a scholarship from the South African National Research Foundation and the Carl & Emily Fuchs Foundation.
- 1. Donders ART, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59(10):1087–91. pmid:16980149
- 2. Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychological methods. 2002;7(2):147. pmid:12090408
- 3. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.
- 4. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995;142(12):1255–64. pmid:7503045
- 5. Little RJ. Regression with missing X's: a review. Journal of the American Statistical Association. 1992;87(420):1227–37.
- 6. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: J. Wiley & Sons; 2004. 320 p.
- 7. Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd Edition ed. Balding DG, Bloomfield P, Cressie NAC, Fisher NI, Johnstone AM, Kadane JB, et al., editors. New York: John A. Wiley & Sons,; 2002. 389 p.
- 8. Rubin DB, Schenker N. Multiple imputation in health‐are databases: An overview and some applications. Statistics in medicine. 1991;10(4):585–98. pmid:2057657
- 9. Schafer JL. Multiple imputation: a primer. Statistical methods in medical research. 1999;8(1):3–15. pmid:10347857
- 10. VACH W. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Statistics in medicine. 1997;16(1):57–72.
- 11. Schafer JL. Analysis of incomplete multivariate data. 1st Ed. ed. Cox DR, Isham V, Keiding N, Reid N, Tong H, editors: Chapman & Hall/CRC,; 1997.
- 12. Crawford SL, Tennstedt SL, McKinlay JB. A comparison of analytic methods for non-random missingness of outcome data. J Clin Epidemiol. 1995;48(2):209–19. pmid:7869067
- 13. Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York; London: Springer; 2001. xxii, 568 p. p.
- 14. Baker J, White N, Mengersen K. Missing in space: an evaluation of imputation methods for missing data in spatial analysis of risk factors for type II diabetes. International journal of health geographics. 2014;13(1):47.
- 15. van der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9. pmid:16980151
- 16. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in medicine. 1999;18(6):681. pmid:10204197.
- 17. Quintana-Murci L, Harmant C, Quach H, Balanovsky O, Zaporozhchenko V, Bormans C, et al. Strong maternal Khoisan contribution to the South African coloured population: a case of gender-biased admixture. The American Journal of Human Genetics. 2010;86(4):611–20. pmid:20346436
- 18. City of Cape Town. City of Cape Town – 2011 Census–Ward 009. 2013.
- 19. City of Cape Town. City of Cape Town—Census 2001 –Ward 9. 2006.
- 20. World Health Organization. Screening for Type 2 Diabetes. Report of a World Health Organization and International Diabetes Federation meeting. Geneva: 2003 WHO/NMH/MNC/03.1.
- 21. Brown N, Critchley J, Bogowicz P, Mayige M, Unwin N. Risk scores based on self-reported or available clinical data to detect undiagnosed type 2 diabetes: a systematic review. Diabetes research and clinical practice. 2012;98(3):369–85. pmid:23010559
- 22. Griffin S, Little P, Hales C, Kinmonth A, Wareham N. Diabetes risk score: towards earlier detection of type 2 diabetes in general practice. Diabetes Metab Res Rev. 2000;16(3):164–71. pmid:10867715
- 23. Al Khalaf MM, Eid MM, Najjar HA, Alhajry KM, Doi SA, Thalib L. Screening for diabetes in Kuwait and evaluation of risk scores. East Mediterr Health J. 2010;16(7):725–31. pmid:20799528.
- 24. Al-Lawati J, Tuomilehto J. Diabetes risk score in Oman: a tool to identify prevalent type 2 diabetes among Arabs of the Middle East. Diabetes Res Clin Pract. 2007;77(3):438–44. pmid:17306410
- 25. Baan CA, Ruige JB, Stolk RP, Witteman J, Dekker JM, Heine RJ, et al. Performance of a predictive model to identify undiagnosed diabetes in a health care setting. Diabetes Care. 1999;22(2):213–9. pmid:10333936
- 26. Bergmann A, Li J, Wang L, Schulze J, Bornstein S, Schwarz P. A simplified Finnish diabetes risk score to predict type 2 diabetes risk and disease evolution in a German population. Hormone and metabolic research. 2007;39(09):677–82.
- 27. Masconi K, Matsha TE, Erasmus RT, Kengne AP. Independent external validation and comparison of prevalent diabetes risk prediction models in a mixed-ancestry population of South Africa. Diabetology & Metabolic Syndrome. 2015.
- 28. The R Foundation. The R Project for Statistical Computing 2015 [cited 2015 15.09]. Available from: http://www.R-project.org.
- 29. Masconi KL, Matsha TE, Echouffo-Tcheugui JB, Erasmus RT, Kengne AP. Reporting and handling of missing data in predictive research for prevalent undiagnosed type 2 diabetes mellitus: a systematic review. EPMA Journal. 2015;6(1):7. pmid:25829972
- 30. Steyerberg EW. Clinical Prediction Models: A practical approach to development, validation, and updating. Gail M, Krickeberg K, Samet J, Tsiati A, Wong W, editors: Springer; 2009. 500 p.
- 31. Graham JW, Hofer SM, MacKinnon DP. Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research. 1996;31(2):197–218.
- 32. Brick JM, Kalton G. Handling missing data in survey research. Stat Methods Med Res. 1996;5(3):215–38. pmid:8931194.
- 33. Alzola CF, Harrell FE. An introduction to S and the Hmisc and design libraries. 2006. p. 310.
- 34. Buuren S, Groothuis-Oudshoorn K. MICE: Multivariate imputation by chained equations in R. Journal of statistical software. 2011;45(3).
- 35. Kengne AP, Masconi K, Mbanya VN, Lekoubou A, Echouffo-Tcheugui JB, Matsha TE. Risk predictive modelling for diabetes and cardiovascular disease. Crit Rev Clin Lab Sci. 2013;51(1):1–12. pmid:24304342
- 36. Moons KG, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio) marker. Heart. 2012;98(9):683–90. pmid:22397945
- 37. Dobson AJ, Kuulasmaa K, Eberle E, Scherer J. Confidence intervals for weighted sums of Poisson parameters. Statistics in medicine. 1991;10(3):457–62. pmid:2028128