Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option

doi:10.1371/journal.pone.0164464

Fig 1.

Crocs dataset, Anatomical bias.

Crocs dataset, Anatomical bias. A-C. Visualization of imputation of complete cases with simulated missing values for A. Probabilistic PCA (PPCA) imputation, B. Mean imputation, C. KNN imputation. In all three, blue represent complete cases. Simulated cases with missing that would be rejected (red) or accepted (green) for estimated error <0.03. The size represents the actual imputation error. D-F. Shows the imputation errors for all complete cases (rows) with all missingness patterns simulated (columns) for D. PPCA imputation, E. Mean imputation, F. KNN imputation. G-H. Feature weights in first and second principal components. I. Learning curve presenting root mean square error (RMSE) as a function of included cases with 100 replications at each step. RMSE was calculated both the estimated errors and the actual imputation errors.

More »

Expand

Fig 2.

Crocs dataset, Species bias.

Crocs dataset, Species bias. A-C. Visualization of imputation of complete cases with simulated missing values for A. Probabilistic PCA (PPCA) imputation, B. Mean imputation, C. KNN imputation. In all three, blue represent complete cases. Simulated cases with missing that would be rejected (red) or accepted (green) for estimated error <0.03. The size represents the actual imputation error. D-F. Shows the imputation errors for all complete cases (rows) with all missingness patterns simulated (columns) for D. PPCA imputation, E. Mean imputation, F. KNN imputation. G-H. Feature weights in first and second principal components. I. Learning curve presenting root mean square error (RMSE) as a function of included cases with 100 replications at each step. RMSE was calculated both the estimated errors and the actual imputation errors.

More »

Expand

Fig 3.

Gridsearch.

Gridsearch to estimate γ and δ for each dataset. Optimal parameters indicated with a red circle. A-C. Crocs dataset with anatomical bias imputed with PPCA (A), Mean imputation (B), or KNN imputation (C). D-E Crocs dataset with species bias imputed with PPCA (D), Mean imputation (E), or KNN imputation (F). G-I Echocardiogram dataset imputed with PPCA (G), Mean imputation (H), or KNN imputation (I). J-L Chronic kidney disease dataset imputed with PPCA (J), Mean imputation (K), or KNN imputation (L).

More »

Expand

Fig 4.

Echocardiogram dataset.

Echocardiogram dataset. A-C. Visualization of imputation of complete cases with simulated missing values for A. Probabilistic PCA (PPCA) imputation, B. Mean imputation, C. KNN imputation. In all three, blue represent complete cases. Simulated cases with missing that would be rejected (red) or accepted (green) for estimated error <0.03. The size represents the actual imputation error. D-F. Shows the imputation errors for all complete cases (rows) with all missingness patterns simulated (columns) for D. PPCA imputation, E. Mean imputation, F. KNN imputation. G-H. Feature weights in first and second principal components. I. Learning curve presenting root mean square error (RMSE) as a function of included cases with 100 replications at each step. RMSE was calculated both the estimated errors and the actual imputation errors.

More »

Expand

Fig 5.

Chronic kidney disease dataset.

Chronic kidney disease dataset dataset. A-C. Visualization of imputation of complete cases with simulated missing values for A. Probabilistic PCA (PPCA) imputation, B. Mean imputation, C. KNN imputation. In all three, blue represent complete cases. Simulated cases with missing that would be rejected (red) or accepted (green) for estimated error <0.03. The size represents the actual imputation error. D-F. Shows the imputation errors for all complete cases (rows) with all missingness patterns simulated (columns) for D. PPCA imputation, E. Mean imputation, F. KNN imputation. G-H. Feature weights in first and second principal components. I. Learning curve presenting root mean square error (RMSE) as a function of included cases with 100 replications at each step. RMSE was calculated both the estimated errors and the actual imputation errors.

More »

Expand