Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction

doi:10.1371/journal.pcbi.1012803

Table 1.

The basic information of datasets used in this study.

More »

Expand

Fig 1.

The design of the reliability-based training data cleaning method based on inductive conformal prediction and the validation process. The training data cleaning method based on the conformal prediction is shown on the left half (module 1) while the modeling of the downstream classification tasks and the evaluation of the validation and test sets are shown on the right half of the figure (module 2). Based on the standard ICP method, the training dataset is partitioned into the proper training set and calibration set. The proper training set is used to represent the noisy training data and the calibration set represents the well-curated dataset. Wrongly labeled data and outliers in the proper training set are detected and corrected based on the P-values calibrated on the nonconformity measure distribution on the calibration set. The cleaned training set is then used to train classifiers for downstream classification tasks and compared against baselines.

More »

Expand

Fig 2.

The model accuracy improvement with training data cleaning in DILI literature classification task based on the S2V embeddings under different percentages of training data label permutation. The classification accuracy on the validation set (A–C) and on the test set (D–F) with a wrongly labeled data detection threshold of 0.8 (A,D), 0.5 (B,E) and 0.2 (C,F). The mean and 95% confidence intervals are shown. Statistically significant improvement in accuracy has been marked as follows .: p < 0 . 1, *: p < 0 . 05, **: p < 0 . 01, ***: p < 0 . 001; first row: LR models, second row: LDA models.

More »

Expand

Fig 3.

The model accuracy improvement with training data cleaning in DILI literature classification task based on the W2V embeddings under different percentages of training data label permutation. The classification accuracy on the validation set (A–C) and on the test set (D–F) with a wrongly labeled data detection threshold of 0.8 (A,D), 0.5 (B,E), and 0.2 (C,F). The mean and 95% confidence intervals are shown. The statistically significant improvement in accuracy has been marked as follows: .: p < 0 . 1, *: p < 0 . 05, **: p < 0 . 01, ***: p < 0 . 001; first row: LR models, second row: LDA models.

More »

Expand

Fig 4.

The model performance in AUROC and AUPRC with training data cleaning in COVID-19 patient ICU admission prediction task under different percentages of training data label permutation. The AUROC (A) and AUPRC (B) on the validation set, and the AUROC (C) and AUPRC (D) on the test set with a wrongly labeled data detection threshold of 0.8. The mean and 95% confidence intervals are shown. The statistically significant improvement in accuracy has been marked as follows: .: p < 0 . 1, *: p < 0 . 05, **: p < 0 . 01, ***: p < 0 . 001; first row: LR models, second row: LDA models.

More »

Expand

Fig 5.

The model performance in AUROC and AUPRC with training data cleaning in COVID-19 patient ICU admission prediction task under different percentages of training data label permutation. The AUROC (A) and AUPRC (B) on the validation set, and the AUROC (C) and AUPRC (D) on the test set with a wrongly labeled data detection threshold of 0.5. The mean and 95% confidence intervals are shown. The statistically significant improvement in accuracy has been marked as follows: .: p < 0 . 1, *: p < 0 . 05, **: p < 0 . 01, ***: p < 0 . 001; first row: LR models, second row: LDA models.

More »

Expand

Fig 6.

The model performance in accuracy and F1 score with training data cleaning in the breast cancer subtype prediction task under different percentages of training data label permutation. The classification accuracy (A) and macro-averaged F1 score (B) on the validation set, and the classification accuracy (C) and macro-averaged F1 score (D) on the test set with a wrongly labeled data detection threshold of 0.5. The mean and 95% confidence intervals are shown. The statistically significant improvement in accuracy has been marked as follows: .: p < 0 . 1, *: p < 0 . 05, **: p < 0 . 01, ***: p < 0 . 001; first row: LR models, second row: LDA models.

More »

Expand

Fig 7.

The number of wrong labels and outliers detected under different percentages of training data label permutation in DILI literature prediction task with W2V embeddings. The number of wrongly labeled data (A-C) and outliers (D-F) under different detection thresholds of wrongly labeled data: 0.8 (A,D), 0.5 (B,E), 0.2 (C,F). The cleaning process visualization is based on W2V embeddings and fixed hyperparameters for the conformal predictor. Here, total means the total number of wrongly labeled data detected, regardless of labels.

More »

Expand

Fig 8.

The number of ground-truth wrong labels before and after training data cleaning under different percentages of training data label permutation in DILI literature prediction task with W2V embeddings. The number of wrongly labeled data before/after training data cleaning and the number of corrections made under different detection thresholds of wrongly labeled data: 0.8 (A), 0.5 (B), 0.2 (C). The cleaning process visualization is based on W2V embeddings and fixed hyperparameters for the conformal predictor.

More »

Expand

Fig 9.

The number of wrong labels detected under different percentages of training data label permutation in COVID-19 patient ICU admission prediction task. The number of wrongly labeled data based on LR models (A–C) and LDA models (D–F) under different detection thresholds of wrongly labeled data: 0.8 (A,D), 0.5 (B,E), 0.2 (C,F). The cleaning process visualization is based on optimized hyperparameters for the conformal predictor tuned on the validation dataset for each classifier and each percentage of labels permuted.

More »

Expand

Fig 10.

The number of wrong labels detected under different percentages of training data label permutation in TCGA breast cancer subtype prediction task. The number of wrongly labeled data based on LR models (A–C) and LDA models (D–F) under different detection thresholds of wrongly labeled data: 0.8 (A,D), 0.5 (B,E), 0.2 (C,F). The cleaning process visualization is based on optimized hyperparameters for the conformal predictor tuned on the validation dataset for each classifier and each percentage of labels permuted.

More »

Expand