Rare event detection by progressive clustering undersampling

doi:10.1371/journal.pone.0340758

Fig 1.

(a) PCU applied to the petroleum dataset.

At each clustering stage, a single resulting cluster consistently contains the majority of positive samples, except at the final stage. (b) Supervised algorithms are used to guide the test set along the same path followed by the training set.

More »

Expand

Fig 2.

The scatter plots illustrate the undersampling process.

The first column shows the original training set. The 3D plot uses scaled features, with the final cluster highlighted in yellow. The second row distinguishes minority (orange) and majority (blue) classes, and the third row uses the first two Principal Component Analysis (PCA) for visualization. Focusing on relevant instances by zooming in simplifies the subsequent training process.

More »

Expand

Fig 3.

The first two PCA components of the training data before and after applying balancing methods.

The best separation between classes occurs by PCU method after the second removal wave.

More »

Expand

Fig 4.

Logging responses for the two test dataset wells, along with prediction outputs (grey) and target layers (red), Measured Depth (MD) represents the depth within the well [feet].

More »

Expand

Fig 5.

Evaluation scores for various models on the test set include classifiers without resampling, as well as SOTA resampling methods combined with KNN and Random Forest classifiers.

PCU achieves the highest F1-score in the Semi-Guided setting and the highest precision in the Fully-Guided setting.

More »

Expand

Fig 6.

Two moons dataset with three training sets of different noise levels and a test set for evaluation.

More »

Expand

Table 1.

Settings for unsupervised algorithms applied to various noisy versions of the Two Moons dataset, showing the percentage of minority class samples retained within the unique cluster and the number of removed samples.

The algorithm in bold indicates the selected approach for further evaluation.

More »

Expand

Table 2.

KNN classifier average performance with standard deviation on versions of the Two Moons dataset resampled using either PCU or other techniques.

The Semi-Guided method significantly reduces the impact of noise on the subsequent KNN classifier. Symbol †: Statistically significant compared to Semi-Guided method. Symbol ‡: Statistically significant compared to Fully-Guided method.

More »

Expand

Fig 7.

Two moons dataset with three training sets of different feature noise levels and test sets for evaluation.

More »

Expand