SMOTE-CD: SMOTE for compositional data

doi:10.1371/journal.pone.0287705

Fig 1.

(a) Expert-based mapped image of Maupiti island and (b) Pleiades image of Maupiti island segmented with Felzenszwalb’s method.

More »

Expand

Table 1.

Percentage of the number of pixels of each class on Maupiti data, based on expert mapping.

More »

Expand

Fig 2.

Difference between the original SMOTE algorithm and SMOTE-CD.

The blue points are the points to oversample. (a) The points to oversample belong to the same class (here, class 1). (b) The points to oversample are the ones that have the same class as their majority class in their compositional vector label.

More »

Expand

Fig 3.

Simulation of 400 points using B^(a) (a) and B^(b) (b).

More »

Expand

Fig 4.

An example of SMOTE-CD.

(a) The original imbalanced dataset, (b) the output balanced dataset with the created points displayed as a cross.

More »

Expand

Table 2.

Comparison of simulated raw data (4 classes) and oversampled data, repeated 100 times. Displayed results are mean (s.d.).

More »

Expand

Fig 5.

Performance of Dirichlet model on raw and oversampled data, depending on the imbalance of the dataset (indicated by % of observations in class 0), based on 16 features and 4 classes.

More »

Expand

Fig 6.

Average R² and F1-score per class of Dirichlet model on raw and oversampled simulated data.

Bars represent the mean score, vertical lines represent the standard deviation.

More »

Expand

Table 3.

Results comparing raw Maupiti data (4 classes) and oversampled with a 5-fold cross validation. Displayed results are mean (s.d.).

More »

Expand

Fig 7.

Average R² score per class of Gradient Boosting tree on raw and oversampled Maupiti data.

The red dotted lines represent the weight of each class, and the value below the class is its weight. Bars represent the mean score, vertical lines represent the standard deviation.

More »

Expand

Table 4.

Results comparing raw Tecator data (3 classes) and oversampled with a 10-fold cross validation, iterated 100 times. Displayed results are mean (s.d.).

More »

Expand