Fig 1.
Exemplary binary classification problem.
An exemplary binary classification problem with one-dimensional feature space and class labels y ∈ {−1, +1}. We compare a GPC (using a radial-basis-function kernel and expectation propagation (EP) with the implementation from [19]) and our CASIMAC (based on a GPR with Matérn kernel as the underlying regression model), both trained on the same D = 25 training datapoints. The top plot shows the expectation values
,
and the standard deviations of the latent space distributions
,
for both approaches (on two different scales), whereas the bottom plot contains the class probability predictions
and
for both approaches. The gray and white areas indicate the true regions of the class + 1 and −1, respectively. All datapoints are sampled without noise from these regions. Finally, the dotted horizontal lines (⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅) represent the decision boundaries. The corresponding cone segments for CASIMAC,
and
, are shown in the top plot.
Fig 2.
Segmentation of the latent space.
Segmentation of the latent space of a ternary classification problem (n = 3) into three congruent cone segments
, each associated with one of the three classes as indicated by the respective colors. The vertices p1, p2, p3 of the simplex
(marked by the gray shading) with barycenter 0 lie on the central ray of the respective segments. The borders of the segments are defined by the vertices −p1, −p2, −p3 of the mirrored simplex
.
Fig 3.
Illustration of the training data transformation.
Illustration of the training data transformation from (14) for an exemplary ternary classification problem (n = 3) with a two-dimensional feature space
. The colored regions on the left illustrate the true regions of the three classes with their respective labels, whereas the points denote the (noiselessly) sampled training datapoints x1, …, xD. We sample three points from each class (i. e., D = 9) and use different symbols (∘, ◽ and ♢) to uniquely identify each point. As the distance metric d underlying f, we choose the Euclidean distance. The other hyperparameters are chosen as α ≔ 0, β ≔ 1 and kα, kβ ≔ 1, respectively. The figure shows that the farther a training datapoint xi (for i ∈ {1, …, D}) is away from its nearest foreign-class neighbor, the clearer is the membership of f(xi) to the respective cone segment
.
Fig 4.
Illustration of the regression model.
Illustration of the regression model for the exemplary ternary classification problem from Fig 3. We choose a GPR model with a Matérn kernel, which is fitted to the transformed training datapoints (x1, f(x1),, …, (xD, f(xD)) that are obtained from the training data transformation f according to Fig 3. In order to illustrate the effect of
, we show how the points of the rectangular grid on the left get transformed by
. In particular, we show this transformation for the corner points marked by stars. The color of any point x on the grid and, respectively, of any point
on the transformed grid indicates the true class of x.
Fig 5.
Illustration of our proposed calibrated simplex-mapping classifier.
Illustration of our proposed calibrated simplex-mapping classifier for the exemplary ternary classification problem from Fig 3 based on the GPR model
from Fig 4. According to our definition (27), the classifier determines for each feature space point
(on the left) the cone segment containing the latent space counterpart
(on the right) and then takes the label of this cone segment as the class label prediction
for x. In other words, the cuts of the cone segment borders through the latent space
determine the class membership of each data point x according to its learned latent space position
. We mark the predicted class boundaries (on the left) as well as the corresponding cone segment boundaries (on the right) by dashed lines. As can be seen, the predicted class boundaries in
do not deviate much from the true class boundaries indicated by the different background and grid colors on the left. In other words, our classifier produces only a few misclassifications. This can also be seen from the fact that the color of most of the transformed grid points in
on the right (indicating the true class) is identical to the underlying background color (indicating the predicted class).
Fig 6.
Illustration of the ground truth (40) for the synthetic data class labels. Synthetic data visualization.
Table 1.
Test scores for the synthetic data set based on T = 10000 uniformely sampled test datapoints. The proba-loss is defined in (44). We show the means and the corresponding standard deviations (in brackets) over all 10 classification tasks. The best mean results are highlighted in bold.
Fig 7.
Synthetic data class probability predictions.
(a) CASIMAC class probabilities. (b) GPC class probabilities. Class probability predictions of CASIMAC and of GPC for the synthetic data set. The color of each background point corresponds to the weighted average of the class colors from Fig 6 with a weight corresponding to the predicted probability of the respective class at this point. Thus, clear colors as in (a) represent high probabilities for a single class, whereas washed-out colors as in (b) represent almost uniform probabilities. We also show the training data set (consisting of D = 40 points) on which the classifiers have been trained. The color of the points corresponds to their true class. While most of the training datapoints are correctly classified (shown as ∘), our CASIMAC incorrectly predicts three training datapoints (shown as ♢) close to the class borders.
Table 2.
Overview of the five real-world data sets and their basic characteristics: n is the number of classes, m is the number of features or, in other words, the dimension of the feature space , I is the total number of datapoints, and D is the number of training datapoints. The number T of test datapoints is just T ≔ I − D. We abbreviate the references as a) [25], b) [26], c) [27], d) [28], e) [29], f) [30], g) [2], and h) [31]. Also, we turned the originally multi-class data set pine into a binary problem as described in [2].
Table 3.
Test scores for the benchmarked classifiers on the five real-world data sets from Table 2. We show the means and the corresponding standard deviations (in brackets) over all 10 classification tasks. The best mean results are highlighted in bold.
Fig 8.
Calibration curves for the benchmarked classifiers on the pine data set. The closer the curves are to the diagonal reference line, the better the calibration of the respective classifier. In particular, our method exhibits the best calibration properties.
Table 4.
Calibration score measured in terms of the area-deviation (area between each curve and the diagonal reference line) for the calibration curves from Fig 8, which refer to the pine data set. The best result is highlighted in bold.
Fig 9.
Compressed latent space representation.
(a) Compressed latent space representation of the training data. (b) Compressed latent space representation of the test data. Images of (a) the training datapoints under the transformation C ∘ f and of (b) the test datapoints under the transformation . See (46) and (47), respectively. The color of each point represents its true class, while the form of the markers indicates whether a point is correctly classified (∘) or not (♢) by our CASIMAC. All training datapoints are correctly classified by our construction of f and C, see (21) and (39). The scales on the three simplex edges indicate the barycentric coordinates of the simplex points (Lemma 14).
Table 5.
Confusion matrix (error matrix): the entry in row k and column l is the number of test datapoints which belong to class k and for which our CASIMAC predicts the class label l. Correct classifications (on the diagonal) are highlighted in bold. There are no misclassifications between members of the classes 1 and 2, as can be expected from Fig 9a.
Table 6.
Accuracy for the fashion-mnist data set.
Top-1 to top-5 accuracy of our naive and of our informed CASIMAC on the fashion-mnist data set. In the naive approach we use a purely Euclidean distance metric between the images, whereas the informed approach also takes the structrual image similarity into account. The best scores are highlighted in bold.