Calibrated simplex-mapping classification

Raoul Heese; Jochen Schmid; Michał Walczak; Michael Bortz

doi:10.1371/journal.pone.0279876

Abstract

We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular (n − 1)-dimensional simplex, n being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.

Citation: Heese R, Schmid J, Walczak M, Bortz M (2023) Calibrated simplex-mapping classification. PLoS ONE 18(1): e0279876. https://doi.org/10.1371/journal.pone.0279876

Editor: Xiyu Liu, Shandong Normal University, CHINA

Received: September 13, 2021; Accepted: December 16, 2022; Published: January 17, 2023

Copyright: © 2023 Heese et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data considered in the manuscript are publicly available online: 1. Alcohol data set: https://archive.ics.uci.edu/ml/datasets/Alcohol+QCM+Sensor+Dataset 2. Climate data set: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes 3. HIV data set: https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+cleavage 4. Pine data set: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes 5. Wifi data set: https://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization 6. Fashion-MNIST data set: https://github.com/zalandoresearch/fashion-mnist.

Funding: This work was funded by the Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. (Fraunhofer Society, Hansastraße 27c, 80686 München, Germany) via two of its institutes, namely the Fraunhofer Institute for Industrial Mathematics (Fraunhofer ITWM, Fraunhofer-Platz 1, 67663 Kaiserslautern, Germany) and Fraunhofer Center for Machine Learning. Additional funding was provided by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the Priority Programme “SPP 2331: Machine Learning in Chemical Engineering”. These funders provided support in the form of salaries for the authors, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist. All authors are employed by the Fraunhofer Society. This does not alter our adherence to PLOS ONE policies on sharing data and materials. No competing interest arise from the affiliation to the Fraunhofer Society or the above-mentioned funders. Apart from the statements made here, there are no relevant declarations relating to employment, consultancy, patents, products in development, or marketed products that lead to competing interests.

1 Introduction

In many classification tasks, it is not sufficient to merely predict the class label for a given feature space point x. Instead, it is often important to also have good predictions for the class label probabilities, because these probability predictions provide a measure for the confidence one can have in the individual class label predictions . Such additional confidence information is important in many applications, for instance in clinical applications [1]. Classifiers that come with such additional class probability predictions are called calibrated. Some classifiers from methods like logistic regression or Gaussian process classification (GPC) are intrinsically calibrated. Also, there are various methods to calibrate an intrinsically non-calibrated classifier or to improve the calibration quality of an ill-calibrated classifier [2–5].

1.1 Contribution

In this paper, we propose a novel supervised learning method for multi-class classification that yields classifiers with a high potential to be intrinsically well-calibrated. It can be applied to general classification problems in an arbitrary metrizable feature space of possibly non-numeric features and with an arbitrary number n of classes with labels . Starting from a training data set (1) of feature space points together with associated class labels , the training of our classifier proceeds in two training steps:

In a first step, the training datapoints x₁, …, x_D are transformed by means of a suitable training data transformation to a latent space , which we partition into n cone segments corresponding to the n classes in and defined in terms of a regular (n − 1)-dimensional simplex in .
In a second step, a regression model is trained based on the transformed training datapoints (x₁, f(x₁)), …, (x_D, (f(x_D)), which are obtained by means of the training data transformation f from the first step. In this manner, the latent space representation of the training data is extended to the whole feature space.

We design the training data transformation f such that the latent space counterpart f(x_i) of each datapoint x_i is located in the corresponding cone segment and such that the location of f(x_i) in this segment reflects the distances of x_i from its own-class and its foreign-class datapoint neighbors. Concerning the choice of the distance metric and the number of neighbors used in the definition of f, we are completely free, and the same is true for the choice of the regression model . In particular, these quantities can be freely customized and tuned to the particular problem at hand.

As soon as the above training steps have been performed, our classifier is readily obtained. Indeed, its class label prediction for a given feature space point is the label of the (first) cone segment that contains , the regression model’s point prediction for x. If in addition to these point predictions, the regression model also provides probabilistic predictions, then our classifier yields predictions for the class label probabilities as well. Specifically, these class label probability predictions read (2) where for a given is the regression model’s prediction for the probabilistic distribution of latent space points. As a classifier coming with class label probability predictions, our classifier is calibrated. We refer to it as a calibrated simplex-mapping classifier (CASIMAC) because the underlying latent-space mapping f is defined in terms of the vertices of a simplex in .

We point out that the concept of leveraging Bayesian probabilistic prediction power combined with latent space mappings has been studied before. Several recent publications propose to couple a deep neural network with Gaussian processes (GPs) for an improved uncertainty estimate of model predictions [6–10]. Alternative approaches explore the use of deep neural networks not as feature extraction methods but, for instance, to suitably estimate the mean functions of GPs [11] or to predict their covariance functions and hyperparameters [12]. Yet, due to the high complexity of the deep neural network components in these models, the algorithms mentioned above are well-suited for large data sets with abundant training data available [13]. In the present paper, by contrast, we propose a simple latent space representation of the original feature space as the core component of a well-calibrated classifier that also works on less complex data sets. In particular, our method has recently been successfully used for an industrial application [14].

In summary, our contribution consists of the following parts:

We propose a novel supervised learning method for multi-class classification with a simplex-like latent space.
We rigorously establish the theoretical background including detailed proofs.
We find that the computational effort of making predictions with our proposed classifier is comparatively low (in contrast to, e. g., GPC).
We show how the latent space of our proposed classifier can be suitably visualized.
We benchmark the prediction and calibration properties of our proposed classifier.

Additionally, we discuss potential use cases and further research directions.

1.2 Simple example

In order to concretize the aforementioned assets of our method and paint a more intuitive picture, we briefly discuss a simple case with n = 2 classes (i. e., a binary classification problem), which is shown in Fig 1. The two class labels are chosen as l₁ ≔ −1 and l₂ ≔ + 1, respectively. In that case, our latent space is just the real line and the cone segments simply are and , the negative and the positive half-axis, respectively. Consequently, our definition of the class label predictions simplifies to (3) while our definition of the class probability predictions simplifies to (4) As before, is a regression model that is fitted to the transformed training datapoints (x₁, f(x₁)), …, (x_D, f(x_D)) and is the associated predictive a posteriori distribution for x based on the same transformed training datapoints.

Download:

Fig 1. Exemplary binary classification problem.

An exemplary binary classification problem with one-dimensional feature space and class labels y ∈ {−1, +1}. We compare a GPC (using a radial-basis-function kernel and expectation propagation (EP) with the implementation from [19]) and our CASIMAC (based on a GPR with Matérn kernel as the underlying regression model), both trained on the same D = 25 training datapoints. The top plot shows the expectation values , and the standard deviations of the latent space distributions , for both approaches (on two different scales), whereas the bottom plot contains the class probability predictions and for both approaches. The gray and white areas indicate the true regions of the class + 1 and −1, respectively. All datapoints are sampled without noise from these regions. Finally, the dotted horizontal lines (⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅) represent the decision boundaries. The corresponding cone segments for CASIMAC, and , are shown in the top plot.

https://doi.org/10.1371/journal.pone.0279876.g001

A simple choice for the regression model is a Gaussian process regressor (GPR). And a possible choice for the training data transformation f on which depends is the simple distance-based map defined as follows: (5) where the sign in this formula is simply given by the class label of the datapoint x_i and the distance is to be understood w.r.t. the chosen metric on the feature space . If we train a CASIMAC with these choices for f and on an exemplary training data set with D = 25 points in the one-dimensional feature space , we obtain the class probability predictions depicted in red in Fig 1 (bottom plot). If we train a GPC on the same training data, we obtain the respective class probability predictions depicted in blue. As usual, (6) where σ(z) ≔ 1/(1 + e^−z) represents the logistic sigmoid function and is the predictive a posteriori distribution of the GPC for the test point x [15, 16].

We immediately see from the class probability plots in Fig 1 that our CASIMAC has a high confidence in its class label predictions in regions of densely sampled datapoints, as one would intuitively expect. In contrast, the GPC has considerably less confidence in its class label predictions at or near the datapoints despite its high prediction accuracy. In essence, this is because the training data transformation f that underlies our classifier takes into account the actual distances of the datapoints and because the latent space probability density is considerably more concentrated around its expectation value than is the case for its analog . Another downside of the GPC—further impairing its calibration quality—is that its class probability predictions can be computed only approximately. In fact, already the computation of the non-normal distribution , a D-dimensional integral, requires quite sophisticated approximations like the Laplace approximation [15], expectation propagation [15], variational inference [17], or the Markov chain Monte Carlo approximation [18], to name a few. As opposed to this, no approximations are required for the computation of the CASIMAC class probabilities because there is a closed-form expression for them, due to the normality of the distributions assumed in our example from Fig 1.

1.3 Structure of the paper

In section 2, we formally introduce our CASIMAC method. We explain in detail the two training steps as well as how predictions of a trained classifier can be obtained. We also show how our latent space mappings can be leveraged for the convenient visualization of inter- or intra-class relationships in the data, especially in the case of a high-dimensional feature space and a moderate number of classes. In Section 3, we apply our method to various data sets and compare its performance and calibration qualities to several well-established benchmark classifiers. In particular, we demonstrate that our approach can be applied to many different application domains because our training data transformation reflects actual distances in the data set and because the respective distance metric as well as the regression model are freely customizable. Section 4 concludes the paper with a summary and an outlook on possible future research. The appendix collects the mathematical and technical background underlying our CASIMAC method. In S1 Appendix, we rigorously prove the results needed for an in-depth and mathematically sound understanding of the method. In S2 Appendix, in turn, we summarize the main features of our implementation, and S3 Appendix summarizes how the hyperparameters of the models in Section 3 were chosen.

2 Calibrated simplex-mapping classification

In this section, we formally introduce our proposed calibrated simplex-mapping classifier, briefly referred to as CASIMAC. Sections 2.1 and 2.2 explain in detail the two training steps outlined in Section 1.1. In Section 2.3, we give the precise definition of our CASIMAC and, moreover, we explain how its predictions and for the class labels and the class label probabilities can be calculated in a computationally favorable manner. In Section 2.4 we finally explain how the latent space representation upon which our classification method relies can also be exploited for visualization purposes. Here and in the following, we consistently denote the training data set and the training datapoints as in (1) and write (7) for the true class labels of the training datapoints x₁, …, x_D. Also, always denotes the feature space—which is only assumed to be metrizable and, in particular, need not be embedded in any —and denotes the set of class labels. As usual, a set is called metrizable iff there exists at least one metric on it [20]. We see below that we actually only need so-called semimetrics [21] and we exploit this even greater flexibility in our last application example. If is embedded in some , then of course infinitely many metrics exist on , namely at least all the metrics induced by the ℓ^p-norms ‖⋅‖_p on for p ∈ [1, ∞) ∪ {∞}. In order to reduce cumbersome double indices, we assume from here on that the class labels are just 1, …, n (instead of the general labels l₁, …, l_n). In short, we assume that (8) without loss of generality.

2.1 Training data transformation to a latent space

In the first training step of our method, we transform the feature space training datapoints x₁, …, x_D by means of a suitably designed training data transformation (9) to a suitable latent space . We choose this latent space to be , where n ≥ 2 as before is the number of classes observed in the training data . We decompose this space into n conically shaped segments , which are defined by the vertices p₁, …, p_n of a regular (n − 1)-dimensional simplex in having barycenter 0 and being at unit distance from their barycenter 0, that is, (10) See S1 Appendix (Proposition 2). Specifically, we define the segment as the cone that is spanned by the mirrored vertices −p_i with i ≠ k, that is, (11) In view of (10.a), the vertex p_k lies on the central ray of , which is why we also refer to p_k as the central vector of . Also, the segments cover the whole latent space and any two of these segments overlap only at their boundaries. In short, (12) where is the interior of , that is, the set without its boundary [20]. It is given by (13) And finally, the segments are pairwise congruent. All these statements are proven rigorously in S1 Appendix (Lemma 5 and Propositions 6 and 7). In the special case of just two or three classes, they can also be easily verified graphically. See Fig 2, for instance.

Download:

Fig 2. Segmentation of the latent space.

Segmentation of the latent space of a ternary classification problem (n = 3) into three congruent cone segments , each associated with one of the three classes as indicated by the respective colors. The vertices p₁, p₂, p₃ of the simplex (marked by the gray shading) with barycenter 0 lie on the central ray of the respective segments. The borders of the segments are defined by the vertices −p₁, −p₂, −p₃ of the mirrored simplex .

https://doi.org/10.1371/journal.pone.0279876.g002

With the help of the above segmentation of the latent space , we can define the training data transformation (9). We design this mapping f such that it maps each training datapoint to the cone segment corresponding to its class label y(x) in such a way that the location of f(x) in this cone segment reflects the distances of x from its own-class and from its foreign-class neighbors. We choose the location of f(x) based on the following premises:

f(x) should be located the farther in the direction of p_y(x) (and thus the farther inside the cone segment ), the closer x is to its class-y(x) datapoint neighbors
f(x) should be located the farther in the direction of −p_y (and thus the farther away from the cone segment ), the farther x is away from its class-y datapoint neighbors for .

Specifically, we define f as follows: (14) for every training datapoint . In particular, the coefficient indicates how far f(x) is pulled into the own-class segment , while the coefficient indicates how far f(x) is pulled away from the foreign-class segment for . We therefore refer to these coefficients as the attraction and the repulsion coefficients of x and we define them, as indicated above, in terms of the mean distance of x from its k_α closest datapoint neighbors of its own class y(x) or, respectively, from its k_β closest datapoint neighbors of the foreign class . That is, (15) where is the set of all training datapoints belonging to class y and (16) is the mean distance of x from its k nearest neighbors from the subset , the distance being measured in terms of some arbitrary semimetric [21] (17) on the feature space , that is, d(x, x′) = d(x′, x) for all (symmetry) and d(x, x′) = 0 if and only if x = x′. At least one such semimetric exists on because this space was assumed to be semimetrizable at the beginning of Section 2. Also, α, β ∈ [0, ∞) and are user-defined hyperparameters satisfying (18) where is the cardinality of the smallest training data class. Conditions (18.b) and (18.c) guarantee that all the attraction and repulsion coefficients are strictly positive finite real numbers, that is, (19) for all and all (Proposition 10). In most of our applications, d is a proper metric, that is, a semimetric that also satisfies the triangle inequality. In our last application example, though, we make explicit use of semimetrics as well.

Since the domain of f and the attraction and repulsion coefficients occurring in the definition of f obviously depend on the training data set , so does the training data transformation f itself, and we sometimes make this dependence explicit by writing (20) It is straightforward to verify that the training data transformation f from (14), with α ≔ 0, β ≔ 1, k_α, k_β ≔ 1, reduces to the mapping from (5) in the special case of just n = 2 classes. It is also easy to verify from (14), using the barycenter condition (10.a) along with (13), (18.a) and (19), that (21) for every (Proposition 10). In other words, f(x) for every training datapoint lies in the interior of the corresponding cone segment. Specifically, the membership of f(x) to is the clearer, the closer x is to its k_α nearest own-class neighbors and the farther x is away from its k_β nearest foreign-class neighbors. Fig 3 illustrates this behavior of the training data transformation f for the simple case of a ternary classification problem (n = 3).

Download:

Fig 3. Illustration of the training data transformation.

Illustration of the training data transformation from (14) for an exemplary ternary classification problem (n = 3) with a two-dimensional feature space . The colored regions on the left illustrate the true regions of the three classes with their respective labels, whereas the points denote the (noiselessly) sampled training datapoints x₁, …, x_D. We sample three points from each class (i. e., D = 9) and use different symbols (∘, ◽ and ♢) to uniquely identify each point. As the distance metric d underlying f, we choose the Euclidean distance. The other hyperparameters are chosen as α ≔ 0, β ≔ 1 and k_α, k_β ≔ 1, respectively. The figure shows that the farther a training datapoint x_i (for i ∈ {1, …, D}) is away from its nearest foreign-class neighbor, the clearer is the membership of f(x_i) to the respective cone segment .

https://doi.org/10.1371/journal.pone.0279876.g003

We point out that this behavior of f is generic in the sense that it is independent of the number of classes and independent of the specific choices of α, β, k_α, k_β and d. In particular, we can freely choose and tune the semimetric d as well as the hyperparameters α, β, k_α, k_β within the bounds (18) to the specific classification problem at hand [22]. We make ample use of this customization flexibility for the exemplary classification problems in Section 3.

2.2 Training of a regression model based on the transformed data

In the second training step of our method, we train a regression model (22) from the feature space to the latent space . As the training data set for this regression model, we take the transformed training data set (23) consisting of the feature space training datapoints together with their latent space counterparts which are obtained by means of the training data transformation f from the first training step as defined in (14).

In sharp contrast to f, the regression model is defined on the whole of (instead of only the training datapoints x₁, …, x_D) and thus yields latent space predictions for arbitrary feature space points , especially for hitherto unobserved and unclassified feature space points. Fig 4 illustrates the effect of the regression model for the special case of a ternary classification problem (n = 3).

Download:

Fig 4. Illustration of the regression model.

Illustration of the regression model for the exemplary ternary classification problem from Fig 3. We choose a GPR model with a Matérn kernel, which is fitted to the transformed training datapoints (x₁, f(x₁),, …, (x_D, f(x_D)) that are obtained from the training data transformation f according to Fig 3. In order to illustrate the effect of , we show how the points of the rectangular grid on the left get transformed by . In particular, we show this transformation for the corner points marked by stars. The color of any point x on the grid and, respectively, of any point on the transformed grid indicates the true class of x.

https://doi.org/10.1371/journal.pone.0279876.g004

Since both and depend on the original training data set , so does the regression model and we sometimes make this dependence explicit by writing (24) We point out that—similarly to the choice of the hyperparameters α, β, k_α, k_β and d of the training data transformation—the specific choice of the regression model is completely arbitrary.

In particular, we can choose a suitable probabilistic regression model that is able to not only predict a single latent space point but also an entire probability density for every . Such a so-called predictive a posteriori distribution indicates the distribution of the latent space points for given x and a given data set . In particular, the predictive a posteriori distribution indicates the level of confidence that the model has in its point prediction , namely it is all the more confident in the point prediction the more is concentrated around . A typical example for regression models which provide both a point prediction and a probability-density prediction for every are GPR models [15, 16]. As is well-known, for these models the point predictions are completely determined by the predictive a posteriori distribution , namely as the maximum a posteriori prediction, that is, (25) We make ample use of GPR models in our benchmark examples from Section 3.

2.3 Calibrated simplex-mapping classifier

After the two training steps explained above have been performed, we can define and put to use our CASIMAC. We define it as the classifier (26) that assigns to each feature space point the label y of the first cone segment that contains . That is, (27) where is the regression model’s latent space prediction for x and where (28) is the label of the first cone segment containing z. Since by (12.a) every latent space point is contained in at least one of the segments , the expression (27) really yields a well-defined classifier. Since, on the other hand, the segments overlap at their respective borders, a given latent space point is contained in several segments if it is located exactly on such a border. In this special case, we need to decide for one of the overlapping segments, to obtain a classifier that assigns a single label (instead of multiple labels) to each feature space point. In our definition (27), we decide for the first of the segments containing for the sake of simplicity, hence the minimum in that formula.

In view of the dependence of the regression model on the training data set , our CASIMAC depends on as well and, to emphasize this, we sometimes write (29) According to the definition (27), in order to practically compute the class label prediction for a given feature space point , we have to determine all cone segments that contain . Yet, in view of the purely geometric definition (11) of the cone segments, it is not so clear at first glance how this can be done in a computationally feasible and favorable manner—especially in the case of many classes (n ≥ 5) where the segments cannot be (directly) visualized anymore. We find, however, that the segments containing can be determined solely in terms of the distances (30) of from the central vectors p₁, …, p_n of the segments . Specifically, we prove in S1 Appendix (Theorem 8) that a latent space point belongs to the segment if and only if (31) that is, if and only if its distance from the other segments’ central vectors p_l, l ≠ y, is at least as large as its distance from the central vector p_y of . Consequently, the geometrically inspired definition (27) of our CASIMAC can be recast in the form (32) (Corollary 11). Computationally, (32) is much easier to evaluate than (27) since can be determined simply by calculating all distances (30) and by then choosing the smallest one. In our CASIMAC implementation, we therefore use (32) to perform predictions. We illustrate the method in Fig 5.

Download:

Fig 5. Illustration of our proposed calibrated simplex-mapping classifier.

Illustration of our proposed calibrated simplex-mapping classifier for the exemplary ternary classification problem from Fig 3 based on the GPR model from Fig 4. According to our definition (27), the classifier determines for each feature space point (on the left) the cone segment containing the latent space counterpart (on the right) and then takes the label of this cone segment as the class label prediction for x. In other words, the cuts of the cone segment borders through the latent space determine the class membership of each data point x according to its learned latent space position . We mark the predicted class boundaries (on the left) as well as the corresponding cone segment boundaries (on the right) by dashed lines. As can be seen, the predicted class boundaries in do not deviate much from the true class boundaries indicated by the different background and grid colors on the left. In other words, our classifier produces only a few misclassifications. This can also be seen from the fact that the color of most of the transformed grid points in on the right (indicating the true class) is identical to the underlying background color (indicating the predicted class).

https://doi.org/10.1371/journal.pone.0279876.g005

If the regression model perfectly fits the training data (23) in the sense that (33) then it is easy to see from (27), using (12.b) and (21), that (34) (Corollary 11). In other words, if the regression model is perfectly interpolating in the sense of (33), our CASIMAC perfectly reconstructs the true class labels of the training datapoints. A typical example of perfectly interpolating regression models are noise-free GPR models [15, 16].

If the regression model apart from its point predictions also provides probabilistic predictions, we can estimate the class probabilities for arbitrary feature space points x. In that case, there is a latent probability density on for every which on the one hand determines the point prediction in some way—for instance, as the maximum a posteriori estimate (25)—and which on the other hand also determines the uncertainty of this prediction. Since the regression model is our model for observations of latent-space points, the associated probabilities can be considered as estimates for the probability of observing the latent-space point for a given . Since, moreover, if and only if by the definition (27) of our classifier, the expression (35) is an estimate for the probability of observing class y for x. We therefore take this estimate as our classifier’s prediction for the probability of class y given x. In the above equation, denotes the indicator function (36) with respect to the set . In view of (12.b) and (28), we have for every and thus (37) are alternative expressions for our classifier’s class probability prediction (Corollary 12).

If the probability densities belong to a normal distribution and the number of classes is n = 2, then the class probability predictions (37) can be computed by means of a simple analytical formula, namely (118) in S1 Appendix (Proposition 13). If, by contrast, the probability densities do not belong to a normal distribution or the number of classes is n > 2, then the multi-dimensional integrals in (37) can still be computed approximately. In our CASIMAC implementation, we use Monte Carlo sampling for these approximate computations of the class probability predictions (37) for n > 2. See (119) in S1 Appendix (Proposition 13). Standard examples of regression models with normally distributed probability densities are provided by GPR models. In particular, the class probabilities for the binary classification problem from Fig 1 were evaluated analytically.

2.4 Visualization

Apart from its central use in the definition of our classifier, the latent space representation given by f and can also be beneficially used for visualization purposes.

In particular, the training data transformation f can be exploited for visually detecting inter- or intra-class relationships in the data. Inspecting the latent space representation from Fig 3, for instance, we find that the transformed datapoints of class 1 and 3 are farther apart from each other than they both are apart from the transformed datapoints of class 2. We could have directly seen that, of course, by inspecting the untransformed datapoints in feature space, which is only 2-dimensional here. After all, class 2 lies between class 1 and 3, as the left panel of Fig 3 reveals. Such a direct inspection of the data in feature space, however, is possible only for feature spaces with small dimensions. In case the feature space is high-dimensional or not even embedded in an , though, we can still use its latent space representation to uncover relations between the datapoints. All we need for that is a latent space of moderate dimension n − 1, ideally n ≤ 4.

Since our latent space is an unbounded set by definition, the latent space counterparts f(x₁), …, f(x_D) of the datapoints can be very far apart from each other. It can therefore be useful to further transform the latent space—and with it, the points f(x₁), …, f(x_D)—to a fixed bounded reference set. A natural way of achieving this is to compress the latent space to the interior of the reference simplex underlying our CASIMAC. Indeed, there is a bijective compression map (38) which is diffeomorphic (infinitely differentiable with infinitely differentiable inverse) and which leaves the cone segments invariant in the sense that, for every and every , (39) See S1 Appendix (Proposition 15). We use this compression map for the visualization of one of our real-world data sets below.

3 Benchmark

In this section, we apply our CASIMAC method to various data sets and compare its performance to several well-established benchmark classifiers. Specifically, we perform benchmarks using synthetic data in Section 3.1 and real-world data in Section 3.2. Furthermore, we exemplarily illustrate the latent space visualization in Section 3.3. In these first three subsections, the regression model underlying our classifier is a GPR. In Section 3.4, by contrast, we use a neural network as our regression model, thereby demonstrating how a more complex regression model allows us to solve a classification task with more training data. For all calculations, we use the implementation of our method [23], which is briefly outlined in S2 Appendix.

3.1 Synthetic data

As a first benchmark, we test our method on a synthetically generated data set. Specifically, we consider the synthetic classification problem with n = 4 classes and the 2-dimensional feature space , which is based on the explicit class membership rule (40) for features . The problem is illustrated in Fig 6. In order to obtain our training data set (1), we uniformly sample D = 40 points from and associate them with their true class labels according to (40). In particular, the synthetic data is free of noise. Analogously, we obtain the test data set (41) by uniformly sampling T = 10000 points from and by associating them with their true class labels (40). Here and in the following, we always standardize the features based on the training data before feeding it to the considered classifiers.

Download:

Fig 6. Synthetic data visualization.

Illustration of the ground truth (40) for the synthetic data class labels. Synthetic data visualization.

https://doi.org/10.1371/journal.pone.0279876.g006

As the distance metric d in the training data transformation (14) underlying our CASIMAC, we choose the Euclidean distance on , that is, (42) for all . Also, we parameterize the weights α, β for the attraction and repulsion coefficients (15) by a single parameter γ ∈ [0, 1], namely (43) so that α, β ∈ [0, 1]. And as the regression model (22) underlying our CASIMAC, we choose a GPR with a combination of a Matérn and a white-noise kernel [15]. We compare our classifier to a (one-versus-rest) GPC with the same type of kernel. The hyperparameters of both classifiers are tuned by cross-validating the training data over a pre-defined set of different setups as outlined in S3 Appendix. We made use of scikit-sklearn [24] to realize the standard models.

In total, we perform 10 classification tasks, each with different test and training data sets. For each, we use the test data set to determine the accuracy (fraction of correctly predicted points) and the log-loss (that is, the logistic regression loss or cross-entropy loss). In addition to that, we calculate the proba-loss, which we define as the mean predicted probability error (44) where is our CASIMAC’s or the GPC’s class probability prediction, respectively. Clearly, δp ∈ [0, 1] with 0 being the best possible outcome and 1 the worst. We can compute this score only because we know the true class membership rule (40) underlying the problem.

The results are shown in Table 1. Clearly, our method has a better proba-loss and log-loss than the GPC, whereas the latter has a slightly better accuracy.

Download:

Table 1. Synthetic data test scores.

Test scores for the synthetic data set based on T = 10000 uniformely sampled test datapoints. The proba-loss is defined in (44). We show the means and the corresponding standard deviations (in brackets) over all 10 classification tasks. The best mean results are highlighted in bold.

https://doi.org/10.1371/journal.pone.0279876.t001

We also show an exemplary visualization of the predicted class probabilities in Fig 7 for a single training data set. The background colors represent a weighted average of the class colors from Fig 6 with a weight corresponding to the predicted probability of the respective class. So, for a perfectly calibrated classifier, the colors in Fig 7 would be the same as the colors in Fig 6. Although our CASIMAC has a slightly lower training accuracy than the GPC (0.925 as opposed to 1.000), it clearly exhibits brighter, more distinguishable colors. Consequently, the predicted probabilities are less uniform and correspond to a clearer decision for one of the four classes instead of an uncertain mixture. This observation corresponds to the lower proba-loss and log-loss results in Table 1.

Download:

Fig 7. Synthetic data class probability predictions.

(a) CASIMAC class probabilities. (b) GPC class probabilities. Class probability predictions of CASIMAC and of GPC for the synthetic data set. The color of each background point corresponds to the weighted average of the class colors from Fig 6 with a weight corresponding to the predicted probability of the respective class at this point. Thus, clear colors as in (a) represent high probabilities for a single class, whereas washed-out colors as in (b) represent almost uniform probabilities. We also show the training data set (consisting of D = 40 points) on which the classifiers have been trained. The color of the points corresponds to their true class. While most of the training datapoints are correctly classified (shown as ∘), our CASIMAC incorrectly predicts three training datapoints (shown as ♢) close to the class borders.

https://doi.org/10.1371/journal.pone.0279876.g007

3.2 Real-world data

Following the benchmark on synthetic data, we continue with a benchmark on five real-world data sets from different fields of application. Specifically, we consider the data sets from Table 2. All of these data sets are publicly available online.

Download:

Table 2. Data set overview.

Overview of the five real-world data sets and their basic characteristics: n is the number of classes, m is the number of features or, in other words, the dimension of the feature space , I is the total number of datapoints, and D is the number of training datapoints. The number T of test datapoints is just T ≔ I − D. We abbreviate the references as a) [25], b) [26], c) [27], d) [28], e) [29], f) [30], g) [2], and h) [31]. Also, we turned the originally multi-class data set pine into a binary problem as described in [2].

https://doi.org/10.1371/journal.pone.0279876.t002

As the distance metric d in the training data transformation (14) underlying our CASIMAC, we choose the Euclidean distance (42) on for the data sets alcohol, climate, hiv, and pine, whereas for the remaining data set wifi we choose the taxicab distance on defined by (45) because it leads to a better performance there. Additionally, we parameterize the weights α, β of the attraction and repulsion coefficients (15) as in (43) for all our data sets. And as the regression model (22) underlying our CASIMAC, we again choose a GPR with a combination of a Matérn and a white-noise kernel for all our data sets. We compare our CASIMAC on each data set to three other classifiers, namely to a GPC as before and, in addition to that, to an artificial neural network with fully-connected layers (MLP) and to a k-nearest neighbor classifier (kNN). This choice of classifiers was dictated by their reported good calibration properties [2]. Again, we tune the hyperparameters of the classifiers by cross-validation as outlined in S3 Appendix. As in the previous section, we made use of scikit-sklearn [24] to realize the standard models.

The test-training split of the data is performed by means of a stratified random sampling with respect to the class labels. Average test scores over 10 classification tasks with different training data sets are reported in Table 3. Note that for multi-class data sets, the f1 score is calculated as the weighted arithmetic mean over harmonic means [32], where the weight is determined by the number of true instances for each class. Analogously, we calculate the precision score (ratio of true positives to the sum of true positives and false positives) and recall score (ratio of true positives to the sum of true positives and false negatives) as weighted averages over all classes.

Download:

Table 3. Real-world data test scores.

Test scores for the benchmarked classifiers on the five real-world data sets from Table 2. We show the means and the corresponding standard deviations (in brackets) over all 10 classification tasks. The best mean results are highlighted in bold.

https://doi.org/10.1371/journal.pone.0279876.t003

We find from Table 3 that our CASIMAC exhibits a comparatively good overall score on all data sets considered. In particular, its log-loss is better in all cases than that of the GPC. It is also better than the log-loss of all other classifiers except for the wifi data set on which the MLP is superior. Similarly, the accuracy of our approach is also better than that of the other candidates except for the climate data set on which GPC and kNN perform slightly better. Taking the uncertainties of the results into account, it turns out that in most cases the scores fall within the range of a single standard deviation of each other. In particular the log-loss, however, shows the most discrepancies between the classifiers.

For binary classification problems, it is also interesting to analyze the calibration curves (or reliability diagrams) in addition to the scores [2, 33]. For this purpose, we predict the probability of class 1 for all test samples and discretize the results into ten bins. For each bin, we plot the true fraction of class 1 against the arithmetic mean of the predicted probabilities. For a perfectly calibrated classifier, such a curve corresponds to a diagonal line and deviations from this line can therefore be understood as miscalibrations. As an example, we consider the pine data set and show a typical calibration curve Fig 8.

Download:

Fig 8. Calibration curves.

Calibration curves for the benchmarked classifiers on the pine data set. The closer the curves are to the diagonal reference line, the better the calibration of the respective classifier. In particular, our method exhibits the best calibration properties.

https://doi.org/10.1371/journal.pone.0279876.g008

Clearly, the curve of our CASIMAC is closer to the diagonal reference line than the curves of the other classifiers. In particular, the GPC curve takes on a sigmoidal shape with major deviations from the diagonal at the beginning and the end. In order to obtain a quantitative measure for the observed miscalibrations, we calculate the corresponding area-deviation, that is, the area between each curve and the diagonal reference line. In case of an optimal calibration, this area vanishes, otherwise it is positive. The results are listed in Table 4 and allow us to rank the classifiers by calibration quality: CASIMAC gives by far the best calibration result, followed by kNN and MLP, and with GPC at the very end.

Download:

Table 4. Calibration score.

Calibration score measured in terms of the area-deviation (area between each curve and the diagonal reference line) for the calibration curves from Fig 8, which refer to the pine data set. The best result is highlighted in bold.

https://doi.org/10.1371/journal.pone.0279876.t004

Summarized, our benchmark on different real-world data sets shows that our proposed CASIMAC can compete with other well-established classifiers and exhibits comparably good calibration properties.

3.3 Visualization

In order to illustrate the use of our compressed latent space representation for visualization purposes (Section 2.4), we consider a simplified version of the data set alcohol from Table 2 and we refer to this simplified version as alcohol-3. It is obtained from alcohol by merging the last three classes into one and, thus, it has only three instead of five classes (that is, n = 3 and m = 10).

We train an exemplary CASIMAC using D = 25 points from alcohol-3 as training datapoints, while using the remaining T = 100 points as test datapoints. As the distance metric d underlying the training data transformation f of our CASIMAC, we choose the Euclidean distance, and for the hyperparameters α, β, k_α, k_β of f we choose As the regression model underlying our CASIMAC, in turn, we choose the same GPR as in our previous benchmarks, see S3 Appendix.

In Fig 9, we show the reference simplex with the segmentation induced by the segmentation (12) of . So, the differently colored segments in Fig 9 are nothing but the sets and by (39), in turn, these simplex segments are precisely the images of the cone segments under the compression map: for .

Download:

Fig 9. Compressed latent space representation.

(a) Compressed latent space representation of the training data. (b) Compressed latent space representation of the test data. Images of (a) the training datapoints under the transformation C ∘ f and of (b) the test datapoints under the transformation . See (46) and (47), respectively. The color of each point represents its true class, while the form of the markers indicates whether a point is correctly classified (∘) or not (♢) by our CASIMAC. All training datapoints are correctly classified by our construction of f and C, see (21) and (39). The scales on the three simplex edges indicate the barycentric coordinates of the simplex points (Lemma 14).

https://doi.org/10.1371/journal.pone.0279876.g009

Specifically, Fig 9a displays the compressed latent space representation of the training datapoints x₁, …, x_D, that is, the points (46) where C is the compression map from (38). In view of (21) and (39), any one of these transformed datapoints w₁, …, w_D belongs to the simplex segment corresponding to its true class label. Also, the distances between points w_i of neighboring classes can indicate inter-class relationships in the original feature space data set. We see from Fig 9a, for instance, that the points w_i belonging to class 1 or 2, respectively, have a larger distance from each other than from the points of class 3. And this leads us to the conclusion that also the original classes 1 and 2 are more clearly separated than are the classes 1 and 3 and, respectively, the classes 2 and 3.

In turn, Fig 9b displays the compressed latent space representation of the test datapoints x_D+1, …, x_D+T, that is, the points (47) Since the regression model is fitted to the training datapoints (x₁, f(x₁)), …, (x_D, f(x_D)) only, a test point w_i from (47) can lie in the simplex segment corresponding to its true class label but it can also lie in a simplex segment corresponding to a false class label. In the latter case, our CASIMAC leads to an incorrect class label prediction and the degree of misclassification can be gathered from the degree of misplacement of w_i. As we can see from Fig 9b, there are misclassifications between members of class 1 and 3 and especially between members of class 2 and 3, but not between points of the classes 1 and 2, as one would expect from our previous observation about inter-class distances in the training data set. We list the detailed classification results of the test data set in form of a (transposed) confusion matrix in Table 5. As expected, most misclassifications happen between the classes 2 and 3.

Download:

Table 5. Confusion matrix.

Confusion matrix (error matrix): the entry in row k and column l is the number of test datapoints which belong to class k and for which our CASIMAC predicts the class label l. Correct classifications (on the diagonal) are highlighted in bold. There are no misclassifications between members of the classes 1 and 2, as can be expected from Fig 9a.

https://doi.org/10.1371/journal.pone.0279876.t005

3.4 Towards deep learning

Finally, as a proof of concept for a classification task with a larger training data set, we consider the fashion-mnist data set. It consists of 28×28 pixel images of fashion articles in 8-bit grayscale format (i. e., ) as described in [34]. In total, there are D = 60000 training images and T = 10000 test images, which are assigned to n = 10 classes. We perform a min-max normalization of the data before we feed it to our classifier, so that the individual features lie within the range [0, 1].

Concerning the training data transformation f underlying our CASIMAC, we use two approaches. In the first approach, we make the same ansatz (14) for the training data transformation f as in all previous examples, that is, (48) As the distance metric d, we choose the Euclidean distance, and the hyperparameters α, β, k_α, k_β we choose to be (49) In the second approach, we make a mixture ansatz for the training data transformation, namely, we take f to be the average of three training data transformations of the form (14). In short, (50) and for all three components f₁, f₂, f₃ we choose the hyperparameters α, β, k_α, k_β as in (49). As the distance metric d₁ for f₁, we again choose the Euclidean distance, but the distance metrics d₂ and d₃ for f₂, f₃ we choose in a problem-specific way, namely as the similarity metrics defined by In these definitions, s_w(x, x′) ∈ [−1, 1] is the structural similarity index for two images x, x′ with sliding window size w [35, 36] and we use the implementation from [37]. In particular, d₂ and d₃ are valid semimetrics because s_w is symmetric and because s_w(x, x′) = 1 if and only if x = x′, as pointed out on page 106 of [36].

Since our first approach (48) with its purely Euclidean distance metric does not take into account the structural properties of our image data, it can be considered naive. In contrast, our second approach (50) is informed because it brings to bear the fact that the data consists of images, between which a structural relationship can be established. A general overview of such informed machine learning techniques can be found, for instance, in [38].

As the regression model underlying our CASIMAC, we take a fully connected neural network, both in the naive and in the informed approach. The network contains a first hidden layer with 100 neurons and a sigmoid activation function and a second hidden layer with 18 neurons and a linear activation function. The output of the second layer is interpreted as the mean and standard deviation of a normal distribution. We optimize the log-likelihood of this distribution with an Adam approach to determine the best model parameters. In total, there are 80318 trainable parameters. This model is realized with the help of Tensorflow-probability [39]. Since we predict a distribution for each input point x, we can calculate class label probabilities according to (35).

In Table 6, we summarize the classification results for our CASIMAC based on the naive training data transformation (48) and our CASIMAC based on the informed training data transformation (50). Specifically, we show the top-1 to top-5 accuracy scores, which are based on the probability prediction of the classifier. It turns out that our informed approach is slightly better or equal to the naive approach for all accuracies. A list of benchmarked accuracies for other classifiers can be found in [34], for instance.

Download:

Table 6. Accuracy for the fashion-mnist data set.

Top-1 to top-5 accuracy of our naive and of our informed CASIMAC on the fashion-mnist data set. In the naive approach we use a purely Euclidean distance metric between the images, whereas the informed approach also takes the structrual image similarity into account. The best scores are highlighted in bold.

https://doi.org/10.1371/journal.pone.0279876.t006

In contrast to the approach from [7], we use the neural network to directly perform the latent space mapping instead of linking the network to a series of GPs. Additionally, our network directly predicts the estimated mean and variance of the latent space mapping. It would, however, be a promising approach to further improve our results by incorporating a more advanced form of feature extraction like the one from [7].

4 Conclusions and outlook

In this paper, we have introduced a novel classifier called CASIMAC for multi-class classification in arbitrary semimetrizable feature spaces. It is based on the idea of transforming the classification problem into a regression problem. We achieve this by mapping the training data onto a latent space with a simplex-like geometry and subsequently fitting a regression model to the transformed training data. With the help of this regression model, the predictions of our classifier for the class labels and for the class label probabilities can be obtained in a conceptually and computationally simple manner. We have described in detail how our proposed method works and have demonstrated that it can be successfully applied to real-world data sets from various application domains. In particular, we see three major benefits of our approach.

First, it is generic and flexible in the sense that the choice of the particular distance semimetric for our training data transformation and the choice of the regression model underlying our classifier are completely arbitrary. In particular, this capability allows for non-numeric features. Moreover, it enables the integration of additional expert knowledge in the chosen distance metric [38]. For instance, to classify molecules a distance measure reflecting stoichiometry and configuration variations could be applied [40]. Another possible strategy would be to infer the distance metric from the data itself, possibly based on certain informed assumptions [41]. Similarly, expert knowledge could be brought to bear in the training of the regression model, for example, in the form of shape constraints [42–44].

Second, the intuitive latent space representation with its simple geometric concept has a direct interpretation. In particular, this can be exploited to visually detect inter-class relationships and is especially useful for classification problems with a large feature space dimension and a small number of classes.

Third, as our benchmarks have shown, our method leads to classifiers with comparably good prediction and calibration qualities. To determine class probabilities, we only require a regression model with probabilistic predictions. Also, the effort of computing these class probability predictions is quite low, especially compared to the computational effort necessary for GPC. In particular, no complicated approximations are required in our approach and, for a binary classification problem and a regression model with a normally distributed probabilistic prediction, there even exists a closed-form expression for the class probability predictions.

A challenge of our method is that its training requires the calculation of nearest datapoint neighbors, which is computationally expensive for larger data sets [45]. It would be a natural starting point for further studies to investigate how this computational limitation in the training of our classifiers can be overcome. A related challenge of our method is that the tuning of hyperparameters can be costly. Instead of using a cross-validated grid search like we did in this paper, it could be advantageous to consider more elaborate strategies, for example, to infer the hyperparameters from the statistical properties of the training data. We leave this topic as an open question for future research.

Supporting information

S1 Appendix. A: Segmentation of latent space and core properties of calibrated simplex-mapping classifiers.

Mathematical background of the method including detailed proofs.

https://doi.org/10.1371/journal.pone.0279876.s001

(PDF)

S2 Appendix. B: Implementation.

Description of the Python implementation of the proposed method.

https://doi.org/10.1371/journal.pone.0279876.s002

(PDF)

S3 Appendix. C: Benchmark.

Description of the hyperparameters for the numerical benchmarks.

https://doi.org/10.1371/journal.pone.0279876.s003

(PDF)

Acknowledgments

We would like to thank Janis Keuper and Jürgen Franke for their helpful and constructive comments.

References

1. Challis E, Hurley P, Serra L, Bozzali M, Oliver S, Cercignani M. Gaussian process classification of Alzheimer’s disease and mild cognitive impairment from resting-state fMRI. NeuroImage. 2015;112:232–243. pmid:25731993
- View Article
- PubMed/NCBI
- Google Scholar
2. Niculescu-Mizil A, Caruana R. Predicting Good Probabilities with Supervised Learning. In: International Conference on Machine Learning. ICML’05. New York, NY, USA: ACM; 2005. p. 625–632.
3. Platt JC. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In: Advances in Large Margin Classifiers. MIT Press; 1999. p. 61–74.
4. Zadrozny B, Elkan C. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In: International Conference on Knowledge Discovery and Data Mining. KDD’02. New York, NY, USA: ACM; 2002. p. 694–699.
5. Gebel M. Multivariate calibration of classification scores into the probability space. PhD thesis, University of Dortmund; 2009.
6. Calandra R, Peters J, Rasmussen CE, Deisenroth MP. Manifold Gaussian processes for regression. In: International Joint Conference on Neural Networks. IJCNN’16. IEEE; 2016. p. 3338–3345.
7. Wilson AG, Hu Z, Salakhutdinov RR, Xing EP. Stochastic variational deep kernel learning. In: Advances in Neural Information Processing Systems; 2016. p. 2586–2594.
- View Article
- Google Scholar
8. Al-Shedivat M, Wilson AG, Saatchi Y, Hu Z, Xing EP. Learning scalable deep kernels with recurrent structure. The Journal of Machine Learning Research. 2017;18(1):2850–2886. pmid:30662374
- View Article
- PubMed/NCBI
- Google Scholar
9. Bradshaw J, Matthews AG, Ghahramani Z. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks. preprint arXiv:170702476. 2017.
10. Daskalakis C, Dellaportas P, Panos A. Scalable Gaussian Processes, with Guarantees: Kernel Approximations and Deep Feature Extraction. preprint arXiv:200401584. 2020.
11. Iwata T, Ghahramani Z. Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes. preprint arXiv:170705922. 2017.
12. Cremanns K, Roos D. Deep Gaussian Covariance Network. preprint arXiv:171006202. 2017.
13. Liu H, Ong YS, Shen X, Cai J. When Gaussian Process Meets Big Data: A Review of Scalable GPs. preprint arXiv:180701065. 2018.
14. Ludl PO, Heese R, Höller J, Asprion N, Bortz M. Using machine learning models to explore the solution space of large nonlinear systems underlying flowsheet simulations with constraints. Frontiers of Chemical Science and Engineering. 2021.
- View Article
- Google Scholar
15. Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. Adaptative computation and machine learning series. University Press Group Limited; 2006.
16. Murphy KP. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series. MIT Press; 2012.
17. Tran D, Ranganath R, Blei DM. The Variational Gaussian Process. preprint arXiv:151106499. 2015.
18. Hensman J, Matthews AG, Filippone M, Ghahramani Z. MCMC for Variationally Sparse Gaussian Processes. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, editors. Advances in Neural Information Processing Systems 28. Curran Associates, Inc.; 2015. p. 1648–1656.
19. GPy. GPy: A Gaussian process framework in Python; 2012. Available online: http://github.com/SheffieldML/GPy.
20. Dugundji J. Topology. Allyn and Bacon; 1966.
21. Wilson WA. On semi-metric spaces. American Journal of Mathematics. 1931;53:361–373.
- View Article
- Google Scholar
22. Deza MM, Deza E. Encyclopedia of Distances. Springer, Berlin; 2016.
23. Heese R. CASIMAC: Calibrated simplex mapping classifier in Python; 2022. Available online: https://github.com/raoulheese/casimac.
24. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
- View Article
- Google Scholar
25. Dua D, Graff C. UCI Machine Learning Repository; 2017. Available online: http://archive.ics.uci.edu/ml.
26. Adak MF, Lieberzeit P, Jarujamrus P, Yumusak N. Classification of alcohols obtained by QCM sensors with different characteristics using ABC based neural network. Engineering Science and Technology, an International Journal. 2019.
- View Article
- Google Scholar
27. Lucas DD, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, et al. Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development. 2013;6(4):1157–1171.
- View Article
- Google Scholar
28. Rögnvaldsson T, You L, Garwicz D. State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics. 2014;31(8):1204–1210.
- View Article
- Google Scholar
29. Baumgardner MF, Biehl LL, Landgrebe DA. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3; 2015. Available online: https://purr.purdue.edu/publications/1947/1.
30. Graña M, M A Veganzons BA. Hyperspectral Remote Sensing Scenes; 2014. Available online: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.
31. Rohra JG, Perumal B, Narayanan SJ, Thakur P, Bhatt RB. User Localization in an Indoor Environment Using Fuzzy Hybrid of Particle Swarm Optimization & Gravitational Search Algorithm with Neural Networks. In: International Conference on Soft Computing for Problem Solving. SocProS’16; 2016. p. 286–295.
32. Opitz J, Burst S. Macro F1 and Macro F1. preprint arXiv:191103347. 2019.
33. DeGroot MH, Fienberg SE. The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society Series D (The Statistician). 1983;32(1/2):12–22.
- View Article
- Google Scholar
34. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. preprint arXiv:170807747. 2017.
35. Zhou Wang, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. 2004;13(4):600–612.
- View Article
- Google Scholar
36. Wang Z, Bovik AC. Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures. IEEE Signal Processing Magazine. 2009;26(1):98–117.
- View Article
- Google Scholar
37. Van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, et al. scikit-image: image processing in Python. PeerJ. 2014;2:e453. pmid:25024921
- View Article
- PubMed/NCBI
- Google Scholar
38. Rueden Lv, Mayer S, Beckh K, Georgiev B, Giesselbach S, Heese R, et al. Informed Machine Learning—A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering. 2021; p. 1.
- View Article
- Google Scholar
39. Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, et al. TensorFlow Distributions. preprint arXiv:171110604. 2017;abs/1711.10604.
40. Rupp M, Tkatchenko A, Müller KR, Von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters. 2012;108(5):058301. pmid:22400967
- View Article
- PubMed/NCBI
- Google Scholar
41. Bellet A, Habrard A, Sebban M. A Survey on Metric Learning for Feature Vectors and Structured Data. preprint arXiv:13066709. 2013.
42. Kurnatowski Mv, Schmid J, Link P, Zache R, Morand L, Kraft T, et al. Compensating data shortages in manufacturing with monotonicity knowledge. Algorithms. 2021;14(12).
- View Article
- Google Scholar
43. Schmid J. Approximation, characterization, and continuity of multivariate monotonic regression functions. Analysis and Applications. 2021.
- View Article
- Google Scholar
44. Link P, Poursanidis M, Schmid J, Zache R, von Kurnatowski M, Teicher U, et al. Capturing and incorporating expert knowledge into machine learning models for quality prediction in manufacturing. Journal of Intelligent Manufacturing. 2022;33(7):2129–2142.
- View Article
- Google Scholar
45. Dhanabal S, Chandramathi DS. A Review of various k-Nearest Neighbor Query Processing Techniques. International Journal of Computer Applications. 2011;31(7):14–22.
- View Article
- Google Scholar

[ref1] 1. Challis E, Hurley P, Serra L, Bozzali M, Oliver S, Cercignani M. Gaussian process classification of Alzheimer’s disease and mild cognitive impairment from resting-state fMRI. NeuroImage. 2015;112:232–243. pmid:25731993
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Niculescu-Mizil A, Caruana R. Predicting Good Probabilities with Supervised Learning. In: International Conference on Machine Learning. ICML’05. New York, NY, USA: ACM; 2005. p. 625–632.

[ref3] 3. Platt JC. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In: Advances in Large Margin Classifiers. MIT Press; 1999. p. 61–74.

[ref4] 4. Zadrozny B, Elkan C. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In: International Conference on Knowledge Discovery and Data Mining. KDD’02. New York, NY, USA: ACM; 2002. p. 694–699.

[ref5] 5. Gebel M. Multivariate calibration of classification scores into the probability space. PhD thesis, University of Dortmund; 2009.

[ref6] 6. Calandra R, Peters J, Rasmussen CE, Deisenroth MP. Manifold Gaussian processes for regression. In: International Joint Conference on Neural Networks. IJCNN’16. IEEE; 2016. p. 3338–3345.

[ref7] 7. Wilson AG, Hu Z, Salakhutdinov RR, Xing EP. Stochastic variational deep kernel learning. In: Advances in Neural Information Processing Systems; 2016. p. 2586–2594.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref8] 8. Al-Shedivat M, Wilson AG, Saatchi Y, Hu Z, Xing EP. Learning scalable deep kernels with recurrent structure. The Journal of Machine Learning Research. 2017;18(1):2850–2886. pmid:30662374
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref9] 9. Bradshaw J, Matthews AG, Ghahramani Z. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks. preprint arXiv:170702476. 2017.

[ref10] 10. Daskalakis C, Dellaportas P, Panos A. Scalable Gaussian Processes, with Guarantees: Kernel Approximations and Deep Feature Extraction. preprint arXiv:200401584. 2020.

[ref11] 11. Iwata T, Ghahramani Z. Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes. preprint arXiv:170705922. 2017.

[ref12] 12. Cremanns K, Roos D. Deep Gaussian Covariance Network. preprint arXiv:171006202. 2017.

[ref13] 13. Liu H, Ong YS, Shen X, Cai J. When Gaussian Process Meets Big Data: A Review of Scalable GPs. preprint arXiv:180701065. 2018.

[ref14] 14. Ludl PO, Heese R, Höller J, Asprion N, Bortz M. Using machine learning models to explore the solution space of large nonlinear systems underlying flowsheet simulations with constraints. Frontiers of Chemical Science and Engineering. 2021.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref15] 15. Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. Adaptative computation and machine learning series. University Press Group Limited; 2006.

[ref16] 16. Murphy KP. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series. MIT Press; 2012.

[ref17] 17. Tran D, Ranganath R, Blei DM. The Variational Gaussian Process. preprint arXiv:151106499. 2015.

[ref18] 18. Hensman J, Matthews AG, Filippone M, Ghahramani Z. MCMC for Variationally Sparse Gaussian Processes. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, editors. Advances in Neural Information Processing Systems 28. Curran Associates, Inc.; 2015. p. 1648–1656.

[ref19] 19. GPy. GPy: A Gaussian process framework in Python; 2012. Available online: http://github.com/SheffieldML/GPy.

[ref20] 20. Dugundji J. Topology. Allyn and Bacon; 1966.

[ref21] 21. Wilson WA. On semi-metric spaces. American Journal of Mathematics. 1931;53:361–373.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref22] 22. Deza MM, Deza E. Encyclopedia of Distances. Springer, Berlin; 2016.

[ref23] 23. Heese R. CASIMAC: Calibrated simplex mapping classifier in Python; 2022. Available online: https://github.com/raoulheese/casimac.

[ref24] 24. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref25] 25. Dua D, Graff C. UCI Machine Learning Repository; 2017. Available online: http://archive.ics.uci.edu/ml.

[ref26] 26. Adak MF, Lieberzeit P, Jarujamrus P, Yumusak N. Classification of alcohols obtained by QCM sensors with different characteristics using ABC based neural network. Engineering Science and Technology, an International Journal. 2019.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref27] 27. Lucas DD, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, et al. Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development. 2013;6(4):1157–1171.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref28] 28. Rögnvaldsson T, You L, Garwicz D. State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics. 2014;31(8):1204–1210.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref29] 29. Baumgardner MF, Biehl LL, Landgrebe DA. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3; 2015. Available online: https://purr.purdue.edu/publications/1947/1.

[ref30] 30. Graña M, M A Veganzons BA. Hyperspectral Remote Sensing Scenes; 2014. Available online: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.

[ref31] 31. Rohra JG, Perumal B, Narayanan SJ, Thakur P, Bhatt RB. User Localization in an Indoor Environment Using Fuzzy Hybrid of Particle Swarm Optimization & Gravitational Search Algorithm with Neural Networks. In: International Conference on Soft Computing for Problem Solving. SocProS’16; 2016. p. 286–295.

[ref32] 32. Opitz J, Burst S. Macro F1 and Macro F1. preprint arXiv:191103347. 2019.

[ref33] 33. DeGroot MH, Fienberg SE. The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society Series D (The Statistician). 1983;32(1/2):12–22.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref34] 34. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. preprint arXiv:170807747. 2017.

[ref35] 35. Zhou Wang, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. 2004;13(4):600–612.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref36] 36. Wang Z, Bovik AC. Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures. IEEE Signal Processing Magazine. 2009;26(1):98–117.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref37] 37. Van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, et al. scikit-image: image processing in Python. PeerJ. 2014;2:e453. pmid:25024921
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref38] 38. Rueden Lv, Mayer S, Beckh K, Georgiev B, Giesselbach S, Heese R, et al. Informed Machine Learning—A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering. 2021; p. 1.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref39] 39. Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, et al. TensorFlow Distributions. preprint arXiv:171110604. 2017;abs/1711.10604.

[ref40] 40. Rupp M, Tkatchenko A, Müller KR, Von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters. 2012;108(5):058301. pmid:22400967
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref41] 41. Bellet A, Habrard A, Sebban M. A Survey on Metric Learning for Feature Vectors and Structured Data. preprint arXiv:13066709. 2013.

[ref42] 42. Kurnatowski Mv, Schmid J, Link P, Zache R, Morand L, Kraft T, et al. Compensating data shortages in manufacturing with monotonicity knowledge. Algorithms. 2021;14(12).
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref43] 43. Schmid J. Approximation, characterization, and continuity of multivariate monotonic regression functions. Analysis and Applications. 2021.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref44] 44. Link P, Poursanidis M, Schmid J, Zache R, von Kurnatowski M, Teicher U, et al. Capturing and incorporating expert knowledge into machine learning models for quality prediction in manufacturing. Journal of Intelligent Manufacturing. 2022;33(7):2129–2142.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref45] 45. Dhanabal S, Chandramathi DS. A Review of various k-Nearest Neighbor Query Processing Techniques. International Journal of Computer Applications. 2011;31(7):14–22.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

Figures

Abstract

1 Introduction

1.1 Contribution

1.2 Simple example

1.3 Structure of the paper

2 Calibrated simplex-mapping classification

2.1 Training data transformation to a latent space

2.2 Training of a regression model based on the transformed data

2.3 Calibrated simplex-mapping classifier

2.4 Visualization

3 Benchmark

3.1 Synthetic data

3.2 Real-world data

3.3 Visualization

3.4 Towards deep learning

4 Conclusions and outlook

Supporting information

S1 Appendix. A: Segmentation of latent space and core properties of calibrated simplex-mapping classifiers.

S2 Appendix. B: Implementation.

S3 Appendix. C: Benchmark.

Acknowledgments

References