Figures
Abstract
We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular (n − 1)-dimensional simplex, n being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.
Citation: Heese R, Schmid J, Walczak M, Bortz M (2023) Calibrated simplex-mapping classification. PLoS ONE 18(1): e0279876. https://doi.org/10.1371/journal.pone.0279876
Editor: Xiyu Liu, Shandong Normal University, CHINA
Received: September 13, 2021; Accepted: December 16, 2022; Published: January 17, 2023
Copyright: © 2023 Heese et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data considered in the manuscript are publicly available online: 1. Alcohol data set: https://archive.ics.uci.edu/ml/datasets/Alcohol+QCM+Sensor+Dataset 2. Climate data set: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes 3. HIV data set: https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+cleavage 4. Pine data set: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes 5. Wifi data set: https://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization 6. Fashion-MNIST data set: https://github.com/zalandoresearch/fashion-mnist.
Funding: This work was funded by the Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. (Fraunhofer Society, Hansastraße 27c, 80686 München, Germany) via two of its institutes, namely the Fraunhofer Institute for Industrial Mathematics (Fraunhofer ITWM, Fraunhofer-Platz 1, 67663 Kaiserslautern, Germany) and Fraunhofer Center for Machine Learning. Additional funding was provided by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the Priority Programme “SPP 2331: Machine Learning in Chemical Engineering”. These funders provided support in the form of salaries for the authors, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist. All authors are employed by the Fraunhofer Society. This does not alter our adherence to PLOS ONE policies on sharing data and materials. No competing interest arise from the affiliation to the Fraunhofer Society or the above-mentioned funders. Apart from the statements made here, there are no relevant declarations relating to employment, consultancy, patents, products in development, or marketed products that lead to competing interests.
1 Introduction
In many classification tasks, it is not sufficient to merely predict the class label for a given feature space point x. Instead, it is often important to also have good predictions
for the class label probabilities, because these probability predictions provide a measure for the confidence one can have in the individual class label predictions
. Such additional confidence information is important in many applications, for instance in clinical applications [1]. Classifiers that come with such additional class probability predictions are called calibrated. Some classifiers from methods like logistic regression or Gaussian process classification (GPC) are intrinsically calibrated. Also, there are various methods to calibrate an intrinsically non-calibrated classifier or to improve the calibration quality of an ill-calibrated classifier [2–5].
1.1 Contribution
In this paper, we propose a novel supervised learning method for multi-class classification that yields classifiers with a high potential to be intrinsically well-calibrated. It can be applied to general classification problems in an arbitrary metrizable feature space of possibly non-numeric features and with an arbitrary number n of classes with labels
. Starting from a training data set
(1)
of feature space points
together with associated class labels
, the training of our classifier proceeds in two training steps:
- In a first step, the training datapoints x1, …, xD are transformed by means of a suitable training data transformation
to a latent space
, which we partition into n cone segments
corresponding to the n classes in
and defined in terms of a regular (n − 1)-dimensional simplex in
.
- In a second step, a regression model
is trained based on the transformed training datapoints (x1, f(x1)), …, (xD, (f(xD)), which are obtained by means of the training data transformation f from the first step. In this manner, the latent space representation of the training data is extended to the whole feature space.
We design the training data transformation f such that the latent space counterpart f(xi) of each datapoint xi is located in the corresponding cone segment and such that the location of f(xi) in this segment reflects the distances of xi from its own-class and its foreign-class datapoint neighbors. Concerning the choice of the distance metric and the number of neighbors used in the definition of f, we are completely free, and the same is true for the choice of the regression model
. In particular, these quantities can be freely customized and tuned to the particular problem at hand.
As soon as the above training steps have been performed, our classifier is readily obtained. Indeed, its class label prediction
for a given feature space point
is the label of the (first) cone segment
that contains
, the regression model’s point prediction for x. If in addition to these point predictions, the regression model also provides probabilistic predictions, then our classifier yields predictions
for the class label probabilities as well. Specifically, these class label probability predictions read
(2)
where
for a given
is the regression model’s prediction for the probabilistic distribution of latent space points. As a classifier coming with class label probability predictions, our classifier is calibrated. We refer to it as a calibrated simplex-mapping classifier (CASIMAC) because the underlying latent-space mapping f is defined in terms of the vertices of a simplex in
.
We point out that the concept of leveraging Bayesian probabilistic prediction power combined with latent space mappings has been studied before. Several recent publications propose to couple a deep neural network with Gaussian processes (GPs) for an improved uncertainty estimate of model predictions [6–10]. Alternative approaches explore the use of deep neural networks not as feature extraction methods but, for instance, to suitably estimate the mean functions of GPs [11] or to predict their covariance functions and hyperparameters [12]. Yet, due to the high complexity of the deep neural network components in these models, the algorithms mentioned above are well-suited for large data sets with abundant training data available [13]. In the present paper, by contrast, we propose a simple latent space representation of the original feature space as the core component of a well-calibrated classifier that also works on less complex data sets. In particular, our method has recently been successfully used for an industrial application [14].
In summary, our contribution consists of the following parts:
- We propose a novel supervised learning method for multi-class classification with a simplex-like latent space.
- We rigorously establish the theoretical background including detailed proofs.
- We find that the computational effort of making predictions with our proposed classifier is comparatively low (in contrast to, e. g., GPC).
- We show how the latent space of our proposed classifier can be suitably visualized.
- We benchmark the prediction and calibration properties of our proposed classifier.
Additionally, we discuss potential use cases and further research directions.
1.2 Simple example
In order to concretize the aforementioned assets of our method and paint a more intuitive picture, we briefly discuss a simple case with n = 2 classes (i. e., a binary classification problem), which is shown in Fig 1. The two class labels are chosen as l1 ≔ −1 and l2 ≔ + 1, respectively. In that case, our latent space is just the real line and the cone segments simply are
and
, the negative and the positive half-axis, respectively. Consequently, our definition of the class label predictions simplifies to
(3)
while our definition of the class probability predictions simplifies to
(4)
As before,
is a regression model that is fitted to the transformed training datapoints (x1, f(x1)), …, (xD, f(xD)) and
is the associated predictive a posteriori distribution for x based on the same transformed training datapoints.
An exemplary binary classification problem with one-dimensional feature space and class labels y ∈ {−1, +1}. We compare a GPC (using a radial-basis-function kernel and expectation propagation (EP) with the implementation from [19]) and our CASIMAC (based on a GPR with Matérn kernel as the underlying regression model), both trained on the same D = 25 training datapoints. The top plot shows the expectation values
,
and the standard deviations of the latent space distributions
,
for both approaches (on two different scales), whereas the bottom plot contains the class probability predictions
and
for both approaches. The gray and white areas indicate the true regions of the class + 1 and −1, respectively. All datapoints are sampled without noise from these regions. Finally, the dotted horizontal lines (⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅) represent the decision boundaries. The corresponding cone segments for CASIMAC,
and
, are shown in the top plot.
A simple choice for the regression model is a Gaussian process regressor (GPR). And a possible choice for the training data transformation f on which
depends is the simple distance-based map defined as follows:
(5)
where the sign in this formula is simply given by the class label
of the datapoint xi and the distance is to be understood w.r.t. the chosen metric on the feature space
. If we train a CASIMAC with these choices for f and
on an exemplary training data set with D = 25 points in the one-dimensional feature space
, we obtain the class probability predictions
depicted in red in Fig 1 (bottom plot). If we train a GPC on the same training data, we obtain the respective class probability predictions
depicted in blue. As usual,
(6)
where σ(z) ≔ 1/(1 + e−z) represents the logistic sigmoid function and
is the predictive a posteriori distribution of the GPC for the test point x [15, 16].
We immediately see from the class probability plots in Fig 1 that our CASIMAC has a high confidence in its class label predictions in regions of densely sampled datapoints, as one would intuitively expect. In contrast, the GPC has considerably less confidence in its class label predictions at or near the datapoints despite its high prediction accuracy. In essence, this is because the training data transformation f that underlies our classifier takes into account the actual distances of the datapoints and because the latent space probability density is considerably more concentrated around its expectation value
than is the case for its analog
. Another downside of the GPC—further impairing its calibration quality—is that its class probability predictions
can be computed only approximately. In fact, already the computation of the non-normal distribution
, a D-dimensional integral, requires quite sophisticated approximations like the Laplace approximation [15], expectation propagation [15], variational inference [17], or the Markov chain Monte Carlo approximation [18], to name a few. As opposed to this, no approximations are required for the computation of the CASIMAC class probabilities
because there is a closed-form expression for them, due to the normality of the distributions
assumed in our example from Fig 1.
1.3 Structure of the paper
In section 2, we formally introduce our CASIMAC method. We explain in detail the two training steps as well as how predictions of a trained classifier can be obtained. We also show how our latent space mappings can be leveraged for the convenient visualization of inter- or intra-class relationships in the data, especially in the case of a high-dimensional feature space and a moderate number of classes. In Section 3, we apply our method to various data sets and compare its performance and calibration qualities to several well-established benchmark classifiers. In particular, we demonstrate that our approach can be applied to many different application domains because our training data transformation reflects actual distances in the data set and because the respective distance metric as well as the regression model are freely customizable. Section 4 concludes the paper with a summary and an outlook on possible future research. The appendix collects the mathematical and technical background underlying our CASIMAC method. In S1 Appendix, we rigorously prove the results needed for an in-depth and mathematically sound understanding of the method. In S2 Appendix, in turn, we summarize the main features of our implementation, and S3 Appendix summarizes how the hyperparameters of the models in Section 3 were chosen.
2 Calibrated simplex-mapping classification
In this section, we formally introduce our proposed calibrated simplex-mapping classifier, briefly referred to as CASIMAC. Sections 2.1 and 2.2 explain in detail the two training steps outlined in Section 1.1. In Section 2.3, we give the precise definition of our CASIMAC and, moreover, we explain how its predictions and
for the class labels and the class label probabilities can be calculated in a computationally favorable manner. In Section 2.4 we finally explain how the latent space representation upon which our classification method relies can also be exploited for visualization purposes. Here and in the following, we consistently denote the training data set and the training datapoints as in (1) and write
(7)
for the true class labels of the training datapoints x1, …, xD. Also,
always denotes the feature space—which is only assumed to be metrizable and, in particular, need not be embedded in any
—and
denotes the set of class labels. As usual, a set is called metrizable iff there exists at least one metric on it [20]. We see below that we actually only need so-called semimetrics [21] and we exploit this even greater flexibility in our last application example. If
is embedded in some
, then of course infinitely many metrics exist on
, namely at least all the metrics induced by the ℓp-norms ‖⋅‖p on
for p ∈ [1, ∞) ∪ {∞}. In order to reduce cumbersome double indices, we assume from here on that the class labels are just 1, …, n (instead of the general labels l1, …, ln). In short, we assume that
(8)
without loss of generality.
2.1 Training data transformation to a latent space
In the first training step of our method, we transform the feature space training datapoints x1, …, xD by means of a suitably designed training data transformation
(9)
to a suitable latent space
. We choose this latent space to be
, where n ≥ 2 as before is the number of classes observed in the training data
. We decompose this space into n conically shaped segments
, which are defined by the vertices p1, …, pn of a regular (n − 1)-dimensional simplex
in
having barycenter 0 and being at unit distance from their barycenter 0, that is,
(10)
See S1 Appendix (Proposition 2). Specifically, we define the segment
as the cone that is spanned by the mirrored vertices −pi with i ≠ k, that is,
(11)
In view of (10.a), the vertex pk lies on the central ray of
, which is why we also refer to pk as the central vector of
. Also, the segments
cover the whole latent space
and any two of these segments overlap only at their boundaries. In short,
(12)
where
is the interior of
, that is, the set
without its boundary [20]. It is given by
(13)
And finally, the segments
are pairwise congruent. All these statements are proven rigorously in S1 Appendix (Lemma 5 and Propositions 6 and 7). In the special case of just two or three classes, they can also be easily verified graphically. See Fig 2, for instance.
Segmentation of the latent space of a ternary classification problem (n = 3) into three congruent cone segments
, each associated with one of the three classes as indicated by the respective colors. The vertices p1, p2, p3 of the simplex
(marked by the gray shading) with barycenter 0 lie on the central ray of the respective segments. The borders of the segments are defined by the vertices −p1, −p2, −p3 of the mirrored simplex
.
With the help of the above segmentation of the latent space , we can define the training data transformation (9). We design this mapping f such that it maps each training datapoint
to the cone segment
corresponding to its class label y(x) in such a way that the location of f(x) in this cone segment
reflects the distances of x from its own-class and from its foreign-class neighbors. We choose the location of f(x) based on the following premises:
- f(x) should be located the farther in the direction of py(x) (and thus the farther inside the cone segment
), the closer x is to its class-y(x) datapoint neighbors
- f(x) should be located the farther in the direction of −py (and thus the farther away from the cone segment
), the farther x is away from its class-y datapoint neighbors for
.
Specifically, we define f as follows:
(14)
for every training datapoint
. In particular, the coefficient
indicates how far f(x) is pulled into the own-class segment
, while the coefficient
indicates how far f(x) is pulled away from the foreign-class segment
for
. We therefore refer to these coefficients as the attraction and the repulsion coefficients of x and we define them, as indicated above, in terms of the mean distance of x from its kα closest datapoint neighbors of its own class y(x) or, respectively, from its kβ closest datapoint neighbors of the foreign class
. That is,
(15)
where
is the set of all training datapoints belonging to class y and
(16)
is the mean distance of x from its k nearest neighbors from the subset
, the distance being measured in terms of some arbitrary semimetric [21]
(17)
on the feature space
, that is, d(x, x′) = d(x′, x) for all
(symmetry) and d(x, x′) = 0 if and only if x = x′. At least one such semimetric exists on
because this space was assumed to be semimetrizable at the beginning of Section 2. Also, α, β ∈ [0, ∞) and
are user-defined hyperparameters satisfying
(18)
where
is the cardinality of the smallest training data class. Conditions (18.b) and (18.c) guarantee that all the attraction and repulsion coefficients are strictly positive finite real numbers, that is,
(19)
for all
and all
(Proposition 10). In most of our applications, d is a proper metric, that is, a semimetric that also satisfies the triangle inequality. In our last application example, though, we make explicit use of semimetrics as well.
Since the domain of f and the attraction and repulsion coefficients occurring in the definition of f obviously depend on the training data set
, so does the training data transformation f itself, and we sometimes make this dependence explicit by writing
(20)
It is straightforward to verify that the training data transformation f from (14), with α ≔ 0, β ≔ 1, kα, kβ ≔ 1, reduces to the mapping from (5) in the special case of just n = 2 classes. It is also easy to verify from (14), using the barycenter condition (10.a) along with (13), (18.a) and (19), that
(21)
for every
(Proposition 10). In other words, f(x) for every training datapoint
lies in the interior
of the corresponding cone segment. Specifically, the membership of f(x) to
is the clearer, the closer x is to its kα nearest own-class neighbors and the farther x is away from its kβ nearest foreign-class neighbors. Fig 3 illustrates this behavior of the training data transformation f for the simple case of a ternary classification problem (n = 3).
Illustration of the training data transformation from (14) for an exemplary ternary classification problem (n = 3) with a two-dimensional feature space
. The colored regions on the left illustrate the true regions of the three classes with their respective labels, whereas the points denote the (noiselessly) sampled training datapoints x1, …, xD. We sample three points from each class (i. e., D = 9) and use different symbols (∘, ◽ and ♢) to uniquely identify each point. As the distance metric d underlying f, we choose the Euclidean distance. The other hyperparameters are chosen as α ≔ 0, β ≔ 1 and kα, kβ ≔ 1, respectively. The figure shows that the farther a training datapoint xi (for i ∈ {1, …, D}) is away from its nearest foreign-class neighbor, the clearer is the membership of f(xi) to the respective cone segment
.
We point out that this behavior of f is generic in the sense that it is independent of the number of classes and independent of the specific choices of α, β, kα, kβ and d. In particular, we can freely choose and tune the semimetric d as well as the hyperparameters α, β, kα, kβ within the bounds (18) to the specific classification problem at hand [22]. We make ample use of this customization flexibility for the exemplary classification problems in Section 3.
2.2 Training of a regression model based on the transformed data
In the second training step of our method, we train a regression model
(22)
from the feature space
to the latent space
. As the training data set for this regression model, we take the transformed training data set
(23)
consisting of the feature space training datapoints
together with their latent space counterparts
which are obtained by means of the training data transformation f from the first training step as defined in (14).
In sharp contrast to f, the regression model is defined on the whole of
(instead of only the training datapoints x1, …, xD) and thus yields latent space predictions for arbitrary feature space points
, especially for hitherto unobserved and unclassified feature space points. Fig 4 illustrates the effect of the regression model for the special case of a ternary classification problem (n = 3).
Illustration of the regression model for the exemplary ternary classification problem from Fig 3. We choose a GPR model with a Matérn kernel, which is fitted to the transformed training datapoints (x1, f(x1),, …, (xD, f(xD)) that are obtained from the training data transformation f according to Fig 3. In order to illustrate the effect of
, we show how the points of the rectangular grid on the left get transformed by
. In particular, we show this transformation for the corner points marked by stars. The color of any point x on the grid and, respectively, of any point
on the transformed grid indicates the true class of x.
Since both and
depend on the original training data set
, so does the regression model
and we sometimes make this dependence explicit by writing
(24)
We point out that—similarly to the choice of the hyperparameters α, β, kα, kβ and d of the training data transformation—the specific choice of the regression model is completely arbitrary.
In particular, we can choose a suitable probabilistic regression model that is able to not only predict a single latent space point but also an entire probability density
for every
. Such a so-called predictive a posteriori distribution indicates the distribution of the latent space points for given x and a given data set
. In particular, the predictive a posteriori distribution
indicates the level of confidence that the model has in its point prediction
, namely it is all the more confident in the point prediction
the more
is concentrated around
. A typical example for regression models which provide both a point prediction
and a probability-density prediction
for every
are GPR models [15, 16]. As is well-known, for these models the point predictions
are completely determined by the predictive a posteriori distribution
, namely as the maximum a posteriori prediction, that is,
(25)
We make ample use of GPR models in our benchmark examples from Section 3.
2.3 Calibrated simplex-mapping classifier
After the two training steps explained above have been performed, we can define and put to use our CASIMAC. We define it as the classifier
(26)
that assigns to each feature space point
the label y of the first cone segment
that contains
. That is,
(27)
where
is the regression model’s latent space prediction for x and where
(28)
is the label of the first cone segment containing z. Since by (12.a) every latent space point
is contained in at least one of the segments
, the expression (27) really yields a well-defined classifier. Since, on the other hand, the segments overlap at their respective borders, a given latent space point
is contained in several segments if it is located exactly on such a border. In this special case, we need to decide for one of the overlapping segments, to obtain a classifier that assigns a single label (instead of multiple labels) to each feature space point. In our definition (27), we decide for the first of the segments containing
for the sake of simplicity, hence the minimum in that formula.
In view of the dependence of the regression model on the training data set
, our CASIMAC depends on
as well and, to emphasize this, we sometimes write
(29)
According to the definition (27), in order to practically compute the class label prediction
for a given feature space point
, we have to determine all cone segments that contain
. Yet, in view of the purely geometric definition (11) of the cone segments, it is not so clear at first glance how this can be done in a computationally feasible and favorable manner—especially in the case of many classes (n ≥ 5) where the segments cannot be (directly) visualized anymore. We find, however, that the segments containing
can be determined solely in terms of the distances
(30)
of
from the central vectors p1, …, pn of the segments
. Specifically, we prove in S1 Appendix (Theorem 8) that a latent space point
belongs to the segment
if and only if
(31)
that is, if and only if its distance from the other segments’ central vectors pl, l ≠ y, is at least as large as its distance from the central vector py of
. Consequently, the geometrically inspired definition (27) of our CASIMAC can be recast in the form
(32)
(Corollary 11). Computationally, (32) is much easier to evaluate than (27) since
can be determined simply by calculating all distances (30) and by then choosing the smallest one. In our CASIMAC implementation, we therefore use (32) to perform predictions. We illustrate the method in Fig 5.
Illustration of our proposed calibrated simplex-mapping classifier for the exemplary ternary classification problem from Fig 3 based on the GPR model
from Fig 4. According to our definition (27), the classifier determines for each feature space point
(on the left) the cone segment containing the latent space counterpart
(on the right) and then takes the label of this cone segment as the class label prediction
for x. In other words, the cuts of the cone segment borders through the latent space
determine the class membership of each data point x according to its learned latent space position
. We mark the predicted class boundaries (on the left) as well as the corresponding cone segment boundaries (on the right) by dashed lines. As can be seen, the predicted class boundaries in
do not deviate much from the true class boundaries indicated by the different background and grid colors on the left. In other words, our classifier produces only a few misclassifications. This can also be seen from the fact that the color of most of the transformed grid points in
on the right (indicating the true class) is identical to the underlying background color (indicating the predicted class).
If the regression model perfectly fits the training data (23) in the sense that
(33)
then it is easy to see from (27), using (12.b) and (21), that
(34)
(Corollary 11). In other words, if the regression model is perfectly interpolating in the sense of (33), our CASIMAC perfectly reconstructs the true class labels of the training datapoints. A typical example of perfectly interpolating regression models are noise-free GPR models [15, 16].
If the regression model apart from its point predictions also provides probabilistic predictions, we can estimate the class probabilities for arbitrary feature space points x. In that case, there is a latent probability density on
for every
which on the one hand determines the point prediction
in some way—for instance, as the maximum a posteriori estimate (25)—and which on the other hand also determines the uncertainty of this prediction. Since the regression model
is our model for observations of latent-space points, the associated probabilities
can be considered as estimates for the probability of observing the latent-space point
for a given
. Since, moreover,
if and only if
by the definition (27) of our classifier, the expression
(35)
is an estimate for the probability of observing class y for x. We therefore take this estimate as our classifier’s prediction for the probability of class y given x. In the above equation,
denotes the indicator function
(36)
with respect to the set
. In view of (12.b) and (28), we have
for every
and thus
(37)
are alternative expressions for our classifier’s class probability prediction (Corollary 12).
If the probability densities belong to a normal distribution and the number of classes is n = 2, then the class probability predictions (37) can be computed by means of a simple analytical formula, namely (118) in S1 Appendix (Proposition 13). If, by contrast, the probability densities
do not belong to a normal distribution or the number of classes is n > 2, then the multi-dimensional integrals in (37) can still be computed approximately. In our CASIMAC implementation, we use Monte Carlo sampling for these approximate computations of the class probability predictions (37) for n > 2. See (119) in S1 Appendix (Proposition 13). Standard examples of regression models
with normally distributed probability densities
are provided by GPR models. In particular, the class probabilities for the binary classification problem from Fig 1 were evaluated analytically.
2.4 Visualization
Apart from its central use in the definition of our classifier, the latent space representation given by f and can also be beneficially used for visualization purposes.
In particular, the training data transformation f can be exploited for visually detecting inter- or intra-class relationships in the data. Inspecting the latent space representation from Fig 3, for instance, we find that the transformed datapoints of class 1 and 3 are farther apart from each other than they both are apart from the transformed datapoints of class 2. We could have directly seen that, of course, by inspecting the untransformed datapoints in feature space, which is only 2-dimensional here. After all, class 2 lies between class 1 and 3, as the left panel of Fig 3 reveals. Such a direct inspection of the data in feature space, however, is possible only for feature spaces with small dimensions. In case the feature space is high-dimensional or not even embedded in an , though, we can still use its latent space representation to uncover relations between the datapoints. All we need for that is a latent space of moderate dimension n − 1, ideally n ≤ 4.
Since our latent space is an unbounded set by definition, the latent space counterparts f(x1), …, f(xD) of the datapoints can be very far apart from each other. It can therefore be useful to further transform the latent space—and with it, the points f(x1), …, f(xD)—to a fixed bounded reference set. A natural way of achieving this is to compress the latent space
to the interior of the reference simplex
underlying our CASIMAC. Indeed, there is a bijective compression map
(38)
which is diffeomorphic (infinitely differentiable with infinitely differentiable inverse) and which leaves the cone segments
invariant in the sense that, for every
and every
,
(39)
See S1 Appendix (Proposition 15). We use this compression map for the visualization of one of our real-world data sets below.
3 Benchmark
In this section, we apply our CASIMAC method to various data sets and compare its performance to several well-established benchmark classifiers. Specifically, we perform benchmarks using synthetic data in Section 3.1 and real-world data in Section 3.2. Furthermore, we exemplarily illustrate the latent space visualization in Section 3.3. In these first three subsections, the regression model underlying our classifier is a GPR. In Section 3.4, by contrast, we use a neural network as our regression model, thereby demonstrating how a more complex regression model allows us to solve a classification task with more training data. For all calculations, we use the implementation of our method [23], which is briefly outlined in S2 Appendix.
3.1 Synthetic data
As a first benchmark, we test our method on a synthetically generated data set. Specifically, we consider the synthetic classification problem with n = 4 classes and the 2-dimensional feature space , which is based on the explicit class membership rule
(40)
for features
. The problem is illustrated in Fig 6. In order to obtain our training data set (1), we uniformly sample D = 40 points from
and associate them with their true class labels according to (40). In particular, the synthetic data is free of noise. Analogously, we obtain the test data set
(41)
by uniformly sampling T = 10000 points from
and by associating them with their true class labels (40). Here and in the following, we always standardize the features based on the training data before feeding it to the considered classifiers.
Illustration of the ground truth (40) for the synthetic data class labels. Synthetic data visualization.
As the distance metric d in the training data transformation (14) underlying our CASIMAC, we choose the Euclidean distance on , that is,
(42)
for all
. Also, we parameterize the weights α, β for the attraction and repulsion coefficients (15) by a single parameter γ ∈ [0, 1], namely
(43)
so that α, β ∈ [0, 1]. And as the regression model (22) underlying our CASIMAC, we choose a GPR with a combination of a Matérn and a white-noise kernel [15]. We compare our classifier to a (one-versus-rest) GPC with the same type of kernel. The hyperparameters of both classifiers are tuned by cross-validating the training data over a pre-defined set of different setups as outlined in S3 Appendix. We made use of scikit-sklearn [24] to realize the standard models.
In total, we perform 10 classification tasks, each with different test and training data sets. For each, we use the test data set to determine the accuracy (fraction of correctly predicted points) and the log-loss (that is, the logistic regression loss or cross-entropy loss). In addition to that, we calculate the proba-loss, which we define as the mean predicted probability error
(44)
where
is our CASIMAC’s or the GPC’s class probability prediction, respectively. Clearly, δp ∈ [0, 1] with 0 being the best possible outcome and 1 the worst. We can compute this score only because we know the true class membership rule (40) underlying the problem.
The results are shown in Table 1. Clearly, our method has a better proba-loss and log-loss than the GPC, whereas the latter has a slightly better accuracy.
Test scores for the synthetic data set based on T = 10000 uniformely sampled test datapoints. The proba-loss is defined in (44). We show the means and the corresponding standard deviations (in brackets) over all 10 classification tasks. The best mean results are highlighted in bold.
We also show an exemplary visualization of the predicted class probabilities in Fig 7 for a single training data set. The background colors represent a weighted average of the class colors from Fig 6 with a weight corresponding to the predicted probability of the respective class. So, for a perfectly calibrated classifier, the colors in Fig 7 would be the same as the colors in Fig 6. Although our CASIMAC has a slightly lower training accuracy than the GPC (0.925 as opposed to 1.000), it clearly exhibits brighter, more distinguishable colors. Consequently, the predicted probabilities are less uniform and correspond to a clearer decision for one of the four classes instead of an uncertain mixture. This observation corresponds to the lower proba-loss and log-loss results in Table 1.
(a) CASIMAC class probabilities. (b) GPC class probabilities. Class probability predictions of CASIMAC and of GPC for the synthetic data set. The color of each background point corresponds to the weighted average of the class colors from Fig 6 with a weight corresponding to the predicted probability of the respective class at this point. Thus, clear colors as in (a) represent high probabilities for a single class, whereas washed-out colors as in (b) represent almost uniform probabilities. We also show the training data set (consisting of D = 40 points) on which the classifiers have been trained. The color of the points corresponds to their true class. While most of the training datapoints are correctly classified (shown as ∘), our CASIMAC incorrectly predicts three training datapoints (shown as ♢) close to the class borders.
3.2 Real-world data
Following the benchmark on synthetic data, we continue with a benchmark on five real-world data sets from different fields of application. Specifically, we consider the data sets from Table 2. All of these data sets are publicly available online.
Overview of the five real-world data sets and their basic characteristics: n is the number of classes, m is the number of features or, in other words, the dimension of the feature space , I is the total number of datapoints, and D is the number of training datapoints. The number T of test datapoints is just T ≔ I − D. We abbreviate the references as a) [25], b) [26], c) [27], d) [28], e) [29], f) [30], g) [2], and h) [31]. Also, we turned the originally multi-class data set pine into a binary problem as described in [2].
As the distance metric d in the training data transformation (14) underlying our CASIMAC, we choose the Euclidean distance (42) on for the data sets alcohol, climate, hiv, and pine, whereas for the remaining data set wifi we choose the taxicab distance on
defined by
(45)
because it leads to a better performance there. Additionally, we parameterize the weights α, β of the attraction and repulsion coefficients (15) as in (43) for all our data sets. And as the regression model (22) underlying our CASIMAC, we again choose a GPR with a combination of a Matérn and a white-noise kernel for all our data sets. We compare our CASIMAC on each data set to three other classifiers, namely to a GPC as before and, in addition to that, to an artificial neural network with fully-connected layers (MLP) and to a k-nearest neighbor classifier (kNN). This choice of classifiers was dictated by their reported good calibration properties [2]. Again, we tune the hyperparameters of the classifiers by cross-validation as outlined in S3 Appendix. As in the previous section, we made use of scikit-sklearn [24] to realize the standard models.
The test-training split of the data is performed by means of a stratified random sampling with respect to the class labels. Average test scores over 10 classification tasks with different training data sets are reported in Table 3. Note that for multi-class data sets, the f1 score is calculated as the weighted arithmetic mean over harmonic means [32], where the weight is determined by the number of true instances for each class. Analogously, we calculate the precision score (ratio of true positives to the sum of true positives and false positives) and recall score (ratio of true positives to the sum of true positives and false negatives) as weighted averages over all classes.
Test scores for the benchmarked classifiers on the five real-world data sets from Table 2. We show the means and the corresponding standard deviations (in brackets) over all 10 classification tasks. The best mean results are highlighted in bold.
We find from Table 3 that our CASIMAC exhibits a comparatively good overall score on all data sets considered. In particular, its log-loss is better in all cases than that of the GPC. It is also better than the log-loss of all other classifiers except for the wifi data set on which the MLP is superior. Similarly, the accuracy of our approach is also better than that of the other candidates except for the climate data set on which GPC and kNN perform slightly better. Taking the uncertainties of the results into account, it turns out that in most cases the scores fall within the range of a single standard deviation of each other. In particular the log-loss, however, shows the most discrepancies between the classifiers.
For binary classification problems, it is also interesting to analyze the calibration curves (or reliability diagrams) in addition to the scores [2, 33]. For this purpose, we predict the probability of class 1 for all test samples and discretize the results into ten bins. For each bin, we plot the true fraction of class 1 against the arithmetic mean of the predicted probabilities. For a perfectly calibrated classifier, such a curve corresponds to a diagonal line and deviations from this line can therefore be understood as miscalibrations. As an example, we consider the pine data set and show a typical calibration curve Fig 8.
Calibration curves for the benchmarked classifiers on the pine data set. The closer the curves are to the diagonal reference line, the better the calibration of the respective classifier. In particular, our method exhibits the best calibration properties.
Clearly, the curve of our CASIMAC is closer to the diagonal reference line than the curves of the other classifiers. In particular, the GPC curve takes on a sigmoidal shape with major deviations from the diagonal at the beginning and the end. In order to obtain a quantitative measure for the observed miscalibrations, we calculate the corresponding area-deviation, that is, the area between each curve and the diagonal reference line. In case of an optimal calibration, this area vanishes, otherwise it is positive. The results are listed in Table 4 and allow us to rank the classifiers by calibration quality: CASIMAC gives by far the best calibration result, followed by kNN and MLP, and with GPC at the very end.
Calibration score measured in terms of the area-deviation (area between each curve and the diagonal reference line) for the calibration curves from Fig 8, which refer to the pine data set. The best result is highlighted in bold.
Summarized, our benchmark on different real-world data sets shows that our proposed CASIMAC can compete with other well-established classifiers and exhibits comparably good calibration properties.
3.3 Visualization
In order to illustrate the use of our compressed latent space representation for visualization purposes (Section 2.4), we consider a simplified version of the data set alcohol from Table 2 and we refer to this simplified version as alcohol-3. It is obtained from alcohol by merging the last three classes into one and, thus, it has only three instead of five classes (that is, n = 3 and m = 10).
We train an exemplary CASIMAC using D = 25 points from alcohol-3 as training datapoints, while using the remaining T = 100 points as test datapoints. As the distance metric d underlying the training data transformation f of our CASIMAC, we choose the Euclidean distance, and for the hyperparameters α, β, kα, kβ of f we choose
As the regression model
underlying our CASIMAC, in turn, we choose the same GPR as in our previous benchmarks, see S3 Appendix.
In Fig 9, we show the reference simplex with the segmentation induced by the segmentation (12) of
. So, the differently colored segments in Fig 9 are nothing but the sets
and by (39), in turn, these simplex segments are precisely the images of the cone segments
under the compression map:
for
.
(a) Compressed latent space representation of the training data. (b) Compressed latent space representation of the test data. Images of (a) the training datapoints under the transformation C ∘ f and of (b) the test datapoints under the transformation . See (46) and (47), respectively. The color of each point represents its true class, while the form of the markers indicates whether a point is correctly classified (∘) or not (♢) by our CASIMAC. All training datapoints are correctly classified by our construction of f and C, see (21) and (39). The scales on the three simplex edges indicate the barycentric coordinates of the simplex points (Lemma 14).
Specifically, Fig 9a displays the compressed latent space representation of the training datapoints x1, …, xD, that is, the points
(46)
where C is the compression map from (38). In view of (21) and (39), any one of these transformed datapoints w1, …, wD belongs to the simplex segment corresponding to its true class label. Also, the distances between points wi of neighboring classes can indicate inter-class relationships in the original feature space data set. We see from Fig 9a, for instance, that the points wi belonging to class 1 or 2, respectively, have a larger distance from each other than from the points of class 3. And this leads us to the conclusion that also the original classes 1 and 2 are more clearly separated than are the classes 1 and 3 and, respectively, the classes 2 and 3.
In turn, Fig 9b displays the compressed latent space representation of the test datapoints xD+1, …, xD+T, that is, the points
(47)
Since the regression model is fitted to the training datapoints (x1, f(x1)), …, (xD, f(xD)) only, a test point wi from (47) can lie in the simplex segment corresponding to its true class label but it can also lie in a simplex segment corresponding to a false class label. In the latter case, our CASIMAC leads to an incorrect class label prediction and the degree of misclassification can be gathered from the degree of misplacement of wi. As we can see from Fig 9b, there are misclassifications between members of class 1 and 3 and especially between members of class 2 and 3, but not between points of the classes 1 and 2, as one would expect from our previous observation about inter-class distances in the training data set. We list the detailed classification results of the test data set in form of a (transposed) confusion matrix in Table 5. As expected, most misclassifications happen between the classes 2 and 3.
Confusion matrix (error matrix): the entry in row k and column l is the number of test datapoints which belong to class k and for which our CASIMAC predicts the class label l. Correct classifications (on the diagonal) are highlighted in bold. There are no misclassifications between members of the classes 1 and 2, as can be expected from Fig 9a.
3.4 Towards deep learning
Finally, as a proof of concept for a classification task with a larger training data set, we consider the fashion-mnist data set. It consists of 28×28 pixel images of fashion articles in 8-bit grayscale format (i. e., ) as described in [34]. In total, there are D = 60000 training images and T = 10000 test images, which are assigned to n = 10 classes. We perform a min-max normalization of the data before we feed it to our classifier, so that the individual features lie within the range [0, 1].
Concerning the training data transformation f underlying our CASIMAC, we use two approaches. In the first approach, we make the same ansatz (14) for the training data transformation f as in all previous examples, that is,
(48)
As the distance metric d, we choose the Euclidean distance, and the hyperparameters α, β, kα, kβ we choose to be
(49)
In the second approach, we make a mixture ansatz for the training data transformation, namely, we take f to be the average of three training data transformations of the form (14). In short,
(50)
and for all three components f1, f2, f3 we choose the hyperparameters α, β, kα, kβ as in (49). As the distance metric d1 for f1, we again choose the Euclidean distance, but the distance metrics d2 and d3 for f2, f3 we choose in a problem-specific way, namely as the similarity metrics defined by
In these definitions, sw(x, x′) ∈ [−1, 1] is the structural similarity index for two images x, x′ with sliding window size w [35, 36] and we use the implementation from [37]. In particular, d2 and d3 are valid semimetrics because sw is symmetric and because sw(x, x′) = 1 if and only if x = x′, as pointed out on page 106 of [36].
Since our first approach (48) with its purely Euclidean distance metric does not take into account the structural properties of our image data, it can be considered naive. In contrast, our second approach (50) is informed because it brings to bear the fact that the data consists of images, between which a structural relationship can be established. A general overview of such informed machine learning techniques can be found, for instance, in [38].
As the regression model underlying our CASIMAC, we take a fully connected neural network, both in the naive and in the informed approach. The network contains a first hidden layer with 100 neurons and a sigmoid activation function and a second hidden layer with 18 neurons and a linear activation function. The output of the second layer is interpreted as the mean and standard deviation of a normal distribution. We optimize the log-likelihood of this distribution with an Adam approach to determine the best model parameters. In total, there are 80318 trainable parameters. This model is realized with the help of Tensorflow-probability [39]. Since we predict a distribution
for each input point x, we can calculate class label probabilities according to (35).
In Table 6, we summarize the classification results for our CASIMAC based on the naive training data transformation (48) and our CASIMAC based on the informed training data transformation (50). Specifically, we show the top-1 to top-5 accuracy scores, which are based on the probability prediction of the classifier. It turns out that our informed approach is slightly better or equal to the naive approach for all accuracies. A list of benchmarked accuracies for other classifiers can be found in [34], for instance.
Top-1 to top-5 accuracy of our naive and of our informed CASIMAC on the fashion-mnist data set. In the naive approach we use a purely Euclidean distance metric between the images, whereas the informed approach also takes the structrual image similarity into account. The best scores are highlighted in bold.
In contrast to the approach from [7], we use the neural network to directly perform the latent space mapping instead of linking the network to a series of GPs. Additionally, our network directly predicts the estimated mean and variance of the latent space mapping. It would, however, be a promising approach to further improve our results by incorporating a more advanced form of feature extraction like the one from [7].
4 Conclusions and outlook
In this paper, we have introduced a novel classifier called CASIMAC for multi-class classification in arbitrary semimetrizable feature spaces. It is based on the idea of transforming the classification problem into a regression problem. We achieve this by mapping the training data onto a latent space with a simplex-like geometry and subsequently fitting a regression model to the transformed training data. With the help of this regression model, the predictions of our classifier for the class labels and for the class label probabilities can be obtained in a conceptually and computationally simple manner. We have described in detail how our proposed method works and have demonstrated that it can be successfully applied to real-world data sets from various application domains. In particular, we see three major benefits of our approach.
First, it is generic and flexible in the sense that the choice of the particular distance semimetric for our training data transformation and the choice of the regression model underlying our classifier are completely arbitrary. In particular, this capability allows for non-numeric features. Moreover, it enables the integration of additional expert knowledge in the chosen distance metric [38]. For instance, to classify molecules a distance measure reflecting stoichiometry and configuration variations could be applied [40]. Another possible strategy would be to infer the distance metric from the data itself, possibly based on certain informed assumptions [41]. Similarly, expert knowledge could be brought to bear in the training of the regression model, for example, in the form of shape constraints [42–44].
Second, the intuitive latent space representation with its simple geometric concept has a direct interpretation. In particular, this can be exploited to visually detect inter-class relationships and is especially useful for classification problems with a large feature space dimension and a small number of classes.
Third, as our benchmarks have shown, our method leads to classifiers with comparably good prediction and calibration qualities. To determine class probabilities, we only require a regression model with probabilistic predictions. Also, the effort of computing these class probability predictions is quite low, especially compared to the computational effort necessary for GPC. In particular, no complicated approximations are required in our approach and, for a binary classification problem and a regression model with a normally distributed probabilistic prediction, there even exists a closed-form expression for the class probability predictions.
A challenge of our method is that its training requires the calculation of nearest datapoint neighbors, which is computationally expensive for larger data sets [45]. It would be a natural starting point for further studies to investigate how this computational limitation in the training of our classifiers can be overcome. A related challenge of our method is that the tuning of hyperparameters can be costly. Instead of using a cross-validated grid search like we did in this paper, it could be advantageous to consider more elaborate strategies, for example, to infer the hyperparameters from the statistical properties of the training data. We leave this topic as an open question for future research.
Supporting information
S1 Appendix. A: Segmentation of latent space and core properties of calibrated simplex-mapping classifiers.
Mathematical background of the method including detailed proofs.
https://doi.org/10.1371/journal.pone.0279876.s001
(PDF)
S2 Appendix. B: Implementation.
Description of the Python implementation of the proposed method.
https://doi.org/10.1371/journal.pone.0279876.s002
(PDF)
S3 Appendix. C: Benchmark.
Description of the hyperparameters for the numerical benchmarks.
https://doi.org/10.1371/journal.pone.0279876.s003
(PDF)
Acknowledgments
We would like to thank Janis Keuper and Jürgen Franke for their helpful and constructive comments.
References
- 1. Challis E, Hurley P, Serra L, Bozzali M, Oliver S, Cercignani M. Gaussian process classification of Alzheimer’s disease and mild cognitive impairment from resting-state fMRI. NeuroImage. 2015;112:232–243. pmid:25731993
- 2.
Niculescu-Mizil A, Caruana R. Predicting Good Probabilities with Supervised Learning. In: International Conference on Machine Learning. ICML’05. New York, NY, USA: ACM; 2005. p. 625–632.
- 3.
Platt JC. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In: Advances in Large Margin Classifiers. MIT Press; 1999. p. 61–74.
- 4.
Zadrozny B, Elkan C. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In: International Conference on Knowledge Discovery and Data Mining. KDD’02. New York, NY, USA: ACM; 2002. p. 694–699.
- 5.
Gebel M. Multivariate calibration of classification scores into the probability space. PhD thesis, University of Dortmund; 2009.
- 6.
Calandra R, Peters J, Rasmussen CE, Deisenroth MP. Manifold Gaussian processes for regression. In: International Joint Conference on Neural Networks. IJCNN’16. IEEE; 2016. p. 3338–3345.
- 7. Wilson AG, Hu Z, Salakhutdinov RR, Xing EP. Stochastic variational deep kernel learning. In: Advances in Neural Information Processing Systems; 2016. p. 2586–2594.
- 8. Al-Shedivat M, Wilson AG, Saatchi Y, Hu Z, Xing EP. Learning scalable deep kernels with recurrent structure. The Journal of Machine Learning Research. 2017;18(1):2850–2886. pmid:30662374
- 9.
Bradshaw J, Matthews AG, Ghahramani Z. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks. preprint arXiv:170702476. 2017.
- 10.
Daskalakis C, Dellaportas P, Panos A. Scalable Gaussian Processes, with Guarantees: Kernel Approximations and Deep Feature Extraction. preprint arXiv:200401584. 2020.
- 11.
Iwata T, Ghahramani Z. Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes. preprint arXiv:170705922. 2017.
- 12.
Cremanns K, Roos D. Deep Gaussian Covariance Network. preprint arXiv:171006202. 2017.
- 13.
Liu H, Ong YS, Shen X, Cai J. When Gaussian Process Meets Big Data: A Review of Scalable GPs. preprint arXiv:180701065. 2018.
- 14. Ludl PO, Heese R, Höller J, Asprion N, Bortz M. Using machine learning models to explore the solution space of large nonlinear systems underlying flowsheet simulations with constraints. Frontiers of Chemical Science and Engineering. 2021.
- 15.
Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. Adaptative computation and machine learning series. University Press Group Limited; 2006.
- 16.
Murphy KP. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series. MIT Press; 2012.
- 17.
Tran D, Ranganath R, Blei DM. The Variational Gaussian Process. preprint arXiv:151106499. 2015.
- 18.
Hensman J, Matthews AG, Filippone M, Ghahramani Z. MCMC for Variationally Sparse Gaussian Processes. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, editors. Advances in Neural Information Processing Systems 28. Curran Associates, Inc.; 2015. p. 1648–1656.
- 19.
GPy. GPy: A Gaussian process framework in Python; 2012. Available online: http://github.com/SheffieldML/GPy.
- 20.
Dugundji J. Topology. Allyn and Bacon; 1966.
- 21. Wilson WA. On semi-metric spaces. American Journal of Mathematics. 1931;53:361–373.
- 22.
Deza MM, Deza E. Encyclopedia of Distances. Springer, Berlin; 2016.
- 23.
Heese R. CASIMAC: Calibrated simplex mapping classifier in Python; 2022. Available online: https://github.com/raoulheese/casimac.
- 24. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
- 25.
Dua D, Graff C. UCI Machine Learning Repository; 2017. Available online: http://archive.ics.uci.edu/ml.
- 26. Adak MF, Lieberzeit P, Jarujamrus P, Yumusak N. Classification of alcohols obtained by QCM sensors with different characteristics using ABC based neural network. Engineering Science and Technology, an International Journal. 2019.
- 27. Lucas DD, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D, et al. Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development. 2013;6(4):1157–1171.
- 28. Rögnvaldsson T, You L, Garwicz D. State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics. 2014;31(8):1204–1210.
- 29.
Baumgardner MF, Biehl LL, Landgrebe DA. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3; 2015. Available online: https://purr.purdue.edu/publications/1947/1.
- 30.
Graña M, M A Veganzons BA. Hyperspectral Remote Sensing Scenes; 2014. Available online: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.
- 31.
Rohra JG, Perumal B, Narayanan SJ, Thakur P, Bhatt RB. User Localization in an Indoor Environment Using Fuzzy Hybrid of Particle Swarm Optimization & Gravitational Search Algorithm with Neural Networks. In: International Conference on Soft Computing for Problem Solving. SocProS’16; 2016. p. 286–295.
- 32.
Opitz J, Burst S. Macro F1 and Macro F1. preprint arXiv:191103347. 2019.
- 33. DeGroot MH, Fienberg SE. The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society Series D (The Statistician). 1983;32(1/2):12–22.
- 34.
Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. preprint arXiv:170807747. 2017.
- 35. Zhou Wang, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. 2004;13(4):600–612.
- 36. Wang Z, Bovik AC. Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures. IEEE Signal Processing Magazine. 2009;26(1):98–117.
- 37. Van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, et al. scikit-image: image processing in Python. PeerJ. 2014;2:e453. pmid:25024921
- 38. Rueden Lv, Mayer S, Beckh K, Georgiev B, Giesselbach S, Heese R, et al. Informed Machine Learning—A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering. 2021; p. 1.
- 39.
Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, et al. TensorFlow Distributions. preprint arXiv:171110604. 2017;abs/1711.10604.
- 40. Rupp M, Tkatchenko A, Müller KR, Von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters. 2012;108(5):058301. pmid:22400967
- 41.
Bellet A, Habrard A, Sebban M. A Survey on Metric Learning for Feature Vectors and Structured Data. preprint arXiv:13066709. 2013.
- 42. Kurnatowski Mv, Schmid J, Link P, Zache R, Morand L, Kraft T, et al. Compensating data shortages in manufacturing with monotonicity knowledge. Algorithms. 2021;14(12).
- 43. Schmid J. Approximation, characterization, and continuity of multivariate monotonic regression functions. Analysis and Applications. 2021.
- 44. Link P, Poursanidis M, Schmid J, Zache R, von Kurnatowski M, Teicher U, et al. Capturing and incorporating expert knowledge into machine learning models for quality prediction in manufacturing. Journal of Intelligent Manufacturing. 2022;33(7):2129–2142.
- 45. Dhanabal S, Chandramathi DS. A Review of various k-Nearest Neighbor Query Processing Techniques. International Journal of Computer Applications. 2011;31(7):14–22.