Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A3SOM, abstained explainable semi-supervised neural network based on self-organizing map

  • Constance Creux,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Univ Evry, IBISC, Université Paris-Saclay, Evry-Courcouronnes, France

  • Farida Zehraoui ,

    Roles Conceptualization, Formal analysis, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    farida.zehraoui@univ-evry.fr

    Affiliation Univ Evry, IBISC, Université Paris-Saclay, Evry-Courcouronnes, France

  • Blaise Hanczar,

    Roles Supervision, Writing – review & editing

    Affiliation Univ Evry, IBISC, Université Paris-Saclay, Evry-Courcouronnes, France

  • Fariza Tahi

    Roles Funding acquisition, Project administration, Supervision, Writing – review & editing

    Affiliation Univ Evry, IBISC, Université Paris-Saclay, Evry-Courcouronnes, France

Abstract

In the sea of data generated daily, unlabeled samples greatly outnumber labeled ones. This is due to the fact that, in many application areas, labels are scarce or hard to obtain. In addition, unlabeled samples might belong to new classes that are not available in the label set associated with data. In this context, we propose A3SOM, an abstained explainable semi-supervised neural network that associates a self-organizing map to dense layers in order to classify samples. Abstained classification enables the detection of new classes and class overlaps. The use of a self-organizing map in A3SOM allows integrated visualization and makes the model explainable. Along with describing our approach, this paper shows that the method is competitive with other classifiers and demonstrates the benefits of including abstention rules. A use case is presented on breast cancer subtype classification and discovery to show the relevance of our method in real-world medical problems.

1 Introduction

Technological advances and an increase in the capabilities of modern computers in recent years have led to the production of massive amounts of data. This is true for many domains, such as the biomedical field, where images, genomic data, and other types of complex data are created faster than they can be processed by humans. As a response, applications of artificial intelligence—and deep learning, specifically—have become commonplace in these fields [16]. A significant part of this data remains unlabeled, i.e., it has not been assigned a label, as extracting information from it is challenging. Unsupervised learning, especially clustering, is typically used as a first step to explore data by grouping samples that exhibit similar patterns. In this work, we are interested in classification problems for which supervised learning methods are generally used. However, some real-world applications are characterized by the availability of only a few labels associated with certain samples. In this case, adding the information carried by unlabeled samples can improve separation between classes, leading to better classification results. The missing labels problem can be decomposed into two different subproblems. The first one, the most common, happens when some training examples have no labels; however, all classes are represented in the training set. Semi-supervised learning mainly focuses on this problem. The second subproblem is relative to the absence of labeled examples of some classes in the training set. For instance in oncology, cancer with the same primary site can have multiple subtypes associated with various clinical outcomes. As we study more and more patients, we might discover a new group of patients that have a new cancer subtype. In addition, another problem with classification is that some samples may be located in areas of overlap between classes. To deal with the last two issues, we can use abstained classification, also known as reject or selective classification. It enables the model not to classify observations if prediction confidence is too low. Abstention, when combined with semi-supervised classification, can help discover new classes in the data.

In the past decade, interest for explainable artificial intelligence models has rocketed. This move away from so-called ‘black-box’ models encourages developers to create models that can be understood by humans. In that regard, self-organizing maps (SOM) are a particularly interesting type of neural network. It learns to locate the position of training samples onto a topographic map of neurons. These neurons are prototypes that represent the data, and can be used to offer a prototype-based explanation. Moreover, the map is bi-dimensional and can be visualized.

In this context, we propose A3SOM, an Abstained explainable Semi-Supervised neural network based on a Self-Organizing Map. Dense layers are linked to the SOM to exploit available labels and associate meaning with the SOM neurons. The architecture of our method makes it explainable through the local arrangement of neurons, their visualization, as well as the use of two distinct abstention rules. Our model is able to perform two tasks:

  • Standard semi-supervised classification, which leverages both unlabeled and labeled data complemented with visualization and explainability.
  • Abstained semi-supervised classification, which allows the detection of classification ambiguities and the discovery of new classes.

This paper is organized as follows. We first present related works on semi-supervised classification, abstained classification, and extensions of SOM. We then detail our approach before presenting experimental results for the two tasks that A3SOM can perform, and show an application on breast cancer subtype classification. Finally, we conclude and present perspectives for future work.

2 Related works

This section presents the key points identified in the introduction: semi-supervised learning and abstained classification. As the SOM is an integral part of our algorithm, we also present extensions made to this model in recent years, including semi-supervised SOM and abstained SOM.

2.1 Semi-supervised learning

Semi-supervised learning is a middle ground between supervised and unsupervised methods: it uses labels when available, but unlabeled samples are not discarded, and are instead used in various ways to improve performance. Its use is particularly relevant when unlabeled data outnumbers labeled data [7]. Over the years, many semi-supervised methods have been developed, especially for image classification tasks [811].

In [12], multiple types of semi-supervised classification methods are described, with different ways of integrating unlabeled samples: wrapper methods, unsupervised preprocessing, intrinsically semi-supervised methods, and transductive methods.

Wrapper methods are based on classifiers. In the case of self-training [13, 14], a classifier is trained on labeled samples, and a class is predicted for all samples, including unlabeled ones. Predictions with high confidence are added to the set of training labels, and the process is repeated. Wrapper methods also include approaches based on boosting [15] or co-training [16].

Unsupervised preprocessing is when computations are performed on unlabeled data to make training on labeled data more efficient. Feature extraction can be done on unlabeled data, by transforming the original dataset to a new one with fewer features, for example with AutoEncoders [17, 18]. Clustering can also be used to represent the unlabeled data in groups before classification [19]. Preprocessing can also refer to pre-training [20, 21], meaning that weights of the model are first learned on unsupervised data, and they only need to be adjusted (fine-tuned) with labeled data to fit the classification problem better.

Intrinsically semi-supervised methods include both labeled and unlabeled data in the error computation. For example, this can be done by adding a regularization term based on unsupervised samples in the loss function [2224], or by adapting existing architectures such as GAN (Generative Adversarial Networks) by associating one output to fake data points [25].

Transductive methods are typically graph-based methods, where label information is propagated along edges [26]. One commonly used method is label propagation [27], which uses neighboring data to predict the labels of unlabeled samples.

2.2 Abstained classification

In medical diagnosis and other fields like fraud detection or self-driving vehicles, wrongly classifying a sample can have critical consequences. Abstention, also called rejection, is a branch of classification that enables the model to abstain from returning predictions when confidence is too low. Abstained classification is primarily associated with supervised learning models. Semi-supervised applications are limited to very particular data types presenting spatial relationships that can be exploited for abstention [2832]. Different abstention criteria can be defined using thresholds on provided predictions.

2.2.1. Abstention criteria.

It is possible to differentiate between two rules to explain why abstention is applied to a prediction: the distance rule and the ambiguity rule. In [33, 34], both types are described in the context of neural networks with a sigmoidal output. The distance (or novelty) rule identifies samples that do not seem to belong to any known class. A prediction is abstained from if the highest predicted probability for a sample is lower than a threshold. This concept is similar to open-set classification problems [3538], which account for situations where test data might not have the same distribution as training data.

2.2.2. Thresholds.

Applying thresholds on outputted predictions is common. In a pioneer work on rejection [39], Chow defines a rule with a global rejection threshold: the model abstains from classifying an input sample if the probability that it belongs to a class is lower than a predefined threshold that depends on abstention cost. This approach assumes complete knowledge of the a priori probability distribution of classes and the a posteriori probabilities, which is rarely the case in real-world problems. In [40], authors propose to use one threshold for each class, i.e., local thresholds, instead of one global threshold for all classes. They show that local thresholds lead to better results when class probabilities are estimated.

The ambiguity rule detects instances where two classes overlap, and the threshold is applied to the difference between the two highest predicted probabilities.

2.3 Extensions of SOM

Self-organizing maps were first described as an unsupervised neural network by Kohonen in [41], and present unique properties for clustering and interpretability. Many extensions of the SOM model have been proposed, often extending SOM to supervised learning [42, 43], by including the label as one of the features [44, 45] or by labeling the SOM neurons a posteriori, typically with majority voting [46, 47].

2.3.1 Semi-supervised SOM.

SOM have seldom been integrated into semi-supervised contexts. Existing works could primarily be categorized as intrinsically semi-supervised, as unlabeled and labeled samples are processed in the same step to optimize the model. It is done by computing an error using unlabeled and labeled information to modify SOM neurons or add new ones [4852]. Adding neurons to SOM can create clusters more relevant to real-life organization but has the disadvantage of deforming the map, making its visualization less straightforward. Unsupervised preprocessing is also employed with SOM, by training a first unsupervised SOM before using it in concordance with label information [53]. As SOM present a meaningful topology, transductive methods such as label propagation can also be used [54].

2.3.2 Abstained SOM.

To our knowledge, only four works associate abstention with SOM, all for supervised classification, which can be cited. ROSOM [55] trains a SOM, then the neurons are labeled with a majority vote. According to a rejection threshold, some neurons are instead associated with rejection. Another method uses a combination of SOM and the traveling salesman algorithm to reject the classification of outliers [56]. Both of these works detect what we call distance abstention. IRSOM [57] uses a SOM in a binary classification problem to detect ambiguities. Finally, SLSOM [58] trains an unsupervised SOM followed by a dense layer to perform classification. This method separates ambiguity and distance rejection, and uses global thresholds.

2.3.2.1 Identified gaps in related works. There is a lack of methods combining semi-supervised learning and abstained classification. To our knowledge, only a few methods integrate both concepts: almost all of them can only be applied to images [2831], except for one that is suited for graphs [32]. In this new work, we propose an intrinsically semi-supervised method by including the SOM in the regularization process. This method can be used on tabular data. When associated with local abstention options, our approach enables the discovery of new classes and the explanation of predictions.

3 Method

We present a novel semi-supervised learning approach called A3SOM. It is based on self-organizing maps and therefore includes data visualization. Fig 1 shows the inputs and outputs of A3SOM, illustrating the two tasks it can perform: standard or abstained classification.

thumbnail
Fig 1. Overview of A3SOM method.

Input data is tabular, there can be missing labels. A3SOM can perform standard classification or abstained classification. In both cases, results can be visualized and interpreted.

https://doi.org/10.1371/journal.pone.0286137.g001

Data is first given as input to a SOM, which clusters the samples and provides data visualization. The output of the SOM is then fed to a block of dense layers, which produces the output. Unlike standard SOM that are based on unsupervised learning, we use a SOM in a semi-supervised learning process where the positions of SOM units are not only influenced by the distance between instances (unlabeled and labeled) and neuron prototypes, but also by the labels used in the training step. There are two options for prediction. The first is a standard classification task with visualization and interpretation (Fig 1a), and the second consists in an abstained classification task capable of detecting ambiguities and discovering new classes through the use of the SOM and analyzing the results (Fig 1b).

In this section, we first present the full A3SOM learning algorithm. We then show how our method performs standard classification, and how we can explain the results. We finally present how our approach can be used for abstained classification where new classes can be discovered and ambiguities between existing classes can be detected.

3.1 A3SOM training phase

Let be the set of samples where are the unlabeled input samples and the labeled ones. are the true labels associated to . The set of predicted labels is defined as , where and are associated respectively with and . N is the total number of samples in X, and is the number of labeled samples. The training of the network is carried out in two dependent phases: forward propagation and backpropagation. During the forward propagation step, data is propagated through the different layers of the network, as shown in Fig 2, from the input layer to the last layer of the dense block. Class prediction is then performed, and an error is calculated using the total loss function, which is composed of two main terms: the distortion and the categorical cross-entropy. Afterward, gradient descent is used during the backpropagation step to update the model parameters in the different layers of the network, depending on the error. In this step, parameters are modified from the last dense layer to the input layer.

thumbnail
Fig 2. A3SOM training phase.

Information is propagated from the input layer to the self-organizing map and to fully-connected dense layers, to obtain class predictions. A loss composed of the categorical cross-entropy and the distortion of the SOM is computed. Then, the weights of the model are updated during backpropagation.

https://doi.org/10.1371/journal.pone.0286137.g002

3.1.1 Forward propagation.

3.1.1.1 Self-organizing map. The matrix X is given to the SOM as input. The map is composed of neurons arranged on a two-dimensional grid. Each neuron unit u in the SOM is associated to two vectors: a vector that represents the coordinates of u in the grid and a weight vector , which has the same dimension as the input vector xiX where , m being the number of features in the original data space. Each neuron unit corresponds to a cluster, and the weight vector associated with the unit is the prototype that represents the cluster.

The first step of the SOM algorithm is to identify the neuron closest to each sample. These closest neurons, called Best-Matching Units (BMUs) or winner neurons, are found by measuring a distance, for example the Euclidean distance, between the sample and the SOM prototypes. The BMU of a sample xi is the neuron with the smallest distance to xi: (1) where U is the set of neurons and d() the distance measure.

In the original SOM model, prototypes wu(t) of the BMU and its neighbors are updated towards the sample xi: (2) where α(t) is the learning rate set by the optimizer and the neighborhood function is . σ(t) is the neighborhood function’s radius, and d′() is the distance between the coordinate vectors of the units in the map, which can be different from the previously defined d().

In this method, we use a variant of SOM weight update, in which weights are updated after seeing mini-batches of data. It consists of two steps: a fixed assignment step, after which weights are updated by minimizing a measure of the SOM error (defined in Eq 8). The assignment step consists in finding the BMU for each sample in the current batch of data. Then, the following formula is used to update prototype weights: (3) where Ind(xi, v) is the indicator matrix, in which all entries are equal to 0 except for those where the BMU of sample xi is the neuron v, where it is equal to 1.

The activation of a neuron uU corresponds to the distance between data samples and the prototype associated with the neuron. It is transmitted to the dense block and can be computed as follows: (4)

3.1.1.2 Dense block. The dense block consists of B fully-connected dense layers, B being a hyperparameter of the model. The first layer receives the output of the SOM as input. The activation of the neuron l in the hidden layer j (1 ≤ jB) is defined as: (5) where is the weight of the connection linking the neuron l of the layer j to the neuron k of the layer (j − 1), is the bias of the neuron l in the layer j and the activation of the neuron k in the layer (j − 1) ( with the activation of the unit uk in the SOM). f() is the activation function: specifically, we use the Rectified Linear Unit (ReLU) function, where ReLU(x) = max(0,x).

The last dense layer is composed of as many neurons as there are classes in the training set. The output of a neuron c in the last layer is obtained by: (6) where and are respectively the weights and the biases of the output layer and the activation of the last hidden layer, k is the index of the neuron and c the index of the class. After the last layer, the class-membership probabilities for all the samples in X can be obtained using an activation function fout(). We will use the softmax or the sigmoid activation function depending on the task performed.

3.1.2 Backpropagation.

Gradient descent is used to update the weights of the network neurons, in both the SOM and the dense block. The semi-supervised loss is composed of two terms.

  • The first is a supervised term, the cross-entropy cost function, which is defined as: (7) where C is the number of classes, is the output of a neuron c (corresponding to the class c) in the last layer for a sample xi and yic is the cth component of the vector yi representing the true label associated to the sample xi.
    Note that yi, for , is represented as a one-hot encoded vector of dimension C, i.e. yi = (yi1, …, yiC) with yic = 1 if yi corresponds to the class c and yic = 0 otherwise.
  • The second term corresponds to the distortion. Assessing the quality of the SOM is more complex than simply using quantitative criteria like for most clustering validation problems [59]: we also want to know whether the map preserves topology relationships, i.e., if neighboring samples in the original data are also neighbors on the map. Distortion is the cost function of SOM that comprises both quantization and topology errors. It measures the average error made when projecting data from its original space onto the SOM, weighted by the neighborhood function. It is defined by: (8) where xiX.

In a semi-supervised context, labels may not be available for all samples. This means that some samples cannot be used in the calculation of the cross-entropy. However, these unlabeled samples are integrated into the calculation of the distortion. The loss function can be formulated as follows: (9) where ||w||2 is a regularization term, and γ and η are the parameters that control the importance of the distortion and the regularization term in the loss.

3.2 A3SOM prediction phase

Compared to the majority of semi-supervised approaches, the originality of our approach lies in the integration of the SOM inside the network architecture, allowing analysis and interpretation of the results. Indeed, the location of the SOM neurons is influenced by the existing labels and the distances between all data and SOM weight vectors. Below are presented the prediction phases corresponding to the standard classification task and the abstained classification task.

3.2.1 A3SOM for classification task.

After the last layer, the class-membership probabilities for all the samples in X can be obtained by using the softmax activation function. Here, it is an appropriate choice of activation function: we are in a multi-class classification problem, where we consider each class to be mutually exclusive. In this case, the equation defined in Eq 6 can be re-written as: (10)

Thanks to the use of a SOM in our model, additional outputs can be obtained along with class predictions. This additional information can be visualized and makes the predictions explainable. We can return the BMU prototype of each sample and analyze the properties of the samples contained in the cluster represented by the BMU. This type of interpretation is called a prototype-based explanation [60]. In addition, we can study the BMU’s neighbors since neurons that are close on the map share similar semantic concepts. The labels associated with the neurons can also be analyzed in order to understand if the sample is close to neurons that are all associated with the same class or not. An illustration of prediction explanation is given in Fig 3, where A3SOM predicts an input sample xt from the test set. A class prediction is returned, as well as a prototype-based explanation associated with different visualizations.

thumbnail
Fig 3. Interpretation of prediction.

The model predicts the class of an input sample xt as purple. This sample is represented by the prototype P(1,1) which can be visualized.

https://doi.org/10.1371/journal.pone.0286137.g003

In Fig 3a, the SOM neurons are colored based on the labels of the samples they represent. We see that all of the samples represented by P(1,1) are from the same class as xt. xt is also close to P(1,2), which means it has similarities with class blue. In Fig 3b, the SOM neurons are represented by their values for each of the dataset’s features. The feature profile of P(1,1) is similar to the profile of other prototypes of class purple, with high values for the second feature, and lower values for other features. P(1,1) is the prototype from class purple with the highest values for the last two features, which typically have high values in class blue, which it is close to with P(1,2).

3.2.2 A3SOM for abstained classification task.

In order to analyze ambiguities between existing classes and discover new classes, we propose an extension of the model to perform abstained classification, illustrated in Fig 4. An abstention task is plugged to the classifier’s output, and the decision is based on the estimation of the posterior class probabilities.

thumbnail
Fig 4. Abstained classification task overview.

The abstention task is applied after the model has been trained. It produces a new set of labels that can be visualized.

https://doi.org/10.1371/journal.pone.0286137.g004

We use the two abstention rules described in Section 2.2. The first is the distance rule, which abstains from predicting samples that are far from learned classes and potentially belong to new classes. The second is the ambiguity rule, which abstains from predicting samples that are in ambiguous classification areas (where several classes are overlapping). We make the assumption that a sample can only belong to one class. Both rules depend on the output of the neurons located in the output layer, yic, which corresponds to the posterior probability P(c|xi) of an input xi to be an element of the class c.

In order to distinguish the two types of abstention, the output is defined using a sigmoid activation function. Here, the equation defined in Eq 6 can be re-written as: (11)

Indeed, while the softmax function is appropriate for standard classification, its use in abstained classification is unsuitable. A softmax function forces all probabilities to sum to one, which is undesirable in a context where a sample is far from all known classes, or located in the overlap between several classes. Since probabilities returned for each class by the sigmoid function are independent, it is more appropriate for abstained classification problems.

The thresholds for abstention can be defined globally (the same threshold for all classes) or locally (one threshold per class). A3SOM uses local thresholds as they are more flexible and better suited to obtain optimal decisions and abstention regions. Thus, in the definitions below, one threshold βc is defined for each predicted class c.

3.2.2.1 Distance abstention rule. In the distance rule, we abstain from returning a prediction if the largest output yic* is lower than a threshold . Thus we can define the distance abstention rule ruledist by: (12) where yic* = maxc{yic} and .

A low distance threshold will lead to few abstentions because the best probability is rarely too low. It is when the threshold increases that the distance rule becomes more stringent.

3.2.2.2 Ambiguity abstention rule. In the ambiguity rule, we abstain from predicting the input xi if the difference between the two largest output probabilities yic* and yic** (yic* = maxc{yic} and yic** = maxcc*{yic}) is lower than a threshold . We can define the ambiguity abstention rule ruleamb by: (13) where .

A small ambiguity threshold means we will only abstain from prediction when the two highest probabilities are very close. The higher the ambiguity threshold, the more we abstain from predictions.

4 Results

A3SOM is implemented using the TensorFlow [61] Python API, Keras [62], and the implementation of a Keras layer for SOM given in [63].

We present results for each task performed by A3SOM. For standard semi-supervised classification, we illustrate A3SOM’s performance compared to other classifiers on several datasets (Section 4.1). For the abstained classification task, we show using an artificial dataset that abstaining from prediction can improve accuracy and reduce the number of classification errors (Section 4.2). Finally, we illustrate the usefulness of A3SOM in a case study relating to the diagnosis of breast cancer (Section 4.3).

4.1 Semi-supervised classification evaluation

In this section we compare A3SOM’s semi-supervised classification results to other methods on several datasets.

4.1.1 Datasets.

We use a total of six datasets in the benchmark. The first is an artificial dataset we generated to show the specificities of our method. The others are five public datasets.

The artificial dataset is composed of 6 classes: three defined by Gaussian clusters (labeled 0, 1 and 2) and three with uniform distributions (labeled 3, 4 and 5), as shown in Fig 5. The classes are generated in order to have different degrees of ambiguity. Classes 3, 4 and 5 have a high ambiguity in their overlapping areas; class 5 overlaps with class 2; classes 0 and 3 are close but still separable, and class 1 is well separated from others. Dataset construction details are given in supplementary file A, Section 1.

thumbnail
Fig 5. Plot representing the 2D points of the artificial dataset.

Each class in the dataset is represented by a different color. Some overlaps between classes can be noticed.

https://doi.org/10.1371/journal.pone.0286137.g005

The real-world datasets are extracted from the OpenML database [64]:

  • Cardiotocography: measurements data on cardiotocograms [65].
  • Ionosphere: radar data [66].
  • Iris: different types of flowers [67].
  • MNIST: handwritten digits [68].
  • WDBC: breast lump description [69].

All datasets are all normalized between 0 and 1 to reduce measurement bias, and instances with missing values are removed. A summary of the six datasets used is given in Table 1.

thumbnail
Table 1. Description of the different characteristics of benchmark datasets: Size, number of features, and number of classes.

Note that some datasets are not balanced.

https://doi.org/10.1371/journal.pone.0286137.t001

4.1.2 Benchmark methods.

We compare A3SOM to other semi-supervised methods, which we can divide into three categories: implemented baselines, and existing semi-supervised approaches not SOM-based (SSL approaches) or SOM-based (SOM approaches).

4.1.2.1 .Implemented baselines. We implemented three combinations of self-training [13], which is a wrapper method, with popular classifiers: Support-Vector Machine (SVM) [70], Random Forest (RF) [71], and Multi-Layer Perceptron (MLP) [72]. Only the last is a deep learning algorithm. The package scikit-learn [73] is used for implementation. Self-training is when labeled data is used to build a first classifier, which predicts ‘pseudo-labels’ for unlabeled samples. Pseudo-labels are then used along with labels for rounds of training. Label propagation (LP) [27] is also implemented with scikit-learn. This is a transductive method that builds a fully-connected similarity graph over the training samples, and label information is propagated to unlabeled samples in the graph.

4.1.2.2 SSL approaches. Few semi-supervised classification approaches can be applied on tabular data. We include two recent methods for which code was made available: VIME [24] and TabNet [21]. VIME is designed to extend the use of self- and semi-supervised learning to tabular data. It uses an encoder to learn a new representation of the data focusing on important features. It is intrinsically semi-supervised, combining a supervised and an unsupervised term in the loss. TabNet [21] is an attention-based model that learns which features are important for prediction. Through unsupervised preprocessing and supervised fine-tuning, TabNet can be used as a semi-supervised method.

4.1.2.3 SOM approaches. For semi-supervised methods based on SOM, we include the only recent methods which made their code available, SS-SOM [51] and SuSi [53]. SS-SOM is based on a growing SOM which switches between supervised and unsupervised learning depending on the nature of each sample. If the sample is unlabeled, the standard unsupervised SOM algorithm is performed. Otherwise, the label will influence the decision when the closest neuron is found: the label associated with the neuron might change, or a new neuron might be inserted into the map. The method can be seen as intrinsically semi-supervised. Specifically, we use the variant Batch SS-SOM (BSSSOM) [51] as it is the only one with an explained python package [74] that can be included in our experimental protocol. SuSi [53] first trains an unsupervised SOM on all data, then uses a second SOM to associate labels to the neurons of the first map. It is a method using unsupervised preprocessing. The accompanying package [75] is used. Note that this method’s computations on MNIST were extremely long, thus they are not presented here.

A comparison of our method with implemented supervised classifiers can be found in supplementary file A, Section 2, as well as details on the hyperparameters used for the methods presented above.

4.1.3 Benchmark results.

A3SOM was compared to the methods presented in the previous section. We defined different levels of labeled data seen during the training phase as percentages of the total training data available. The labels of the remaining samples were set to -1. Thus, all semi-supervised methods were trained on the entire dataset, where only some of the samples were labeled. All methods were evaluated on the same validation sets, in which all labels were present. For each method, different sets of hyperparameters were implemented. We present the variant with the highest score for each percentage of labeled data. The methods were scored on their mean validation accuracy when performing five-fold cross-validation.

Fig 6 gives the results obtained by A3SOM and the other classifiers on the datasets described in Section 4. The numerical values of the results, along with the standard deviation, are given in supplementary file A, Section 3.

thumbnail
Fig 6. Benchmark results.

Representation of the mean validation accuracy after 5-fold cross-validation for different percentages of labeled data used during training. The x-axis represents the percentage of labeled samples included during training, and the mean accuracy of each method can be read on the y-axis.

https://doi.org/10.1371/journal.pone.0286137.g006

As we can see on the figure, a general trend is that the accuracy of the different methods improves as more labeled data is included during training, which is predictable.

4.1.3.1 Implemented baselines. LP, ST-RF and ST-SVM are not deep-learning methods; we expect them to perform well on smaller datasets, and less so on bigger datasets. LP behaves as anticipated. ST-RF actually performs well on most datasets except for Artificial and WDBC, for which scores are slightly lower. ST-SVM also reaches a good accuracy for the most part, with one notable exception on MNIST, where results are low; moreover, it was extremely slow to run. ST-MLP generally makes good predictions, although its performance is relatively low on Ionosphere.

4.1.3.2. SSL approaches. VIME and TabNet both obtain good accuracies for MNIST and Iris. VIME performs well on datasets Cardiotocography and WDBC; and TabNet on the Artificial dataset. On the other datasets, however, both methods are slower to increase and do not always catch up, even with 100% of labeled data. This is the case with Artificial for VIME, and with WDBC for TabNet.

4.1.3.3 SOM approaches. SuSi seems to struggle on bigger datasets, and although BSSSOM obtains high scores on Iris and Cardiotocography, its performance is average on the other datasets.

4.1.3.4 A3SOM. We observe that A3SOM is among the most accurate classifiers in our benchmark, and gives the best results for all datasets except for WDBC. On the latter dataset, A3SOM obtains slightly lower accuracies than ST-SVM or VIME for some labeled cases. When considering the accuracies averaged on all percentages of labeled data, A3SOM gives the best results for all datasets, as shown in Table 2. Compared to the other methods, A3SOM’s competitiveness is coupled with additional advantages: visualization and explainability.

thumbnail
Table 2. Averaged benchmark accuracies.

The results presented in Fig 6 are averaged over all percentages of labeled data. Best performance for each dataset is in bold.

https://doi.org/10.1371/journal.pone.0286137.t002

4.2 Abstained classification evaluation

We evaluate the performance of A3SOM’s abstained classification on the artificial dataset presented in Section 4.1.1. We first show the effect of global and local abstention thresholds on the improvement of classification accuracy. We then illustrate how extending the SOM with an abstained classifier enables the identification of samples within overlapping class boundaries and the discovery of new classes.

4.2.1 Trade-off between performance improvement and abstention rate.

We divide the classification outputs into two decision regions G and E such that G represents the good predictions and E represents the misclassified predictions (errors). Abstention allows us to divide the output space differently by defining two regions A and R so that A represents the accepted predictions and R the rejected ones. From these four regions, we can define two performance measures for abstained classification: Accepted Accuracy (AA) and Rejected Error (RE). AA corresponds to the correct prediction rate among the accepted predictions and RE to the rate of rejected errors among the total errors. They are defined as follows: (14)

Fig 7 shows how the model’s performances evolve when we vary the thresholds used for abstention. When all predictions are accepted (i.e., 0% of rejected observations on the figures), AA is equal to the standard classification accuracy, and RE is equal to zero as ER is empty.

thumbnail
Fig 7. Evolution of performance on the artificial dataset when varying abstention thresholds.

Scores obtained using combinations of local thresholds are in blue, and those obtained using global thresholds are in pink. The areas in yellow are the regions of interest, i.e., where local thresholds perform better than global ones. In (a) and (b) we vary distance abstention thresholds between 0 and 1 (ambiguity thresholds are set to 0). In (c) and (d) we vary ambiguity abstention thresholds between 0 and 1 (distance thresholds are set to 0).

https://doi.org/10.1371/journal.pone.0286137.g007

In Fig 7 we see that, whether we apply distance or ambiguity abstention, AA and RE both rise with the number of abstentions. For a given AA or RE rate, there exist combinations of local thresholds that can outperform global thresholds while abstaining from fewer predictions. This is highlighted in the yellow areas. For example, on Fig 7b for distance abstention, it is necessary to reject a little more than 60% of the samples by global abstention to reach 90% of RE, whereas it suffices to abstain from about 20% of predictions to get the same score with local thresholds. This difference can be explained by the fact that proximity between the different classes varies, and classes have different organizations, hence the need to have specific abstention thresholds for each class.

The effect of varying the local threshold for individual classes is analyzed and given in the supplementary file B, Section 1.

4.2.2 Interpretation of abstained classification.

One advantage of this work is its ability to identify and visualize potential new classes and ambiguous classification areas in the input data using local abstention options. To show these properties, we split the artificial dataset into a training and a test set. The training set is curated to include data from all six classes that compose the dataset, but the labels for class 1 are removed and considered unknown. In this experiment, class 1 represents the new class that our algorithm must discover with the distance abstention rule. We also randomly removed 90% of labels across all the remaining classes, so that the hidden class is not the only one with masked labels.

For the distance rule, we found the highest predicted probability for each sample and associated it with the corresponding class. Distance thresholds were then set to the mean of all the probabilities associated with each class. For the ambiguity rule, we found the two highest probabilities for each sample, and associated the difference between the two with the class with the highest probability. Ambiguity thresholds were then set to the mean of all the differences associated with each class. Specific threshold values used for this application can be found in supplementary file B, Section 2.

Fig 8 represents different sets of labels—before or after abstention—for the artificial dataset. In both cases, labels are shown on the projection of samples and on the SOM created by the model.

thumbnail
Fig 8. Representation of labels on the artificial dataset.

(a) 2D representation of training samples, colored by their true labels. Triangles are for labels that were not used during training, and circles for labels used. (b) Representation of SOM prototypes by the true labels of the training set. (c) 2D representation of test samples, colored by their abstained label. (d) Representation of SOM prototypes by the abstained labels of the test set.

https://doi.org/10.1371/journal.pone.0286137.g008

On Fig 8a and 8b re shown respectively the 2D points and the true label distribution in the SOM computed on the whole training set of the artificial dataset. In Fig 8a we can see the overlapping areas induced by classes 3, 4 and 5. We can also see how far class 1 is from the other ones. In this figure, the samples whose labels were seen during training are represented by circles, while the samples whose labels were masked during training are represented by lighter triangles.

In Fig 8b we show the true label repartition of the input data on the SOM. As expected, the input samples of classes 0, 1 and 2 are located at the corners of the map. On the contrary, input samples of classes 3, 4 and 5 are situated in the center of the SOM, where SOM neurons can be associated with samples from several classes, which shows the ambiguities identified in Fig 8a.

On Fig 8c and 8d are shown respectively the 2D points and the abstained labels repartition in the SOM computed on the test set. In Fig 8d we can see on the top right of the map that the model abstained from predicting a group of samples with the distance rule. The proximity of these examples in the SOM suggests they belong to the same class. We have verified that the abstained examples correspond to class 1 represented in Fig 8a. We can also see in Fig 8c that the model abstained from classifying input samples in ambiguous areas. For some of these predictions the ambiguity rule was applied, as expected. For others the distance abstention rule was applied. However, it is still apparent on the SOM that these samples do not form a group the way the samples in the top right do. Moreover, the abstention task did not find noticeable ambiguity around classes 0 and 2. This can be explained by the fact that classes 0 and 2 are separable.

4.3 Semi-supervised abstained classification: A case study

This section shows how A3SOM can be used on a real-life dataset to make essential observations or discoveries. Our example is in the field of medical diagnosis, using omics data. We demonstrate how our method can be used to classify breast cancer subtypes and discover a subtype that was not present in the training labels.

4.3.1 Biological context.

Breast cancer is the most diagnosed cancer in the world, as well as one of the most deadly. There are multiple types of breast cancer, called subtypes, which differ in genomic features and clinical outcomes. The four primary molecular subtypes are Luminal A, Luminal B, Her2-enriched and Basal-like. Luminal A and Luminal B are the subtypes with the best prognoses, and are typically treated with hormone therapies. The prognosis is worse for the Her2-enriched subtype, but it has improved with the development of treatments that target receptors of the protein Her2. Finally, the Basal-like subtype has the worst prognosis and currently has no targeted treatment.

Molecular expression varies for each subtype, as shown in [76]. Therefore, characterizing each cancer subtype by its genomic profile can help in its diagnosis and treatment.

4.3.2 Dataset construction.

We obtained publicly-available data from The Cancer Genome Atlas (TCGA, https://www.cancer.gov/tcga), using the R package TCGAbiolinks [77]. Specifically, we downloaded gene expression data of patients with breast cancer. This dataset presents the counts of 19,947 genes for 1,215 patients. Gene counts are normalized, which means that the counts were scaled to the total number of reads for each sample. We also downloaded clinical data, including subtype information, which was given for 1,083 patients.

Patients are associated with a total of five subtypes: the ones described above and the ‘Normal-like’ subtype. Due to a low number of examples in our dataset and the fact that this subtype is not always described in articles on breast cancer, we decided to exclude it from our dataset. As the Her2 subtype was also underrepresented, but important, we decided to randomly remove some samples from the other classes to make the dataset more balanced. This makes visualizations easier to interpret in the rest of this section.

Gene expression datasets consist of many genes, but relatively few of them play a role in diseases. To reduce noise in our dataset, we decided to limit the number of features to only keep genes that have been shown to be relevant in breast cancer subtyping. The 50-gene signature described in [78] is widely used to classify breast tumors.

Finally, the dataset was normalized before training.

4.3.3 A3SOM application.

As in Section 4.2, we processed the training dataset by removing the labels of an entire class, to manufacture a class that is represented in the data but for which the label is unknown. We chose to remove the labels for the Basal-like subtype as it is the one with the worst prognosis. We also removed at random 40% of labels in the dataset, no matter the class, so that Basal-like was not the only subtype with removed labels. This percentage is chosen based on the complexity of the genomic data and the low number of samples belonging to certain classes (enriched with Her2 in particular): we could remove more labels but some classes can become completely unlabeled or contain very few samples. We then trained the abstained mode of A3SOM to predict the three classes Luminal A, Luminal B and Her2-enriched (denoted LumA, LumB, Her2). Abstention thresholds were set the same way as in the previous section, by looking at mean predictions for each class, and are presented in supplementary file B, Section 3.

Fig 9 represents the SOM obtained after training with different information: the true labels associated to samples, the labels predicted by the algorithm, and the labels once abstention has been applied.

thumbnail
Fig 9. Visualization of the trained SOM with different types of labels for the breast cancer dataset.

LumA, LumB and Her2 are the three breast cancer subtypes seen during training, and Basal is the fourth subtype present in the data. Distance and ambiguity are the two abstention criteria.

https://doi.org/10.1371/journal.pone.0286137.g009

Looking at the map in Fig 9c, we can clearly see that the model abstained from predicting an entire group with the distance rule in the top right corner. This group of neurons represents samples that the model identifies as different from the learned classes. Since the neurons are grouped together, the samples they represent are similar. This is a strong indication that these samples might form a new class that was unknown during training. We can compare the labels for these neurons with the true labels on Fig 9a, which confirms that these samples belong to a different class. The model is able to detect a new class in the data. In the context of breast cancer, the true labels (Fig 9a) indicate that these patients all have the aggressive Basal-like subtype. If we do not apply abstention (Fig 9b), the model predicts that these patients have one of the other subtypes. As explained in the biological context, it is important to separate the Basal-like subtype from others, as it is the one with the worst prognosis, and it does not respond to the same therapies as others. Mis-predicting a patient as one of the other subtype when they have Basal breast cancer is dangerous, as it would mean using a treatment that is inappropriate for the patient, potentially experiencing side effects, while letting the tumor grow and become even more aggressive.

4.3.4 Interpretation.

We show an example of results interpretation of A3SOM on breast cancer classification. To make the interpretation intelligible, we chose to study the representation of four genes across classes. The genes were selected based on their involvement in breast cancer subtypes [79], but one could replace them by other genes of interest. Indeed, ESR1 (Estrogen Receptor 1) and PGR (Progesterone Receptor) were chosen for their roles in Luminal subtypes. LumA and LumB are both characterized by high expressions of ER and/or PR genes (respectively Estrogen Receptor and Progesterone Receptor). ERBB2 (Erb-B2 Receptor Tyrosine Kinase 2), also called HER2, is the principal gene that characterizes the Her2 subtype. The Basal subtype is sometimes referred to as ‘Triple-negative’ because it has low expression of ER, PR and HER2 markers. Finally, we used MKI67 (Marker Of Proliferation Ki-67) to differentiate between LumA and LumB. Based on this, we expect the different signatures described in Table 3.

thumbnail
Table 3. Gene signature expected for each subtype.

‘+’ means overexpression of the gene in the subtype, ‘-’ means underexpression, and ‘/’ means we have no particular expectation.

https://doi.org/10.1371/journal.pone.0286137.t003

Fig 10 shows the SOM with abstained label information in the background, and feature values for the four selected genes in the foreground, for each neuron.

thumbnail
Fig 10. Visualization of the SOM prototypes.

Each prototype is represented by its values for four genes. In the background, neurons are also colored by their abstained label, as shown in the previous figure.

https://doi.org/10.1371/journal.pone.0286137.g010

On Fig 10 we see that the expected signatures can be read on prototypes. Prototypes associated with the Her2 subtype, in green, tend to have low values for ESR1 and PGR, and the highest values for ERBB2 compared with other prototypes. Most prototypes associated with the Luminal subtypes have higher values for ESR1 and PGR, and LumB (blue) prototypes have higher expressions of MKI67 than LumA (purple) prototypes. The prototypes in the top-right corner, where labels were primarily rejected with distance abstention, all seem to share a similar signature with low values for the first three genes (ESR1, PGR and ERBB2). It shows that these samples likely belong to the same class, as they have similar signatures, and that they do form a separate group from known classes, as their signatures are quite different. Moreover, this matches the expectations we have for the Basal subtype.

Visualizing the different signatures (feature values) of all subtypes could be a first step to identify biomarkers.

5 Discussion

We present in this paper an original method that is, to our knowledge, the only method that can simultaneously handle semi-labeled data, perform abstained classification, provide an explanation for the prediction and produce visualization that helps analyze the model. We showed that our method is competitive to semi-supervised black-box approaches (VIME, TABNET, …) in addition to providing explanation. We detail in the following the comparison between our approach and SOM-based methods.

SuSi [53] was primarily designed to handle hyperspectral data, while A3SOM is generic and can be applied on different applications (A3SOM showed good results on various datasets). Another difference is that although the model does propose classification, so far the authors mainly focused on developing and optimizing their proposal for regression, which can explain why SuSi’s results were relatively low in our classification benchmark (Section 4.1). For BSSSOM (as well as the other variants) [5052], the base algorithm distinguishes unlabeled and labeled samples during training, to either use an unsupervised algorithm or a supervised algorithm. While the map’s topology stays constant in A3SOM, BSSSOM uses a map with a time-varying structure, to which neurons can be added during training to better represent data. This is a drawback in terms of visualization: as the map is not constrained, the addition of neurons deforms its structure, the notion of neighborhood loses some of its meaning, and it cannot be represented in the same way as A3SOM’s rectangular grid. Conceptually, these three methods diverge in their type of semi-supervision as described in Section 2.1. SuSi performs unsupervised preprocessing. The algorithm is based on two consecutive stages, where the first (unsupervised) task is not modified by the second (supervised) task. BSSSOM also performs two separate learning tasks, either supervised or unsupervised depending on the presence of a label for a sample. As these two tasks happen simultaneously to optimize the model, BSSSOM can be seen as intrinsically semi-supervised. A3SOM is truly intrinsically semi-supervised, as labeled and unlabeled samples are both included in the objective function. Both types of data simultaneously play a role in the optimization of the model’s weights.

Concerning the abstention strategies associated to SOM in the literature, ROSOM [55] and the method in [56] do not distinguish between ambiguity and distance abstention rules which makes the distinction between class discovery and class overlaps difficult. IRSOM [57] only detects ambiguities, but in our case distance is also very important. SLSOM [58] separates the two rules but uses global abstention thresholds which under-performed local thresholds in our experiments. We do not compare our results to other abstained classifiers as abstention options could theoretically be applied after prediction for other methods as well. What makes the originality of A3SOM is that combining the two criteria with the self-organizing map makes abstention decisions interpretable. When we represent abstention results on the SOM, we can understand whether samples for which the distance rule was applied might be outliers or whether they are likely to belong to a new class.

Interpretability has become a necessity in critical areas (e.g., medical, financial, self-driving cars and legal). In the current state-of-the-art, there are two main approaches to interpreting neural networks [80]: post-hoc approaches which explain already trained neural networks and which are the most used, and self-explaining approaches (or explainable) that aim to create interpretable models. Due to the multitude of drawbacks of post-hoc methods, the scientific community recommends building directly interpretable models [81]. Some methods are thus based on examples or prototypes [60] and provide intuitive explanations for the predictions by selecting representative prototypes (or instances) of the input sample, while others provide the most relevant concepts (or features) that lead to the prediction [82]. Our approach follows recent recommendations as it is self-explaining and provides prediction explanation based on prototypes and visualization. In Section 4.3 we have shown how powerful this type of explanation can be. Explainability is not offered by SuSi and BSSSOM. SuSi proposes visualization of the map, but does not address prediction explanation. Due to the varying structure of the map in BSSSOM, visualization is not straightforward, and a prototype-based explanation is not addressed.

6 Conclusion

We presented a new approach that combines SOM and dense layers in a semi-supervised context as well as an abstention task. A3SOM can classify data competitively with state-of-the-art algorithms and allows the detection of new classes and class overlaps. In addition to these results, the novelty of our proposal is that using SOM as well as two distinct abstention rules enables the visualization and explanation of classification. The specificity of A3SOM lies in its ability to handle partially labeled data, for which obtaining labels requires many resources. Its use is therefore particularly appropriate in fields where much data is generated, but its study is still in the exploratory stage.

One potential improvement to our model would be to include abstention during the training phase. In the survey [83], three architectures for abstained classification are described: separated, dependent, or integrated rejection. We perform dependent rejection: abstention is applied to the output of the model, after the training step. It would be interesting to explore how integrated rejection affects the model. This would mean applying abstention during training and learning where to apply abstention.

Additional visualizations can be added to develop the model’s explainability further. Examples of visualizations could be the U-matrix, which represents the distance between the neurons of the SOM and shows which neurons form clusters and which are very different, or feature heatmaps [84] to understand the importance of the different features in the definition of the neurons’ prototypes and to identify correlations between features.

One drawback of SOM is that a fixed number of neurons must be defined before training. Several extensions of SOM, called growing SOM, have been proposed in the literature [85] to overcome this problem. Since the visualization aspect is very important in our model, a perspective is to extend our model to growing SOM with fixed structure like growing cell structure [86, 87] and growing grid [88, 89].

Future work will further develop this method to study cancer with multi-omics data. Multi-omics, or integrated omics, is the study of multiple molecular levels (genome, transcriptome, proteome…) at the same time. Combining several levels in a computational model is a task that has gained in popularity over the last five years, and has resulted in discoveries in biology. Omics data exist in abundance, but only some are labeled, making semi-supervised approaches relevant. We are interested in making discoveries: it is important that our model can abstain from returning predictions and detect potential new classes. Moreover, bioinformatics is a field rich in interactions between different actors, and providing a visualization to make results easier to understand is also beneficial.

Supporting information

S1 Text. A3SOM source code, as well as supplementary files, are available at https://forge.ibisc.univevry.fr/ccreux/A3SOM.git.

https://doi.org/10.1371/journal.pone.0286137.s001

(TXT)

References

  1. 1. Bhandari S, Pathak S, Jain SA. A Literature Review of Early-Stage Diabetic Retinopathy Detection Using Deep Learning and Evolutionary Computing Techniques. Archives of Computational Methods in Engineering. 2023;30(2):799–810.
  2. 2. Bourgeais V, Zehraoui F, Hanczar B. GraphGONet: a self-explaining neural network encapsulating the Gene Ontology graph for phenotype prediction on gene expression. Bioinformatics. 2022;38(9):2504–2511. pmid:35266505
  3. 3. Nemade V, Pathak S, Dubey AK. A Systematic Literature Review of Breast Cancer Diagnosis Using Machine Intelligence Techniques. Archives of Computational Methods in Engineering. 2022;29(6):4401–4430.
  4. 4. Elmarakeby HA, Hwang J, Arafeh R, Crowdis J, Gang S, Liu D, et al. Biologically Informed Deep Neural Network for Prostate Cancer Discovery. Nature. 2021;598(7880):348–352. pmid:34552244
  5. 5. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly Accurate Protein Structure Prediction with AlphaFold. Nature. 2021;596(7873):583–589. pmid:34265844
  6. 6. Boukelia A, Boucheham A, Belguidoum M, Batouche M, Zehraoui F, Tahi F. A Novel Integrative Approach for Non-coding RNA Classification Based on Deep Learning. Current Bioinformatics. 2020;15(4):338–348.
  7. 7. Chapelle O, Schölkopf B, Zien A. Semi-Supervised Learning. Bach F, editor. Adaptive Computation and Machine Learning Series. Cambridge, MA, USA: MIT Press; 2006.
  8. 8. Chen J, Yang M, Ling J. Attention-Based Label Consistency for Semi-Supervised Deep Learning Based Image Classification. Neurocomputing. 2021;453:731–741.
  9. 9. Mygdalis V, Iosifidis A, Tefas A, Pitas I. Semi-Supervised Subclass Support Vector Data Description for Image and Video Classification. Neurocomputing. 2018;278:51–61.
  10. 10. Camargo G, Bugatti PH, Saito PTM. Active Semi-Supervised Learning for Biological Data Classification. PLOS ONE. 2020;15(8):1–20. pmid:32813738
  11. 11. Han CH, Kim M, Kwak JT. Semi-Supervised Learning for an Improved Diagnosis of COVID-19 in CT Images. PLOS ONE. 2021;16(4):1–13. pmid:33793650
  12. 12. van Engelen JE, Hoos HH. A Survey on Semi-Supervised Learning. Machine Learning. 2020;109(2):373–440.
  13. 13. Yarowsky D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. ACL’95. USA: Association for Computational Linguistics; 1995. p. 189–196.
  14. 14. Chen R, Ma Y, Liu L, Chen N, Cui Z, Wei G, et al. Semi-Supervised Anatomical Landmark Detection via Shape-Regulated Self-Training. Neurocomputing. 2022;471:335–345.
  15. 15. Tanha J. MSSBoost: A New Multiclass Boosting to Semi-Supervised Learning. Neurocomputing. 2018;314:251–266.
  16. 16. Ren Y, Wu Y, Ge Y. A Co-Training Algorithm for EEG Classification with Biomimetic Pattern Recognition and Sparse Representation. Neurocomputing. 2014;137:212–222.
  17. 17. Liu JX, Wang D, Gao YL, Zheng CH, Shang JL, Liu F, et al. A Joint-L2,1-Norm-Constraint-Based Semi-Supervised Feature Extraction for RNA-Seq Data Analysis. Neurocomputing. 2017;228:263–269.
  18. 18. Yin W, Li L, Wu FX. A Semi-Supervised Autoencoder for Autism Disease Diagnosis. Neurocomputing. 2022;483:140–147.
  19. 19. Goldberg A, Zhu X, Singh A, Xu Z, Nowak R. Multi-Manifold Semi-Supervised Learning. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR; 2009. p. 169–176.
  20. 20. Rifai S, Vincent P, Muller X, Glorot X, Bengio Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. In: ICML; 2011.
  21. 21. Arik SÖ, Pfister T. TabNet: Attentive Interpretable Tabular Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(8):6679–6687.
  22. 22. Grandvalet Y, Bengio Y. Semi-Supervised Learning by Entropy Minimization. In: Advances in Neural Information Processing Systems. vol. 17. MIT Press; 2004.
  23. 23. Weston J, Ratle F, Mobahi H, Collobert R. Deep Learning via Semi-supervised Embedding. In: Montavon G, Orr GB, Müller KR, editors. Neural Networks: Tricks of the Trade: Second Edition. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer; 2012. p. 639–655.
  24. 24. Yoon J, Zhang Y, Jordon J, van der Schaar M. VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain. In: Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 11033–11043.
  25. 25. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved Techniques for Training GANs. In: Advances in Neural Information Processing Systems. vol. 29. Curran Associates, Inc.; 2016.
  26. 26. Li C, Peng X, Peng H, Li J, Wang L. TextGTL: Graph-based Transductive Learning for Semi-supervised Text Classification via Structure-Sensitive Interpolation. In: Twenty-Ninth International Joint Conference on Artificial Intelligence. vol. 3; 2021. p. 2680–2686.
  27. 27. Zhu X, Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation; 2002.
  28. 28. Saito K, Kim D, Saenko K. OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers; 2021.
  29. 29. Liu YC, Ma CY, Dai X, Tian J, Vajda P, He Z, et al.. Open-Set Semi-Supervised Object Detection; 2022.
  30. 30. Cao K, Brbic M, Leskovec J. Open-World Semi-Supervised Learning; 2022.
  31. 31. Yu Q, Ikami D, Irie G, Aizawa K. Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning; 2020.
  32. 32. Huang T, Wang D, Fang Y, Chen Z. End-to-End Open-Set Semi-Supervised Node Classification with Out-of-Distribution Detection. In: Raedt LD, editor. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. International Joint Conferences on Artificial Intelligence Organization; 2022. p. 2087–2093. Available from: https://doi.org/10.24963/ijcai.2022/290.
  33. 33. Ishibuchi H, Nii M. Neural Networks for Soft Decision Making. Fuzzy Sets and Systems. 2000;115(1):121–140.
  34. 34. De Stefano C, Sansone C, Vento M. To Reject or Not to Reject: That Is the Question-an Answer in Case of Neural Classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 2000;30(1):84–94.
  35. 35. Roady R, Hayes TL, Kemker R, Gonzales A, Kanan C. Are Open Set Classification Methods Effective on Large-Scale Datasets? PLOS ONE. 2020;15(9):e0238302. pmid:32886692
  36. 36. Bendale A, Boult T. Towards Open Set Deep Networks; 2015.
  37. 37. Fang Z, Lu J, Liu A, Liu F, Zhang G. Learning Bounds for Open-Set Learning; 2021.
  38. 38. Saranrittichai P, Mummadi CK, Blaiotta C, Munoz M, Fischer V. Multi-Attribute Open Set Recognition; 2022.
  39. 39. Chow C. On Optimum Recognition Error and Reject Tradeoff. IEEE Transactions on Information Theory. 1970;16(1):41–46.
  40. 40. Fumera G, Roli F, Giacinto G. Reject Option with Multiple Thresholds. Pattern Recognition. 2000;33(12):2099–2101.
  41. 41. Kohonen T. Self-Organizing Maps. vol. 30 of Springer Series in Information Sciences. Springer; 1995.
  42. 42. Platon L, Zehraoui F, Tahi F. Localized Multiple Sources Self-Organizing Map. In: 25th International Conference on Neural Information Processing (ICONIP 2018). vol. 11303 of Lecture Notes in Computer Science. Siem Reap, Cambodia; 2018. p. 648–659. Available from: https://hal.science/hal-01971022.
  43. 43. Mendoza-Carranza M, Ejarque E, Nagelkerke LAJ. Disentangling the Complexity of Tropical Small-Scale Fisheries Dynamics Using Supervised Self-Organizing Maps. PLOS ONE. 2018;13(5):1–28. pmid:29782501
  44. 44. Kohonen T. The’Neural’ Phonetic Typewriter. Computer. 1988;21(3):11–22.
  45. 45. Kittiwachana S, Grudpan K. Supervised Self Organizing Maps for Exploratory Data Analysis of Running Waters Based on Physicochemical Parameters: A Case Study in Chiang Mai, Thailand. Asia-Pacific Journal of Science and Technology. 2015;20(1):1–11.
  46. 46. Mattos CLC, Barreto GA. ARTIE and MUSCLE Models: Building Ensemble Classifiers from Fuzzy ART and SOM Networks. Neural Computing and Applications. 2013;22(1):49–61.
  47. 47. Lau KW, Yin H, Hubbard S. Kernel Self-Organising Maps for Classification. Neurocomputing. 2006;69(16):2033–2040.
  48. 48. Hsu A, Halgamuge SK. Class Structure Visualization with Semi-Supervised Growing Self-Organizing Maps. Neurocomputing. 2008;71(16):3124–3130.
  49. 49. Allahyar A, Sadoghi Yazdi H, Harati A. Constrained Semi-Supervised Growing Self-Organizing Map. Neurocomputing. 2015;147:456–471.
  50. 50. Braga PHM, Bassani HF. A Semi-Supervised Self-Organizing Map for Clustering and Classification. In: 2018 International Joint Conference on Neural Networks (IJCNN); 2018. p. 1–8.
  51. 51. Braga PHM, Medeiros HR, Bassani HF. Deep Categorization with Semi-Supervised Self-Organizing Maps. In: 2020 International Joint Conference on Neural Networks (IJCNN). Glasgow, UK: IEEE; 2020. p. 1–7.
  52. 52. Braga PHM, Bassani HF. A Semi-Supervised Self-Organizing Map with Adaptive Local Thresholds. In: 2019 International Joint Conference on Neural Networks (IJCNN); 2019. p. 1–8.
  53. 53. Riese FM, Keller S, Hinz S. Supervised and Semi-Supervised Self-Organizing Maps for Regression and Classification Focusing on Hyperspectral Data. Remote Sensing. 2020;12(1).
  54. 54. Herrmann L, Ultsch A. Label Propagation for Semi-Supervised Learning in Self-Organizing Maps. International Workshop on Self-Organizing Maps: Proceedings (2007). 2007.
  55. 55. Gamelas Sousa R, Rocha Neto AR, Cardoso JS, Barreto GA. Robust Classification with Reject Option Using the Self-Organizing Map. Neural Computing and Applications. 2015;26(7):1603–1619.
  56. 56. Stefanovič P, Kurasova O. Outlier Detection in Self-Organizing Maps and Their Quality Estimation. Neural Network World. 2018;28:105–117.
  57. 57. Platon L, Zehraoui F, Bendahmane A, Tahi F. IRSOM, a Reliable Identifier of ncRNAs Based on Supervised Self-Organizing Maps with Rejection. Bioinformatics. 2018;(17):i620–i628. pmid:30423081
  58. 58. Platon L, Zehraoui F, Tahi F. Self-Organizing Maps with Supervised Layer. In: 2017 12th International Workshop on Self-Organizing Maps and Learning Vector Quantization, Clustering and Data Visualization (WSOM); 2017. p. 1–8.
  59. 59. Forest F, Lebbah M, Azzag H, Lacaille J. A Survey and Implementation of Performance Metrics for Self-Organized Maps; 2020.
  60. 60. Li O, Liu H, Chen C, Rudin C. Deep Learning for Case-Based Reasoning Through Prototypes: A Neural Network That Explains Its Predictions. Proceedings of the AAAI Conference on Artificial Intelligence. 2018;32(1).
  61. 61. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, et al.. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems; 2015.
  62. 62. Chollet F, et al.. Keras; 2015.
  63. 63. Forest F, Lebbah M, Azzag H, Lacaille J. Deep Embedded Self-Organizing Maps for Joint Representation Learning and Topology-Preserving Clustering. Neural Computing and Applications. 2021;33(24):17439–17469.
  64. 64. Vanschoren J, van Rijn JN, Bischl B, Torgo L. OpenML: Networked Science in Machine Learning. SIGKDD Explorations. 2013;15:49–60.
  65. 65. Ayres-de Campos D, Bernardes J, Garrido A, Marques-de-Sá J, Pereira-Leite L. SisPorto 2.0: A Program for Automated Analysis of Cardiotocograms. The Journal of Maternal-Fetal Medicine. 2000 Sep-Oct;9(5):311–318. pmid:11132590
  66. 66. Sigillito VG, Wing SP, Hutton LV, Baker KB. Sigillito V. G., Wing S. P., et al. (1989). Classification of Radar Returns from the Ionosphere Using Neural Networks. Johns Hopkins APL Technical Digest. 1989;10(3).
  67. 67. Fisher RA. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics. 1936;7(2):179–188.
  68. 68. Deng L. The Mnist Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine. 2012;29(6):141–142.
  69. 69. Street WN, Wolberg WH, Mangasarian OL. Nuclear Feature Extraction for Breast Tumor Diagnosis. 1993;1905:861–870.
  70. 70. Boser BE, Guyon IM, Vapnik VN. A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT’92. New York, NY, USA: Association for Computing Machinery; 1992. p. 144–152. Available from: https://doi.org/10.1145/130385.130401.
  71. 71. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
  72. 72. Rumelhart DE, Hinton GE, Williams RJ. Learning Internal Representations by Error Propagation. In: Rumelhart DE, Mcclelland JL, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. Cambridge, MA: MIT Press; 1986. p. 318–362.
  73. 73. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
  74. 74. Braga PHM. Batch SS-SOM; 2020. Available from: https://github.com/phbraga/batch-sssom.
  75. 75. Riese FM. SuSi: Supervised Self-Organizing Maps in Python; 2019. Available from: https://github.com/felixriese/susi.
  76. 76. Gao F, Wang W, Tan M, Zhu L, Zhang Y, Fessler E, et al. DeepCC: A Novel Deep Learning-Based Framework for Cancer Molecular Subtype Classification. Oncogenesis. 2019;8(9):1–12. pmid:31420533
  77. 77. Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of TCGA Data. Nucleic Acids Research. 2016;44(8):e71. pmid:26704973
  78. 78. Chia SK, Bramwell VH, Tu D, Shepherd LE, Jiang S, Vickery T, et al. A 50-Gene Intrinsic Subtype Classifier for Prognosis and Prediction of Benefit from Adjuvant Tamoxifen. Clinical Cancer Research. 2012;18(16):4465–4472. pmid:22711706
  79. 79. Guiu S, Michiels S, Andre F, Cortes J, Denkert C, Leo A, et al. Molecular Subclasses of Breast Cancer: How Do We Define Them? The IMPAKT 2012 Working Group Statement. Annals of oncology: official journal of the European Society for Medical Oncology / ESMO. 2012;23:2997–3006. pmid:23166150
  80. 80. Linardatos P, Papastefanopoulos V, Kotsiantis S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy. 2021;23(1).
  81. 81. Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence. 2019;1:206–215. pmid:35603010
  82. 82. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’16. New York, NY, USA: Association for Computing Machinery; 2016. p. 1135–1144. Available from: https://doi.org/10.1145/2939672.2939778.
  83. 83. Hendrickx K, Perini L, Van der Plas D, Meert W, Davis J. Machine Learning with a Reject Option: A Survey; 2021.
  84. 84. Ables J, Kirby T, Anderson W, Mittal S, Rahimi S, Banicescu I, et al.. Creating an Explainable Intrusion Detection System Using Self Organizing Maps; 2022.
  85. 85. Fritzke B. Growing self-organizing networks-why? In: ESANN. vol. 96; 1996. p. 61–72.
  86. 86. Fritzke B. Growing Cell Structures—A Self-Organizing Network for Unsupervised and Supervised Learning. Neural Networks. 1994;7(9):1441–1460.
  87. 87. Sung HY, Yeh HY, Lin JK, Chen SH. A Visualization Tool of Patent Topic Evolution Using a Growing Cell Structure Neural Network. Scientometrics. 2017;111(3):1267–1285.
  88. 88. Fritzke B. Growing Grid—a Self-Organizing Network with Constant Neighborhood Range and Adaptation Strength. Neural Processing Letters. 1995;2(5):9–13.
  89. 89. Gharaee Z. Hierarchical Growing Grid Networks for Skeleton Based Action Recognition. Cognitive Systems Research. 2020;63:11–29.