Fuzziness-based active learning framework to enhance hyperspectral image classification performance for discriminative and generative classifiers

Hyperspectral image classification with a limited number of training samples without loss of accuracy is desirable, as collecting such data is often expensive and time-consuming. However, classifiers trained with limited samples usually end up with a large generalization error. To overcome the said problem, we propose a fuzziness-based active learning framework (FALF), in which we implement the idea of selecting optimal training samples to enhance generalization performance for two different kinds of classifiers, discriminative and generative (e.g. SVM and KNN). The optimal samples are selected by first estimating the boundary of each class and then calculating the fuzziness-based distance between each sample and the estimated class boundaries. Those samples that are at smaller distances from the boundaries and have higher fuzziness are chosen as target candidates for the training set. Through detailed experimentation on three publically available datasets, we showed that when trained with the proposed sample selection framework, both classifiers achieved higher classification accuracy and lower processing time with the small amount of training data as opposed to the case where the training samples were selected randomly. Our experiments demonstrate the effectiveness of our proposed method, which equates favorably with the state-of-the-art methods.


Introduction
Remote sensing is a mature field of science and extensively studied to extract the meaningful information from earth surface or objects of interest based on their radiance acquired by the given sensors at short or medium distance [1] [2]. One of the types of remote sensing is PLOS  hyperspectral sensing also referred to as "hyperspectral imaging". Hyperspectral imaging has been widely employed in real life applications such as pharmaceutical and food process for quality control and monitoring, forensic (Ink mismatches detection or segmentation in forensic document analysis), industrial, biomedical, and biometric applications such as face detection and recognition [3]. Additionally, in recent years, hyperspectral imaging has also been studied in a wide range of urban, environmental, mineral exploration, and security-related applications. Nowadays, researchers are broadly studying hyperspectral image classification techniques for the case of a limited number of training samples, both with and without reducing the dimensionality of hyperspectral data. In this regard, the recent works [4] [5] [6] [7] [8] demonstrate that the choice of classification approach is important future research direction. Therefore, we discuss some of the main supervised and semi-supervised hyperspectral image classification techniques and their challenges.
Supervised learning techniques, which require class label information, have been widely studied for hyperspectral image classification [9]; however, these learning models face various challenges for hyperspectral image classification including but not limited to, high dimensionality of hyperspectral data and an insufficient number of labeled training samples for learning the Model [3] [10]. Collecting a large number of labeled training samples is time-intensive, challenging, and expensive because the labels of training samples are selected through humanmachine interaction [11].
To cope with the issues that are discussed above, several techniques have been developed. These include discriminant analysis algorithms with different discriminant functions (e.g. nearest neighbor, linear and nonlinear functions) [12] [13], feature-mining [14], decision trees, and subspace-nature approaches [15]. The goal of subspace-nature and feature-mining approaches is to reduce the high dimensionality of hyperspectral data to better utilize the limited accessibility of the labeled training samples. The main problem of discriminant analysis is its sensitivity to the "Hughes phenomenon" [16]. The kernel based methods, like support vector machines (SVMs), have also been used to deal with the Hughes phenomenon or curse of dimensionality [17] [18] [19].
To some extent, semi-supervised approaches have addressed the problem of a limited number of labeled training samples by generating the labels though machine-machine interaction. The primary assumption of semi-supervised classification methods is that the newly labeled samples for learning can be generated with a certain degree of confidence from a set of limited available labeled training samples without considerable cost and efforts [20] [21]. Semi-supervised techniques have been significantly improved in recent years. For example, in [22] Bruzzone et al. proposed "transductive SVMs", in [23], Camps-Valls et al. proposed a "graph-based method to exploit the importance of labeled training samples" and in [24], Velasco-Forero et al. proposed a "composite kernel in graph-based classification method". In [25], Tuia et al. proposed a "semi-supervised SVM using cluster kernels method", whereas in [26], Li et al. explained a "semi-supervised approach which uses a spatial-multi-level logistic prior method". In [27], Bruzzone et al. proposed a "context sensitive semi-supervised SVM method" and Munoz-Mari et al. presented two semi-supervised single-class SVM methods in [28]. Their first technique models the data marginal distribution with graph-Laplacian built with both labeled and unlabeled training samples, whereas the other technique is used for the modification of the SVM cost function, which massively penalizes the errors made when wrongly classifying the samples for the target class. The algorithm proposed in [29] is based on a sample selection bias problem in contrast to [29], [30] where the authors proposed an SVM with a linear combination of two kernels (likelihood and base kernels). The works [31] and [32] done by Rattle, et al. and Munoz-Mari et al. respectively, exploited a similar concept using a neural network as the baseline classification algorithm. To generate the land-cover maps, they adopted a semi-automatic technique using active queries concept.
All the techniques discussed above assume that the labeled training samples are limited in number, and these methods enlarge the initial training set by efficiently exploiting the unlabeled samples to address the "ill-posed problem". However, to achieve the desired results, several vital requirements need to be met. For example, the quantity of the generated data should not be too large such that it may increase the computational complexity, and the samples should be properly selected to avoid any confusion in correctly classifying the unseen samples. Above all, the obtained samples and their class labels must be obtained without substantial cost and time.
Active learning techniques can be used to overcome the above-mentioned issues. In general, active learning techniques are referred to as a special subcategory of semi-supervised learning techniques [33] [34]. Without loss of generality, in active learning, the learning model actively requests the user for class information. To this end, the most recent developments are "hybrid active-learning [35]" and "active learning in a single pass-context [36]", which combine the concepts of adaptive and incremental learning from the field of traditional and online machine-learning. These breakthroughs have resulted in a significant number of different active-learning methods such as reported in [11] [53].
In general, the additional labeled samples are selected randomly or by using some information criteria or source of information to query the samples and their class information. Random selection of the training samples is more-often subjective and tends to bring redundancy into the classifiers. Furthermore, it reduces the generalization performance of the classifiers. Moreover, the number of samples required to learn a model can be much lower than the number of used samples. In such scenarios, there is a risk that the learning model may get overwhelmed because of the uninformative samples queried by the learning model.
To this end, in this work, an active learning framework using a single sample view criticalclass-oriented query is proposed for hyperspectral image classification. We call this scheme fuzziness-based active learning framework (FALF). In FALF, the classifier comes with an integrated data acquisition module that ranks unlabeled samples based on their confidence for the future query that has the maximum learning utility. Thus, the proposed framework aims to achieve the maximum potential of the learning model using both labeled and unlabeled data, whereas the amount of training data can be kept to a minimum by focusing only on the most informative training samples. This process leads to a better utilization of information in the data, while considerably minimizing the cost of labeled data collection and improving the generalization performance of the classifiers.
The primary goal of FALF is to focus on selecting difficult samples for the hyperspectral classification task. In conjunction with "Discriminative" and "Generative" classifiers, hardly predicted sample pairs are first identified by using the instability of classification boundary. A category level guidance for which sample should be queried next is then provided to the active querier. Samples with higher fuzziness and lower distance to the class boundaries are considered as the difficult samples and are queried first. This strategy of identifying the most informative samples is based on the hypothesis that, the samples that are far from class boundaries have a lower risk of being misclassified as compared to the samples that are closer. Moreover, two selection approaches are implemented and compared. The first approach randomly selects samples based on their entire fuzziness magnitude; whereas the second approach incorporates only the hardly predicted samples from higher fuzziness magnitude group.
These methods are developed for a single sample-based critical class query strategy. The experiments are conducted on both AVIRIS and ROSIS-03 hyperspectral data sets.
Classification performance was superior to the state-of-the-art active learning methods. It is worth mentioning that the proposed framework is a two-fold process in which learning is first done in a fully supervised fashion, and then semi-supervised learning is used to select the appropriate candidates for the training set. Furthermore, traditional active learning methodologies add new samples to the training data with their original labels, whereas in the proposed framework, the new samples are added in a semi-supervised fashion with their predicted class labels.
To summarize, the primary contributions of our work are as follows: • Designing and implementing a new fuzziness-based active learning framework to select the optimal training samples to enhance the classifier's generalization performance for hyperspectral image classification.
• Validation of the effectiveness of the proposed framework for two different kinds of classifiers on three publicly available datasets (both AVIRIS and ROSIS-03 datasets).
• Investigating the potential of the proposed framework to reduce the classification time while maintaining a good accuracy under high dimensionality.

Materials and methods
The main idea of this work is to employ and retain the relationship between misclassification rate of boundary samples and fuzziness for each class to select samples for the training set. The important steps of our proposed algorithm are summarized below: 1. Randomly select 5% of labeled training samples from each class.
2. Train Support Vector Machine (SVM) and Fuzzy K-Nearest Neighbor (FKNN) on randomly selected samples and test them for the rest of the samples.
3. Record the fuzzy membership matrix.
4. Calculate the fuzziness from fuzzy membership matrix for each sample and estimate the distance between the sample and the boundary.
5. Based on the threshold of fuzziness magnitude, divide the samples into two subgroups as lower and higher fuzziness magnitude groups.
6. Determine the correct rate of classification and misclassification (i.e., TP and FP) for each class in both groups individually. It is to be noted that step 7 (A) and (B) are two alternative ways to select the samples. In general active learning approaches, the samples are selected through step 7 (B), but we propose to select the samples using step 7 (A) and compare the accuracies obtained by both ways in experimental and results section.
The intuition behind selecting the hardest correctly predicted samples using step 7 (A) is that such samples contain the most information about boundaries rather than the samples with lower fuzziness in magnitude. The threshold value between lower and higher fuzziness is set by trial and error. The proposed methodology significantly boosts the performance of the classifier for hyperspectral image classification not only in terms of accuracies but also reduce the classification time.
Here we will theoretically explain the procedure for estimating the boundary of each class, and then how to build a relation between the samples and the estimated boundaries to select the target samples.

Boundary extraction
Generally, there are two kinds of classifiers: those that use some specific formula to estimate class boundaries (discriminative), and others that use some distribution for the same task (generative). For example, [39] [40] [41] used locus approximation on some sample distributions to estimate the class boundary; whereas [42] used an analytical formula. Fuzzy K-Nearest Neighbors (FKNN's) and Support Vector Machines (SVM's) are two representatives of the aforementioned types.
Before we discuss the boundary extraction process for both classifiers, it is helpful to understand the concept of a fuzzy membership function, because we seek the output of each classifier in the form of fuzzy membership grades.

Memberships function
Let us assume a set of N sample vectors {r 1 , r 2 , r 3 , . . ., r N }, and a fuzzy partition of these N sample vectors represents each sample vector's degree of membership to each of the C classes. The fuzzy C partitions have certain characteristics as defined below: where μ ij 2 [0, 1], and μ ij = μ i (r j ) is a function that represents the membership (a value in [0, 1]) for the j th sample r j , to the i th partition, i 2 1, 2, 3, . . ., C, and j 2 1, 2, 3, . . ., N.

Support Vector Machine (SVM)
SVM aims to find the optimal hyperplane according to the maximization of the margin on the training data. In SVM, data is mapped from the input space into a high dimensional feature space using an implicit function; such mapping is directly associated with a kernel function Kðr i ; r j Þ, which satisfies Kðr i ; r j Þ ¼ < φðr i Þ; φðr j Þ >. In the kernel function the terms r i and r j denotes the i th and j th training samples respectively. The mathematical hypothesis of SVM is given by: In above equation c i is the i th class label, b and α i are unknown parameters which are determined by quadratic programming. Furthermore, α i is a vector of non-negative Lagrange multipliers; therefore, the solution vector α i is sparse and the samples r i which correspond to nonzero α i are called support vectors. Thus, the samples r i corresponding to α i = 0 have no contribution to the construction of the optimal hyperplane. From the literature, one can find several extensions of SVM [42] and open tools such as LIBSVM [43] which has produced acceptable performance in hyperspectral image classification. As we explained, SVM has limitations in training using a large number of samples in terms of time and computations. In order to cope with these difficulties, we can take advantage of fuzzy class membership to filter the samples based on fuzziness magnitude. In this work, we use the class membership as expressed in [44].

Fuzzy K-Nearest Neighbors (FKNN)
FKNN produces the output as a vector of class memberships where each component of the sample vector strictly belongs to the closed interval [0, 1]. If the component of a sample vector is equal to 0 or 1, then the algorithm behaves like a common KNN. FKNN search is similar to the traditional KNN search. In traditional KNN, each sample can only belong to one class, which is the majority class in KNN search, whereas in FKNN, a sample can belong to multiple classes with different membership degrees associated with these classes. FKNN can be summarized in the following steps: 1. First find K nearest neighbors r j , j 2 1, 2, 3, . . .., K, of the given sample r using Euclidean distance function from the set of the samples.
2. Evaluate the membership function values for each class. FKNN obtains the membership of a sample as: In the above equation kr − r j k is the Euclidean distance and μ i (r j ) is the membership value of the point r j for the i th class. The parameter m controls the effective magnitude of the distance of the prototype neighbors from the sample under process [40]. The value of m can also be updated through cross-validation along with the value of K, where K is the number of neighbors.
3. The class of sample r j is chosen by the given formula: where C is the total number of classes; therefore the decision boundary is locus expressed by (4), where m Ã i is the permutation of μ i in decreasing order. rjm Ã ði;:::;CÞ ðrÞ ¼ m Ã ðj;:::;CÞ ðrÞ; i 6 ¼ j Based on the two different classifiers' boundary extraction process as discussed above, we can conclude that the estimated class boundary strongly depends on the criteria of classification algorithm even for identical training samples. The discussion of FKNN indicates that the boundary extraction process cannot be explicitly expressed the same way as that of the formula-based classification methods, such as SVM. Thus, it is not easy to identify whether a training sample is far from or close to the classification boundary, especially when the classification boundary cannot be expressed as a mathematical formula. Therefore, the difference between the actual and found classification boundary is considered as an important index for evaluating the generalization capability of any classification technique.
In this work, we initialize the learning model with a specific percentage of randomly selected samples, but one can control the size of samples by adjusting neighbors when assigning fuzzy class memberships to the training samples r i ; therefore, the training set is mapped to a fuzzy training sample as (r (1,. . ..,N) , l (1,. . .,N) , μ (1,. . ..,N) ), where each membership value is assigned independently.

Fuzziness relation between the samples and boundary
In the above sections, we describe the process to extract the boundary of each class, but now the problem at hand is how to identify whether the said sample is close or away from the class boundary. In order to cope with the said problem, let us assume the output of a classifier for a specific sample is a fuzzy vector m T ð1;2;3;:::;nÞ in which the component should be a specific number within the closed interval [0, 1]. The set of these numbers represent the fuzzy membership grades of the individual sample fit into the corresponding class.
For readers' ease, let us consider the distance between the class boundary and the sample with output (μ i , μ j ) T can be estimated using Eq (5), which we will further incorporate with fuzziness properties. The said phenomenon is explained in the form of the corollary; However, for better understanding, it is important to explain the concept of fuzziness.
Axiom 3 is known as sharpened order, where μ and σ are fuzzy subsets of a crisp set, where μ ≼ σ means μ is less sharpened than σ and hence μ has more fuzziness than σ. Since F(R) is not totally ordered, so there are many pairs of fuzzy sets that are not comparable under ≼ but on the contrary, a measure of fuzziness provides a total order.
Measuring fuzziness. Consider a fuzzy set R = {μ (1,2,3,. . .,n) }, then the fuzziness of R can be define as, The above expression attains its maximum when the membership degree of each sample is equal to 0.5 and minimum when every sample absolutely falls into the fuzzy set or not. In this work, the term fuzziness is a kind of cognitive uncertainty. We further extend it as a fuzzy partition of the given training samples ðr i Þ P i¼1 ; P ( N that assigns the membership degree of each sample to C classes as M = (μ ij ) C Ã P , where μ ij = μ i (r j ) is the membership of the j th sample r j belonging to the i th class. The elements of the membership matrix should follow the properties defined in the above section. Therefore, the membership matrix upon P training samples is attained once the training procedure completes. For the j th sample, the classifier produces an output vector represented as a fuzzy set (μ j = μ 1j , μ 2j, μ 3j , μ Cj ) T , so by the above equation the fuzziness of the classifier can be written as: Finally, a membership matrix upon P training samples for C classes can be defined as: The above expression defines the training fuzziness. In hyperspectral space, a classifier's fuzziness is computed as the averaged fuzziness over the entire hyperspectral space. However, the fuzziness for the testing phase is unknown. For any supervised and semi-supervised classification problem, there is a premise, "the training samples have a distribution identical to the distribution of samples in the entire space". Therefore, the above equation can be used to calculate a classifier's fuzziness. The following corollary gives further insight into the fuzziness relation between samples and boundary.
Corollary. Suppose a binary class problem with two samples (r i , r j ) and distances (d i , d j ), where d i is the distance between classification boundary r i and sample, and d j is the distance between boundary r j and sample.
Furthermore, α and β are outputs of the classifier on samples r i and r j . According to [39], if d i is less than d j , then the fuzziness of α should be greater than β; which means that the fuzziness of r i is no less than that of r j . The said phenomena is further explained in Fig 1. To prove the statement, let us assume that the outputs of the classifier on r i and r j are in the form of α = (α 1 , α 2 ) T and β = (β 1 , β 2 ) T , respectively. Therefore, by Eqs (4) and (5), the boundary and the distance between boundary and sample with the output (α 1 , α 2 ) T can be estimated as rjα 1 (r) = α 2 (r) = 0.5 and |α 1 − 0.5| + |α 2 − 0.5|. By the above definition, we can find the distances for each membership value as: In addition, by using the above relations, we can further observe the right threshold value to identify a sample either close to or away from the boundary. Let us assume that α 1 ! α 2 and β 1 ! β 2 . This implies that α 1 ! 0.5 and β 1 ! 0.5, which results in d i = 2(α 1 − 0.5) and d j = 2(β 1 − 0.5). Based on our assumption, d i is less than d j , therefore α 1 < β 1 . Thus, the sharpened order axiom α 1 ≼ β 1 satisfies the inequality of fuzziness as E(α 1 ) > E(β 1 ) but by definition, we know that E(α 1 ) ! E(α 2 ) and E(β 1 ) ! E(β 2 ), therefore, we can conclude that, E(α) > E(β).
The above mathematical evidence shows that the samples far from the classification boundary have low fuzziness as compared to the samples that are near to the classification boundary. This phenomenon is relatively simple and it is easy in a binary class problem with linearly separable samples to judge for each sample whether it is near to or away from classification boundary with some threshold value. The problem becomes trickier in case of complex boundaries with nonlinear mixtures. In such situations, we have three possibilities: • The samples actually belong to the region where they are supposed to be; with high or low fuzziness, • The samples belong to the other region where they are not supposed to be; with high or low fuzziness, and • Homogeneous mixtures, i.e. non-distinguishable regions without any prerequisite conditions to make them distinguishable.
The first two cases belong to heterogeneous-type mixtures, and can easily be solved, but the third case is trickier. To cope with the third case, we suggest measuring the correct rate of classification and misclassification from each class while considering the fuzziness subgroups. This can also be solved by applying any filter, which will recursively pass the same distribution of samples at once based on their class.

System validation
Hyperspectral image classification with an optimal number of labeled training samples is one of the fundamental and challenging tasks. In practice, the availability of labeled training samples is often insufficient for hyperspectral image classification, and in such scenarios, the classification methods generally either overwhelmed with uninformative samples or suffer due to the undersampling problem. Thus, in this work, we investigate the above-mentioned classifiers performance as a function of a different number of training samples size, varying from a minimum of 5% to a maximum of 25% per class (i.e., 5%, 10%, 15%, 20%, and 25%). Fuzziness-based active learning framework for hyperspectral image classification

Experimental setup
In all experiments, the parameters of the classifier are chosen as those that provide the best training accuracy. To avoid any bias, all the experiments are done within the same fixed settings which maximize the training accuracies. All the initializing parameters are evaluated in the first few experiments. When the parameters remain unchanged, the evaluation of the optimal parameters is stopped and selected for further experiments. We implemented SVM with a Polynomial kernel function and FKNN with 10 as the number of nearest neighbors and the Euclidean distance function.
In all experiments, the terms SVM and FKNN refer to the classifiers trained on samples selected using step 7(B), whereas the terms PSVM and PFKNN are used for the cases where the classifiers are trained using the samples selected using step 7(A) as explained in the methodology section. To this end, the first goal is to compare the performance of PSVM and PFKNN against that of SVM and FKNN, respectively. The second goal is to compare the performance of PSVM, PFKNN, FKNN, and SVM with state-of-the-art active learning frameworks.
The Kappa (κ) coefficient and overall accuracy are analyzed using a five-fold cross-validation process, related to a different number of training samples for all three datasets. It is worth noting that the training accuracy is not 100% and might include some error in terms of fuzziness estimation. All the experiments are carried out using MATLAB (2014b) on Intel 1 Core™ i5 CPU 3.20 GHz with 8 GB of RAM and the Machine is the 64-bit operating system.

Experimental datasets
The performance of the proposed FALF method is validated on three widely used publicly available hyperspectral datasets using two different classifiers with two different ways to select the target samples.
The ROSIS-03 optical sensor acquired the Pavia University (PU) and Pavia Centre (PC) data over the urban area of northern Italy. The PU and PC datasets consist of 610 Ã 340 and 1096 Ã 710 samples with 115 and 102 bands respectively. For the PU data, 12 noisy bands were removed prior to the analysis and the remaining 103 bands were used in our experiments. The ground truths differentiate 9 different classes in both datasets.
The third dataset was acquired by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor. The Indian Pines (IP) dataset consists of 145 Ã 145 samples and 220 spectral bands with a spatial resolution of 20-m and a spectral range from 0.4-2.5 Î¼m. Twenty noisy bands were removed prior to the analysis whereas the remaining 200 bands were used in our experimental setup. The removed bands are 104-108, 150-163, and 220. Indian Pines dataset consists of 16 classes. All three datasets can be freely obtained from [47] [48].

Experimental results
The Kappa (κ) coefficient and overall accuracies are considered as the evaluation metrics since these are widely used in existing works. The kappa coefficient is obtained by using the expressions given below [49].
In the above equations, N is the total number of samples, C k represents the number of correctly predicted samples in the given class, ∑ k C k is the sum of the number of correctly predicted samples, σ k is the actual number of samples belonging to the given class, and φ k is the number of samples that have been correctly predicted into the given class [49].
The average performance comparison of the proposed algorithm with each classifier is shown in Figs 2 and 3. These figures show the average classification accuracy and kappa coefficient analysis for both classifiers trained by randomly selected samples (Step 7(B)), and the same classifiers trained by using hardest predicted samples (Step 7(A)).
As explained earlier, we set the minimum training sample size as 5% for the first experiment and in each experiment, we increase the size with 5% newly selected samples. In the extreme case, the sample size is not more than 25% of the entire population. Based on the analysis is shown in Figs 2 and 3, for the IP and PU datasets, the PSVM classifier outperforms the rest of the classifiers. From different observations with a different number of training samples, there is a slight improvement using SVM and FKNN but PFKNN improves the accuracy impressively when we increase the size of training samples from 5% to 10% in both IP and PU datasets.  Fuzziness-based active learning framework for hyperspectral image classification computational cost for the same datasets, which indicates that the quantity of hardest predicted samples slightly influences the computational time of PSVM as compared to SVM. On the other hand, PFKNN obtained almost the same or a slightly lower computational time as compared to FKNN for all datasets in different experiments. In both experiments, the comparison between randomly selected samples and the hardest predicted samples has been shown for the IP and PU datasets. Moreover, the PC dataset is used to show the performance only for hardly predicted samples.
As shown in Fig 4, the computational cost gradually increases as the size of data increases in the PC dataset. Therefore, it is crucial to deal with such high computational time. Certain possible solutions can be applied to solve this problem. For example, one approach is to split the dataset into small regions and then build a separate classifier for each of the sub-regions. However, for this strategy to work well, there is another problem of how to conduct the data splitting such that it does not minimize the classification performance. Figs 5 and 6 show the classification maps for the PU and IP datasets respectively. The complete hypotheses on both datasets processed by the proposed FALF method to select the most informative samples to retrain the classifier, and the same classifier, trained on randomly  Tables 1, 2 and 3, which show average statistics for all experimental datasets with a different number of training samples for each classifier.
We have abbreviated the test names in Tables 1, 2   Fuzziness-based active learning framework for hyperspectral image classification FOR and FDR is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. It measures the proportion of false negatives that are incorrectly rejected. FOR is computed by using FN and TP; it can also be computed by taking the complement of negative predictive values (NPVs). FDR measures the proportion of actual positives that are incorrectly identified and is computed by using FP and TP.
For class-based classification judgments, we have done two statistical analysis, which is presented in Figs 8 and 9, in which we show the average statistics for all classes with a different number of training samples for each classifier . Figs 8 and 9 show the sensitivity and specificity of classification analysis on all three datasets for each class. PFKNN and PSVM have quite similar behavior for different classes, as one can see from the figures.

Comparison with state-of-the-art
To evaluate the performance of our proposed framework, the following state-of-the-art methods are compared. All competing methods are evaluated on two publicly available real hyperspectral datasets and the average performance of 5-fold cross validation is presented. The detailed performance comparison of the proposed algorithm with state-of-the-art methods defined below is presented in Tables 4 and 5. From these tables, we can see that the proposed framework outperforms the state-of-the-art active learning frameworks because of our careful sample selection using a twofold learning hierarchy. In traditional active learning frameworks, Fuzziness-based active learning framework for hyperspectral image classification the supervisor selects the samples in an iterative fashion, whereas the proposed model systematically selects the samples by machine-machine interaction without involving any supervisor, in computationally efficient fashion for high-dimensional hyperspectral datasets.

Discussion
We can find many classical active learning frameworks in the literature that are similar to the proposed framework. For example, the work proposed by Lughofer in [36] focused on online Fuzziness-based active learning framework for hyperspectral image classification learning and it was specifically designed for "an on-line single-pass setting in which the data stream samples arrive continuously". Such kind of methods does not allow classifier re-training for the next round of sample selection. Furthermore, Lughofer uses the close concepts of conflict and ignorance. Conflict models how close a query point is to the actual decision boundary and ignorance represents the distance between a new query point and the training samples seen so far. Our membership concept is conceptually close to these indicators, but we are able to consider both the distance from the class boundary and in-class variance inside one parameter. In addition, unlike [36], we implemented and validated our active learning approach for hyperspectral image classification problem.
In contrast to [36], Nie et al. proposed another active learning framework in [53], in which the authors focused only on early active learning strategies, i.e., solving the early stage experimental design problem. The Transductive Experimental Design (TED) method was proposed to select the data points, and for this, the authors propose a novel robust active learning approach using the structured sparsity-inducing norms to relax the NP-hard objective to the convex formulation. Thus their framework only focused on selecting an optimal set of initial samples to kick-start the active learning procedure. However, the benefit of our framework is that it shows state-of-the-art performance independent of how the initial samples are selected. Of course, the framework proposed by Nie et al. can be easily integrated with our framework to be executed instead of executing the first step of our algorithm. Fuzziness-based active learning framework for hyperspectral image classification Fuzziness-based active learning framework for hyperspectral image classification In our work, we start evaluating our hypotheses from 5% of randomly selected training samples and we demonstrate that randomly adding more samples (step 7(B)) back into the training set slightly increases accuracy but the classifiers become computationally complex. Therefore, we decided to separate the set of samples that were most difficult to predict in our first phase of classification (The samples between the ranges of 0.7-1.0 in fuzziness magnitude). We then fuse a specific percentage of these hardly predicted samples back into the original training set to retrain the classifier from scratch for better generalization and classification performance on those samples which were initially misclassified.
It is worth noting from experiments that adding hardly predicted samples back into the training set improves the performance on those samples that were misclassified in the first phase. For the IP and PU datasets, we have experimentally proved that randomly adding samples back into the training set does not provide the desired accuracy, but that by adding the samples back into the training set selected by the proposed FALF framework boosts the Fuzziness-based active learning framework for hyperspectral image classification performance of the classifier. We further validate our hypotheses on the PC dataset which also produces good accuracy in a computationally efficient way.
A second most important factor involved in the training and testing phase is computational time, which is significantly improved for both classifiers. Therefore, to make the model efficient and quick, we fuse the most difficult and informative samples back into the training set to retrain the classifier in each experiment. The classification accuracy and Kappa (κ) test results are significantly improved as we can see from Figs 2 to 9. Tables 1, 2 and 3 present the average statistical test results on predicted samples, which show the model's ability to correctly classify the unseen samples from each class.

Conclusion
Hyperspectral image classification with a limited number of training samples is a challenging problem. To improve the classification performance for such cases, this paper proposed the idea of retraining the classifier using most informative samples. These samples are identified by first estimating the boundary of each class and then calculating the fuzziness-based distance between each sample and the estimated class boundaries. The hardest correctly classified samples with smaller distances and higher fuzziness are selected as appropriate candidates for the training set to retrain the classifier.
Through several experiments, we show that for an image classification task we can start with only 5% of the training samples and then use the proposed FALF framework to select only a small amount of new samples to train the classifier from scratch, which significantly boosts the classifier's generalization performance on unseen samples.
It is worth noting is that the proposed method is not classifier sensitive, i.e. the derived relation holds if we change the classification model, such as locus approximation to an analytical formula-based classifier.