Choosing the Most Effective Pattern Classification Model under Learning-Time Constraint

Nowadays, large datasets are common and demand faster and more effective pattern analysis techniques. However, methodologies to compare classifiers usually do not take into account the learning-time constraints required by applications. This work presents a methodology to compare classifiers with respect to their ability to learn from classification errors on a large learning set, within a given time limit. Faster techniques may acquire more training samples, but only when they are more effective will they achieve higher performance on unseen testing sets. We demonstrate this result using several techniques, multiple datasets, and typical learning-time limits required by applications.


Introduction
Advances in digital technologies make large datasets commonly available, which demands faster and more effective pattern analysis techniques. However, methodologies to find the most suitable technique for a given application do not usually take into account the learning-time constraint required by the application. One may argue that parallel processing is possible in many situations and that machines are faster, but in practice datasets grow fast and opportunities for new applications continually emerge.
Consider a large database of face images obtained from many individuals through video cameras and all possible applications involving face recognition and verification. When a cell phone user uploads a video of her face to that database, such that a classifier can be trained to identify her and unlock the cell phone, the learning time for this application should not take more than a few seconds. In computer-assisted diagnosis of parasites [1], each microscopy slide may contain hundreds of thousands of image components to be classified either as impurity or as some type of parasite. Possible variations in the preparation of the slides, due to the choice of reagent brands or human operator, demand retraining and updating the classifier from time to time. The whole process should not take longer than a few minutes. Other applications, such as interactive segmentation of medical [2] and natural images [3], require real-time response to the user's actions. Fig 1 illustrates an example based on the method described in [3]. The user draws labeled markers (a training set) inside and outside the object, and segmentation is based on optimum paths from the competing markers in an image graph (Fig 1a). These markers can easily contain tens of thousands of pixels, as modern digital cameras can take pictures with tens of millions of pixels. Image segmentation first relies on a pixel classifier, which is trained from the markers to create a fuzzy object map (Fig 1b). Second, the image is interpreted as a graph, whose pixels are the nodes and arcs between pixels are weighted based on intensity differences from the image and fuzzy object map (Fig 1c). For higher effectiveness, the object should appear brighter than the background in the fuzzy object map and arcs weights should be lower on the object's border than elsewhere. The visual feedback from Fig 1a-1c guides the user to the image location where more markers must be selected, improving segmentation along a few interventions (Fig 1d-1f). Note that, the user should not have to wait longer than one second for a response after each intervention.
In general, applications in which a classifier must be trained upon user request (e.g., to answer a query performed by the user) should provide interactive response time. Even without user interaction, applications that require parameter optmization, using the accuracy of a classifier as criterion function, also require training and testing the classifier several times. As the dataset grow large, this becomes a problem. A good example of this occurs when learning the architecture of a convolutional neural network for feature extraction and classification-a hot topic nowadays. The response time does not need to be interactive, but processing time limitations may compromise the success of the optimization procedure by reducing the search space. Feature selection from large datasets may be used as a similar example. In the case of face recognition in mobile devices, this may be considered a future application that will become reality with the advances of cloud computing and communication networks. Currently, the mobile devices provide mechanisms for face recognition that are independent of the user-i.e., the design of the classifier does not consider the most informative samples to distinguish a particular user from other individuals with similar face characteristics. As a consequence of that, face recognition in mobile devices does not work properly. In order to have a robust face recognition system, it is desirable to learn the most informative samples from a large negative dataset, which could be stored and processed in a cloud system. During user enrollment, the important negative samples could be mined and together with the face samples from the user could train an user-specific classifier, which would be transmitted back to the mobile device, being the whole process performed in interactive time. Furthermore, for research purposes, where some techniques commonly take several minutes or hours to train a classifier, it is desirable to reduce the learning time to no longer than a few minutes. After all, we may need to repeat experiments hundreds of times in order to obtain statistically significant results and the increasing size of the datasets may also prevent us from train the classifier with all labeled samples.
Methodologies to compare pattern classification models usually fix features, training samples, test samples, and accuracy measures for all classifiers. This approach is adequate when evaluating the effectiveness of different techniques under the same conditions, but it contemplates neither learning-time constraints from the applications nor the fact that faster classifiers may be able to achieve higher performance on the same unseen testing set, given an allowance for a larger training set. For fairness with faster techniques and from a practical viewpoint, it is important to relax the constraint of a fixed training set for all classification models, provided that: (1) the training samples come from the same large learning set, and (2) all techniques must choose their own training samples and complete training within a pre-established learning-time limit, granted by the application.
In this work, we propose a methodology consistent with the above conditions. At first, a large dataset is randomly divided into learning and testing sets. The learning set should be large enough to contain representative samples from all classes. Given the learning-time limit allowed by a given application, it is usually not feasible to train a classifier using all learning samples. Therefore, the proposition is to start with a very small training set composed of randomly selected samples from the large learning set. This initial training set is the same for all classification models. Each classifier is then trained and subsequently evaluated on the learning set. By assembling a subset of randomly chosen misclassified samples, containing no more than the number of samples in the current training set, and incorporating it to the training set, we prepare the stage for a new learning cycle. This three-step procedure is then repeated until either (i) the learning-time limit from the application is reached or (ii) the number of errors becomes zero. In this way, faster techniques may acquire more error samples and complete the learning process with larger training sets. However, for their performance to be better on the unseen testing set, they must be more effective in learning from their own errors. Moreover, in order to achieve statistically significant results, this entire process, with distinct learning and test sets, has to be repeated several times.
We have evaluated our methodology on large datasets, using several techniques, and subject to the learning-time limits typically required by applications. We categorized these time limits into very interactive (less than 1s), interactive (from 1s to 5s), nearly interactive (from 5s to 1 The user draws labeled markers (a training set) inside and outside the object, and segmentation is based on optimum path competition from the markers in an image graph. (b) Image segmentation first relies on a pixel classifier, which is trained from the markers to create a fuzzy object map (the object should appear brighter than the background). (c) Second, the image is interpreted as a graph, whose arc weights should be lower on the border of the object than elsewhere. (d)-(f) The visual feedback from these results guides the user to the image location where more markers must be selected, improving fuzzy object map, arc weights, and so segmentation along a few interventions. minute), and non-interactive (above 1 minute). As this work aims at presenting a new methodology for the evaluation of classifiers and not to endorse any particular technique, we opted for standard implementations of each of the techniques used, since choosing optimized implementations might skew the results unfairly.
This work is organized as follows. We start by describing the background material and related works. Next, we introduce the new methodology. Then, we discuss the experiments and results. Finally, we present our conclusions.

Background
Many works have presented pattern classification models based on discriminant analysis, nearest neighbors, neural networks, support vector machines and decision trees, among other techniques. In [4] the authors carry out a performance study of well-known classifiers, comparing the influence of the parameter configurations on the accuracy. They present a generic methodology to construct artificial datasets modeling the different characteristics of real data. Even though theoretical properties and settings may be used to justify new techniques, most of the literature compares the effectiveness of a proposed approach with respect to others for specific applications. Methodologies for that comparison usually divide a dataset into two parts, training and testing sets, where the first is used to project a classifier and the second to measure its errors [5]. This process must be repeated several times so that a sound conclusion on the statistics of its results can be reached.
Several aspects must be carefully considered for such methodologies to work. Firstly, call to mind that certain characteristics of the datasets (e.g., class imbalance) may require a specific sampling strategy [6]. Also, distinct sampling strategies to create training and testing sets can produce different estimates of performance [7]. The most popular ones are known as cross-validation, hold-out, leave-one-out, and bootstrap. Many articles adopt cross-validation techniques [8][9][10][11], despite their inherent trade-off regarding the number of folds and iterations. [9] showed that ten-fold cross-validation has a lower variance than leave-one-out and bootstrap. [10] and [11] hold running five separate iterations of two-fold cross-validation in order to reduce the correlation between the training sets.
On the other hand, one aspect rarely considered is that those methodologies are sensitive to the order of the samples in the dataset. [8] have evaluated the impact of the order of the samples in effectiveness, reproducibility, and generalization of the results. The authors showed that, due to distinct orders, a few iterations of cross-validation can severely affect the conclusions when comparing classifiers. [12] studied the consistency of statistical tests on individual datasets and recommended a corrected t-test [13] across ten iterations of ten-fold cross-validation as the least sensitive to the order of the samples. Other studies offered general guidelines for evaluation [14][15][16][17].
The methodology presented in the next section can be used with any measure of effectiveness, since it is able to maximize effectiveness under a learning-time limit, as required by applications. To the best of our knowledge, our methodology is unique in the sense that it takes into account both the efficiency of the techniques in acquiring more knowledge (training samples) from the problem instance and their effectiveness in identifying the most representative training samples (errors in a learning set) to lead to a more accurate classifier. The methodologies mentioned above also aim at maximizing effectiveness, but they attempt this without regard for learning time. Given that each classifier selects its own training samples during the learning phase and the number of learning iterations is dependent on its efficiency, our methodology is robust with respect to the order of samples in the learning set. As we will see, the fact that all classifiers start with a small training set, obtained from randomly selected samples from the large learning set, considerably reduces the correlation between the initial training sets among multiple executions of the method.

Methodology
In this section, we present the proposed evaluation methodology that considers efficacy and efficiency at the same time. In order to accomplish a fair comparison, the proposed methodology allows the classifiers to learn from their own errors, within a given time limit as required by the application.
A dataset Z is first randomly divided into a learning set Z 2 and a test set Z 3 . Due to the learning-time limit of the given application, training with all learning samples is usually not possible. So, an initial training set Z 1 is created with a very small subset of randomly selected samples from Z 2 , such that each class is represented by at least one sample.
After training each of the models with the same training set Z 1 , they are evaluated on Z 2 nZ 1 . Next, we randomly select from the misclassified samples of each classifier a number of samples to be incorporated into (its own) Z 1 (this number is limited so as to, at most, double the size of the current training set). Retraining, evaluation, and misclassified sample selection is repeated until either the number of errors goes to zero or the learning-time limit T of the application is reached. Within the learning-time limit T, each classifier has the opportunity to learn from its own classification errors on the learning set Z 2 nZ 1 , as the training set Z 1 increases.
This procedure works under the reasonable assumption that the most informative samples can be obtained from the errors on Z 2 nZ 1 . So, after each learning phase, an improvement in accuracy should also be expected on the unseen testing set Z 3 . Algorithm 1 details this learning approach.
Algorithm 1: Learning Algorithm input: A learning-time limit T, a learning set Z 2 and a function λ(s) that returns the correct label of any sample s 2 Z 2 .
Output: A supervised classifier. auxiliaries: A training set Z 1 and a list M of misclassified samples. 1 Z 1 small random sampling from all classes in Create a classifier instance I from Z 1 ; 5 for each sample t 2 Z 2 nZ 1 do 6 Compute the label L(t) using I;

Experiments
In this section, we describe the overall experimental methodology, including datasets, effectiveness measure, classification models and the computational environment used. The experiments were carried out 100 times with randomly generated learning set Z 2 and test set Z 3 . After starting off with an identical small training set containing at least one sample from each class, each individual classifier assembled its own training set Z 1 from Z 2 . The resulting training sets ended up having different sizes, after the learning-time limit, according to the efficiency of each classifier. All the experiments reported here were performed on an off the shelf desktop computer featuring an Intel Core I5 processor and 4 GB of RAM. We used the standard C implementations of each classifier, as well as their own specific training strategy, with or without parameter optimization depending on the case, which is repeated during each learning iteration. It is impossible to establish the same parameter optimization method for all classifiers, because each classifier has its own mechanism. Some of them, such as OPF, does not require parameter optimization. Note that, since the methodology can be used to select the most suitable classifier for any given application, we are not targeting any application in particular. The experiments essentially demostrate the main characteristics of the methodology when comparing classifiers in different scenarios (datasets and time constraints).

Dataset Description
For the experiments, we selected commonly available datasets of modest sizes with feature spaces of various dimensions.
• Covertype Dataset: also obtained from the UCI Machine Learning Repository [45]; it contains 581,012 samples, 7 classes and 54 features.

Effectiveness Measure Description
It is important to highlight that the proposed methodology can be used with any effectiveness measure appropriate to the specific domain of application. The literature suggests several interesting effectiveness measures, as previously stated. In our experiments, we adopted the F 1 score, which Jardine and van Rijsbergen [48] defined as the normalized, weighted harmonic mean of precision and recall: Learning-Time Constraints For each dataset, we used four different learning-time limits, which were empirically chosen to simulate potential applications with different response times, so named: very interactive, interactive, nearly interactive and non-interactive. Table 1 presents the specified time limits for each type of application.
For the sake of completion, the compared classifiers are briefly described in the following subsections. For more details, see [49,50]. Support Vector Machines. Support Vector Machines (SVM) [51], a widely used classification model, is formulated as an optimization scheme that seeks to determine the hyperplane which best separates two classes (or one class from the others). Also, given non-linearly separable classes, it is possible to apply kernels that transform the data, improving separation between each class and the remaining ones. SVM's main deficiency is that, depending on the size of the training set, too much computational time is needed for convergence to a solution. This lack of efficiency for large datasets may make SVM unfeasible in applications that require multiple retraining phases with interactive response times. Moreover, the assumption of class-separability in transformed space may not hold [52].
In our experiments, we used the latest version of the LibSVM package [49] with Gaussian mapping function (denoted as KSVM), optimization of the parameters C and γ using 5-fold cross-validation within the training set and a grid search over exponentially growing sequences of C and γ (C = 2 −5 ,2 −3 , . . ., 2 15 and γ = 2 −15 ,2 −13 , . . ., 2 3 ), as well as the linear version [53] of SVM (denoted as LSVM) and optimization of the parameter C through 5-fold cross-validation. We also used a grid search over exponentially growing sequences of C.
k-nearest neighbors. The k-nearest neighbor (k-NN) algorithm is amongst the simplest and most largely used of all classification techniques. k-NN classifies a given sample by assigning it to the label most frequently present among its k nearest neighbors. For k = 1, a given sample is simply assigned to the class of its nearest neighbor and it corresponds to a first order Voronoi tesselation of the training data. k-NN takes into account k neighbors, so making the variance of the method less sensitive to noise and outliers. In this work, we estimated the value for k using a leave-one-out procedure over the training set (k = 1,3,5).
Optimum-Path Forest. The Optimum-Path Forest classifier (OPF) [50,54] is a graphbased technique which models classification problems as optimum-path searches in graphs derived from an adjacency relation between samples in a given feature space (a complete relation, in this paper). The nodes are represented by the feature vectors and the edges connect pairs of them. Class representatives (prototypes) are chosen among the training samples in all classes and used to classify the remaining samples based on lengths of paths on the graph. This method has as advantage a very low computational training cost, given that it does not have to optimize parameters. Moreover, it can handle some overlap among classes.
In our experiments, we used LibOPF [50], which is a free library, implemented in the C language, for the design of classifiers based on optimum-path forest. The distance between feature vectors was measured using log-euclidean distance.

Results
In this section, we discuss the average results obtained by following the approach presented in the Methodology Section, with the experiments repeated 100 times with randomly selected learning and test sets. In the preprocessing step, we employed a feature standardization to avoid that attributes in larger numeric ranges dominate those in smaller ranges. We initialized each learning instance with 0.01% of Z for the initial training set, Z 1 ; 49.99%, for the learning set, Z 2 nZ 1 ; and 50.0%, for the test set, Z 3 . Given that each experiment was repeated 100 times with randomly selected learning and test sets, this procedure resembles a 50 × 2 Monte Carlo cross validation [55]. However, final training sets were constructed from this learning set by each classifier. Given that sample selection is based on a fixed increment of 0.01% of jZ 2 j (the learning set size) at each iteration, we can easily estimate the number of iterations, that each classifier required to complete the learning process (sample selection and training) within each given time limit, based on the final training set size. For instance, if the final training set contains 0.05% of the learning samples, then the classifier required 5 iterations.
Concerning the problem of overlapped training sets in cross-validation methodologies, as mentioned in the Background Section, we studied the correlation between each pair of training sets to evaluate the effectiveness of our methodology with respect to the choice of statistically different training sets (see Fig 2). The training sets present correlation below 0.2, being mostly much lower than that. This indicates that our methodology really measures the generalization ability of the classifiers for different training sets.
Tables 2-6 illustrate the effectiveness measure, namely the F 1 -score and the final training set size for each classification model, grouped by the learning-time constraint over different datasets.
From the experimental results for Cod-RNA dataset (Table 2), we see that both SVM strategies are able to obtain relatively good accuracy even with small training sets. One of the main issues with SVM is its non-scalability with respect to the number of training samples. Our methodology allowed these methods to select their most representative samples for a reduced training set.
In Table 3, we can see that faster techniques, such as OPF and k-NN, can acquire more samples within the time constraint as well as achieve higher mean accuracy. However, k-NN usually presents higher variance, being more sensitive to noise. Differently, OPF presents a more stable performance (Tables 3-6), in general, especially in multi-class problems.
Some techniques can learn faster than others, building larger training sets. However, the ability of the technique in selecting the most informative samples is more important than its speed. This makes an interesting point with respect to the proposed methodology. It is fair to all techniques in the sense that each one has the chance to mine the most informative samples for training. Note that, Tables 2-6, show the final training set size of each technique and the best technique is not always the one with largest training set. Indeed, faster techniques obtained their maximal predictive performance only when they could effectively learn from their errors.
To provide a statistical analysis of the results, we performed a Friedman test [56] for each pair of dataset and learning time constraint. Demšar [57] states that the Friedman test provides reliable conclusions when the assumptions (normal distributions and sphericity) of the traditional multiple hypotheses testing ANOVA are violated. Figs 3-7 illustrate a graphical representation of the post-hoc Nemenyi test [58], since we rejected the null hypotheses that all the classifiers are equivalent. Note that 1 represents the best technique, and while 4 stands for the worst one. Groups of classifiers that are not significantly different (at p = 0.05) are connected by using a calculated critical distance (CD) equals to 0.4690.   Table 2. Cod-RNA dataset: Predictivity performance over Z 3 and final training set size, jZ 1 j as a percentual of jZ 2 j, for each classification model and learning time constraint using the proposed selection.      It is worth noting the importance of the statistical test, since the mean and standard deviation (see Tables 2-6) in some cases are not sufficient to indicate the best classifier. The results presented by both tests (Nemenyi test and mean-standard deviation), in general, are equivalents. The statistical test in the CodRNA dataset (Fig 3) shows that both SVMs have significantly better performance when compared to other classifiers. According to the mean and standard deviation (Table 2), both SVMs are equivalent. However, the statistical tests show evidence that they are not.
Such divergences can also be observed with the other datasets. They occur due to the fact that the standard deviation values are relatively high compared to the difference in performance of the classifiers. Similarly to the results expressed by the mean and standard deviation (Tables 3-6), the statistical test also reveals that k-NN and OPF, in general, present a better performance for Connect, Covertype, IJCNN and SensIT Vehicle datasets (Figs 4-7). However, the Nemenyi test indicates statistically significant differences between them, unlike the meanstandard deviation test. It is important to clarify that in some cases, for example SensIT Vehicle dataset with learning-time constraint 300, we cannot reach any conclusion regarding the relative performances of LSVM, k-NN and OPF.
Sample selection methods do not account for time constraints. Methods based on clustering and statistical information learned from the data are usually time costly for large learning sets, which would make it very difficult to select and train a classifier within lower time limits. The simplest approach is random sample selection from each class. Even in this case, one has to estimate the maximum number of samples that a given model can use to train the classifier in a single iteration and within the given time limit. First, for some models, such as SVM, the training time also depends on the selected samples. Anyway, ignoring that, we have estimated that number for each classification model and compared to the proposed sample selection approach based on classification errors. Tables 7-11 present the corresponding results using a single learning iteration with the maximum number of randomly selected samples.
Comparing the results achieved by the proposed method (Tables 2-6) with the ones by the randomized method (Tables 7-11), one can observe that in general, the proposed methodology is capable to select the most representative samples for the training set, holding higher accuracy results (see Tables 2 and 7 with time constraint equal to 1 sec for all classifiers). Even in some  Table 7. Cod-RNA dataset: Predictivity performance over Z 3 and final training set size, jZ 1 j as a percentual of jZ 2 j, for each classification model and learning time constraint using random selection.  Table 9. Covertype dataset: Predictivity performance over Z 3 and final training set size, jZ 1 j as a percentual of jZ 2 j, for each classification model and learning time constraint using random selection. cases, when it was possible to train with the entire dataset (for instance, see Table 7 with time constraint equal to 300 sec for k-NN and OPF, as well as with time constraint equal to 1200 sec for k-NN, OPF and LSVM), it seems that some randomly selected training samples impaired the performance of the classifier, while our methodology is capable to avoid them in the training set (see Table 2 with the same time constraints and classifiers). Note also that the proposed methodology can output considerably smaller training sets, which matters in some approaches, such as the OPF and k-NN classifiers, to speed up classification of large test sets. The comparison of methods using randomized sample selection is not suitable, because these samples capture the geometry of the classes. Besides, as they increase in number, all classification models become equivalent. Fig 8 shows randomly selected samples by each classification model within a given time constraint (1 and 1.5 seconds) using the Cone-Torus dataset [59]. Samples that were not selected are highlighted in gray. One can observe that OPF reached greater effectiveness and efficiency, selecting 100% of the learning samples with only 1.5 seconds and presenting F-score measure equal to 1 (Table 12). It is noteworthy that faster techniques (with a larger training set) do not always achieve higher accuracy. It relies on the effective learning from their errors. For instance, LSVM did not achieved better performance, even being faster than KSVM. Each classification model defines decision boundaries (regions) in a different way in the feature space. By selecting classification errors as training samples, the learning process converges faster to the corresponding decision boundaries. The errors tend to be samples close to the decision boundaries rather than outliers, as long as outliers are minority. If this is not the case, outlier removal should be applied before the learning process. In order to better clarify this issue, we have added Fig 9 with samples not selected from the learning set in gray and samples selected by the classifiers to the training set in color. Fig 9 shows the selected samples for 1 second of time limit using the 2D Cone-Torus dataset. doi:10.1371/journal.pone.0129947.g008 Table 12. Cone-Torus dataset: Predictivity performance over Z 3 and final training set size, jZ 1 j as a percentual of jZ 2 j, for each classification model using random selection and learning time constraint equals to 1 and 1.5 seconds.

Time constraint
Classifier F 1 score jZ 1 j/jZ 2 j% (± variations) In order to analyze the performance of each classifier using the entire learning set, Table 13 shows the accuracy and the time required for training each dataset. All classifiers presented similar accuracies. However, k-NN and OPF were more efficient. KSVM and LSVM require significantly more time for training.

Conclusion
We presented a methodology to compare multiple classifiers under a learning-time constraint, which is useful to select the best classifier for a given application. In this paper, the applications  Table 13. Predictivity performance over Z 3 and required time for each classification model and dataset using all learning samples, jZ 2 j. were represented by different datasets with unbalancing of classes, distinct number of classes and feature space dimensions. The proposed methodology allows each classifier to select its most representative samples from a learning set during the training phase. The experiments allowed us to reach several conclusions.
Although it was not possible to assert which is the most effective classification model under a given time constraint, due to the variability of results on each application domain, experiments obtained using the proposed methodology allowed us to arrive at some relevant observations.
Larger training sets do not necessarily lead to higher predictive performance on unseen test sets, which indicates the effectiveness of some classifiers in learning from their own errors.
The methodology is able to produce statistically independent training sets as observed by the low correlations between each pair of training set obtained for a given dataset-classifier pair, following 100 executions. This demonstrates the advantage of our approach with respect to the regular cross-validation procedure, largely used in related works.
It is also very common in the literature for the presentation of experimental results to rely solely on the mean and standard deviation of accuracy values. The statistical test shows that this approach is not always reliable, due to the relative variations of the standard deviation.