Data reduction for SVM training using density-based border identification

doi:10.1371/journal.pone.0300641

Fig 1.

Separating hyperplanes in the linearly separable case.

The hyperplane is shown as a solid line, the margins as dashed lines, and the support vectors are enclosed in red circles.

More »

Expand

Fig 2.

Training an SVM classifier only on support vectors compared to that trained on the whole dataset.

The decision boundaries are shown as solid lines, the margins as dotted lines, and the support vectors are enclosed in circles. (a) A classifier trained on the entire dataset. (b) A classifier trained only on support vectors identified earlier by an SVM classifier trained on the whole dataset.

More »

Expand

Fig 3.

Identification of a layer of border points in our proposed method (DBI).

(a) The input classes of a synthetic 2-dimensional dataset set. (b) The calculated core scores for all points of the dataset. (c) The three identified types of points; i.e. border points (red), core points (green) and outliers (black). (d) The final reduced training set.

More »

Expand

Fig 4.

Effect of border overlap on the decision boundaries of the trained classifier on the Banana dataset.

The whole dataset is compared to our proposed DBI and BRIX methods, both at a ratio of 0.2. The test dataset is plotted against the decision boundaries (solid lines) and margins (dashed lines). (a) Whole dataset. (b) Reduced dataset (DBI). (c) Reduced dataset (BRIX variant). (d) Test dataset plot against decision boundaries (whole dataset). (e) Test dataset plot against decision boundaries (DBI). (f) Test dataset plot against decision boundaries (BRIX variant). Notice the misclassified points in the central region of the plot in (e) due to the overlapping borders of the opposite class around the central part.

More »

Expand

Fig 5.

Reduced dataset selection using the proposed SVO method for a non-overlapping dataset.

(a) A moons-shaped dataset. (b) Identified support vectors. (c) Reduced subset using SVO at k = 15.

More »

Expand

Table 1.

Results of the proposed methods (DBI, BRI & BRIX) on the Banana dataset with different reduction ratios.

More »

Expand

Fig 6.

Results for our proposed methods (DBI, BRI and BRIX) on the Banana dataset with different reduction ratios.

More »

Expand

Table 2.

Results of the proposed SVO & SVOX methods on the Banana dataset with different values of k.

More »

Expand

Fig 7.

Results for the proposed SVO and SVOX methods on the Banana dataset with different values of k.

More »

Expand

Table 3.

Results of the proposed methods compared to other methods from the literature on the Banana dataset.

More »

Expand

Table 4.

Pareto set of different methods on the Banana dataset.

More »

Expand

Fig 8.

Ranking of Pareto set methods on the Banana dataset based on closeness to the optimal point.

The optimal point is the zero point, representing the ideal of minimizing all the optimized metrics. The score is calculated as the reciprocal of the Euclidean distance from the optimal point.

More »

Expand

Fig 9.

Pareto set for the Banana dataset.

After the elimination of 70 non-dominating solutions, the set is composed of only five elements. These elements are exclusively proposed methods, namely BRIX and SVOX. (A point is dominating if it is better or equal in all objectives and strictly better in at least one objective).

More »

Expand

Fig 10.

Comparison of methods on the Banana dataset.

The distribution of the different methods in the solution space of the three optimization objectives, in addition to the ratio of the reduced dataset, is shown for each pair of objectives. The proposed methods, except DBI and SVO, are predominantly closest to the optimal point.

More »

Expand

Fig 11.

Results for the proposed methods (DBI, BRI and BRIX) on the USPS dataset with different reduction ratios.

More »

Expand

Table 5.

Results of the proposed methods (DBI, BRI & BRIX) on the USPS dataset with different reduction ratios.

More »

Expand

Table 6.

Results of the proposed SVO & SVOX methods on the USPS dataset with different values of k.

More »

Expand

Fig 12.

Results for the proposed SVO and SVOX methods on the USPS dataset with different values of k.

More »

Expand

Table 7.

Results of the proposed methods compared to other methods from the literature on the USPS dataset.

More »

Expand

Table 8.

Pareto set of different methods on the USPS dataset.

More »

Expand

Fig 13.

Ranking of Pareto set methods on the USPS dataset based on closeness to the optimal point.

The optimal point is the zero point, representing the ideal of minimizing all the optimized metrics. The score is calculated as the reciprocal of the Euclidean distance from the optimal point.

More »

Expand

Fig 14.

Pareto set for the USPS dataset.

The Pareto set is composed of 17 elements selected from 106 candidate solutions.

More »

Expand

Fig 15.

Comparison of methods on the USPS dataset.

The distribution of the different methods in the solution space of the three optimization objectives, in addition to the ratio of the reduced dataset, is shown for each pair of objectives. The proposed methods, except SVO, are predominantly closest to the optimal point.

More »

Expand

Table 9.

Results of the proposed methods (DBI, BRI & BRIX) on the Adult9a dataset with different reduction ratios.

More »

Expand

Fig 16.

Results for our proposed methods (DBI, BRI, and BRIX) on the Adult9a dataset with different reduction ratios.

More »

Expand

Table 10.

Results of the proposed SVO & SVOX methods on the Adult9a dataset with different values of k.

More »

Expand

Fig 17.

Results for the proposed SVO and SVOX methods on the Adult9a dataset with different values of k.

More »

Expand

Table 11.

Results of the proposed methods compared to other methods from the literature on the Adult9a dataset.

More »

Expand

Table 12.

Pareto set of different methods on the Adult9a dataset.

More »

Expand

Fig 18.

Ranking of Pareto set methods on the Adult9a dataset based on closeness to the optimal point.

More »

Expand

Fig 19.

Pareto set for the Adult9a dataset.

The Pareto set is composed of 17 elements selected out of 68 possible candidate solutions.

More »

Expand

Fig 20.

Comparison of methods on the Adult9a dataset.

The distribution of the different methods in the solution space of the three optimization objectives, in addition to the ratio of the reduced dataset, is shown for each pair of objectives. BRIX and Gaffari’s method are the closest to the optimal point. SVO and SVOX are more clustered in the solution space than the other methods.

More »

Expand