UD is supported in the form of salary by Microsoft Research. There are no patents, products in development, or marketed products to declare. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Training of onevs.rest SVMs can be parallelized over the number of classes in a straight forward way. Given enough computational resources, onevs.rest SVMs can thus be trained on data involving a large number of classes. The same cannot be stated, however, for the socalled allinone SVMs, which require solving a quadratic program of size quadratically in the number of classes. We develop distributed algorithms for two allinone SVM formulations (Lee et al. and Weston and Watkins) that parallelize the computation evenly over the number of classes. This allows us to compare these models to onevs.rest SVMs on unprecedented scale. The results indicate superior accuracy on text classification data.
Modern data analysis requires computation with a large number of classes. As examples, consider the following. (1) We are continuously monitoring the internet for new webpages, which we would like to categorize. (2) We have data from an online biomedical bibliographic database that we want to index for quick access to clinicians. (3) We are collecting data from an online feed of photographs that we would like to classify into image categories. (4) We add new articles to an online encyclopedia and intend to predict the categories of the articles. (5) Given a huge collection of ads, we want to built a classifier from this data.
The problems above—taken from varying application domains ranging from the sciences to technology—involve a large number of classes, typically at least in the thousands. This motivates research on scaling up multiclass classification methods. In the present work, we address scaling up multiclass support vector machines (MCSVMs) [
Onevs.one (OVO) and onevs.rest (OVR) MCSVMs decompose the problem into multiple binary subproblems that are subsequently aggregated [
Recently, Dogan et al. [
The reason is that (linear) state of the art solvers require time complexity of
In this paper, we focus on the comparison between OVR SVMs and allinone SVMs. We do this by developing distributed algorithms where up to
The algorithm proposed for WW draws inspiration from a major result in graph theory: the solution to the 1factorization problem of a graph [
On the other hand, we parallelize LLW training by introducing an auxiliary variable
We provide both multicore and distributed implementations of the proposed algorithms. We report on empirical runtime comparisons of the proposed solvers with the onevs.rest implementation by LIBLINEAR [
The main contributions of this paper are the following:
We propose the first distributed, exact solver for WW and LLW.
We provide both multicore and truly distributed implementations of the solver.
We give the first comparison of WW, LLW, and OVR on the DMOZ data from the LSHTC ‘10–’12 corpora using the full feature resolution.
We expect that the present work starts a line of research on parallelization of exact training of various allinone MCSVMs, including Crammer and Singer, multiclass maximum margin regression [
The paper is structured as follows. In the next section we discuss the problem setting and preliminaries. In Section Distributed Algorithms, we present the proposed distributed algorithms for LLW and WW, respectively. We analyze their convergence empirically in Section Experiments. Followed by sections Discussion of related work and Conclusion.
We consider the following problem. We are given data (
To address this problem setting, a number of generalizations of the binary SVM [
Both formulations lead to very similar dual problems, as shown below. For the dualization of WW, we refer to [
Using slack variables, the primal LLW problem reads
In this section, we derive algorithms that solve (LLW) and (WW) in a distributed manner. With start by addressing LLW.
First note the following optimality condition in (LLW):
Which was exploited by prevalent solvers to remove the variable
Then we perform dual block coordinate ascent (DBCA) [
1:
2:
3:
4:
5:
6:
7:
8: optimal ← True
9: shuffleData()
10:
11: solve1DimLLW(
12:
13:
As necessary step within Algorithm 1, we need to update every single
1:
2:
3:
4:
5:
6: optimal ← False
7:
8:
9: optimal ← False
10:
11:
It is well known that the block coordinate ascent method converges under suitable regularity conditions e.g. [
Note that in practice, we observed speedups by updating
Further, we drop variables
Our implementation uses OpenMPI for intermachine (MPI) [
Recall,
In this section, we propose a distributed algorithm for WW, which draws inspiration from the 1factorization problem of a graph. The approach is presented below.
Our approach is based on running dual coordinate ascend, e.g. algorithm 3.1 in [
1:
2:
3:
4:
5:
6: optimal ← False
7:
8:
9: optimal ← False
10:
11:
12:
We observe from above that optimizing with regard to
Assume that
To better understand the problem, we consider the following analogy. We are given a football league with
Illustration of the solution of the 1factorization problem of a graph with
We arrange one node centrally and all other nodes in a regular polygon around the center node. At round
The algorithm, shown in Algorithm 4, performs DBCA over the variables
1:
2:
3:
4:
5:
6:
7:
8: optimal ← True
9: shuffleData()
10:
11:
12:
13:
14:
15: solve1DimWW(
16:
17: solve1DimWW(
1:
2:
3:
4:
5:
6:
7:
8:
9:
Furthermore, note that our algorithm performs the same coordinate updates as Algorithm 3.1 in [
In practice, because of limitations of computational resources, we might not be able to fully parallelize to the maximum of
Given the average number of nonzero entries per sample
As with LLW, we implemented a mixed MPIOpenMP solver for WW. However, note that, while LLW has mild communication needs, WW needs to pair the weight vectors of the matched classes
We tackled the problem as follows. First of all, we use OpenMP for computations on a single machine (efficiently parallelizing among cores). Here, due to the shared memory, no weight vectors need to be moved. The more challenging task is to handle intermachine communication efficiently. Our approach is based on two key observations.
If the data is highdimensional data, yet sparse, we keep the full weight matrix in memory for fast access, yet communicating only the nonzero entries between computers. Regardless of the increased computational effort, this takes only a fraction of time compared to sending the dense data.
Furthermore, we relax the WW matching scheme. Coming back to a football, consider each country hosts a league, and inside the league, we match the teams as known. Now we would like to match teams across leagues. In order to do so, we first match the countries with the scheme from Section. For each pair of countries, call them A and B, every team from country A plays any other team from country B. Coming back to classes and machines, this means we transfer bundles of classes (countries) between computers. This drastically reduces the network communication.
This section is structured as follows. First we empirically verify the soundness of the proposed algorithms. Then we introduce the employed datasets, on which we investigate the convergence and runtime behavior of the proposed algorithms as well as the induced classification performance.
Each training algorithm was run three times, using randomly shuffled data, and the results were averaged. Note that the training set is the same in each run, but the different order of data points can impact the runtime of the algorithms.
For our experiments we use two different types of machines. Type A has 20 physical cpu cores, 128 GB of memory and a 10 GigaBit Ethernet network. Type B has 24 physical cpu cores and 386 GB of memory. On type B we ran the experiments involving CS due to the memory requirements.
Training repetitions were run on training sets with a random order of the data (note that the training set is the same in each run; only the order of points is shuffled, which can impact the DBCA algorithm). For
We implemented our solveres using OpenMP, OpenMPI, and the Pythonecosystem. In more detailed we used [
In our first experiment, we validate the correctness of the proposed solvers. We downloaded data from the
Then we compare our LLW and WW solvers with the stateoftheart implementation contained in the ML library Shark [
Dataset:  DLLW  SLLW  DWW  SWW  

Err.  Den.  Err.  Den.  Err.  Den.  Err.  Den.  
log( 
21.34  100.0  21.34  100.0  19.88  100.0  19.88  100.0 
20.95  100.0  20.95  100.0  19.51  100.0  19.51  100.0  
20.78  100.0  20.78  100.0  19.38  100.0  19.38  100.0  
log( 
66.67  100.0  66.67  100.0  38.10  100.0  38.10  100.0 
61.90  100.0  61.90  100.0  19.05  100.0  19.05  100.0  
33.33  100.0  33.33  100.0  19.05  100.0  19.05  100.0  
log( 
13.33  100.0  13.33  100.0  6.67  100.0  6.67  100.0 
26.67  100.0  26.67  100.0  13.33  100.0  13.33  100.0  
26.67  100.0  26.67  100.0  13.33  100.0  13.33  100.0  
log( 
87.04  100.0  87.04  100.0  28.25  100.0  28.26  100.0 
87.24  100.0  87.24  100.0  29.04  100.0  29.03  100.0  
61.91  100.0  87.24  100.0  28.92  100.0  28.93  100.0  
log( 
29.23  97.24  29.23  97.24  15.32  51.16  15.30  49.72 
22.97  97.24  22.97  97.24  14.80  44.74  14.80  42.70  
16.15  97.17  16.15  97.04  15.98  45.97  15.98  43.47  
log( 
47.96  78.00  47.96  78.00  11.31  26.42  11.31  23.45 
33.27  78.00  33.41  77.98  11.52  22.93  11.52  20.12  
12.03  78.00  12.03  77.98  12.03  23.05  12.03  20.06  
log( 
26.75  100.0  26.73  100.0  15.80  100.0  15.80  100.0 
26.80  100.0  26.80  100.0  15.47  100.0  15.53  100.0  
26.90  100.0  26.90  100.0  15.96  100.0  16.00  100.0  
log( 
16.29  100.0  16.37  100.0  16.16  100.0  16.16  100.0 
16.09  100.0  16.15  100.0  16.37  100.0  16.28  100.0  
16.34  100.0  16.28  100.0  16.32  100.0  16.24  100.0  
log( 
31.84  100.0  31.84  100.0  8.17  100.0  8.17  100.0 
30.09  100.0  30.04  100.0  9.37  100.0  9.37  100.0  
28.00  100.0  28.00  100.0  10.51  100.0  10.51  100.0 
Error on the test set and density in % of the Shark solver (denoted S) and the proposed solver (denoted D). The results across solver implementations show good accordance.
At random we tested whether the dualitygap closes or not. We did this for both solvers with different
We experiment on large classification datasets, where the number of classes ranges between 451 and 27,875. The relevant statistics of the datasets are shown in
Dataset 



4,463  1,858  1,139  51,033  
128,710  34,880  12,294  381,581  
383,408  103,435  11,947  575,555  
394,754  104,263  27,875  594,158 
The used datasets from the LSHTCcorpus and their properties.
In order to measure the speedup provided by increasing the number of machines/cores, we run a fix amount of iterations over the whole LSHTClarge dataset. We use 10 runs over 10 iterations with a fixed parameter C equal 1 without shrinking. While the MC execution works on one machine, the MPI executes on two or four machines, i.e. spreading the used cores evenly on each node.
The results are shown in
Speedup of our solver averaged over 10 repetitions respectively in the number of cores. For *MPI2 and *MPI4 the number of cores is split evenly on 2 and 4 machines respectively. We observe a linear speedup in the number of cores for both solvers.
Now we evaluate and compare the proposed algorithms on the LSHTC datasets for a range of C values, i.e. we perform no crossvalidation. For comparison we use a solver from the wellknown
For the multicore solvers, i.e. OVR and WWMC, we use 16 cores. MPI spreads over 2 or 4 machines using 8 and 4 cores respectively at each node,
Dataset:  

OVR  CS  WW  LLW  OVR  CS  WW  LLW  
log( 
93.00  59.74  72.82  92.74  11.11  69.73  
85.36  59.74  65.34  93.00  81.54  11.13  16.44  92.74  
74.54  59.74  57.59  93.00  46.76  11.12  6.06  92.74  
64.37  55.49  54.57  93.00  38.20  11.76  5.74  92.74  
93.00  92.74  
log( 
88.12  58.57  66.47  75.26  2.53  18.50  
85.21  58.57  60.58  95.86  45.14  2.53  4.45  100.0  
77.96  57.82  55.28  95.86  25.28  2.55  1.71  100.0  
63.11  95.86  18.33  100.0  
54.18  54.41  *  2.67  1.66  *  
log( 
83.66  49.81  58.02  72.60  1.73  16.97  
75.15  49.65  50.20  92.63  46.20  1.71  4.06  99.52  
60.38  46.14  44.94  92.63  25.87  1.76  1.52  99.52  
47.33  *  18.20  *  
45.60  46.15  *  2.09  1.47  *  
log( 
87.95  59.09  68.19  72.38  1.57  13.49  
85.85  59.09  62.14  96.18  45.97  1.57  3.16  100.0  
76.78  58.18  57.31  96.18  25.97  1.55  1.19  100.0  
63.11  *  18.24  *  
57.78  58.32  *  1.70  1.14  * 
Test set error and model density (in %) as achieved by the OVR, CS, WW, and LLW solvers on the LSHTC datasets. For each solver the result with the best error is in bold font. For LLW entries with a ‘*’ did not converge within a day of runtime.
Dataset:  

OVR  CS  WW  LLW  OVR  CS  WW  LLW  
log( 
7.00  40.26  27.18  0.61  22.08  10.73  
14.42  40.26  34.66  7.00  2.70  22.08  16.15  0.61  
25.46  40.26  42.41  7.00  8.72  22.08  24.71  0.61  
35.47  44.46  45.43  7.00  16.42  26.70  28.75  0.61  
7.00  0.61  
log( 
11.77  41.35  33.53  0.88  25.43  15.05  
14.80  41.52  39.42  4.14  1.51  25.41  20.83  0.09  
22.02  42.19  44.72  4.14  3.35  25.83  27.90  0.09  
36.86  *  14.76  30.99  *  
45.83  45.59  *  31.12  *  
log( 
16.34  50.19  41.98  0.28  20.55  8.08  
24.85  50.35  49.80  7.37  0.69  20.72  16.17  0.01  
39.62  53.86  55.06  7.37  2.64  23.76  25.94  0.01  
52.67  *  12.46  *  
54.40  53.85  *  31.84  30.95  *  
log( 
12.05  40.91  31.81  0.46  22.44  10.47  
14.15  40.91  37.86  3.82  0.62  22.46  16.48  0.05  
23.22  41.82  42.69  3.82  1.89  23.37  23.17  0.05  
36.89  *  10.60  *  
42.22  41.86  *  26.31  26.97  * 
MicroF1 and MacroF1 scores (in %) as achieved by the OVR, CS, WW, and LLW solvers on the LSHTC datasets. For each solver and each metric the best result across C values is in bold font. For LLW entries with a ‘*’ did not converge within a day of runtime.
For all datasets the canonical multiclass formulations, i.e. CS and WW, perform significantly better than OVR. On one hand the error is smaller and the F1scores better. On the other hand the learned models are much sparser, i.e. up to a magnitude. The results justify the increased solution complexity of canonical formulations.
Comparing CS and WW, CS performs as well or slightly better at classifying. Though WW leads to a sparser model. To the best of our knowledge this is the first comparison of these wellknown multiclass SVMs on the studied LSHTC data.
From
Training time averaged over 10 repetitions per C for the various solvers.
All WW experiments use the same amount of cores, but with a varying degree of distribution. We observe that the communication imposes a modest overhead. This overhead is influenced by the model density which is higher for smaller
Knowing that LLW converges to the correct solution, as the dualitygap closes, the results indicate that the chosen C range is not suitable. For LSHTCsmall we conducted experiments with much larger C values. And indeed, as shown in
log( 
2  3  4 

87.73  66.74  59.31  
2.08  15.07  40.69  
12.27  33.26  24.58  
92.74  92.74  92.74  
99.88  99.87  99.90 
Error, MicroF1, and MacroF1 on the test set and model density in % of the LLW solver on the LSHTCsmall dataset.
Most approaches to parallelization of MCSVM training are based on OVO or OVR [
There is a line of research on parallelizing stochastic gradient (SGD) based training of MCSVMs over multiple computers [
The related work that is the most closest to the present work is by [
Note that beyond SVMs there is a large body of work on distributed multiclass [
When the performance of single solvers saturates one can consider to refine them, e.g. by considering the reliability of single class estimates [
We proposed distributed algorithms for solving the multiclass SVM formulations by Lee et al. (LLW) and Weston and Watkins (WW). The algorithm addressing LLW takes advantage of an auxiliary variable, while our approach to optimizing WW in parallel is based on the 1factorization problem from graph theory.
The experiments confirmed the correctness of the solver (in the sense of an exact solver) and show linear speedup when the number of cores is increased. This speedup allows us to train LLW and WW on LSHTC datasets, for which results were lacking in the literature.
Our analysis contributed to comparing MCSVM formulations on rather large data sets, where comparisons were still lacking. In comparison to OVR we showed that WW can achieve competitive classification results in less time, while still leading to a much sparser model. Unexpectedly, LLW shows clear disadvantages over the other MCSVMs. Yet the favorable scaling properties make further research interesting, for instance regarding the development of an unconstrained algorithm. We ease further research by publishing the source code under
Overcoming the limitations of a single machine, i.e. distribution, is a key problem and a key enabler in large scale learning. To best of our knowledge, we are the
In the future, we would like to study extensions of the concepts presented in this paper to various more MCSVMs, including the Crammer and Singer MCSVM [
We thank Rohit Babbar, Shinichi Nakajima, and KlausRobert Müller for helpful discussions. We thank Giancarlo Kerg for pointing us to the graph 1factorization problem. We thank Ioannis Partalas for help regarding the LSHTC datasets.