Distributed Optimization of Multi-Class SVMs

Training of one-vs.-rest SVMs can be parallelized over the number of classes in a straight forward way. Given enough computational resources, one-vs.-rest SVMs can thus be trained on data involving a large number of classes. The same cannot be stated, however, for the so-called all-in-one SVMs, which require solving a quadratic program of size quadratically in the number of classes. We develop distributed algorithms for two all-in-one SVM formulations (Lee et al. and Weston and Watkins) that parallelize the computation evenly over the number of classes. This allows us to compare these models to one-vs.-rest SVMs on unprecedented scale. The results indicate superior accuracy on text classification data.


Introduction
Modern data analysis requires computation with a large number of classes. As examples, consider the following. (1) We are continuously monitoring the internet for new webpages, which we would like to categorize. (2) We have data from an online biomedical bibliographic database that we want to index for quick access to clinicians. (3) We are collecting data from an online feed of photographs that we would like to classify into image categories. (4) We add new articles to an online encyclopedia and intend to predict the categories of the articles. (5) Given a huge collection of ads, we want to built a classifier from this data.
The problems above-taken from varying application domains ranging from the sciences to technology-involve a large number of classes, typically at least in the thousands. This motivates research on scaling up multi-class classification methods. In the present work, we address scaling up multi-class support vector machines (MC-SVMs) [1]. There are two major types of MC-SVMs: 1. One-vs.-one (OVO) and one-vs.-rest (OVR) MC-SVMs decompose the problem into multiple binary subproblems that are subsequently aggregated [1,2]. Training can be parallelized in a straight forward way.
2. All-in-one MC-SVMs extend the concept of the margin to multiple classes. Because there is no unique extension of the margin concept, multiple all-in-one MC-SVMs have been proposed, including the ones by Crammer and Singer (CS) [3], Lee, Lin, and Wahba (LLW) [4], and Weston and Watkins (WW) [1,5]. See [2,[6][7][8][9][10][11] for conceptual and empirical comparisons. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Recently, Dogan et al. [11] have compared the various all-in-one MC-SVM variants on rather moderately sized datasets and showed advantages of all-in-one MC-SVMs over OVR MC-SVM, but-so far-slow training time has prohibited comparisons on data involving a large number of classes.
The reason is that (linear) state of the art solvers require time complexity of Oð " d " n Á C 2 Þ and memory complexity at least of Oð" nC 2 Þ, where d is the feature dimensionality, " d the average number of non-zeros ( " d ¼ d for dense data), and " n the average number of samples per class. This quadratic dependence on the number of classes C can be prohibitive for large C, often leaving OVO and OVR as the only MC-SVM options in the big data setting.
In this paper, we focus on the comparison between OVR SVMs and all-in-one SVMs. We do this by developing distributed algorithms where up to OðCÞ nodes solve WW and LLW in parallel, dividing model and computation evenly. The resulting solvers are compared to a state-of-the-art OVR solution.
The algorithm proposed for WW draws inspiration from a major result in graph theory: the solution to the 1-factorization problem of a graph [12]. The idea is that the optimization of a single coordinate α i,c of the dual objective involves only the two hypotheses w y i and w c . As in the 1-factorization problem, we can thus form pairs of classes where the corresponding blocks of coordinates can be optimized in parallel.
On the other hand, we parallelize LLW training by introducing an auxiliary variable " w into the dual problem that decouples the objective into a sum over C many independent subproblems.
We provide both multi-core and distributed implementations of the proposed algorithms. We report on empirical runtime comparisons of the proposed solvers with the one-vs.-rest implementation by LIBLINEAR [13] on text classification data taken from the LSHTC corpus [14].
The main contributions of this paper are the following: • We propose the first distributed, exact solver for WW and LLW.
• We provide both multi-core and truly distributed implementations of the solver.
• We give the first comparison of WW, LLW, and OVR on the DMOZ data from the LSHTC '10-'12 corpora using the full feature resolution.
We expect that the present work starts a line of research on parallelization of exact training of various all-in-one MC-SVMs, including Crammer and Singer, multi-class maximum margin regression [15], and the reinforced multicategory SVM [16], enabling comparison of all these methods on large datasets.
The paper is structured as follows. In the next section we discuss the problem setting and preliminaries. In Section Distributed Algorithms, we present the proposed distributed algorithms for LLW and WW, respectively. We analyze their convergence empirically in Section Experiments. Followed by sections Discussion of related work and Conclusion.

Preliminaries
We consider the following problem. We are given data (x 1 , y 1 ), . . ., (x n , y n ) with x i 2 R d and y i 2 f1; :::; Cg. Each class has in average " n samples. The largest number of samples for a single class is n max . We are predicting using the model where W ¼ ðw 1 ; ::; w C Þ 2 R dÂC are unknown parameters. The aim is to efficiently find good parameters in order to predict well on new data using Eq (1).
To address this problem setting, a number of generalizations of the binary SVM [17] have been proposed. We are specifically studying the following two formulations, dropping the bias terms in both. Throughout this paper, l(x) = max{0, 1 − x} will denote the hinge-loss.
Lee, Lin, and Wahba (LLW) [4] min Weston and Watkins (WW) [5] min Both formulations lead to very similar dual problems, as shown below. For the dualization of WW, we refer to [18]. The LLW dual is given below, where we introduce an auxiliary variable " w that is exploited by our distributed algorithm, as explained in the next section.
Derivation of lagrangian dual problems for Lin, Lee, and Wahba. Using slack variables, the primal LLW problem reads Distributed optimization of multi-class SVMs We introduce Lagrangian multipliers a 2 R nÂC , b 2 R n , and " LðW; x; a; b; " wÞ ¼ Slater's condition holds; therefore, we have strong duality and can use the dual The partial derivatives are given by LðW; x; a; b; " Setting those to zero leads to And plugging in into the lagrangian, finally gives the dual max a2R nÂC ; " w2R d

Distributed algorithms
In this section, we derive algorithms that solve (LLW) and (WW) in a distributed manner. With start by addressing LLW.

Algorithm for Lee, Lin, and Wahba
First note the following optimality condition in (LLW): Xa c : Which was exploited by prevalent solvers to remove the variable " w from the optimization. In contrast, the core idea of our LLW solver is to actually keep this auxiliary variable, as it decouples the objective function into the following sum: wÞ: Then we perform dual block coordinate ascent (DBCA) [18, Algorithm 3.1] with a specially tailored block structure, considering as blocks " w as well as each single coordinate α i,c . As we observe from Eq (9), the optimization of the columns α :,c is mutually independent of each other, given fixed " w. Hence, it can be distributed evenly over C many nodes. On the cth node, we run coordinate ascend on the subobjective subobjða c ; " wÞ over α i,c , i = 1, . . ., n, as described in the next paragraph. After one epoch of α computation, the variable " w is updated via Eq (9). The final algorithm is shown in Algorithm 1. k i x T i x i 7: while not optimal do 8: optimal True 9: shuffleData() 10: for i 2 I\I c do 11: solve1DimLLW(i, c) 12: " w Reduce( P c w c =C) 13: w c w c À " w As necessary step within Algorithm 1, we need to update every single α i,c . Optimizing α i,c is solving the problem where e i;c 2 R nÂC is one at the (i, c)th coordinate and zero elsewise. Set w c : Hence, the optimal solution of Eq (9) is given by d ¼ min fC À a i;c ; max fÀ a i;c ; À The corresponding pseudo-code can be found in Algorithm 2.
optimal False 7: if g > and α i,c > 0 then 8: optimal False 10: w c w c + δx i 11: α i,c α i,c + δ Convergence. It is well known that the block coordinate ascent method converges under suitable regularity conditions e.g. [19,20]. Our objective is continuously differentiable and strictly convex. The constraints are solely box constraints, hence the feasible set decomposes as a Cartesian product over the blocks. Algorithm 1 traverses the two blocks in cyclic order. Under these conditions, the DBCA method provably converges (e.g. Prop. 2.7.1 in [20]).
Note that in practice, we observed speedups by updating " w in Algorithm 1 after each tenth of an epoch, breaking the cyclic order. The blocks of coordinates are then traversed in so-called essentially cyclic order e.g. Section 2 in [19], meaning that there exists T 2 N such that each block is traversed at least once after T iterations. Closer inspection of the proof in Prop. 2.7.1 in [20] reveals that the result holds also under this slightly more general assumption.
Further, we drop variables α i,c in the optimization (shrinking) if they are not updated in three subsequent epochs. Once the stopping condition holds, we run another epoch of optimization over all variables (including the ones that were dropped). If the stopping criterion is then met, we terminate the algorithm and continue optimization elsewise.
Implementation details. Our implementation uses OpenMPI for inter-machine (MPI) [21] and OpenMP (MC) [22] for intra-machine communication. Note that Algorithm 1 has very mild communication requirements: the only communication needed is the sum of all weight vectors " w ¼ P c w c . Hence, MPI suffers very little from communication overhead between the various machines. In practice, we may not be able to fully parallelize to the maximum of C cores; therefore our algorithm will divide the set of classes into number-of-cores many chunks and optimize each class sequentially.
Recall, " d is the average number of non-zero entries per sample and " n is the average number of samples per class. Given c cores and I iteration steps, the optimization has an asymptotic where the first part of the sum amounts for the gradient updates and the second for the model communication and update. Given an average dual sparsity as 1 À " a the algorithm needs at each node an asymptotic space complexity of to store the weight matrix, the dual coefficients and the weight vector used for averaging.

Algorithm for Weston and Watkins
In this section, we propose a distributed algorithm for WW, which draws inspiration from the 1-factorization problem of a graph. The approach is presented below.
Preliminaries. Our approach is based on running dual coordinate ascend, e.g. algorithm 3.1 in [18], over α i,c on the (WW) objective function as follows. Denote the objective in Eq (5) by D(α) and recall from [18] that optimizing α i,c is solving the problem Setting w c = ∑ i: Which is optimal at: This computation is summarized in Algorithm 3.
Algorithm 3 Solving 1-dim sub-problem of WW optimal False 7: if g > and α i,c > 0 then 8: δ max{−α i,c , −g/2k i } 9: optimal False 10: w y i w y i + δx i 11: w c w c − δx i 12: α i,c α i,c + δ Core observation. We observe from above that optimizing with regard to α i,c will require only the weight vectors w y i and w c . In other words, given four different classes c 1 , c 2 , c 3 , c 4 , the optimization of the block of variables (α i , c 1 ) i:y i =c2 -according to Eq (11)-is independent of the optimization of the block (α i , c 3 ) i:y i =c4 . Hence it can be parallelized. In the next section we describe how we exploit this structure to derive a distributed optimization algorithm.
Excursus: 1-factorization of a graph. Assume that C is even. The core idea now is to form C 2 many disjoint blocks ða i ; c 1 Þ i:y i ¼c 2 ; . . . ; ða i ; c CÀ 1 Þ i:y i ¼C of variables. Each of these blocks can be optimized in parallel. The challenge is to derive a maximally distributed optimization schedule where each block (α i , c j ) i:y i =ck for any j 6 ¼ k is optimized.
To better understand the problem, we consider the following analogy. We are given a football league with C teams. Before the season, we have to decide on a schedule such that each team plays any other team exactly once. Furthermore, all teams shall play on every matchday so that in total we need only C À 1 matchdays. This problem is the 1-factorization problem in graph theory, e.g. [12]. The solution to this problem, illustrated in Fig 1, is as follows.
We arrange one node centrally and all other nodes in a regular polygon around the center node. At round r, we connect the centered node with node r and connect all other nodes orthogonal to this line. The pseudocode to compute the partner of a given node c at a certain round r is given in Algorithm 5. Note that in case of an uneven number of classes, we introduce a dummy class C þ 1, making the number of classes even. We run the algorithm, but skip all computations involving the dummy class.
Algorithm. The algorithm, shown in Algorithm 4, performs DBCA over the variables α i,c using the schedule derived in Section and the coordinate updates derived in Section. k i x T i x i 7: while not optimal do 8: optimal True 9: shuffleData() 10: for r ¼ 1::C À 1 do 11: for c ¼ 1::C do in parallel 12:c matchClass(C; c; r) 13: ifc > c then 14: for i 2 I c do 15: solve1DimWW(i;c) 16: for i 2 Ic do 17: solve1DimWW(i,c) Algorithm 5 Solving the graph 1-factorization problem. Indices start with one. if C is even then 6: return C 7: else 8: return c 9: return modð2r À c; C À 1Þ Convergence and implementation details. Furthermore, note that our algorithm performs the same coordinate updates as Algorithm 3.1 in [18]. Hence, they share the same favorable convergence behavior. Formally, convergence is guaranteed for exactly the same reasons discussed in Section. We also employ the same speedup tricks, i.e. shrinking and updating every tenth of an epoch.
In practice, because of limitations of computational resources, we might not be able to fully parallelize to the maximum of C=2 cores. In that case, our algorithm divides the set of classes into number-of-cores many chunks and solves each bundle sequentially. For optimal speedup, it is advisable to arrange the classes into chunks of equal number of classes and data points.
Given the average number of non-zero entries per sample " d and the average number of samples per class " n, c cores and I iteration steps, the optimization has an asymptotic runtime where the first part of the sum amounts for the gradient updates and the second for the model communication. Given an average dual sparsity as 1 À " a the algorithm needs at each node an asymptotic space complexity of O C c Á ð " d þ " n Á " aÞ À Á to store the weight matrix and the dual coefficients.
As with LLW, we implemented a mixed MPI-OpenMP solver for WW. However, note that, while LLW has mild communication needs, WW needs to pair the weight vectors of the matched classes c andc in each epoch, for which C=2 weight vectors needs to communicated among computers. Therefore it is crucial to communicate efficiently.
We tackled the problem as follows. First of all, we use OpenMP for computations on a single machine (efficiently parallelizing among cores). Here, due to the shared memory, no weight vectors need to be moved. The more challenging task is to handle inter-machine communication efficiently. Our approach is based on two key observations.
If the data is high-dimensional data, yet sparse, we keep the full weight matrix in memory for fast access, yet communicating only the non-zero entries between computers. Regardless of the increased computational effort, this takes only a fraction of time compared to sending the dense data.
Furthermore, we relax the WW matching scheme. Coming back to a football, consider each country hosts a league, and inside the league, we match the teams as known. Now we would like to match teams across leagues. In order to do so, we first match the countries with the scheme from Section. For each pair of countries, call them A and B, every team from country A plays any other team from country B. Coming back to classes and machines, this means we transfer bundles of classes (countries) between computers. This drastically reduces the network communication.

Experiments
This section is structured as follows. First we empirically verify the soundness of the proposed algorithms. Then we introduce the employed datasets, on which we investigate the convergence and runtime behavior of the proposed algorithms as well as the induced classification performance.
Each training algorithm was run three times, using randomly shuffled data, and the results were averaged. Note that the training set is the same in each run, but the different order of data points can impact the runtime of the algorithms.

Setup
For our experiments we use two different types of machines. Type A has 20 physical cpu cores, 128 GB of memory and a 10 GigaBit Ethernet network. Type B has 24 physical cpu cores and 386 GB of memory. On type B we ran the experiments involving CS due to the memory requirements.
Training repetitions were run on training sets with a random order of the data (note that the training set is the same in each run; only the order of points is shuffled, which can impact the DBCA algorithm). For LIBLINEAR solvers we use the newest available version as of April 2016 with the default settings.

Validation of solver
In our first experiment, we validate the correctness of the proposed solvers. We downloaded data from the LIBLINEAR (https://www.csie.ntu.edu.tw/*cjlin/liblinear/) [13] and UCI (https://archive.ics.uci.edu/ml/datasets.html) [26] dataset repositories. Where training and test splits are unavailable, we split the data once into 90% train and 10% test sets. For each dataset, the optimal feature scaling was selected, in order to maximize the average accuracy on the test sets. Datapoints in iris and news were thus normalized to unit norm, letter and satimage were normalized to unit variance. All other data was considered unnormalized.
Then we compare our LLW and WW solvers with the state-of-the-art implementation contained in the ML library Shark [27]. We implemented the same stopping criteria as [27]. The results (averaged over 10 runs) are shown in Table 1. We observe good accordance of the results and model sparsity of the proposed solvers and the reference implementation from the Shark toolbox, thus confirming that our respective solvers are indeed exact solvers of LLW and WW.
At random we tested whether the duality-gap closes or not. We did this for both solvers with different C values and datasets. In any case the duality gap closed, i.e. decreased by an order of two magnitudes. Based on this we chose our stopping criteria equal to 0.1 for the LSHTC datasets.  Table 2. The LSHTC-Ã datasets are high-dimensional text datasets taken from the well-known LSHTC corpus [14]. The datasets belong to the released competition rounds 1 to 3, i.e. '10-'12. LSHTC-2011 and LSHTC-2012 originate from the DMOZ corpus. The features were extracted using TF/IDF representation and we use the full feature resolution for training.

Speedup
In order to measure the speedup provided by increasing the number of machines/cores, we run a fix amount of iterations over the whole LSHTC-large dataset. We use 10 runs over 10 iterations with a fixed parameter C equal 1 without shrinking. While the MC execution works on one machine, the MPI executes on two or four machines, i.e. spreading the used cores evenly on each node. The results are shown in Fig 2. Both solvers exhibit linear speedup regardless if distributed or not, due to the small communication cost. Yet the speedup of WW is bounded by a larger constant compared to LLW.

Timing and classification results
Now we evaluate and compare the proposed algorithms on the LSHTC datasets for a range of C values, i.e. we perform no cross-validation. For comparison we use a solver from the wellknown LIBLINEAR package, namely the multi-core implementation with L2L1-loss (OVR) [28]. For completeness we also include the single-core Crammer-Singer implementation (CS) [13]. Due to the lack of performance we do not compare to the LLW and WW implementation in the Shark ML library.
For the multi-core solvers, i.e. OVR and WW-MC, we use 16 cores. MPI spreads over 2 or 4 machines using 8 and 4 cores respectively at each node, thus trains the model distributed. Table 3 shows the error and the model sparsity for the compared solutions. We further provide the Micro-F1 and Macro-F1 score in Table 4.
For all datasets the canonical multi-class formulations, i.e. CS and WW, perform significantly better than OVR. On one hand the error is smaller and the F1-scores better. On the other hand the learned models are much sparser, i.e. up to a magnitude. The results justify the increased solution complexity of canonical formulations. Comparing CS and WW, CS performs as well or slightly better at classifying. Though WW leads to a sparser model. To the best of our knowledge this is the first comparison of these well-known multi-class SVMs on the studied LSHTC data.
From Fig 3, we observe that the runtime of our solver outperforms the one of OVR and CS by up to two orders of magnitude. Even when distributed our solver outperforms multi-core OVR in all except one case.
All WW experiments use the same amount of cores, but with a varying degree of distribution. We observe that the communication imposes a modest overhead. This overhead is influenced by the model density which is higher for smaller C values and effect of shrinking which is higher with larger C values. Yet in the regime with best classification results, e.g. C equal 0.1 and 1, the overhead is small.

Lin, Lee, & Wahba.
Knowing that LLW converges to the correct solution, as the dualitygap closes, the results indicate that the chosen C range is not suitable. For LSHTC-small we conducted experiments with much larger C values. And indeed, as shown in Table 5, LLW performs best in a nearly unconstrained setting. In our experiments we observed that the model learned by LLW is never sparse, neither in the weight matrix W, nor in dual factors α. Resource limitations and slow convergence properties hindered us to conduct experiments with even larger C values. It is left to future work to explore this space or even develop a unconstrained version of LLW.
There is a line of research on parallelizing stochastic gradient (SGD) based training of MC-SVMs over multiple computers [40,41]. SGD builds on iteratively approximating the loss term by one that is based on a subset of the data (mini-batch). In contrast, batch solvers (such as the ones proposed in the present paper) are based on the full sample. In this sense, our approach is completely different to SGD. While there is a long ongoing discussion whether the batch or the SGD approach is superior, the common opinion is that SGD has its advantages in the early phase of the optimization, while classical batch solvers shine in the later phase. In this sense, the two approaches are complementary and could also be combined.
The related work that is the most closest to the present work is by [42]. They build on the alternating direction method of multipliers (ADMM) [43] to break the Crammer and Singer optimization problem into smaller parts, which can be solved individually on different computers. In contrast to our approach, the optimization problem is parallelized over the samples, not the optimization variables. In our problem setting, high-dimensional sparse data, the Distributed optimization of multi-class SVMs model size is vary large. Because each node holds the whole model in memory, this algorithm hardly scales with large label spaces. E.g. consider Table 2; the model for LSHTC-2011 contains %16 Ã 10 9 parameters. Note also that it is unclear at this point whether the approach of [42] could be adapted to LLW and WW, which are the object of study in the present paper. Note that beyond SVMs there is a large body of work on distributed multi-class [44,45] and multi-label learning algorithms [46], which is outside of the scope of the present paper.
When the performance of single solvers saturates one can consider to refine them, e.g. by considering the reliability of single class estimates [47] to enhance the prediction accuracy, or to learn ensembles of SVMs to get better prediction by combining several estimates. This can be done by bagging [48] or boosting [49] them or by exploiting smart compositions of solvers, e.g. with LibD3C [50].

Conclusion
We proposed distributed algorithms for solving the multi-class SVM formulations by Lee et al.
(LLW) and Weston and Watkins (WW). The algorithm addressing LLW takes advantage of an auxiliary variable, while our approach to optimizing WW in parallel is based on the 1-factorization problem from graph theory.
The experiments confirmed the correctness of the solver (in the sense of an exact solver) and show linear speedup when the number of cores is increased. This speedup allows us to train LLW and WW on LSHTC datasets, for which results were lacking in the literature.
Our analysis contributed to comparing MC-SVM formulations on rather large data sets, where comparisons were still lacking. In comparison to OVR we showed that WW can achieve competitive classification results in less time, while still leading to a much sparser model. Unexpectedly, LLW shows clear disadvantages over the other MC-SVMs. Yet the favorable scaling properties make further research interesting, for instance regarding the development of an unconstrained algorithm. We ease further research by publishing the source code under https://github.com/albermax/xcsvm.
Overcoming the limitations of a single machine, i.e. distribution, is a key problem and a key enabler in large scale learning. To best of our knowledge, we are the first to train an exact, allin-one MC-SVMs in a distributed manner. We hope this first step inspires further research in this context.
In the future, we would like to study extensions of the concepts presented in this paper to various more MC-SVMs, including the Crammer and Singer MC-SVM [51], the l p -norm MC-SVM [52], and scatter based MC-SVMs [53]. Distributed optimization of multi-class SVMs