Applications of Spectral Gradient Algorithm for Solving Matrix ℓ2,1-Norm Minimization Problems in Machine Learning

The main purpose of this study is to propose, then analyze, and later test a spectral gradient algorithm for solving a convex minimization problem. The considered problem covers the matrix ℓ2,1-norm regularized least squares which is widely used in multi-task learning for capturing the joint feature among each task. To solve the problem, we firstly minimize a quadratic approximated model of the objective function to derive a search direction at current iteration. We show that this direction descends automatically and reduces to the original spectral gradient direction if the regularized term is removed. Secondly, we incorporate a nonmonotone line search along this direction to improve the algorithm’s numerical performance. Furthermore, we show that the proposed algorithm converges to a critical point under some mild conditions. The attractive feature of the proposed algorithm is that it is easily performable and only requires the gradient of the smooth function and the objective function’s values at each and every step. Finally, we operate some experiments on synthetic data, which verifies that the proposed algorithm works quite well and performs better than the compared ones.


Introduction
The tasks in medical diagnosis [1], text classification [2][3][4][5], biomedical informatics [6,7] and other applications [8][9][10][11][12] are always related to each other. Hence, capturing the shared information among each task becomes the key issue to learn [13][14][15]. Given the training set of t tasks A ¼ ½A 1 ; . . . ; A t 2 R mÂn and b ¼ ½b 1 ; . . . ; b t > 2 R m , where A j is the data for the j-th task and b j is the corresponding response. We let x j 2 R n be the sparse feature for the j-th task, and let X ¼ ½x 1 ; . . . ; x t 2 R nÂt be the joint feature to be learned. In order to select features globally, it encourages several rows of X to be zeros and solves the following ℓ 2,1 -norm regularized least squares [16,17] min X2R nÂt where μ > 0 is a weighting parameter, and kXk 2,1 is defined by the sum of the ℓ 2 -norm of each row of a matrix. It is well known that the ℓ 2,1 -norm is used to encourage the multiple predictions from different tasks to share similar parameter sparsity patterns.
In the past few years, several algorithms have been proposed, analyzed, and tested to solve the nonsmooth convex minimization Problem (1). The algorithm in [18] transformed Eq (1) equivalently into a smooth convex optimization problem and minimized consequently by Nesterov's gradient method. The method in [16] reformulated Eq (1) as a constrained optimization problem and minimized alternately. The algorithm in [19] and its variant [20] reformulated the problem as an equivalent constrained minimization by introducing an auxiliary variable, and then minimized the corresponding augmented Lagrange function alternatively. Finally, for another accelerated proximal gradient version of the algorithm [19], one can refer to [21].
Unlike all the research activities which mainly concerned about Problem (1), in this paper, we focus on the following generalized nonsmooth convex optimization problem min X2R nÂt FðXÞ þ mk X k 2;1 ; ð2Þ where F : R nÂt ! R is continuously differentiable (may be non-convex) and bounded below. Clearly, Model (2) includes Eq (1) as a special case when F is a least square. As we all know, the spectral gradient method was originated by Barzilai and Borwein [22] for solving smooth unconstrained minimization problems, later was developed in [23][24][25][26], and then was extended to solve ℓ 1 -regularized nonsmooth minimization [27]. However, its numerical performance in solving matrix ℓ 2,1 -norm involved nonsmooth minimization problems is still undiscovered. Therefore, extending the spectral gradient algorithm to solve Problem (2) may have significance both in theory and practice. The first contribution of this study lies in the design of the search direction at each iteration, which is derived by minimizing a quadratic approximated model of the objective function and at the same time making full use of the special structure of the ℓ 2,1 -norm. We also show that the generated direction descends automatically provided that the spectral coefficient is positive. The second contribution of the paper is the nonmonotone line search, which is used to improve the algorithm's performance. At each iteration, the algorithm requires the gradient of the smooth term and the value of the objective function, which means it has the ability to solve high dimensional problems. Finally, we do performance comparisons with a couple of solvers IAMD_MFL and SLEP, which illustrate that the proposed method is fast, efficient, and competitive. The paper is organized as follows. In Section 2, we provide some notations and preliminaries, and construct the new algorithm together with its properties. In Section 3, we establish the global convergence of the algorithm. In Section 4, we report some numerical results and do some performance comparisons. Finally, we conclude our paper in Section 5.

Notations and preliminaries
In the first place, we summarize the notations used in this paper. Matrices are written as uppercase letters. Vectors are described as lowercase letters. For the matrix X, its i-th row and j-th column are denoted by X i,: and X :,j respectively. The Frobenius norm and the ℓ 2,1 -norm of the matrix X 2 R nÂt are defined as, respectively, For any two matrices X; Y 2 R nÂt , we define hX, Yi = tr(X > Y) (the standard trace inner product in R t ), so that k X k F ¼ ffiffiffiffiffiffiffiffiffiffiffiffi ffi hX; Xi p . If x 2 R d , we denote "Diag(x)" the diagonal matrix possessing the components of vector x on the diagonal. We define ">" as the transpose of a vector or a matrix. For the sake of simplicity, we let F(X) = F(X) + μkXk 2,1 . Additional notations will be introduced when they occur.
We now quickly review the spectral gradient method for the unconstrained smooth minimization problem min f ðxÞ; x 2 R n ; where f : R n ! R is a continuously differentiable function. The spectral gradient method is defined by where one of the choices of λ k (named as spectral coefficient) is given by k rf ðx k Þ descends automatically at current point.

Algorithm
Now, we turn our attention to the original Model (2). Since the ℓ 2,1 -norm is nodifferentiable, we approximate the objective function by the following quadratic function Q k : where rFðX k Þ 2 R nÂt is the gradient of F at X k ; Λ k is the so-called spectral coefficient which defined by where Spectral Gradient Algorithm for Matrix Minimization The favorable structure of Eq (5) make the i-th row of matrix M k write explicitly as where the convention 0 Á 0/0 = 0 is followed. Hence, the search direction at current point can be expressed as Obviously, the Eq (6) reduces to D k ¼ À L À 1 k rFðx k Þ at the case of μ = 0, which means Eq (6) covers the traditional spectral gradient direction as a special case.
The following lemma verifies that D k is a descent direction when the optimal solution is not achieved.

is correct. ]
To improve the algorithm's performance, we use the classical nonmonotone line search [28] to find a suitable stepsize along the direction. It is well known that this technique allows the functional values to increase occasionally in some iterations but decrease in the whole iterative process. Letting δ 2 (0, 1), ρ 2 (0, 1) andm be a given positive integer, we choose the smallest nonnegative integer j k such that the stepsize a k ¼ãr j k satisfies where 0 mðkÞ minfmðk À 1Þ þ 1;mg (m(0) = 0) and In summary, the full steps of the Nonmonotone Spectral Gradient algorithm for L 2,1 -norm minimization (abbr. NSGL21) can be described as follows:
Step 5. Let k: = k+1. Go to Step 1. As is stated in the proceeding section that the generated direction descend automatically whenever Λ k > 0. To ensure Λ k > 0, we choose a sufficiently small Λ (min) > 0 and a sufficiently large Λ (max) > 0, such that Λ k is forced as This approach ensures that the hereditary descent property is guaranteed at each and every step.
Remark 1. The steps of the proposed algorithm is novel and different to other existing approaches. The well-known approach [18] reformulated Problem (2) as the following constrained smooth convex optimization problem min X2R nÂt ;x2R n FðXÞ þ m and then solved via the Nesterov's method. The method in [19] paid attention least square Model (1) and used an auxiliary variable to transform the model equivalently as An alternating direction method of multiplier is used immediately to solve the resulting model and closed-form solution are derived at each subproblem. Clearly, our proposed algorithm is different from the above mentioned approaches in sense that we solve the original Model (2) directly without any transformation. ]

Convergence analysis
This section is devoted to establishing the global convergence of algorithm NSGL21. For this purpose, we make the following assumption. Proof. In the case of D k 6 ¼ 0, Lemma 1 shows that D k is a descent direction, which implies that X k is not a stationary point of F. On the other hand, since D k = 0 is the solution of Eq (5), for any xD 2 R nÂt with ξ > 0 we have which indicates that X k is a stationary point of F. ] Lemma 3. Let l(k) be an integer such that k À mðkÞ lðkÞ k and FðX lðkÞ Þ ¼ max 0 j mðkÞ FðX kÀ j Þ: Then the sequence {F(X l(k) )} is nonincreasing and the search direction D l(k) satisfies Proof. It is not difficult to see that F(X l(k+1) ) F(X l(k) ), which indicates that the maximum value of the objective function is nonincreasing at each iteration. Moreover, by Eq (9), we have that for all k >m, By Assumption 1, the sequence {F(X l(k) )} admits a limit as k ! 1. Hence, it follows that lim k!1 a lðkÞ D lðkÞ ¼ 0: On the other hand, by the definition of Δ k in Eq (10) and the inequality Eq (8), it is easy to deduce that D lðkÞ À L ðminÞ 2 k D lðkÞ k 2 F < 0: Combining with Eq (13) Proof. Let " X be a limit point of {X k }, and fX k g K 1 be a subsequence of {X k } converging to " X. Then by Eq (12) In this condition, we assume that there exists a constant > 0 such that Since α k is the first value to satisfy Eq (9), it follows from Step 3 in Algorithm 1 that there exists an index " k such that, for all k ! " k and k 2 K, Since F is continuously differentiable, by the mean-value theorem on F, we can find that there exists a constant θ k 2 (0, 1), such that Combining with Eq (17), we have Since α k ! 0 in Eq (15), we have α k < ρ as k ! 1. It is not difficult to show that Subtracting left side of Eq (18) by Δ k and noting the definition of Δ k , it is distinct that

5:
Noting Eq (19), thus Eq (18) shows that Taking the limit as k 2 K, k ! 1 in the both sides of Eq (20) and using the smoothness of F, we obtain This yields a contradiction because Eq (16) indicates that kD k k F is bounded. ]

Numerical experiments
In this section, we present numerical results to illustrate the feasibility and efficiency of the algorithm NSGL21. In particular, we also test against the recent solvers IADM_MFL and SLEP for performance comparison. In running SLEP (Sparse Learning with Efficient Projections), we use the code at http://www.public.asu.edu/~jye02/Software/SLEP/index.htm in its Matlab package, and choose mFlag = 1 and lFlag = 1 for using an adaptive line search. All experiments are carried out under Windows 7 and Matlab v7.8 (2009a) running on a Lenovo laptop with an Intel Pentium CPU at 2.5 GHz and 4 GB of memory.
As [16], in the first test, " X :;j is generated from a 5-dimensional Gaussian distribution with zero-mean and con-variance diag{1, 0.64, 0.49, 0.36, 0.25}. Regarding each " X :;j , we keep adding up to 20 irrelevant dimensions which are exactly zeros. The training and test data A j is Gaussian matrices and their response data b j is generated by where ω is zero-mean Gaussian noise with standard deviation 1.e − 2. We start NSGL21 from zero point and terminate the iterative process when where tol > 0 is a tolerance. The quality of the solution X Ã is measured by the relative error to " X, i.e., In this test, we takeã ¼ 1, μ = 1e − 2, t = 200, n = 15, tol = 1e − 3, Λ (min) = 10 −20 , Λ (max) = 10 20 , and m j = 100 for all j = 1, 2, . . ., t. Moreover, to compare the performance of these algorithms in a fair way, we run each code from zero point, use all the default parameter values, and observe their convergence behavior in obtaining similar accurate solutions. To specifically illustrate the performance of each algorithm, we draw a couple of figures to show their convergence behaviors with respect to the relative error and computing time proceed in Figs 1 and 2. Observing Figs 1 and 2, we clearly know that IADM_MFL and NSGL21 produced faithful results expect for SLEP. We have tried to run SLEP with more iterations in our experiments' preparation, but it cannot achieve progress any more. Meanwhile, NSGL21 requires less number of iterations than IADM_MFL to achieve the similar quality of solutions. In both plots, we see that the green line lies at the bottom of each plot in most cases, which indicates that NSGL21 is superior to the other two solvers.
The simple test is not enough to verify that NSGL21 is the winner. To further illustrate the benefit of NSGL21, we give some insights to the behavior of NSGL21 with different dimensions and different number of tasks. The results are listed in Table 1, which contains the From Table 1, we clearly observe that each algorithm requires more computing time with the increase of the problems' dimensions and the number of tasks. Meanwhile, the number of iterations required by NSGL21 and IADM_MFL increases slightly at the higher dimensions case. We also observe that, for all the tested problems, both NSGL21 and IADM_MFL are terminated abnormally in producing similar quality solutions in sense of comparable relative errors and final function values. However, SLEP cannot generate acceptable solutions although more iterations are permitted in experiments' preparation. Hence, we conclude that NSGL21 and IADM_MFL perform better than SLEP. Now, we turn our attention to the performance comparison of solvers IADM_MFL and NSGL21. For getting similar quality of solutions, we take notice that NSGL21 is faster than IADM_MFL and saves at least 50% number of iterations. It is reasonable to make an conclusion that NSGL21 is the winner among the compared solvers.

Conclusions
In this paper, we have proposed, then analyzed, and later tested a nonmonotone spectral gradient algorithm for solving ℓ 2,1 -norm regularized minimization problem. The type of this problem mainly appears in computer version, text classification and biomedical informatics. Due to the nonsmoothness of the regularization term, the task of minimizing the problem is full of challenges. To the best of our knowledge, SLEP and IADM_MFL are the only available solvers of solving this problem. However, both solvers transferred equivalently to an equality-constrained minimization problem and then minimized alternatively. As we all know that the spectral gradient algorithm is very effective to solve smooth minimization problem. Hence, its performance in solving ℓ 2,1 -norm regularized problems is worthy of investigating. Certainly, it is the main motivation of our paper. At each iteration, the method proposed in this paper minimizes an approximal quadratic model of the objective function to produce a search direction. We showed that the generated direction descends automatically and the algorithm converges globally under some mild conditions. Additionally, the numerical experiments illustrate that the proposed algorithm is competitive with or even performs better than SLEP and IADM_MFL. Of course, this is the numerical contribution of our paper. We have said that the ℓ 2,1 -norm regularized minimization problem is partly arising in multi-task learning for Spectral Gradient Algorithm for Matrix Minimization capturing joint feather between each task. However, we did not test its real performance by using real data, this should be our further task to investigate. Finally, we expect that the proposed method and its extensions could produce even applications for problems in relevant areas of the machine learning.