Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Applications of Spectral Gradient Algorithm for Solving Matrix 2,1-Norm Minimization Problems in Machine Learning

  • Yunhai Xiao ,

    yhxiao@henu.edu.cn

    Affiliation Institute of Applied Mathematics, School of Mathematics and Statistics, Henan University, Kaifeng, Henan Province, China

  • Qiuyu Wang,

    Affiliation School of Mathematics and Statistics, Henan University, Kaifeng, Henan Province, China

  • Lihong Liu

    Affiliation School of Mathematics and Statistics, Henan University, Kaifeng, Henan Province, China

Applications of Spectral Gradient Algorithm for Solving Matrix 2,1-Norm Minimization Problems in Machine Learning

  • Yunhai Xiao, 
  • Qiuyu Wang, 
  • Lihong Liu
PLOS
x

Abstract

The main purpose of this study is to propose, then analyze, and later test a spectral gradient algorithm for solving a convex minimization problem. The considered problem covers the matrix 2,1-norm regularized least squares which is widely used in multi-task learning for capturing the joint feature among each task. To solve the problem, we firstly minimize a quadratic approximated model of the objective function to derive a search direction at current iteration. We show that this direction descends automatically and reduces to the original spectral gradient direction if the regularized term is removed. Secondly, we incorporate a nonmonotone line search along this direction to improve the algorithm’s numerical performance. Furthermore, we show that the proposed algorithm converges to a critical point under some mild conditions. The attractive feature of the proposed algorithm is that it is easily performable and only requires the gradient of the smooth function and the objective function’s values at each and every step. Finally, we operate some experiments on synthetic data, which verifies that the proposed algorithm works quite well and performs better than the compared ones.

1 Introduction

The tasks in medical diagnosis [1], text classification [25], biomedical informatics [6, 7] and other applications [812] are always related to each other. Hence, capturing the shared information among each task becomes the key issue to learn [1315]. Given the training set of t tasks and , where Aj is the data for the j-th task and bj is the corresponding response. We let be the sparse feature for the j-th task, and let be the joint feature to be learned. In order to select features globally, it encourages several rows of X to be zeros and solves the following 2,1-norm regularized least squares [16, 17] (1) where μ > 0 is a weighting parameter, and ‖X2,1 is defined by the sum of the 2-norm of each row of a matrix. It is well known that the 2,1-norm is used to encourage the multiple predictions from different tasks to share similar parameter sparsity patterns.

In the past few years, several algorithms have been proposed, analyzed, and tested to solve the nonsmooth convex minimization Problem (1). The algorithm in [18] transformed Eq (1) equivalently into a smooth convex optimization problem and minimized consequently by Nesterov’s gradient method. The method in [16] reformulated Eq (1) as a constrained optimization problem and minimized alternately. The algorithm in [19] and its variant [20] reformulated the problem as an equivalent constrained minimization by introducing an auxiliary variable, and then minimized the corresponding augmented Lagrange function alternatively. Finally, for another accelerated proximal gradient version of the algorithm [19], one can refer to [21].

Unlike all the research activities which mainly concerned about Problem (1), in this paper, we focus on the following generalized nonsmooth convex optimization problem (2) where is continuously differentiable (may be non-convex) and bounded below. Clearly, Model (2) includes Eq (1) as a special case when F is a least square. As we all know, the spectral gradient method was originated by Barzilai and Borwein [22] for solving smooth unconstrained minimization problems, later was developed in [2326], and then was extended to solve 1-regularized nonsmooth minimization [27]. However, its numerical performance in solving matrix 2,1-norm involved nonsmooth minimization problems is still undiscovered. Therefore, extending the spectral gradient algorithm to solve Problem (2) may have significance both in theory and practice. The first contribution of this study lies in the design of the search direction at each iteration, which is derived by minimizing a quadratic approximated model of the objective function and at the same time making full use of the special structure of the 2,1-norm. We also show that the generated direction descends automatically provided that the spectral coefficient is positive. The second contribution of the paper is the nonmonotone line search, which is used to improve the algorithm’s performance. At each iteration, the algorithm requires the gradient of the smooth term and the value of the objective function, which means it has the ability to solve high dimensional problems. Finally, we do performance comparisons with a couple of solvers IAMD_MFL and SLEP, which illustrate that the proposed method is fast, efficient, and competitive.

The paper is organized as follows. In Section 2, we provide some notations and preliminaries, and construct the new algorithm together with its properties. In Section 3, we establish the global convergence of the algorithm. In Section 4, we report some numerical results and do some performance comparisons. Finally, we conclude our paper in Section 5.

2 Algorithm

2.1 Notations and preliminaries

In the first place, we summarize the notations used in this paper. Matrices are written as uppercase letters. Vectors are described as lowercase letters. For the matrix X, its i-th row and j-th column are denoted by Xi,: and X:,j respectively. The Frobenius norm and the 2,1-norm of the matrix are defined as, respectively, For any two matrices , we define 〈X, Y〉 = tr(X Y) (the standard trace inner product in ), so that . If , we denote “Diag(x)” the diagonal matrix possessing the components of vector x on the diagonal. We define “⊤” as the transpose of a vector or a matrix. For the sake of simplicity, we let Φ(X) = F(X) + μX2,1. Additional notations will be introduced when they occur.

We now quickly review the spectral gradient method for the unconstrained smooth minimization problem where is a continuously differentiable function. The spectral gradient method is defined by where one of the choices of λk (named as spectral coefficient) is given by where sk−1 = xkxk−1 and yk−1 = ∇f(xk) − ∇f(xk−1). Obviously, if , i.e. λk > 0, the search direction descends automatically at current point.

2.2 Algorithm

Now, we turn our attention to the original Model (2). Since the 2,1-norm is nodifferentiable, we approximate the objective function by the following quadratic function Qk: (3) where is the gradient of F at Xk; Λk is the so-called spectral coefficient which defined by (4) where Sk−1 = XkXk−1 and Yk−1 = ∇F(Xk) − ∇F(Xk−1). Minimizing Eq (3) yields Denote Mk = Xk + D and . One can get (5) The favorable structure of Eq (5) make the i-th row of matrix Mk write explicitly as where the convention 0 ⋅ 0/0 = 0 is followed. Hence, the search direction at current point can be expressed as (6) Obviously, the Eq (6) reduces to at the case of μ = 0, which means Eq (6) covers the traditional spectral gradient direction as a special case.

The following lemma verifies that Dk is a descent direction when the optimal solution is not achieved.

Theorem 1 Suppose that Λk > 0 and Dk is determined by Eq (6). Then (7) and (8)

Proof. By the differentiability of F and the convexity of ‖X2,1, we have that for any θ ∈ (0, 1], which is exactly Eq (7). Noting that Dk is the minimizer of Eq (3) and θ ∈ (0, 1], by Eq (3) and the convexity of ‖X2,1, one can get Hence, i.e., Recalling θ ∈ (0, 1], the above inequality indicates Eq (8) is correct.

To improve the algorithm’s performance, we use the classical nonmonotone line search [28] to find a suitable stepsize along the direction. It is well known that this technique allows the functional values to increase occasionally in some iterations but decrease in the whole iterative process. Letting δ ∈ (0, 1), ρ ∈ (0, 1) and be a given positive integer, we choose the smallest nonnegative integer jk such that the stepsize satisfies (9) where (m(0) = 0) and (10) From Eq (8), it is clear that whenever Dk ≠ 0, which shows that Eq (9) is well-defined.

In summary, the full steps of the Nonmonotone Spectral Gradient algorithm for L2,1-norm minimization (abbr. NSGL21) can be described as follows:

Algorithm 1 (NSGL21)

Step 0. Choose initial point X0, constants μ > 0, , ρ ∈ (0, 1), δ ∈ (0, 1) and positive integer . Set k: = 0.

Step 1. Stop if ‖DkF = 0. Otherwise, continue.

Step 2. Compute Dk via Eq (6).

Step 3. Compute αk via Eq (9).

Step 4. Let Xk+1: = Xk+αk dk.

Step 5. Let k: = k+1. Go to Step 1.

As is stated in the proceeding section that the generated direction descend automatically whenever Λk > 0. To ensure Λk > 0, we choose a sufficiently small Λ(min) > 0 and a sufficiently large Λ(max) > 0, such that Λk is forced as This approach ensures that the hereditary descent property is guaranteed at each and every step.

Remark 1. The steps of the proposed algorithm is novel and different to other existing approaches. The well-known approach [18] reformulated Problem (2) as the following constrained smooth convex optimization problem and then solved via the Nesterov’s method. The method in [19] paid attention least square Model (1) and used an auxiliary variable to transform the model equivalently as An alternating direction method of multiplier is used immediately to solve the resulting model and closed-form solution are derived at each subproblem. Clearly, our proposed algorithm is different from the above mentioned approaches in sense that we solve the original Model (2) directly without any transformation.

3 Convergence analysis

This section is devoted to establishing the global convergence of algorithm NSGL21. For this purpose, we make the following assumption.

Assumption 1. The level set Ω = {X: F(X) ≤ F(X0)} is bounded.

Lemma 2. Suppose that the Assumption 1 holds and the sequence {Xk} is generated by Algorithm 1. Then Xk is a stationary point of Problem (2) if and only if Dk = 0.

Proof. In the case of Dk ≠ 0, Lemma 1 shows that Dk is a descent direction, which implies that Xk is not a stationary point of F. On the other hand, since Dk = 0 is the solution of Eq (5), for any with ξ > 0 we have (11) Combining the fact F(Xk + ξD) − F(Xk) = 〈∇F(Xk), ξD〉 + o(ξ) with Eq (11), it yields which indicates that Xk is a stationary point of F.

Lemma 3. Let l(k) be an integer such that Then the sequence {Φ(Xl(k))} is nonincreasing and the search direction Dl(k) satisfies (12)

Proof. It is not difficult to see that Φ(Xl(k+1)) ≤ Φ(Xl(k)), which indicates that the maximum value of the objective function is nonincreasing at each iteration. Moreover, by Eq (9), we have that for all , By Assumption 1, the sequence {Φ(Xl(k))} admits a limit as k → ∞. Hence, it follows that (13) On the other hand, by the definition of Δk in Eq (10) and the inequality Eq (8), it is easy to deduce that Combining with Eq (13), one get which indicates the desirable result Eq (12).

Theorem 1. Let the sequence {Xk} and {Dk} be generated by Algorithm 1. Then, there exists a subsequence such that (14)

Proof. Let be a limit point of {Xk}, and be a subsequence of {Xk} converging to . Then by Eq (12) either , or there exists a subsequence () such that (15) In this condition, we assume that there exists a constant ϵ > 0 such that (16) Since αk is the first value to satisfy Eq (9), it follows from Step 3 in Algorithm 1 that there exists an index such that, for all and , (17) Since F is continuously differentiable, by the mean-value theorem on F, we can find that there exists a constant θk ∈ (0, 1), such that Combining with Eq (17), we have (18) Since αk → 0 in Eq (15), we have αk < ρ as k → ∞. It is not difficult to show that (19) Subtracting left side of Eq (18) by Δk and noting the definition of Δk, it is distinct that Noting Eq (19), thus Eq (18) shows that (20) Taking the limit as , k → ∞ in the both sides of Eq (20) and using the smoothness of F, we obtain which implies ‖DkF → 0 as , k → ∞. This yields a contradiction because Eq (16) indicates that ‖DkF is bounded.

4 Numerical experiments

In this section, we present numerical results to illustrate the feasibility and efficiency of the algorithm NSGL21. In particular, we also test against the recent solvers IADM_MFL and SLEP for performance comparison. In running SLEP (Sparse Learning with Efficient Projections), we use the code at http://www.public.asu.edu/~jye02/Software/SLEP/index.htm in its Matlab package, and choose mFlag = 1 and lFlag = 1 for using an adaptive line search. All experiments are carried out under Windows 7 and Matlab v7.8 (2009a) running on a Lenovo laptop with an Intel Pentium CPU at 2.5 GHz and 4 GB of memory.

As [16], in the first test, is generated from a 5-dimensional Gaussian distribution with zero-mean and con-variance diag{1, 0.64, 0.49, 0.36, 0.25}. Regarding each , we keep adding up to 20 irrelevant dimensions which are exactly zeros. The training and test data Aj is Gaussian matrices and their response data bj is generated by where ω is zero-mean Gaussian noise with standard deviation 1.e − 2. We start NSGL21 from zero point and terminate the iterative process when (21) where tol > 0 is a tolerance. The quality of the solution X* is measured by the relative error to , i.e., In this test, we take , μ = 1e − 2, t = 200, n = 15, tol = 1e − 3, Λ(min) = 10−20, Λ(max) = 1020, and mj = 100 for all j = 1, 2, …, t. Moreover, to compare the performance of these algorithms in a fair way, we run each code from zero point, use all the default parameter values, and observe their convergence behavior in obtaining similar accurate solutions. To specifically illustrate the performance of each algorithm, we draw a couple of figures to show their convergence behaviors with respect to the relative error and computing time proceed in Figs 1 and 2.

thumbnail
Fig 1. Comparison results of NSGL21, IADM MFL, and SLEP.

The x-axes represents the number of iterations and the y-axes represents the relative error.

https://doi.org/10.1371/journal.pone.0166169.g001

thumbnail
Fig 2. Comparison results of NSGL21, IADM MFL, and SLEP.

The x-axes represents the CPU time in seconds and the y-axes represents the relative error.

https://doi.org/10.1371/journal.pone.0166169.g002

Observing Figs 1 and 2, we clearly know that IADM_MFL and NSGL21 produced faithful results expect for SLEP. We have tried to run SLEP with more iterations in our experiments’ preparation, but it cannot achieve progress any more. Meanwhile, NSGL21 requires less number of iterations than IADM_MFL to achieve the similar quality of solutions. In both plots, we see that the green line lies at the bottom of each plot in most cases, which indicates that NSGL21 is superior to the other two solvers.

The simple test is not enough to verify that NSGL21 is the winner. To further illustrate the benefit of NSGL21, we give some insights to the behavior of NSGL21 with different dimensions and different number of tasks. The results are listed in Table 1, which contains the number of iterations (Iter), the CPU time in seconds (Time), the relative errors (RelErr), and the final functional values (Fun).

From Table 1, we clearly observe that each algorithm requires more computing time with the increase of the problems’ dimensions and the number of tasks. Meanwhile, the number of iterations required by NSGL21 and IADM_MFL increases slightly at the higher dimensions case. We also observe that, for all the tested problems, both NSGL21 and IADM_MFL are terminated abnormally in producing similar quality solutions in sense of comparable relative errors and final function values. However, SLEP cannot generate acceptable solutions although more iterations are permitted in experiments’ preparation. Hence, we conclude that NSGL21 and IADM_MFL perform better than SLEP. Now, we turn our attention to the performance comparison of solvers IADM_MFL and NSGL21. For getting similar quality of solutions, we take notice that NSGL21 is faster than IADM_MFL and saves at least 50% number of iterations. It is reasonable to make an conclusion that NSGL21 is the winner among the compared solvers.

5 Conclusions

In this paper, we have proposed, then analyzed, and later tested a nonmonotone spectral gradient algorithm for solving 2,1-norm regularized minimization problem. The type of this problem mainly appears in computer version, text classification and biomedical informatics. Due to the nonsmoothness of the regularization term, the task of minimizing the problem is full of challenges. To the best of our knowledge, SLEP and IADM_MFL are the only available solvers of solving this problem. However, both solvers transferred equivalently to an equality-constrained minimization problem and then minimized alternatively. As we all know that the spectral gradient algorithm is very effective to solve smooth minimization problem. Hence, its performance in solving 2,1-norm regularized problems is worthy of investigating. Certainly, it is the main motivation of our paper. At each iteration, the method proposed in this paper minimizes an approximal quadratic model of the objective function to produce a search direction. We showed that the generated direction descends automatically and the algorithm converges globally under some mild conditions. Additionally, the numerical experiments illustrate that the proposed algorithm is competitive with or even performs better than SLEP and IADM_MFL. Of course, this is the numerical contribution of our paper. We have said that the 2,1-norm regularized minimization problem is partly arising in multi-task learning for capturing joint feather between each task. However, we did not test its real performance by using real data, this should be our further task to investigate. Finally, we expect that the proposed method and its extensions could produce even applications for problems in relevant areas of the machine learning.

Acknowledgments

The research of Y. Xiao was supported by the Major State Basic Research Development Program of China (973 Program) (Grant No. 2015CB856003), the National Natural Science Foundation of China (Grant No. 11471101), and the Program for Science and Technology Innovation Talents in Universities of Henan Province (Grant No. 13HASTIT050).

Author Contributions

  1. Conceptualization: YX.
  2. Data curation: QW.
  3. Formal analysis: YX LL.
  4. Methodology: YX LL.
  5. Project administration: YX.
  6. Software: QW.
  7. Supervision: YX.
  8. Validation: YX.
  9. Writing – original draft: LL.
  10. Writing – review & editing: YX.

References

  1. 1. Bi J, Xiong X, Yu S, Dundar M, Rao B, An improved multi-task learning approach with applications in medical diagnosis. In European Conference on Machine Learning, 2008.
  2. 2. Zhang J, Ghahramani Z, Yang Y, Flexible latent variable models for multi-task learning. Maching Learning, 3 (2008), 221–242.
  3. 3. Zheng Y, Jeon B, Xu D, Wu QMJ, Zhang H. Image segmentation by generalized hierarchical fuzzy C-means algorithm Journal of Intelligent and Fuzzy Systems, 28 (2015), 961–973.
  4. 4. Obozinski G, Taskar B, Jordan MI. Joint covariate selection for grouped classification. Technical report, Statistics Department, UC Berkeley, 2007.
  5. 5. Obozinski G, Taskar B, Jordan MI. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20 (2010), 231–252.
  6. 6. Fu Z, Wu X, Guan C, Sun X, Ren K. Towards Efficient Multi-keyword Fuzzy Search over Encrypted Outsourced Data with Accuracy Improvement. IEEE Transactions on Information Forensics and Security,
  7. 7. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics, 23 (2007), 2507–2517. pmid:17720704
  8. 8. Fu Z, Ren K, Shu J, Sun X, and Huang F. Enabling Personalized Search over Encrypted Outsourced Data with Efficiency Improvement. IEEE Transactions on Parallel and Distributed Systems, 2015.
  9. 9. Gu B, Sheng VS. A Robust Regularization Path Algorithm for v-Support Vector Classification IEEE Transactions on Neural Networks and Learning Systems, 2016 pmid:26929067
  10. 10. Xia Z, Wang X, Sun X, Wang B. Steganalysis of least significant bit matching using multi-order differences. Security and Communocatopm Networks, 7 (2014), 1283–1291.
  11. 11. Chen B, Shu H, Coatrieux G, Chen G, Sun X, Coatrieux JL. Color image analysis by quaternion-type moments. Journal of Mathematical Imaging and Vision, 51 (2015), 124–144.
  12. 12. Xia Z, Wang X, Zhang L, Qin Z, Sun X, Ren K. A Privacy-preserving and Copy-deterrence Content-based Image Retrieval Scheme in Cloud Computing IEEE Transactions on Information Forensics and Security, 2016,
  13. 13. Ando RK, Zhang T. A framework for learning predictive structures from multiple tasks and unlabeled data Journal of Machine Learning Research, 6 (2005), 1817–1853.
  14. 14. Bakker B. Heskes T. Task clustering and gating for Bayesian multi-task learning Journal of Machine Learning Research, 4 (2003), 83–99.
  15. 15. Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6 (2005), 615–637.
  16. 16. Argyriou A, Evgeniou T, Massimiliano P. Convex multi-convex feature learning. Machine Learning, 73 (2008), 243–272.
  17. 17. Obozinski G, Taskar B, Jordan MI. Multi-task feature selection Technical Report, UC Berkeley, 2006.
  18. 18. Liu J, Ji S, Ye J. Multi-task feather learning via efficient ℓ2,1-norm minimization. in Comference on Uncertainty in Artificial Intelligence, 2009.
  19. 19. Xiao Y, Wu SY, He BS. A proximal alternating direction method for ℓ2,1-norm least squares problem in multi-task feature learning. Journal of Industrial and Management Optimization, 8 (2012), 1057–1069.
  20. 20. Deng W, Yin W, Zhang Y. Group sparse optimization by alternating direction method Technical Report TR11-06, Rice University, 2011. available at http://www.caam.rice.edu/~zhang/reports/tr1106.pdf.
  21. 21. Hu Y, Wei Z, Yuan G. Inexact accelerated proximal gradient algorithms for matrix ℓ2,1-Norm minimization problem in multi-task feature learning. Statistics, Optimization & Information Computing, 2 (2014), 352–367.
  22. 22. Barzilai J, Borwein JM. Two point step size gradient method. IMA Journal of Numerical Analysis, 8 (1988), 141–148.
  23. 23. Birgin EG, Martínez JM, Raydan M. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10 (2000), 1196–1121.
  24. 24. Raydan M. On the Barzilai and Borwein choice of steplength for the gradient methodz. IMA Journal of Numerical Analysis, 13 (1993), 321–326.
  25. 25. Raydan M. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization, 7 (1997), 26–33.
  26. 26. Cheng W, Li DH. A derivative-free nonmonotone line search and its application to the spectral residual method. IMA Journal of Numerical Analysis, 29 (2009), 814–825.
  27. 27. Xiao Y, Wu SY, Qi L. Nonmonotone Barzilai-Borwein Gradient Algorithm for ℓ1-Regularized Nonsmooth Minimization in Compressive Sensingzz. Journal of Scientific Computing, 61 (2014), 17–41.
  28. 28. Grippo L, Lampariello F, Lucidi S. A nonmonotone line search technique for Newton’s methodz. SIAM Journal on Numerical Analysis, 23 (1986), 707–716.