## Figures

## Abstract

The main purpose of this study is to propose, then analyze, and later test a spectral gradient algorithm for solving a convex minimization problem. The considered problem covers the matrix *ℓ*_{2,1}-norm regularized least squares which is widely used in multi-task learning for capturing the joint feature among each task. To solve the problem, we firstly minimize a quadratic approximated model of the objective function to derive a search direction at current iteration. We show that this direction descends automatically and reduces to the original spectral gradient direction if the regularized term is removed. Secondly, we incorporate a nonmonotone line search along this direction to improve the algorithm’s numerical performance. Furthermore, we show that the proposed algorithm converges to a critical point under some mild conditions. The attractive feature of the proposed algorithm is that it is easily performable and only requires the gradient of the smooth function and the objective function’s values at each and every step. Finally, we operate some experiments on synthetic data, which verifies that the proposed algorithm works quite well and performs better than the compared ones.

**Citation: **Xiao Y, Wang Q, Liu L (2016) Applications of Spectral Gradient Algorithm for Solving Matrix *ℓ*_{2,1}-Norm Minimization Problems in Machine Learning. PLoS ONE 11(11):
e0166169.
https://doi.org/10.1371/journal.pone.0166169

**Editor: **Xiaolei Ma,
Beihang University, CHINA

**Received: **June 19, 2016; **Accepted: **October 23, 2016; **Published: ** November 18, 2016

**Copyright: ** © 2016 Xiao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data is contained within the manuscript.

**Funding: **The research is supported by the Major State Basic Research Development Program of China (973 Program) (Grant No. 2015CB856003), the National Natural Science Foundation of China (Grant No. 11471101), and the Program for Science and Technology Innovation Talents in Universities of Henan Province (Grant No. 13HASTIT050).

**Competing interests: ** The authors have declared that no competing interests exist.

## 1 Introduction

The tasks in medical diagnosis [1], text classification [2–5], biomedical informatics [6, 7] and other applications [8–12] are always related to each other. Hence, capturing the shared information among each task becomes the key issue to learn [13–15]. Given the training set of *t* tasks and , where *A*_{j} is the data for the *j*-th task and *b*_{j} is the corresponding response. We let be the sparse feature for the *j*-th task, and let be the joint feature to be learned. In order to select features globally, it encourages several rows of *X* to be zeros and solves the following *ℓ*_{2,1}-norm regularized least squares [16, 17]
(1)
where *μ* > 0 is a weighting parameter, and ‖*X*‖_{2,1} is defined by the sum of the *ℓ*_{2}-norm of each row of a matrix. It is well known that the *ℓ*_{2,1}-norm is used to encourage the multiple predictions from different tasks to share similar parameter sparsity patterns.

In the past few years, several algorithms have been proposed, analyzed, and tested to solve the nonsmooth convex minimization Problem (1). The algorithm in [18] transformed Eq (1) equivalently into a smooth convex optimization problem and minimized consequently by Nesterov’s gradient method. The method in [16] reformulated Eq (1) as a constrained optimization problem and minimized alternately. The algorithm in [19] and its variant [20] reformulated the problem as an equivalent constrained minimization by introducing an auxiliary variable, and then minimized the corresponding augmented Lagrange function alternatively. Finally, for another accelerated proximal gradient version of the algorithm [19], one can refer to [21].

Unlike all the research activities which mainly concerned about Problem (1), in this paper, we focus on the following generalized nonsmooth convex optimization problem
(2)
where is continuously differentiable (may be non-convex) and bounded below. Clearly, Model (2) includes Eq (1) as a special case when *F* is a least square. As we all know, the spectral gradient method was originated by Barzilai and Borwein [22] for solving smooth unconstrained minimization problems, later was developed in [23–26], and then was extended to solve *ℓ*_{1}-regularized nonsmooth minimization [27]. However, its numerical performance in solving matrix *ℓ*_{2,1}-norm involved nonsmooth minimization problems is still undiscovered. Therefore, extending the spectral gradient algorithm to solve Problem (2) may have significance both in theory and practice. The first contribution of this study lies in the design of the search direction at each iteration, which is derived by minimizing a quadratic approximated model of the objective function and at the same time making full use of the special structure of the *ℓ*_{2,1}-norm. We also show that the generated direction descends automatically provided that the spectral coefficient is positive. The second contribution of the paper is the nonmonotone line search, which is used to improve the algorithm’s performance. At each iteration, the algorithm requires the gradient of the smooth term and the value of the objective function, which means it has the ability to solve high dimensional problems. Finally, we do performance comparisons with a couple of solvers IAMD_MFL and SLEP, which illustrate that the proposed method is fast, efficient, and competitive.

The paper is organized as follows. In Section 2, we provide some notations and preliminaries, and construct the new algorithm together with its properties. In Section 3, we establish the global convergence of the algorithm. In Section 4, we report some numerical results and do some performance comparisons. Finally, we conclude our paper in Section 5.

## 2 Algorithm

### 2.1 Notations and preliminaries

In the first place, we summarize the notations used in this paper. Matrices are written as uppercase letters. Vectors are described as lowercase letters. For the matrix *X*, its *i*-th row and *j*-th column are denoted by *X*_{i,:} and *X*_{:,j} respectively. The Frobenius norm and the *ℓ*_{2,1}-norm of the matrix are defined as, respectively,
For any two matrices , we define 〈*X*, *Y*〉 = tr(*X*^{⊤} *Y*) (the standard trace inner product in ), so that . If , we denote “Diag(*x*)” the diagonal matrix possessing the components of vector *x* on the diagonal. We define “⊤” as the transpose of a vector or a matrix. For the sake of simplicity, we let Φ(*X*) = *F*(*X*) + *μ*‖*X*‖_{2,1}. Additional notations will be introduced when they occur.

We now quickly review the spectral gradient method for the unconstrained smooth minimization problem
where is a continuously differentiable function. The spectral gradient method is defined by
where one of the choices of *λ*_{k} (named as spectral coefficient) is given by
where *s*_{k−1} = *x*_{k} − *x*_{k−1} and *y*_{k−1} = ∇*f*(*x*_{k}) − ∇*f*(*x*_{k−1}). Obviously, if , i.e. *λ*_{k} > 0, the search direction descends automatically at current point.

### 2.2 Algorithm

Now, we turn our attention to the original Model (2). Since the *ℓ*_{2,1}-norm is nodifferentiable, we approximate the objective function by the following quadratic function *Q*_{k}:
(3)
where is the gradient of *F* at *X*_{k}; Λ_{k} is the so-called spectral coefficient which defined by
(4)
where *S*_{k−1} = *X*_{k} − *X*_{k−1} and *Y*_{k−1} = ∇*F*(*X*_{k}) − ∇*F*(*X*_{k−1}). Minimizing Eq (3) yields
Denote *M*_{k} = *X*_{k} + *D* and . One can get
(5)
The favorable structure of Eq (5) make the *i*-th row of matrix *M*_{k} write explicitly as
where the convention 0 ⋅ 0/0 = 0 is followed. Hence, the search direction at current point can be expressed as
(6)
Obviously, the Eq (6) reduces to at the case of *μ* = 0, which means Eq (6) covers the traditional spectral gradient direction as a special case.

The following lemma verifies that *D*_{k} is a descent direction when the optimal solution is not achieved.

**Theorem 1** Suppose that Λ_{k} > 0 and *D*_{k} is determined by Eq (6). Then
(7)
and
(8)

**Proof.** By the differentiability of *F* and the convexity of ‖*X*‖_{2,1}, we have that for any *θ* ∈ (0, 1],
which is exactly Eq (7). Noting that *D*_{k} is the minimizer of Eq (3) and *θ* ∈ (0, 1], by Eq (3) and the convexity of ‖*X*‖_{2,1}, one can get
Hence,
i.e.,
Recalling *θ* ∈ (0, 1], the above inequality indicates Eq (8) is correct.

To improve the algorithm’s performance, we use the classical nonmonotone line search [28] to find a suitable stepsize along the direction. It is well known that this technique allows the functional values to increase occasionally in some iterations but decrease in the whole iterative process. Letting *δ* ∈ (0, 1), *ρ* ∈ (0, 1) and be a given positive integer, we choose the smallest nonnegative integer *j*_{k} such that the stepsize satisfies
(9)
where (*m*(0) = 0) and
(10)
From Eq (8), it is clear that whenever *D*_{k} ≠ 0, which shows that Eq (9) is well-defined.

In summary, the full steps of the **N**onmonotone **S**pectral **G**radient algorithm for ** L_{2,1}**-norm minimization (abbr. NSGL21) can be described as follows:

**Algorithm 1 (NSGL21)**

**Step 0.** Choose initial point *X*_{0}, constants *μ* > 0, , *ρ* ∈ (0, 1), *δ* ∈ (0, 1) and positive integer . Set *k*: = 0.

**Step 1.** Stop if ‖*D*_{k}‖_{F} = 0. Otherwise, continue.

**Step 2.** Compute *D*_{k} via Eq (6).

**Step 3.** Compute *α*_{k} via Eq (9).

**Step 4.** Let *X*_{k+1}: = *X*_{k}+*α*_{k} *d*_{k}.

**Step 5.** Let *k*: = *k*+1. Go to Step 1.

As is stated in the proceeding section that the generated direction descend automatically whenever Λ_{k} > 0. To ensure Λ_{k} > 0, we choose a sufficiently small Λ_{(min)} > 0 and a sufficiently large Λ_{(max)} > 0, such that Λ_{k} is forced as
This approach ensures that the hereditary descent property is guaranteed at each and every step.

**Remark 1.** The steps of the proposed algorithm is novel and different to other existing approaches. The well-known approach [18] reformulated Problem (2) as the following constrained smooth convex optimization problem
and then solved via the Nesterov’s method. The method in [19] paid attention least square Model (1) and used an auxiliary variable to transform the model equivalently as
An alternating direction method of multiplier is used immediately to solve the resulting model and closed-form solution are derived at each subproblem. Clearly, our proposed algorithm is different from the above mentioned approaches in sense that we solve the original Model (2) directly without any transformation.

## 3 Convergence analysis

This section is devoted to establishing the global convergence of algorithm NSGL21. For this purpose, we make the following assumption.

**Assumption 1.** The level set *Ω* = {*X*: *F*(*X*) ≤ *F*(*X*_{0})} is bounded.

**Lemma 2.** Suppose that the Assumption 1 holds and the sequence {*X*_{k}} is generated by Algorithm 1. Then *X*_{k} is a stationary point of Problem (2) if and only if *D*_{k} = 0.

**Proof.** In the case of *D*_{k} ≠ 0, Lemma 1 shows that *D*_{k} is a descent direction, which implies that *X*_{k} is not a stationary point of *F*. On the other hand, since *D*_{k} = 0 is the solution of Eq (5), for any with *ξ* > 0 we have
(11)
Combining the fact *F*(*X*_{k} + *ξD*) − *F*(*X*_{k}) = 〈∇*F*(*X*_{k}), *ξD*〉 + *o*(*ξ*) with Eq (11), it yields
which indicates that *X*_{k} is a stationary point of *F*.

**Lemma 3.** Let *l*(*k*) be an integer such that
Then the sequence {Φ(*X*_{l(k)})} is nonincreasing and the search direction *D*_{l(k)} satisfies
(12)

**Proof.** It is not difficult to see that Φ(*X*_{l(k+1)}) ≤ Φ(*X*_{l(k)}), which indicates that the maximum value of the objective function is nonincreasing at each iteration. Moreover, by Eq (9), we have that for all ,
By Assumption 1, the sequence {Φ(*X*_{l(k)})} admits a limit as *k* → ∞. Hence, it follows that
(13)
On the other hand, by the definition of Δ_{k} in Eq (10) and the inequality Eq (8), it is easy to deduce that
Combining with Eq (13), one get
which indicates the desirable result Eq (12).

**Theorem 1.** Let the sequence {*X*_{k}} and {*D*_{k}} be generated by Algorithm 1. Then, there exists a subsequence such that
(14)

**Proof.** Let be a limit point of {*X*_{k}}, and be a subsequence of {*X*_{k}} converging to . Then by Eq (12) either , or there exists a subsequence () such that
(15)
In this condition, we assume that there exists a constant *ϵ* > 0 such that
(16)
Since *α*_{k} is the first value to satisfy Eq (9), it follows from Step 3 in Algorithm 1 that there exists an index such that, for all and ,
(17)
Since *F* is continuously differentiable, by the mean-value theorem on *F*, we can find that there exists a constant *θ*_{k} ∈ (0, 1), such that
Combining with Eq (17), we have
(18)
Since *α*_{k} → 0 in Eq (15), we have *α*_{k} < *ρ* as *k* → ∞. It is not difficult to show that
(19)
Subtracting left side of Eq (18) by Δ_{k} and noting the definition of Δ_{k}, it is distinct that
Noting Eq (19), thus Eq (18) shows that
(20)
Taking the limit as , *k* → ∞ in the both sides of Eq (20) and using the smoothness of *F*, we obtain
which implies ‖*D*_{k}‖_{F} → 0 as , *k* → ∞. This yields a contradiction because Eq (16) indicates that ‖*D*_{k}‖_{F} is bounded.

## 4 Numerical experiments

In this section, we present numerical results to illustrate the feasibility and efficiency of the algorithm NSGL21. In particular, we also test against the recent solvers IADM_MFL and SLEP for performance comparison. In running SLEP (Sparse Learning with Efficient Projections), we use the code at http://www.public.asu.edu/~jye02/Software/SLEP/index.htm in its Matlab package, and choose mFlag = 1 and lFlag = 1 for using an adaptive line search. All experiments are carried out under Windows 7 and Matlab v7.8 (2009a) running on a Lenovo laptop with an Intel Pentium CPU at 2.5 GHz and 4 GB of memory.

As [16], in the first test, is generated from a 5-dimensional Gaussian distribution with zero-mean and con-variance diag{1, 0.64, 0.49, 0.36, 0.25}. Regarding each , we keep adding up to 20 irrelevant dimensions which are exactly zeros. The training and test data *A*_{j} is Gaussian matrices and their response data *b*_{j} is generated by
where *ω* is zero-mean Gaussian noise with standard deviation 1.*e* − 2. We start NSGL21 from zero point and terminate the iterative process when
(21)
where *tol* > 0 is a tolerance. The quality of the solution *X** is measured by the relative error to , i.e.,
In this test, we take , *μ* = 1*e* − 2, *t* = 200, *n* = 15, *tol* = 1*e* − 3, Λ_{(min)} = 10^{−20}, Λ_{(max)} = 10^{20}, and *m*_{j} = 100 for all *j* = 1, 2, …, *t*. Moreover, to compare the performance of these algorithms in a fair way, we run each code from zero point, use all the default parameter values, and observe their convergence behavior in obtaining similar accurate solutions. To specifically illustrate the performance of each algorithm, we draw a couple of figures to show their convergence behaviors with respect to the relative error and computing time proceed in Figs 1 and 2.

The x-axes represents the number of iterations and the y-axes represents the relative error.

The x-axes represents the CPU time in seconds and the y-axes represents the relative error.

Observing Figs 1 and 2, we clearly know that IADM_MFL and NSGL21 produced faithful results expect for SLEP. We have tried to run SLEP with more iterations in our experiments’ preparation, but it cannot achieve progress any more. Meanwhile, NSGL21 requires less number of iterations than IADM_MFL to achieve the similar quality of solutions. In both plots, we see that the green line lies at the bottom of each plot in most cases, which indicates that NSGL21 is superior to the other two solvers.

The simple test is not enough to verify that NSGL21 is the winner. To further illustrate the benefit of NSGL21, we give some insights to the behavior of NSGL21 with different dimensions and different number of tasks. The results are listed in Table 1, which contains the number of iterations (Iter), the CPU time in seconds (Time), the relative errors (RelErr), and the final functional values (Fun).

From Table 1, we clearly observe that each algorithm requires more computing time with the increase of the problems’ dimensions and the number of tasks. Meanwhile, the number of iterations required by NSGL21 and IADM_MFL increases slightly at the higher dimensions case. We also observe that, for all the tested problems, both NSGL21 and IADM_MFL are terminated abnormally in producing similar quality solutions in sense of comparable relative errors and final function values. However, SLEP cannot generate acceptable solutions although more iterations are permitted in experiments’ preparation. Hence, we conclude that NSGL21 and IADM_MFL perform better than SLEP. Now, we turn our attention to the performance comparison of solvers IADM_MFL and NSGL21. For getting similar quality of solutions, we take notice that NSGL21 is faster than IADM_MFL and saves at least 50% number of iterations. It is reasonable to make an conclusion that NSGL21 is the winner among the compared solvers.

## 5 Conclusions

In this paper, we have proposed, then analyzed, and later tested a nonmonotone spectral gradient algorithm for solving *ℓ*_{2,1}-norm regularized minimization problem. The type of this problem mainly appears in computer version, text classification and biomedical informatics. Due to the nonsmoothness of the regularization term, the task of minimizing the problem is full of challenges. To the best of our knowledge, SLEP and IADM_MFL are the only available solvers of solving this problem. However, both solvers transferred equivalently to an equality-constrained minimization problem and then minimized alternatively. As we all know that the spectral gradient algorithm is very effective to solve smooth minimization problem. Hence, its performance in solving *ℓ*_{2,1}-norm regularized problems is worthy of investigating. Certainly, it is the main motivation of our paper. At each iteration, the method proposed in this paper minimizes an approximal quadratic model of the objective function to produce a search direction. We showed that the generated direction descends automatically and the algorithm converges globally under some mild conditions. Additionally, the numerical experiments illustrate that the proposed algorithm is competitive with or even performs better than SLEP and IADM_MFL. Of course, this is the numerical contribution of our paper. We have said that the *ℓ*_{2,1}-norm regularized minimization problem is partly arising in multi-task learning for capturing joint feather between each task. However, we did not test its real performance by using real data, this should be our further task to investigate. Finally, we expect that the proposed method and its extensions could produce even applications for problems in relevant areas of the machine learning.

## Acknowledgments

The research of Y. Xiao was supported by the Major State Basic Research Development Program of China (973 Program) (Grant No. 2015CB856003), the National Natural Science Foundation of China (Grant No. 11471101), and the Program for Science and Technology Innovation Talents in Universities of Henan Province (Grant No. 13HASTIT050).

## Author Contributions

**Conceptualization:**YX.**Data curation:**QW.**Formal analysis:**YX LL.**Methodology:**YX LL.**Project administration:**YX.**Software:**QW.**Supervision:**YX.**Validation:**YX.**Writing – original draft:**LL.**Writing – review & editing:**YX.

## References

- 1.
Bi J, Xiong X, Yu S, Dundar M, Rao B, An improved multi-task learning approach with applications in medical diagnosis. In European Conference on Machine Learning, 2008.
- 2. Zhang J, Ghahramani Z, Yang Y, Flexible latent variable models for multi-task learning. Maching Learning, 3 (2008), 221–242.
- 3. Zheng Y, Jeon B, Xu D, Wu QMJ, Zhang H. Image segmentation by generalized hierarchical fuzzy C-means algorithm Journal of Intelligent and Fuzzy Systems, 28 (2015), 961–973.
- 4.
Obozinski G, Taskar B, Jordan MI. Joint covariate selection for grouped classification. Technical report, Statistics Department, UC Berkeley, 2007.
- 5. Obozinski G, Taskar B, Jordan MI. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20 (2010), 231–252.
- 6. Fu Z, Wu X, Guan C, Sun X, Ren K. Towards Efficient Multi-keyword Fuzzy Search over Encrypted Outsourced Data with Accuracy Improvement. IEEE Transactions on Information Forensics and Security,
- 7. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics, 23 (2007), 2507–2517. pmid:17720704
- 8. Fu Z, Ren K, Shu J, Sun X, and Huang F. Enabling Personalized Search over Encrypted Outsourced Data with Efficiency Improvement. IEEE Transactions on Parallel and Distributed Systems, 2015.
- 9. Gu B, Sheng VS. A Robust Regularization Path Algorithm for v-Support Vector Classification IEEE Transactions on Neural Networks and Learning Systems, 2016 pmid:26929067
- 10. Xia Z, Wang X, Sun X, Wang B. Steganalysis of least significant bit matching using multi-order differences. Security and Communocatopm Networks, 7 (2014), 1283–1291.
- 11. Chen B, Shu H, Coatrieux G, Chen G, Sun X, Coatrieux JL. Color image analysis by quaternion-type moments. Journal of Mathematical Imaging and Vision, 51 (2015), 124–144.
- 12. Xia Z, Wang X, Zhang L, Qin Z, Sun X, Ren K. A Privacy-preserving and Copy-deterrence Content-based Image Retrieval Scheme in Cloud Computing IEEE Transactions on Information Forensics and Security, 2016,
- 13. Ando RK, Zhang T. A framework for learning predictive structures from multiple tasks and unlabeled data Journal of Machine Learning Research, 6 (2005), 1817–1853.
- 14. Bakker B. Heskes T. Task clustering and gating for Bayesian multi-task learning Journal of Machine Learning Research, 4 (2003), 83–99.
- 15. Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6 (2005), 615–637.
- 16. Argyriou A, Evgeniou T, Massimiliano P. Convex multi-convex feature learning. Machine Learning, 73 (2008), 243–272.
- 17.
Obozinski G, Taskar B, Jordan MI. Multi-task feature selection Technical Report, UC Berkeley, 2006.
- 18.
Liu J, Ji S, Ye J. Multi-task feather learning via efficient ℓ
_{2,1}-norm minimization. in Comference on Uncertainty in Artificial Intelligence, 2009. - 19.
Xiao Y, Wu SY, He BS. A proximal alternating direction method for ℓ
_{2,1}-norm least squares problem in multi-task feature learning. Journal of Industrial and Management Optimization, 8 (2012), 1057–1069. - 20.
Deng W, Yin W, Zhang Y. Group sparse optimization by alternating direction method Technical Report TR11-06, Rice University, 2011. available at http://www.caam.rice.edu/~zhang/reports/tr1106.pdf.
- 21.
Hu Y, Wei Z, Yuan G. Inexact accelerated proximal gradient algorithms for matrix ℓ
_{2,1}-Norm minimization problem in multi-task feature learning. Statistics, Optimization & Information Computing, 2 (2014), 352–367. - 22. Barzilai J, Borwein JM. Two point step size gradient method. IMA Journal of Numerical Analysis, 8 (1988), 141–148.
- 23. Birgin EG, Martínez JM, Raydan M. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10 (2000), 1196–1121.
- 24. Raydan M. On the Barzilai and Borwein choice of steplength for the gradient methodz. IMA Journal of Numerical Analysis, 13 (1993), 321–326.
- 25. Raydan M. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization, 7 (1997), 26–33.
- 26. Cheng W, Li DH. A derivative-free nonmonotone line search and its application to the spectral residual method. IMA Journal of Numerical Analysis, 29 (2009), 814–825.
- 27.
Xiao Y, Wu SY, Qi L. Nonmonotone Barzilai-Borwein Gradient Algorithm for ℓ
_{1}-Regularized Nonsmooth Minimization in Compressive Sensingzz. Journal of Scientific Computing, 61 (2014), 17–41. - 28. Grippo L, Lampariello F, Lucidi S. A nonmonotone line search technique for Newton’s methodz. SIAM Journal on Numerical Analysis, 23 (1986), 707–716.