The Modified HZ Conjugate Gradient Algorithm for Large-Scale Nonsmooth Optimization

In this paper, the Hager and Zhang (HZ) conjugate gradient (CG) method and the modified HZ (MHZ) CG method are presented for large-scale nonsmooth convex minimization. Under some mild conditions, convergent results of the proposed methods are established. Numerical results show that the presented methods can be better efficiency for large-scale nonsmooth problems, and several problems are tested (with the maximum dimensions to 100,000 variables).


Introduction
Consider the following optimization problem: If the constrained set satisfies F = < n , then Eq (1) is called unconstrained optimization; if F = {x j l x u, x 2 < n }, where n is the number of variables and the vectors l and u represent the lower and upper bounds on the variables, then Eq (1) is called box-constrained optimization; and if F = {x j h i (x) = 0, g j (x) 0, x 2 < n , i = 1, Á Á Á, r, j = 1, Á Á Á, k}, where r and k are positive integers, then Eq (1) is called normal constrained optimization. If the objective function f: < n ! < m is continuously differentiable, then Eq (1) is called smooth optimization; if f: < n ! < m is a nondifferentiable function, then Eq (1) is called nonsmooth optimization. For a given objective function f and constrained set F, Eq (1) will be called the corresponding optimization problem.

The HZ CG formula [29, 30] and a modification thereof
For convenience, we rewrite Eq (1) as the following special case: where f: < n ! < is continuously differentiable; Eq (2) is called an unconstrained optimization problem. CG methods are a class of effective line search methods for solving Eq (2), especially when the dimension n is large. The iterative formula for CG methods is defined as follows: where x k is the current point in the iteration, α k > 0 is the step length, and d k is the search direction; the latter is determined as where g k+1 = rf(x k+1 ) is the gradient of f(x) at point x k+1 and β k 2 < is a scalar. The parameter β k is chosen such that when the method is applied to minimize a strongly convex quadratic function, the directions d k and d k−1 are conjugate with respective to the Hessian of the quadratic function.
with y k = g k+1 − g k ; we call this formula for b N k [29] the HZ formula. If d T k y k 6 ¼ 0, then Eq (4) This method exhibits global convergence for the strongly convex function f. To obtain a similar result for a general nonlinear function, Hager and Zhang [30] proposed the formula where η > 0 is a constant; in their experiments, they set η = 0.01. This new parameter " b N k also has the property expressed in Eq (6).
Based on the formula for b N k , if we let T k ¼ d T k y k , we can obtain the more general formula In this paper, we set T k ¼ max fc k d k kk y k k; d T k y k ; where c 2 (0, 1) is a scalar. It is easy to deduce that T k ¼ max fc k d k kk y k k; d T k y k ; (7) is identical to the HZ formula. The modified formula has the following features: (i) The new formula can overcome the shortcomings of the CG parameter β k . If y T k g kþ1 ! 0, it is not difficult to find that (ii) Another property of this formula is that it can ensure that the new direction d k+1 in Eq (4) with b k ¼ b GN k belongs to a trust region without the need for any line search technique. By By combining this result with Step 1 of Algorithm 2.2, we find that (iii) If T k 6 ¼ 0, then the new direction d k+1 in Eq (4) when b k ¼ b GN k possesses the following sufficient descent property: which holds for all k. Now let us analyze this result. If k = 0, we have d 1 = −g 1 and d T 1 g 1 ¼ À k g 1 k 2 ; satisfying Eq (10). For k ! 1, because T k 6 ¼ 0, multiplying Eq (4) by g T kþ1 yields Let u ¼ 1 2 T k g kþ1 and v ¼ 2ðd T k g kþ1 Þy k ; then, by the inequality u T v 1 2 ðu 2 þ v 2 Þ, we have g T kþ1 d kþ1

Nonsmooth Convex Problems and Their Results
Consider the unconstrained convex optimization problem where f: < n ! < is a possibly nonsmooth convex function. For the special case that f is continuously differentiable, this optimization problem has been well studied for several decades. However, nonsmooth optimization problems of the form of Eq (12) also arise in many applications, such as image restoration [31] and optimal control [32]. The Moreau-Yosida regularization of f generates where λ is a positive parameter; then, Eq (12) is equivalent to the following problem: Let p(x) = argmin θ(z) and define a function Thus, p(x) is well defined and unique because the function θ(z) is strongly convex. Therefore, F (x) can be expressed as follows: F(x) possesses many known features (see, e.g., [33][34][35]). The generalized Jacobian of F(x) and its property of BD-regularity are demonstrated in [36,37]. Here, we list some additional findings regarding the function F(x), as follows: where is the gradient of F.
is a symmetric positive semidefinite matrix for each x 2 < n because g is a gradient mapping of the convex function F.
(iv) There exist two constants, μ 1 > 0 and μ 2 > 0, and a neighborhood O of x that satisfy

Two algorithms for nonsmooth problems
Based on the above discussion, we present two algorithms for application to nonsmooth problems of the form Eq (14); afterward, we analyze the solution of Eq (12). In the following, unless otherwise noted, g k = g(x k ) is defined as in Eq (16).

Require:
An initial point x 0 2 < n , λ > 0, σ, η 2 (0, 1), ρ 2 (0, 1/2], 2 [0, 1). Specify g 0 by solving the subproblem Eq (13); while kg k k > do Determine the step size α k = max {ρ j |j = 0, 1, 2, Á Á Á} satisfying the following Armijo line search condition Compute g k+1 by solving the subproblem Eq (13); if kg k+1 k then break. else Compute d k as (ii) g is BD-regular at x, namely, item (iv) in Section 3 above holds. By Assumption 4.1, it is not difficult to deduce that there exists a constant MÃ > 1 such that Proof. Suppose that α k satisfies the Armijo line search condition Eq (17). The proof is complete if α k = 1 holds. Otherwise, let a 0 k ¼ a k r ; then, we have By performing a Taylor expansion, we obtain where x k ¼ x k þ ya 0 k d k ; y 2 ð0; 1Þ; and the last inequality follows from Eq (20). By combining d T k g k À 7 8 k g k k 2 ; Eqs (21) and (23), we obtain Thus, we find that moreover, any accumulation point of x k is an optimal solution of Eq (12). (25) is not true. Then, there must exist two constants, 0 > 0 and kÃ > 0, that satisfy

Proof. Suppose that Eq
By combining d T k g k À 7 8 k g k k 2 , Eqs (17), (22) and (26), we obtain Because F(x) is bounded from below for all k, it follows from Eq (27) From Eq (28), we find that kg(x Ã )k = krF(x Ã )k = 0. Then, from property (i) of F(x) as given in Section 3, x Ã is an optimal solution of Eq (12). The proof is complete. In a manner similar to Theorem 4.1 in [38], we can establish the linear convergence rate of Algorithm 4.1 (or Algorithm 4.2). Here, we simply state this property, as follows, but omit the proof.

Theorem 2 Let Assumptions 4.1 (i) and (ii) hold, and let x Ã be the unique solution of Eq (14).
Then, there exist two constants b > 0 and r 2 (0, 1) that satisfy

Numerical results for nonsmooth problems
In this section, we present several numerical experiments using Algorithms 4.1 and 4.2 and a modified Polak-Ribière-Polyak conjugate gradient algorithm (called MPRP) [18]. It is well known that the CG method is very effective for large-scale smooth problems. We will show that these two algorithms are also applicable to large-scale nonsmooth problems. The nonsmooth academic test problems that are listed, along with their initial points, in Table 1 are described in [27], where "Problem" is the name of the test problem, "x 0 " is the initial point, and the corresponding numbers of variables are also given, "f ops " is the optimal value of the test problem. Problems 1-5 are convex functions, and the others are nonconvex functions. The detailed characteristics of these problems can be found in [27]. Because we wished to test the three considered methods for application to large-scale nonsmooth problems, problem dimensions of 5000, 10000, 50000, and 100000 were chosen. In our experiments, we found that problem 2 required a considerably amount of time to solve; therefore, we set its largest dimension to 50000. Both algorithms were implemented using Fortran PowerStation 4.0 with double-precision arithmetic, and all experiments were run on a PC with a Core 2 Duo E7500 CPU @2.93 GHz with 2.00 GB of memory and the Windows XP operating system. The following parameters were chosen for Algorithms 4.1 and 4.2 and MPRP: σ = 0.8, ρ = 0.5, c = 0.01, k = 1E − 15, and η = 0.01. We stopped the algorithms when the condition kg(x)k 1E − 5 or jF(x k+1 ) − F(x k )j 1E − 8 or |f(x k ) − f ops | 1E − 4 was satisfied. If the number of iterations exceeded ten thousand, the program would also terminate. Because a line search cannot always ensure the descent condition d T k g k < 0, an uphill search direction may arise in real numerical experiments, which may cause the line search rule to fail. To avoid this condition, the step size α k was accepted only if the search number in the line search was greater than five. Table 1. Test problems and their initial points and optimal value.

No.
Problem In the experiments, the subproblem Eq (13) was solved using the PRP CG method (called the sub-algorithm), and its numbers of iterations and function evaluations were added to those of Algorithm 4.1, Algorithm 4.2 or MPRP (called the main algorithm). In the sub-algorithm, if k@f(x k )k 1E − 4 or f(x k+1 ) − f(x k ) + k@f(x k+1 )k 2 − k@f(x k )k 2 1E − 3 (see [39]) holds, where @f(x k ) is the subgradient of f(x) at point x k , then the algorithm terminates. The sub-algorithm will also terminate when the iteration number exceeds ten. For the line search, the Armijo line search technique was used and the step length was accepted if the search number was greater than five. The columns in Tables 2, 3   and NF when the program terminates; moreover, the value of kg(x)k is smaller than that for Algorithm 4.2 in most cases. However, we also note that our algorithms can efficient solve the 3000 dimensional(with maximum dimensional) case for problems 3, 4, 5 and 8, if we increase the dimensional, these algorithms fail to converge to good minima and become stuck at local. To directly illustrate the performances of these two methods, we used the tool developed by Dolan and Moré [40] to analyze their efficiencies in terms of the number of iterations, number of function evaluations, and CPU time. In the following, Figs 1, 2 and 3 represent the results presented in Tables 2, 3

Conclusion
(i) In this paper, we focus on the HZ CG method and study the application of this method to solve nonsmooth optimization problems. Several results are presented that prove the efficiency of this method for application to large-scale problems of nonsmooth unconstrained optimization.
(ii) Motivated by the HZ formula, we also present a modified HZ CG formula. The modified HZ formula not only possesses the sufficient descent property of the HZ formula but also belongs to a trust region and has the non-negative scale b GN k ! 0: (iii) We report the results of applying three methods to solve large-scale nonsmooth convex minimization problems. Global convergence is achieved, and numerical experiments verify that both methods can be successfully used to solve large-scale nonsmooth problems.
(iv) Although the HZ and MHZ methods offer several key achievements for large-scale nonsmooth optimization, we believe that there are at least five issues that could be addressed to gain further improvements. The first is the scale c in the modified HZ CG algorithm, which could be adjusted. The second is the application of other CG methods for this type of optimization areas; perhaps there exists a more suitable CG method for this purpose. Regarding the third issue, it is well known that limited-memory quasi-Newton methods are effective techniques for solving certain classes of large-scale optimization problems because they require minimal storage; this inspires us to combine limited-memory quasi-Newton methods with the HZ CG technique to solve large-scale nonsmooth optimization. In future, we will also use the HZ CG method to investigate large-scale nonsmooth optimization with constraints; this is the  fourth issue that we believe must be addressed. The last issue is the most important one, namely, the consideration of other optimality conditions and convergence conditions in nonsmooth problems should be paid. All of these issues will be addressed in our future work.