A Limited-Memory BFGS Algorithm Based on a Trust-Region Quadratic Model for Large-Scale Nonlinear Equations

In this paper, a trust-region algorithm is proposed for large-scale nonlinear equations, where the limited-memory BFGS (L-M-BFGS) update matrix is used in the trust-region subproblem to improve the effectiveness of the algorithm for large-scale problems. The global convergence of the presented method is established under suitable conditions. The numerical results of the test problems show that the method is competitive with the norm method.

where b:< n ! < n is a system of nonlinear equations and n denotes the large-scale dimension. Problems involving non-linear equations can be found in numerous applications in engineering, such as nonlinear fitting, function approximating, parameter estimating and leaning models [1][2][3][4][5]. Large-scale nonlinear equations are believed to be difficult to solve because the relations with x are complex and because the dimension is high. Currently, many effective algorithms exist for addressing (1), such as the trust-region method [6][7][8][9][10][11], Levenberg-Marquardt method [12][13][14], quasi-Newton method [15][16][17][18][19], and Gauss-Newton method [20][21][22][23][24][25]. Let θ be the norm function defined by yðxÞ ¼ 1 2 kbðxÞk 2 , and suppose that b(x) has a zero point. Then, (1) is equivalent to the following global optimization problem: min yðxÞ; x 2 < n : In this paper, we will focus on the solution of (2) via the trust-region (TR) methods. The traditional TR methods, at each iterative point x k , obtain the trial step d k using the following TR subproblem model: where 4 > 0 is the so-called TR radius and kÁk denotes the Euclidean norm of vectors or their induced matrix. The TR method is very useful when the exact Jacobian or Hessian computation is inexpensive or possible. However, this case is rare in practice. It is not difficult to see that the Jacobian matrix rb(x) at every iteration must be computed, which obviously increases the workload and CPU time. Moreover, the global and superlinear convergence of the TR methods often requires that the nondegenerate assumption about rb(x Ã ) holds, where x Ã is a solution of (1) and nondegeneracy means nonsingularity. The nondegeneracy of rb(x Ã ) seems to be too stringent of a requirement for the purpose of ensuring the global and superlinear convergence. To overcome the above drawbacks, Yuan et al. [9] presented a new TR subproblem model defined by where the TR radius 4 k = c p kb(x k )k, c 2 (0, 1), p is a nonnegative integer and B k is generated by the following BFGS formula: where x k+1 is the next iteration and B 0 is an initial symmetric, positive definite matrix. They established the global and superlinear convergence without nondegeneracy. The numerical results show that the given algorithm is competitive with the normal TR method up to a dimension of 600. The subproblem (4) can be also rewritten as follows min d2< n q k ðdÞ ¼ Yuan et al. [8] consider the case with the symmetric Jacobian matrix rb(x k ) and propose the following TR subproblem model: where B 0 k is defined by the special BFGS update: where δ k = b(x k + y k ) − b(x k ). About the above TR methods, the TR subproblems models will be repeatedly computed if the calculated trial step d k does not satisfy the given conditions, which obviously increase the workload. To avoid this drawback, Yuan et al. [10] give the following TR model: where 4 m k ¼ maxfkbðx k Þk; 4 k g: In this model, the next point is x k+1 = x k + d k if the trial step d k satisfies the successful iteration; otherwise, the next point is x k+1 = x k + α k d k , where α k > 0 is the so-called steplength designed by where δ 2 (0, 1) and the technique was proposed by Yuan and Lu [16]. However, the above model can not ensure that the number of searching α k > 0 is small. Many TR model methods (see [7,12] etc.) have been developed for solving nonlinear equations. However, these TR methods require the matrix information (the Jacobian matrix or the BFGS update matrix), which will increase the computational complexity. This observation motivate us find another approach to generate the matrix information requiring minimal memory. Based on the problem (4), we will use the limited-memory BFGS (L-M-BFGS) method instead of the BFGS update because the former often provides a fast rate of linear convergence and requires minimal memory. The L-BFGS method is an adaptation of the standard BFGS method [26]. The implementation is identical to that of the normal BFGS method, the difference between the L-M-BFGS method and the BFGS method is that the inverse Hessian approximation is not explicitly formed, but defined by a small number of BFGS updates. At every iteration x k , the L-M-BFGS method stores a small number (say, m) of correction pairs {s i , y i }(i = k − 1, . . ., k − m) to obtain H k+1 , instead of storing the matrices H k , where with H k being the inverse of B k . The L-M-BFGS update formula is defined as where r k ¼ 1 y T k s k and V k ¼ I À r k y k s T k . It has been proved that the L-M-BFGS method is very suitable for large-scale systems of nonlinear equations [17]. Therefore, combining the L-M-BFGS update (8) and (4), we design the L-M-BFGS TR subproblem model as follows where 4 k = c p kb(x k )k. This paper is organized as follows. In the next section, the algorithm is discussed. In Section 3, the global convergence is established. Finally, the numerical results are reported in Section 4.

The L-M-BFGS Trust-Region Method
The algorithm is given as follows.
Initial: Given the constants ρ, c 2 (0, 1), p = 0, > 0, x 0 2 < n , H 0 2 < n × < n is symmetric and positive definite. Set k: = 0; Step 1: If kb k k < , stop; Step 2: Solve the TR subproblem (9) with 4 = 4 k to obtain d k ; Step 3: Calculate the following ratio of the actual reduction over the predicted reduction: If r p k < r; then we let p = p + 1, go to Step 2. Otherwise, go to Step 4; Step 4: Set (8) to generate the updated matrix H k+1 with positive definiteness.
(ii) It is well known that the positive definiteness of the update matrix H k is very important when analyzing the convergence of the algorithm. Byrd et al. [27] showed that the limitedmemory BFGS matrix has this property if the curvature (s k ) T y k > 0 is satisfied. Similarly, Powell [28] proposed that y k should be as follows: Step 4 of Algorithm 1 is reasonable.
To compare with the normal TR method, another algorithm is given and we call it Algorithm 2.

• Algorithm 2.(BFGS Trust-Region Method)
Step 2: Solve the TR subproblem (4) with 4 = 4 k to obtain d k ; Step 4: Set Remark 2. It is easy to see that the difficult step of these two algorithms is the Step 2. The following Dogleg method [29] is used to solve the TR subproblem models (4) and (9) to obtain d k .
• At the kth iteration:

Convergence Analysis
This section will provide some convergence results under the following assumptions. (ii) The relation holds. Assumption A(i) implies that there exists M 1 > 0 such that Similar to Moré [30], Yuan et al. [9], and Yuan and Sun [11], we can obtain the following lemma. Lemma 0.1 Let the predicted reduction be and let d k be the solution of (9). Then, we have Proof. For any α 2 [0, 1], by the property of the solution of (9) named d k , we obtain Thus, the inequality holds. This completes the proof. Lemma 0.2 Let d k be the solution of (9), and let the actual reduction Ared k (d k ) be Suppose that Assumption A holds and that the sequence {x k , d k } is generated by Algorithm 1. Then, the inner cycle of Algorithm 1 will stop in a finite number of steps.
Proof. We will establish this lemma by contradiction. Suppose that the inner cycle of Algorithm 1 infinitely circles at x k , namely, p ! 1, r p k < r and c p ! 0 hold. It is easy to obtain kb k k ! , or the algorithm stops. Thus, we can conclude that kd k k 4 k = c p kb k k ! 0 is true.
By the definitions of d k , Ared k (d k ), Pred k (d k ), and Assumption A, we obtain where the second equality follows Taylor's formula. It follows from Lemma 0.1 and 7) that Thus, if p is sufficiently large, we have r p k ! r: This relation contradicts r p k < r: The proof is complete.
The above lemma shows that the given algorithm is well-defined. Theorem 0.1 Let Assumption A hold, and let {x k } be generated by Algorithm 1. Then, {x k } & O, and {θ(x k )} converges. In particular, Algorithm 1 either stops in a finite number of steps or generates an infinite sequence {x k } such that Proof. By the definition of Algorithm 1, we obtain r p k ! r > 0: Combining this with Lemma 0.1 generates Thus, {x k } & O holds. Considering θ(x k ) ! 0, we then conclude that {θ(x k )} converges. If Algorithm 1 does not stop after a finite number of steps, again by (9) and θ(x k ) ! 0, we easily conclude that yðx k Þ ! 0; k ! 1; which shows that b(x k ) ! 0, k ! 1. The proof is complete.
Theorem 0.1 proves that the sequence {x k } of Algorithm 1 converges such that kb(x k )k ! 0 without the assumption that rb(x Ã ) is nondegenerate, where x Ã is a cluster point of {x k }.
All codes for the experiments were written in MATLAB r2009a and run on a PC with an E5507@2.27 GHz CPU and 2.99 GB of memory using the Windows XP operating system. The parameters were set as ρ = 0.0001, = 10 −5 , c = 0.1, γ = 0.7, where B 0 and H 0 are unit matrices and m = 6. In the inner iterations of Algorithm 1 and Algorithm 2, the trial step will be accepted when p > 5. We also stop the program if the number of iterations is larger than one thousand or when θ(x) < 10 −5 . The columns of the tables have the following meaning: Dim: the dimension of the problem. NG: the number of norm function evaluations. NI: the total number of iterations. GN: the normal value of θ(x) when the program stops. NaN: fails to find the final values of θ(x).
From Table 1, it is clear that the new algorithm performs quite well in terms of solving these problems and that the dimension does not have an obvious influence on the performance of the presented method. Algorithm 2 can also successfully solve some of these problems. However, Algorithm 2 fails to solve problems 3 and 6.
To clearly demonstrate these algorithms' efficiencies, a tool developed by Dolan and Moré [35] is used to analyze them. In this tool, the so-called (cumulative) distribution function % s is defined by % s ðkÞ ¼ 1 n p sizefp 2 P : p p;s kg; where p is a problem, s is its solver, n p is the number of problems, n s is the number of existing solvers, P is the test set, κ denotes the spending time or the iteration number, and π p,s is the performance ratio given by where S is the set of solvers. According to the above formulas and the rules given in [35], the performance profile plot of Table 1 is given in the following two figures .  Figs 1 and 2 show that the performance of these methods is a function of NI and NG, respectively, in Table 1. It is clear that the given method performs better because it has a higher probability of being the optimal solver. Fig 1 shows that Algorithm 1 is superior to Algorithm 2 for κ 14, approximately. From Fig 2, it is clear that the given method can successfully solve

Supporting Information:Legends of Figs 1 and 2
Dolan and Moré [35] developed a new tool to analyze the efficiency of algorithms. They introduced the notion of a performance profile for evaluating and comparing the performance of the set of solvers S on a test set P. Assuming that n s solvers and n p problems exist, for each problem p and solver s, they defined κ p,s as the computing time (the number of function evaluations or some other metric) required to solve problem p using solver s. Requiring a baseline for comparison, they compared the performance on problem p using solver s with the best performance by any solver on this problem, giving the the performance ratio p p;s ¼ k p;s min fk p;s : s 2 Sg : Suppose that a parameter π M ! π p,s for all p, s is chosen, and π p,s = π M if and only if solver s does not solve problem p.
The performance of solver s on any given problem might be of interest. More importantly, one would like to obtain an overall assessment of the performance of the solver. With this motivation, these authors defined % s ðkÞ ¼ 1 n p sizefp 2 P : p p;s kg: Specifically, % s (κ) is the probability for solver s 2 S that a performance ratio π p,s is within a factor κ 2 < of the best possible ratio. Then, function % s is the (cumulative) distribution function for the performance ratio. The performance profile % s : < 7 ! [0, 1] for a solver is a nondecreasing, piecewise constant function and is continuous from the right at each breakpoint. The value of % s (1) is the probability that the solver would outperform the remaining solvers. According to the above rules, we know that one solver whose performance profile plot is on the top right outperforms the remaining solvers. Similar legends can be found in [10,[36][37][38].

Conclusion
1. In this paper, a TR model combining with the L-M-BFGS technique is presented for nonlinear equations and the global convergence is established. Fact, this work is an extension of the paper [9] and the difference between these two papers is the update matrix: this paper chooses the L-M-BFGS formula and the paper [9] chooses the normal BFGS formula.
2. Since the L-M-BFGS technique requires minimal storage and is suitable for certain classes of large scale optimization problems, so we extend it to large scale nonlinear equations. Numerical results turn out the given algorithm is competitive to similar method for large-scale nonlinear equations.
3. It is well known that the conjugate gradient methods are effective techniques for large-scale optimization problems since they do not need the matrix information, which motivates us to use the conjugate gradient technique for large scale nonlinear equations. We think the numerical performance is also very interesting.