Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Accelerated Stochastic Conjugate Gradient for a class of convex optimization

Abstract

The conjugate gradient method is widely recognized as a foundational technique for large-scale unconstrained optimization. In this work, we introduce an Accelerated Stochastic Conjugate Gradient (ASCG) algorithm, specifically designed for a class of convex empirical risk minimization problems. The proposed ASCG method integrates a variance-reduced gradient estimator-inspired by modern stochastic variance reduction techniques-to control noise and improve stability in the optimization process. Moreover, the ASCG algorithm incorporates a novel acceleration mechanism via a deflation factor on the step size, which is shown to achieve faster practical convergence compared to the baseline stochastic FR method. We provide a rigorous theoretical analysis demonstrating that ASCG achieves an expected linear convergence rate under strong convexity assumptions and attains a superior reduction in function values compared to non-accelerated stochastic counterparts. Extensive numerical experiments on four widely-used benchmark datasets confirm that ASCG consistently outperforms state-of-the-art stochastic optimization methods.

1 Introduction

This paper primarily addresses the following empirical risk minimization (ERM) problem:

(1)

where n is the number of training samples with finite yet extremely large value, and d represents the dimensionality of the feature space. In this paper, we assume that the set of optimal solutions of problem (1) is nonempty. It is evident that (1) represents a finite-sum problem, which commonly arises in the fields of statistics and machine learning. In such problems, a set of training data will be given for data fitting. The popular loss function that frequently used, such as, l2 regularized least-squares with  +  for regression analysis, as well as l2 regularized logistic regression with  +   +  for binary classification.

Numerous efficient gradient-based methods have been employed to solve problem (1), including gradient descent and its variants [1,2]. Among these, the conjugate gradient method (CGM) stands out as a particularly notable class of gradient-based optimization techniques, renowned for its superior convergence properties in specific problem types. Originally introduced by Hestenes and Stiefel [3] for solving linear systems, CGM was later extended by Fletcher and Reeves [4] to nonlinear optimization problems, leading to the development of nonlinear CGM. Prominent nonlinear variants include the FR-CGM [4], PR-CGM [5], and DY-CGM [6]. CGM offers two key advantages over other methods: it converges faster than gradient descent and avoids the computational burden of Hessian matrix evaluation required in Newton’s method. As a result, CGM has been widely applied in machine learning domains such as compressed sensing [7], image restoration [8], and signal processing [9]. However, since CGM requires the computation of the full gradient at each iteration, it becomes impractical for problems involving extremely large datasets.

To enhance the applicability of CGM to large-scale unconstrained ERM problems, the randomized CGM was developed. Schraudolph and Graepel [10] were the first to incorporate ideas from CGM into a stochastic setting, experimentally demonstrating convergence rates orders of magnitude faster than stochastic gradient descent. Huang et al. [11] combined a stochastic recursive gradient approach with the Barzilai-Borwein technique to address nonconvex optimization, providing a theoretical convergence analysis and empirical validation in machine learning tasks. Randomized CGM has since been extensively applied across areas including digital predistortion [12], pattern classification [13], seismic inversion [14], neural language modeling [15], and deep learning [16]. Nevertheless, the high variance inherent in stochastic gradient estimates considerably impedes the convergence speed of randomized CGM.

Recent years have witnessed significant progress in variance reduction (VR) techniques, which have effectively addressed the high variance inherent in stochastic gradient estimates and substantially enhanced the convergence guarantees of stochastic optimization algorithms. By employing control variates to correct stochastic gradients, VR methods construct gradient estimators with asymptotically vanishing variance, thereby accelerating convergence both in theory and practice. Pioneering VR-based approaches such as SVRG [17] and SAGA [18] laid the foundation for this line of research. While both target convex empirical risk minimization problems, they diverge in mechanism: SVRG adopts a nested-loop structure that periodically refreshes the full gradient, whereas SAGA maintains a historical gradient matrix to enable incremental updates. Building on these ideas, more recent innovations such as SARAH [19], SPIDER [20], and PAGE [21] have further refined the trade-offs between computational cost, storage, and convergence rate, achieving even

faster convergence under various settings. These advances set the stage for the integration of variance reduction into more complex optimization frameworks, including stochastic conjugate gradient methods.

The integration of VR techniques has opened new avenues for enhancing randomized CGM, leading to significant improvements in convergence efficiency and stability. For instance, Jin et al. [22] effectively combined SVRG with CGM to develop the CGVR algorithm, which achieves linear convergence under strong convexity and smoothness assumptions. In a different direction, Kobayashi and Liduka [23] integrated Adam-type adaptive updating with nonlinear CGM, substantially boosting performance in deep neural network training. Further extending this line of work, Kou and Yang [24] proposed the SCGA method, drawing inspiration from SAGA and FR-CGM, and established its linear convergence for smooth and strongly convex objectives. More recently, Ouyang et al. [25] introduced a variance-aware three-term conjugate gradient method equipped with advanced line search techniques tailored for nonconvex optimization. While these VR-enhanced methods consistently outperform conventional CGM on large-scale empirical risk minimization problems, challenges related to online step-size selection persist. In response, Yang [26,27] proposed innovative solutions using local quadratic approximation and hyper-gradient descent, providing more robust and adaptive strategies for step-size control.

In parallel to advancements in stochastic variants, significant research efforts have been devoted to acceleration techniques aimed at further improving the convergence properties of CGM. Early foundational work by Lenard [28] introduced a family of accelerated CGMs drawing analogies to the Broyden family of quasi-Newton methods, establishing a theoretical bridge between these two classes of algorithms. Building on this, Andrei [2931] developed accelerated CGM variants that achieve substantial reductions in function values through sophisticated step-size selection and direction updating strategies. More recently, Sun et al. [32,33] derived an acceleration parameter using quadratic interpolation models and introduced specific acceptance criteria to enhance robustness and efficiency in practical implementations. Jian et al. [34] reformulated the conjugate parameter via accelerated subspace quadratic optimization, effectively exploiting structural advantages of linear CGM in large-scale settings. Further extending its applicability, Hu et al. [35] incorporated adaptive momentum into nonlinear CGM, demonstrating notable improvements in convergence for sparse recovery problems. Most recently, Karimi and Vavasis [36] proposed a hybrid C+AG algorithm that integrates Nesterov’s accelerated gradient framework with conventional CGM, achieving optimal performance for quadratic functions while maintaining competitive complexity bounds for general smooth convex functions. Collectively, these contributions not only underscore the enduring theoretical value of acceleration mechanisms in CGM but also greatly expand their practical utility in modern computational environments.

Despite these advances, the development of accelerated stochastic conjugate gradient methods remains relatively underexplored, particularly within the context of variance-reduced optimization. To address this gap, we propose a novel Accelerated Stochastic Conjugate Gradient (ASCG) algorithm that synergistically integrates acceleration mechanisms with stochastic variance reduction. Our approach embeds a conjugate gradient update within a stabilized SVRG-like framework, effectively mitigating gradient noise while preserving directional fidelity. Furthermore, we introduce an adaptive step-size scaling strategy that incorporates an acceleration factor to enhance convergence dynamics without compromising stability. We provide a rigorous theoretical analysis establishing that ASCG achieves an expected linear convergence rate for strongly convex empirical risk minimization problems, alongside a provably superior reduction in function values relative to existing non-accelerated stochastic CGMs. To the best of our knowledge, this work constitutes the first systematic effort to incorporate acceleration techniques into stochastic conjugate gradient frameworks, offering both foundational theoretical insights and tangible practical benefits.

The main contributions of this paper as follows:

  • We propose the Accelerated Stochastic Conjugate Gradient (ASCG) algorithm-a novel framework that integrates SVRG-based variance reduction to mitigate stochastic gradient noise and a step size acceleration factor to address online step size inefficiencies of randomized CGM.
  • We prove that ASCG achieves an expected linear convergence rate for strongly convex ERM problems, with rigorously derived acceleration guarantees that outperform non-accelerated stochastic CGM in function value reduction.
  • Our work fills a critical research gap by being the first to combine VR and acceleration techniques for stochastic CGM, a direction previously unaddressed in existing literature.
  • Unifies acceleration techniques with variance reduction for CGM, demonstrated through theoretical analysis to be effective for both regression l2-regularized least squares) and classification (logistic regression) tasks.

Now we provide a detailed description of the structure of this article. We briefly review some definitions as well as related properties in Sect 2. Sect 3 provides ASCG algorithm in detail. The convergence analysis is given in Sect 4. In Sect 5, we carry out practical experiments to illustrate the superiority of ASCG algorithm. Finally, the conclusion is given in Sect 6.

2 Notations and preliminaries

In this paper, we denote an d-dimensional Euclidean space by . We use to define the standard l2 norm, and let be the inner product induced by l2 norm. We use to denote the domain of an extended-real-valued function . We let be the gradient of a continuously differentiable function f at the point . For an integer n, we denote by the largest integer that does not exceed it. For a stochastic algorithm , we use to denote the total expectation in terms of the whole iteration process of . For a certain random variable i, we denote its expectation by .

Below we provide several basic definitions associated with their properties.

Definition 1. () [37, Definition 5.1] A continuously differentiable function is L-smoothness if and only if

Then we call that f has a Lipschitz continuous gradient.

Below is an important property in terms of Lipschitz continuous gradient, which will be used frequently in this paper.

Lemma 1. [38, Theorem 2.1.5] If a continuously differentiable function has a Lipschitz continuous gradient, then the following holds

If further assume that f is convex, then

Definition 2. [38, Definitioin 2.1.3] A function is continuously differentiable and -strongly convex if

Lemma 2. [38, Theorem 2.1.5] Suppose that f is continuously differentiable and strongly convex with parameter . Let x* be the unique minimizer of f. Then

holds.

3 The description of ASCG algorithm

This section provides the main ideas of ASCG algorithm in detail. We begin with introducing the following update rule of CGM:

where gk defines the gradient of the objective function at the current point xk and is the conjugate parameter. The stepsize can be determined by line search, such as Armijo condition and standard Wolfe condition. A commonly used more enhanced condition, i.e., the strong Wolfe condition:

where the parameters . The first condition is called sufficient decrease condition, and the second condition is called the curvature condition. To solve problem (1) with a stochastic method, we introduce the random strong Wlofe condition. Assume that we randomly sampling at the current point xk, then we have the following update rule:

(2)

where and gk are some stochastic gradient at the point xk.

For realizing the acceleration, let we introduce a variable (see the following Algorithm 1). It follows from the first Wolfe condition in (2) that

(3)

where is the variance reduced stochastic gradient (see the step 14 in Algorithm 1 below). Let we introduce a new variable , where is a parameter which follows to be determined in such a way as to improve the behavior of the algorithm. Now we have

(4)

and

(5)

By combining the Eqs (4) and (5), we obtain

(6)

where . Taking expectation conditioned on Sk on both sides of Eq (6), we have

(7)

Let us denote

where (see Lemma 4 in the following Sect 3) and for convex function . Therefore,

Obviously, is a quadratic function with respect to . According to a simple calculation it can be seen that when

In practice, remains well-defined as the line search ensures and for valid descent directions. reaches the minimum. Considering in Eq (7) and since , we see that

(8)

This leads to a potential improvement in the expected values of function f when the condition is satisfied. By combining inequalities (3) and (8), we obtain

(9)

Since , ( is a descent direction).

Algorithm 1 Accelerated stochastic conjugate gradient: ASCG.

Input: The initial point , the Wolfe condition parameters , and epoch length m.

1: for do

2:  

3:  

4:  

5:  

6:   Draw samples S0 uniformly at random from

7:   for do

8:     Use the Wolfe line search rule (1) to obtain the

  stepsiez

9:     Draw samples Bk uniformly at random from

10:     Compute: ,

11:     Compute: , and

12:     Compute:

13:     Draw samples Sk + 1 uniformly at random from

14:     Compute:

15:     Compute conjugate parameter:

16:     Compute:

17:    end for

18:   Option I:

19:   Option II: chosen uniformly at random from

20: end for 21: chosen uniformly from .

Output: .

4 Convergence analysis

In this section, we mainly study the convergence rate of the proposed Algorithm 1. We start from making the following assumption for the objective function in problem (1).

Assumption 1. For the objective function in problem (1):

(1) (smoothness) Function fi for all is continuously differentiable and Li-smooth.

(2) (convex) Function fi for all is convex.

(3) () The whole finite-sum function f is -strongly convex.

These two assumptions are very common in both stochastic and deterministic convex optimizations. The Assumption 1 indicates that the whole function f is also L–smooth, where . Below we provide a certain upper bound for the variance of gradient estimate, which are frequently used in stochastic optimization algorithms for convex optimization problems.

Assumption 2. ASCG Algorithm is implemented with a step size that satisfies and condition (2) with .

Assumption 3. There exists such that

where is the variance reduced gradient estimator at the point of Algorithm 1, i.e., .

Assumption 4. Let be a sequence generated by Algorithm 1. There exists some such that

Remark 1. These assumptions are standard in stochastic conjugate gradient literature [24,26,27]: Unbounded step sizes often lead to divergence in stochastic settings. Assumption 2 restricting the step size to a bounded interval is a standard and necessary practice to balance convergence speed and numerical stability. The condition of Assumption 3 enforces that the norm of the variance reduced gradient does not grow across iterations-it reflects a reasonable decay (or controlled growth) of estimation "noise" in the gradient. Assumption 4 bounds the "quality" of the variance reduced gradient estimator relative to the true gradient . Under Lipschitz continuity of f’s gradient, one can show (via variance bounds of empirical gradients) that such estimators satisfy being bounded, and by Cauchy-Schwarz gradient boundedness, this leads to being comparable to . The step size parameter in Assumption 2 is chosen such that , which is natural as . Combined with the gradient related bounds from variance reduction, the constant c can be verified to lie in .

Lemma 3. [24] Suppose that ASCG algorithm is implemented with the stepsize that satisfies the strong Wolfe condition (2) with . Then, the FR conjugate gradient method generates descent direction dk that satisfy

Lemma 4. Let be a sequence generated by Algorithm 1.

Proof:

where the first inequality follows from the second Wolfe condition (2). The second inequality comes from Lemma 3. The last equality by using the definition of . And the last inequality follows from Assumption 4.

Lemma 5. Suppose that Assumptions 3 holds. Then, for any k in a fixed epoch s, we have

where .

The similar result can be see [22, Theorem 2].

In the following, we give the main property of this paper.

Theorem 1. Let be a sequence generated by Algorithm 1. Suppose that Assumptions 1-4 hold, and satisfies the sufficient descent condition and , where . Then the following linear convergence

holds, where .

Proof: If follows from the Lipschitz continuously that

Taking expectation conditioned on Sk on both sides of this inequality, we have

(10)

By combining Lemma 4, Lemma 5 and (10), we have

(11)

Then, using (8) and (11), we have

(12)

On the other hand,

(13)

Substituting inequality (13) into inequality (12), we have

(14)

By letting in inequality (14), as well as Lemma 1 and 2, we further have

Taking the population expectation on both sides of the this inequality, we further have

(15)

Summing over and using a telescoping sum, we have

(16)

On the other hand,

(17)

Subtracting (17) into (16), we get

(18)

Rearranging (18), we have

5 Numerical experiments

In this section, we evaluate the practical performance of the proposed ASCG algorithm through numerical experiments on widely-used machine learning benchmarks. We compare ASCG with several state-of-the-art methods using datasets sourced from LIBSVM: svmguide1, a9a, w8a, gisette, and ijcnn1. The dataset can be obtained from https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. Table 1 provides a detailed description of these datasets. All experiments were conducted in MATLAB 2017b on a 64-bit PC equipped with an Intel(R) Core(TM) i7-6700HQ CPU (2.60 GHz) and 16 GB of RAM. The experimental model is described in Sect 5.1. Sect 5.2 introduces the parameter settings corresponding to all comparative algorithms used in the experiments. The comparative performance of ASCG against other state-of-the-art methods is presented in Sect 5.3, and an analysis of the properties of ASCG is provided in Sect 5.4.

5.1 The experimental model

In our experiments, we consider the following l2 regularized logistic regression for binary classification:

(19)

where is coming a certain collection of dataset with and is the ith training sample and the corresponding label, respectively. is the regularization parameter. The regularized logistic regression model is a popular model used in the field of machine learning to measure how well an algorithm is. And it is obviously that l2 regularized logistic regression conforms to the Assumption 1 with , and .

5.2 Implementation of algorithms

We compare the proposed ASCG algorithm with several state-of-the-art conjugate gradient methods-both stochastic and deterministic-for solving the finite-sum problem (19). Specifically, the compared algorithms include: conjugate gradient with variance reduction (CGVR) [22], the stochastic conjugate gradient algorithm (SCGA) [24], and two classical deterministic conjugate gradient methods, namely Fletcher-Reeves conjugate gradient method (FR-CGM) [4] and the accelerated conjugate gradient method (ACG) [29]. The parameter settings for each algorithm are detailed below:

CGVR: This is the first stochastic conjugate gradient method that integrates a variance reduction technique with the FR-CGM. It involves four parameters: the Wolfe line search parameters , the minibatch size b, and the number of inner iterations m. In our experiments, we set , , , and .

SCGA: A recent stochastic conjugate gradient method that incorporates the SAGA framework into FR-CGM. The parameters are set according to the recommendations in Table 2 of [24]: minibatch size , inner iteration count , and Wolfe conditions parameters and .

FR-CGM: A classical deterministic conjugate gradient method, abbreviated as CG in this paper. It requires only the two Wolfe line search parameters. We use and in all experiments.

ACG: A well-established accelerated conjugate gradient method developed by Andrei, which enhances convergence via adaptive step-size adjustments. Like FR-CGM, it relies solely on the Wolfe parameters. We set and .

ASCG: The proposed accelerated stochastic conjugate gradient method. Although ASCG also employs acceleration mechanisms similar to CGVR and SCGA, it involves four parameters: the Wolfe line search parameters and (with ), the minibatch size b, and the inner loop count m. Our experiments use , , , and .

For clarity, all parameter configurations are summarized in Table 2. During implementation, all feature values were normalized to the range . Each algorithm was initialized from the same randomly generated starting point drawn from a standard normal distribution. To ensure fair comparison, the regularization parameter was set to for all methods, a common choice in machine learning applications.

In theory, we using the first Wolfe condition, i.e.,

Where . But in experiments for the compared stochastic methods, we use the standard stochastic version Wolfe condition

We refer the readers to [22,24] for the similar setting. And one can see in the later experimental results, this operation can also bring good practical results.

5.3 Comparison with other related algorithms

To evaluate the performance of the proposed algorithm, we compare ASCG with several competitive conjugate gradient methods for solving problem (19). The numerical results are presented in Figs 17. Following common practice for stochastic methods in machine learning, we measure the objective function value in terms of the number of effective passes. In these figures, the x-axis represents the number of effective passes (epochs), while the y-axis corresponds to the function value, training accuracy, and testing accuracy, respectively. Overall, the results demonstrate that ASCG performs comparably or superiorly to the other conjugate gradient methods. Specifically, variance-reduced (VR-based) algorithms consistently outperform non-VR methods. Moreover, ASCG exceeds the performance of both CGVR and SCGA, while ACG outperforms CG, empirically confirming the advantage of accelerated methods over non-accelerated ones.

thumbnail
Fig 1. The change curve of objective function value against the number of effective passes on four datasets of different methods.

https://doi.org/10.1371/journal.pone.0338720.g001

thumbnail
Fig 2. The change curve of objective function value against the gradient calculation on four datasets of different methods.

https://doi.org/10.1371/journal.pone.0338720.g002

thumbnail
Fig 3. The change curve of objective function value against the gradient calculation on four datasets of different methods.

https://doi.org/10.1371/journal.pone.0338720.g003

thumbnail
Fig 4. The effects of different minibatch size on objective function value of ASCG algorithm.

https://doi.org/10.1371/journal.pone.0338720.g004

thumbnail
Fig 5. The effects of different minibatch size on training accuracy of ASCG algorithm.

https://doi.org/10.1371/journal.pone.0338720.g005

thumbnail
Fig 6. The effects of different minibatch size on testing accuracy of ASCG algorithm.

https://doi.org/10.1371/journal.pone.0338720.g006

thumbnail
Fig 7. The performance of ASCG algorithm on high-dimensional dataset.

https://doi.org/10.1371/journal.pone.0338720.g007

In Table 3, we report the total running time, maximum training accuracy, and maximum testing accuracy of three stochastic conjugate gradient methods-each using a minibatch size of -across four datasets. The results indicate that the total running time of ASCG is nearly identical to that of CGVR on all datasets, whereas SCGA requires slightly more time. In general, ASCG achieves higher training and testing accuracy than the other two methods in most cases. Although CGVR and SCGA occasionally perform better, Figs 13 reveal that ASCG converges earlier during the initial stages.

thumbnail
Table 3. The performance of different stochastic CGM for logistic regression on four datasets.

https://doi.org/10.1371/journal.pone.0338720.t003

To assess performance on high-dimensional data, we specifically evaluated the gisette dataset from LIBSVM. The results, shown in Fig 7, indicate that ASCG maintains superior performance even in this setting. Notably, ASCG converges faster in both objective function value and test accuracy compared to all baseline methods (CG, CGVR, ACG, and SCGA), supporting its scalability and effectiveness for large-scale problems.

5.4 Effect of the minibatch sizse in ASCG algorithm

In this subsection, we examine the effect of minibatch size on ASCG across all datasets. For each minibatch size, the initial step size was tuned optimally to ensure convergence. For svmguide1 and a9a, minibatch sizes of were used, while for ijcnn1 and w8a, sizes of were employed. The results, illustrated in Figs 47, show that ASCG generally achieves better performance with larger minibatch sizes. Excessively small minibatches can lead to divergence. As shown particularly in Fig 7, the algorithm exhibits more stable and improved generalization when the minibatch size is set near .

6 Conclusion

In this paper, we propose an accelerated stochastic conjugate gradient algorithm, termed ASCG, which incorporates a variance reduction technique for solving a class of convex empirical risk minimization problems. In contrast to existing stochastic conjugate gradient methods such as CGVR and SCGA, the proposed algorithm introduces an adaptive acceleration mechanism by scaling the step size via a deflation factor, thereby simplifying the parameter selection process. We provide a rigorous theoretical analysis demonstrating that ASCG achieves an expected linear convergence rate under the Fletcher-Reeves update for strongly convex problems, along with a significantly improved expected reduction in function values. Extensive numerical experiments on four large-scale datasets, performed in the context of l2-regularized logistic regression for binary classification, confirm that the proposed algorithm outperforms several state-of-the-art methods.

Acknowledgments

The authors would like to thank the editor and the anonymous referees for their valuable suggestions and comments which have greatly improved the presentation of this paper.

References

  1. 1. Mo Z, Ouyang C, Pham H, Yuan G. A stochastic recursive gradient algorithm with inertial extrapolation for non-convex problems and machine learning. Int J Mach Learn & Cyber. 2025;16(7-8):4545–59.
  2. 2. Ouyang C, Jian A, Zhao X, Yuan G. Difference-enhanced adaptive momentum methods for non-convex stochastic optimization in image classification. Digital Signal Processing. 2025;161:105118.
  3. 3. Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems. J RES NATL BUR STAN. 1952;49(6):409.
  4. 4. Fletcher R. Function minimization by conjugate gradients. The Computer Journal. 1964;7(2):149–54.
  5. 5. Polak E, Ribiere G. Rev Fr Inform Et Rech Oper. 1969;3(16):35–43.
  6. 6. Dai YH, Yuan Y. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J Optim. 1999;10(1):177–82.
  7. 7. Ma G, Jin J, Jian J, Yin J, Han D. A modified inertial three-term conjugate gradient projection method for constrained nonlinear equations with applications in compressed sensing. Numer Algor. 2022;92(3):1621–53.
  8. 8. Jiang X, Liao W, Yin J, Jian J. A new family of hybrid three-term conjugate gradient methods with applications in image restoration. Numer Algor. 2022;91(1):161–91.
  9. 9. Aminifard Z, Babaie-Kafaki S. Dai-Liao extensions of a descent hybrid nonlinear conjugate gradient method with application in signal processing. Numer Algor. 2021;89(3):1369–87.
  10. 10. Schraudolph NN, Graepel T. Conjugate directions for stochastic gradient descent. In: International conference on artificial neural networks. 2002. p. 1351–6.
  11. 11. Huang R, Qin Y, Liu K, Yuan G. Biased stochastic conjugate gradient algorithm with adaptive step size for nonconvex problems. Expert Systems with Applications. 2024;238:121556.
  12. 12. Jiang H, Yu X, Wilford PA. Digital predistortion using stochastic conjugate gradient method. IEEE Trans on Broadcast. 2012;58(1):114–24.
  13. 13. Huang W, Zhou H-W. Least-squares seismic inversion with stochastic conjugate gradient method. J Earth Sci. 2015;26(4):463–70.
  14. 14. Sharma S, Rastogi R. Stochastic conjugate gradient descent twin support vector machine for large scale pattern classification. Advances in Artificial Intelligence. 2018;11320:590–602.
  15. 15. Liu J, Lin L, Ren H, Gu M, Wang J, Youn G, et al. Building neural network language model with POS-based negative sampling and stochastic conjugate gradient descent. Soft Comput. 2018;22(20):6705–17.
  16. 16. Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY. On optimization methods for deep learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011. p. 265–72.
  17. 17. Johnson R, Zhhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems. 2013. p. 1–9.
  18. 18. Defazio A, Bach F. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems. 2014;27:1–9.
  19. 19. Nguyen LM, Liu J, Scheinberg K, Takac M. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning. 2017. p. 1–9.
  20. 20. Fang C, Li CJ, Lin Z, Zhang T. SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Advances in Neural Information Processing Systems. 2018. p. 1–11.
  21. 21. Li Z, Bao H, Zhang X, Richtarik P. PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: Proceedings of the 38th International Conference on Machine Learning. 2021. p. 6286–95.
  22. 22. Jin X-B, Zhang X-Y, Huang K, Geng G-G. Stochastic conjugate gradient algorithm with variance reduction. IEEE Trans Neural Netw Learn Syst. 2019;30(5):1360–9. pmid:30281486
  23. 23. Kobayashi Y, Liduka H. Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning. arXiv preprint 2020. https://arxiv.org/abs/2003.00231v2
  24. 24. Kou C, Yang H. A mini-batch stochastic conjugate gradient algorithm with variance reduction. J Glob Optim. 2022;87(2–4):1009–25.
  25. 25. Ouyang C, Lu C, Zhao X, Huang R, Yuan G, Jiang Y. Stochastic three-term conjugate gradient method with variance technique for non-convex learning. Stat Comput. 2024;34(3).
  26. 26. Yang Z. Large-scale machine learning with fast and stable stochastic conjugate gradient. Computers & Industrial Engineering. 2022;173:108656.
  27. 27. Yang Z. Adaptive stochastic conjugate gradient for machine learning. Expert Systems with Applications. 2022;206:117719.
  28. 28. Lenard ML. Accelerated conjugate direction methods for unconstrained optimization. J Optim Theory Appl. 1978;25(1):11–31.
  29. 29. Andrei N. Acceleration of conjugate gradient algorithms for unconstrained optimization. Applied Mathematics and Computation. 2009;213(2):361–9.
  30. 30. Andrei N. Accelerated hybrid conjugate gradient algorithm with modified secant condition for unconstrained optimization. Numer Algor. 2009;54(1):23–46.
  31. 31. Andrei N. An accelerated subspace minimization three-term conjugate gradient algorithm for unconstrained optimization. Numer Algor. 2013;65(4):859–74.
  32. 32. Sun W, Liu H, Liu Z. A class of accelerated subspace minimization conjugate gradient methods. J Optim Theory Appl. 2021;190(3):811–40.
  33. 33. Sun W, Liu H, Liu Z. Several accelerated subspace minimization conjugate gradient methods based on regularization model and convergence rate analysis for nonconvex problems. Numer Algor. 2022;91(4):1677–719.
  34. 34. Jian J, Chen W, Jiang X, Liu P. A three-term conjugate gradient method with accelerated subspace quadratic optimization. J Appl Math Comput. 2021;68(4):2407–33.
  35. 35. Hu M, Lou Y, Wang B, Yan M, Yang X, Ye Q. Accelerated sparse recovery via gradient descent with nonlinear conjugate gradient momentum. J Sci Comput. 2023;95(1):33.
  36. 36. Karimi S, Vavasis S. Nonlinear conjugate gradient for smooth convex functions. arXiv preprint 2023. arXiv:2111.11613v2
  37. 37. Beck A. First-order methods in optimization. Philadelhia. 2017.
  38. 38. Nesterov Y. Lectures on convex optimization. Springer; 2018. https://doi.org/10.1007/978-3-319-91578-4