Accelerated Stochastic Conjugate Gradient for a class of convex optimization

Lulu He; Yanan Du

doi:10.1371/journal.pone.0338720

Abstract

The conjugate gradient method is widely recognized as a foundational technique for large-scale unconstrained optimization. In this work, we introduce an Accelerated Stochastic Conjugate Gradient (ASCG) algorithm, specifically designed for a class of convex empirical risk minimization problems. The proposed ASCG method integrates a variance-reduced gradient estimator-inspired by modern stochastic variance reduction techniques-to control noise and improve stability in the optimization process. Moreover, the ASCG algorithm incorporates a novel acceleration mechanism via a deflation factor on the step size, which is shown to achieve faster practical convergence compared to the baseline stochastic FR method. We provide a rigorous theoretical analysis demonstrating that ASCG achieves an expected linear convergence rate under strong convexity assumptions and attains a superior reduction in function values compared to non-accelerated stochastic counterparts. Extensive numerical experiments on four widely-used benchmark datasets confirm that ASCG consistently outperforms state-of-the-art stochastic optimization methods.

Citation: He L, Du Y (2025) Accelerated Stochastic Conjugate Gradient for a class of convex optimization. PLoS One 20(12): e0338720. https://doi.org/10.1371/journal.pone.0338720

Editor: Jianchao Bai, Northwestern Polytechnical University, CHINA

Received: July 2, 2025; Accepted: November 26, 2025; Published: December 29, 2025

Copyright: © 2025 He, Du. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data is available from https://github.com/luluhe2025/ASCG-Accelerated-Stochastic-Conjugate-Gradient-.

Funding: This research was funded by the Natural Science Foundation of Universities of Anhui Province (Grant No. 2024AH050130). The Scientific Research Start-up Foundation of Anhui Polytechnic University (Grant No. S022023033).

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

This paper primarily addresses the following empirical risk minimization (ERM) problem:

(1)

where n is the number of training samples with finite yet extremely large value, and d represents the dimensionality of the feature space. In this paper, we assume that the set of optimal solutions of problem (1) is nonempty. It is evident that (1) represents a finite-sum problem, which commonly arises in the fields of statistics and machine learning. In such problems, a set of training data will be given for data fitting. The popular loss function that frequently used, such as, l₂ regularized least-squares with − + for regression analysis, as well as l₂ regularized logistic regression with + + for binary classification.

Numerous efficient gradient-based methods have been employed to solve problem (1), including gradient descent and its variants [1,2]. Among these, the conjugate gradient method (CGM) stands out as a particularly notable class of gradient-based optimization techniques, renowned for its superior convergence properties in specific problem types. Originally introduced by Hestenes and Stiefel [3] for solving linear systems, CGM was later extended by Fletcher and Reeves [4] to nonlinear optimization problems, leading to the development of nonlinear CGM. Prominent nonlinear variants include the FR-CGM [4], PR-CGM [5], and DY-CGM [6]. CGM offers two key advantages over other methods: it converges faster than gradient descent and avoids the computational burden of Hessian matrix evaluation required in Newton’s method. As a result, CGM has been widely applied in machine learning domains such as compressed sensing [7], image restoration [8], and signal processing [9]. However, since CGM requires the computation of the full gradient at each iteration, it becomes impractical for problems involving extremely large datasets.

To enhance the applicability of CGM to large-scale unconstrained ERM problems, the randomized CGM was developed. Schraudolph and Graepel [10] were the first to incorporate ideas from CGM into a stochastic setting, experimentally demonstrating convergence rates orders of magnitude faster than stochastic gradient descent. Huang et al. [11] combined a stochastic recursive gradient approach with the Barzilai-Borwein technique to address nonconvex optimization, providing a theoretical convergence analysis and empirical validation in machine learning tasks. Randomized CGM has since been extensively applied across areas including digital predistortion [12], pattern classification [13], seismic inversion [14], neural language modeling [15], and deep learning [16]. Nevertheless, the high variance inherent in stochastic gradient estimates considerably impedes the convergence speed of randomized CGM.

Recent years have witnessed significant progress in variance reduction (VR) techniques, which have effectively addressed the high variance inherent in stochastic gradient estimates and substantially enhanced the convergence guarantees of stochastic optimization algorithms. By employing control variates to correct stochastic gradients, VR methods construct gradient estimators with asymptotically vanishing variance, thereby accelerating convergence both in theory and practice. Pioneering VR-based approaches such as SVRG [17] and SAGA [18] laid the foundation for this line of research. While both target convex empirical risk minimization problems, they diverge in mechanism: SVRG adopts a nested-loop structure that periodically refreshes the full gradient, whereas SAGA maintains a historical gradient matrix to enable incremental updates. Building on these ideas, more recent innovations such as SARAH [19], SPIDER [20], and PAGE [21] have further refined the trade-offs between computational cost, storage, and convergence rate, achieving even

faster convergence under various settings. These advances set the stage for the integration of variance reduction into more complex optimization frameworks, including stochastic conjugate gradient methods.

The integration of VR techniques has opened new avenues for enhancing randomized CGM, leading to significant improvements in convergence efficiency and stability. For instance, Jin et al. [22] effectively combined SVRG with CGM to develop the CGVR algorithm, which achieves linear convergence under strong convexity and smoothness assumptions. In a different direction, Kobayashi and Liduka [23] integrated Adam-type adaptive updating with nonlinear CGM, substantially boosting performance in deep neural network training. Further extending this line of work, Kou and Yang [24] proposed the SCGA method, drawing inspiration from SAGA and FR-CGM, and established its linear convergence for smooth and strongly convex objectives. More recently, Ouyang et al. [25] introduced a variance-aware three-term conjugate gradient method equipped with advanced line search techniques tailored for nonconvex optimization. While these VR-enhanced methods consistently outperform conventional CGM on large-scale empirical risk minimization problems, challenges related to online step-size selection persist. In response, Yang [26,27] proposed innovative solutions using local quadratic approximation and hyper-gradient descent, providing more robust and adaptive strategies for step-size control.

In parallel to advancements in stochastic variants, significant research efforts have been devoted to acceleration techniques aimed at further improving the convergence properties of CGM. Early foundational work by Lenard [28] introduced a family of accelerated CGMs drawing analogies to the Broyden family of quasi-Newton methods, establishing a theoretical bridge between these two classes of algorithms. Building on this, Andrei [29–31] developed accelerated CGM variants that achieve substantial reductions in function values through sophisticated step-size selection and direction updating strategies. More recently, Sun et al. [32,33] derived an acceleration parameter using quadratic interpolation models and introduced specific acceptance criteria to enhance robustness and efficiency in practical implementations. Jian et al. [34] reformulated the conjugate parameter via accelerated subspace quadratic optimization, effectively exploiting structural advantages of linear CGM in large-scale settings. Further extending its applicability, Hu et al. [35] incorporated adaptive momentum into nonlinear CGM, demonstrating notable improvements in convergence for sparse recovery problems. Most recently, Karimi and Vavasis [36] proposed a hybrid C+AG algorithm that integrates Nesterov’s accelerated gradient framework with conventional CGM, achieving optimal performance for quadratic functions while maintaining competitive complexity bounds for general smooth convex functions. Collectively, these contributions not only underscore the enduring theoretical value of acceleration mechanisms in CGM but also greatly expand their practical utility in modern computational environments.

Despite these advances, the development of accelerated stochastic conjugate gradient methods remains relatively underexplored, particularly within the context of variance-reduced optimization. To address this gap, we propose a novel Accelerated Stochastic Conjugate Gradient (ASCG) algorithm that synergistically integrates acceleration mechanisms with stochastic variance reduction. Our approach embeds a conjugate gradient update within a stabilized SVRG-like framework, effectively mitigating gradient noise while preserving directional fidelity. Furthermore, we introduce an adaptive step-size scaling strategy that incorporates an acceleration factor to enhance convergence dynamics without compromising stability. We provide a rigorous theoretical analysis establishing that ASCG achieves an expected linear convergence rate for strongly convex empirical risk minimization problems, alongside a provably superior reduction in function values relative to existing non-accelerated stochastic CGMs. To the best of our knowledge, this work constitutes the first systematic effort to incorporate acceleration techniques into stochastic conjugate gradient frameworks, offering both foundational theoretical insights and tangible practical benefits.

The main contributions of this paper as follows:

We propose the Accelerated Stochastic Conjugate Gradient (ASCG) algorithm-a novel framework that integrates SVRG-based variance reduction to mitigate stochastic gradient noise and a step size acceleration factor to address online step size inefficiencies of randomized CGM.
We prove that ASCG achieves an expected linear convergence rate for strongly convex ERM problems, with rigorously derived acceleration guarantees that outperform non-accelerated stochastic CGM in function value reduction.
Our work fills a critical research gap by being the first to combine VR and acceleration techniques for stochastic CGM, a direction previously unaddressed in existing literature.
Unifies acceleration techniques with variance reduction for CGM, demonstrated through theoretical analysis to be effective for both regression l₂-regularized least squares) and classification (logistic regression) tasks.

Now we provide a detailed description of the structure of this article. We briefly review some definitions as well as related properties in Sect 2. Sect 3 provides ASCG algorithm in detail. The convergence analysis is given in Sect 4. In Sect 5, we carry out practical experiments to illustrate the superiority of ASCG algorithm. Finally, the conclusion is given in Sect 6.

2 Notations and preliminaries

In this paper, we denote an d-dimensional Euclidean space by . We use to define the standard l₂ norm, and let be the inner product induced by l₂ norm. We use to denote the domain of an extended-real-valued function . We let be the gradient of a continuously differentiable function f at the point . For an integer n, we denote by the largest integer that does not exceed it. For a stochastic algorithm , we use to denote the total expectation in terms of the whole iteration process of . For a certain random variable i, we denote its expectation by .

Below we provide several basic definitions associated with their properties.

Definition 1. () [37, Definition 5.1] A continuously differentiable function is L-smoothness if and only if

Then we call that f has a Lipschitz continuous gradient.

Below is an important property in terms of Lipschitz continuous gradient, which will be used frequently in this paper.

Lemma 1. [38, Theorem 2.1.5] If a continuously differentiable function has a Lipschitz continuous gradient, then the following holds

If further assume that f is convex, then

Definition 2. [38, Definitioin 2.1.3] A function is continuously differentiable and -strongly convex if

Lemma 2. [38, Theorem 2.1.5] Suppose that f is continuously differentiable and strongly convex with parameter . Let x* be the unique minimizer of f. Then

holds.

3 The description of ASCG algorithm

This section provides the main ideas of ASCG algorithm in detail. We begin with introducing the following update rule of CGM:

where g_k defines the gradient of the objective function at the current point x_k and is the conjugate parameter. The stepsize can be determined by line search, such as Armijo condition and standard Wolfe condition. A commonly used more enhanced condition, i.e., the strong Wolfe condition:

where the parameters . The first condition is called sufficient decrease condition, and the second condition is called the curvature condition. To solve problem (1) with a stochastic method, we introduce the random strong Wlofe condition. Assume that we randomly sampling at the current point x_k, then we have the following update rule:

(2)

where and g_k are some stochastic gradient at the point x_k.

For realizing the acceleration, let we introduce a variable (see the following Algorithm 1). It follows from the first Wolfe condition in (2) that

(3)

where is the variance reduced stochastic gradient (see the step 14 in Algorithm 1 below). Let we introduce a new variable , where is a parameter which follows to be determined in such a way as to improve the behavior of the algorithm. Now we have

(4)

and

(5)

By combining the Eqs (4) and (5), we obtain

(6)

where . Taking expectation conditioned on S_k on both sides of Eq (6), we have

(7)

Let us denote

where (see Lemma 4 in the following Sect 3) and for convex function . Therefore,

Obviously, is a quadratic function with respect to . According to a simple calculation it can be seen that when

In practice, remains well-defined as the line search ensures and for valid descent directions. reaches the minimum. Considering in Eq (7) and since , we see that

(8)

This leads to a potential improvement in the expected values of function f when the condition is satisfied. By combining inequalities (3) and (8), we obtain

(9)

Since , ( is a descent direction).

Algorithm 1 Accelerated stochastic conjugate gradient: ASCG.

Input: The initial point , the Wolfe condition parameters , and epoch length m.

1: for do

2:

3:

4:

5:

6: Draw samples S₀ uniformly at random from

7: for do

8: Use the Wolfe line search rule (1) to obtain the

stepsiez

9: Draw samples B_k uniformly at random from

10: Compute: ,

11: Compute: , and

12: Compute:

13: Draw samples S_k + 1 uniformly at random from

14: Compute:

15: Compute conjugate parameter:

16: Compute:

17: end for

18: Option I:

19: Option II: chosen uniformly at random from

20: end for 21: chosen uniformly from .

Output: .

4 Convergence analysis

In this section, we mainly study the convergence rate of the proposed Algorithm 1. We start from making the following assumption for the objective function in problem (1).

Assumption 1. For the objective function in problem (1):

(1) (smoothness) Function f_i for all is continuously differentiable and L_i-smooth.

(2) (convex) Function f_i for all is convex.

(3) () The whole finite-sum function f is -strongly convex.

These two assumptions are very common in both stochastic and deterministic convex optimizations. The Assumption 1 indicates that the whole function f is also L–smooth, where . Below we provide a certain upper bound for the variance of gradient estimate, which are frequently used in stochastic optimization algorithms for convex optimization problems.

Assumption 2. ASCG Algorithm is implemented with a step size that satisfies and condition (2) with .

Assumption 3. There exists such that

where is the variance reduced gradient estimator at the point of Algorithm 1, i.e., .

Assumption 4. Let be a sequence generated by Algorithm 1. There exists some such that

Remark 1. These assumptions are standard in stochastic conjugate gradient literature [24,26,27]: Unbounded step sizes often lead to divergence in stochastic settings. Assumption 2 restricting the step size to a bounded interval is a standard and necessary practice to balance convergence speed and numerical stability. The condition of Assumption 3 enforces that the norm of the variance reduced gradient does not grow across iterations-it reflects a reasonable decay (or controlled growth) of estimation "noise" in the gradient. Assumption 4 bounds the "quality" of the variance reduced gradient estimator relative to the true gradient . Under Lipschitz continuity of f’s gradient, one can show (via variance bounds of empirical gradients) that such estimators satisfy − being bounded, and by Cauchy-Schwarz gradient boundedness, this leads to being comparable to . The step size parameter in Assumption 2 is chosen such that − , which is natural as . Combined with the gradient related bounds from variance reduction, the constant c can be verified to lie in − .

Lemma 3. [24] Suppose that ASCG algorithm is implemented with the stepsize that satisfies the strong Wolfe condition (2) with . Then, the FR conjugate gradient method generates descent direction d_k that satisfy

Lemma 4. Let be a sequence generated by Algorithm 1.

Proof:

where the first inequality follows from the second Wolfe condition (2). The second inequality comes from Lemma 3. The last equality by using the definition of . And the last inequality follows from Assumption 4.

Lemma 5. Suppose that Assumptions 3 holds. Then, for any k in a fixed epoch s, we have

where .

The similar result can be see [22, Theorem 2].

In the following, we give the main property of this paper.

Theorem 1. Let be a sequence generated by Algorithm 1. Suppose that Assumptions 1-4 hold, and satisfies the sufficient descent condition and , where . Then the following linear convergence

holds, where .

Proof: If follows from the Lipschitz continuously that

Taking expectation conditioned on S_k on both sides of this inequality, we have

(10)

By combining Lemma 4, Lemma 5 and (10), we have

(11)

Then, using (8) and (11), we have

(12)

On the other hand,

(13)

Substituting inequality (13) into inequality (12), we have

(14)

By letting in inequality (14), as well as Lemma 1 and 2, we further have

Taking the population expectation on both sides of the this inequality, we further have

(15)

Summing over and using a telescoping sum, we have

(16)

On the other hand,

(17)

Subtracting (17) into (16), we get

(18)

Rearranging (18), we have

5 Numerical experiments

In this section, we evaluate the practical performance of the proposed ASCG algorithm through numerical experiments on widely-used machine learning benchmarks. We compare ASCG with several state-of-the-art methods using datasets sourced from LIBSVM: svmguide1, a9a, w8a, gisette, and ijcnn1. The dataset can be obtained from https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. Table 1 provides a detailed description of these datasets. All experiments were conducted in MATLAB 2017b on a 64-bit PC equipped with an Intel(R) Core(TM) i7-6700HQ CPU (2.60 GHz) and 16 GB of RAM. The experimental model is described in Sect 5.1. Sect 5.2 introduces the parameter settings corresponding to all comparative algorithms used in the experiments. The comparative performance of ASCG against other state-of-the-art methods is presented in Sect 5.3, and an analysis of the properties of ASCG is provided in Sect 5.4.

Download:

Table 1. The detailed description of each dataset.

https://doi.org/10.1371/journal.pone.0338720.t001

5.1 The experimental model

In our experiments, we consider the following l₂ regularized logistic regression for binary classification:

(19)

where is coming a certain collection of dataset with and is the ith training sample and the corresponding label, respectively. is the regularization parameter. The regularized logistic regression model is a popular model used in the field of machine learning to measure how well an algorithm is. And it is obviously that l₂ regularized logistic regression conforms to the Assumption 1 with , and .

5.2 Implementation of algorithms

We compare the proposed ASCG algorithm with several state-of-the-art conjugate gradient methods-both stochastic and deterministic-for solving the finite-sum problem (19). Specifically, the compared algorithms include: conjugate gradient with variance reduction (CGVR) [22], the stochastic conjugate gradient algorithm (SCGA) [24], and two classical deterministic conjugate gradient methods, namely Fletcher-Reeves conjugate gradient method (FR-CGM) [4] and the accelerated conjugate gradient method (ACG) [29]. The parameter settings for each algorithm are detailed below:

CGVR: This is the first stochastic conjugate gradient method that integrates a variance reduction technique with the FR-CGM. It involves four parameters: the Wolfe line search parameters , the minibatch size b, and the number of inner iterations m. In our experiments, we set , , , and .

SCGA: A recent stochastic conjugate gradient method that incorporates the SAGA framework into FR-CGM. The parameters are set according to the recommendations in Table 2 of [24]: minibatch size , inner iteration count , and Wolfe conditions parameters and .

Download:

Table 2. Parameters of all comparative algorithms.

https://doi.org/10.1371/journal.pone.0338720.t002

FR-CGM: A classical deterministic conjugate gradient method, abbreviated as CG in this paper. It requires only the two Wolfe line search parameters. We use and in all experiments.

ACG: A well-established accelerated conjugate gradient method developed by Andrei, which enhances convergence via adaptive step-size adjustments. Like FR-CGM, it relies solely on the Wolfe parameters. We set and .

ASCG: The proposed accelerated stochastic conjugate gradient method. Although ASCG also employs acceleration mechanisms similar to CGVR and SCGA, it involves four parameters: the Wolfe line search parameters and (with ), the minibatch size b, and the inner loop count m. Our experiments use , , , and .

For clarity, all parameter configurations are summarized in Table 2. During implementation, all feature values were normalized to the range . Each algorithm was initialized from the same randomly generated starting point drawn from a standard normal distribution. To ensure fair comparison, the regularization parameter was set to for all methods, a common choice in machine learning applications.

In theory, we using the first Wolfe condition, i.e.,

Where . But in experiments for the compared stochastic methods, we use the standard stochastic version Wolfe condition

We refer the readers to [22,24] for the similar setting. And one can see in the later experimental results, this operation can also bring good practical results.

5.3 Comparison with other related algorithms

To evaluate the performance of the proposed algorithm, we compare ASCG with several competitive conjugate gradient methods for solving problem (19). The numerical results are presented in Figs 1–7. Following common practice for stochastic methods in machine learning, we measure the objective function value in terms of the number of effective passes. In these figures, the x-axis represents the number of effective passes (epochs), while the y-axis corresponds to the function value, training accuracy, and testing accuracy, respectively. Overall, the results demonstrate that ASCG performs comparably or superiorly to the other conjugate gradient methods. Specifically, variance-reduced (VR-based) algorithms consistently outperform non-VR methods. Moreover, ASCG exceeds the performance of both CGVR and SCGA, while ACG outperforms CG, empirically confirming the advantage of accelerated methods over non-accelerated ones.

Download:

Fig 1. The change curve of objective function value against the number of effective passes on four datasets of different methods.

https://doi.org/10.1371/journal.pone.0338720.g001

Download:

Fig 2. The change curve of objective function value against the gradient calculation on four datasets of different methods.

https://doi.org/10.1371/journal.pone.0338720.g002

Download:

Fig 3. The change curve of objective function value against the gradient calculation on four datasets of different methods.

https://doi.org/10.1371/journal.pone.0338720.g003

Download:

Fig 4. The effects of different minibatch size on objective function value of ASCG algorithm.

https://doi.org/10.1371/journal.pone.0338720.g004

Download:

Fig 5. The effects of different minibatch size on training accuracy of ASCG algorithm.

https://doi.org/10.1371/journal.pone.0338720.g005

Download:

Fig 6. The effects of different minibatch size on testing accuracy of ASCG algorithm.

https://doi.org/10.1371/journal.pone.0338720.g006

Download:

Fig 7. The performance of ASCG algorithm on high-dimensional dataset.

https://doi.org/10.1371/journal.pone.0338720.g007

In Table 3, we report the total running time, maximum training accuracy, and maximum testing accuracy of three stochastic conjugate gradient methods-each using a minibatch size of -across four datasets. The results indicate that the total running time of ASCG is nearly identical to that of CGVR on all datasets, whereas SCGA requires slightly more time. In general, ASCG achieves higher training and testing accuracy than the other two methods in most cases. Although CGVR and SCGA occasionally perform better, Figs 1–3 reveal that ASCG converges earlier during the initial stages.

Download:

Table 3. The performance of different stochastic CGM for logistic regression on four datasets.

https://doi.org/10.1371/journal.pone.0338720.t003

To assess performance on high-dimensional data, we specifically evaluated the gisette dataset from LIBSVM. The results, shown in Fig 7, indicate that ASCG maintains superior performance even in this setting. Notably, ASCG converges faster in both objective function value and test accuracy compared to all baseline methods (CG, CGVR, ACG, and SCGA), supporting its scalability and effectiveness for large-scale problems.

5.4 Effect of the minibatch sizse in ASCG algorithm

In this subsection, we examine the effect of minibatch size on ASCG across all datasets. For each minibatch size, the initial step size was tuned optimally to ensure convergence. For svmguide1 and a9a, minibatch sizes of were used, while for ijcnn1 and w8a, sizes of were employed. The results, illustrated in Figs 4–7, show that ASCG generally achieves better performance with larger minibatch sizes. Excessively small minibatches can lead to divergence. As shown particularly in Fig 7, the algorithm exhibits more stable and improved generalization when the minibatch size is set near .

6 Conclusion

In this paper, we propose an accelerated stochastic conjugate gradient algorithm, termed ASCG, which incorporates a variance reduction technique for solving a class of convex empirical risk minimization problems. In contrast to existing stochastic conjugate gradient methods such as CGVR and SCGA, the proposed algorithm introduces an adaptive acceleration mechanism by scaling the step size via a deflation factor, thereby simplifying the parameter selection process. We provide a rigorous theoretical analysis demonstrating that ASCG achieves an expected linear convergence rate under the Fletcher-Reeves update for strongly convex problems, along with a significantly improved expected reduction in function values. Extensive numerical experiments on four large-scale datasets, performed in the context of l₂-regularized logistic regression for binary classification, confirm that the proposed algorithm outperforms several state-of-the-art methods.

Acknowledgments

The authors would like to thank the editor and the anonymous referees for their valuable suggestions and comments which have greatly improved the presentation of this paper.

References

1. Mo Z, Ouyang C, Pham H, Yuan G. A stochastic recursive gradient algorithm with inertial extrapolation for non-convex problems and machine learning. Int J Mach Learn & Cyber. 2025;16(7-8):4545–59.
- View Article
- Google Scholar
2. Ouyang C, Jian A, Zhao X, Yuan G. Difference-enhanced adaptive momentum methods for non-convex stochastic optimization in image classification. Digital Signal Processing. 2025;161:105118.
- View Article
- Google Scholar
3. Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems. J RES NATL BUR STAN. 1952;49(6):409.
- View Article
- Google Scholar
4. Fletcher R. Function minimization by conjugate gradients. The Computer Journal. 1964;7(2):149–54.
- View Article
- Google Scholar
5. Polak E, Ribiere G. Rev Fr Inform Et Rech Oper. 1969;3(16):35–43.
- View Article
- Google Scholar
6. Dai YH, Yuan Y. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J Optim. 1999;10(1):177–82.
- View Article
- Google Scholar
7. Ma G, Jin J, Jian J, Yin J, Han D. A modified inertial three-term conjugate gradient projection method for constrained nonlinear equations with applications in compressed sensing. Numer Algor. 2022;92(3):1621–53.
- View Article
- Google Scholar
8. Jiang X, Liao W, Yin J, Jian J. A new family of hybrid three-term conjugate gradient methods with applications in image restoration. Numer Algor. 2022;91(1):161–91.
- View Article
- Google Scholar
9. Aminifard Z, Babaie-Kafaki S. Dai-Liao extensions of a descent hybrid nonlinear conjugate gradient method with application in signal processing. Numer Algor. 2021;89(3):1369–87.
- View Article
- Google Scholar
10. Schraudolph NN, Graepel T. Conjugate directions for stochastic gradient descent. In: International conference on artificial neural networks. 2002. p. 1351–6.
11. Huang R, Qin Y, Liu K, Yuan G. Biased stochastic conjugate gradient algorithm with adaptive step size for nonconvex problems. Expert Systems with Applications. 2024;238:121556.
- View Article
- Google Scholar
12. Jiang H, Yu X, Wilford PA. Digital predistortion using stochastic conjugate gradient method. IEEE Trans on Broadcast. 2012;58(1):114–24.
- View Article
- Google Scholar
13. Huang W, Zhou H-W. Least-squares seismic inversion with stochastic conjugate gradient method. J Earth Sci. 2015;26(4):463–70.
- View Article
- Google Scholar
14. Sharma S, Rastogi R. Stochastic conjugate gradient descent twin support vector machine for large scale pattern classification. Advances in Artificial Intelligence. 2018;11320:590–602.
- View Article
- Google Scholar
15. Liu J, Lin L, Ren H, Gu M, Wang J, Youn G, et al. Building neural network language model with POS-based negative sampling and stochastic conjugate gradient descent. Soft Comput. 2018;22(20):6705–17.
- View Article
- Google Scholar
16. Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY. On optimization methods for deep learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011. p. 265–72.
17. Johnson R, Zhhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems. 2013. p. 1–9.
18. Defazio A, Bach F. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems. 2014;27:1–9.
- View Article
- Google Scholar
19. Nguyen LM, Liu J, Scheinberg K, Takac M. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning. 2017. p. 1–9.
20. Fang C, Li CJ, Lin Z, Zhang T. SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Advances in Neural Information Processing Systems. 2018. p. 1–11.
21. Li Z, Bao H, Zhang X, Richtarik P. PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: Proceedings of the 38th International Conference on Machine Learning. 2021. p. 6286–95.
22. Jin X-B, Zhang X-Y, Huang K, Geng G-G. Stochastic conjugate gradient algorithm with variance reduction. IEEE Trans Neural Netw Learn Syst. 2019;30(5):1360–9. pmid:30281486
- View Article
- PubMed/NCBI
- Google Scholar
23. Kobayashi Y, Liduka H. Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning. arXiv preprint 2020. https://arxiv.org/abs/2003.00231v2
- View Article
- Google Scholar
24. Kou C, Yang H. A mini-batch stochastic conjugate gradient algorithm with variance reduction. J Glob Optim. 2022;87(2–4):1009–25.
- View Article
- Google Scholar
25. Ouyang C, Lu C, Zhao X, Huang R, Yuan G, Jiang Y. Stochastic three-term conjugate gradient method with variance technique for non-convex learning. Stat Comput. 2024;34(3).
- View Article
- Google Scholar
26. Yang Z. Large-scale machine learning with fast and stable stochastic conjugate gradient. Computers & Industrial Engineering. 2022;173:108656.
- View Article
- Google Scholar
27. Yang Z. Adaptive stochastic conjugate gradient for machine learning. Expert Systems with Applications. 2022;206:117719.
- View Article
- Google Scholar
28. Lenard ML. Accelerated conjugate direction methods for unconstrained optimization. J Optim Theory Appl. 1978;25(1):11–31.
- View Article
- Google Scholar
29. Andrei N. Acceleration of conjugate gradient algorithms for unconstrained optimization. Applied Mathematics and Computation. 2009;213(2):361–9.
- View Article
- Google Scholar
30. Andrei N. Accelerated hybrid conjugate gradient algorithm with modified secant condition for unconstrained optimization. Numer Algor. 2009;54(1):23–46.
- View Article
- Google Scholar
31. Andrei N. An accelerated subspace minimization three-term conjugate gradient algorithm for unconstrained optimization. Numer Algor. 2013;65(4):859–74.
- View Article
- Google Scholar
32. Sun W, Liu H, Liu Z. A class of accelerated subspace minimization conjugate gradient methods. J Optim Theory Appl. 2021;190(3):811–40.
- View Article
- Google Scholar
33. Sun W, Liu H, Liu Z. Several accelerated subspace minimization conjugate gradient methods based on regularization model and convergence rate analysis for nonconvex problems. Numer Algor. 2022;91(4):1677–719.
- View Article
- Google Scholar
34. Jian J, Chen W, Jiang X, Liu P. A three-term conjugate gradient method with accelerated subspace quadratic optimization. J Appl Math Comput. 2021;68(4):2407–33.
- View Article
- Google Scholar
35. Hu M, Lou Y, Wang B, Yan M, Yang X, Ye Q. Accelerated sparse recovery via gradient descent with nonlinear conjugate gradient momentum. J Sci Comput. 2023;95(1):33.
- View Article
- Google Scholar
36. Karimi S, Vavasis S. Nonlinear conjugate gradient for smooth convex functions. arXiv preprint 2023. arXiv:2111.11613v2
- View Article
- Google Scholar
37. Beck A. First-order methods in optimization. Philadelhia. 2017.
38. Nesterov Y. Lectures on convex optimization. Springer; 2018. https://doi.org/10.1007/978-3-319-91578-4

[ref1] 1. Mo Z, Ouyang C, Pham H, Yuan G. A stochastic recursive gradient algorithm with inertial extrapolation for non-convex problems and machine learning. Int J Mach Learn & Cyber. 2025;16(7-8):4545–59.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Ouyang C, Jian A, Zhao X, Yuan G. Difference-enhanced adaptive momentum methods for non-convex stochastic optimization in image classification. Digital Signal Processing. 2025;161:105118.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems. J RES NATL BUR STAN. 1952;49(6):409.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Fletcher R. Function minimization by conjugate gradients. The Computer Journal. 1964;7(2):149–54.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Polak E, Ribiere G. Rev Fr Inform Et Rech Oper. 1969;3(16):35–43.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Dai YH, Yuan Y. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J Optim. 1999;10(1):177–82.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Ma G, Jin J, Jian J, Yin J, Han D. A modified inertial three-term conjugate gradient projection method for constrained nonlinear equations with applications in compressed sensing. Numer Algor. 2022;92(3):1621–53.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Jiang X, Liao W, Yin J, Jian J. A new family of hybrid three-term conjugate gradient methods with applications in image restoration. Numer Algor. 2022;91(1):161–91.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Aminifard Z, Babaie-Kafaki S. Dai-Liao extensions of a descent hybrid nonlinear conjugate gradient method with application in signal processing. Numer Algor. 2021;89(3):1369–87.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Schraudolph NN, Graepel T. Conjugate directions for stochastic gradient descent. In: International conference on artificial neural networks. 2002. p. 1351–6.

[ref11] 11. Huang R, Qin Y, Liu K, Yuan G. Biased stochastic conjugate gradient algorithm with adaptive step size for nonconvex problems. Expert Systems with Applications. 2024;238:121556.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref12] 12. Jiang H, Yu X, Wilford PA. Digital predistortion using stochastic conjugate gradient method. IEEE Trans on Broadcast. 2012;58(1):114–24.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref13] 13. Huang W, Zhou H-W. Least-squares seismic inversion with stochastic conjugate gradient method. J Earth Sci. 2015;26(4):463–70.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref14] 14. Sharma S, Rastogi R. Stochastic conjugate gradient descent twin support vector machine for large scale pattern classification. Advances in Artificial Intelligence. 2018;11320:590–602.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref15] 15. Liu J, Lin L, Ren H, Gu M, Wang J, Youn G, et al. Building neural network language model with POS-based negative sampling and stochastic conjugate gradient descent. Soft Comput. 2018;22(20):6705–17.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref16] 16. Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY. On optimization methods for deep learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011. p. 265–72.

[ref17] 17. Johnson R, Zhhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems. 2013. p. 1–9.

[ref18] 18. Defazio A, Bach F. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems. 2014;27:1–9.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref19] 19. Nguyen LM, Liu J, Scheinberg K, Takac M. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning. 2017. p. 1–9.

[ref20] 20. Fang C, Li CJ, Lin Z, Zhang T. SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Advances in Neural Information Processing Systems. 2018. p. 1–11.

[ref21] 21. Li Z, Bao H, Zhang X, Richtarik P. PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: Proceedings of the 38th International Conference on Machine Learning. 2021. p. 6286–95.

[ref22] 22. Jin X-B, Zhang X-Y, Huang K, Geng G-G. Stochastic conjugate gradient algorithm with variance reduction. IEEE Trans Neural Netw Learn Syst. 2019;30(5):1360–9. pmid:30281486
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref23] 23. Kobayashi Y, Liduka H. Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning. arXiv preprint 2020. https://arxiv.org/abs/2003.00231v2
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref24] 24. Kou C, Yang H. A mini-batch stochastic conjugate gradient algorithm with variance reduction. J Glob Optim. 2022;87(2–4):1009–25.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref25] 25. Ouyang C, Lu C, Zhao X, Huang R, Yuan G, Jiang Y. Stochastic three-term conjugate gradient method with variance technique for non-convex learning. Stat Comput. 2024;34(3).
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref26] 26. Yang Z. Large-scale machine learning with fast and stable stochastic conjugate gradient. Computers & Industrial Engineering. 2022;173:108656.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref27] 27. Yang Z. Adaptive stochastic conjugate gradient for machine learning. Expert Systems with Applications. 2022;206:117719.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref28] 28. Lenard ML. Accelerated conjugate direction methods for unconstrained optimization. J Optim Theory Appl. 1978;25(1):11–31.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref29] 29. Andrei N. Acceleration of conjugate gradient algorithms for unconstrained optimization. Applied Mathematics and Computation. 2009;213(2):361–9.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref30] 30. Andrei N. Accelerated hybrid conjugate gradient algorithm with modified secant condition for unconstrained optimization. Numer Algor. 2009;54(1):23–46.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref31] 31. Andrei N. An accelerated subspace minimization three-term conjugate gradient algorithm for unconstrained optimization. Numer Algor. 2013;65(4):859–74.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref32] 32. Sun W, Liu H, Liu Z. A class of accelerated subspace minimization conjugate gradient methods. J Optim Theory Appl. 2021;190(3):811–40.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref33] 33. Sun W, Liu H, Liu Z. Several accelerated subspace minimization conjugate gradient methods based on regularization model and convergence rate analysis for nonconvex problems. Numer Algor. 2022;91(4):1677–719.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref34] 34. Jian J, Chen W, Jiang X, Liu P. A three-term conjugate gradient method with accelerated subspace quadratic optimization. J Appl Math Comput. 2021;68(4):2407–33.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref35] 35. Hu M, Lou Y, Wang B, Yan M, Yang X, Ye Q. Accelerated sparse recovery via gradient descent with nonlinear conjugate gradient momentum. J Sci Comput. 2023;95(1):33.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref36] 36. Karimi S, Vavasis S. Nonlinear conjugate gradient for smooth convex functions. arXiv preprint 2023. arXiv:2111.11613v2
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref37] 37. Beck A. First-order methods in optimization. Philadelhia. 2017.

[ref38] 38. Nesterov Y. Lectures on convex optimization. Springer; 2018. https://doi.org/10.1007/978-3-319-91578-4

Figures

Abstract

1 Introduction

2 Notations and preliminaries

3 The description of ASCG algorithm

4 Convergence analysis

5 Numerical experiments

5.1 The experimental model

5.2 Implementation of algorithms

5.3 Comparison with other related algorithms

5.4 Effect of the minibatch sizse in ASCG algorithm

6 Conclusion

Acknowledgments

References