1 Introduction
We consider the general smooth, nonconvex compositional optimization problem of minimizing the composition of two expectations of stochastic functions:
(1) 
where the outer and inner functions are defined as , , and
are random variables, and each component
, are smooth but not necessarily convex. Compositional optimization can be used to formulate many important machine learning problems, e.g. reinforcement learning (Sutton and Barto, 1998), risk management (Dentcheva et al., 2017), multistage stochastic programming (Shapiro et al., 2009), deep neural net (Yang et al., 2018), etc. We list two specific application instances that can be written in the stochastic compositional form of (1):Risk management problem, which is formulated as
(2) 
where denotes the returns of assets at time , and denotes the investment quantity corresponding to assets. The goal is to maximize the return while controlling the variance of the portfolio. (2) can be written as a compositional optimization problem with two functions
(3)  
(4) 
where denotes the (column) subvector of its first coordinates and denotes its th coordinate.
Value function evaluation in reinforcement learning, where the objective function of interest is
(5) 
where are two plausible states, denotes the reward to move from to , and is the value function on state corresponding to policy .
Compared with traditional optimization problem which allows accesses to the stochastic gradients, problem (1) is more difficult to solve. Existing algorithms for solving (1) are often more computationally challenging. This is mainly due to the nonlinear structure of the composition function with respect to the random index pair . Treating the objective function as an expectation
, computing each iterate of the gradient estimation involves recalculating
, which is either timeconsuming or impractical. To tackle such weakness in practition, Wang et al. (2017a) firstly introduce a twotimescale algorithm called Stochastic Compositional Gradient Descent (SCGD) along with its (in Nesterov’s sense) accelerated variant (AccSCGD), and provide a first convergence rate analysis to that problem. Subsequently, Wang et al. (2017b) proposed accelerated stochastic compositional proximal gradient algorithm (ASCPG) which improves over the upper bound complexities in Wang et al. (2017a). Furthermore, variancereduced gradient methods designed specifically for compositional optimization on nonconvex settings arises from Liu et al. (2017) and later generalized to the nonsmooth setting (Huo et al., 2018). These approaches aim at getting variancereduced estimators of , and , respectively. Such success signals the necessity and possibility of designing a special algorithm for nonconvex objectives with better convergence rates.Algorithm  Finitesum  Online 

SCGD (Wang et al., 2017a)  unknown  
AccSCGD (Wang et al., 2017a)  unknown  
ASCPG (Wang et al., 2017b)  unknown  
SCVR / SCSCSG (Liu et al., 2017)  
VRSCPG (Huo et al., 2018)  unknown  
SARAHCompositional (this work) 
In this paper, we propose an efficient algorithm called SARAHCompositional for the stochastic compositional optimization problem (1). For notational simplicity, we let and the index pair
be uniformly distributed over the product set
, i.e.(6) 
We use the same notation for the online case, in which case either or can be infinite.
A fundamental theoretical question for stochastic compositional optimization is the Incremental Firstorder Oracle (IFO) (the number of individual gradient and function evaluations; see Definition 1 in §2 for a precise definition) complexity bounds for stochastic compositional optimization. Our new SARAHCompositional algorithm is developed by integrating the iteration of stochastic recursive gradient descent (Nguyen et al., 2017), shortened as SARAH,^{1}^{1}1This is also referred to as stochastic recursive variance reduction method, incremental variance reduction method or SPIDERBOOST in various recent literatures. We stick to name the algorithm after SARAH to respect to our best knowledge the earliest discovery of that algorithm. with the stochastic compositional optimization formulation (Wang et al., 2017a). The motivation of this approach is that SARAH with specific choice of stepsizes is known to be optimal in stochastic optimization and regarded as a cuttingedge variance reduction technique, with significantly reduced oracle access complexities than earlier variance reduction method (Fang et al., 2018). We prove that SARAHCompositional can reach an IFO computational complexity of , improving the best known result of in nonconvex compositional optimization. See Table 1 for detailed comparison.
Related Works
Classical firstorder methods such as gradient descent (GD), accelerated graident descent (AGD) and stochastic gradient descent (SGD) have received intensive attentions in both convex and nonconvex optimization
(Nesterov, 2004; Ghadimi and Lan, 2016; Li and Lin, 2015). When the objective can be written in a finitesum or online/expectation structure, variancereduced gradient (a.k.a. variance reduction) techniques including SAG (Schmidt et al., 2017), SVRG (Xiao and Zhang, 2014; AllenZhu and Hazan, 2016; Reddi et al., 2016), SDCA (ShalevShwartz and Zhang, 2013, 2014), SAGA (Defazio et al., 2014), SCSG (Lei et al., 2017), SARAH/SPIDER (Nguyen et al., 2017; Fang et al., 2018; Wang et al., 2018), etc., can be employed to improve the theoretical convergence properties of classical firstorder algorithms. Notably, Fang et al. (2018) recently proposed the SPIDERSFO algorithm which nontrivially hybrids the iteration of stochastic recursive gradient descent (SARAH) (Nguyen et al., 2017) with the normalized gradient descent (NGD). In the representative case of batchsize 1, SPIDERSFO adopts a small steplength that is proportional to the squared targeted accuracy , and (by rebooting the SPIDER tracking iteration once every iterates) the variance of the stochastic estimator can be constantly controlled by . For finding accurate solution purposes, Wang et al. (2018) rediscovered a variant of the SARAH algorithm that achieves the same complexity as SPIDERSFO (Fang et al., 2018) (their algorithm is under the name of SPIDERBOOST since it can be seen as the SPIDERSFO algorithm with relaxed steplength restrictions). The theoretical convergence property of SARAH/SPIDER methods in the smooth nonconvex case outperforms that of SVRG, and is provably optimal under a set of mild assumptions (Fang et al., 2018; Wang et al., 2018).It turns out that when solving compositional optimization problem (1), firstorder methods for optimizing a single objective function can either be nonapplicable or it brings at least queries to calculate the inner function . To remedy this issue, Wang et al. (2017a, b) considered the stochastic setting and proposed the SCGD algorithm to calculate or estimate the inner finitesum more efficiently, achieving a polynomial rate that is independent of . Later on, Lian et al. (2017); Liu et al. (2017); Huo et al. (2018) and Lin et al. (2018) merged SVRG method into the compositional optimization framework to do variance reduction on all three steps of the estimation. In stark contrast, our work adopts the SARAH/SPIDER method which is theoretically more efficient than the SVRG method in the nonconvex compositional optimization setting.
After the initial submission of the short version of this technical report, we are aware of a line of concurrent works by Zhang and Xiao (Zhang and Xiao, 2019b, a) who adapted the idea of SPIDER Fang et al. (2018) and solve the stochastic compositional problem. More relevant to this work is Zhang and Xiao (2019a) which consider a special nonsmooth setting for the compositional optimization problem where the objective function has an additive nonsmooth term that admits an easy proximal mapping.^{2}^{2}2Such a setting has also been studied in Wang et al. (2017b); Lin et al. (2018); Huo et al. (2018), among many others. Due to the lack of space, we satisfy ourselves with the smooth case and leave the analysis of the aforementioned nonsmooth case to the future. For a fair comparison, the IFO complexity upper bound obtained in Zhang and Xiao (2019a) is similar to ours (Theorems 3 and 4) yet with significant differences: (i) Zhang and Xiao (2019a) has the steplength restriction that SPIDER has in nature, and our work circumvent this issue, hence applicable to a wider range of statistical learing tasks; (ii) Zhang and Xiao (2019a) fixes the batch size, while our work theoretically optimizes the choice of batch sizes (Corollary 5 and contexts) and further halves the IFO upper bound in its asymptotics, which potentially serves as a parametertuning guidance to practitioners.
Contributions This work makes two contributions as follows. First, we propose a new algorithm for stochastic compositional optimization called SARAHCompositional, which operates SARAH/SPIDERtype recursive variance reduction to estimate relevant quantities. Second, we conduct theoretical analysis for both online and finitesum cases, which verifies the superiority of SARAHCompositional over the best known previous results. In the finitesum case, we obtain a complexity of which improves over the best known complexity achieved by Huo et al. (2018). In the online case we obtain a complexity of which improves the best known complexity obtained in Liu et al. (2017).
Notational Conventions Throughout the paper, we treat the parameters and as global constants. Let
denote the Euclidean norm of a vector or the operator norm of a matrix induced by Euclidean norm, and let
denotes the Frobenious norm. For fixed let denote the sequence . Let denote the conditional expectation . Let and denote the cardinality of a multiset of samples (a generic set that permits repeated instances). The averaged subsampled stochastic estimator is denoted as where the summation counts repeated instances. We denote if there exist some constants such that as becomes large. Other notations are explained at their first appearances.Organization The rest of our paper is organized as follows. §2 formally poses our algorithm and assumptions. §3 presents the convergence rate theorem and §4 presents numerical experiments that apply our algorithm to the task of portfolio management. We conclude our paper in §5. Proofs of convergence results for finitesum and online cases and auxiliary lemmas are deferred to §6 and §7 in the supplementary material.
2 SARAH for Stochastic Compositional Optimization
Recall our goal is to solve the compositional optimization problem (1), i.e. to minimize where
Here for each and the functions and . We can formally take the derivative of the function
and obtain (via the chain rule) the gradient descent iteration
(7) 
where the operator computes the Jacobian matrix of the smooth mapping, and the gradient operator is only taken with respect to the firstlevel variable. As discussed in §1, it can be either impossible (online case) or timeconsuming (finitesum case) to estimate the terms and in the iteration scheme (7). In this paper, we design a novel algorithm (SARAHCompositional) based on Stochastic Compositional Variance Reduced Gradient method (see Lin et al. (2018)) yet hybriding with the stochastic recursive gradient method Nguyen et al. (2017). As the readers see later, our SARAHCompositional is more efficient than all existing algorithms for nonconvex compositional optimization.
We introduce some definitions and assumptions. First, we assume the algorithm has accesses to an incremental firstorder oracle in our blackbox environment (Lin et al., 2018); also see (Agarwal and Bottou, 2015; Woodworth and Srebro, 2016) for vanilla optimization case:
Definition 1 (Ifo).
(Lin et al., 2018) The Incremental Firstorder Oracle (IFO) returns, when some and are inputted, the vectormatrix pair or when some and are inputted, the scalarvector pair .
Second, our goal in this work is to find an accurate solution, defined as
Definition 2 (accurate solution).
We call an accurate solution to problem (1), if
(8) 
It is worth remarking here that the inequality (8) can be modified to for some global constant without hurting the magnitude of IFO complexity bounds.
Let us first make some assumptions regarding to each component of the (compositional) objective function. Analogous to Assumption 1(i) of Fang et al. (2018), we make the following finite gap assumption:
Assumption 1 (Finite gap).
We assume that the algorithm is initialized at with
(9) 
where denotes the global minimum value of .
We make the following standard smoothness and boundedness assumptions, which are standard in recent compositional optimizatioin literatures (e.g. Lian et al. (2017); Huo et al. (2018); Lin et al. (2018)).
Assumption 2 (Smoothness).
There exist Lipschitz constants such that for , we have
(10)  
Here for the purpose of using stochastic recursive estimation of , we slightly strengthen the smoothness assumption by adopting the Frobenius norm in left hand of the first line of (10).
Assumption 3 (Boundedness).
There exist boundedness constants such that for , we have
(11)  
Notice that applying meanvalue theorem for vectorvalued functions to (11) gives another Lipschitz condition
(12) 
and analogously for . It turns out that under the above two assumptions, a choice of in (10) can be expressed as a polynomial of . For clarity purposes in the rest of this paper, we adopt the following typical choice of :
(13) 
whose applicability can be verified via a simple application of the chain rule. We integrate both finitesum and online cases into one algorithm SARAHCompositional and write it in Algorithm 1.
3 Convergence Rate Analysis
In this section, we aim to justify that our proposed SARAHCompositional algorithm provides IFO complexities of in the finitesum case and in the online case, which supersedes the concurrent and comparative algorithms (see more in Table 1).
Let us first analyze the convergence in the finitesum case. In this case we have , , . Involved analysis leads us to conclude
Theorem 3 (Finitesum case).
Suppose Assumptions 1, 2 and 3 in §2 hold, let , , let for any minibatch sizes ,
(14) 
and set the stepsize
(15) 
Then for the finitesum case, SARAHCompositional Algorithm 1 outputs an satisfying in
(16) 
iterates. Furthermore, let the minibatch sizes , satisfy
(17) 
then the IFO complexity to achieve an accurate solution is bounded by
(18) 
Like in Fang et al. (2018), for a wide range of minibatch sizes the IFO complexity to achieve an accurate solution is upper bounded by , as long as (17) holds.^{3}^{3}3Here and in below, the smoothness and boundedness parameters and are treated as constants Note if the batch size are chosen as , then from (18) the IFO complexity upper bound is
(19) 
Let us then analyze the convergence in the online case, where we sample minibatches of relevant quantities instead of the ground truth once every iterates. To characterize the estimation error, we put in one additional finite variance assumption:
Assumption 4 (Finite Variance).
We assume that there exists and as the upper bounds on the variance of the functions , , and , respectively, such that
(20)  
From Assumptions 2 and 3 we can easily verify, via triangle inequality and convexity of norm, that can be chosen as and can be chosen as . On the contrary, cannot be represented as a function of boundedness and smoothness constants. We conclude the following theorem for the online case:
Theorem 4 (Online case).
Suppose Assumptions 1, 2 and 3 in §2 hold, let , , let for any minibatch sizes
(21) 
let noiserelevant parameter
(22) 
let , and set the stepsize
(23) 
Then the SARAHCompositional Algorithm 1 outputs an satisfying in
(24) 
iterates. Furthermore, let the minibatch sizes , satisfy
(25) 
then the IFO complexity to achieve an accurate solution is bounded by
(26) 
We see that in the online case, the IFO complexity to achieve an accurate solution is upper bounded by , as long as (25) holds.^{4}^{4}4Here and in below, the smoothness and boundedness parameters and are treated as constants Note if the batch size are chosen as , then from (26) the IFO complexity upper bound is
(27) 
In fact, we can further improve the coefficient in the term in (18) and term in (26). A simple optimization tricks enables us to obtain an optimal choice (as in (28) and (30) below) of minibatch sizes, as
Corollary 5 (Optimal batch size, finitesum and online case).
Let (resp. ) maps a real to its closest element in (resp. ).

When the minibatch sizes in the finitesum case are chosen as
(28) the IFO complexity bound to achieve an accurate solution for SARAHCompositional is further minimized to
(29) 
When the minibatch sizes in the online case are chosen to satisfy
(30) where is defined in (22), the IFO complexity bound to achieve an accurate solution for SARAHCompositional is further minimized to
(31)
To understand the new IFO complexity upper bounds (29) and (31) with optimally chosen batch sizes, via the basic inequality the complexity in (29) when can be further upper bounded by
This indicates that compared to the singlesample case (19), the IFO complexity upper bound obtained is reduced by at least in its coefficient when is asymptotically large. To our best knowledge, the theoretical phenomenon that minibatch SARAH can reduce IFO complexity has not been quantitatively characterized in previous literatures. It is worth noting that analogous property does not hold in the classical optimization case, where the singlesample case and minibatch cases share the same IFO complexity upper bound (Fang et al., 2018). With further efforts, it can be shown that the running time can be effectively more reduced by adopting parallel computing techniques; we omit the details for clarity.
4 Experiments
In this section, we conduct numerical experiments to support our theory by applying our proposed SARAHCompositional algorithm to three practical tasks: portfolio management, reinforcement learning, and a dimension reduction technique named stochastic neighborhood embedding (SNE). In sequel, §4.1 studies performance of our algorithm to (riskadverse) portfolio management/optimization problem, §4.2 tests the performance of SARAHCompositional on evaluating value functions in reinforcement learning, while §4.3 focuses on the study of SNE which possesses a nonconvex objective function. We follow the setups in Huo et al. (2018); Liu et al. (2017) and compare with existing algorithms for compositional optimization. Readers are referred to Wang et al. (2017a, b) for more tasks we can apply our algorithm to.^{5}^{5}5We conduct experiments on synthetic data and MNIST dataset; the source code can be found at http://github.com/angeoz/SCGD.
4.1 SARAHCompositional Applied to Portfolio Management
The portfolio management minimization problem can be formulated as a meanvariance optimization problem:
(32) 
where denotes the quantities invested at every assets . We recall that in Section 1 we introduced the equivalent formulation of (32) as a compositional optimization problem (4). As it satisfies Assumptions 14, it serves as a good example to validate our theory.
For illustration purpose of the finitesum case we set and . Each row in the matrix are generated from a
dimensional Gaussian distribution with covariance matrix
. The conditional number of are predetermined as one of the experiment settings. For example we experimented on the case when and when where denotes the condition number. Inspired from the setting of Huo et al. (2018), we sample from the Gaussian distribution and take the absolute value.Furthermore, we did experiments on real world datasets as a demonstration of the online cases. Datasets include different portfolio datas formed on size and operating profitability^{6}^{6}6http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/. We choose 6 different datasets with and , which is partly the same as used in (Lin et al., 2018).
Throughout the experiment of the portfolio management, we set in the finitesum case, and set in the online case. The other parameters are set as follows: in the finitesum case and in the online case, , and . For SCGD and ASCPG algorithm, we fix the extrapolation parameter to be 0.9. Our search of learning rates is among , and we plot the learning curve of each algorithms corresponding to the best learning rate found. The results are shown in Figure 1and 2 respectively.
We demonstrate the comparison between our algorithm SARAHCompositional, SCGD (Wang et al., 2017a) for compositional optimization, ASCPG (Wang et al., 2017b) and VRSCPG (Huo et al., 2018) which serves as a baseline for variance reduced stochastic compositional optmization methods. We plot the objective function value gap and gradient norm against IFO complexity (measured by gradients calculation) for all four algorithms, two covariance settings and six real world problems. We observe that SARAHCompositional outperforms all other algorithms.
The result is an experimental justification that our proposed SARAHbased compositional optimization on portfolio management achieves stateofthe art performance. Moreover, we note that due to the small size of the batches, basic SCGD fails to reach a satisfying result, which is also shown by Huo et al. (2018); Lian et al. (2017). Smaller batch size also causes oscillation in SCGD training, which is a problem that SARAHCompositional algorithm does not encounter.
4.2 SARAHCompositional Applied to Reinforcement Learning
Next we demonstrate an experiment on reinforcement learning and test the performance of SARAHCompositional on value function evaluation. Let be the value of state under policy , then the value function can be evaluated through Bellman equation:
(33) 
for all , where represents the set of available states and . In value function evaluation tasks, we minimize the square loss
(34) 
We write as . Equation (34) is a special form of the stochastic compositional optimization problem by choosing and as follows (Wang et al., 2017b):
where is the vector with the elements in as components.
To model a reinforcement learning problem, we choose one of the commonly used setting of Dann et al. (2014)
and generate a Markov decision process (MDP) with
states andactions at each state. The transition probability is generated randomly from the uniform distribution
with added to each component to ensure the ergodicity. In addition, the rewards are sampled uniformly from .We tested our results on different settings of batch size and inner iteration numbers. In Figure 3 we plot our results on the batch size of , , respectively. The learning rate goes over the set and the inner loop update iteration number are set to be 100. We plot the objective value gap together with the gradient norm and use moving average to smooth the plot, which gives us Figure 3. From the figures we note that when the batch size is small and the iteration number is large, SARAHCompositional outperforms VRSC on convergence speed, gradient norm and stability. This supports our theoretical results and shows the advantage of SARAHCompositional over VRSC on the effect of variance reduction.
4.3 SARAHCompositional Applied to SNE
In SNE problem (Hinton and Roweis, 2003) we use ’s to denote points in high dimensional space and ’s to denote their low dimensional images. We define
where controls the sensitivity to distance. Then the SNE problem can be formulated as a nonconvex compositional optimization problem (Liu et al., 2017) as (1) and (6), where
We implement SNE method on MNIST dataset, with sample size 2000 and dimension 784. We use SCGD(Wang et al., 2017a), ASCPG(Wang et al., 2017b), and VRSC (Liu et al., 2017) as a baseline of variance reduced version of stochastic compositional optimization methods and compare its performance with SARAHCompositional. We choose the best learning rate that keeps the algorithm to converge for each case.
In our experiment, we choose a inner batch size of 5, an outer batch size of 1000, and optimal learning rate for both algorithms. In the left panel of Figure 4, we plot the change of objective function value gap during iterations, and in the right panel we plot the gradient norm with respect to each outer loop update in SCGD, ASCPG, VRSC and SARAHCompositional. The left panel in Figure 4 shows that SARAHCompositional has significantly better stability compared to VRSC. The gradient norm of SARAHCompositional gradually decreases within each inner loop, while the gradient norm of VRSC accumulates within each inner loop and decreases at each outer loop.
We note that the objective function of tSNE is nonconvex. We observe from Figure 4 that, SARAHCompositional outperforms VRSC with respect to the decrease in gradient norm against IFO complexity (gradient calculation), which is numerically consistent with our theory.
5 Conclusion
In this paper, we propose a novel algorithm called SARAHCompositional for solving stochastic compositional optimization problems using the idea of a recently proposed variancereduced gradient method. Our algorithm achieves both outstanding theoretical and experimental results. Theoretically, we show that the SARAHCompositional algorithm can achieve desirable efficiency and IFO upper bound complexities for finding an accurate solution of nonconvex compositional problems in both finitesum and online cases. Experimentally, we compare our new compositional optimization method with a few rival algorithms for the task of portfolio management. Future directions include handling the nonsmooth case and the theory of lower bounds for stochastic compositional optimization. We hope this work can provide new perspectives to both optimization and machine learning communities interested in compositional optimization.
References
 Agarwal and Bottou (2015) Agarwal, A. and Bottou, L. (2015). A lower bound for the optimization of finite sums. In International Conference on Machine Learning, pages 78–86.
 AllenZhu and Hazan (2016) AllenZhu, Z. and Hazan, E. (2016). Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning, pages 699–707.
 Dann et al. (2014) Dann, C., Neumann, G., and Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1):809–883.
 Defazio et al. (2014) Defazio, A., Bach, F., and LacosteJulien, S. (2014). Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654.
 Dentcheva et al. (2017) Dentcheva, D., Penev, S., and Ruszczyński, A. (2017). Statistical estimation of composite risk functionals and risk optimization problems. Annals of the Institute of Statistical Mathematics, 69(4):737–760.
 Fang et al. (2018) Fang, C., Junchi Li, C., Lin, Z., and Zhang, T. (2018). Spider: Nearoptimal nonconvex optimization via stochastic path integrated differential estimator. arXiv preprint arXiv:1807.01695.
 Ghadimi and Lan (2016) Ghadimi, S. and Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(12):59–99.
 Hinton and Roweis (2003) Hinton, G. E. and Roweis, S. T. (2003). Stochastic neighbor embedding. In Advances in neural information processing systems, pages 857–864.

Huo et al. (2018)
Huo, Z., Gu, B., Liu, J., and Huang, H. (2018).
Accelerated method for stochastic composition optimization with
nonsmooth regularization.
In
ThirtySecond AAAI Conference on Artificial Intelligence
.  Lei et al. (2017) Lei, L., Ju, C., Chen, J., and Jordan, M. I. (2017). Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2345–2355.
 Li and Lin (2015) Li, H. and Lin, Z. (2015). Accelerated proximal gradient methods for nonconvex programming. In Advances in neural information processing systems, pages 379–387.
 Lian et al. (2017) Lian, X., Wang, M., and Liu, J. (2017). Finitesum composition optimization via variance reduced gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 1159–1167.
 Lin et al. (2018) Lin, T., Fan, C., Wang, M., and Jordan, M. I. (2018). Improved oracle complexity for stochastic compositional variance reduced gradient. arXiv preprint arXiv:1806.00458.
 Liu et al. (2017) Liu, L., Liu, J., and Tao, D. (2017). Variance reduced methods for nonconvex composition optimization. arXiv preprint arXiv:1711.04416.
 Nesterov (2004) Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, volume 87. Springer.
 Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. (2017). Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621.
 Reddi et al. (2016) Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323.
 Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112.
 ShalevShwartz and Zhang (2013) ShalevShwartz, S. and Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599.
 ShalevShwartz and Zhang (2014) ShalevShwartz, S. and Zhang, T. (2014). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72.
 Shapiro et al. (2009) Shapiro, A., Dentcheva, D., and Ruszczyński, A. (2009). Lectures on stochastic programming: modeling and theory. SIAM.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction.
 Wang et al. (2017a) Wang, M., Fang, E. X., and Liu, H. (2017a). Stochastic compositional gradient descent: Algorithms for minimizing compositions of expectedvalue functions. Mathematical Programming, 161(12):419–449.
 Wang et al. (2017b) Wang, M., Liu, J., and Fang, E. X. (2017b). Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18:1–23.
 Wang et al. (2018) Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. (2018). Spiderboost: A class of faster variancereduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690.
 Woodworth and Srebro (2016) Woodworth, B. E. and Srebro, N. (2016). Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pages 3639–3647.
 Xiao and Zhang (2014) Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075.
 Yang et al. (2018) Yang, S., Wang, M. M., and Fang, E. (2018). Multilevel stochastic gradient methods for nested composition optimization. arXiv preprint arXiv:1801.03600.
 Zhang and Xiao (2019a) Zhang, J. and Xiao, L. (2019a). Multilevel composite stochastic optimization via nested variance reduction. arXiv preprint arXiv:1908.11468.
 Zhang and Xiao (2019b) Zhang, J. and Xiao, L. (2019b). A stochastic composite gradient method with incremental variance reduction. arXiv preprint arXiv:1906.10186.
6 Detailed Analysis of Convergence Theorems
In this section, we detail the analysis of our Theormems 3 and 4. Before moving on, we first provide a key lemma that serves as their common analysis, whose proof is provided in §6.3. We assume that the expected estimation error squared is bounded as the following for any and some parameters , and to be specified here:
Lemma 6.
Assume that for any initial point
Comments
There are no comments yet.