We consider the problem of minimizing a smooth objective function, when the optimization algorithm is provided with biased function measurements. This setting is motivated by practical applications, where the objective function is estimated from a batch dataset, and the estimation scheme is biased. As an example, consider the problem of estimating Conditional Value-at-Risk (CVaR), a popular risk measure in financial applications, from a batch of independent and identically distributed (i.i.d.) samples. The classic CVaR estimator [1] requires estimation of a certain quantile of the underlying distribution, and hence, the resulting estimate is biased. As another example, one could consider a simulation optimization problem [2], where the function measurements are ‘biased’ due to computational constraints. In both examples, increasing the batch size used for estimation decreases the estimation error.
Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, E-Mail: niravnb@cse.iitm.ac.in,
Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, E-Mail: prashla@cse.iitm.ac.in.
The ‘biased stochastic optimization’ setting outlined above is more general than the canonical ‘zeroth-order stochastic optimization‘ setting because the former features an estimation error that has a positive mean, while the latter usually features an estimation error that vanishes in expectation. We extend the theory of zeroth-order stochastic optimization to our setting by formalizing two oracle models that encapsulate a biased stochastic optimization problem. In each oracle model, an algorithm obtains a noisy and biased estimate of the gradient at any chosen point. Both oracles feature a batch size parameter that can be used to control an additive estimation error component in the gradient estimates. The difference between the two proposed biased gradient oracles is that the first oracle features a bias-variance tradeoff for the gradient estimates, while the second one does not have such a tradeoff.
The biased gradient oracles can be implemented using the simultaneous perturbation [3, 4] class of algorithms that can provide biased gradient information, using only noisy function measurements. Such an approach can be extended to cover the case of biased function measurements that we consider in this paper. The gradient estimate resulting from a simultaneous perturbation method usually has a bias-variance tradeoff, i.e., the estimate has an additive bias of , where is a parameter to be chosen by the optimization algorithm. The variance of the gradient estimate is , and the choice of relates to bias-variance tradeoff [5, 3, 6, 7]. Under additional assumptions, one can eschew the bias-variance tradeoff, i.e., reduce the bias without adversely affecting the variance [4, 8, 9].
The focus of this paper is to understand the rate of convergence of gradient-based methods with inputs from a biased gradient oracle. We derive non-asymptotic bounds on the iteration complexity of gradient-based methods for a non-convex as well as a convex objective. In either case, we derive bounds for gradient-based methods with inputs from the following oracle models: (i) an oracle whose gradient estimates have a parameter for trading off bias against the variance. We shall refer to this oracle as (O1); and (ii) an oracle where the gradient estimates have no bias-variance tradeoff. We shall refer to this oracle as (O2) below. Note that both oracles
Table I: Summary of the iteration and sample complexity we obtain for the RSG-BGO algorithm 1 and SGD-BGO algorithm 2 under two oracles for finding an -stationary or -optimal point (see Definition 1). Here (O1) is an oracle that returns a biased gradient estimate with a parameter that controls bias-variance tradeoff, while (O2) is an oracle variant where the bias of the gradient estimate can be reduced without adversely affecting the variance. Both oracles have a batch size parameter that controls the estimation error (See Section II for precise definitions).
have a batch size parameter for controlling the estimation error. Table I summarizes our bounds in the convex as well non-convex regimes, under two oracle models.
We now summarize our contributions in the case when the objective is non-convex. We study the non-asymptotic performance of the randomized stochastic gradient (RSG) algorithm, proposed in [8]. The case of unbiased gradient information is addressed in the aforementioned reference, and we focus on the cases when RSG is provided inputs from (O1) or (O2). From our analysis, we observe that RSG has a sample complexity bound of with (O1), and with (O2). This is not surprising, as (O1) provides a gradient estimate whose variance scales inversely with the perturbation constant , and this is unlike the estimate from (O2), where such an inverse scaling is absent. Our result, when specialized to a setting without the estimation error, matches the corresponding result in [8]. An advantage with our approach is that, unlike [8], we do not require knowledge of the function value at the optima for choosing the perturbation constant . We demonstrate the applicability of our biased gradient oracle by considering a risk-sensitive optimization problem in a reinforcement learning setting. We propose a risk-sensitive policy gradient (Risk-PG) algorithm, and show that the analysis of the RSG algorithm with (O1) applies to Risk-PG algorithm, while the results under (O2) would apply under additional assumptions.
Next, we summarize our contributions in the case when the objective is convex. Using a proof technique that is similar to the one employed in the non-convex case, we provide non-asymptotic bounds for the RSG algorithm with inputs from either (O1) or (O2). A disadvantage with the RSG algorithm is that it requires knowledge of the smoothness parameter for choosing the step-size parameter in the gradient descent update iteration. We overcome this dependence by employing a different algorithm that is based on the stochastic gradient descent scheme analyzed in [10]. We provide non-asymptotic bounds that hold in expectation for the final iterate of the stochastic gradient algorithm with inputs from either (O1) or (O2). For the case of unbiased gradient information, the authors in [10] provide a sample complexity bound of the order . We also provide a similar order bound, when the gradients are obtained from (O2). On the other hand, when gradient estimates from (O1) are employed, the bound we obtain is of the order . The latter bound is not surprising, considering a matching minimax lower bound shown in [7].
Related work. Biased stochastic optimization has been considered before in [7, 11, 12, 13, 14, 15, 16]. In [11, 12, 13], the authors consider an oracle that outputs biased gradient measurements without any noise component. In contrast, we consider noisy gradient measurements with a controllable estimation error. In [7], which is a closely related work, the authors formalize a biased noisy gradient oracle. They derive an upper bound for a mirror descent scheme, and a minimax lower bound, both for the case of a convex objective. Unlike [7], our oracle model features an additional estimation error component that is not zero mean. Our results, when specialized to the case without an estimation error matches the upper bound derived in [7]. Our bounds are for a regular stochastic gradient descent algorithm, with the added advantage that the stepsize we employ does not require knowledge of the underlying smoothness parameter. More importantly, unlike [7], we study stochastic non-convex optimization problems with the biased gradient oracles mentioned before.
In [9], the authors derive a non-asymptotic bound for a zeroth-order variant of the stochastic conditional gradient algorithm under an oracle model similar to (O2), except that the estimation error component is absent. Specializing our results to have zero-mean estimation error would make our bounds comparable to those in [9]. In [14], the authors consider a oracle model with a batch size parameter, and propose an algorithm that estimates the gradient on a mini-batch of sufficient size. They provide a sample complexity bound of for the case of a convex function, and this matches our bound for the oracle (O2). Our bounds for the convex case are for the ‘practically preferred’ last iterate, while those in [14] are for an iterate chosen uniformly at random. In addition, our step-size choice does not require the knowledge of the underlying smoothness parameter.
In [16], which is another closely related work, the authors study stochastic gradient methods in a setting where the objective is estimated using batch data, and a batch size of m leads to an estimation error of . Our biased gradient oracle framework is comparable to their setting, when , and for this case, our non-asymptotic bounds match their asymptotic rate.
The rest of the paper is organized as follows: Section II formulates the biased gradient oracles, along with motivating applications. Section III considers the stochastic non-convex optimization problem, and presents non-asymptotic bounds for a randomized gradient descent algorithm with inputs from a biased gradient oracle. Section IV considers the stochastic convex optimization problem and presents non-asymptotic bounds for a stochastic gradient descent algorithm. Section V highlights the applicability of our biased gradient oracles in a risk-sensitive reinforcement learning setting. Section VI provides the proofs of all the bounds which are presented in the paper, and finally, Section VII provides the concluding remarks.
Notation: Throughout this paper we assume , and is an matrix with each entry as one.
Consider the following optimization problem:
where the function is assumed to be smooth. Gradient-based methods are very popular for solving the optimization problem formulated above, and we consider an iterative algorithm which obtains estimates of through calls to a biased gradient oracle. We define two such oracles below, and subsequently, we provide motivating applications featuring biased function measurements.
A. Oracle definitions
(O1) Biased gradient oracle Input: , perturbation constant , and batch size m > 0.
In the oracle defined above, the parameter is used to tradeoff bias and variance in the gradient estimates, while the parameter m is motivated by practical models where exact function measurements are unavailable. Instead, one could choose larger values of m to increase the accuracy of the function measurements. Next, we present an alternative to (O1), where the bias of the gradient estimates can be reduced without adversely affecting the variance.
B. Illustrative applications
Example 1. In the regular simulation optimization setting [2], we are given function measurements with zero-mean noise, i.e., . In contrast, we consider a model where the function measurements have an error term with positive mean. In this model, the objective f is obtained as a solution to the following sub-problem over the optimization variable y that belongs to a convex and compact set Y :
Owing to computational considerations, the sub-problem defined above cannot be solved exactly. Instead, an optimization algorithm can obtain inexact measurements F(x, m) defined by
In the above, m is the batch size parameter, and is a ‘positive’ estimation error term. Choosing a larger batch size m implies the sub-problem in (2) can be solved more precisely, leading to lower estimation error .
The oracles (O1)–(O2) are also appealing in practical applications where the objective f has to be estimated from i.i.d. samples coming from a r.v. X, and the estimation scheme is biased. Estimation of risk measures such as CVaR is an example of an application where the de facto estimation scheme is biased. We shall illustrate the applicability of our oracle in the context of CVaR objective below.
Example 2. We begin by defining the VaR and CVaR , at a pre-specified level below.
where If the distribution underlying X is continuous, then .
We now describe a well-known estimate of CVaR using m i.i.d. samples . Notice that CVaR estimation requires an estimate of VaR. Let and denote the estimates of VaR and CVaR. These quantities are defined as follows (see [17]):
In the above, denotes the ith order statistic, . Notice that , since the VaR estimate in (4) is not unbiased. However, a recent CVaR concentration result in [18] shows that if the underlying r.v. X is -sub-Gaussian1, then, for any , the following inequality holds:
where constants depend on . Using (5), we have
where is an absolute constant.
In both examples illustrated above, the common element is biased function measurements. Using such measurements, one could construct gradient estimates using the simultaneous perturbation (SP) method [3]. We make this construction precise below.
Let , and . Here are the estimation errors assuming a batch size of is a perturbation constant, and is a d-dimensional standard Gaussian vector. For the two examples discussed above, it is apparent that the estimation error is in expectation, if m samples are used for estimation of f at input parameters.
A gradient estimate is formed using two function evaluations (i.e., and ) as follows:
where is a d-dimensional Gaussian vector composed of standard normal r.v.s. The estimate defined above is referred to as Gaussian smoothed functional, as well as Gaussian smoothing. This estimate was proposed in [19], and studied later in a convex optimization setting in [4]. A related estimate is random directions stochastic approximation (RDSA) with spherical perturbations, proposed in [20].
Assuming that the underlying function f is three-times continuously differentiable, we have
Hence,
where we used the fact that , since is a standard Gaussian vector. Combining the equality above with the fact that the estimation error is , we obtain
for some constants . This satisfies the requirement (a) in (O1). The requirement in (b) can be shown by squaring the estimator , assuming the square of the objective f is bounded, leading to an inverse scaling with the perturbation constant .
A similar argument works for the case of a convex and smooth objective as well. In addition, a variety of distributions can be employed for the random perturbations, cf. [5, 21, 22, 3, 7].
The oracle variant defined in (O2) can also be constructed using the estimator in (7). In particular, following arguments used in Lemma 3 in [4] together with the fact that estimation error is of order would lead to the condition (a) in (O2). As mentioned before, in a regular simulation optimization setting, one observes a sample that satisfies . If the noise has bounded variance, then one can ensure condition (b) in (O1), where the variance of the estimator scales inversely with . However, under additional assumptions, such as , one can get rid of the inverse scaling with , avoiding the bias-variance tradeoff through the parameter . The requirement amounts to an interchange of differentation and integration operators, usually by invoking the dominated convergence theorem, and is common in perturbation analysis, cf. [23, 24]. For a proof that leads to condition (b) in (O2), a straightforward variation of the proof of Lemma B.1 in [9], which incorporates biased function measurements can be worked out, and we omit the details.
C. Performance metrics
We consider stochastic gradient (SG) type algorithms for solving (1), with inputs from either (O1) or (O2). A SG algorithm runs for N iterations, and outputs a point , that could be chosen randomly from the iterates . We study SG schemes in following two contexts: (i) the case when the objective f is convex; and (ii) the case when the objective f is not assumed to be convex. In case (i), we provide bounds on the optimization error, i.e., , where is a minima of f. On the other hand, in case (ii), i.e., when the objective is non-convex, it is difficult to bound the optimization error. A popular alternative is to establish local convergence. i.e., to a point where the gradient of the objective is small (cf. [8, 25]). The following definition makes the optimization objective apparent in both cases.
The SG algorithms are judged using the iteration as well as sample complexity, which are defined below.
Definition 2. The iteration complexity of an algorithm A is the number of calls A makes to a biased gradient oracle before finding an -stationary (resp. -optimal) point for a non-convex (resp. convex) objective function.
Definition 3. Suppose an algorithm A makes N calls to a biased gradient oracle before finding an -stationary (resp. -optimal) point for a non-convex (resp. convex) objective function. Then, the sample complexity of A is , where is the batch size in iteration i.
In this section, we consider the problem in (1), with an objective f that is smooth, but not necessarily convex. We analyze the non-asymptotic performance of the RSG algorithm [8], with inputs from a biased gradient oracle (either (O1) or (O2)). The pseudocode for the algorithm is given below. This algorithm performs an incremental update as defined in (8), and outputs a random iterate, after N iterations2.
For the non-asymptotic analysis of the RSG algorithm, we make the following assumptions: (A1) Function f has Lipschitz continuous gradient with constant L > 0, i.e.,
(A2) There exists a constant B > 0 such that . The smoothness assumption in (A1) is standard in the analysis of gradient-based algorithm (cf. [8, 9]). The boundedness requirement in (A2) is made in the context of zeroth-order optimization in [9], and can be inferred from the assumptions common to the analysis of policygradient algorithms in a reinforcement learning context, cf. [26]. We provide below a non-asymptotic bound for RSGBGO algorithm with (O1).
Theorem 1. (RSG-BGO under (O1)) Assume (A1) and (A2). With the oracle (O1), suppose that the RSG-BGO algorithm is run with the stepsize and perturbation constant set as follows :
, constants and are as defined in (O1), B is as defined in (A2),
and is an optimal solution to (1).
(ii) If the batch size , for some constant and , then, for any , we have
where constants are the same as in part (i).
Proof. See Section VI-A1.
Remark 1. The overall rate, from the bound above, is , and this is not surprising because the bias of the gradient cannot be made arbitrarily small by setting to a low value, as the variance of the gradient estimates scales inversely with . The (asymptotic) convergence rate results for simultaneous perturbation stochastic approximation (SPSA) in [5], and RDSA in [21], also exhibit the same order, under an oracle that is a variant to (O1) (without the estimation error component).
Using the bound in Theorem 1, it is easy to see that the iteration complexity is , and the sample complexity is .
Remark 2. For understanding the dimension dependence in the iteration complexity, let us consider the special case where (O1) is implemented using either SPSA [5] or RDSA [21]. In this case, and , where are dimension-independent constants. Choosing and in (9), the overall iteration complexity of RSGBGO turns out to be .
We provide below a non-asymptotic bound for the RSG-BGO algorithm with (O2).
Theorem 2. (RSG-BGO under (O2)) Assume (A1) and (A2). With the oracle (O2), suppose that the RSG-BGO algorithm is run with the stepsize , perturbation constant , and batch size set as follows :
where , , constants and are as defined in (O2), B is as defined in (A2), and is as defined in (11).
Proof. See Section VI-A2.
From the bound in Theorem 2, it is easy to see that the iteration complexity is , and the sample complexity is . This bound is better than the corresponding bound with (O1), we believe this improvement is because the variance of the gradient estimate in (O2) does not increase when bias is reduced.
Remark 3. In [8], the authors derive a non-asymptotic bound for a zeroth-order variant of their RSG algorithm under an oracle that is a variant to (O2) (without the estimation error component). Our result in Theorem 2 matches their bound. Moreover, unlike [8], we derive a non-asymptotic bound for the oracle (O2), which involves an estimation error component.
An advantage with our analysis is that it allows a simpler distribution for picking the final iterate (see Proposition 2 in the Section VI-A1). In particular, our bounds hold for an iterate that is picked uniformly at random from . The net effect is that of iterate averaging, except that the averaging happens in expectation.
Remark 4. For understanding the dimension dependence in the iteration complexity, let us consider the special case where (O2) is implemented using the Gaussian smoothing approach [4, 9]. In this case,
whereis the bound on variance of the estimator of f(x). Choosing the stepsize and in (13), the overall iteration complexity of RSGBGO turns out to be .
In this section, we consider the problem in (1), under the assumption that f is a convex function. Let be a minimizer of the objective f. We first analyze the RSG-BGO algorithm in a convex setting, and subsequently present the SGD-BGO algorithm.
A. Randomized stochastic gradient algorithm with a biased gradient oracle (RSG-BGO)
We provide below a non-asymptotic bound for RSG algorithm with (O1).
Theorem 3. (RSG-BGO under (O1))
Assume (A1). With the oracle (O1), suppose that the RSG-BGO algorithm is run with the batch size , for some constant and stepsize , perturbation constant set as defined in (9). Then, for any , we have
where , constants and are as defined in (O1),
(14) and is an optimal solution to (1).
Proof. See Section VI-B1.
The iteration complexity of and the sample complexity of the RHS above matches that in Theorem 1 with the non-convex objective. However, unlike non-convex case, we bound the optimization error, i.e., .
We now provide a non-asymptotic bound for the RSG algorithm with (O2) for the convex objective.
Theorem 4. (RSG-BGO under (O2))
Assume (A1). With the oracle (O2), suppose that the RSG-BGO algorithm is run with the stepsize , perturbation constant , and batch size set as defined in (13). Then, for any , we have
where , , constants and are as defined in (O2), and D is as defined in (14).
Proof. See Section VI-B2.
From the bound in Theorem 4, it is easy to see that the iteration complexity is , and the sample complexity is .
B. Stochastic gradient descent algorithm with a biased gradient oracle (SGD-BGO)
In this section, we study a stochastic gradient descent algorithm with a biased gradient oracle. Unlike RSGBGO algorithm whose bounds were for a random iterate, the bounds that we derive for SGD-BGO are for the last iterate, which is the preferred point in practical implementations. The pseudocode for the SGD-BGO algorithm with inputs from a biased gradient oracle is given in Algorithm 2.
Following the approach from [10], we assume the knowledge of the the total number of iterations N
and split the horizon N into l phases. The choice of phase lengths, and the step-size decay in each phase is performed along the lines of [10]. However, unlike their work that assumed unbiased gradient information, we operate in a setting where biased gradient information is available through (O1) or (O2), and this induces significant deviations in the proof. Morever, our setting features a perturbation constant parameter, which has to be chosen in a phase-dependent manner as well.
We make the choice of phases precise below.
From the phase definitions above, it can be seen that is an increasing sequence. Further,
and so on. In the result below, we provide a non-asymptotic bound on the optimization error, i.e., for the SGD-BGO algorithm under (O1).
Theorem 5. (SGD-BGO under (O1)) Assume (A2). With the oracle (O1), suppose that the SGD-BGO algorithm is run with the stepsize , perturbation constant , and batch size set as follows:
for some constant , when , with as defined in (16). Then, for any , we have
are as defined in (O1), and D is as defined in (14).
Proof. See Section VI-C1.
From the bound in Theorem 5, it is easy to see that the iteration complexity of SGD-BGO is , and the sample complexity is .
Remark 5. The analysis used in arriving at the bounds in Theorem 5 cannot be extended to the non-convex case. This is because the analysis takes a dual viewpoint and approaches the minima of the objective from below, and in this process, convexity is strictly necessary. Intuitively, it may be challenging to provide bounds for the last iterate sans averaging in a non-convex optimization setting, while it is possible to provide bounds for the averaged iterate (or the random iterate of RSG-BGO, which is an average in expectation) in the non-convex case.
Theorem 6. (SGD-BGO under (O2)) Assume (A2). With the oracle (O2), suppose that the SGD-BGO algorithm is run with the stepsize , perturbation constant and batch size set as follows:
for some constant , when , with as defined in (16). Then, for any , we have
are as in (O2), and D is as defined in (14).
Proof. See Section VI-C2.
From the bound in Theorem 6, it is easy to see that the iteration complexity of SGD-BGO is , and the sample complexity is . Further, it is interesting to note that the bound with (O2) matches up to constant factors, the bound obtained in [10] for the case when unbiased gradient information is available. Unlike [8], where the authors provide a iteration complexity bound for a random iterate using the RSGBGO algorithm, we provide bound for the last iterate of SGD-BGO. Apart from a practical preference for using the last iterate, an advantage with our approach is that for setting the step size and perturbation constant (17), we do not require the knowledge of Lipschitz constant L (see (A1)) and . The latter quantity is typically unavailable in practice, as it relates to the initial error.
One could specialize the bounds in Theorems 3–6 for the case when the underlying oracles are implemented using SPSA or GS methods, and we omit the details due to space constraints.
We consider a stochastic shortest path (SSP) problem, with a special cost-free absorbing state, say 0. We restrict our attention to ‘proper’ policies, which ensure state 0 is recurrent, and the remaining states are transient in the Markov chain underlying the policy considered. We define an episode as a sample path , where , and is the first passage time to state 0. Consider a smoothly parameterized class of policies . Suppose that the policy is a continuously differentiable function of the parameter x: a standard assumption in policy gradient literature. Let denote the total cost r.v. under policy x starting in state , i.e., , where is the discount factor and is the single-stage cost incurred at time instant t in state on choosing action . Here actions are chosen according to policy , which is parameterized by x. The classic objective in RL is to find a policy that minimizes, in expectation, the total cost. We consider a risk-sensitive RL setting, where the goal is to find a policy that optimizes a certain risk measure, i.e., the following problem:
where parameterizes the policy , and is a risk measure. As examples for the risk measure, one could consider CVaR [1], utility-based shortfall risk (UBSR) [27], and spectral risk measure (SRM) [28]. Notice that the optimization problem in (18) is non-convex in nature. For solving the problem defined above using gradient-based methods, one requires (i) an estimate of the risk measure for any given policy ; and (ii) an estimate of the gradient of the risk measure w.r.t. the policy parameter x. We elaborate on these two parts below.
icy , and collect samples of the total cost . Using these samples, define the empirical distribution function (EDF) of as follows
EDF, we form the estimate of as follows:
Such as estimation scheme for an abstract risk measure has been considered earlier in [29, 30].
Next, we present a bound in expectation for the estimation error associated with (19).
Proposition 1. Suppose the risk measure satisfies the following continuity requirement for any two distributions F, G:
where is the Wasserstein distance3 between distributions F and G, and L is as defined in (A1). Suppose the r.v. satisfies , for any . Then,
for some constant c that depends on B.
Proof. See Section VI-D1.
The continuity requirement in (20) is satisfied by the three popular risk measures CVaR, UBSR and SRM, and the reader is referred to [30] for details.
To construct an estimate of the gradient of the risk measure , one could employ the simultaneous perturbation method, e.g., the estimate in (7). Using this gradient estimate and the template of RSG-BGO algorithm, we arrive at the following update iteration for the risk-sensitive policy gradient (risk-PG) algorithm:
where is the stepsize and is an estimate of the gradient of the risk measure . To elaborate on the gradient estimation aspect of the algorithm above, we first simulate trajectories of the underlying MDP with policy parameters and , respectively. Here is the perturbation constant, and is a standard Gaussian vector. Using (19), we estimate the risk measures corresponding to the aforementioned policy parameters, and then, use (7) to form .
For applying the non-asymptotic bounds derived earlier for RSG-BGO, we require (A1) to hold. We verify this assumption for the special case of CVaR below. Let Cdenote the CVaR associated with a policy that is parameterized by x. Then, using the likelihood ratio method, we arrive at the following variant of the policy gradient theorem under the CVaR objective (cf. [31]):
In the above, and term (I) on the RHS are Lipschitz functions due to the policy gradient assumption mentioned above. In addition, if we assume that the distribution, say , of is a Lipschitz function in x, then we can infer that is sum of product of Lipschitz functions, implying assumption (A1). One could generalize this argument to the case when is a coherent risk measure, and the reader is referred to [31] for details.
From the foregoing, it is apparent that the risk-PG oracle implemented using the gradient estimate (7) falls under the biased gradient oracle with an estimation error component scheme, i.e., either (O1) or (O2). For the risk-PG oracle to be of type (O2), one would require an additional assumption that allows interchange of the expectation and differentiation operators, see Proposition 2.3 in [23] for an example. Using the non-asymptotic bounds for RSG-BGO algorithm derived earlier, we can infer that the iteration complexity for the Risk-PG algorithm is: (i) under the oracle (O1) (using Theorem 1); and (ii) under the oracle (O2) (using Theorem 2).
For notational convenience, we shall use . Let , and denote the expectation w.r.t. .
A. Proofs for RSG-BGO algorithm with a non-convex objective
1) Proof of Theorem 1: In the proposition below, we state and prove a general result that holds for any choice of non-increasing stepsize sequence, perturbation constants and batch sizes. Subsequently, we specialize the result for the choice of parameters suggested in Theorem 1, to prove the same.
Proposition 2. Assume (A1) and (A2). With the oracle (O1), suppose that the RSG-BGO algorithm is run with a non-increasing stepsize sequence satisfying and with the probability mass function
then, for any , we have
where , constants are as defined in (O1), B is as defined in (A2), and is given in (11).
Proof. Let . We use the technique from [8]. However, our proof involves significant deviations owing to the fact that the gradient estimates in (O1) have variance that scales inversely with perturbation constant . Further, unlike [8], we have the batch size parameter that needs to be optimized.
First, notice that
and
where , and
Under assumption (A1), we have
Taking expectations with respect to on both sides of (25) and using (23) and (24), we obtain
+ + L2 + 2
where we have used the fact that for any vector X in arriving at the inequality (26). The last inequality follows from (A2). Re-arranging the terms, we obtain
Now, summing up the inequality above over k = 1 to N, and taking expectations with respect to the filtration generated by , we obtain
The bound in (22) follows by using the distribution of R (specified in (21)) in the RHS above.
We now specialize the result obtained in the proposition above, to derive the bounds in Theorem 1.
Let . Then, combining (21) with (22), we obtain
In the above, inequality (28) follows by using the fact that , and the inequality (29) follows by using the definition of and m. The bound in (10) follows by rearranging terms.
Proof. (Theorem 1 (ii)) Recall the stepsize , perturbation constant from (27). Let , for some constant and . Combining (21) with (22), we obtain
In the above, inequality (30) follows by using the fact that and . The inequality (31) follows by using the definition of and . The bound in (12) follows by rearranging terms.
2) Proof of Theorem 2 :
Proof. Following the proof of Proposition 2, we obtain
where . Using arguments similar to those employed in Theorem 1, we obtain
The main claim follows by plugging values of , and m, as defined in Theorem 2 in the above equation.
B. Proofs for RSG-BGO algorithm with convex objective
1) Proof of Theorem 3: In the proposition below, we state and prove a general result that holds for any choice of non-increasing stepsize sequence, perturbation constants and batch sizes. Subsequently, we specialize the result for the choice of parameters suggested in Theorem 3, to prove the same.
Proposition 3. Assume (A1) and (A2). With the oracle (O1), suppose that the ZRSG algorithm is run with a non-increasing stepsize sequence satisfying and with the probability mass function as defined in (21), then, for any , we have
where , constants and are as defined in (O1), and D as defined in 14.
Proof. Let and . Then for any k = 1, . . . , N, we have,
Taking expectations with respect to on both sides of (33), and using (23), (24), we obtain
where the last inequality follows from the fact that for any vector X. Now, using the fact that is convex, we have , further from (A1), we have . Plugging it in equation above, we obtain,
where the last inequality follows from the fact that is convex along with for any vector X. Re-arranging the terms, we obtain
Now summing up the inequality above from k = 1 to N and taking expectation on both sides of above equation, we obtain
Using the fact that and for all , we obtain
where the last inequality follows by using 14, i.e., . We conclude by combining the above result with (21).
Proof. (Theorem 3)
Recall that the stepsize , perturbation constant and batch size is as defined in (27). Combining (21) with (32), we obtain
In the above, inequality (34) follows by using the fact that , and the inequality (35) follows by using the definition of and m.
2) Proof of Theorem 4:
Proof. Following the initial passage in the proof of Theorem 3 with (O1), we obtain the following inequality for the case of (O2) considered here:
where . Now, using the choice of parameters in (13), followed by simplifications similar to those in the proof of Theorem 3, we obtain
C. Proofs for SGD-BGO algorithm
1) Proof of Theorem 5: The proof proceeds through a sequence of lemmas. We follow the technique from [10], and our proof involves significant deviations owing to the fact that unbiased gradient information is not available, leading to additional terms involving perturbation constants (arising out of gradient bias), and batch sizes (arising due to estimation errors).
Recall that is defined as follows:
Further, when , stepsize , perturbation constant , and batch size is defined as follows:
for some constant . Note that, unlike [10], parameters and are local to our setting, and due to the inverse scaling of variance in gradient estimates with , the stepsizes chosen is of and not .
We divide the proof into phases , let be the output of the SGD-BGO algorithm. We start with a variant of Lemma 1 from [10]. In comparison to their result, our claim below features additional factors involving perturbation constant and batch size owing to the zeroth-order setting we consider.
Lemma 1. Assume (A2). With the oracle (O1), suppose that the SGD-BGO algorithm is run with stepsize sequence . Then, given any , we have
, constants is as defined in (O1) and D is as defined in (14).
Proof. Let and . Then for any k = 1, . . . , N, we have
Taking expectations with respect to on both sides of (38), and using (23), (24), we obtain
where the last inequality follows from the fact that for any vector X. Now, using (A2), i.e., , we obtain
where the last inequality follows from the fact that is convex along with for any vector X. Re-arranging the terms, we obtain
We conclude by summing the above equation over to , taking expectations, and using (14), i.e., .
Lemma 2. Under conditions of Lemma 1, with , for any , we have
where is as defined in (O1), B is as defined in (A2) and D is as defined in (14).
Proof. Let , then we have . Using the definition of convexity, we obtain
where we have used the identity in arriving at the equality in (39). Using , we have
Taking expectation, we obtain
where the last inequality follows from the fact that for any vector X. We conclude by summing (40) over k, with , and using the fact that .
Proof. (Theorem 5) Recall the definition of from equation (36) and let be defined as follows:
We split the horizon N into l phases, then to show that the function value for the final iterate in the last phase () is close to optima . Using the fact that , we have
Now to bound , we first consider the case when . Using Lemma 1 with and , we obtain
The inequality in (43) follows from the fact that and are decaying in a phase-dependent manner (see (37)). Note that from the definition of 0 whenever . Thus, we have
where the second inequality follows from the assumption that , and the fact that . The last inequality follows from the Lemma 4 of [10]. Combining (44) and (45), we obtain
This completes the proof for the case when . The proof for the case when i = 0 follows in a similar manner. Plugging (46) into (42), we obtain
Note that for all , we have step size and perturbation constant . Let be the output of SGD-BGO algorithm, then using the fact that infimum is smaller than any weighted average, we have
where the inequality in (48) follows from the fact that , the inequality in (49) follows from the Lemma 2 and the final inequality follows from the fact that . We conclude by plugging (50) into (47) to obtain
2) Proof of Theorem 6: The proof proceeds through a sequence of lemmas, similar to the proof of Theorem 5 in Section VI-C1 under the oracle (O1).
Lemma 3. Assume (A2). With the oracle (O2), suppose that the SGD-BGO algorithm is run with stepsize sequence . Then, given any , we have
, constants is as defined in (O2), B is as defined in (A2), and D is given in (14).
and .
Lemma 4. Assume (A2). With the oracle (O2), suppose that the SGD-BGO algorithm is run with a constant stepsize and perturbation constant, i.e., . Then, for any , we have
where is as defined in (O2), B is as defined in (A2), and D is as defined in (14).
Proof. The proof follows in a similar manner as that of Lemma 2, with the following modification: .
Proof. (Theorem 6) Using a parallel argument to the initial passage in the proof of Theorem 5 leading upto equation (46), we obtain
Plugging (51) into (42), we get
As in the proof of Theorem 5, we obtain
where we used Lemma 4 and the fact that . We conclude by plugging (53) into (52) to obtain
D. Proof for Risk-Sensitive Reinforcement Learning
1) Proof of Proposition 1:
Proof. Let F denote the distribution of . Then, we have
where the final inequality follows by using Theorem 3.1 of [32].
Motivated by practical applications involving biased function measurements, we formulated two biased gradient oracles with an additive estimation error component. The first oracle featured a bias-variance tradeoff for the gradient estimates, while the second one did not have such a tradeoff. We studied the non-asymptotic performance of gradient-based algorithms with inputs from a biased gradient oracle in convex as well as non-convex optimization regimes. Further, we highlighted the applicability of our proposed biased gradient oracles in a risk-sensitive reinforcement learning setting.
[1] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of Risk, vol. 2, no. 3, pp. 21–41, 2000.
[2] M. C. Fu, Ed., Handbook of Simulation Optimization. Springer, 2015.
[3] S. Bhatnagar, H. L. Prasad, and L. A. Prashanth, Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods (Lecture Notes in Control and Information Sciences). Springer, 2013, vol. 434.
[4] Y. Nesterov and V. Spokoiny, “Random gradient- free minimization of convex functions,” Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017.
[5] J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE Transactions on Automatic Control, vol. 37, no. 3, pp. 332–341, 1992.
[6] ——, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. John Wiley & Sons, 2005, vol. 65.
[7] X. Hu, L. A. Prashanth, A. György, and C. Szepesvári, “(Bandit) Convex Optimization with Biased Noisy Gradient Oracles,” in Artificial Intelligence and Statistics, 2016, pp. 819–828.
[8] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
[9] K. Balasubramanian and S. Ghadimi, “Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates,” in Advances in Neural Information Processing Systems, 2018, pp. 3459–3468.
[10] P. Jain, D. Nagaraj, and P. Netrapalli, “Making the last iterate of sgd information theoretically optimal,” SIAM Journal on Optimization, vol. 31, no. 2, pp. 1108–1130, 2021.
[11] O. Devolder, F. Glineur, and Y. Nesterov, “First- order methods of smooth convex optimization with inexact oracle,” Mathematical Programming, vol. 146, no. 1, pp. 37–75, 2014.
[12] M. Baes, “Estimate sequence methods: Extensions and approximations,” IFOR Internal report, ETH Zurich, Switzerland, Tech. Rep., 2009.
[13] A. d’Aspremont, “Smooth optimization with ap- proximate gradient,” SIAM Journal on Optimization, vol. 19, pp. 1171–1183, 2008.
[14] L. M. Nguyen, K. Scheinberg, and M. Takáˇc, “Inex- act SARAH algorithm for stochastic optimization,” Optimization Methods and Software, vol. 36, no. 1, pp. 237–258, 2021.
[15] R. Bollapragada, R. H. Byrd, and J. Nocedal, “Ex- act and inexact subsampled Newton methods for optimization,” IMA Journal of Numerical Analysis, vol. 39, no. 2, pp. 545–578, 2019.
[16] R. Pasupathy, P. Glynn, S. Ghosh, and F. S. Hashemi, “On sampling rates in simulation-based recursions,” SIAM Journal on Optimization, vol. 28, no. 1, pp. 45–73, 2018.
[17] R. J. Serfling, Approximation theorems of mathematical statistics. John Wiley & Sons, 2009, vol. 162.
[18] S. P. Bhat and L. A. Prashanth, “Concentration of risk measures: A Wasserstein distance approach,” Advances in Neural Information Processing Systems, vol. 32, pp. 11 762–11771, 2019.
[19] V. Y. Katkovnik and Y. Kulchitsky, “Convergence of a class of random search algorithms,” Automation Remote Control, vol. 8, pp. 1321–1326, 1972.
[20] H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems. New York: Springer Verlag, 1978.
[21] L. A. Prashanth, S. Bhatnagar, M. C. Fu, and S. I. Marcus, “Adaptive system optimization using random directions stochastic approximation,” IEEE Transactions on Automatic Control, vol. 62, no. 5, pp. 2223–2238, 2017.
[22] L. A. Prashanth, S. Bhatnagar, N. Bhavsar, M. C. Fu, and S. I. Marcus, “Random directions stochastic approximation with deterministic perturbations,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2450–2465, 2020.
[23] S. Asmussen and P. W. Glynn, Stochastic simulation: algorithms and analysis. Springer Science & Business Media, 2007, vol. 57.
[24] P. Glasserman and Y.-C. Ho, Gradient estimation via perturbation analysis. Springer Science & Business Media, 1991, vol. 116.
[25] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimiza- tion methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
[26] Z. Shen, A. Ribeiro, H. Hassani, H. Qian, and C. Mi, “Hessian aided policy gradient,” in International Conference on Machine Learning. PMLR, 2019, pp. 5729–5738.
[27] H. Föllmer and A. Schied, “Convex measures of risk and trading constraints,” Finance and stochastics, vol. 6, no. 4, pp. 429–447, 2002.
[28] C. Acerbi, “Spectral measures of risk: A coherent representation of subjective risk aversion,” Journal of Banking & Finance, vol. 26, no. 7, pp. 1505– 1518, 2002.
[29] A. Cassel, S. Mannor, and A. Zeevi, “A general ap- proach to multi-armed bandits under risk criteria,” in Proceedings of the 31st Conference On Learning Theory, 2018, pp. 1295–1306.
[30] L. A. Prashanth and S. P. Bhat, “A Wasserstein distance approach for concentration of empirical risk estimates,” arXiv preprint arXiv:1902.10709, 2020.
[31] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Man- nor, “Policy gradient for coherent risk measures,” in Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc., 2015.
[32] J. Lei, “Convergence and concentration of empir- ical measures under Wasserstein distance in unbounded functional spaces,” Bernoulli, vol. 26, no. 1, pp. 767–798, 2020.