b

DiscoverSearch
About
My stuff
Tightly Robust Optimization via Empirical Domain Reduction
2020·arXiv
Abstract
Abstract

Data-driven decision-making is performed by solving a parameterized optimization problem, and the optimal decision is given by an optimal solution for unknown true parameters. We often need a solution that satisfies true constraints even though these are unknown. Robust optimization is employed to obtain such a solution, where the uncertainty of the parameter is represented by an ellipsoid, and the scale of robustness is controlled by a coefficient. In this study, we propose an algorithm to determine the scale such that the solution has a good objective value and satisfies the true constraints with a given confidence probability. Under some regularity conditions, the scale obtained by our algorithm is asymptotically  O(1/√n), whereas the scale obtained by a standard approach is  O(�d/n).This means that our algorithm is less affected by the dimensionality of the parameters.

1.1 Problem Setting

Data-driven decision-making must manage stochastic uncertainty due to a limited number of available data samples. Suppose that a decision is made by solving the following parameterized optimization problem [6]:

image

where  θ ∈ Rdis a parameter,  X ⊆ Rmis a decision domain,  x ∈ Xis a decision variable, f : Rm → Ris an objective function, and  g1, . . . , gK : Rm × Rd → Rare parameterized constraint functions that are linear in  θ. The ideal decision is given by an optimal solution x(θ∗)under true parameter  θ∗; however, θ∗ is unknown to us and must be estimated from i.i.d. samples  θi(i = 1, . . . , n) from a distribution  P ∗whose expectation is  θ∗.

In practice, we often need a solution that satisfies the true constraints (i.e., constraints with true parameters) even though these are unknown. In such a case, a naive approach that solves the problem with estimated parameters would be inadequate because the obtained solution might not satisfy the true constraints. Rather, robust optimization [2, 6] can be used as follows. For a positive semidefinite matrix  Σ, we define  UΣ := {u ∈ Rd | u⊤Σ−1u ≤ 1}. The robust optimization problem (with ellipsoidal uncertainty) is then given by

image

where  λis a parameter called the nonnegative scale. Let  x(λ; θ, Σ)be the solution of the above problem. Then,  x(λ; θ, Σ)satisfies the true constraints if  θ∗ − θ ∈ λUΣ. This means that if  λ islarge, it will most likely satisfy the true constraints but may have a poor best-case performance. Conversely, if  λis small, it will have a good best-case performance but also a high risk of violating the true constraints. Therefore, the choice of  λis a significantly important task [8].

In this study, we consider the problem of finding a  λsuch that the problem will have a small objective value and satisfy the true constraints with a sufficiently high probability. Formally, the problem is defined as follows. Let  f ∗be the true objective function:

image

Let ˆθnbe the empirical average of  θ1, . . . , θn, ˆΣnbe an estimate of the covariance matrix  Σ∗defined by ˆΣn := (1/n) �ni=1(θi − ˆθn)(θi − ˆθn)⊤, and ˆxn(λ)be defined by  ˆxn(λ) := x(λ; ˆθn, ˆΣn).Our goal then is to find an algorithm A that determines a scale  λfrom the observation  θnsuch that the (1 − δ)-quantile of  f ∗(ˆx(A(θn))), i.e., the  value-at-risk VaRδ[f ∗(ˆx(A(θn)))], is minimized as follows 1:

Problem 1. Given  δ > 0, find an algorithm  A : θn �→ λthat minimizes  VaRδ[f ∗n(ˆx(A(θn)))].

1.2 Standard Approach

A standard approach for Problem 1 determines a  λ such that ˆθn − θ∗ ∈ λUΣholds with a desired probability [4, 7, 10]. Let  Σ∗ := Eθ∼P[(θ − θ∗)(θ − θ∗)⊤]be the true covariance matrix and x∗(λ) := x(λ; θ∗, Σ∗). We then obtain the following:

Proposition 2. Suppose that ˆθn − θ∗ ∈ λUΣ∗with probability at least  1 − δ. It then holds that VaRδ[f ∗(x(λ; ˆθ, Σ∗))] ≤ f ∗(x∗(2λ)).

If ˆθnis (asymptotically) distributed by a normal distribution  N(θ∗, Σ∗/n)with mean  θ∗and covariance  Σ∗/n, then such  λwill be determined by  λ = χ−1d (1 − δ)/√n, where  χdis a chi-distribution with a degree of freedom d.

Observe that such a  λwill be of  O(�d/n). Thus, the speed of convergence depends on the dimension d of the parameter  θ. The speed of convergence will be especially degraded when a large number of features (d ≫ 1) are available, but only a small number of features is useful, which will cause the so-called “curse of dimensionality” in robust optimization.

1.3 Our Approach and Results

In this study, we prove the following theorem, which improves the convergence rate of Corollary 2:

Theorem 3 (Informal version of Theorem 20). There is an algorithm A, which is given in Algorithm 2, such that the output ˆλ = A(θn)satisfies

image

where

image

It is important to note that our  λnis asymptotically independent of the dimensionality of the parameter  θ, i.e., the speed of convergence is less affected by the dimensionality of the parameter  θ.

The theorem is proved in the following steps. We first derive a new inequality on a subspace. We then establish methods to apply the new inequality effectively.

1.3.1 New Inequality on Subspace

Recall that Corollary 2 is derived from the following fundamental fact in robust optimization.

Fact 4. Let  λ ≥ 0. If  θ ∈ Rdand  Σ ⪰ Osatisfy  θ − θ∗ ∈ λUΣ, then

image

for  arbitrary g : Rm × Rd → R and x ∈ Rm. Here, we observe that this is too conservative. If we consider given  gk (k = 1, . . . , K) and a subspace  Y ⊆ Xthat includes the resulting solution  x(λ; θ, Σ), we can prove the same inequality (3) under a weaker assumption. Let rk(x; Σ) := maxu∈UΣ gk(x, u). For Y ⊆ Rd, let S(λ, Y; Σ)be the set of errors that does not cause a violation of the true constraint over Y, i.e.,

image

We then obtain the following lemma:

Lemma 5. Let  λ ≥ 0. If  θ ∈ Rd, Σ ⪰ O, and  Y ⊆ Rdsatisfy

image

then (3) holds.

Lemma 5 generalizes Fact 4 because the assumption of Fact 4 is a sufficient condition in Lemma 5 for Y = X since  λUΣ ⊆ S(λ, Y; Σ)holds for any Y, and (5) is trivial with Y = X. Our idea is to use Lemma 5 rather than Fact 4 to derive better bounds than Corollary 2.

The following example shows the effect of choosing a different Y.

Example 6. Let us consider the following d = 2-dimensional parameterized problem:

image

The true parameters are  θ∗ = (2, 1)⊤, and the true covariance of samples is  Σ∗ = I2, where Id is

an identity matrix of size d.

Figure 1 illustrates  S(λ, Y) = S(λ, Y; I2) for Y = R2, R2+, and Rx1+ := {(x1, 0)⊤ ∈ R2 | x1 ≥0}, where all these Y include the true optimum solution  x∗ = (1, 0)⊤. The figure clearly shows that, with a fixed  λ, a smaller Y results in a larger S:

image

Recall that Fact 4 and Lemma 5 lead us to define  λin order to satisfy ˆθn − θ∗ ∈ S(λ, Y)with desired probability  1 − δ.

More specifically, let ˆθnbe distributed by the normal distribution  N(θ∗, I2) and let δ = 0.1.Fact 4 with  UI2(= S(λ, R2))then requires  λ = χ−12 (1 − δ) ≈ 2.1, while Lemma 5 together with S(λ, Rx1+ ) requires λ = χ−11 (1−δ) ≈ 1.6. Since a smaller  λyields a closer solution to the optimal one, this indicates the possibility of having a better solution.

For an effective use of Lemma 5, we must be able to evaluate the probability  Prob(ˆθn − θ∗ ∈S(λ, Y; Σ∗))for  Y ⊆ X, and must find a good subspace  Y ⊆ Xwithout knowing  θ∗and  P∗.

image

Figure 1: Plot of S(λ, Y) = S(λ, Y; I2) for Y = R2, R2+, and Rx1+ , in a two-dimensional toy problem.

1.3.2 Probability Evaluation

In order to evaluate probability without  θ∗ and P∗, in Section 3 we characterize  S(λ, Y; Σ∗) asan intersection of Minkovski’s sum of an ellipsoid and a dual cone. Since such an intersection is a convex set, we can apply the high-dimensional Berry-Esseen theorem for convex sets [5] in order to approximate the distribution of ˆθn − θ∗by a normal distribution while maintaining a theoretical guarantee. We then propose a sampling-based algorithm for measuring the probability with sufficient accuracy.

1.3.3 Subspace Selection

To find good subspace  Y ⊆ X, in Section 4 we extend the two-step empirical domain reduction algorithm proposed for risk minimization with a single uncertainty objective [16] to our context for robust optimization with multiple uncertain constraints. Such a reduced space must satisfy (4) and (5) in Lemma 5 with high probability. For (4), since the reduction is based on empirical estimate ˆθn, the resulting  S(λ, Y; Σ)also includes stochastic uncertainty. We must thus avoid over-fitting on a given single sample and keep consistency with other sampling scenarios. In addition, (5) requires that the solution  x(λ; θ, Σ)must be included in Y without forcing it by adding constraints. These two requirements require precise control of domain reduction, which can be done using our empirical two-step domain reduction algorithm.

1.4 Related studies

Our approach can be summarized in terms of the following concepts: inequality on subspaces, probability evaluation, and subspace selection. While these concepts are novel in the context of robust optimization, they have already been proven to be effective in the context of variance-based regularization for risk minimization in machine learning. Let us note that variance-based regularization regularizes uncertain objectives by means of scaled (square-root of) variance, and thus it is similar to the robust optimization with ellipsoidal uncertainty that introduces redundancy into uncertain constraints, which redundancy is proportional to (square-root of) variance.

In the context of risk minimization, the relationship between the size of an optimization domain and the speed of convergence has been characterized by various complexity measures, such as VC dimension [15], covering number [13], and Rademacher complexity [1]. Studies of variance-based regularization [13, 14] have determined the scale of a regularizer on the basis of these complexity measures in order for a regularized empirical risk function to bound the true objectives with a desired probabilities. A recent study of empirical hypothesis space reduction [16] has achieved, by means of subspace selection, an acceleration of convergence that is asymptotically independent of the dimensionality of uncertain parameters.

To make use of the approaches in [16], which is designed for variance-based regularization, in robust optimization, we have dealt with the issues below. Robust optimization has several uncertain constraints, while risk minimization has a single uncertain objective. In particular, while an uncertainty objective does not influence the feasible domain, uncertain constraints do influence, which increases the difficulty of subspace selection. In addition, in robust optimization, it is often assumed that uncertain functions are linear with respect to uncertain parameters, and this linearity has led us to propose a novel sampling-based evaluation of violation probability.

2.1 Technical lemmas

This section introduces a series of lemmas. The following Gershgorin circle theorem characterizes the list of eigenvalues of a matrix  A ∈ Rd×dby its elements  aij.

Lemma 7 (The Gershgorin circle theorem; see [11]). Let  A = (aij) ∈ Rd×d, and define Ri = �j̸=i |aij|. Under suitable ordering, the set of eigenvalues  λ1, . . . , λd of A satisfy aii −Ri ≤λi ≤ aii + Ri.

Hoeffding’s inequality below bounds the gap between sample average and true average.

Lemma 8 (Hoeffding’s inequality; see [9]). Let  Z, Z1, . . . , Znbe i.i.d. random variables with values in [0, 1] and let  δ > 0. Then we have

image

The following Berry-Esseen theorem enables us to evaluate the speed of convergence of the central limit theorem (for tighter coefficient  c1in a one-dimensional case, see [12]).

Lemma 9 (The high-dimensional Berry-Esseen inequality [5]). Let d ≥ 1be an integer. Let  Cdbe a set of convex set in  Rdand  cd := 400d1/4. Let  Z1, . . . , Znbe i.i.d. random variables over Rdwith  E[Zi] = 0, E[ZiZ⊤i ] = Id, and  E[∥Zi∥3] ≤ τ. Let  Z = (1/√n) �ni=1 Zi. It then holds that

image

2.2 Estimation accuracy

We can show here the accuracy of estimators ˆθn and ˆΣnon the basis of the lemmas introduced in the previous section and the following assumption. Let  P∗∆denote the distribution of ∆ = Σ∗−1/2(θ − θ∗)where  θ ∼ P∗.

Assumption 10. An upper-bound  τ of E∆∼P∗∆[∥∆∥3] and τ ′ of maxj=1,2,...,d E∆∼P∗∆[∆6j] is given.

We here assume that (an upper-bound of) the third moment  τand sixth moment  τ ′ is known,since the speed of convergence is less influenced by the higher-order moments  τ and τ ′ than bythe first and second moments (θ∗ and Σ∗). In practice,  τ and τ ′ can be calculated if the domain of the distribution  P∗is bounded, or it can also be estimated using samples  θn.

Under Assumption 10, the distribution of ˆθnconverges to the normal distribution  N(0, Σ∗/n)by the central limit theorem. Recall the definitions of  Cdand  cd := 400d1/4in Lemma 9. The speed of convergence can be characterized by  τ, τ ′, and Lemma 9 as follows:

Lemma 11. It holds that

image

For  δ′ > 0, let us define  ηd,n(δ′)by

image

The relative accuracy of estimator ˆΣncan be characterized by Lemma 7 and Lemma 9 as follows.

Lemma 12. For  δ′ > 0, the following then holds with probability at least  1 − δ′:

image

Note that the above bound can be improved by using the Berry-Esseen theorem that is specialized for a one-dimensional random variable, rather than by using Lemma 9. Since this improvement does not influence the order of convergence, we here use Lemma 9 for simplicity of presentation.

This section proposes an algorithm that realizes the concept of probability evaluation discussed in Section 1.3.

3.1 Geometric characterization of S(µ, Y; Σ)

Recall that  S(λ, Y; Σ) ⊆ Rd is the set of estimation error  θ − θ∗ that does not cause a violation of the true constraint over Y with the scale  λ. Here we prove that S is a convex set for any  λ, Y, and  Σ.

Since  gkis linear in  θ, there exists a function  vk : X → Rdsuch that

image

For  Y ⊆ X, we define  Vk(Y)by

image

We then define the polar cone  C+k (Y)and dual cone  C−k (Y)of  Vk(Y)by

image

The polar cone  C+k (Y)and the dual cone  C−k (Y)are convex cones. For a pair of set  U, S ⊆ Rd,their Minkowski’s sum is defined by  U + S := {u + s | u ∈ U, s ∈ S}. Note that Minkowski’s sum of two convex sets is also a convex set. The following theorem characterize  S(µ, Y; Σ)as an intersection of convex sets:

Lemma 13. For any  λ ≥ 0, Y ⊆ X, and  Σ ⪰ 0, S(λ, Y; Σ)is a convex set. In particular, if gk(x, θ)is linear in x for all k = 1, 2, . . . , K and Y is convex, then it holds that

image

3.2 Normal approximation via multidimensional Berry-Esseen’ theorem

On the basis of [16, Definition 5], we here introduce the minimum spatial uniform bounds µ(α, Y; P, Σ)as a minimum scale  µ ≥ 0that satisfies (4) with probability  αwhen  θ − θ∗is distributed by P:

Definition 14. Let p ∈ [0, 1], Y ⊆ X, Σ ⪰ O, and Pbe a distribution over  Rd. The minimum spatial uniform bound  µ(α, Y; P, Σ) ∈ R+is defined by:

image

Let us define  P∗nas the distribution of  ˆθn − θ∗. The ideal goal of this section, then, is to obtain  µ(α, Y; P∗n, Σ∗). µsatisfies the following component-wise monotonicity. Let us denote NΣ := N(0, Σ).

image

Recall that Lemma 13 shows the convexity of  S(µ, Y; Σ)for arbitrary  µ, Y, and Σ. We canthen apply Lemma 11 to bound the gap of probability in (8) for  P∗nand the corresponding normal  NΣ∗/n. Combined with the monotonicity shown in Lemma 15 (i), we can upper-bound µ(α, Y; P∗n, Σ∗)of unknown distribution  P∗nby its asymptotically normal counterpart:

Proposition 16. It holds that

image

3.3 Sampling algorithm for calculating spatial uniform bounds for normal distribution

This section fixes  p ∈ [0, 1], Y ⊆ X, and Σ ⪰ O, and we then denote  µ′(p) = µ(p, Y; NΣ, Σ) fornotational simplicity. Proposition 16 reduces our goal to that of calculating a minimum spatial uniform bound  µ′(p). Lemma 15 (i) shows that such a  µ′ is monotone in  p, and µ′ satisfies the following property:

image

Thus  µ′is monotone, upper and lower bounded, and its inverse  µ′−1can be calculated by a series of sampling and optimization. This characterization leads the following algorithm for calculating  µ′by sampling from  NΣand a binary search, as shown in Algorithm 1.

Given  p ∈ [0, 1], Y ⊆ X, and Σ ⪰ O, together with parameters  α, β, γ ≥ 0deciding accuracy, Algorithm 1 calculates a  λthat approximates  µ′(p). Line 1 of the algorithm first defines the number Q of samples by

image

Line 2, then, generates samples  εi for i = 1, 2, . . . , Qfrom the normal distribution  NΣ. Lines 3–18 estimate  µ′(p)by the binary search. Line 4 defines the upper-bound  λand the lower-bound  λ onthe basis of Lemma 17. Line 6 updates  λas a mean of the upper-bound  λand the lower-bound λ, and then Lines 7–12 approximately calculate  µ′−1(λ)on the basis of the samples  εifor i = 1, 2, . . . , Q and the following lemma:

Lemma 18. For  ε ∈ Rd, ε ∈ S(λ, Y; Σ)if and only if  ψ(µ, ε) ≥ 0, where

image

For each  εi, Line 8 calculates the optimization problem (10) for  ε = εi, and then Lines 9– 11 approximately calculate  µ′−1(λ)by counting the number of such  εi ∈ S(λ, Y; Σ)for i = 1, 2, . . . , Q. Note that some of the calculation of  ψ(λ, εi)can be omitted in practice since  ψ(λ, εi)is monotonically increasing in  λ. Lines 13–17 update the upper-bound  λor the lower-bound  λ onthe basis of the approximate probability  q of µ′−1(λ). Finally, Line 19 outputs the approximate upper-bound  ˙µ = λof  µ′(p).

This output  ˙µsatisfies the following probabilistic guarantee.

Proposition 19. With probability at least  1 − α, the output  ˙µof Algorithm 1 satisfy

image

The computational tractability of Algorithm 1 relies heavily on the choice of Y, and this is discussed in the next section, which utilizes Algorithm 1.

This section presents an algorithm that achieves our main theoretical result via the concept of subspace selection discussed in Section 1.3.

image

4.1 Two-step domain reduction algorithm

Let us define constants  αn, βn, γn, ηn, δn > 0and empirically reduced domain ˆY(w, µ) for w, µ ≥ 0by

image

We can then propose Algorithm 2 for the calculation of the scale ˆλof robustness. Lines 1–4 represent the first stage of this algorithm, which conducts subspace selection. Line 1 first calculates coefficient  ˙µusing Algorithm 1 with Y = X. Line 2 defines  ˆµby scaling  ˙µ, and thenLine 3 calculates optimum value  ˆw = f(ˆx(3ˆµ))by solving (2) on the basis of scale  ˆµ. Line 4 conduct subspace selection by defining ˆY = ˆY( ˆw, ˆµ/(1 − ηn)). Lines 5–6 represent the second stage of the algorithm, which calculates scale ˆλon the basis of the subspace ˆY. Line 5 calculates ˙λusing Algorithm 1 with  Y = ˆY, and Line 6 then outputs  ˆλwhich defined by scaling  ˙λ.

image

4.2 Theoretical analysis

Let  n0be the minimum integer n that satisfies

image

Note that such  n0must exist since  δn, ηn = O(1/√n). Our main theoretical result can then be given as follows:

Theorem 20. Suppose that  n ≥ n0, and  limε→+0 Y∗(f(x∗(0)) + ε, ε) = {x∗(0)}. Then the output ˆλof Algorithm 2 satisfies

image

where

image

Thus, under the uniqueness of the true optimum solution  x∗(0), the estimated scale ˆλis asymptotically independent of the dimension d.

The essence of the proof of this statement lies in finding subspaces  Yn, Yn ⊆ Xthat satisfy the following two conditions: (i) with high probability, it holds that

image

and (ii)  limn→∞ Yn = {x∗(0)}. Property (i), together with Lemma 5, implies contribution to proof of (11), and the property (ii) implies asymptotic convergence (12). Since ˆYis controlled by the estimated optimum value  ˆw, for the existence of both  Yn and Yn, we need to control the range of estimated optimum value  ˆw. We utilize the following lemma for this control:

Lemma 21. Let  µ ≥ 0and  κ ≥ 1. Suppose that  θ ∈ Rd, Σ ⪰ O, and  Y ⊆ Rdsatisfy

image

Roughly speaking, we bound the range of  ˆwin probability by applying Lemma 21 with  κ = 3and  µ = ˆµ.

Let us conclude this section with a brief discussion of the computational tractability of our algorithms.

image

Figure 2: Value-at-risk  VaR[f∗(ˆxn(λ))] for scaleλ. The vertical line shows the value-at-risk, and the horizontal line shows the number of samples. Respective yellow, green, blue lines show the result with  λ = χ−11 (1 − δ), χ−1d (1 − δ), and ˆλn.

Figure 3: Scale of robustness  λ. Vertical line shows the value of scale of robustness  λ, and thehorizontal line shows the number of samples. Respective yellow, green, blue lines indicate the value of  λ = χ−11 (1 − δ), χ−1d (1 − δ), and √nˆλn.

Remark 22. Algorithm 2 mainly consists of an optimization (3) in Line 3 and two applications of Algorithm 2 in Lines 1 and 5, where each application consists of  O(n log2 n)optimization of (10). If the original robust optimization (3) is a convex programming problem, then (10) with Y = X is a series of K convex programmings. It is thus natural to assume that Lines 1 and 3 are computationally tractable.

For the second application of Algorithm 1, in Line 5, note that the subspace  Y = ˆYwill not generally be convex because of the concavity of  −rk. In practical implementation, we can replace  rkin the definition of ˆY(w, µ)by an upper-bound  ukthat makes ˆYconvex. If such a  ukcan be bounded above as  uk(x; Σ) ≤ κrk(x; Σ)with some constant  κ ≥ 1, then the resulting output ˆλof Algorithm 2 will also satisfy the guarantee in Theorem 20.

Let us introduce a simple example of such an upper-bound  uk. Suppose that X is the positive orthant (X = Rd+) and that  gk(x, θ) = θ⊤x. Then  rk(x, Id) = ∥x∥2, where  ∥ · ∥qis the  ℓq-norm for  q ≥ 0. It then holds that  rk(x, Id) ≤√d∥x∥1. Observe that  ∥ · ∥1is linear on the positive orthant X, and thus this  ukis computationally tractable.

5.1 Experimental setting

Let us consider the following simple portfolio optimization problem, which has been examined by several existing studies in robust optimization [8, 3]. Suppose that there are d items whose costs are  θ1, . . . , θd, which are uncertain parameters. The task is to find a convex combination (portfolio) that minimizes the cost:

image

This problem has an uncertain objective function. Thus, we first translate this problem into the our setting (1). Let us introduce an additional variable  x0. We denote by x the set of variables x = (x0, x1, . . . , xd) over Rd+1. We also introduce a dummy parameter  θ0 = 1. We define X, f,and g by

image

Then the portfolio optimization problem (13) is captured by our setting (1).

For this problem, we generate synthetic data as follows. We fix  δ = 0.3, d = 20, and θ∗ = (−1, −0.9, −0.8, . . . , 0.8, 0.9)⊤. We set the covariance matrix  Σ∗as a diagonal matrix Σ∗ = diag(σ21, σ22, . . . , σ2d)where  σifollows a uniform distribution over [0, 10]. For each sample size n and true covariance matrix, we generate 20 sets of n i.i.d. samples from  N(θ∗, Σ∗), overwhich samples value-at-risk will be approximately calculated. All our experimental results are average over 30 generation of such covariance matrix  Σ∗. For a solution  ˆxthat does not satisfy the true constraint, we put  f ∗(ˆx) = 1instead of  f ∗(ˆx) = ∞for the purpose of clear visualization.

To make Algorithm 2 practical, we applied the following approximations: While we bound the gap of the probability mass of the true distribution of ˆθn − θ∗ and NΣ∗/n by cdτ/√n in ourtheoretical analysis, we simply replace the true distribution by its normal estimate  NˆΣn/nin implementation by putting  cd = τ = 0. Also, our theoretical analysis carefully replace the true covariance matrix  Σ∗ by its probabilistic upper-bound ˆΣn/(1 − ηn)or lower-bound ˆΣn/(1 + ηn),but we simply replace  Σ∗by ˆΣnin implementation by putting  τ ′ = 0.

5.2 Experimental results

We solved the robust optimization problem (2) for three different  λ:

[proposed]  λ = ˆλncalculated by Algorithm 2,

[lower-bound]  λ = λ := χ−11 (1 − δ)/√n. This is a lower-bound in terms of Lemma 17.

[upper bound]  λ = λ := χ−1d (1 − δ)/√n. This is the the scale adopted by the standard approach on the basis of Fact 4.

Figure 2 shows the value-at-risk  VaRδ[f ∗(ˆx(λ))]for each  λ. We observed that

The lower-bound  λ(yellow) showed worst performance with small sample size  n ≤ 60. This performance is due to the violation of true constraints with probability more than  δ. Althoughthis  λshowed the best performance with large sample size  n ≥ 120, this scale is inadequate for guaranteed optimization.

image

This paper has shown that the scale of robustness required for achieving a given confidence probability is dependent on the size of optimization domain, and then has proposed an algorithm for deciding the scale tightly on the basis of empirical domain reduction. The analysis has proven that the scale is asymptotically independent of the parameter dimension, and our experiments has demonstrated that the proposed method can achieve better performance while maintaining probabilistic guarantee.

[1] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[2] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust optimization. Princeton University Press, 2009.

[3] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations research letters, 25(1):1–13, 1999.

[4] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.

[5] Vidmantas Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003.

[6] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM review, 53(3):464–501, 2011.

[7] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. Mathematical Programming, 167(2):235–292, 2018.

[8] Dimitris Bertsimas and Melvyn Sim. The price of robustness. Operations research, 52(1):35– 53, 2004.

[9] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[10] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612, 2010.

[11] Gene H Golub and Charles F Van Loan. Matriz computations. The Johns Hopkins University Press, Baltimore, USA, 1989.

[12] V Yu Korolev and Irina G Shevtsova. On the upper bound for the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 54(4):638–658, 2010.

[13] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In COLT 2009-The 22nd Conference on Learning Theory, 2009.

[14] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2975–2984, 2017.

[15] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.

[16] Akihiro Yabe and Takanori Maehara. Empirical hypothesis space reduction. arXiv preprint, https://arxiv.org/abs/1909.01576, 2019.

A.1 Proofs for the statements in Section 1

Proof of Fact 4. By  θ − θ∗ ∈ λUΣ, for any  x ∈ Xand k = 1, 2, . . . , K, it holds that

image

The left inequality implies that  f(x(λ; θ, Σ)) ≤ f(x(2λ; θ∗, Σ)), and the right inequality implies that  f ∗(x(λ; θ, Σ)) < ∞. It thus holds that  f ∗(x(λ; θ, Σ)) ≤ f ∗(x(2λ; θ∗, Σ)).

Proof of Proposition 2. Since ˆθn − θ∗ ∈ λUΣ∗with probability at least  1 − δ, it holds that

image

The last inequality follows from Fact 4.

Proof of Lemma 5. By the definition of  rk and S(λ, Y; Σ)and linearlity of  gkwith respect to  θ,for all  y ∈ Y, it holds that

image

Since  x(λ; θ, Σ), x(2λ; θ∗, Σ) ∈ Y, it holds that  f ∗(x(λ; θ, Σ)), f ∗(x(2λ; θ∗, Σ)) < ∞. By the first inequality and the optimality of  x(λ; θ, Σ)over X, then, we have  f(x(λ; θ, Σ)) ≤ f(y)that satisfies  y ∈ Yand  minu∈UΣ gk(y, θ∗ + 2λu) ≥ 0. Thus the statement holds.

A.2 Proofs for the statements in Section 2

For each  θi ∼ P∗ for i = 1, 2, . . . , n, let us denote  ∆i = Σ∗−1/2(θi − θ∗) and ∆n = �ni=1 ∆i/√n.Then each  ∆iis distributed by  P∗∆.

Proof of Lemma 11. Observe that each  ∆i satisfies E[∆i] = 0, E[∆i∆i⊤] = Id, and E[∥∆i∥3] ≤ τby Assumption 10. Lemma 9 implies that

image

Mapping both random variables as  W �→ (Σ∗/n)1/2Wand  ∆n �→ (Σ∗/n)1/2∆n = ˆθn, we have the statement.

Proof of Lemma 12. Let us define ˆIn := (1/n) �ni=1 ∆i∆i⊤. Since  Σ1/2 ˆInΣ1/2 = ˆΣnand the relationship  O ⪯ Σ1 ⪯ Σ2of positive semidefinit matrices is invariant under the mapping Σ1 �→ Σ1/2Σ1Σ1/2by another positive semidefinite matrix  Σ, it is enough to prove that

image

For each i = 1, 2, . . . , n and j, k = 1, 2, . . . , d, let us define  σijk = ∆ij∆ik. Then  E[σijj] =1, σ∗2jj = E[σi2jj] ≤ τ′2/3by Hölder’s inequality, and  E[σi3jj] ≤ τ ′. Observe that ˆIn,jj =(1/n)(�ni=1 σijj) − 1Thus, by Lemma 9, it holds that

image

Since  ProbW∼N(0,σ∗2jj )(W ∈ [−λ, λ]) = χ1(λ/σ∗jj), the following holds with probability  1 − δ′/d2,it holds that

image

Also, for  j ̸= k, it holds that  E[σijk] = 0, σ∗2jk := E[σi2jk] ≤ τ′2/3by Hölder’s inequality, and E[σi3jk] ≤ τ ′also by Hölder’s inequality. Thus, by Lemma 9, it holds that

image

The following holds with probability  1 − δ′/d2, it holds that

image

Thus the statement holds.

A.3 Proofs for the statements in Section 3

Proof of Lemma 13. Let us define  S+kand  S−kfor k = 1, 2, . . . , K by

image

Let us denote  S+k = S+k (λ, Y; Σ)and  S−k = S−k (λ, Y; Σ). It then holds that  S(λ, Y; Σ) =�Kk=1(S+k ∩ S−k ). We then show that  S+kand  S−kare convex sets for all k = 1, 2, . . . , K.Observe that,

image

For each  ∆ ∈ S+kand  y ∈ Y, then, let  u+k (∆, y)denote an element of  λUΣthat satisfy gk(y, −∆+u+k (∆, y)) ≤ 0. Suppose that  ∆1, ∆2 ∈ S+k , and we will prove that  α∆1 +(1−α)∆2 ∈S+kfor any  α ∈ [0, 1]. For any  y ∈ Y, let  u′ := αu+k (∆1, y) + (1 − α)u+k (∆2, y), which  u′is included by  λUΣby the convexity of  UΣ. Then, by the linearlity of  gkwith respect to  θ, we have

image

and thus  α∆1 + (1 − α)∆2 ∈ S+k and S+k is convex set. The same line of discussion shows that S−kis also a convex set, and thus the first half of the statement holds.

Assume in addition that  gk(x, θ)is linear with respect to x for all k = 1, 2, . . . , K and Y is a convex set. By Sion’s minimax theorem, then, it holds that

image

It then also holds that  ∆ ∈ S−k if and only if  ∆ ∈�λUΣ + C−k�. The second half of the statement thus holds, and the proof is complete.

image

Similarly, it holds that  S−k (κ1λ, Y; κ2Σ) = κ1√κ2S−k (λ, Y; Σ). Since S(λ, Y; Σ) = �Kk=1(S+k ∩S−k ),these then imply that  Sk(κ1λ, Y; κ2Σ) = κ1√κ2Sk(λ, Y; Σ). It then holds that

image

The statement thus holds.

Proof of Proposition 16. By Lemma 11, it holds that

image

Thus the second inequality holds. The first inequality follows from the same line of discussion.

Proof of Lemma 17. (i) For the left inequality, let up pick arbitrary  y ∈ Y, and then it holds that

image

For the right inequality, since  µUΣ ⊆ S(µ, Y; Σ)for any  Y ⊆ Xand  Σ. It then holds that

image

(ii) This statement directly follows from the definition of  µ(p, Y; NΣ, Σ) and S(λ, Y; Σ).

Proof of Lemma 18. This statement directly follows from the definition of  S(λ, Y; Σ).

Proof of Proposition 19. Let us define  ˙µ(p)by

image

Then the output  λsatisfies

image

By Lemma 8, then, the following inequalities holds with probaiblity at least  1 − α/2:

image

which shows that  µ′(p) ≤ ˙µ(p + β/2). Similarly, with probability at least  1 − α/2, it holds that ˙µ(p + β/2) ≤ µ′(p + β). By(17) and the union bound, then, the following holds with probability at least  1 − α:

image

The statement thus holds.

A.4 Proofs for the statements in Section 4

Proof of Theorem 20. Let  ε1 = d2c1τ ′/√n. By Lemma 12, with probability at least  1 − 2ε1, it holds that

image

Then it holds that

image

The first inequality follows from Proposition 16. The second inequality follows from (18) and Lemma 15 (iii). The first equality follows from Lemma 15 (iv). The third inequality follows from (19). The forth inequality follows from (18), Lemma 15 (iii) and (iv), and (19). The fifth inequality follows from Lemma 17 (i).

Suppose that (18) and (20) hold. By the definition of the minimum spatial uniform bound, the following holds with probability at least  1 − 3ε2:

image

Observe that  ˆx(3ˆµ/(1 − ηn)) = x(3ˆµ, ˆθn, ˆΣn/(1 − ηn)). By Lemma 21 with Y = X, (18), and (20), it holds that

image

Observe that, by (18) and (22), for any  w′and  µ′ ≥ 0, it holds that

image

It then holds that

image

Suppose that (18), (20) and (22) holds. Recall that  δ′ = 2ε1 + δ/√n + 4ε2. By (18), Proposition 16, and Lemma 17 (i), it holds that

image

By the definition of the minimum spatial uniform bound, with probability at least  1 − δ + δ′, itholds that

image

By the union bound, (18), (20), (22), and (25) hold with probability at least  1 − δ. Since ˆλ ≥ 0, it is immediate that  ˆx(ˆλ) ∈ ˆY(∞, ˆµ). Since  n ≥ n0, the following inequalities hold.

image

It then holds that

image

Therefore it holds that  ˆx(ˆλ) ∈ Yn. Also, since  2ˆλ ≤ 2µ′nimply  f(x∗(2ˆλ)) ≤ f(x∗(2µ′n))

image

and thus  ˆx(ˆλ) ∈ Yn. Therefore, by Lemma 5, it holds that

image

The last inequality follows from the union bound, and thus the proof is complete.

Proof of Lemma 21. By the definition of  rk, S(λ, Y; Σ), and linearlity of  gkwith respect to  θ, for all  y ∈ Yand k = 1, 2, . . . , K, it holds that

image

Since  x((κ − 1)λ; θ∗, Σ), x(κλ; θ, Σ), x((κ + 1)λ; θ∗, Σ) ∈ Y, the inequalities above indicates that f ∗(x((κ −1)λ; θ∗,Σ))  ≤ f ∗(x(κλ; θ,Σ))  ≤ f ∗(x((κ+ 1)λ; θ∗,Σ)).


Designed for Accessibility and to further Open Science