Tightly Robust Optimization via Empirical Domain Reduction

2020·Arxiv

Abstract

Abstract

Data-driven decision-making is performed by solving a parameterized optimization problem, and the optimal decision is given by an optimal solution for unknown true parameters. We often need a solution that satisfies true constraints even though these are unknown. Robust optimization is employed to obtain such a solution, where the uncertainty of the parameter is represented by an ellipsoid, and the scale of robustness is controlled by a coefficient. In this study, we propose an algorithm to determine the scale such that the solution has a good objective value and satisfies the true constraints with a given confidence probability. Under some regularity conditions, the scale obtained by our algorithm is asymptotically , whereas the scale obtained by a standard approach is This means that our algorithm is less affected by the dimensionality of the parameters.

1 Introduction

1.1 Problem Setting

Data-driven decision-making must manage stochastic uncertainty due to a limited number of available data samples. Suppose that a decision is made by solving the following parameterized optimization problem [6]:

where is a parameter, is a decision domain, is a decision variable, is an objective function, and are parameterized constraint functions that are linear in . The ideal decision is given by an optimal solution under true parameter is unknown to us and must be estimated from i.i.d. samples (i = 1, . . . , n) from a distribution whose expectation is .

In practice, we often need a solution that satisfies the true constraints (i.e., constraints with true parameters) even though these are unknown. In such a case, a naive approach that solves the problem with estimated parameters would be inadequate because the obtained solution might not satisfy the true constraints. Rather, robust optimization [2, 6] can be used as follows. For a positive semidefinite matrix , we define . The robust optimization problem (with ellipsoidal uncertainty) is then given by

where is a parameter called the nonnegative scale. Let be the solution of the above problem. Then, satisfies the true constraints if . This means that if large, it will most likely satisfy the true constraints but may have a poor best-case performance. Conversely, if is small, it will have a good best-case performance but also a high risk of violating the true constraints. Therefore, the choice of is a significantly important task [8].

In this study, we consider the problem of finding a such that the problem will have a small objective value and satisfy the true constraints with a sufficiently high probability. Formally, the problem is defined as follows. Let be the true objective function:

Let be the empirical average of , be an estimate of the covariance matrix defined by be defined by Our goal then is to find an algorithm A that determines a scale from the observation such that the ()-quantile of , i.e., the , is minimized as follows :

Problem 1. Given , find an algorithm that minimizes .

1.2 Standard Approach

A standard approach for Problem 1 determines a holds with a desired probability [4, 7, 10]. Let be the true covariance matrix and . We then obtain the following:

Proposition 2. Suppose that with probability at least . It then holds that .

If is (asymptotically) distributed by a normal distribution with mean and covariance , then such will be determined by , where is a chi-distribution with a degree of freedom d.

Observe that such a will be of . Thus, the speed of convergence depends on the dimension d of the parameter . The speed of convergence will be especially degraded when a large number of features () are available, but only a small number of features is useful, which will cause the so-called “curse of dimensionality” in robust optimization.

1.3 Our Approach and Results

In this study, we prove the following theorem, which improves the convergence rate of Corollary 2:

Theorem 3 (Informal version of Theorem 20). There is an algorithm A, which is given in Algorithm 2, such that the output satisfies

where

It is important to note that our is asymptotically independent of the dimensionality of the parameter , i.e., the speed of convergence is less affected by the dimensionality of the parameter .

The theorem is proved in the following steps. We first derive a new inequality on a subspace. We then establish methods to apply the new inequality effectively.

1.3.1 New Inequality on Subspace

Recall that Corollary 2 is derived from the following fundamental fact in robust optimization.

Fact 4. Let . If and satisfy , then

for . Here, we observe that this is too conservative. If we consider given ) and a subspace that includes the resulting solution , we can prove the same inequality (3) under a weaker assumption. Let be the set of errors that does not cause a violation of the true constraint over Y, i.e.,

We then obtain the following lemma:

Lemma 5. Let . If , and satisfy

then (3) holds.

Lemma 5 generalizes Fact 4 because the assumption of Fact 4 is a sufficient condition in Lemma 5 for Y = X since holds for any Y, and (5) is trivial with Y = X. Our idea is to use Lemma 5 rather than Fact 4 to derive better bounds than Corollary 2.

The following example shows the effect of choosing a different Y.

Example 6. Let us consider the following d = 2-dimensional parameterized problem:

The true parameters are , and the true covariance of samples is

an identity matrix of size d.

Figure 1 illustrates , where all these Y include the true optimum solution . The figure clearly shows that, with a fixed , a smaller Y results in a larger S:

Recall that Fact 4 and Lemma 5 lead us to define in order to satisfy with desired probability .

More specifically, let be distributed by the normal distribution Fact 4 with then requires , while Lemma 5 together with . Since a smaller yields a closer solution to the optimal one, this indicates the possibility of having a better solution.

For an effective use of Lemma 5, we must be able to evaluate the probability for , and must find a good subspace without knowing and .

, in a two-dimensional toy problem.

1.3.2 Probability Evaluation

In order to evaluate probability without , in Section 3 we characterize an intersection of Minkovski’s sum of an ellipsoid and a dual cone. Since such an intersection is a convex set, we can apply the high-dimensional Berry-Esseen theorem for convex sets [5] in order to approximate the distribution of by a normal distribution while maintaining a theoretical guarantee. We then propose a sampling-based algorithm for measuring the probability with sufficient accuracy.

1.3.3 Subspace Selection

To find good subspace , in Section 4 we extend the two-step empirical domain reduction algorithm proposed for risk minimization with a single uncertainty objective [16] to our context for robust optimization with multiple uncertain constraints. Such a reduced space must satisfy (4) and (5) in Lemma 5 with high probability. For (4), since the reduction is based on empirical estimate , the resulting also includes stochastic uncertainty. We must thus avoid over-fitting on a given single sample and keep consistency with other sampling scenarios. In addition, (5) requires that the solution must be included in Y without forcing it by adding constraints. These two requirements require precise control of domain reduction, which can be done using our empirical two-step domain reduction algorithm.

1.4 Related studies

Our approach can be summarized in terms of the following concepts: inequality on subspaces, probability evaluation, and subspace selection. While these concepts are novel in the context of robust optimization, they have already been proven to be effective in the context of variance-based regularization for risk minimization in machine learning. Let us note that variance-based regularization regularizes uncertain objectives by means of scaled (square-root of) variance, and thus it is similar to the robust optimization with ellipsoidal uncertainty that introduces redundancy into uncertain constraints, which redundancy is proportional to (square-root of) variance.

In the context of risk minimization, the relationship between the size of an optimization domain and the speed of convergence has been characterized by various complexity measures, such as VC dimension [15], covering number [13], and Rademacher complexity [1]. Studies of variance-based regularization [13, 14] have determined the scale of a regularizer on the basis of these complexity measures in order for a regularized empirical risk function to bound the true objectives with a desired probabilities. A recent study of empirical hypothesis space reduction [16] has achieved, by means of subspace selection, an acceleration of convergence that is asymptotically independent of the dimensionality of uncertain parameters.

To make use of the approaches in [16], which is designed for variance-based regularization, in robust optimization, we have dealt with the issues below. Robust optimization has several uncertain constraints, while risk minimization has a single uncertain objective. In particular, while an uncertainty objective does not influence the feasible domain, uncertain constraints do influence, which increases the difficulty of subspace selection. In addition, in robust optimization, it is often assumed that uncertain functions are linear with respect to uncertain parameters, and this linearity has led us to propose a novel sampling-based evaluation of violation probability.

2 Preliminary

2.1 Technical lemmas

This section introduces a series of lemmas. The following Gershgorin circle theorem characterizes the list of eigenvalues of a matrix by its elements .

Lemma 7 (The Gershgorin circle theorem; see [11]). Let , and define . Under suitable ordering, the set of eigenvalues .

Hoeffding’s inequality below bounds the gap between sample average and true average.

Lemma 8 (Hoeffding’s inequality; see [9]). Let be i.i.d. random variables with values in [0, 1] and let . Then we have

The following Berry-Esseen theorem enables us to evaluate the speed of convergence of the central limit theorem (for tighter coefficient in a one-dimensional case, see [12]).

Lemma 9 (The high-dimensional Berry-Esseen inequality [5])be an integer. Let be a set of convex set in and . Let be i.i.d. random variables over with , and . Let . It then holds that

2.2 Estimation accuracy

We can show here the accuracy of estimators on the basis of the lemmas introduced in the previous section and the following assumption. Let denote the distribution of where .

Assumption 10. An upper-bound

We here assume that (an upper-bound of) the third moment and sixth moment since the speed of convergence is less influenced by the higher-order moments the first and second moments (). In practice, can be calculated if the domain of the distribution is bounded, or it can also be estimated using samples .

Under Assumption 10, the distribution of converges to the normal distribution by the central limit theorem. Recall the definitions of and in Lemma 9. The speed of convergence can be characterized by , and Lemma 9 as follows:

Lemma 11. It holds that

For , let us define by

The relative accuracy of estimator can be characterized by Lemma 7 and Lemma 9 as follows.

Lemma 12. For , the following then holds with probability at least :

Note that the above bound can be improved by using the Berry-Esseen theorem that is specialized for a one-dimensional random variable, rather than by using Lemma 9. Since this improvement does not influence the order of convergence, we here use Lemma 9 for simplicity of presentation.

3 Probability Evaluation Algorithm

This section proposes an algorithm that realizes the concept of probability evaluation discussed in Section 1.3.

3.1 Geometric characterization of S(µ, Y; Σ)

Recall that is the set of estimation error that does not cause a violation of the true constraint over Y with the scale . Here we prove that S is a convex set for any , Y, and .

Since is linear in , there exists a function such that

For , we define by

We then define the polar cone and dual cone of by

The polar cone and the dual cone are convex cones. For a pair of set their Minkowski’s sum is defined by . Note that Minkowski’s sum of two convex sets is also a convex set. The following theorem characterize as an intersection of convex sets:

Lemma 13. For any , and is a convex set. In particular, if is linear in x for all k = 1, 2, . . . , K and Y is convex, then it holds that

3.2 Normal approximation via multidimensional Berry-Esseen’ theorem

On the basis of [16, Definition 5], we here introduce the minimum spatial uniform bounds as a minimum scale that satisfies (4) with probability when is distributed by P:

be a distribution over . The minimum spatial uniform bound is defined by:

Let us define as the distribution of . The ideal goal of this section, then, is to obtain satisfies the following component-wise monotonicity. Let us denote .

Recall that Lemma 13 shows the convexity of for arbitrary then apply Lemma 11 to bound the gap of probability in (8) for and the corresponding normal . Combined with the monotonicity shown in Lemma 15 (i), we can upper-bound of unknown distribution by its asymptotically normal counterpart:

Proposition 16. It holds that

3.3 Sampling algorithm for calculating spatial uniform bounds for normal distribution

This section fixes , and we then denote notational simplicity. Proposition 16 reduces our goal to that of calculating a minimum spatial uniform bound . Lemma 15 (i) shows that such a is monotone in satisfies the following property:

Thus is monotone, upper and lower bounded, and its inverse can be calculated by a series of sampling and optimization. This characterization leads the following algorithm for calculating by sampling from and a binary search, as shown in Algorithm 1.

Given , together with parameters deciding accuracy, Algorithm 1 calculates a that approximates . Line 1 of the algorithm first defines the number Q of samples by

Line 2, then, generates samples from the normal distribution . Lines 3–18 estimate by the binary search. Line 4 defines the upper-bound and the lower-bound the basis of Lemma 17. Line 6 updates as a mean of the upper-bound and the lower-bound , and then Lines 7–12 approximately calculate on the basis of the samples for i = 1, 2, . . . , Q and the following lemma:

Lemma 18. For if and only if , where

For each , Line 8 calculates the optimization problem (10) for , and then Lines 9– 11 approximately calculate by counting the number of such for i = 1, 2, . . . , Q. Note that some of the calculation of can be omitted in practice since is monotonically increasing in . Lines 13–17 update the upper-bound or the lower-bound the basis of the approximate probability . Finally, Line 19 outputs the approximate upper-bound of .

This output satisfies the following probabilistic guarantee.

Proposition 19. With probability at least , the output of Algorithm 1 satisfy

The computational tractability of Algorithm 1 relies heavily on the choice of Y, and this is discussed in the next section, which utilizes Algorithm 1.

4 Empirical Domain Reduction

This section presents an algorithm that achieves our main theoretical result via the concept of subspace selection discussed in Section 1.3.

4.1 Two-step domain reduction algorithm

Let us define constants and empirically reduced domain by

We can then propose Algorithm 2 for the calculation of the scale of robustness. Lines 1–4 represent the first stage of this algorithm, which conducts subspace selection. Line 1 first calculates coefficient using Algorithm 1 with Y = X. Line 2 defines by scaling Line 3 calculates optimum value by solving (2) on the basis of scale . Line 4 conduct subspace selection by defining . Lines 5–6 represent the second stage of the algorithm, which calculates scale on the basis of the subspace . Line 5 calculates using Algorithm 1 with , and Line 6 then outputs which defined by scaling .

4.2 Theoretical analysis

Let be the minimum integer n that satisfies

Note that such must exist since . Our main theoretical result can then be given as follows:

Theorem 20. Suppose that , and . Then the output of Algorithm 2 satisfies

where

Thus, under the uniqueness of the true optimum solution , the estimated scale is asymptotically independent of the dimension d.

The essence of the proof of this statement lies in finding subspaces that satisfy the following two conditions: (i) with high probability, it holds that

and (ii) . Property (i), together with Lemma 5, implies contribution to proof of (11), and the property (ii) implies asymptotic convergence (12). Since is controlled by the estimated optimum value , for the existence of both , we need to control the range of estimated optimum value . We utilize the following lemma for this control:

Lemma 21. Let and . Suppose that , and satisfy

Roughly speaking, we bound the range of in probability by applying Lemma 21 with and .

Let us conclude this section with a brief discussion of the computational tractability of our algorithms.

Figure 2: Value-at-risk . The vertical line shows the value-at-risk, and the horizontal line shows the number of samples. Respective yellow, green, blue lines show the result with

Figure 3: Scale of robustness . Vertical line shows the value of scale of robustness horizontal line shows the number of samples. Respective yellow, green, blue lines indicate the value of

Remark 22. Algorithm 2 mainly consists of an optimization (3) in Line 3 and two applications of Algorithm 2 in Lines 1 and 5, where each application consists of optimization of (10). If the original robust optimization (3) is a convex programming problem, then (10) with Y = X is a series of K convex programmings. It is thus natural to assume that Lines 1 and 3 are computationally tractable.

For the second application of Algorithm 1, in Line 5, note that the subspace will not generally be convex because of the concavity of . In practical implementation, we can replace in the definition of by an upper-bound that makes convex. If such a can be bounded above as with some constant , then the resulting output of Algorithm 2 will also satisfy the guarantee in Theorem 20.

Let us introduce a simple example of such an upper-bound . Suppose that X is the positive orthant () and that . Then , where is the -norm for . It then holds that . Observe that is linear on the positive orthant X, and thus this is computationally tractable.

5 Experiments

5.1 Experimental setting

Let us consider the following simple portfolio optimization problem, which has been examined by several existing studies in robust optimization [8, 3]. Suppose that there are d items whose costs are , which are uncertain parameters. The task is to find a convex combination (portfolio) that minimizes the cost:

This problem has an uncertain objective function. Thus, we first translate this problem into the our setting (1). Let us introduce an additional variable . We denote by x the set of variables . We also introduce a dummy parameter and g by

Then the portfolio optimization problem (13) is captured by our setting (1).

For this problem, we generate synthetic data as follows. We fix , and . We set the covariance matrix as a diagonal matrix where follows a uniform distribution over [0, 10]. For each sample size n and true covariance matrix, we generate 20 sets of n i.i.d. samples from which samples value-at-risk will be approximately calculated. All our experimental results are average over 30 generation of such covariance matrix . For a solution that does not satisfy the true constraint, we put instead of for the purpose of clear visualization.

To make Algorithm 2 practical, we applied the following approximations: While we bound the gap of the probability mass of the true distribution of theoretical analysis, we simply replace the true distribution by its normal estimate in implementation by putting . Also, our theoretical analysis carefully replace the true covariance matrix by its probabilistic upper-bound or lower-bound but we simply replace by in implementation by putting .

5.2 Experimental results

We solved the robust optimization problem (2) for three different :

• [proposed] calculated by Algorithm 2,

• [lower-bound] . This is a lower-bound in terms of Lemma 17.

• [upper bound] . This is the the scale adopted by the standard approach on the basis of Fact 4.

Figure 2 shows the value-at-risk for each . We observed that

• The lower-bound (yellow) showed worst performance with small sample size . This performance is due to the violation of true constraints with probability more than this showed the best performance with large sample size , this scale is inadequate for guaranteed optimization.

6 Concluding remarks

This paper has shown that the scale of robustness required for achieving a given confidence probability is dependent on the size of optimization domain, and then has proposed an algorithm for deciding the scale tightly on the basis of empirical domain reduction. The analysis has proven that the scale is asymptotically independent of the parameter dimension, and our experiments has demonstrated that the proposed method can achieve better performance while maintaining probabilistic guarantee.

References

[1] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[2] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust optimization. Princeton University Press, 2009.

[3] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations research letters, 25(1):1–13, 1999.

[4] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.

[5] Vidmantas Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003.

[6] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM review, 53(3):464–501, 2011.

[7] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. Mathematical Programming, 167(2):235–292, 2018.

[8] Dimitris Bertsimas and Melvyn Sim. The price of robustness. Operations research, 52(1):35– 53, 2004.

[9] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[10] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612, 2010.

[11] Gene H Golub and Charles F Van Loan. Matriz computations. The Johns Hopkins University Press, Baltimore, USA, 1989.

[12] V Yu Korolev and Irina G Shevtsova. On the upper bound for the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 54(4):638–658, 2010.

[13] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In COLT 2009-The 22nd Conference on Learning Theory, 2009.

[14] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2975–2984, 2017.

[15] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.

[16] Akihiro Yabe and Takanori Maehara. Empirical hypothesis space reduction. arXiv preprint, https://arxiv.org/abs/1909.01576, 2019.

A Proofs

A.1 Proofs for the statements in Section 1

Proof of Fact 4. By , for any and k = 1, 2, . . . , K, it holds that

The left inequality implies that , and the right inequality implies that . It thus holds that .

Proof of Proposition 2. Since with probability at least , it holds that

The last inequality follows from Fact 4.

Proof of Lemma 5. By the definition of and linearlity of with respect to for all , it holds that

Since , it holds that . By the first inequality and the optimality of over X, then, we have that satisfies and . Thus the statement holds.

A.2 Proofs for the statements in Section 2

For each , let us denote Then each is distributed by .

Proof of Lemma 11. Observe that each by Assumption 10. Lemma 9 implies that

Mapping both random variables as and , we have the statement.

Proof of Lemma 12. Let us define . Since and the relationship of positive semidefinit matrices is invariant under the mapping by another positive semidefinite matrix , it is enough to prove that

For each i = 1, 2, . . . , n and j, k = 1, 2, . . . , d, let us define . Then by Hölder’s inequality, and . Observe that Thus, by Lemma 9, it holds that

Since , the following holds with probability it holds that

Also, for , it holds that by Hölder’s inequality, and also by Hölder’s inequality. Thus, by Lemma 9, it holds that

The following holds with probability , it holds that

Thus the statement holds.

A.3 Proofs for the statements in Section 3

Proof of Lemma 13. Let us define and for k = 1, 2, . . . , K by

Let us denote and . It then holds that . We then show that and are convex sets for all k = 1, 2, . . . , K.Observe that,

For each and , then, let denote an element of that satisfy . Suppose that , and we will prove that for any . For any , let , which is included by by the convexity of . Then, by the linearlity of with respect to

and thus is convex set. The same line of discussion shows that is also a convex set, and thus the first half of the statement holds.

Assume in addition that is linear with respect to x for all k = 1, 2, . . . , K and Y is a convex set. By Sion’s minimax theorem, then, it holds that

It then also holds that if and only if . The second half of the statement thus holds, and the proof is complete.

Similarly, it holds that these then imply that . It then holds that

The statement thus holds.

Proof of Proposition 16. By Lemma 11, it holds that

Thus the second inequality holds. The first inequality follows from the same line of discussion.

Proof of Lemma 17. (i) For the left inequality, let up pick arbitrary , and then it holds that

For the right inequality, since for any and . It then holds that

(ii) This statement directly follows from the definition of

Proof of Lemma 18. This statement directly follows from the definition of .

Proof of Proposition 19. Let us define by

Then the output satisfies

By Lemma 8, then, the following inequalities holds with probaiblity at least :

which shows that . Similarly, with probability at least , it holds that (17) and the union bound, then, the following holds with probability at least :

The statement thus holds.

A.4 Proofs for the statements in Section 4

Proof of Theorem 20. Let . By Lemma 12, with probability at least , it holds that

Then it holds that

The first inequality follows from Proposition 16. The second inequality follows from (18) and Lemma 15 (iii). The first equality follows from Lemma 15 (iv). The third inequality follows from (19). The forth inequality follows from (18), Lemma 15 (iii) and (iv), and (19). The fifth inequality follows from Lemma 17 (i).

Suppose that (18) and (20) hold. By the definition of the minimum spatial uniform bound, the following holds with probability at least :

Observe that . By Lemma 21 with Y = X, (18), and (20), it holds that

Observe that, by (18) and (22), for any and , it holds that

It then holds that

Suppose that (18), (20) and (22) holds. Recall that . By (18), Proposition 16, and Lemma 17 (i), it holds that

By the definition of the minimum spatial uniform bound, with probability at least holds that

By the union bound, (18), (20), (22), and (25) hold with probability at least . Since , it is immediate that . Since , the following inequalities hold.

It then holds that

Therefore it holds that . Also, since imply

and thus . Therefore, by Lemma 5, it holds that

The last inequality follows from the union bound, and thus the proof is complete.

Proof of Lemma 21. By the definition of , and linearlity of with respect to , for all and k = 1, 2, . . . , K, it holds that

Since , the inequalities above indicates that 1)Σ)) Σ)) + 1)Σ)).