1.1 Problem Setting
Data-driven decision-making must manage stochastic uncertainty due to a limited number of available data samples. Suppose that a decision is made by solving the following parameterized optimization problem [6]:
where is a parameter,
is a decision domain,
is a decision variable,
is an objective function, and
are parameterized constraint functions that are linear in
. The ideal decision is given by an optimal solution
under true parameter
is unknown to us and must be estimated from i.i.d. samples
(i = 1, . . . , n) from a distribution
whose expectation is
.
In practice, we often need a solution that satisfies the true constraints (i.e., constraints with true parameters) even though these are unknown. In such a case, a naive approach that solves the problem with estimated parameters would be inadequate because the obtained solution might not satisfy the true constraints. Rather, robust optimization [2, 6] can be used as follows. For a positive semidefinite matrix , we define
. The robust optimization problem (with ellipsoidal uncertainty) is then given by
where is a parameter called the nonnegative scale. Let
be the solution of the above problem. Then,
satisfies the true constraints if
. This means that if
large, it will most likely satisfy the true constraints but may have a poor best-case performance. Conversely, if
is small, it will have a good best-case performance but also a high risk of violating the true constraints. Therefore, the choice of
is a significantly important task [8].
In this study, we consider the problem of finding a such that the problem will have a small objective value and satisfy the true constraints with a sufficiently high probability. Formally, the problem is defined as follows. Let
be the true objective function:
Let be the empirical average of
,
be an estimate of the covariance matrix
defined by
be defined by
Our goal then is to find an algorithm A that determines a scale
from the observation
such that the (
)-quantile of
, i.e., the
, is minimized as follows
:
Problem 1. Given , find an algorithm
that minimizes
.
1.2 Standard Approach
A standard approach for Problem 1 determines a holds with a desired probability [4, 7, 10]. Let
be the true covariance matrix and
. We then obtain the following:
Proposition 2. Suppose that with probability at least
. It then holds that
.
If is (asymptotically) distributed by a normal distribution
with mean
and covariance
, then such
will be determined by
, where
is a chi-distribution with a degree of freedom d.
Observe that such a will be of
. Thus, the speed of convergence depends on the dimension d of the parameter
. The speed of convergence will be especially degraded when a large number of features (
) are available, but only a small number of features is useful, which will cause the so-called “curse of dimensionality” in robust optimization.
1.3 Our Approach and Results
In this study, we prove the following theorem, which improves the convergence rate of Corollary 2:
Theorem 3 (Informal version of Theorem 20). There is an algorithm A, which is given in Algorithm 2, such that the output satisfies
where
It is important to note that our is asymptotically independent of the dimensionality of the parameter
, i.e., the speed of convergence is less affected by the dimensionality of the parameter
.
The theorem is proved in the following steps. We first derive a new inequality on a subspace. We then establish methods to apply the new inequality effectively.
1.3.1 New Inequality on Subspace
Recall that Corollary 2 is derived from the following fundamental fact in robust optimization.
Fact 4. Let . If
and
satisfy
, then
for . Here, we observe that this is too conservative. If we consider given
) and a subspace
that includes the resulting solution
, we can prove the same inequality (3) under a weaker assumption. Let
be the set of errors that does not cause a violation of the true constraint over Y, i.e.,
We then obtain the following lemma:
Lemma 5. Let . If
, and
satisfy
then (3) holds.
Lemma 5 generalizes Fact 4 because the assumption of Fact 4 is a sufficient condition in Lemma 5 for Y = X since holds for any Y, and (5) is trivial with Y = X. Our idea is to use Lemma 5 rather than Fact 4 to derive better bounds than Corollary 2.
The following example shows the effect of choosing a different Y.
Example 6. Let us consider the following d = 2-dimensional parameterized problem:
The true parameters are , and the true covariance of samples is
an identity matrix of size d.
Figure 1 illustrates , where all these Y include the true optimum solution
. The figure clearly shows that, with a fixed
, a smaller Y results in a larger S:
Recall that Fact 4 and Lemma 5 lead us to define in order to satisfy
with desired probability
.
More specifically, let be distributed by the normal distribution
Fact 4 with
then requires
, while Lemma 5 together with
. Since a smaller
yields a closer solution to the optimal one, this indicates the possibility of having a better solution.
For an effective use of Lemma 5, we must be able to evaluate the probability for
, and must find a good subspace
without knowing
and
.
, in a two-dimensional toy problem.
1.3.2 Probability Evaluation
In order to evaluate probability without , in Section 3 we characterize
an intersection of Minkovski’s sum of an ellipsoid and a dual cone. Since such an intersection is a convex set, we can apply the high-dimensional Berry-Esseen theorem for convex sets [5] in order to approximate the distribution of
by a normal distribution while maintaining a theoretical guarantee. We then propose a sampling-based algorithm for measuring the probability with sufficient accuracy.
1.3.3 Subspace Selection
To find good subspace , in Section 4 we extend the two-step empirical domain reduction algorithm proposed for risk minimization with a single uncertainty objective [16] to our context for robust optimization with multiple uncertain constraints. Such a reduced space must satisfy (4) and (5) in Lemma 5 with high probability. For (4), since the reduction is based on empirical estimate
, the resulting
also includes stochastic uncertainty. We must thus avoid over-fitting on a given single sample and keep consistency with other sampling scenarios. In addition, (5) requires that the solution
must be included in Y without forcing it by adding constraints. These two requirements require precise control of domain reduction, which can be done using our empirical two-step domain reduction algorithm.
1.4 Related studies
Our approach can be summarized in terms of the following concepts: inequality on subspaces, probability evaluation, and subspace selection. While these concepts are novel in the context of robust optimization, they have already been proven to be effective in the context of variance-based regularization for risk minimization in machine learning. Let us note that variance-based regularization regularizes uncertain objectives by means of scaled (square-root of) variance, and thus it is similar to the robust optimization with ellipsoidal uncertainty that introduces redundancy into uncertain constraints, which redundancy is proportional to (square-root of) variance.
In the context of risk minimization, the relationship between the size of an optimization domain and the speed of convergence has been characterized by various complexity measures, such as VC dimension [15], covering number [13], and Rademacher complexity [1]. Studies of variance-based regularization [13, 14] have determined the scale of a regularizer on the basis of these complexity measures in order for a regularized empirical risk function to bound the true objectives with a desired probabilities. A recent study of empirical hypothesis space reduction [16] has achieved, by means of subspace selection, an acceleration of convergence that is asymptotically independent of the dimensionality of uncertain parameters.
To make use of the approaches in [16], which is designed for variance-based regularization, in robust optimization, we have dealt with the issues below. Robust optimization has several uncertain constraints, while risk minimization has a single uncertain objective. In particular, while an uncertainty objective does not influence the feasible domain, uncertain constraints do influence, which increases the difficulty of subspace selection. In addition, in robust optimization, it is often assumed that uncertain functions are linear with respect to uncertain parameters, and this linearity has led us to propose a novel sampling-based evaluation of violation probability.
2.1 Technical lemmas
This section introduces a series of lemmas. The following Gershgorin circle theorem characterizes the list of eigenvalues of a matrix by its elements
.
Lemma 7 (The Gershgorin circle theorem; see [11]). Let , and define
. Under suitable ordering, the set of eigenvalues
.
Hoeffding’s inequality below bounds the gap between sample average and true average.
Lemma 8 (Hoeffding’s inequality; see [9]). Let be i.i.d. random variables with values in [0, 1] and let
. Then we have
The following Berry-Esseen theorem enables us to evaluate the speed of convergence of the central limit theorem (for tighter coefficient in a one-dimensional case, see [12]).
Lemma 9 (The high-dimensional Berry-Esseen inequality [5])be an integer. Let
be a set of convex set in
and
. Let
be i.i.d. random variables over
with
, and
. Let
. It then holds that
2.2 Estimation accuracy
We can show here the accuracy of estimators on the basis of the lemmas introduced in the previous section and the following assumption. Let
denote the distribution of
where
.
Assumption 10. An upper-bound
We here assume that (an upper-bound of) the third moment and sixth moment
since the speed of convergence is less influenced by the higher-order moments
the first and second moments (
). In practice,
can be calculated if the domain of the distribution
is bounded, or it can also be estimated using samples
.
Under Assumption 10, the distribution of converges to the normal distribution
by the central limit theorem. Recall the definitions of
and
in Lemma 9. The speed of convergence can be characterized by
, and Lemma 9 as follows:
Lemma 11. It holds that
For , let us define
by
The relative accuracy of estimator can be characterized by Lemma 7 and Lemma 9 as follows.
Lemma 12. For , the following then holds with probability at least
:
Note that the above bound can be improved by using the Berry-Esseen theorem that is specialized for a one-dimensional random variable, rather than by using Lemma 9. Since this improvement does not influence the order of convergence, we here use Lemma 9 for simplicity of presentation.
This section proposes an algorithm that realizes the concept of probability evaluation discussed in Section 1.3.
3.1 Geometric characterization of S(µ, Y; Σ)
Recall that is the set of estimation error
that does not cause a violation of the true constraint over Y with the scale
. Here we prove that S is a convex set for any
, Y, and
.
Since is linear in
, there exists a function
such that
For , we define
by
We then define the polar cone and dual cone
of
by
The polar cone and the dual cone
are convex cones. For a pair of set
their Minkowski’s sum is defined by
. Note that Minkowski’s sum of two convex sets is also a convex set. The following theorem characterize
as an intersection of convex sets:
Lemma 13. For any , and
is a convex set. In particular, if
is linear in x for all k = 1, 2, . . . , K and Y is convex, then it holds that
3.2 Normal approximation via multidimensional Berry-Esseen’ theorem
On the basis of [16, Definition 5], we here introduce the minimum spatial uniform bounds as a minimum scale
that satisfies (4) with probability
when
is distributed by P:
be a distribution over
. The minimum spatial uniform bound
is defined by:
Let us define as the distribution of
. The ideal goal of this section, then, is to obtain
satisfies the following component-wise monotonicity. Let us denote
.
Recall that Lemma 13 shows the convexity of for arbitrary
then apply Lemma 11 to bound the gap of probability in (8) for
and the corresponding normal
. Combined with the monotonicity shown in Lemma 15 (i), we can upper-bound
of unknown distribution
by its asymptotically normal counterpart:
Proposition 16. It holds that
3.3 Sampling algorithm for calculating spatial uniform bounds for normal distribution
This section fixes , and we then denote
notational simplicity. Proposition 16 reduces our goal to that of calculating a minimum spatial uniform bound
. Lemma 15 (i) shows that such a
is monotone in
satisfies the following property:
Thus is monotone, upper and lower bounded, and its inverse
can be calculated by a series of sampling and optimization. This characterization leads the following algorithm for calculating
by sampling from
and a binary search, as shown in Algorithm 1.
Given , together with parameters
deciding accuracy, Algorithm 1 calculates a
that approximates
. Line 1 of the algorithm first defines the number Q of samples by
Line 2, then, generates samples from the normal distribution
. Lines 3–18 estimate
by the binary search. Line 4 defines the upper-bound
and the lower-bound
the basis of Lemma 17. Line 6 updates
as a mean of the upper-bound
and the lower-bound
, and then Lines 7–12 approximately calculate
on the basis of the samples
for i = 1, 2, . . . , Q and the following lemma:
Lemma 18. For if and only if
, where
For each , Line 8 calculates the optimization problem (10) for
, and then Lines 9– 11 approximately calculate
by counting the number of such
for i = 1, 2, . . . , Q. Note that some of the calculation of
can be omitted in practice since
is monotonically increasing in
. Lines 13–17 update the upper-bound
or the lower-bound
the basis of the approximate probability
. Finally, Line 19 outputs the approximate upper-bound
of
.
This output satisfies the following probabilistic guarantee.
Proposition 19. With probability at least , the output
of Algorithm 1 satisfy
The computational tractability of Algorithm 1 relies heavily on the choice of Y, and this is discussed in the next section, which utilizes Algorithm 1.
This section presents an algorithm that achieves our main theoretical result via the concept of subspace selection discussed in Section 1.3.
4.1 Two-step domain reduction algorithm
Let us define constants and empirically reduced domain
by
We can then propose Algorithm 2 for the calculation of the scale of robustness. Lines 1–4 represent the first stage of this algorithm, which conducts subspace selection. Line 1 first calculates coefficient
using Algorithm 1 with Y = X. Line 2 defines
by scaling
Line 3 calculates optimum value
by solving (2) on the basis of scale
. Line 4 conduct subspace selection by defining
. Lines 5–6 represent the second stage of the algorithm, which calculates scale
on the basis of the subspace
. Line 5 calculates
using Algorithm 1 with
, and Line 6 then outputs
which defined by scaling
.
4.2 Theoretical analysis
Let be the minimum integer n that satisfies
Note that such must exist since
. Our main theoretical result can then be given as follows:
Theorem 20. Suppose that , and
. Then the output
of Algorithm 2 satisfies
where
Thus, under the uniqueness of the true optimum solution , the estimated scale
is asymptotically independent of the dimension d.
The essence of the proof of this statement lies in finding subspaces that satisfy the following two conditions: (i) with high probability, it holds that
and (ii) . Property (i), together with Lemma 5, implies contribution to proof of (11), and the property (ii) implies asymptotic convergence (12). Since
is controlled by the estimated optimum value
, for the existence of both
, we need to control the range of estimated optimum value
. We utilize the following lemma for this control:
Lemma 21. Let and
. Suppose that
, and
satisfy
Roughly speaking, we bound the range of in probability by applying Lemma 21 with
and
.
Let us conclude this section with a brief discussion of the computational tractability of our algorithms.
Figure 2: Value-at-risk . The vertical line shows the value-at-risk, and the horizontal line shows the number of samples. Respective yellow, green, blue lines show the result with
Figure 3: Scale of robustness . Vertical line shows the value of scale of robustness
horizontal line shows the number of samples. Respective yellow, green, blue lines indicate the value of
Remark 22. Algorithm 2 mainly consists of an optimization (3) in Line 3 and two applications of Algorithm 2 in Lines 1 and 5, where each application consists of optimization of (10). If the original robust optimization (3) is a convex programming problem, then (10) with Y = X is a series of K convex programmings. It is thus natural to assume that Lines 1 and 3 are computationally tractable.
For the second application of Algorithm 1, in Line 5, note that the subspace will not generally be convex because of the concavity of
. In practical implementation, we can replace
in the definition of
by an upper-bound
that makes
convex. If such a
can be bounded above as
with some constant
, then the resulting output
of Algorithm 2 will also satisfy the guarantee in Theorem 20.
Let us introduce a simple example of such an upper-bound . Suppose that X is the positive orthant (
) and that
. Then
, where
is the
-norm for
. It then holds that
. Observe that
is linear on the positive orthant X, and thus this
is computationally tractable.
5.1 Experimental setting
Let us consider the following simple portfolio optimization problem, which has been examined by several existing studies in robust optimization [8, 3]. Suppose that there are d items whose costs are , which are uncertain parameters. The task is to find a convex combination (portfolio) that minimizes the cost:
This problem has an uncertain objective function. Thus, we first translate this problem into the our setting (1). Let us introduce an additional variable . We denote by x the set of variables
. We also introduce a dummy parameter
and g by
Then the portfolio optimization problem (13) is captured by our setting (1).
For this problem, we generate synthetic data as follows. We fix , and
. We set the covariance matrix
as a diagonal matrix
where
follows a uniform distribution over [0, 10]. For each sample size n and true covariance matrix, we generate 20 sets of n i.i.d. samples from
which samples value-at-risk will be approximately calculated. All our experimental results are average over 30 generation of such covariance matrix
. For a solution
that does not satisfy the true constraint, we put
instead of
for the purpose of clear visualization.
To make Algorithm 2 practical, we applied the following approximations: While we bound the gap of the probability mass of the true distribution of theoretical analysis, we simply replace the true distribution by its normal estimate
in implementation by putting
. Also, our theoretical analysis carefully replace the true covariance matrix
by its probabilistic upper-bound
or lower-bound
but we simply replace
by
in implementation by putting
.
5.2 Experimental results
We solved the robust optimization problem (2) for three different :
• [proposed] calculated by Algorithm 2,
• [lower-bound] . This is a lower-bound in terms of Lemma 17.
• [upper bound] . This is the the scale adopted by the standard approach on the basis of Fact 4.
Figure 2 shows the value-at-risk for each
. We observed that
• The lower-bound (yellow) showed worst performance with small sample size
. This performance is due to the violation of true constraints with probability more than
this
showed the best performance with large sample size
, this scale is inadequate for guaranteed optimization.
This paper has shown that the scale of robustness required for achieving a given confidence probability is dependent on the size of optimization domain, and then has proposed an algorithm for deciding the scale tightly on the basis of empirical domain reduction. The analysis has proven that the scale is asymptotically independent of the parameter dimension, and our experiments has demonstrated that the proposed method can achieve better performance while maintaining probabilistic guarantee.
[1] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[2] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust optimization. Princeton University Press, 2009.
[3] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations research letters, 25(1):1–13, 1999.
[4] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
[5] Vidmantas Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003.
[6] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM review, 53(3):464–501, 2011.
[7] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. Mathematical Programming, 167(2):235–292, 2018.
[8] Dimitris Bertsimas and Melvyn Sim. The price of robustness. Operations research, 52(1):35– 53, 2004.
[9] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
[10] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612, 2010.
[11] Gene H Golub and Charles F Van Loan. Matriz computations. The Johns Hopkins University Press, Baltimore, USA, 1989.
[12] V Yu Korolev and Irina G Shevtsova. On the upper bound for the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 54(4):638–658, 2010.
[13] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In COLT 2009-The 22nd Conference on Learning Theory, 2009.
[14] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2975–2984, 2017.
[15] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
[16] Akihiro Yabe and Takanori Maehara. Empirical hypothesis space reduction. arXiv preprint, https://arxiv.org/abs/1909.01576, 2019.
A.1 Proofs for the statements in Section 1
Proof of Fact 4. By , for any
and k = 1, 2, . . . , K, it holds that
The left inequality implies that , and the right inequality implies that
. It thus holds that
.
Proof of Proposition 2. Since with probability at least
, it holds that
The last inequality follows from Fact 4.
Proof of Lemma 5. By the definition of and linearlity of
with respect to
for all
, it holds that
Since , it holds that
. By the first inequality and the optimality of
over X, then, we have
that satisfies
and
. Thus the statement holds.
A.2 Proofs for the statements in Section 2
For each , let us denote
Then each
is distributed by
.
Proof of Lemma 11. Observe that each by Assumption 10. Lemma 9 implies that
Mapping both random variables as and
, we have the statement.
Proof of Lemma 12. Let us define . Since
and the relationship
of positive semidefinit matrices is invariant under the mapping
by another positive semidefinite matrix
, it is enough to prove that
For each i = 1, 2, . . . , n and j, k = 1, 2, . . . , d, let us define . Then
by Hölder’s inequality, and
. Observe that
Thus, by Lemma 9, it holds that
Since , the following holds with probability
it holds that
Also, for , it holds that
by Hölder’s inequality, and
also by Hölder’s inequality. Thus, by Lemma 9, it holds that
The following holds with probability , it holds that
Thus the statement holds.
A.3 Proofs for the statements in Section 3
Proof of Lemma 13. Let us define and
for k = 1, 2, . . . , K by
Let us denote and
. It then holds that
. We then show that
and
are convex sets for all k = 1, 2, . . . , K.Observe that,
For each and
, then, let
denote an element of
that satisfy
. Suppose that
, and we will prove that
for any
. For any
, let
, which
is included by
by the convexity of
. Then, by the linearlity of
with respect to
and thus is convex set. The same line of discussion shows that
is also a convex set, and thus the first half of the statement holds.
Assume in addition that is linear with respect to x for all k = 1, 2, . . . , K and Y is a convex set. By Sion’s minimax theorem, then, it holds that
It then also holds that if and only if
. The second half of the statement thus holds, and the proof is complete.
Similarly, it holds that these then imply that
. It then holds that
The statement thus holds.
Proof of Proposition 16. By Lemma 11, it holds that
Thus the second inequality holds. The first inequality follows from the same line of discussion.
Proof of Lemma 17. (i) For the left inequality, let up pick arbitrary , and then it holds that
For the right inequality, since for any
and
. It then holds that
(ii) This statement directly follows from the definition of
Proof of Lemma 18. This statement directly follows from the definition of .
Proof of Proposition 19. Let us define by
Then the output satisfies
By Lemma 8, then, the following inequalities holds with probaiblity at least :
which shows that . Similarly, with probability at least
, it holds that
(17) and the union bound, then, the following holds with probability at least
:
The statement thus holds.
A.4 Proofs for the statements in Section 4
Proof of Theorem 20. Let . By Lemma 12, with probability at least
, it holds that
Then it holds that
The first inequality follows from Proposition 16. The second inequality follows from (18) and Lemma 15 (iii). The first equality follows from Lemma 15 (iv). The third inequality follows from (19). The forth inequality follows from (18), Lemma 15 (iii) and (iv), and (19). The fifth inequality follows from Lemma 17 (i).
Suppose that (18) and (20) hold. By the definition of the minimum spatial uniform bound, the following holds with probability at least :
Observe that . By Lemma 21 with Y = X, (18), and (20), it holds that
Observe that, by (18) and (22), for any and
, it holds that
It then holds that
Suppose that (18), (20) and (22) holds. Recall that . By (18), Proposition 16, and Lemma 17 (i), it holds that
By the definition of the minimum spatial uniform bound, with probability at least holds that
By the union bound, (18), (20), (22), and (25) hold with probability at least . Since
, it is immediate that
. Since
, the following inequalities hold.
It then holds that
Therefore it holds that . Also, since
imply
and thus . Therefore, by Lemma 5, it holds that
The last inequality follows from the union bound, and thus the proof is complete.
Proof of Lemma 21. By the definition of , and linearlity of
with respect to
, for all
and k = 1, 2, . . . , K, it holds that
Since , the inequalities above indicates that
1)
Σ))
Σ))
+ 1)
Σ)).