Stochastic online learning is a general framework of sequential decision problem. At each time, the learner selects (or plays) an action from a given finite action set, receives some random reward and observes some random feedback. One simplest, though often unrealistic, feedback model is full-information feedback where the learning agent can observe the random rewards of all actions no matter which action is selected. Another popular feedback model is bandit feedback where only the random reward of the selected action is revealed to the learner [Auer et al., 2002]. Recent studies further generalize them to graph feedback where the feedback model is characterized by a (directed) graph [Mannor and Shamir, 2011]. Each edge (i, j) means the learner will observe the random reward of action j if playing action i. This problem is motivated by advertisements where the response for a vacation advertisement could provide sideinformation for a similar vacation place and social networks where the response from a user to a promotion could infer her neighbors to similar offers.
The problem of online learning with graph feedback has been intensively studied in both adversarial [Mannor and Shamir, 2011; Alon et al., 2015a; Koc´ak et al., 2014; Cohen et al., 2016] and stochastic cases [Caron et al., 2012; Buccapatnam et al., 2014; Tossou et al., 2017; Wu et al., 2015]. However only a few of them do not assume self-loops [Alon et al., 2015a; Wu et al., 2015] where the reward of the selected action might be invisible. This general setting would fit into the partial monitoring framework [Bart´ok et al., 2014; Komiyama et al., 2015] but the literature on the latter mainly focus on finite case where the possible outcomes are finite. We also consider general feedback graphs that do not assume self-loops.
Though some studies assume feedback graphs could vary over time or even invisible to the learner before selecting actions [Koc´ak et al., 2014; Tossou et al., 2017], most works focus on deterministic graphs or an adversarial list of graphs with certain topology guarantees. To the best of our knowledge, only a few of them work on probabilistic graphs with [Koc´ak et al., 2016; Alon et al., 2017] on adversarial case and [Liu et al., 2018] on stochastic case and they only discuss about Erd¨os-R´enyi random graphs [Erd˝os and R´enyi, 1960]. Recall that an Erd¨os-R´enyi graph with parameter p is by random sampling the edge of every pair of nodes with probability p independently.
We consider general probabilistic feedback graphs with both the one-step case and the cascade case. The one-step case is the usual one where the learner observes reward of j if edge (i, j) exists in the random graph and i is selected. The cascade case means the learner observes reward of j if there is a (directed) path from i to j in the random graph and i is selected. The observations of the cascade case, in other words, follow a probabilistic cascading starting from the selected action — for each edge (i, j) with probability , if action i is either played or observed, then with an independent probability
a random reward sample of action j will be observed. As a motivating example, consider the information propagation in social networks. If selecting a user in a social network causes an information cascade in the social network, one may be able to observe further feedback from the cascade users.
This paper makes three major contributions.
1. We formalize the setting of stochastic online learning with general probabilistic graph feedback and consider
2. We derive asymptotic lower bounds for both the one-step and the cascade cases.
3. We design algorithms for both the one-step and the cascade cases and analyze their regrets. Their asymptotic regret upper bounds match the asymptotic lower bounds with high probability.
Related work The studies on online learning with graph feedback start from adversarial online learning with side observations where a decision maker can observe rewards of other actions as well as observe the reward of the selected action [Mannor and Shamir, 2011]. The observation structure can be encoded as a directed graph where there is an edge (i, j) if the reward of action j is observed when i is selected. Their setting assumes that self-loops exist on every node. Alon et al. [2015a] then generalize to arbitrary directed graphs as long as each action is observable by selecting some action. They show the structure of feedback graph controls the inherent difficulty of the learning problem and present a classification over graphs. These works assume the feedback graph is fixed over time and known to the learner. A follow-up [Alon et al., 2015b] extends to time-varying feedback graphs where the graphs are revealed either at the beginning of the round or at the end of the round but assumes good topology properties on the graphs. Koc´ak et al. [2014] also allow the feedback graph to vary over time and can be revealed to the learner at the end of the round, but assume the existence of self-loops. Cohen et al. [2016] assume the graph is not revealed in both adversarial and stochastic cases but also assume the existence of self-loops.
Besides the last one, there are also other works on the stochastic case. Caron et al. [2012] first study the stochastic case with side observations and design UCB-like algorithms with improved regret bound over the standard one without additional feedback. Buccapatnam et al. [2014] derive an asymptotic lower bound and design two algorithms that are near-optimal. Tossou et al. [2017] apply Thompson sampling and allow the feedback graph to be unknown and/or changing. They bound the Bayesian regret in terms of the size of minimum clique covering. All these works assume the feedback graph has self-loops. Wu et al. [2015] consider general feedback graphs but assume different observation variance from different choices of actions. They provide non-asymptotic problem-dependent regret lower bound and also design algorithms that achieve the problem-dependent lower bound and the minimax lower bounds.
There are several works on Erd¨os-R´enyi random feedback graphs where the feedback graph at each time is randomly generated by Erd¨os-R´enyi model. Koc´ak et al. [2016] consider adversarial case with the unknown generating probability of the feedback graphs. Liu et al. [2018] consider stochastic case and design a randomized policy with Bayesian regret guarantee. Both of them assume self-observability. An updated version [Alon et al., 2017] of [Alon et al., 2015b] extends one result to Erd¨os-R´enyi model in the adversarial case.
The setting of graph feedback can be fit into a more general setting of partial monitoring [Rustichini, 1999; Cesa-Bianchi and Lugosi, 2006] where feedback matrix and reward matrix are given for each pair of the chosen action and the environment. Bart´ok et al. [2014] make a significant progress on classifying finite adversarial partial monitoring games which is completed by [Lattimore and Szepesvari, 2018]. Komiyama et al. [2015] derive a problem-dependent regret lower bound and design an algorithm with asymptotically optimal regret upper bound in the stochastic case. Most studies on general partial monitoring framework focus on finite case where the number of actions and possible outcomes are finite. The algorithms for general partial monitoring games are not efficient in our case since the feedback matrix might be infinite or exponentially large.
The cascade observation feedback resembles the independent cascade model in the context of influence maximization studies [Kempe et al., 2003; Chen et al., 2013], but the goal is different: influence maximization aims at finding a set of k seeds that generates the largest expected cascade size, while our goal is to find the best action (arm) utilizing the cascade feedback. Influence maximization has been combined with online learning in several studies [Vaswani et al., 2015; Chen et al., 2016; Wen et al., 2017], but again their goal is to maximize influence cascade size while using online learning to gradually learn edge probabilities.
The feedback structure is characterized by a directed graph G = (V, E, p) over the set of actions V = [K], where E is the set of edges between actions and is a mapping describing live probabilities on edges. G is fixed and known to the learner. Assume for each action j, there is an edge
for some i.
At each time t, the environment privately draws a reward vector and a random realization
of G where any edge
is live in
with probability
independently of other edges. Here
is an independent random sample from the associated Gaussian reward distribution of action i which has mean
and variance 1 for each
independently of other actions. The assumption of the Gaussian distributions is mainly for the sake of simplicity and can be generalized (see Section 4.4). Then the learner selects an action
and receives reward
which might be invisible to the learner. We consider the following two kinds of feedback models. One-Step Case The learner will receive feedback
for any j such that there is an edge
in
. Cascade Case The learner will receive feedback
for any j such that there is a directed path from
to j in
.
Note that [Caron et al., 2012; Buccapatnam et al., 2014; Tossou et al., 2017; Alon et al., 2015a; Wu et al., 2015] consider deterministic feedback graphs which correspond to our one-step case with each equal to 1. The first three of them also assume self-loops exist in the graph. The work [Liu et al., 2018] corresponds to our one-step case with E containing the edge between every pair of actions and all
have the same value.
Suppose for each action i. The learner’s objective is to maximize its expected cumulative reward, or equiv-
alently, to minimize its expected cumulative regret
where the expectation is over the randomness of and
. We will omit G in the regret expression if the context is clear.
3.1 Lower Bound for the One-Step Case
For the sake of simplicity on the following index notations, assume there is a unique optimal action for . Let
be the k-th best action index for the parameter
. Then
. Denote
and
. Let
be the set of incoming neighbors of action j. Define
Recall that an algorithm is consistent if for any a > 0 and any parameter
. Then the asymptotic lower bound for any consistent algorithm is provided in the following theorem.
Theorem 1 For any consistent algorithm, the regret satisfies
Note this lower bound can easily recover the lower bound in [Wang and Chen, 2017, Theorem 3] where they only consider a special probabilistic graph G. Proof. Fix any consistent algorithm and any parameter .
For any , define
except
. Let P and
be the probability over all outcomes in T rounds when expected reward is
and
respectively. By high-dimensional Pinsker’s inequality ([Lattimore and Szepesvari, 2017, Lemma 5])
Note that
where are the Gaussian random variables with mean
(and variance 1) respectively. Then
Since the algorithm is consistent, for any a > 0 and any parameter
, or equivalently
Similar result follows
Thus the vector . Recall the regret is
. The result fol- lows.
3.2 Lower Bound for the Cascade Case
Let be the probability that there is a directed path from i to j in a random realization of G. Define
Theorem 2 For any consistent algorithm, the regret satisfies
Note that the computation of is #P-hard for general graphs [Valiant, 1979; Wang et al., 2012]. Thus the lower bound is not efficiently computable even when
is known.
The proof of Theorem 2 is similar to that of Theorem 1 by replacing (3) with the following formula
Let be the number of times the learner selects an action i and
by the end of time t. Let
be the sample-mean estimates of
by the end of time t. Let
be the number of times that action i is selected and reward for action j is observed by the end of time t. Then
. Let
be the number of observations for action j by the end of time t.
4.1 One-Step Uniform Case
The uniform case in which all ’s have the same value p is first considered in this section. When E contains edges between every pair of actions, this graph reduces to Erd¨osR´enyi random graph with parameter p.
Let be the expected numberof observations for action j at the end of time t. Then
.
The pseudocode of the algorithm is provided in Algorithm 1. It starts with the initialization of and the estimates of
(line 1). Here
is the number of exploration rounds for the learner to know more about unknown
which will be clearer later. At each time t, if for some j the real observation times of action j is less than half the expected observation times (line 3), then the learner selects a parent of j to try to observe reward of j once more (line 4) and keeps
unchanged (line 5). Note that
and
will concentrate at m as m goes to infinity. The condition
means part of the realizations of graph G is far from the expectation and 2 can be changed to other larger-than-1 constant. This is one of the key differences from deterministic graph feedback where the number of observations is well controlled by just selecting actions. While under the probabilistic graph feedback, there is a gap between the number of real observations and expected number of observations.
When for all j, then the realizations of G are good enough and the learner can rely on the quantities of selections to control the accuracy of the estimates. If the selection vector is good enough for current
under current accuracy level (line 6), then the learner will exploit the current best action (line 7) and keep
unchanged. Here
is defined as in (1) and represents the set of good selected “fractions” of actions that are able to identify the reward gaps between actions.
If the current selection vector N is not good enough, then the learner will first check if is close enough to
(line 9-11) and if yes, will explore according to current
. The number
of exploration rounds for the learner to know more about
will increase in this part (line 11&14). The condition of line 9 has an auxiliary function
to guide the exploration such that
will be close to
in the long run. This auxiliary function is also crucial in previous work [Wu et al., 2015] to control the regret bound in the asymptotic sense. The auxiliary function
can be any non-decreasing function satisfying
and the subadditivity
. If some component of
has not been explored enough (line 9), then the learner selects a parent to try to get one more observation (line 10) and increases
(line 11).
When all components of are close to
, the learner selects an action according to the current
with minimal cost on the regret instructed by the asymptotic lower bound (2). Here
denotes any optimal solution of the linear programming problem that minimizes
among all
. Since
is close enough to
under current accuracy level, the vector
is close enough to
(which is part of the proof for the following theorem). There must be at least an i such that
or else the condition of line 6 holds.
The regret bound for the algorithm is stated as follows. Theorem 3 The regret of Algorithm 1 for one-step uniform case satisfies for any ,
any . Then for any
such that
is unique,
holds with probability at least for any
.
Note that any with
meets the requirements. Due to space limit, the proofs are provided in Appendix A.
4.2 One-Step General Case
In the general case where can be different,
. The algorithm follows as in Algorithm 1 by only replacing line 4 with (4’) Play
. Let
be the set of exploration nodes that have the largest live probability among all incoming edges to some j. Let
be the minimal exploration probability for any . With a modified proof to the uniform case, the theoretical guarantee for the general case follows.
Theorem 4 The regret of the modified Algorithm 1’ for one-step general case satisfies for any ,
Assume and
for any
. Then for any
such that
is unique,
holds with probability at least for any
.
4.3 Cascade Case
For the deterministic graphs, there is no essential difference between one-step case and cascade case — the cascade case on a deterministic graph would be equivalent to constructing a new graph where an edge exists if and only if there is a path on the original graph. For a probabilistic graph, one might try a similar solution for the cascade case by constructing a new graph where the probability of an edge (i, j) is just the probability
of i connecting to j in a random re- alization of the original graph. However the computation of
is #P-hard for general graphs, and thus the accurate graph
is unattainable, though it can be approximated within any accuracy by Monte Carlo simulations. Therefore, during the running of the algorithm, a reasonable approximation of
is needed.
Define and
similarly with (6) and (7) by replacing
with
. Since the computation of
is #P-hard, we define an estimated version of
and
respectively:
for any . Then
for some
. To overcome the stated challenge, we need an auxiliary functions
to set up the tolerance of the approximation. At each time t, the path from i to j with probability
can be treated as nonexistent (with prob- ability 0) and the estimation of
has noise at most
if the real value
. Any non-increasing function with limit 0 can be chosen as
. The choice of
is to control the complexity of the graph with only focusing the path of a reasonable length.
lem
where satisfies
if
and
if
and
for
With the approximation and the estimated value for reward vector
, the linear programming problem considered in time t is
and the corresponding P in (10) is denoted as
. Then the algorithm runs with
accordingly. The complete pseudocode is presented in Algorithm 2. In particular, the examination on the realization is performed on approximated graph
with probability matrix
(line 3). The exploitation condition is on the
(line
6). Here is the feasible solution set of the linear programming problem
which is the set of all
satisfying
. The exploration when all compo- nents of estimated
are accurate enough with minimal cost instructed by linear programming solutions is also related to
(line 13). Here
is the optimal solution set of
. Also
is changed accordingly. The regret of the Algorithm 2 is upper bounded in the fol-
lowing theorem.
Theorem 5 The regret of the Algorithm 2 for cascade case
satisfies for any ,
where
Assume and
for
any . Then for any
such that
is unique,
holds with probability at least for any
.
The result depends on the robustness of the linear programming problems. The P matrix in the LP problem (10) is noisy, which is much different from one-step case and the case of deterministic graphs where the noise is only on . See discussions in the next section. The full proof is put in Appendix B.
4.4 Discussions
The assumption on the Gaussian reward distributions can be generalized to any type of sub-Gaussian distributions with a 1-dimensional mean parameter such that the KL-divergence between distributions of different parameters is finite and is continuous in the 1-dimensional mean parameter. Then the term in the definition of
(see Eq.(1)) is replaced by
accordingly where
is the corresponding reward distribution with mean parameter. The Gaussian distribution, Bernoulli distribution and common continuous random distribution on a common bounded interval like Beta distribution all satisfy the requirements.
The term in the regret bound for one-step uniform case (same for other two cases) is due to the gap between the realizations and the expectations of the probabilistic graphs. The term 1/p also appears in the regret
of [Koc´ak et al., 2016] on Erd¨os-R´enyi random graphs in adversarial setting. It is not clear whether this 1/p term represents hindsight difficulty between the probabilistic graphs and deterministic graphs. This would be an interesting future direction.
The terms in the one-step general case de- scribes the minimal exploration probabilities to observe every action. For each
for some j, that is
is the largest live probabilities among all incoming edges for some j. These terms represent the problem complexities for the underlying probabilistic graph. When all
are equal to
.
The term in the cascade case is usually larger than
since it takes the same operations on the connection probabilities of incoming paths which are larger than live probabilities of incoming edges. The term
is an estimation satisfying
for some
.
Next we discuss the difference in proof of the cascade case. If the noise of the linear programming problems is on the b vector in (10), then by the standard results in statistics [Dontchev and Rockafellar, 2009, 3C.5], the resulting optimal solution sets are Lipschitz continuous. The property of Lipschitz continuity is essential since actions are selected according to the optimal solution of a noisy LP problem (line 13) and we need to guarantee this kind of selections is safe. The noise on
vector in (10) is also easy to deal with by considering the dual problem. However, it is much different if the noise is on the P matrix. For example, consider the LP problem that minimizes x over all
and
with parameter a > 0. The optimal solution
is not Lipschitz continuous with respect to a. So the standard statistical tools could not apply here. We derive a novel property of the Lipschitz continuity when there is noise on P for our specific P matrix.
We are the first to formalize the setting of stochastic online learning with probabilistic feedback graph. We derive asymptotic lower bounds for both one-step and cascade cases. The regret bounds of our designed algorithms match the lower bounds with high probability.
This framework is new and we only provide asymptotic lower bounds and finite-time problem-dependent upper bounds. Finite-time lower bounds and minimax upper/lower bounds are all interesting future directions. Bayesian regret is also an interesting topic.
[Alon et al., 2015a] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35, 2015.
[Alon et al., 2015b] Noga Alon, Nicol`o Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. arXiv preprint arXiv:1502.07617, 2015.
[Alon et al., 2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graphstructured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
[Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
[Bart´ok et al., 2014] G´abor Bart´ok, Dean P Foster, D´avid P´al, Alexander Rakhlin, and Csaba Szepesv´ari. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967– 997, 2014.
[Buccapatnam et al., 2014] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. ACM SIGMETRICS Performance Evaluation Review, 42(1):289–300, 2014.
[Caron et al., 2012] St´ephane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations in stochastic bandits. In UAI, pages 142–151. AUAI Press, 2012.
[Cesa-Bianchi and Lugosi, 2006] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
[Chen et al., 2013] Wei Chen, Laks V. S. Lakshmanan, and Carlos Castillo. Information and Influence Propagation in Social Networks. Morgan & Claypool Publishers, 2013.
[Chen et al., 2016] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.
[Cohen et al., 2016] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811–819, 2016.
[Dontchev and Rockafellar, 2009] Asen L Dontchev and R Tyrrell Rockafellar. Implicit functions and solution mappings. Springer Monogr. Math., 2009.
[Erd˝os and R´enyi, 1960] P Erd˝os and A R´enyi. On the evo- lution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5:17–61, 1960.
[Hoeffding, 1963] Wassily Hoeffding. Probability inequali- ties for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
[Kempe et al., 2003] David Kempe, Jon M. Kleinberg, and ´Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 137–146, 2003.
[Koc´ak et al., 2014] Tom´aˇs Koc´ak, Gergely Neu, Michal Valko, and R´emi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621, 2014.
[Koc´ak et al., 2016] Tom´aˇs Koc´ak, Gergely Neu, and Michal Valko. Online learning with erd˝os-r´enyi side-observation graphs. In Uncertainty in Artificial Intelligence, 2016.
[Komiyama et al., 2015] Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. In Advances in Neural Information Processing Systems, pages 1792–1800, 2015.
[Lattimore and Szepesvari, 2017] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737, 2017.
[Lattimore and Szepesvari, 2018] Tor Lattimore and Csaba Szepesvari. Cleaning up the neighborhood: A full classi-fication for adversarial partial monitoring. arXiv preprint arXiv:1805.09247, 2018.
[Liu et al., 2018] Fang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic bandits with graph feedback. In AAAI, 2018.
[Mannor and Shamir, 2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of sideobservations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.
[Rustichini, 1999] Aldo Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1-2):224–243, 1999.
[Tossou et al., 2017] Aristide CY Tossou, Christos Dimitrakakis, and Devdatt Dubhashi. Thompson sampling for stochastic bandits with graph feedback. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[Valiant, 1979] Leslie G. Valiant. The complexity of enu- meration and reliability problems. SIAM Journal on Computing, 8(3):410–421, 1979.
[Vaswani et al., 2015] Sharan Vaswani, Laks Lakshmanan, Mark Schmidt, et al. Influence maximization with bandits. arXiv preprint arXiv:1503.00024, 2015.
[Wang and Chen, 2017] Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In NeurIPS, pages 1161–1171, 2017.
[Wang et al., 2012] Chi Wang, Wei Chen, and Yajun Wang. Scalable influence maximization for independent cascade model in large-scale social networks. DMKD, 2012.
[Wen et al., 2017] Zheng Wen, Branislav Kveton, Michal Valko, and Sharan Vaswani. Online influence maximization under independent cascade model with semi-bandit feedback. In NeurIPS, pages 3025–3035, 2017.
[Wu et al., 2015] Yifan Wu, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Online learning with gaussian payoffs and side observations. In NeurIPS, pages 1360–1368, 2015.
10 log
for some
< M
, for some j
N
/ K, for some j
ǫ, for any i
Bound the regret under B Note
Then it remains to bound . Bound the regret under D Suppose
and
hold. Then
for any
and
for
. Or equivalently
1)
32 ∆
(ˆ
log(
1) for
(ˆ
1)
32 ∆
)log(
1) for
(ˆ
On ,
1)
16 ∆
(ˆ
log(
1) for
(ˆ
1)
16 ∆
)log(
1) for
(ˆ
Then
since
Thus it remains to bound . Bound the regret under
Similar to [Wu et al., 2015, Proposition 17] where the statement
is replaced by
,
Then
Next by [Wu et al., 2015, Lemma 19],
Thus by (16), (14), (17) and (18),
Putting (13), (14), (15), (17), (19), (20) together, the regret satisfies
Next prove the asymptotic behavior of the regret upper bound. Claim:as
for any
.
Suppose not. There exists such that
, or
for all
, stops increasing when
for some
. Then the condition on line 6 is not satisfied when
for some
¿ 0. By the condition on line 9,
also stops increasing and the condition on line 9 for any
is not satisfied any more when
for some
. Also line 14 will not be performed since
stops increasing. Therefore the condition on line 3 always holds, which is impossible.
For any , the probability that the condition on line 3 does not hold when
is at least
. There exists
such that when
for any j since
. Then with probability at least
, line 4-5 are not called any more. The events
is modified by
for some
and (13) is
replaced by . All other parts stay the same. Then the regret satisfies
For any , there exists an
such that the distance between the optimal solution set of
and
for any
such that
for all
is at most
. Here the distance is Pompeiu-Hausdorff distance of sets. This is because the Lipschitz continuity of the optimal set mapping (see [Dontchev and Rockafellar, 2009,
3C.5]) and the duality of linear programming problems. Since
is unique,
is upper bounded by
. Then divide
by log(T ) and let T go to
,
For instance, can be chosen as
The proof of Theorem 4 is similar to Theorem 3 by modifying (13) with
First we prove a useful lemma on the robustness of linear programming problem where the coefficient matrix is of a specific form.
Lemma 6 Denote the linear programming problem of the form
by LP(A, b, c) where and all entries in A, b, c are non-negative. Let the feasible set mapping, the optimal value mapping and the optimal set mapping be
respectively. Note that is always finite with the A, b, c of positive (or even non-negative) entries. Fix a pair (i, j), assume A(i, j) > 0. Let
except
. Then
for some depending on A, b, c and i, j.
Proof. By [Dontchev and Rockafellar, 2009, Theorem 3C.3], the mapping
is Lipschitz continuous. Recall that the distance on sets is Pompeiu-Hausdorff distance . First assume
. Then
If , then
. Suppose not and let
. Then
for some depending on A, b, c and i, j. Thus
for some depending on A, b, c, where
is the unit ball in
and the second inequality is due to the Lipshitz continuity of G in (25). Also
where (*) is by by the Lipshitz continuity of . The case of
follows similarly.
Proof. [of Theorem 5] The finite-time regret is similar to the previous proof. The main difference is on the bound for line 13. In particular, the results of (18) and (19) are changed to be
since can bound
. The proof of other parts follow the proof of Theorem 4 similarly.
By the non-increasing property of whose limit is
would be smaller than
when
for some
. Then
only has small noise on the nonzero entries of
. By Lemma 6,
is Lipschitz continuous in
for any
. Thus
The remaining discussion on is similar.
Lemma 7 (Hoeffding’s Inequality [Hoeffding, 1963]) Let be independent random variable with common support [0, 1]. Let
and
. Then for all
,
Lemma 8 (Bernstein’s Inequality) Let be independent zero-mean random variables. Suppose that
almost surely for all i. Then for all
,
Lemma 9 Let be independent Bernoulli random variables with mean
respectively.
Proof.
where (26) is by Bernstein’s inequality (Lemma 8) and (27) holds when