Stochastic Online Learning with Probabilistic Graph Feedback

2019·Arxiv

Abstract

Abstract

We consider a problem of stochastic online learning with general probabilistic graph feedback. Two cases are covered. (a) The one-step case where for each edge (i, j) with probability in the probabilistic feedback graph. After playing arm i the learner observes a sample reward feedback of arm j with independent probability . (b) The cascade case where after playing arm i the learner observes feedback of all arms j in a probabilistic cascade starting from i – for each (i, j) with probability , if arm i is played or observed, then a reward sample of arm j would be observed with independent probability . Previous works mainly focus on deterministic graphs which corresponds to one-step case with , an adversarial sequence of graphs with certain topology guarantees or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability.

1 Introduction

Stochastic online learning is a general framework of sequential decision problem. At each time, the learner selects (or plays) an action from a given finite action set, receives some random reward and observes some random feedback. One simplest, though often unrealistic, feedback model is full-information feedback where the learning agent can observe the random rewards of all actions no matter which action is selected. Another popular feedback model is bandit feedback where only the random reward of the selected action is revealed to the learner [Auer et al., 2002]. Recent studies further generalize them to graph feedback where the feedback model is characterized by a (directed) graph [Mannor and Shamir, 2011]. Each edge (i, j) means the learner will observe the random reward of action j if playing action i. This problem is motivated by advertisements where the response for a vacation advertisement could provide sideinformation for a similar vacation place and social networks where the response from a user to a promotion could infer her neighbors to similar offers.

The problem of online learning with graph feedback has been intensively studied in both adversarial [Mannor and Shamir, 2011; Alon et al., 2015a; Koc´ak et al., 2014; Cohen et al., 2016] and stochastic cases [Caron et al., 2012; Buccapatnam et al., 2014; Tossou et al., 2017; Wu et al., 2015]. However only a few of them do not assume self-loops [Alon et al., 2015a; Wu et al., 2015] where the reward of the selected action might be invisible. This general setting would fit into the partial monitoring framework [Bart´ok et al., 2014; Komiyama et al., 2015] but the literature on the latter mainly focus on finite case where the possible outcomes are finite. We also consider general feedback graphs that do not assume self-loops.

Though some studies assume feedback graphs could vary over time or even invisible to the learner before selecting actions [Koc´ak et al., 2014; Tossou et al., 2017], most works focus on deterministic graphs or an adversarial list of graphs with certain topology guarantees. To the best of our knowledge, only a few of them work on probabilistic graphs with [Koc´ak et al., 2016; Alon et al., 2017] on adversarial case and [Liu et al., 2018] on stochastic case and they only discuss about Erd¨os-R´enyi random graphs [Erd˝os and R´enyi, 1960]. Recall that an Erd¨os-R´enyi graph with parameter p is by random sampling the edge of every pair of nodes with probability p independently.

We consider general probabilistic feedback graphs with both the one-step case and the cascade case. The one-step case is the usual one where the learner observes reward of j if edge (i, j) exists in the random graph and i is selected. The cascade case means the learner observes reward of j if there is a (directed) path from i to j in the random graph and i is selected. The observations of the cascade case, in other words, follow a probabilistic cascading starting from the selected action — for each edge (i, j) with probability , if action i is either played or observed, then with an independent probability a random reward sample of action j will be observed. As a motivating example, consider the information propagation in social networks. If selecting a user in a social network causes an information cascade in the social network, one may be able to observe further feedback from the cascade users.

This paper makes three major contributions.

1. We formalize the setting of stochastic online learning with general probabilistic graph feedback and consider

2. We derive asymptotic lower bounds for both the one-step and the cascade cases.

3. We design algorithms for both the one-step and the cascade cases and analyze their regrets. Their asymptotic regret upper bounds match the asymptotic lower bounds with high probability.

Related work The studies on online learning with graph feedback start from adversarial online learning with side observations where a decision maker can observe rewards of other actions as well as observe the reward of the selected action [Mannor and Shamir, 2011]. The observation structure can be encoded as a directed graph where there is an edge (i, j) if the reward of action j is observed when i is selected. Their setting assumes that self-loops exist on every node. Alon et al. [2015a] then generalize to arbitrary directed graphs as long as each action is observable by selecting some action. They show the structure of feedback graph controls the inherent difficulty of the learning problem and present a classification over graphs. These works assume the feedback graph is fixed over time and known to the learner. A follow-up [Alon et al., 2015b] extends to time-varying feedback graphs where the graphs are revealed either at the beginning of the round or at the end of the round but assumes good topology properties on the graphs. Koc´ak et al. [2014] also allow the feedback graph to vary over time and can be revealed to the learner at the end of the round, but assume the existence of self-loops. Cohen et al. [2016] assume the graph is not revealed in both adversarial and stochastic cases but also assume the existence of self-loops.

Besides the last one, there are also other works on the stochastic case. Caron et al. [2012] first study the stochastic case with side observations and design UCB-like algorithms with improved regret bound over the standard one without additional feedback. Buccapatnam et al. [2014] derive an asymptotic lower bound and design two algorithms that are near-optimal. Tossou et al. [2017] apply Thompson sampling and allow the feedback graph to be unknown and/or changing. They bound the Bayesian regret in terms of the size of minimum clique covering. All these works assume the feedback graph has self-loops. Wu et al. [2015] consider general feedback graphs but assume different observation variance from different choices of actions. They provide non-asymptotic problem-dependent regret lower bound and also design algorithms that achieve the problem-dependent lower bound and the minimax lower bounds.

There are several works on Erd¨os-R´enyi random feedback graphs where the feedback graph at each time is randomly generated by Erd¨os-R´enyi model. Koc´ak et al. [2016] consider adversarial case with the unknown generating probability of the feedback graphs. Liu et al. [2018] consider stochastic case and design a randomized policy with Bayesian regret guarantee. Both of them assume self-observability. An updated version [Alon et al., 2017] of [Alon et al., 2015b] extends one result to Erd¨os-R´enyi model in the adversarial case.

The setting of graph feedback can be fit into a more general setting of partial monitoring [Rustichini, 1999; Cesa-Bianchi and Lugosi, 2006] where feedback matrix and reward matrix are given for each pair of the chosen action and the environment. Bart´ok et al. [2014] make a significant progress on classifying finite adversarial partial monitoring games which is completed by [Lattimore and Szepesvari, 2018]. Komiyama et al. [2015] derive a problem-dependent regret lower bound and design an algorithm with asymptotically optimal regret upper bound in the stochastic case. Most studies on general partial monitoring framework focus on finite case where the number of actions and possible outcomes are finite. The algorithms for general partial monitoring games are not efficient in our case since the feedback matrix might be infinite or exponentially large.

The cascade observation feedback resembles the independent cascade model in the context of influence maximization studies [Kempe et al., 2003; Chen et al., 2013], but the goal is different: influence maximization aims at finding a set of k seeds that generates the largest expected cascade size, while our goal is to find the best action (arm) utilizing the cascade feedback. Influence maximization has been combined with online learning in several studies [Vaswani et al., 2015; Chen et al., 2016; Wen et al., 2017], but again their goal is to maximize influence cascade size while using online learning to gradually learn edge probabilities.

2 Settings

The feedback structure is characterized by a directed graph G = (V, E, p) over the set of actions V = [K], where E is the set of edges between actions and is a mapping describing live probabilities on edges. G is fixed and known to the learner. Assume for each action j, there is an edge for some i.

At each time t, the environment privately draws a reward vector and a random realization of G where any edge is live in with probability independently of other edges. Here is an independent random sample from the associated Gaussian reward distribution of action i which has mean and variance 1 for each independently of other actions. The assumption of the Gaussian distributions is mainly for the sake of simplicity and can be generalized (see Section 4.4). Then the learner selects an action and receives reward which might be invisible to the learner. We consider the following two kinds of feedback models. One-Step Case The learner will receive feedback for any j such that there is an edge in . Cascade Case The learner will receive feedback for any j such that there is a directed path from to j in .

Note that [Caron et al., 2012; Buccapatnam et al., 2014; Tossou et al., 2017; Alon et al., 2015a; Wu et al., 2015] consider deterministic feedback graphs which correspond to our one-step case with each equal to 1. The first three of them also assume self-loops exist in the graph. The work [Liu et al., 2018] corresponds to our one-step case with E containing the edge between every pair of actions and all have the same value.

Suppose for each action i. The learner’s objective is to maximize its expected cumulative reward, or equiv-

alently, to minimize its expected cumulative regret

where the expectation is over the randomness of and . We will omit G in the regret expression if the context is clear.

3 Asymptotic Lower Bounds

3.1 Lower Bound for the One-Step Case

For the sake of simplicity on the following index notations, assume there is a unique optimal action for . Let be the k-th best action index for the parameter . Then . Denote and . Let be the set of incoming neighbors of action j. Define

Recall that an algorithm is consistent if for any a > 0 and any parameter . Then the asymptotic lower bound for any consistent algorithm is provided in the following theorem.

Theorem 1 For any consistent algorithm, the regret satisfies

Note this lower bound can easily recover the lower bound in [Wang and Chen, 2017, Theorem 3] where they only consider a special probabilistic graph G. Proof. Fix any consistent algorithm and any parameter .

For any , define except . Let P and be the probability over all outcomes in T rounds when expected reward is and respectively. By high-dimensional Pinsker’s inequality ([Lattimore and Szepesvari, 2017, Lemma 5])

Note that

where are the Gaussian random variables with mean (and variance 1) respectively. Then

Since the algorithm is consistent, for any a > 0 and any parameter , or equivalently

Similar result follows

Thus the vector . Recall the regret is . The result fol- lows.

3.2 Lower Bound for the Cascade Case

Let be the probability that there is a directed path from i to j in a random realization of G. Define

Theorem 2 For any consistent algorithm, the regret satisfies

Note that the computation of is #P-hard for general graphs [Valiant, 1979; Wang et al., 2012]. Thus the lower bound is not efficiently computable even when is known.

The proof of Theorem 2 is similar to that of Theorem 1 by replacing (3) with the following formula

4 Algorithm and Analysis

Let be the number of times the learner selects an action i and by the end of time t. Let be the sample-mean estimates of by the end of time t. Let be the number of times that action i is selected and reward for action j is observed by the end of time t. Then . Let be the number of observations for action j by the end of time t.

4.1 One-Step Uniform Case

The uniform case in which all ’s have the same value p is first considered in this section. When E contains edges between every pair of actions, this graph reduces to Erd¨osR´enyi random graph with parameter p.

Let be the expected numberof observations for action j at the end of time t. Then .

The pseudocode of the algorithm is provided in Algorithm 1. It starts with the initialization of and the estimates of (line 1). Here is the number of exploration rounds for the learner to know more about unknown which will be clearer later. At each time t, if for some j the real observation times of action j is less than half the expected observation times (line 3), then the learner selects a parent of j to try to observe reward of j once more (line 4) and keeps unchanged (line 5). Note that and will concentrate at m as m goes to infinity. The condition means part of the realizations of graph G is far from the expectation and 2 can be changed to other larger-than-1 constant. This is one of the key differences from deterministic graph feedback where the number of observations is well controlled by just selecting actions. While under the probabilistic graph feedback, there is a gap between the number of real observations and expected number of observations.

When for all j, then the realizations of G are good enough and the learner can rely on the quantities of selections to control the accuracy of the estimates. If the selection vector is good enough for current under current accuracy level (line 6), then the learner will exploit the current best action (line 7) and keep unchanged. Here is defined as in (1) and represents the set of good selected “fractions” of actions that are able to identify the reward gaps between actions.

If the current selection vector N is not good enough, then the learner will first check if is close enough to (line 9-11) and if yes, will explore according to current . The number of exploration rounds for the learner to know more about will increase in this part (line 11&14). The condition of line 9 has an auxiliary function to guide the exploration such that will be close to in the long run. This auxiliary function is also crucial in previous work [Wu et al., 2015] to control the regret bound in the asymptotic sense. The auxiliary function can be any non-decreasing function satisfying and the subadditivity . If some component of has not been explored enough (line 9), then the learner selects a parent to try to get one more observation (line 10) and increases (line 11).

When all components of are close to , the learner selects an action according to the current with minimal cost on the regret instructed by the asymptotic lower bound (2). Here denotes any optimal solution of the linear programming problem that minimizes among all . Since is close enough to under current accuracy level, the vector is close enough to (which is part of the proof for the following theorem). There must be at least an i such that or else the condition of line 6 holds.

The regret bound for the algorithm is stated as follows. Theorem 3 The regret of Algorithm 1 for one-step uniform case satisfies for any ,

any . Then for any such that is unique,

holds with probability at least for any .

Note that any with meets the requirements. Due to space limit, the proofs are provided in Appendix A.

4.2 One-Step General Case

In the general case where can be different, . The algorithm follows as in Algorithm 1 by only replacing line 4 with (4’) Play . Let

be the set of exploration nodes that have the largest live probability among all incoming edges to some j. Let

be the minimal exploration probability for any . With a modified proof to the uniform case, the theoretical guarantee for the general case follows.

Theorem 4 The regret of the modified Algorithm 1’ for one-step general case satisfies for any ,

Assume and for any . Then for any such that is unique,

holds with probability at least for any .

4.3 Cascade Case

For the deterministic graphs, there is no essential difference between one-step case and cascade case — the cascade case on a deterministic graph would be equivalent to constructing a new graph where an edge exists if and only if there is a path on the original graph. For a probabilistic graph, one might try a similar solution for the cascade case by constructing a new graph where the probability of an edge (i, j) is just the probability of i connecting to j in a random re- alization of the original graph. However the computation of is #P-hard for general graphs, and thus the accurate graph is unattainable, though it can be approximated within any accuracy by Monte Carlo simulations. Therefore, during the running of the algorithm, a reasonable approximation of is needed.

Define and similarly with (6) and (7) by replacing with . Since the computation of is #P-hard, we define an estimated version of and respectively:

for any . Then for some . To overcome the stated challenge, we need an auxiliary functions to set up the tolerance of the approximation. At each time t, the path from i to j with probability can be treated as nonexistent (with prob- ability 0) and the estimation of has noise at most if the real value . Any non-increasing function with limit 0 can be chosen as . The choice of is to control the complexity of the graph with only focusing the path of a reasonable length.

lem

where satisfies if and if and for

With the approximation and the estimated value for reward vector , the linear programming problem considered in time t is and the corresponding P in (10) is denoted as . Then the algorithm runs with accordingly. The complete pseudocode is presented in Algorithm 2. In particular, the examination on the realization is performed on approximated graph with probability matrix (line 3). The exploitation condition is on the (line

6). Here is the feasible solution set of the linear programming problem which is the set of all satisfying . The exploration when all compo- nents of estimated are accurate enough with minimal cost instructed by linear programming solutions is also related to (line 13). Here is the optimal solution set of . Also is changed accordingly. The regret of the Algorithm 2 is upper bounded in the fol-

lowing theorem.

Theorem 5 The regret of the Algorithm 2 for cascade case

satisfies for any ,

where

Assume and for

any . Then for any such that is unique,

holds with probability at least for any .

The result depends on the robustness of the linear programming problems. The P matrix in the LP problem (10) is noisy, which is much different from one-step case and the case of deterministic graphs where the noise is only on . See discussions in the next section. The full proof is put in Appendix B.

4.4 Discussions

The assumption on the Gaussian reward distributions can be generalized to any type of sub-Gaussian distributions with a 1-dimensional mean parameter such that the KL-divergence between distributions of different parameters is finite and is continuous in the 1-dimensional mean parameter. Then the term in the definition of (see Eq.(1)) is replaced by accordingly where is the corresponding reward distribution with mean parameter. The Gaussian distribution, Bernoulli distribution and common continuous random distribution on a common bounded interval like Beta distribution all satisfy the requirements.

The term in the regret bound for one-step uniform case (same for other two cases) is due to the gap between the realizations and the expectations of the probabilistic graphs. The term 1/p also appears in the regret of [Koc´ak et al., 2016] on Erd¨os-R´enyi random graphs in adversarial setting. It is not clear whether this 1/p term represents hindsight difficulty between the probabilistic graphs and deterministic graphs. This would be an interesting future direction.

The terms in the one-step general case de- scribes the minimal exploration probabilities to observe every action. For each for some j, that is is the largest live probabilities among all incoming edges for some j. These terms represent the problem complexities for the underlying probabilistic graph. When all are equal to .

The term in the cascade case is usually larger than since it takes the same operations on the connection probabilities of incoming paths which are larger than live probabilities of incoming edges. The term is an estimation satisfying for some .

Next we discuss the difference in proof of the cascade case. If the noise of the linear programming problems is on the b vector in (10), then by the standard results in statistics [Dontchev and Rockafellar, 2009, 3C.5], the resulting optimal solution sets are Lipschitz continuous. The property of Lipschitz continuity is essential since actions are selected according to the optimal solution of a noisy LP problem (line 13) and we need to guarantee this kind of selections is safe. The noise on vector in (10) is also easy to deal with by considering the dual problem. However, it is much different if the noise is on the P matrix. For example, consider the LP problem that minimizes x over all and with parameter a > 0. The optimal solution is not Lipschitz continuous with respect to a. So the standard statistical tools could not apply here. We derive a novel property of the Lipschitz continuity when there is noise on P for our specific P matrix.

5 Conclusion and Future Work

We are the first to formalize the setting of stochastic online learning with probabilistic feedback graph. We derive asymptotic lower bounds for both one-step and cascade cases. The regret bounds of our designed algorithms match the lower bounds with high probability.

This framework is new and we only provide asymptotic lower bounds and finite-time problem-dependent upper bounds. Finite-time lower bounds and minimax upper/lower bounds are all interesting future directions. Bayesian regret is also an interesting topic.

References

[Alon et al., 2015a] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35, 2015.

[Alon et al., 2015b] Noga Alon, Nicol`o Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. arXiv preprint arXiv:1502.07617, 2015.

[Alon et al., 2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graphstructured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.

[Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

[Bart´ok et al., 2014] G´abor Bart´ok, Dean P Foster, D´avid P´al, Alexander Rakhlin, and Csaba Szepesv´ari. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967– 997, 2014.

[Buccapatnam et al., 2014] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. ACM SIGMETRICS Performance Evaluation Review, 42(1):289–300, 2014.

[Caron et al., 2012] St´ephane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations in stochastic bandits. In UAI, pages 142–151. AUAI Press, 2012.

[Cesa-Bianchi and Lugosi, 2006] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[Chen et al., 2013] Wei Chen, Laks V. S. Lakshmanan, and Carlos Castillo. Information and Influence Propagation in Social Networks. Morgan & Claypool Publishers, 2013.

[Chen et al., 2016] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.

[Cohen et al., 2016] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811–819, 2016.

[Dontchev and Rockafellar, 2009] Asen L Dontchev and R Tyrrell Rockafellar. Implicit functions and solution mappings. Springer Monogr. Math., 2009.

[Erd˝os and R´enyi, 1960] P Erd˝os and A R´enyi. On the evo- lution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5:17–61, 1960.

[Hoeffding, 1963] Wassily Hoeffding. Probability inequali- ties for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.

[Kempe et al., 2003] David Kempe, Jon M. Kleinberg, and ´Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 137–146, 2003.

[Koc´ak et al., 2014] Tom´aˇs Koc´ak, Gergely Neu, Michal Valko, and R´emi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621, 2014.

[Koc´ak et al., 2016] Tom´aˇs Koc´ak, Gergely Neu, and Michal Valko. Online learning with erd˝os-r´enyi side-observation graphs. In Uncertainty in Artificial Intelligence, 2016.

[Komiyama et al., 2015] Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. In Advances in Neural Information Processing Systems, pages 1792–1800, 2015.

[Lattimore and Szepesvari, 2017] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737, 2017.

[Lattimore and Szepesvari, 2018] Tor Lattimore and Csaba Szepesvari. Cleaning up the neighborhood: A full classi-fication for adversarial partial monitoring. arXiv preprint arXiv:1805.09247, 2018.

[Liu et al., 2018] Fang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic bandits with graph feedback. In AAAI, 2018.

[Mannor and Shamir, 2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of sideobservations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.

[Rustichini, 1999] Aldo Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1-2):224–243, 1999.

[Tossou et al., 2017] Aristide CY Tossou, Christos Dimitrakakis, and Devdatt Dubhashi. Thompson sampling for stochastic bandits with graph feedback. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[Valiant, 1979] Leslie G. Valiant. The complexity of enu- meration and reliability problems. SIAM Journal on Computing, 8(3):410–421, 1979.

[Vaswani et al., 2015] Sharan Vaswani, Laks Lakshmanan, Mark Schmidt, et al. Influence maximization with bandits. arXiv preprint arXiv:1503.00024, 2015.

[Wang and Chen, 2017] Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In NeurIPS, pages 1161–1171, 2017.

[Wang et al., 2012] Chi Wang, Wei Chen, and Yajun Wang. Scalable influence maximization for independent cascade model in large-scale social networks. DMKD, 2012.

[Wen et al., 2017] Zheng Wen, Branislav Kveton, Michal Valko, and Sharan Vaswani. Online influence maximization under independent cascade model with semi-bandit feedback. In NeurIPS, pages 3025–3035, 2017.

[Wu et al., 2015] Yifan Wu, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Online learning with gaussian payoffs and side observations. In NeurIPS, pages 1360–1368, 2015.

A Proofs of the Upper Bounds in One-Step Case

10 logfor some

< M, for some j

N / K, for some j

ǫ, for any i

Bound the regret under B Note

Then it remains to bound . Bound the regret under D Suppose and hold. Then for any

and for . Or equivalently

1) 32 ∆(ˆlog(1) for (ˆ

1) 32 ∆)log(1) for (ˆ

On ,

1) 16 ∆(ˆlog(1) for (ˆ

1) 16 ∆)log(1) for (ˆ

Then

since

Thus it remains to bound . Bound the regret under Similar to [Wu et al., 2015, Proposition 17] where the statement is replaced by ,

Then

Next by [Wu et al., 2015, Lemma 19],

Thus by (16), (14), (17) and (18),

Putting (13), (14), (15), (17), (19), (20) together, the regret satisfies

Next prove the asymptotic behavior of the regret upper bound. Claim:as for any .

Suppose not. There exists such that , or for all , stops increasing when for some . Then the condition on line 6 is not satisfied when for some ¿ 0. By the condition on line 9, also stops increasing and the condition on line 9 for any is not satisfied any more when for some . Also line 14 will not be performed since stops increasing. Therefore the condition on line 3 always holds, which is impossible.

For any , the probability that the condition on line 3 does not hold when is at least . There exists such that when for any j since . Then with probability at least , line 4-5 are not called any more. The events is modified by for some and (13) is

replaced by . All other parts stay the same. Then the regret satisfies

For any , there exists an such that the distance between the optimal solution set of and for any such that for all is at most . Here the distance is Pompeiu-Hausdorff distance of sets. This is because the Lipschitz continuity of the optimal set mapping (see [Dontchev and Rockafellar, 2009, 3C.5]) and the duality of linear programming problems. Since is unique, is upper bounded by . Then divide by log(T ) and let T go to ,

For instance, can be chosen as

The proof of Theorem 4 is similar to Theorem 3 by modifying (13) with

B Proof of Theorem 5

First we prove a useful lemma on the robustness of linear programming problem where the coefficient matrix is of a specific form.

Lemma 6 Denote the linear programming problem of the form

by LP(A, b, c) where and all entries in A, b, c are non-negative. Let the feasible set mapping, the optimal value mapping and the optimal set mapping be

respectively. Note that is always finite with the A, b, c of positive (or even non-negative) entries. Fix a pair (i, j), assume A(i, j) > 0. Let except . Then

for some depending on A, b, c and i, j.

Proof. By [Dontchev and Rockafellar, 2009, Theorem 3C.3], the mapping

is Lipschitz continuous. Recall that the distance on sets is Pompeiu-Hausdorff distance . First assume . Then

If , then . Suppose not and let . Then

for some depending on A, b, c and i, j. Thus

for some depending on A, b, c, where is the unit ball in and the second inequality is due to the Lipshitz continuity of G in (25). Also

where (*) is by by the Lipshitz continuity of . The case of follows similarly.

Proof. [of Theorem 5] The finite-time regret is similar to the previous proof. The main difference is on the bound for line 13. In particular, the results of (18) and (19) are changed to be

since can bound . The proof of other parts follow the proof of Theorem 4 similarly.

By the non-increasing property of whose limit is would be smaller than when for some . Then only has small noise on the nonzero entries of . By Lemma 6, is Lipschitz continuous in for any . Thus

The remaining discussion on is similar.

C Technical Lemmas

Lemma 7 (Hoeffding’s Inequality [Hoeffding, 1963]) Let be independent random variable with common support [0, 1]. Let and . Then for all ,

Lemma 8 (Bernstein’s Inequality) Let be independent zero-mean random variables. Suppose that almost surely for all i. Then for all ,

Lemma 9 Let be independent Bernoulli random variables with mean respectively.

Proof.

where (26) is by Bernstein’s inequality (Lemma 8) and (27) holds when

designed for accessibility and to further open science