b

DiscoverSearch
About
My stuff
Stochastic Online Learning with Probabilistic Graph Feedback
2019·arXiv
Abstract
Abstract

We consider a problem of stochastic online learning with general probabilistic graph feedback. Two cases are covered. (a) The one-step case where for each edge (i, j) with probability  pijin the probabilistic feedback graph. After playing arm i the learner observes a sample reward feedback of arm j with independent probability  pij. (b) The cascade case where after playing arm i the learner observes feedback of all arms j in a probabilistic cascade starting from i – for each (i, j) with probability  pij, if arm i is played or observed, then a reward sample of arm j would be observed with independent probability  pij. Previous works mainly focus on deterministic graphs which corresponds to one-step case with  pij ∈ {0, 1}, an adversarial sequence of graphs with certain topology guarantees or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability.

Stochastic online learning is a general framework of sequential decision problem. At each time, the learner selects (or plays) an action from a given finite action set, receives some random reward and observes some random feedback. One simplest, though often unrealistic, feedback model is full-information feedback where the learning agent can observe the random rewards of all actions no matter which action is selected. Another popular feedback model is bandit feedback where only the random reward of the selected action is revealed to the learner [Auer et al., 2002]. Recent studies further generalize them to graph feedback where the feedback model is characterized by a (directed) graph [Mannor and Shamir, 2011]. Each edge (i, j) means the learner will observe the random reward of action j if playing action i. This problem is motivated by advertisements where the response for a vacation advertisement could provide sideinformation for a similar vacation place and social networks where the response from a user to a promotion could infer her neighbors to similar offers.

The problem of online learning with graph feedback has been intensively studied in both adversarial [Mannor and Shamir, 2011; Alon et al., 2015a; Koc´ak et al., 2014; Cohen et al., 2016] and stochastic cases [Caron et al., 2012; Buccapatnam et al., 2014; Tossou et al., 2017; Wu et al., 2015]. However only a few of them do not assume self-loops [Alon et al., 2015a; Wu et al., 2015] where the reward of the selected action might be invisible. This general setting would fit into the partial monitoring framework [Bart´ok et al., 2014; Komiyama et al., 2015] but the literature on the latter mainly focus on finite case where the possible outcomes are finite. We also consider general feedback graphs that do not assume self-loops.

Though some studies assume feedback graphs could vary over time or even invisible to the learner before selecting actions [Koc´ak et al., 2014; Tossou et al., 2017], most works focus on deterministic graphs or an adversarial list of graphs with certain topology guarantees. To the best of our knowledge, only a few of them work on probabilistic graphs with [Koc´ak et al., 2016; Alon et al., 2017] on adversarial case and [Liu et al., 2018] on stochastic case and they only discuss about Erd¨os-R´enyi random graphs [Erd˝os and R´enyi, 1960]. Recall that an Erd¨os-R´enyi graph with parameter p is by random sampling the edge of every pair of nodes with probability p independently.

We consider general probabilistic feedback graphs with both the one-step case and the cascade case. The one-step case is the usual one where the learner observes reward of j if edge (i, j) exists in the random graph and i is selected. The cascade case means the learner observes reward of j if there is a (directed) path from i to j in the random graph and i is selected. The observations of the cascade case, in other words, follow a probabilistic cascading starting from the selected action — for each edge (i, j) with probability  pij, if action i is either played or observed, then with an independent probability  pija random reward sample of action j will be observed. As a motivating example, consider the information propagation in social networks. If selecting a user in a social network causes an information cascade in the social network, one may be able to observe further feedback from the cascade users.

This paper makes three major contributions.

1. We formalize the setting of stochastic online learning with general probabilistic graph feedback and consider

image

2. We derive asymptotic lower bounds for both the one-step and the cascade cases.

3. We design algorithms for both the one-step and the cascade cases and analyze their regrets. Their asymptotic regret upper bounds match the asymptotic lower bounds with high probability.

Related work The studies on online learning with graph feedback start from adversarial online learning with side observations where a decision maker can observe rewards of other actions as well as observe the reward of the selected action [Mannor and Shamir, 2011]. The observation structure can be encoded as a directed graph where there is an edge (i, j) if the reward of action j is observed when i is selected. Their setting assumes that self-loops exist on every node. Alon et al. [2015a] then generalize to arbitrary directed graphs as long as each action is observable by selecting some action. They show the structure of feedback graph controls the inherent difficulty of the learning problem and present a classification over graphs. These works assume the feedback graph is fixed over time and known to the learner. A follow-up [Alon et al., 2015b] extends to time-varying feedback graphs where the graphs are revealed either at the beginning of the round or at the end of the round but assumes good topology properties on the graphs. Koc´ak et al. [2014] also allow the feedback graph to vary over time and can be revealed to the learner at the end of the round, but assume the existence of self-loops. Cohen et al. [2016] assume the graph is not revealed in both adversarial and stochastic cases but also assume the existence of self-loops.

Besides the last one, there are also other works on the stochastic case. Caron et al. [2012] first study the stochastic case with side observations and design UCB-like algorithms with improved regret bound over the standard one without additional feedback. Buccapatnam et al. [2014] derive an asymptotic lower bound and design two algorithms that are near-optimal. Tossou et al. [2017] apply Thompson sampling and allow the feedback graph to be unknown and/or changing. They bound the Bayesian regret in terms of the size of minimum clique covering. All these works assume the feedback graph has self-loops. Wu et al. [2015] consider general feedback graphs but assume different observation variance from different choices of actions. They provide non-asymptotic problem-dependent regret lower bound and also design algorithms that achieve the problem-dependent lower bound and the minimax lower bounds.

There are several works on Erd¨os-R´enyi random feedback graphs where the feedback graph at each time is randomly generated by Erd¨os-R´enyi model. Koc´ak et al. [2016] consider adversarial case with the unknown generating probability of the feedback graphs. Liu et al. [2018] consider stochastic case and design a randomized policy with Bayesian regret guarantee. Both of them assume self-observability. An updated version [Alon et al., 2017] of [Alon et al., 2015b] extends one result to Erd¨os-R´enyi model in the adversarial case.

The setting of graph feedback can be fit into a more general setting of partial monitoring [Rustichini, 1999; Cesa-Bianchi and Lugosi, 2006] where feedback matrix and reward matrix are given for each pair of the chosen action and the environment. Bart´ok et al. [2014] make a significant progress on classifying finite adversarial partial monitoring games which is completed by [Lattimore and Szepesvari, 2018]. Komiyama et al. [2015] derive a problem-dependent regret lower bound and design an algorithm with asymptotically optimal regret upper bound in the stochastic case. Most studies on general partial monitoring framework focus on finite case where the number of actions and possible outcomes are finite. The algorithms for general partial monitoring games are not efficient in our case since the feedback matrix might be infinite or exponentially large.

The cascade observation feedback resembles the independent cascade model in the context of influence maximization studies [Kempe et al., 2003; Chen et al., 2013], but the goal is different: influence maximization aims at finding a set of k seeds that generates the largest expected cascade size, while our goal is to find the best action (arm) utilizing the cascade feedback. Influence maximization has been combined with online learning in several studies [Vaswani et al., 2015; Chen et al., 2016; Wen et al., 2017], but again their goal is to maximize influence cascade size while using online learning to gradually learn edge probabilities.

The feedback structure is characterized by a directed graph G = (V, E, p) over the set of actions V = [K], where E is the set of edges between actions and  p : E → [0, 1]is a mapping describing live probabilities on edges. G is fixed and known to the learner. Assume for each action j, there is an edge  (i, j) ∈ Efor some i.

At each time t, the environment privately draws a reward vector  rt = (rt(i) : i ∈ [K])and a random realization  Gtof G where any edge  (i, j) ∈ Eis live in  Gtwith probability  pijindependently of other edges. Here  rt(i)is an independent random sample from the associated Gaussian reward distribution of action i which has mean  θiand variance 1 for each i ∈ [K]independently of other actions. The assumption of the Gaussian distributions is mainly for the sake of simplicity and can be generalized (see Section 4.4). Then the learner selects an action  itand receives reward  rt(it)which might be invisible to the learner. We consider the following two kinds of feedback models. One-Step Case The learner will receive feedback  (j, rt(j))for any j such that there is an edge  (it, j)in  Gt. Cascade Case The learner will receive feedback  (j, rt(j))for any j such that there is a directed path from  itto j in  Gt.

Note that [Caron et al., 2012; Buccapatnam et al., 2014; Tossou et al., 2017; Alon et al., 2015a; Wu et al., 2015] consider deterministic feedback graphs which correspond to our one-step case with each  pijequal to 1. The first three of them also assume self-loops exist in the graph. The work [Liu et al., 2018] corresponds to our one-step case with E containing the edge between every pair of actions and all  pijhave the same value.

Suppose  θi ∈ [0, 1]for each action i. The learner’s objective is to maximize its expected cumulative reward, or equiv-

alently, to minimize its expected cumulative regret

image

where the expectation is over the randomness of  rtand  Gt. We will omit G in the regret expression if the context is clear.

3.1 Lower Bound for the One-Step Case

For the sake of simplicity on the following index notations, assume there is a unique optimal action for  θ. Let  ik(θ)be the k-th best action index for the parameter  θ. Then  θi1(θ) >θi, ∀i ̸= i1(θ). Denote  ∆i(θ) = θi1(θ) − θiand  ∆(θ) =(∆i(θ) : i ∈ [K]) ∈ [0, 1]K. Let  V in(j) = {i ∈ [K] :(i, j) ∈ E}be the set of incoming neighbors of action j. Define

image

Recall that an algorithm is consistent if  Rθ(T ) = o(T a)for any a > 0 and any parameter  θ. Then the asymptotic lower bound for any consistent algorithm is provided in the following theorem.

Theorem 1 For any consistent algorithm, the regret satisfies

image

Note this lower bound can easily recover the lower bound in [Wang and Chen, 2017, Theorem 3] where they only consider a special probabilistic graph G. Proof. Fix any consistent algorithm and any parameter  θ ∈[0, 1]K.

For any  j ̸= i1(θ), define  θ′ = θexcept  θ′j =θi1(θ) + ǫ. Let P and  P′be the probability over all outcomes in T rounds when expected reward is  θand θ′respectively. By high-dimensional Pinsker’s inequality ([Lattimore and Szepesvari, 2017, Lemma 5])

image

Note that

image

where  r(θj), r(θ′j)are the Gaussian random variables with mean  θj, θ′j(and variance 1) respectively. Then

image

Since the algorithm is consistent,  Rθ(T ) = o(T a)for any a > 0 and any parameter  θ, or equivalently

image

Similar result follows

image

Thus the vector  lim infT →∞E[N(T )]log T ∈ C(θ). Recall the regret is  Rθ(T ) = �Ki=1 E [Ni(T )] ∆i(θ). The result fol- lows.  □

3.2 Lower Bound for the Cascade Case

Let  p′ijbe the probability that there is a directed path from i to j in a random realization of G. Define

image

Theorem 2 For any consistent algorithm, the regret satisfies

image

Note that the computation of  p′ijis #P-hard for general graphs [Valiant, 1979; Wang et al., 2012]. Thus the lower bound is not efficiently computable even when  θis known.

The proof of Theorem 2 is similar to that of Theorem 1 by replacing (3) with the following formula

image

Let  Ni(t)be the number of times the learner selects an action i and  N(t) = (Ni(t) : i ∈ [K])by the end of time t. Let ˆθtbe the sample-mean estimates of  θby the end of time t. Let  nij(t)be the number of times that action i is selected and reward for action j is observed by the end of time t. Then E [nij(t) | Ni(t)] = Ni(t)pij. Let  mj(t) = �i nij(t)be the number of observations for action j by the end of time t.

4.1 One-Step Uniform Case

The uniform case in which all  pij’s have the same value p is first considered in this section. When E contains edges between every pair of actions, this graph reduces to Erd¨osR´enyi random graph with parameter p.

Let  Mj(t) = �i∈V in(j) Ni(t)pbe the expected numberof observations for action j at the end of time t. Then E [mj(t) | Mj(t)] = Mj(t).

image

The pseudocode of the algorithm is provided in Algorithm 1. It starts with the initialization of  N eand the estimates of  θ(line 1). Here  N eis the number of exploration rounds for the learner to know more about unknown  θwhich will be clearer later. At each time t, if for some j the real observation times of action j is less than half the expected observation times (line 3), then the learner selects a parent of j to try to observe reward of j once more (line 4) and keeps  N eunchanged (line 5). Note that  E [mj(t) | Mj(t) = m] = mand  mj(t)will concentrate at m as m goes to infinity. The condition  mj(t) < Mj(t)/2means part of the realizations of graph G is far from the expectation and 2 can be changed to other larger-than-1 constant. This is one of the key differences from deterministic graph feedback where the number of observations is well controlled by just selecting actions. While under the probabilistic graph feedback, there is a gap between the number of real observations and expected number of observations.

When  mj(t) ≥ Mj(t)/2for all j, then the realizations of G are good enough and the learner can rely on the quantities of selections to control the accuracy of the estimates. If the selection vector is good enough for current ˆθunder current accuracy level (line 6), then the learner will exploit the current best action (line 7) and keep  N eunchanged. Here  C(·)is defined as in (1) and represents the set of good selected “fractions” of actions that are able to identify the reward gaps between actions.

If the current selection vector N is not good enough, then the learner will first check if ˆθis close enough to  θ(line 9-11) and if yes, will explore according to current ˆθ. The number  N eof exploration rounds for the learner to know more about  θwill increase in this part (line 11&14). The condition of line 9 has an auxiliary function  β : N → [0, ∞)to guide the exploration such that ˆθwill be close to  θin the long run. This auxiliary function is also crucial in previous work [Wu et al., 2015] to control the regret bound in the asymptotic sense. The auxiliary function  βcan be any non-decreasing function satisfying  0 ≤ β(n) ≤ n/2and the subadditivity β(m + n) ≤ β(m) + β(n). If some component of ˆθhas not been explored enough (line 9), then the learner selects a parent to try to get one more observation (line 10) and increases N e(line 11).

When all components of ˆθare close to  θ, the learner selects an action according to the current ˆθwith minimal cost on the regret instructed by the asymptotic lower bound (2). Here ci(θ′)denotes any optimal solution of the linear programming problem that minimizes  ⟨c, θ′⟩among all  c ∈ C(θ′). Since ˆθis close enough to  θunder current accuracy level, the vector ci(ˆθt−1)is close enough to  ci(θ)(which is part of the proof for the following theorem). There must be at least an i such that  Ni(t − 1) < 16 ci(ˆθt−1) log(t − 1)or else the condition of line 6 holds.

The regret bound for the algorithm is stated as follows. Theorem 3 The regret of Algorithm 1 for one-step uniform case satisfies for any  ǫ > 0,

image

any  ǫ > 0. Then for any  θsuch that  c(θ)is unique,

image

holds with probability at least  1 − δfor any  δ > 0.

Note that any  β(n) = anbwith  a ∈�0, 12�, b ∈ (0, 1)meets the requirements. Due to space limit, the proofs are provided in Appendix A.

4.2 One-Step General Case

In the general case where  pijcan be different,  Mj(t) =�i∈V in(j) Ni(t)pij. The algorithm follows as in Algorithm 1 by only replacing line 4 with (4’) Play  it ∈ argmaxi∈V in(j)pij. Let

image

be the set of exploration nodes that have the largest live probability among all incoming edges to some j. Let

image

be the minimal exploration probability for any  i ∈ V e. With a modified proof to the uniform case, the theoretical guarantee for the general case follows.

Theorem 4 The regret of the modified Algorithm 1’ for one-step general case satisfies for any  ǫ > 0,

image

Assume  β(n) = o(n)and �∞s=0 exp�− β(s)ǫ22K �< ∞for any  ǫ > 0. Then for any  θsuch that  c(θ)is unique,

image

holds with probability at least  1 − δfor any  δ > 0.

4.3 Cascade Case

For the deterministic graphs, there is no essential difference between one-step case and cascade case — the cascade case on a deterministic graph would be equivalent to constructing a new graph where an edge exists if and only if there is a path on the original graph. For a probabilistic graph, one might try a similar solution for the cascade case by constructing a new graph  G′where the probability of an edge (i, j) is just the probability  p′ijof i connecting to j in a random re- alization of the original graph. However the computation of p′ijis #P-hard for general graphs, and thus the accurate graph G′is unattainable, though it can be approximated within any accuracy by Monte Carlo simulations. Therefore, during the running of the algorithm, a reasonable approximation of  G′is needed.

image

Define  V e′and  pei′similarly with (6) and (7) by replacing pijwith  p′ij. Since the computation of  p′ijis #P-hard, we define an estimated version of  V e′and  pei′respectively:

image

for any  i ∈ ˆV e. Then  ˆpei ≥ pei′′/2for some  i′. To overcome the stated challenge, we need an auxiliary functions  η : N+ → [0, 1)to set up the tolerance of the approximation. At each time t, the path from i to j with probability  p′ij ≤ η(t)can be treated as nonexistent (with prob- ability 0) and the estimation of  p′ijhas noise at most  η(t)/2if the real value  p′ij > η(t). Any non-increasing function with limit 0 can be chosen as  η. The choice of  ηis to control the complexity of the graph with only focusing the path of a reasonable length.

image

lem

image

where  P ∈ [0, 1]K×Ksatisfies  Pij = 0if  p′ij ≤ ηand ��Pij − p′ij�� ≤ η/2if  p′ij > ηand  bi(θ′) = 1∆2i (θ′)for

image

With the approximation  Gtand the estimated value for reward vector ˆθt−1, the linear programming problem considered in time t is  LPt = LP(ˆθt−1, η(t))and the corresponding P in (10) is denoted as  Pt. Then the algorithm runs with LPtaccordingly. The complete pseudocode is presented in Algorithm 2. In particular, the examination on the realization is performed on approximated graph  Gtwith probability matrix  Pt(line 3). The exploitation condition is on the  LPt(line

6). Here  Sfeas(LPt)is the feasible solution set of the linear programming problem  LPtwhich is the set of all  c ∈ RKsatisfying  P ⊤t c ≥ b, c ≥ 0. The exploration when all compo- nents of estimated ˆθare accurate enough with minimal cost instructed by linear programming solutions is also related to LPt(line 13). Here  Sopt(LPt)is the optimal solution set of LPt. Also  M ′j(t) = �i Ni(t)(Pt)ijis changed accordingly. The regret of the Algorithm 2 is upper bounded in the fol-

lowing theorem.

Theorem 5 The regret of the Algorithm 2 for cascade case

satisfies for any  ǫ > 0,

image

where

image

Assume  β(n) = o(n)and �∞s=0 exp�− β(s)ǫ22K �< ∞for

any  ǫ > 0. Then for any  θsuch that  c(θ)is unique,

image

holds with probability at least  1 − δfor any  δ > 0.

The result depends on the robustness of the linear programming problems. The P matrix in the LP problem (10) is noisy, which is much different from one-step case and the case of deterministic graphs where the noise is only on  θ′. See discussions in the next section. The full proof is put in Appendix B.

4.4 Discussions

The assumption on the Gaussian reward distributions can be generalized to any type of sub-Gaussian distributions with a 1-dimensional mean parameter such that the KL-divergence between distributions of different parameters is finite and is continuous in the 1-dimensional mean parameter. Then the term  ∆2i (θ)/2in the definition of  C(θ)(see Eq.(1)) is replaced by  KL(r(θi), r(θi1(θ)))accordingly where  r(·)is the corresponding reward distribution with mean parameter. The Gaussian distribution, Bernoulli distribution and common continuous random distribution on a common bounded interval like Beta distribution all satisfy the requirements.

The term  O�log(T ) �Ki=1∆i(θ)p �in the regret bound for one-step uniform case (same for other two cases) is due to the gap between the realizations and the expectations of the probabilistic graphs. The term 1/p also appears in the regret  O(�T/p)of [Koc´ak et al., 2016] on Erd¨os-R´enyi random graphs in adversarial setting. It is not clear whether this 1/p term represents hindsight difficulty between the probabilistic graphs and deterministic graphs. This would be an interesting future direction.

The terms  {pei : i ∈ [K]}in the one-step general case de- scribes the minimal exploration probabilities to observe every action. For each  i ∈ [K], pei = maxi′ pi′jfor some j, that is pijis the largest live probabilities among all incoming edges for some j. These terms represent the problem complexities for the underlying probabilistic graph. When all  pijare equal to  p, pei = p.

The term  pei′in the cascade case is usually larger than  peisince it takes the same operations on the connection probabilities of incoming paths which are larger than live probabilities of incoming edges. The term  ˆpeiis an estimation satisfying ˆpei ≥ pei′ ′/2for some  i′.

Next we discuss the difference in proof of the cascade case. If the noise of the linear programming problems is on the b vector in (10), then by the standard results in statistics [Dontchev and Rockafellar, 2009,  §3C.5], the resulting optimal solution sets are Lipschitz continuous. The property of Lipschitz continuity is essential since actions are selected according to the optimal solution of a noisy LP problem (line 13) and we need to guarantee this kind of selections is safe. The noise on  ∆vector in (10) is also easy to deal with by considering the dual problem. However, it is much different if the noise is on the P matrix. For example, consider the LP problem that minimizes x over all  ax ≥ 1and  x ≥ 0with parameter a > 0. The optimal solution  x∗ = 1/ais not Lipschitz continuous with respect to a. So the standard statistical tools could not apply here. We derive a novel property of the Lipschitz continuity when there is noise on P for our specific P matrix.

We are the first to formalize the setting of stochastic online learning with probabilistic feedback graph. We derive asymptotic lower bounds for both one-step and cascade cases. The regret bounds of our designed algorithms match the lower bounds with high probability.

This framework is new and we only provide asymptotic lower bounds and finite-time problem-dependent upper bounds. Finite-time lower bounds and minimax upper/lower bounds are all interesting future directions. Bayesian regret is also an interesting topic.

[Alon et al., 2015a] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35, 2015.

[Alon et al., 2015b] Noga Alon, Nicol`o Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. arXiv preprint arXiv:1502.07617, 2015.

[Alon et al., 2017] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graphstructured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.

[Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

[Bart´ok et al., 2014] G´abor Bart´ok, Dean P Foster, D´avid P´al, Alexander Rakhlin, and Csaba Szepesv´ari. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967– 997, 2014.

[Buccapatnam et al., 2014] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. ACM SIGMETRICS Performance Evaluation Review, 42(1):289–300, 2014.

[Caron et al., 2012] St´ephane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations in stochastic bandits. In UAI, pages 142–151. AUAI Press, 2012.

[Cesa-Bianchi and Lugosi, 2006] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[Chen et al., 2013] Wei Chen, Laks V. S. Lakshmanan, and Carlos Castillo. Information and Influence Propagation in Social Networks. Morgan & Claypool Publishers, 2013.

[Chen et al., 2016] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.

[Cohen et al., 2016] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811–819, 2016.

[Dontchev and Rockafellar, 2009] Asen L Dontchev and R Tyrrell Rockafellar. Implicit functions and solution mappings. Springer Monogr. Math., 2009.

[Erd˝os and R´enyi, 1960] P Erd˝os and A R´enyi. On the evo- lution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5:17–61, 1960.

[Hoeffding, 1963] Wassily Hoeffding. Probability inequali- ties for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.

[Kempe et al., 2003] David Kempe, Jon M. Kleinberg, and ´Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 137–146, 2003.

[Koc´ak et al., 2014] Tom´aˇs Koc´ak, Gergely Neu, Michal Valko, and R´emi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613–621, 2014.

[Koc´ak et al., 2016] Tom´aˇs Koc´ak, Gergely Neu, and Michal Valko. Online learning with erd˝os-r´enyi side-observation graphs. In Uncertainty in Artificial Intelligence, 2016.

[Komiyama et al., 2015] Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. In Advances in Neural Information Processing Systems, pages 1792–1800, 2015.

[Lattimore and Szepesvari, 2017] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737, 2017.

[Lattimore and Szepesvari, 2018] Tor Lattimore and Csaba Szepesvari. Cleaning up the neighborhood: A full classi-fication for adversarial partial monitoring. arXiv preprint arXiv:1805.09247, 2018.

[Liu et al., 2018] Fang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic bandits with graph feedback. In AAAI, 2018.

[Mannor and Shamir, 2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of sideobservations. In Advances in Neural Information Processing Systems, pages 684–692, 2011.

[Rustichini, 1999] Aldo Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1-2):224–243, 1999.

[Tossou et al., 2017] Aristide CY Tossou, Christos Dimitrakakis, and Devdatt Dubhashi. Thompson sampling for stochastic bandits with graph feedback. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[Valiant, 1979] Leslie G. Valiant. The complexity of enu- meration and reliability problems. SIAM Journal on Computing, 8(3):410–421, 1979.

[Vaswani et al., 2015] Sharan Vaswani, Laks Lakshmanan, Mark Schmidt, et al. Influence maximization with bandits. arXiv preprint arXiv:1503.00024, 2015.

[Wang and Chen, 2017] Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In NeurIPS, pages 1161–1171, 2017.

[Wang et al., 2012] Chi Wang, Wei Chen, and Yajun Wang. Scalable influence maximization for independent cascade model in large-scale social networks. DMKD, 2012.

[Wen et al., 2017] Zheng Wen, Branislav Kveton, Michal Valko, and Sharan Vaswani. Online influence maximization under independent cascade model with semi-bandit feedback. In NeurIPS, pages 3025–3035, 2017.

[Wu et al., 2015] Yifan Wu, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Online learning with gaussian payoffs and side observations. In NeurIPS, pages 1360–1368, 2015.

At =�Mj(t) <10 log�Kt2�,for some  j ∈ [K]�

Bt = {mj(t)< Mj(t)/2, for some j  ∈ [K]}

image

Et = {Mj(t) < 2 β(N e(t))/ K, for some j  ∈ [K]}

Ft =����ˆθt,i − θi��� ≤ǫ, for any i  ∈ [K]�

Bound the regret under B Note

Then it remains to bound �Tt=1 E�∆it(θ)1�Bct−1, Cct−1��. Bound the regret under D Suppose  Dt−1and  Bct−1, Cct−1hold. Then  �i∈V in(j) Ni(t − 1)p ≥ 32∆j(ˆθt−1)2 log(t − 1)for any

j ̸= i1(ˆθt−1)and �i∈V in(j) Ni(t − 1)p ≥ 32∆i2(ˆθt−1)(ˆθt−1)2 log(t − 1)for  j = i1(ˆθt−1). Or equivalently

Mj(t −1)  ≥32 ∆j(ˆθt−1)2log(t −1) for  j ̸= i1(ˆθt−1) ,

Mj(t −1)  ≥32 ∆2i2(ˆθt−1)(ˆθt−1)log(t −1) for  j = i1(ˆθt−1) .

On  Bct−1,

mj(t −1)  ≥16 ∆j(ˆθt−1)2log(t −1) for  j ̸= i1(ˆθt−1) ,

mj(t −1)  ≥16 ∆2i2(ˆθt−1)(ˆθt−1)log(t −1) for  j = i1(ˆθt−1) .

Then

image

since

image

Thus it remains to bound �Tt=1 E�∆it(θ)1�Bct−1, Cct−1, Dct−1��. Bound the regret under  Bc, Cc, DcSimilar to [Wu et al., 2015, Proposition 17] where the statement �i∈V in(j) Nij ≥β(s)/Kis replaced by  Mj(t) ≥ β(s)/K,

image

Then

image

Next by [Wu et al., 2015, Lemma 19],

image

Thus by (16), (14), (17) and (18),

Putting (13), (14), (15), (17), (19), (20) together, the regret satisfies

image

Next prove the asymptotic behavior of the regret upper bound. Claim:Mj(t) → ∞as  t → ∞for any  j ∈ [K].

Suppose not. There exists  j ∈ [K]such that  Mj(t), or  Ni(t)for all  i ∈ V in(j), stops increasing when  t ≥ T1for some T1 > 0. Then the condition on line 6 is not satisfied when  t ≥ T2 ≥ T1for some  T2¿ 0. By the condition on line 9,  N e(t)also stops increasing and the condition on line 9 for any  j′is not satisfied any more when  t ≥ T3 ≥ T2for some  T3 > 0. Also line 14 will not be performed since  N e(t)stops increasing. Therefore the condition on line 3 always holds, which is impossible.

For any  δ ∈ (0, 1), the probability that the condition on line 3 does not hold when  Mj(t) > 10 log Kδis at least  1 − δ/K. There exists  T4 > 0such that when  t ≥ T4, Mj(t) > 10 log Kδfor any j since  Mj(t) → ∞. Then with probability at least 1 − δ, line 4-5 are not called any more. The events  Atis modified by  A′t =�Mj(t) < 10 log Kδfor some  j ∈ [K]�and (13) is

replaced by �Tt=1 E�∆it(θ)1�Nit(t − 1) < 10p log Kδ��≤ �Ki=110∆i(θ)p log(K/δ). All other parts stay the same. Then the regret satisfies

image

For any  η > 0, there exists an  ǫ = ǫ(θ) > 0such that the distance between the optimal solution set of  c(θ)and  c(θ′)for any  θ′such that  |θ′i − θi| ≤ ǫfor all  i ∈ [K]is at most  η. Here the distance is Pompeiu-Hausdorff distance of sets. This is because the Lipschitz continuity of the optimal set mapping (see [Dontchev and Rockafellar, 2009,  §3C.5]) and the duality of linear programming problems. Since  c(θ)is unique,  ci(θ, ǫ)is upper bounded by  ci(θ) + η. Then divide  Rθ(T )by log(T ) and let T go to  ∞,

image

For instance,  ηcan be chosen as  ⟨c(θ), ∆(θ)⟩/ �Ki=1 ∆i(θ).□

The proof of Theorem 4 is similar to Theorem 3 by modifying (13) with

image

First we prove a useful lemma on the robustness of linear programming problem where the coefficient matrix is of a specific form.

Lemma 6 Denote the linear programming problem of the form

by LP(A, b, c) where  A ∈ Rn×n, b ∈ Rn, c ∈ Rnand all entries in A, b, c are non-negative. Let the feasible set mapping, the optimal value mapping and the optimal set mapping be

image

respectively. Note that  Sval(A, b, c)is always finite with the A, b, c of positive (or even non-negative) entries. Fix a pair (i, j), assume A(i, j) > 0. Let  A′ = Aexcept  A′(i, j) = A(i, j) + ǫ. Then

image

for some  αdepending on A, b, c and i, j.

Proof. By [Dontchev and Rockafellar, 2009, Theorem 3C.3], the mapping

image

is Lipschitz continuous. Recall that the distance on sets is Pompeiu-Hausdorff distance  dH. First assume  ǫ > 0. Then

image

If  Sopt(A′) ⊂ Sfeas(A), then  Sval(A) = Sval(A′). Suppose not and let  x′ ∈ Sopt(A′) \ Sfeas(A). Then

image

for some  α1depending on A, b, c and i, j. Thus

for some  α2, α3depending on A, b, c, where  B1is the unit ball in  Rnand the second inequality is due to the Lipshitz continuity of G in (25). Also

image

where (*) is by by the Lipshitz continuity of  G′ : t �→ {x ∈ Sfeas(A′, b) | ⟨c, x⟩ ≤ t}. The case of  ǫ < 0follows similarly.  □

Proof. [of Theorem 5] The finite-time regret is similar to the previous proof. The main difference is on the bound for line 13. In particular, the results of (18) and (19) are changed to be

image

since  ci(θ, ǫ, η(t))can bound  Sopt(LPt). The proof of other parts follow the proof of Theorem 4 similarly.

By the non-increasing property of  η(t)whose limit is  0, η(t)would be smaller than  minij:p′ij>0 p′ijwhen  t ≥ T1for some T1 > 0. Then  Ptonly has small noise on the nonzero entries of  P ′ = (p′ij)ij. By Lemma 6,  Sopt(θ′, η(t))is Lipschitz continuous in  η(t)for any  θ′. Thus

image

The remaining discussion on  ǫis similar.  □

Lemma 7 (Hoeffding’s Inequality [Hoeffding, 1963]) Let  X1, . . . , Xnbe independent random variable with common support [0, 1]. Let ¯X = 1n�ni=1 Xiand  E� ¯X�= µ. Then for all  a ≥ 0, P� ¯X − µ ≥ a�≤ exp(−2na2), P� ¯X − µ ≤ −a�≤ exp(−2na2).

Lemma 8 (Bernstein’s Inequality) Let  X1, . . . , Xnbe independent zero-mean random variables. Suppose that  |Xi| ≤ Malmost surely for all i. Then for all  a ≥ 0,

image

Lemma 9 Let  x1, x2, . . . , xtbe independent Bernoulli random variables with mean  p1, p2, . . . , pt ∈ (0, 1)respectively.

image

Proof.

where (26) is by Bernstein’s inequality (Lemma 8) and (27) holds when

image


Designed for Accessibility and to further Open Science