The stochastic multi-armed bandit problem is a well-studied problem of machine learning. Consider an agent that has to choose among several actions in each round of a game. To each action i is associated a real-valued parameter µi. Whenever the player performs the ith action, she receives a random reward with mean µi. If the player knew the means associated to the actions before starting the game, she would play an action with the highest mean during all rounds. The problem is to design a strategy for the player to maximize her reward in the setting where she does not know the means. The regret of the strategy is the difference between the accumulated rewards in the two scenarios.
This problem encapsulates the well-known exploration/exploitation tradeoff: the player never learns the means exactly, but she can estimate them. As the game proceeds, she learns that some actions probably have better means, so she can exploit these actions to obtain a better reward, but at the same time she has to explore other actions as well, since they might have higher means. Traditionally, actions are called “arms” and “pulling an arm” refers to performing an action. See Slivkins [23], Lattimore and Szepesv´ari [17] for recent monographs on stochastic multi-armed bandits.
We study a multiplayer version of this game, in which each player pulls an arm in each round, and if two or more players pull the same arm, a collision occurs and all players pulling that arm receive zero reward. The players’ goal is to maximize the collective received reward.
One application for this model is opportunistic spectrum access with multiple users in a cognitive radio network: we have a radio network with several channels (corresponding to the arms) that have been purchased by primary users. There are also secondary users (the players) that can try to use these channels during the rounds when the primary users are not transmitting. Successfully using a channel to transmit a message means a unit reward, and not transmitting means zero reward. If more than one secondary users try to use the same channel in the same round, a collision occurs and none of them can transmit. If a unique secondary user tries to use a channel, she will succeed if the primary user owning that channel happens to be idle in that round, which happens with a certain probability. Thus, the reward of the secondary user is a Bernoulli random variable whose mean depends on the activity of the corresponding primary user and whether other secondary users have tried to use the same channel. See Liu and Zhao [18, Section I.D] for other applications.
One may consider (at least) two possible feedback models. In the first model, whenever a player pulls an arm, she observes whether a collision has occurred on that arm and receives a reward. In the second model, the player just receives a reward without observing whether a collision has occurred. Of course, if the reward is positive, she can infer that no collision has occurred. But if the reward is zero, she cannot infer if a collision has occurred.
Our main contributions are as follows.
1. We offer the first theoretical guarantees for the second model, where the players do not observe collision information. We propose an algorithm with a logarithmic regret (in terms of the number of rounds), and we also give an algorithm with a sublinear regret that does not depend on the gaps between the means.
2. For the first model, in which the players observe collision information, we prove the first sublinear regret bound that does not depend on the gaps between the means.
3. One may also view this setup as a stochastic anti-coordination game. Using the algorithmic ideas introduced here, we give an algorithm for reaching an approximate Nash equilibrium quickly in such games.
1.1 Models and results.
Let K > 1 be a positive integer and let µ1,...,µK be nonnegative numbers corresponding to the arm means. Let Yi,t be the reward of arm i in round t, so the {Yi,t}∞t=1 are independent and identically distributed (i.i.d.) and EYi,t = µi. We may assume, by relabeling the arms if necessary, that µ1 ≥ ··· ≥ µK. The players are of course unaware of this labeling.
For a positive integer n, we denote [n] ≔ {1,...,n}. A set of m > 1 players play the following game for T > 0 rounds: in each round t = 1,...,T , player j chooses an arm Aj(t) ∈ [K]. Let Ci(t) ∈ {0,1} be the collision indicator for arm i in round t, that is, Ci(t) = 1 if and only if there exist distinct j,j′ with Aj(t) = Aj′(t) = i. In round t, player j receives reward
We will also consider a stronger feedback model, in which each player j also observes CAj(t)(t) in each round t; this is called “the model with collision information.”
The regret of a strategy is defined as
Note that Regret is a random variable (since the strategy can randomize hence Aj(t) can be random) and we will bound its expected value. Bounds that hold with high probability can also be derived from our proofs.
To simplify the statements and proofs of our main theorems, we make three additional assumptions, which can be relaxed at the expense of getting worse bounds, as discussed in Section 5.
Assumption 2. YAj(t),t is supported on [0,1] so the means µi and the rewards rj(t) are also in [0,1].
Note that we assume no communication between the players, and our algorithms are totally distributed. Moreover, in each particular setting, all players play the same algorithm. All of our algorithms are explicit, simple, and efficient.
We can now state our main theorems. Let ∆ ≔ µm − µm+1. All the following results correspond to the weak feedback model (i.e., no collision information), unless stated otherwise. Certainly, any regret upper bound for this model automatically carries over to the stronger feedback model as well.
Theorem 1. There is an algorithm with expected regret O(mK log(T)/∆2).
In this theorem and throughout, the notation f = O(g) means there exists an absolute constant C such that for all admissible parameters, f ≤ Cg.
A shortcoming of Theorem 1 is that it gives a vacuous bound if ∆ = 0. Moreover, one may wonder if, as in the single player case, a regret of the form√T is possible that is independent of the specific instance. The following theorem shows this is possible, under some weak assumptions. Let ∆′ ≔ min{µm−µi : µi < µm}. Observe that ∆′ ≥ ∆, and that ∆′ is positive and well-defined unless µm = µm+1 = ··· = µK (in this case we define ∆′ = 0).
Theorem 2. (a) Suppose all players know a lower bound µ for µm. Then there is an algorithm with expected regret O(K2mlog2(T)/µ + Kmmin{�T logT ,log(T)/∆′}).
(b) For the stronger feedback model, in which the players observe the collision information, there is an algorithm with expected regret
(c) Suppose each player has the option of leaving the game at any point; that is, she can choose not to pull from some round onward (if a player leaves the game, we assume that she collects reward 0 for the rest of the game). Then, there exists an algorithm with expected regret O(Km√T logT ) .
We do not know whether our regret upper bounds are tight; the only lower bound for this problem is an asymptotic lower bound of Ω((K − m)log(T)/∆′) as T → ∞, provided ∆′ > 0, proved in Anantharam et al. [3, Theorem 3.1] for both feedback models (see (3) below for the exact form). There are gaps between our upper bounds and this bound and closing them is left for future work. Further asymptotic lower bounds were claimed in Besson and Kaufmann [6, Section 3], but the authors found a mistake later, see Besson and Kaufmann [8].
Another interesting avenue for future research is the setting in which the rewards are not i.i.d. but are chosen by an adversary. This problem has been studied recently by Alatur, Levy, and Krause [1] and independently by Bubeck, Li, Peres, and Sellke [14].
A third possible research direction is to study this problem from a (competitive) game-theoretic point of view: each player wants to maximize her own reward and the players are not required to run the same algorithm. Can we redefine the notion of reward so the players are better off running the same algorithm? What happens if most players are running the same, standard algorithm but there are some outliers who are selfish and deviate from the standard algorithm? See Boursier and Perchet [12] for recent results in this direction.
The three algorithms proving Theorem 2 are quite similar. All of our algorithms have the property that, eventually, each player fixates on one arm. This can be viewed as reaching an equilibrium in a game-theoretic framework, where the actions correspond to the arms and the utility of each action is the mean of the arm if no two players choose that action and zero otherwise. Games with the property that “if two or more players choose the same action then their reward is zero” are called anti-coordination games. Using our techniques for multiplayer bandits, we also provide an algorithm for converging to an approximate Nash equilibrium quickly in such a game.
More precisely, we define a stochastic anti-coordination game as follows: for each player i ∈ [m] and action j ∈ [K], there is a parameter µij ∈ [0,1] such that, if player i performs action j while no other player performs it, she will get a random reward in [0,1] with mean µij, while if two or more players perform the same ac- tion, all get reward 0. An assignment of players to actions is called an ε-Nash equilibrium if no player can improve her expected reward by more than ε by switching to another action while other players’ actions are unchanged. Then, we would like to design an algorithm that reaches an ε-Nash equilibrium quickly. We prove the following theorem in this direction.
Theorem 3. There is a distributed algorithm that, with probability at least 1 − δ, converges to an ε-Nash equilibrium in any stochastic anti-coordination game within O(log(K/δ)(K/ε2 + K2/ε)) many rounds.
Note that this theorem is proved in the setting in which the players do not observe collisions; in particular, they do not observe the actions of other players. However, we are still making the Assumptions 1–3 (note there is no parameter T in this case). Moreover, we assume each player also has the option of choosing a dummy action with zero reward. This is a realistic assumption in most applications.
In the next section, we review some related work. Theorems 1 and 2 are proved in Sections 3 and 4, respectively. In Section 5 we discuss how to relax Assumptions 1–3 above. Finally, the proof of Theorem 3 appears in Section 6.
2.1 Model with collision information.
Multiplayer multi-armed bandits were introduced by Anantharam, Varaiya, and Walrand [3] and further studied by Komiyama, Honda, and Nakagawa [16]. They studied a centralized setting where there is a single center that observes the rewards of all players and controls the players. The distributed setting was introduced by Liu and Zhao [18], who gave an algorithm with expected regret bounded by κlogT, with κ depending on the game parameters, m, K, and the arm means. They also showed that any algorithm must have regret Ω(logT). The dependence of κ on the parameters was further improved by Anandkumar, Michael, Tang, and Swami [2], Rosenski, Shamir, and Szlak [21], Besson and Kaufmann [6].
Rosenski et al. [21] introduced a “musical chairs” subroutine to reduce the number of collisions; we have further developed and used this subroutine in our algorithms. Their final algorithm requires the knowledge of ∆ and its expected regret is bounded by O(m2 + mK2 ln(T ) + mK log(T)/∆2), which is at least as large as the bound of Theorem 1.
Let log(·) denote the natural logarithm, and define kl(x,y) ≔ xlog(x/y)+(1− x)log((1 − x)/(1 − y)). Besson and Kaufmann [6] developed an algorithm whose regret is bounded by
This bound is not comparable with the bound of Theorem 1 in general; however if µ1 = ··· = µm = 1/2 and µm+1 = ··· = µK = 1/2 − ∆, then their bound becomes O(m3K2 log(T)/∆2), which is worse than our bound by a multiplicative factor of m2K.
Since the first version of this paper appeared on arXiv in August 2018, the multiplayer bandits problem has attracted lots of attention and new results have been proved, which improve our bounds in some regimes. One of the main new ideas in some of these algorithms is to use collisions as a means of communication between players.
Boursier and Perchet [11, Theorem 1] presented the algorithm SIC-MMAB with expected regret
An asymptotic regret lower bound (as T → ∞) of
was proved in Anantharam et al. [3, Theorem 3.1]. Assuming all arm means are distinct, Wang, Proutiere, Ariu, Jedra, and Russo [25, Theorem 1] presented the algorithm DPE1 achieving this lower bound asymptotically as T approaches infin-ity.
2.2 Model without collision information.
The model was introduced by Bonnefoi et al. [9] and further studied by Besson and Kaufmann [6]. These papers introduced an algorithm and studied it empirically but gave no theoretical guarantee.
Assuming a positive lower bound µmin is known for all the arm means, Boursier and Perchet [11, Theorem 3] presented the algorithm SIC-MMAB2 whose expected regret is
Shi, Xiong, Shen, and Yang [22, Theorem 2] presented the algorithm EC-SIC with expected regret bound
where E(·) is a certain information-theoretic function called Gallager’s error exponent function for the Z-channel.
Assuming the players have access to shared randomness, Bubeck, Budzinski, and Sellke [13, Theorem 1.1] gave an algorithm with regret O(mK11/2�T logT ) with the additional property that, with probability 1 − 1/T , no collision occurs between players.
2.3 Other models.
Bande and Veeravalli [5] studied a version of the problem in which if more than one players pull an arm, the reward is shared among them.
Avner and Mannor [4], Rosenski et al. [21], Hanawal and Darak [15], Boursier and Perchet [11] studied a dynamic version of the problem, in which the players can leave the game and new players can arrive, and proved sublinear regret bounds.
In the “heterogeneous” variant of the problem, the arms’ reward distributions can differ across players; for results on this version, see, e.g., Boursier et al. [10] and the references therein.
Finally, Liu, Ruan, Mania, and Jordan [19] studied a heterogeneous and competitive variant, where the goal is to reach a stable matching as soon as possible.
In this section, we consider only the feedback model in which the collisions are not observed and give an algorithm with regret O(mK log(T)/∆2). The algorithm outline is simple: first, each player builds estimates for the arm means by random exploration until she detects the best m arms with high probability. Second, once the m players have detected the m best arms, they distribute these among themselves.
We now explain the details. Each of the players execute the same algorithm, which has four phases, described next. Note that the phases are not synchronized; that is, each phase may have different starting and stopping times for each player. Let g ≔ 128K log(3Km2T 2).
Phase 1: The player pulls arms uniformly at random and maintains an estimate for the mean of each arm—the estimate for arm i is the average reward received from arm i divided by (1 − 1/K)m−1. Note that, provided other players are also pulling arms uniformly at random, (1 − 1/K)m−1 is precisely the probability of not getting a conflict for a random pull, hence the player indeed has an unbiased estimate for µi. In other words, for any round t that
arm i is pulled and reward r(t) is received, since collisions and rewards are independent, we have (recall (1))
For each round t, the player maintains a sorted list �µi1,t ≥ ··· ≥ �µiK,t of estimated means. Let τ be the first round when �µim,τ − �µim+1,τ ≥ 3�g/τ. The first phase finishes at the end of round τ. We will prove that by this time, the player has learned the best m arms with high probability, and so she has a list G ⊆ [K] of m arms with the highest means.
Phase 3: The player runs a so-called musical chairs algorithm until it occupies an arm. In each round, she pulls a uniformly random arm i ∈ G; if she gets a positive reward (which means no other player has pulled arm i), we say the player has “occupied” arm i, and this phase is finished for the player. Note that, by construction, at most one player will occupy any given arm.
The pseudocode is shown in Algorithm 1. We next analyze the regret of this algorithm, starting with some preliminary lemmas.
We will use the following versions of Chernoff-Hoeffding concentration inequalities.
Proposition 4 ([20, Theorem 2.3]). Let the random variables X1,...,Xn be independent, with 0 ≤ Xk ≤ b for each k and some fixed b. Let �µ = �Xk/n and µ = E�µ. Then we have,
(a) for any t ≥ 0, P�|�µ − µ| > t� < 2exp(−2nt2/b2),
(b) and if b = 1, then for any ε > 0,
We start the analysis with Lemma 5, proving that the mean estimates are close enough to the true means with high probability. Then, in Lemma 6 we prove that with high probability the two first phases will not take too long and once they are finished, all players have learned the best m arms. Finally, Lemma 7 analyzes the MusicalChairs1 subroutine and shows that with high probability it does not take too long for each player to occupy a distinct good arm.
1 while true do
2 pull an arm i ∈ A uniformly at random 3 if positive reward is received then output i // arm i is occupied
4 end
Lemma 5. Consider any fixed player and let �µi,t denote her estimated mean for arm i after t rounds of Phase 1. Then we have
Proof. Fix an arm i ∈ [K]. Observe that 0 ≤ �µi,t ≤ 1/(1 − 1/K)m−1 < 1/(1 − 1/K)K ≤ 4, so we have |�µi,t − µi| ≤ 4 deterministically, so for t ≤ g/16 we have |�µi,t − µi| ≤�g/t.
Next, fix some t > g/16. Let Ti(t) denote the number of times this player has pulled arm i by round t, which is a binomial random variable with mean t/K, hence Proposition 4(b) implies P{Ti(t) < t/(2K)} < exp(−t/8K). Thus, the union bound gives
P�|�µi,t − µi| >�g/t�< exp(−t/8K) + maxt2K ≤s≤tP�|�µi,t − µi| >�g/t���� Ti(t) = s�.
Also, conditioned on any s ∈ [t], |�µi,t − µi| is the difference between an empirical average of s i.i.d. random variables bounded in [0,4] and their expected value, thus Proposition 4(a) gives
giving
P�|�µi,t − µi| >�g/t�< exp(−t/8K)+2exp(−g/16K) < exp(−g/128K)+2exp(−g/16K),
since t > g/16. A union bound over t ∈ [T ] and the K arms concludes the proof.
In the next lemma, we prove that with high probability the first two phases will not take too long, and once they are finished, all players have learned the best m arms.
Lemma 6. With probability at least 1 − 1/mT, the following are true.
for all players (where we recall that τ is the round at which phase 1 finishes).
(iii) The first two phases are finished for all players after at most 625g/∆2 many rounds.
Proof. By the choice of g = 128K log(3Km2T 2), Lemma 5 and a union bound over the m players, with probability at least 1 − 1/mT, all players’ mean estimates are off by an additive error of at most�g/t, uniformly for all arms and all t ∈ [T ]. We next explain how the three parts of the lemma follow from this.
Part (i) follows by noting that a player would stop Phase 1 when she has found a gap of size 3�g/τ between the mth the and the (m+1)th arm. By this time, she has learned the means of all arms within an additive error of�g/τ, therefore by the triangle inequality, she has correctly determined that the mth mean is larger than the (m + 1)th mean, whence she has learned the best m arms.
For part (ii), using the triangle inequality and the definition of τ, we have
whence τ ≥ g/∆2. On the other hand, by time t = 25g/∆2,
whence τ ≤ t.
Part (iii) follows from part (ii) by noting that the duration of Phase 2 is 24τ rounds.
Curious readers may wonder about the role of Phase 2 and ask, “Why cannot a player proceed to Phase 3 right after she has learned the best m arms?” The answer is that Phase 2 is designed to help other players to find the best m arms. Indeed, it is possible that a player finishes Phase 1 by round g/∆2, but the algorithm asks her to continue pulling arms at random so other players continue to have unbiased estimators for the means for at least 24g/∆2 more rounds, at which point we are guaranteed that all players have finished their Phase 1. Otherwise, if a player switches to Phase 3 too quickly, then this would skew the collision probabilities and other players will not have unbiased mean estimates.
We now proceed to analyzing Phase 3, the musical chairs subroutine. By this point, all players have learned the best m arms, hence they just want to share these m arms among themselves as quickly as possible. The next lemma shows that this will not take too long. Note that by definition of the subroutine, once this phase is finished, each player has occupied a distinct arm.
Lemma 7. With probability at least 1 − 1/mT, Phase 3 takes at most 4mlog(m2T)/∆ many rounds for all players.
Proof. Since each reward Yi,t takes value in [0,1], we have P�Yi,t > 0� ≥ EYi,t. Fix any player in her Phase 3 who has not occupied an arm, and suppose there are still a unoccupied arms available for her. (There are m players, and each occupies at most one arm, hence a ≥ 1.) Whenever she tries to occupy an unoccupied arm, her success probability is at least
Here, am ≥ 1/m is the probability that she pulls an unoccupied arm, ∆ ≤ µm is a lower bound on the probability that that arm produces a positive reward, and (1 − 1/m)m−a ≥ 1/4 is the probability that no other player pulls that arm. Hence, the probability that the player has not occupied an arm after t attempts can be bounded by (1 − ∆/4m)t ≤ exp(−t∆/4m). Letting t = 4mlog(m2T )/∆ makes this probability ≤ 1/T m2. Applying the union bound over all players completes the proof.
Proof of Theorem 1. By Lemma 6 and Lemma 7, with probability at least 1 − 2/mT , the first three phases finish for all players after at most 625g/∆2+4mlog(m2T)/∆ = O(K log(KT )/∆2) many rounds. After this time, each player has occupied one of the best m arms and different players have occupied distinct arms. During each round, the regret is at most m, hence the total regret incurred during the first three phases is bounded by O(mK log(KT )/∆2) and the regret afterwards would be 0. On the other hand, with the remaining 2/mT probability, the regret is at most mT . Therefore, the expected regret is at most O(mK log(KT )/∆2) + 2, as required. The O(log(KT )) can be replaced with O(log(T)), noting that
Recall that Theorem 2 has three parts focusing on three different settings: in part (a), we do not observe the collision information, but we know a lower bound for µm; in part (b), we observe the collision information, while in part (c), we do not observe the collision information, but we allow the players to leave the game at points of their choice. We start by proving part (a), and then we explain how the algorithm and the analysis can be modified to prove parts (b) and (c).
4.1 Algorithm for Theorem 2(a).
We describe the algorithm each player executes, first informally and then formally. Recall that µ is a lower bound for µm that all players know in advance. The algorithm has a parameter ν which we set it equal to µ for this part. We say an arm is ν-good if its mean is at least ν; otherwise, we say it is ν-bad.
The player maintains estimates �µ1,...,�µK for the means, which approach the actual means as the algorithm proceeds. She also keeps a confidence interval for each arm j, which is centered at �µj and has the property that µj lies in this interval with sufficiently high probability. If arm j has been pulled s times, this interval has length O(�log(T)/s). Once the player makes sure that some arm is not among the best m arms, she marks it as “bad” and puts it in a set B. This can happen if it is determined that the arm is ν-bad or that there are at least m arms whose confidence intervals lie strictly above this arm’s interval (we say interval [c,d] lies strictly above [a,b] if b < c). On the other hand, once the player makes sure that some arm is within the best m arms, she marks it as a “golden” arm and puts it in a set G. This would happen as soon as there are at least K − m arms that are determined to be bad or whose confidence intervals lie strictly below this arm. Other arms, whose status is yet unknown, are called “silver” arms and kept in a set S.
Initially, all arms are silver. The algorithm proceeds in epochs with increasing lengths. In each epoch, the player explores all silver arms and hopes to characterize each silver arm as golden or bad at the end of the epoch. As time proceeds, arms whose means are far away from the mth arm will be characterized as either golden or bad. Golden arms will be occupied quickly, and bad arms will not be pulled again—this will control the algorithm’s regret.
Special care is needed to ensure all players explore all silver arms without conflicts; this is done via careful executions of a suitable musical chairs subroutine, called MusicalChairs2, explained in the next paragraph. In each epoch, each player maintains a set E of explored arms, which is empty when the epoch starts. The epoch has K+m−1 iterations. In each iteration, if there exists some arm in S\E (i.e., an unexplored silver arm), the player tries to occupy such an arm; otherwise, the player has finished exploring the arms in S, and so she will try to occupy and pull an arbitrary arm from S, while other players are exploring their silver arms. Note that by the assumption that µm ≥ µ and ν = µ, any ν-bad arm is bad. The length of the MusicalChairs2 subroutines are chosen such that each ν-good arm in S that is not marked as golden by any other player will be explored in each epoch by each player. Thus, if an arm is not explored by the end of an epoch, either the arm is ν-bad or the arm is golden and is occupied by another player in the beginning of the epoch. The two cases will be distinguished by checking the empirical reward received from that arm.
We now describe the MusicalChairs2 subroutine, which is different from MusicalChairs1 from the previous section because different players may have different “target sets” now. (A target set is a subset of the arms that a player wants to explore.) For any target set A of arms and any number α of rounds, this subroutine consists of precisely α rounds as follows: in each round, the player pulls a uniformly random arm j ∈ [K]. If she gets a positive reward and j ∈ A, then she occupies arm j, pulls arm j for the remaining rounds, and outputs j. Otherwise, she tries again. If after α rounds she has not occupied any arm, she outputs the dummy index 0. The pseudocode for MusicalChairs2 appears below.
The pseudocode for Theorem 2(a) appears in Algorithm 2 below. Note that this algorithm is synchronized—for all players, the epochs and the iterations within the epochs begin and end at the same round.
To analyze Algorithm 2, we define two bad events: failure of some Musi-calChairs2 subroutine (handled by Corollary 9 below) or incorrectness of some confidence interval (handled by Lemma 10 below). After proving their unlikeliness, we will bound the regret assuming no bad events happen.
4.2 Bounding the failure probability of MusicalChairs2.
We next prove a lemma bounding the failure probability of this subroutine, but first we formally define the notion of success.
Definition (ν-successful MusicalChairs2 subroutine). Let ν ∈ [0,1] be arbitrary. Suppose that a subset of players are executing the MusicalChairs2 subroutine simultaneously for some consecutive rounds (call these the participating players), while any other player either pulls uniformly random arms or pulls a fixed arm during these rounds. The participating players may have different target sets. We say a participating player p with target set Ap is ν-successful if, by the end of the subroutine, either she occupies an arm in Ap or all ν-good arms in Ap are occupied by someone else (participating or otherwise). A player is ν-failed if she is
not ν-successful. Moreover, we say the subroutine is ν-successful if all participating players are ν-successful, and we say the subroutine has ν-failed if at least one participating player has ν-failed.
Lemma 8. Let ν ∈ [0,1] and let α be a positive integer. For MusicalChairs2 of length α, the ν-failure probability of any fixed player is upper bounded by exp(−αν/4K) if m ≤ K and by exp(−αν exp(−2m/K)/K) in general.
Proof. Fix a player with target set A. At any round during the subroutine, suppose the player has not occupied an arm and that there are still a ≥ 1 ν-good unoccupied arms in A. Whenever she tries to occupy one of her target arms, her success probability is at least
Here, aK ≥ 1/K is the probability that she pulls a ν-good unoccupied arm in her target set, ν is a lower bound on the probability that that arm produces a positive reward, and (1 − 1/K)m−1 ≥ exp(−2m/K) is the probability that no other player pulls the same arm. (Note that her success probability may indeed be larger because she may also occupy ν-bad arms in her target set.) Hence, the probability that she has not occupied an arm after α attempts can be bounded by (1 − ν exp(−2m/K)/K)α ≤ exp(−αν exp(−2m/K)/K). If m ≤ K, the argument is identical, but we use the tighter bound (1 − 1/K)m−1 > (1 − 1/K)K ≥ 1/4.
Applying a union bound over the m players gives the following corollary.
Corollary 9. Let ν ∈ [0,1] and let α be a positive integer. The ν-failure probability of a MusicalChairs2 subroutine of length α is not more than mexp(−αν/4K) if m ≤ K and mexp(−αν exp(−2m/K)/K) in general.
4.3 Proof of Theorem 2(a).
As explained in later subsections, the proofs of Theorems 2 (b, c) differ only in values of the parameters ν,α,g. For this part, we put ν = µ and define
We first define the two bad events formally. The first bad event is that some MusicalChairs2 subroutines ν-fail, and the second bad event is that some player’s confidence interval is incorrect, i.e., the actual mean does not lie in the confidence interval. We start by bounding the probability of the bad events.
Lemma 10. Let ν ∈ [0,1] be arbitrary and define α,g as in (4). The probability that some bad event happens is at most 1/mT .
Proof. The probability that some MusicalChairs2 subroutine ν-fails is bounded by mexp(−αν/4K) by Corollary 9. By a union bound over the (at most T ) epochs and the 1 + 2(K + m − 1) ≤ 3Km times MusicalChairs2 is executed in each epoch, the probability that some MusicalChairs2 subroutine ν-fails is at most 3Km × T × mexp(−αν/4K) ≤ 1/2mT , as α = 4K log(6Km2T )/ν.
Whenever a confidence interval for some arm j is updated in some epoch i (Line 21), the arm has been pulled precisely 2i times right before that (Line 20). Hence, the probability that some confidence interval is incorrect for some player, say in epoch i, is bounded via Proposition 4(a) by
By a union bound over the m players, the (at most T) epochs, and the K+m−1 ≤ Km many updates of the confidence intervals within each epoch, the probability of some incorrect confidence interval is at most m × T × Km × 2exp(−2g) ≤ 1/2mT , as g = log(4m3T 2K)/2, completing the proof.
We are now ready to prove Theorem 2(a).
Proof of Theorem 2(a). We bound the regret assuming no bad event happens, and the bound for the expected regret follows as in the proof of Theorem 1. We first prove four deterministic claims and then bound the regret. Informally, these claims are:
1. Any silver arm is explored at least 2i times by each active player during epoch i. (An active player is one that has not occupied a golden arm yet.)
2. No player makes a mistake in marking an arm as golden or bad.
3. Any arm whose mean is much smaller than µm will be marked by all players as bad quickly.
4. Any arm whose mean is much larger than µm will be marked by all players as golden quickly and occupied by one of them quickly.
We now proceed to the formal argument. Note that each epoch has two types of rounds: estimation rounds (Line 20), in which each arm is pulled by at most one player, during which she updates her estimate of its mean, and other rounds, during which some players are executing MusicalChairs2, hence we call them MusicalChairs2 rounds.
Observe that, since there are at least m many ν-good arms (here we are using the fact ν ≤ µ), we always have |G ∪ S| ≥ m, and since the MusicalChairs2 subroutines are always ν-successful, there will be no collision during the estimation rounds.
The first claim is the following: consider a player that has just executed her Line 7 in epoch i. Consider also a ν-good arm j that is silver, and suppose this arm is not occupied by another player as a golden arm in their Line 5. Then the claim is that the player will pull arm j at least 2i times during epoch i and will put it in E at the end of the K +m−1 iterations. To prove this, note that the epoch has K +m−1 iterations. In each iteration, if the player has any unexplored silver arm, in the first α rounds she attempts to occupy one of those (Line 11) while other players pull random arms. By Lemma 11 below and since the MusicalChairs2 subroutines are ν-successful, after the K + m − 1 iterations, each player has explored any such arm j. Therefore, the confidence interval of each such arm will have length 2�g/2i.
The second claim is that the algorithm never makes a mistake in characterizing the arms as golden and bad. First, the characterizations based on confi-dence intervals (Lines 31–35) are correct because all confidence intervals are correct. Now fix an epoch i and an arm j, and note that if j ∈ S \ E on Line 24, that means j is not explored, and that can be for one of two reasons: it may be a golden arm occupied by another player on her Line 5 or its mean may be smaller than ν.
Case 1: arm j is a golden arm occupied by another player. Let �µ′j be the average reward received from this arm by the other player. Suppose the arm was marked as golden by the other player in epoch i′ ≤ i − 1. Then we must have had �µ′j > ν + 3�g/2i′ (see Line 34). This implies
On the other hand, at the end of epoch i − 1, since j was silver and the confidence intervals were correct, we have �µj ≥ µj −�g/2i−1 > ν +�g/2i−1, hence in epoch i, Line 26 is executed and the algorithm correctly marks j as golden.
Case 2: the mean of arm j is smaller than ν. Because the confidence intervals were correct at the end of epoch i − 1, ν lies in the confidence interval for arm j, which has length�g/2i−1. This means �µj −�g/2i−1 ≤ ν, so in epoch i, Line 28 is executed and the player correctly marks j as bad.
The third claim is that any arm with mean < µm − 4�g/2i will be marked as bad by all players by the end of epoch i. Let j be such an arm and suppose we are at the end of epoch i. By Line 32 of the algorithm, it suffices to show that there exists at least m arms ℓ such that either ℓ ∈ G or �µℓ − �µj > 2�g/2i or both. In fact,
this holds for all ℓ ∈ [m], since for any ℓ ∈ [m], if ℓ � G, then ℓ ∈ S, which implies
The fourth claim, whose proof is similar to the third claim, is that any arm j with µj > µm + 4�g/2i will be marked as golden by all players by the end of epoch i. The only difference is the additional condition �µj > ν +3�g/2i, which is satisfied by any such arm. Indeed, we have
by correctness of confidence intervals and since ν ≤ µm.
Now, we bound the algorithm’s regret. First, the number of epochs is fewer than log2(T) < 2log(T). The number of iterations per epoch is K + m − 1 < 2K, whence the total number of MusicalChairs2 rounds can be bounded by 2log(T )(α+ 4Kα) ≤ 10Kα log(T). We next bound the regret of the estimation rounds. The regret of the first epoch can be bounded by m·(K +m−1)·21 < 4Km. Next note that any arm with mean > µm + 4�g/2i−1 has been put in G by the end of epoch i − 1 by all players by the fourth claim, and so some player occupies it in the beginning of epoch i. During epoch i, each active player pulls either a silver or a golden arm, which are at most 8�g/2i−1 away from the best available arms by the third and fourth claims. Since the probability that some bad event happens is 1/mT (Lemma 10), and in this case the total regret can be bounded by mT , the total expected regret can be bounded by
Recall that ∆′ = min{µm − µi : µi < µm}. Let j be the smallest integer that 4�g/2j < ∆′. So, after the first j epochs, any silver arm will have mean precisely µm, and the regret will be zero afterwards. Hence, the total expected regret is alternatively bounded by
Combining (5) and (6) gives that the expected regret is upper bounded by
This bound holds for α = 4K log(6Km2T )/ν and any 0 ≤ ν ≤ µm. Recalling that ν = µ gives Theorem 2(a).
The following lemma is the last piece in completing the proof of Theorem 2(a).
Lemma 11. Fix an epoch and suppose that all MusicalChairs2 subroutines of Line 11 are ν-successful. Then, during the K + m − 1 iterations of the epoch, each player will occupy each ν-good silver arm at least once.
Proof. Consider a bipartite graph with one part being the m players and the other part the K arms, with an edge between a player and an arm if the arm is silver and unexplored for that player. Say an edge is good if the corresponding arm is ν-good. Say two edges are neighbors if they share a vertex, and the degree of an edge is its number of neighbors. Initially, the degree of each edge is at most K + m − 2. Whenever the MusicalChairs2 subroutine in Line 11 is executed, the set of edges corresponding to players and their occupied arms forms an edge-matching in this graph, i.e., a set of edges such that no two of them are neighbors. Moreover, since the MusicalChairs2 subroutine is ν-successful by assumption, this matching M has the property that, for any good edge e, either e ∈ M or some neighbor of e lies in M. After the execution of this subroutine, this matching is deleted from the graph, hence the degree of each good edge decreases by 1. In particular, the maximum degree of good edges decrease by 1. Once this maximum degree becomes zero, in the next iteration all good edges will be deleted. Therefore, after at most K +m − 1 iterations, all good edges will be deleted, which means all ν-good silver arms are explored, as required.
4.4 Proof of Theorem 2(b).
Theorem 2(b) considers the stronger feedback model where we observe the collision information but no lower bound µ for µm is known. Note that in the algorithm for part (a), the parameter µ is mainly used to set the length of the MusicalChairs2 subroutines to make sure that each player will succeed in MusicalChairs2 with high probability. For this part, we observe the collision information, so we can modify MusicalChairs2 to use this information and determine its length without knowing µ.
More precisely, the algorithm is the same as in part (a), except we set ν = 0 and α = 4K log(6Km2T ) and replace MusicalChairs2 with MusicalChairs3, described next. To obtain MusicalChairs3, we modify MusicalChairs2 such that for a player to occupy an arm, she simply looks at the collision information and occupies the arm if there is no collision. Its pseudocode appears below.
The notions of success and failure are defined similarly as before but without a parameter ν (one can think ν = 0 in this case: all arms are 0-good). We have the following bound for its failure probability, whose statement and proof are identical to that for Corollary 9, except there is no parameter ν.
Corollary 12. Let α be a positive integer. In the stronger feedback model with collision information available, the failure probability of MusicalChairs3 subroutine of length α is not more than mexp(−α/4K) if m ≤ K and mexp(−α exp(−2m/K)/K) in general.
The proof of Theorem 2(b) is identical to part (a), except we use Corollary 12 instead of Corollary 9; we obtain the bound (7), which using α = 4K log(6Km2T ) proves Theorem 2(b).
4.5 Proof of Theorem 2(c).
Part (c) considers the case that we do not know µ and we do not observe the collision information, but the players have the option to leave the game. The trouble is that it is not clear how to choose the lengths of MusicalChairs2 subroutines. To solve this issue, we choose really large lengths for MusicalChairs2 subroutines, and if a player has not occupied an arm at the end of a subroutine, she will leave the game. This can happen only if any remaining unoccupied arm has a really small mean, so we have not lost much by not pulling that arm anyway. We explain the details next.
We make the following changes to the algorithm for part (a): we choose ν = K log(T )/√T and define α,g using (4) (so Lemma 10 still applies: all Musi-calChairs2 subroutines are ν-successful with high probability), and we add the following line before Line 20: “if j = 0 then leave the game.” Namely, if a player has not occupied an arm when she wants to start an estimation period, she would simply leave the game and never pull any arm again. Observe that this could happen only if there are fewer than m many ν-good arms, so players may fail to find and occupy an arm. Suppose m′ of the best m arms are ν-bad. Once m′ players have left the game, we will have m − m′ players and m − m′ many ν-good arms, so the algorithm will work as in part (a) from that point onward and the same analysis works. We would only lose a reward of at most m′νT due to the players who have left the game. The total expected regret can be thus bounded via (7) by
completing the proof of Theorem 2(c).
Recall that all the theorems presented so far made three assumptions:
Moreover, for different parts of Theorem 2 we have made additional assumptions. In this section, we discuss how the Assumptions 1–3 can be removed at the expense of getting worse regret bounds. Some assumptions can be removed independently of other assumptions, but some of them cannot be removed unconditionally; we discuss them one by one.
5.1 Unknown time horizon.
The assumption that T is known can be removed independently of any other assumption, and the regret bound would multiply by at most log2(T ).
Indeed, if T is not known, we can apply a simple doubling trick (see [7] for various variants): we execute the algorithm assuming T = 1, then we execute it assuming T = 2 × 1, and so on, until the actual time horizon is reached. If the expected regret of the algorithm for a known time horizon T is R(T), then the expected regret of the modified algorithm for an unknown time horizon would be
R′(T ) ≤ �⌊log2(T )⌋i=0 R(2i) ≤ log2(T)×R(T). For example, if the players have the option of leaving the game, we can apply Theorem 2(c) to get the regret upper bound
R′(T ) ≤
which is within a constant multiplicative factor of the upper bound for R(T).
5.2 Other reward distributions.
The assumption that the rewards always lie in [0,1] can be relaxed, independently of other assumptions, to the assumption that the rewards have subgaussian distributions with mean lying in a known interval; of course the regret bounds must be re-normalized, and we also get a multiplicative logarithmic factor in some cases.
In the proofs, we have used this assumption in three ways: first, we used that the expected regret incurred any round can be bounded by m; second, that the rewards satisfy the Chernoff-Hoeffding concentration inequality (Proposition 4(a)); and third, for bounding the failure probability of MusicalChairs2,3 subroutines we used that P{X > 0} ≥ EX for any random variable X ∈ [0,1].
A random variable X is σ-sub-Gaussian if max{P{X − EX < −t},P{X − EX > t}} < exp(−t2/2σ2); for example, a standard normal random variable is 1-sub-Gaussian. The first two facts hold, with appropriate adjustments, for σ-sub-Gaussian random variables whose means lie in a bounded interval [0,b], see, e.g., [24, Chapter 2]. The third fact also holds up to a logarithmic factor, see Lemma 13 below. Hence, after appropriate adjustments, our main theorems can be readily extended to such distributions.
Lemma 13. Let X ≥ 0 be a random variable with mean µ that satisfies P�X > µ + t� < exp(−t2/2σ2). Then we have P{X > 0} ≥ min{|µ/(σ log(σ/µ))|,1}/99.
Proof. By dividing X by σ we may assume σ = 1. Let t ≥ 0 be a parameter to be chosen later, and define Y = X · 1[X > t + µ] and Z = X · 1[X ≤ t + µ]. Note that µ = EX = EY + EZ and EZ ≤ P{X > 0}(t + µ). We next write EY as
Consequently,
which implies
Now, if µ ≤ 0.05 then setting t = log(1/µ) gives that the right-hand side is greater than µ/(5log(1/µ)) = |µ/(5log(1/µ))|. (Here, we have used the inequality
which holds for all 0 < µ ≤ 0.05.)
On the other hand, if µ > 0.05, setting t = 4 gives that the right-hand side is greater than 1/99, as required. (Here, we have used the inequality (98 − e−8)µ > 4 + 33 × e−8/8, which holds for any µ > 0.05.)
5.3 More players than arms.
We next consider the assumption that K ≥ m and explain how and when it can be removed. First, note that if K < m then the term �i∈[m] µi in the definition of regret (2) is not well defined, hence we must redefine the regret. There are two natural ways to do this.
5.3.1 Original model.
In the original model, if K < m, then the best strategy for the players, had they known the means, would be for K − 1 of them to occupy the best K − 1 arms and for the rest to occupy the worst arm; so the regret in this case can be defined as
Let ∆ ≔ µK−1 − µK.
For this model, we present an algorithm without observing the collision information and without the assumption K ≥ m with expected regret O(mK log(T )exp(4m/K)/∆2).
We assume that T is known and the rewards lie in [0,1]; we have explained in previous subsections how the regret bound will be affected if these are relaxed. The algorithm crucially assumes m is known to the players.
The algorithm is similar to Algorithm 1. Let p ≔ (1−1/K)m−1 ≥ exp(−2m/K) be the probability of no-collision when the players pull arms uniformly at random, and let g = CK log(KT )/p2 for a sufficiently large constant C. Each player pulls arms randomly until at some round τ she finds a gap of 3�g/τ between the (K − 1)th and the Kth arm, and she continues for 24τ more rounds to make sure that all others have also found this gap. An argument similar to that of Lemma 6 gives that these two phases will take O(K log(KT )/p2∆2) many rounds. Moreover, each player has learned that µK−1 ≥ ∆ ≥�g/τ and that�τ/g ≤ 5/∆ (see Lemma 6(ii)). Then the player executes MusicalChairs2 on the set of K − 1 best arms, for α = CK log(KT )�τ/g/p many rounds, for a large enough constant C. Since mexp(−αµK−1p/K) ≤ mexp(−α�g/τp/K) < 1/mT , Lemma 8 implies that, with probability at least 1 − 1/mT , all players will be�g/τ-successful, meaning that the best K − 1 arms are occupied. After MusicalChairs2 is finished, if the player has occupied an arm, she will pull it until the end of game, otherwise she pulls the worst arm for the rest of game. Thus, the regret will be zero after at most O(K log(KT )/p2∆2)+O(K log(KT )�τ/g/p) ≤ O(K log(KT )/p2∆2) many rounds, giving a total expected regret of O(mK log(KT )/p2∆2) ≤ O(mK log(KT )exp(4m/K)/∆2).
5.3.2 Model allowing players to leave.
Alternatively, if we allow the players to leave the game, the best strategy had they known the means would be for m − K players to leave the game and for the rest to occupy distinct arms. The regret in this model can be defined as
For this model, we present an algorithm without observing collision information and without the assumption K ≥ m. We assume that T is known and the rewards lie in [0,1]; we have explained in previous subsections how the regret bound will be affected if these are relaxed. The algorithm crucially assumes m is known to the players.
The algorithm is simple: each player executes the MusicalChairs2 algorithm for a certain number of rounds, and if she has not occupied an arm by that time, she leaves the game.
The number of rounds they play MusicalChairs2 is O(log(TK)K exp(2m/K)/ν) with ν =�mlog(TK)exp(2m/K)/T . With high probability, ν-good arms will be occupied, and any other arm contributes a regret of at most νT . So the total expected
regret can be bounded by O(mlog(TK)K exp(2m/K)/ν +KνT ), which by the choice of ν gives the bound O(K exp(m/K)�mT log(T K)) for the expected regret.
If we make an additional assumption that the players know a lower bound µmin for all the arm means, then instead they play MusicalChairs2 for O(log(T K)K exp(2m/K)/µmin) many rounds, and by Lemma 8, with probability at least 1 − 1/mT , all the K arms are occupied, whence the total expected regret is bounded by O(mK log(T )exp(2m/K)/µmin).
Alternatively, if instead of knowing µmin the players observe the collision information, they play MusicalChairs3 for O(log(T K)K exp(2m/K)) many rounds, and the total expected regret is upper bounded by O(mK log(T )exp(2m/K)).
5.4 Unknown number of players.
We next consider the assumption that m is known and explain how it can be removed. We assume that T is known and the rewards lie in [0,1]; we have explained in previous subsections how the regret bound will be affected if these are relaxed. Crucially, we assume m ≤ K, although if m ≤ CK for some known absolute constant C, then the analysis in this section works after appropriate adjustments and all the derived asymptotic bounds hold.
In this section, we present two subroutines to estimate m in two different settings: when the collision information is observed and when the collision information is not observed but µ1 ≥µ for some knownµ. If m is unknown, such a subroutine can be executed at the beginning of the algorithm, and after that we can execute one of the algorithms presented previously; hence the total regret bound would increase by the number of rounds of the subroutine times m.
In the first setting, when the players observe the collision information, [21, Lemma 2] presents a simple algorithm, with O(K2 log(1/δ)) many rounds, using which each player learns m with probability ≥ 1 − δ. Setting δ = 1/K2T ensures that this simultaneously holds for all players with probability ≥ 1 − 1/KT . After this estimation, the players can run the algorithm of Theorem 2(b). The additional regret due to these estimation rounds is O(K2mlog(KT )), which is dominated by the final regret upper bound of Theorem 2(b).
For the setting without the collision information, we assume that the players know that at least one arm has mean at leastµ. We present an algorithm with O(K3 log2(K/µδ)/µ2) many rounds such that if all players execute it, each will learn m with probability 1−δ. Setting δ = 1/K2T ensures that this simultaneously holds for all players with probability ≥ 1 − 1/KT , and after this estimation, the players can execute Algorithm 1 or Algorithm 2. The additional regret due to estimation is bounded by O(K3mlog2(KT /µ)/
Here is the algorithm each player executes: let ε ≔
and observe that, since K ≥ m ≥ 2,
First, the player pulls random arms for 8K log(K2/9δ)/ε2 rounds and estimates the quantities µj(1−1/K)m−1 for all j ∈ [K]. By an argument similar to that of Lemma 5, she obtains estimates {σj}j∈[K] such that
for all players, uniformly with probability 1−δ/3. Let ℓ be the arm with maximum σ value. We claim that µℓ ≥µ/2. To prove this, note that
whence µℓ ≥
Then the player tries to estimate µℓ itself and then uses the ratio µℓ/σℓ for estimating m. For this, she tries 4K log(6K/µδ) times to occupy the arm ℓ, using a musical chairs subroutine: divide the time horizon into 4K log(6K/µδ) blocks of length log(6/δ)/ε2. For each block, she chooses an arm uniformly at random and pulls it for all the rounds in the block. If this arm was arm ℓ and she receives a positive reward at least once during the block, then, by taking the average of received rewards in the block, she obtains an unbiased estimate �µ for µℓ. In any case, she repeats this procedure for the next blocks. Using an analysis similar to that of MusicalChairs2, after 4K log(6K/µδ) iterations, with probability at least 1− δ/3, all players have explored their arm ℓ. The pseudocode appears in Algorithm 3 below.
For each player, since the estimate �µ is based on log(6/δ)/ε2 pulls, with probability 1 − δ/3 she obtains an estimate �µ for µℓ such that |�µ − µℓ| ≤ ε. Therefore, we have µℓ ∈ [�µ − ε,�µ + ε] and also µℓ(1 − 1/K)m−1 ∈ [σℓ − ε,σℓ + ε] by (8). Given the two intervals, we want to recover m. Since ε < µ/4×(1−1/K)m−1 × (1−1/K)−2/5−1(1−1/K)−2/5+1, we have
hence Lemma 14 below shows that m can be recovered uniquely.
Lemma 14. Let a,b,c,d,p > 0. Consider intervals [a,b] and [c,d] with max{b/a,d/c} ≤ p2/5, and suppose there exist x ∈ [a,b] and y ∈ [c,d] such that xpz = y for some integer z. Then there exists a unique integer n such that [apn,bpn] ∩ [c,d] � ∅.
Proof. The existence of such an n follows from existence of x and y and that xpz = y for some integer z. For the uniqueness, note that we have [apn,bpn] ∩ [c,d] � ∅ if and only if [loga/ logp + n,logb/ logp + n] ∩ [logc/ logp,logd/ logp] � ∅. Now note that the interval [logc/ logp,logd/ logp] has length ≤ 2/5. Each interval In = [loga/ logp + n,logb/ logp + n] also has length ≤ 2/5, hence, for each n, In and In+1 are at least 3/5 apart from each other, so [logc/ logp,logd/ logp] can intersect at most one In.
To bound the number of rounds of the algorithm, note that
thus ε ≥µ/120K. So the number of rounds of the algorithm is 8K log(K2/9δ)/ε2 + 4K log(6K/µδ)log(6/δ)/ε2 = O(K3 log2(K/
In this section, we present a distributed algorithm that, with probability at least 1−δ, converges to an ε-Nash equilibrium in any stochastic anti-coordination game within O(log(K/δ)(K/ε2 + K2/ε)) many rounds.
Note that the players do not observe collisions, and in particular, they do not observe the actions of other players, but we assume each player has the option of choosing a dummy action, which is given index 0 and produces no reward. We are still making the Assumptions 1–3 stated on page 4 (but there is no parameter T here).
We describe the algorithm each player executes. First, the player pulls arms uniformly at random and maintains an estimate for the arm means. An argument similar to that of Lemma 5 gives that, after 512K log(6mK/δ)/ε2 rounds, with probability at least 1 − δ/2m, all estimated means are within distance ε/2 of the actual means. By a union bound over all players, this is true uniformly over all players with probability at least 1 − δ/2.
The player then sorts the �µi as �µi1 ≥ ··· ≥ �µiK. Then for j = 1,...,K, she runsMusicalChairs2 on {ij} (in this order) for 4K log(2mK/δ)/ε many rounds. If during any of these subroutines she occupies an arm, she chooses that action. Otherwise, she chooses the dummy action 0. The pseudocode is given as Algorithm 4.
By Corollary 9 and a union bound over the K iterations, all the MusicalChairs2 subroutines for all players are ε-successful with probability at least 1−δ/2. We now show that if the estimation errors are ≤ ε/2 and all the MusicalChairs2 subroutines are ε-successful (with probability 1 − δ both these good events happen), then the resulting assignment is an ε-Nash Equilibrium. Fix any player p and recall that, for each action i ∈ [K], µpi denotes the average reward player p would receive if she plays action i solely. First, suppose that she has output a non-dummy action ij. This means all actions i1,i2,...,ij−1 were either occupied by other players or had mean < ε or both. On the other hand, since the estimated means are within ε/2 of the actual means, for any s � {i1,i2,...,ij−1} we have �µs ≤ �µij so
hence the player cannot increase her outcome by more than ε by switching to action s. Finally, if player p has chosen the dummy action 0, it means that for each j ∈ [K], either action ij is occupied or µpj ≤ ε or both. Thus, there is no unoccupied action s with µps > ε, so again the player cannot increase her outcome by more than ε by switching.
The total number of rounds is
and the failure probability is at most δ, as required.
We thank the referees of Mathematics of Operations Research for detailed feedback, which resulted in significant improvements in the presentation. G´abor Lugosi was supported by the Spanish Ministry of Economy and Competitiveness, Grant PGC2018-101643-B-I00 “Predicc´on, inferencia y computaci´on en modelos estructurados - Ayudas Fundaci´on BBVA a Equipos de Investigaci´on Cientifica 2017” and by “Google Focused Award Algorithms and Learning for AI.” Abbas Mehrabian was supported by an IVADO-Apog´ee-CFREF postdoctoral fellowship. This work started during the Mathematics of Machine Learning program sponsored by the Centre de Recherches Math´ematiques (CRM) held at Universit´e de Montr´eal in April 2018.
[1] Alatur P, Levy KY, Krause A (2020) Multi-player bandits: The adversarial case. Journal of Machine Learning Research 21(77):1–23, URL http://jmlr.org/papers/v21/19-912.html.
[2] Anandkumar A, Michael N, Tang AK, Swami A (2011) Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications 29(4):731–745, URL https://ieeexplore.ieee.org/document/5738217.
[3] Anantharam V, Varaiya P, Walrand J (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays—part I: IID rewards. IEEE Transactions on Automatic Control 32(11):968–976, URL https://ieeexplore.ieee.org/document/1104491.
[4] Avner O, Mannor S (2014) Concurrent bandits and cognitive radio networks. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 66–81 (Springer), URL https://link.springer.com/chapter/10.1007/978-3-662-44848-9_5.
[5] Bande M, Veeravalli VV (2019) Multi-user multi-armed bandits for uncoordinated spectrum access. 2019 International Conference on Computing, Networking and Communications (ICNC), 653–657, URL https://ieeexplore.ieee.org/document/8685615.
[6] Besson L, Kaufmann E (2018) Multi-player bandits revisited. Janoos F, Mohri M, Sridharan K, eds., Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research, 56–92 (PMLR), URL http://proceedings.mlr.press/v83/besson18a.html.
[7] Besson L, Kaufmann E (2018) What doubling tricks can and can’t do for multi-armed bandits, URL https://hal.inria.fr/hal-01736357.
[8] Besson L, Kaufmann E (2019) Lower bound for multi-player bandits: Erratum for the paper multi-player bandits revisited, URL http://chercheurs.lille.inria.fr/ekaufman/BK19_Erratum_LB.pdf.
[9] Bonnefoi R, Besson L, Moy C, Kaufmann E, Palicot J (2018) Multi-armed bandit learning in IoT networks: Learning helps even in non-stationary settings. Marques P, Radwan A, Mumtaz S, Noguet D, Rodriguez J, Gundlach M, eds., Cognitive Radio Oriented Wireless Networks, 173–185 (Cham: Springer International Publishing), ISBN 978-3-319-76207-4, URL https://link.springer.com/chapter/10.1007/978-3-319-76207-4_15.
[10] Boursier E, Kaufmann E, Mehrabian A, Perchet V (2020) A practical algorithm for multiplayer bandits when arm means vary among players. Chiappa S, Calandra R, eds., Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, 1211–1221 (Online: PMLR), URL http://proceedings.mlr.press/v108/mehrabian20a.html.
[12] Boursier E, Perchet V (2020) Selfish robustness and equilibria in multi-player ban- dits. Abernethy J, Agarwal S, eds., Proceedings of Thirty Third Conference on Learning
Theory, volume 125 of Proceedings of Machine Learning Research, 530–581 (PMLR), URL http://proceedings.mlr.press/v125/boursier20a.html.
[13] Bubeck S, Budzinski T, Sellke M (2020) Cooperative and stochastic multi-player multi-armed bandit: Optimal regret with neither communication nor collisions. arXiv URL https://arxiv.org/abs/2011.03896.
[14] Bubeck S, Li Y, Peres Y, Sellke M (2020) Non-stochastic multi-player multi-armed bandits: Optimal rate with collision information, sublinear without. Abernethy J, Agarwal S, eds., Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 961–987 (PMLR), URL http://proceedings.mlr.press/v125/bubeck20c.html.
[15] Hanawal MK, Darak SJ (2018) Multi-player bandits: A trekking approach. arXiv URL https://arxiv.org/abs/1809.06040.
[16] Komiyama J, Honda J, Nakagawa H (2015) Optimal regret analysis of Thompson sam- pling in stochastic multi-armed bandit problem with multiple plays. Bach F, Blei D, eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1152–1161 (Lille, France: PMLR), URL http://proceedings.mlr.press/v37/komiyama15.html.
[17] Lattimore T, Szepesv´ari C (2020) Bandit Algorithms (Cambridge University Press), draft available at https://tor-lattimore.com/downloads/book/book.pdf.
[18] Liu K, Zhao Q (2010) Distributed learning in multi-armed bandit with multi- ple players. IEEE Trans. Signal Process. 58(11):5667–5681, ISSN 1053-587X, URL http://dx.doi.org/10.1109/TSP.2010.2062509.
[19] Liu LT, Ruan F, Mania H, Jordan MI (2020) Bandit learning in decentralized matching markets. arXiv URL https://arxiv.org/abs/2012.07348.
[20] McDiarmid C (1998) Concentration. Probabilistic methods for algorithmic discrete mathematics, volume 16 of Algorithms Combin., 195–248 (Springer, Berlin), URL http://dx.doi.org/10.1007/978-3-662-12788-9_6.
[21] Rosenski J, Shamir O, Szlak L (2016) Multi-player bandits: A musical chairs approach. Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 155–163, ICML’16 (JMLR.org), URL http://proceedings.mlr.press/v48/rosenski16.html.
[22] Shi C, Xiong W, Shen C, Yang J (2020) Decentralized multi-player multi-armed ban- dits with no collision information. Chiappa S, Calandra R, eds., Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, 1519–1528 (Online: PMLR), URL http://proceedings.mlr.press/v108/shi20a.html.
[23] Slivkins A (2019) Introduction to multi-armed bandits. Foundations and Trends in Machine Learning 12(1-2):1–286, ISSN 1935-8237, URL http://dx.doi.org/10.1561/2200000068, draft available at https://arxiv.org/abs/1904.07272.
[24] Vershynin R (2018) High-dimensional probability: An introduction with applications in data science, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge), ISBN 978-1-108-
41519-4, URL http://dx.doi.org/10.1017/9781108231596, draft available at https://www.math.uci.edu/˜rvershyn/papers/HDP-book/HDP-book.html.
[25] Wang PA, Proutiere A, Ariu K, Jedra Y, Russo A (2020) Optimal algorithms for multiplayer multi-armed bandits. Chiappa S, Calandra R, eds., Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, 4120–4129 (Online: PMLR), URL http://proceedings.mlr.press/v108/wang20m.html.