Multi-User Multi-Armed Bandits for Uncoordinated Spectrum Access

2018·Arxiv

Abstract

Abstract

A multi-user multi-armed bandit (MAB) framework is used to develop algorithms for uncoordinated spectrum access. The number of users is assumed to be unknown to each user. A stochastic setting is first considered, where the rewards on a channel are the same for each user. In contrast to prior work, it is assumed that the number of users can possibly exceed the number of channels, and that rewards can be non-zero even under collisions. The proposed algorithm consists of an estimation phase and an allocation phase. It is shown that if every user adopts the algorithm, the system wide regret is constant with time with high probability. The regret guarantees hold for any number of users and channels, in particular, even when the number of users is less than the number of channels. Next, an adversarial multi-user MAB framework is considered, where the rewards on the channels are user-dependent. It is assumed that the number of users is less than the number of channels, and that the users receive zero reward on collision. The proposed algorithm combines the Exp3.P algorithm developed in prior work for single user adversarial bandits with a collision resolution mechanism to achieve sub-linear regret. It is shown that if every user employs the proposed algorithm, the system wide regret is of the order over a horizon of time T. The algorithms in both stochastic and adversarial scenarios are extended to the dynamic case where the number of users in the system evolves over time and are shown to lead to sub-linear regret.

I. INTRODUCTION

The existing spectrum management paradigm treats frequency spectrum as a fixed commodity, which leads to spectrum under-utilization. Cognitive radio has emerged as a useful strategy to increase spectrum utilization. The existing literature on cognitive radio has largely been focused on the primary/secondary user paradigm, where secondary users need to detect vacant spectrum when available and vacate the occupied spectrum when a primary user wants to transmit.

We focus on a different type of spectrum sharing system in which there is no distinction between users, and in which there is no coordination among the users. The collective performance across all users is more important than that of individual users. This is in contrast to the typical primary/secondary user paradigm in which secondary users bear the responsibility for ensuring priority-based spectrum sharing. We model this system using an adversarial multi-user multi-armed bandit (MAB) framework [2]. Our goal is to design an efficient channel access mechanism by managing interference in the system through a decentralized policy across the users.

Multi-arm bandit formulations in stochastic multi-user cognitive radios without user coordination were considered in [3], [4], [5] and [6]. The algorithm in [3] is based on a time-division fair sharing (TDFS) of the best arms between users. Although the algorithm achieves order optimal regret asymptotically, it requires pre-agreement among users and it is assumed that the number of users is fixed and known to all users. The algorithm in [4] does not require any coordination between users and achieves optimal regret asymptotically, but assumes that the number of users is known. The algorithm in [5] combines an -greedy learning rule with a collision avoidance mechanism, and [6] considers a musical chairs algorithm. Both of these approaches achieve sub-linear regret and do not require knowledge of the number of users. However, it was assumed that the channel parameters are the same for all the users. A stochastic multi-user MAB with user dependent rewards on channel was considered in [7]. However, the algorithm considers coordination and communication between users via an auction algorithm.

In this work, we focus on two scenarios that have not been previously studied in the multi-user MAB setting for uncoordinated dynamic spectrum access. We assume that the number of users is unknown and that there is no communication between the users. However, we make the mild assumption that the users have access to a shared clock for time synchronization (see also, [6], [8], [9]).

We first study a stochastic multi-user MAB where the rewards on the channels are not user dependent. In our model, all users are treated equally and the reward obtained by each user largely depends on the actions of the other users. When multiple users access the same channel, we allow for a non-zero reward with the assumption that the reward for each user decreases as the number of users on the channel increases. Thus we include the case where there are more users than channels. This is in

Parts of this work were presented at ICNC [1] and submitted to ICASSP. M. Bande and V. V. Veeravalli are with the Coordinated Science Laboratory and the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: mbande2@illinois.edu, vvv@illinois.edu). This research was supported by the US NSF WIFiUS Program under grant number CNS 14-57168 and by the US NSF SpecEES under grant number 1730882, through the University of Illinois at Urbana-Champaign.

contrast to the existing approaches, including [5] and [6], which focus on the primary/secondary user paradigm in the scenario where the reward distribution for a user is unknown but fixed. In particular, when multiple users access the same channel they receive zero reward. Hence, all these approaches fail when the number of users is greater than the number of channels.

We assume that the reward on the channel depends on the number of users on the channel and is drawn i.i.d from a distribution depending on the number of users on the channel. The degradation of the reward as a function of number of users depends on the system, e.g., the distance between the users, the protocol used for transmission (e.g., hybrid ARQ) and is captured through a reward distribution that depends on the number of users on the channel.

We propose an algorithm and show that if each user employs the algorithm, the system wide regret is O(1) in time, with high probability. The algorithm can be used for any number of users or channels. To the best of our knowledge, we are the first to provide sub-linear regret guarantees without user coordination when the number of users is greater than the number of channels.

In the second scenario, we study the adversarial multi-user MAB framework with user-dependent rewards. The adversarial bandit problem is an important variation of the MAB problem, where no stochastic assumption is made on the generation of rewards. The term “adversarial” refers to the mechanism choosing the sequence of rewards on each arm. If this mechanism is independent of the users actions, then the adversary is said to be oblivious. If the mechanism may adapt to the users’ past behaviors, then the adversary is said to be non-oblivious [2]. The existing literature on adversarial MABs is focused on the single user case, and a detailed overview of the proposed solutions for the adversarial MAB formulation can be found in [2]. The proposed algorithms in the single user adversarial setting achieve a sub-linear regret of over a time horizon T.

We consider multi-user dynamic spectrum allocation without any coordination among the users. We also assume that the rewards on each channel are user-dependent and may vary with time. Such a system is captured through a multi-user adversarial MAB model, particularly when the reward distribution for each channel and user may change over time. We propose an algorithm, and show that if each user employs the algorithm, the system wide regret is over a time horizon T. To the best of our knowledge, we are the first to consider the multi-user setting for adversarial MABs and to provide sub-linear regret guarantees.

II. SYSTEM MODEL AND NOTATION

Let K be the number of users in the system. We initially assume that the users have unlimited data for transmission. In a more realistic setting, users may become active or inactive depending on their transmission needs; our dynamic setting covers this scenario. Each user can choose one among M channels for transmission. With M channels and K users attempting to access the spectrum, we assume that each user has prior knowledge of M, but not of K. The assumption of known M is reasonable if the spectrum partition is enforced and fixed. On the other hand, it is not realistic to assume the knowledge of K in an uncoordinated network.

We model the system as a multi-user MAB system with K users and M arms (channels). In each time unit t, let denote the set of channels available to user k. User k chooses a channel based on the reward history according to a certain policy and receives a reward . We assume that , and that each user chooses a channel according to the same algorithm. The reward on each arm depends on the number of users who have chosen the arm. Let denote the number of users on each channel at time t, where . Thus, the reward received by user k at time t is a function of the channel chosen and the number of users on the channel .

A. Stochastic setting

We model the system as a stochastic multi-user MAB system with K users and M arms (channels). Each user can choose one among M channels for transmission, where we allow for the possibility that . We assume that the reward observed is inversely proportional to the number of users transmitting on the same channel. For example, the reward could be the rate achieved by the user on the channel which reduces due to interference from other users accessing the channel. Let denote the mean reward on channel m when the number of users on the channel is f(m). We assume that each user chooses a channel according to the same policy. We assume that becomes negligible for some , where depends on the system. This restricts the number of users in the system as .In order to ensure that one user does not monopolize a channel for an extended period of time, we impose the following condition. For each user, transmission on a particular channel takes place for a maximum of time units, after which the user releases the channel for at least time units before attempting to access the same channel.

We define the expected regret in the system as

where argmaxcorresponds to the optimal number of users on each channel. To estimate the means on each channel as a function of number of users, we need to impose the following separability condition.

where is the variance of the distributions and c is a constant.

B. Adversarial setting

In this case, we model the system as an adversarial multi-user MAB with K users and M channels. We further restrict attention to the setting where there are more channels than users in the system i.e., . We assume that each user chooses a channel according to the same algorithm. For user , let denote the probability vector across the arms, where is the probability of choosing arm m at time t. We assume that the adversary chooses different reward for different users for the same channel. Let denote the reward observed by user k on choosing channel at time t. We assume that if more than one user chooses the same channel, they all receive zero reward. In other words, the users observe zero reward on collision. If there is no collision on the channel, the user observes a reward that is chosen by an adversary. Thus, we set when .

We adopt the standard notion of pseudo-regret used for adversarial bandits in [2]. The expected total regret in the system until time T is defined as

III. STOCHASTIC SETTING

In this section, we focus on the stochastic multi-user MAB with user-independent rewards on each channel. We present an algorithm which leads to sub-linear regret with high probability, and extend it to the dynamic case.

A. Algorithm

The algorithm has two phases. The first is an estimation phase during which we estimate the number of users K and , the average mean reward on each channel as a function of the number of users on the channel. The second is an allocation phase where the users arrange themselves in a way that minimizes system regret.

We estimate the number of users by keeping track of the number of collisions similar to [6], with the estimate given by round

We estimate separately for each channel based on the reward x(m) observed on the corresponding channel, by clustering the samples using the k-means algorithm. We employ the algorithm Cluster (see Algorithm 2) inspired by [10]. We

are interested in finding the centroids of the clusters rather than the correct classification of all the samples. Hence, we use an -approximation algorithm with a run time to find the estimates the centroids of the cluster and show that we get good estimates with high probability. We consider the approximation algorithm in [11] with a run time .After obtaining estimates for and , the optimal number of users on each channel can be calculated. We use Alloc (see Algorithm 3) to ensure that each user settles or ‘fixes’ on a channel m , for which the number of users less than . That is, on finding a channel m with , the user keeps transmitting on it for at most time units. The system incurs regret until all users have settled on some channels, and we call this duration the fixing time. Once all the users have settled on their channels the system does not incur regret. However in our system model, a user can transmit on a channel for at most time units, after which the user must switch. We assume that is fixed for all the users but can vary with time. We use Permute (see Algorithm 4) to construct an efficient allocation for which the regret does not grow with time. In order to avoid system-wide regret every time users have to switch, we fix the ordering of each user after epochs; this can be done for any . Our goal is to have each user transmit on all the channels. This is the coupon collector problem with each user having to collect M channels with the expected number of trials logM). When , in order to have efficient allocation so that the regret does not grow with time, after the first epoch, each user switches to the next channel among the set of K best channels.

We fix the epoch size to be , where is the expected time taken for all the users to fix on a channel. After epochs, we continue with an epoch size of . We assume that 2 to ensure that after every transmitting for time units, each user has other available channels. Note that our algorithm works even when , in which case it reduces to a version of the algorithm in [6].

B. Analysis

We first investigate the case where K > M. We show that if all the users in the system use Algorithm 1, with high probability, the expected regret is O(1) in T. 1) Estimation phase: We now show that, with high probability, we have the correct estimates for . More precisely, we find estimates such that with high probability. Lemma 1: For any fixed , user k, channel m and number of users on the channel the estimate obtained after running the algorithm for

rounds, we have with probability at least ,

Proof: Let denote the event that there is at least one combination k, m, n such that and denote the event that each player has more than observations from distribution with mean for each m, n.

where the inequality follows from union bound. To show that , it suffices to show that which follows from Lemma 6 in the appendix with for and follows from Hoeffding’s inequality for n = 1.

Lemma 2: For any , if we run the estimation phase of the algorithm for

which is equivalent to showing

Hence, we choose .

2) Allocation phase: We now find bounds on the expected regret during each fixing phase, given that the estimates of and K are accurate. Lemma 3: The expected regret accumulated by the system during a fixing phase is upper bounded by

Pr(User k being fixed)

where (a) follows because we only consider one term in the each of the summations with i = 0, and (b) follows from for . Thus for any user k, the expected fixing time is given by

and thus the regret during the fixing phase is given by,

where denotes the regret incurred by user k at time t and we have by our assumption on the reward distribution.

Analysis for

For the case where , there is no need for clustering. We only need the estimates for , and all users individually choose the best K channels. This reduces to the “musical chairs” algorithm and the analysis can be found in [6]. After fixing on a channel during the first allocation time, after every time units, each user switches to the next channel among the K best channels.

3) Main Result: We now present the upper bound on the expected regret incurred by the users employing Algorithm 1.

Theorem 1: For any fixed and , with probability greater than , the expected regret for K users using Algorithm 1 with M arms for T rounds, with parameter

given by

i.e., in T. Proof: The expected regret is due to regret during the estimation phase as well as the allocation phase. Let denote the time taken for all the users to fix.

From Lemmas 1 and 2, we have the correct estimates for and K with high probability, with an estimation phase of time units. Thus, corresponds to the regret accumulated system-wide during the estimation phase. Here denotes the time used for running the -approximation algorithm for clustering. In the allocation phase, the regret in the system is accrued only during the number of fixing phases. From Lemma 3, the regret in each fixing phase is exp.

C. Dynamic case

We now extend the results to a dynamic system with a changing number of users. The key idea is to run Algorithm 1 repeatedly across epochs. However, in order to obtain a sub-linear regret bound, we need to impose some restrictions on the number of epochs, and on the way users enter or leave the system. It is easy to see that the number of epochs must be sub-linear in time to have sub-linear regret in the system. We restrict the number of users entering and leaving the system until time t, which we denote by to be where . We note that this is different from [6] where the time horizon is fixed and known, and there is also a restriction on when users can enter or leave the system. In our model, the dynamic scenario also includes the case where can go from greater than M to less than M, and vice-versa.

Let denote the number of active users at time t, where . Note that all the theorems in subsection III-B follow for the dynamic case with . We choose the starting epoch length to be greater than or equal to . We run Algorithm 1 for time , and so on. The resulting algorithm is given below.

Theorem 2: With a probability greater than , the expected system-wide regret after running the Algorithm 5 for T rounds where is

Proof: We have which gives us . The epoch length is changing with time. The total number of epochs until time T is .We consider Theorem 1 with set as in the algorithm Dynamic allocation. Thus we have . Using union bound we show that with a probability at least , we have good estimates for and over all epochs.

In epochs with fixed or static users, the accumulated regret follows from Theorem 1, and in epochs with dynamic users, the system incurs regret during the entire epoch.

D. Experiments

In this section, our goal is to validate the performance of the estimation phase in the algorithm and show that the performance in the allocation phase does not suffer due to use of the estimated values i.e., the regret does not grow with time in the allocation phase.

We consider a system with K = 10 users and M = 6 channels and the non dynamic case. We set time units and and repeat the experiment 100 times and consider the average accumulated regret. The value of is set to 3, and the reward distributions are chosen to be uniform with a variance of 0.01, and means between 0 and 1 given below,

We compare the performance of Algorithm 1 with the estimated values of and K with Algorithm 1 with the true parameter values. We also show how the estimates change with number of iterations in the estimation phase . We used the in-built MATLAB kmeans function for clustering.

Fig. 1: Accumulated regret as a function of time.

From Fig. 1, we see that the accumulated regret grows with time during the estimation phase and remains constant during the allocation phase. Also, there is no noticeable difference between Algorithm 1 with the true parameter values and the one with the estimated values. This follows because the estimates of K and the mean converge to the true values within a few iterations as shown in Fig. 2 and Fig. 3.

Fig. 2: Error in the estimation of number of users K.

Fig. 3: Error in the estimation of the mean.

IV. ADVERSARIAL SETTING

In this section, we consider the adversarial multi-user MAB model with user-dependent rewards on each channel. We present an algorithm that leads to sub-linear regret, and extend it to the dynamic case.

A. Single user MAB

We consider the Exp3.P algorithm described in [2] for a single user MAB in an adversarial setting. We modify the algorithm so that the user chooses an arm and updates the probability vector only in a few time units. This modification is useful in the multi-user case, where the users may not choose an arm in each time unit due to possible collisions. We now present a modified version of the Exp3.P algorithm, in which a new arm is chosen and the probability is updated at time units such that and . For each , we consider the reward over the time-period , with the reward being normalized to lie between 0 and 1.

Theorem 3: The expected regret of Modified Exp3.P algorithm (Algorithm 6) until time T is given by

where

the regret bound for Exp3.P given in [2].

B. Multi-user MAB: Algorithm

We now consider the multi-user adversarial bandits under a known finite horizon T, and propose an algorithm which when employed by all users independently leads to sub-linear regret.

In a multi-user adversarial system, every time t that a user k chooses an arm according to a certain probability distribution to randomize against the adversary, there is a possibility for collision with other users. Hence there is a need for a collision resolution mechanism, so that the regret does not grow linearly with time. Instead of choosing an arm every time unit, a user chooses an arm only a sub-linear number of times until T ( e.g., where y < 1). The goal is to randomize sufficient number of times so as to counteract the adversary, while making sure that the regret due to collisions does not become large.

We propose an algorithm (Algorithm 7) that combines the modified Exp3.P algorithm (Algorithm 6) with a collision resolution mechanism with y < 1. In the analysis in Section IV-C, we pick which is large enough to maintain the sub-linear regret achieved by the modified Exp3.P algorithm but small enough so that the regret due to collisions is sub-linear as well.

In every time-interval of length , we first have a collision resolution phase. Each user chooses a channel with probability . A user settles or fixes on a channel if at any time the user finds a channel without collision. Once a user settles on a channel, the user keeps transmitting on the channel until the end of the time-interval of length . The system incurs regret until all K users have settled on K channels, and we call this duration the fixing time. The remaining part of the algorithm corresponds to each of the K users employing the modified Exp3.P algorithm, where they choose a channel once every time units.

C. Multi-user MAB: Analysis

In this subsection, we first consider the regret due to the collision resolution phase, then the regret due to the modified Exp3.P part of Algorithm 7, and then combine them to find an upper bound on the system-wide regret incurred when each user independently employs Algorithm 7.

1) Regret during collision resolution: Theorem 4: The expected regret accumulated by the system during a collision resolution phase is upper bounded by

Proof: We first note from equation (4) that the probability of choosing any channel by any user is at least . Let , which implies that . Let “maximal” refer to the channel that has the highest probability of being chosen by that particular user. Thus, each user can be associated with one channel such that probability of choosing it is greater than . Since , for each user, there exists at least one channel such that it not the maximal channel for any of the remaining users. Note that even when some users fix or settle on a channel, and there are both unfixed channels and unfixed users in the system, we can still find an unfixed channel such that it is not the maximal channel for the remaining unfixed users.

Based on the above discussion, we define the event to be the event where all unfixed users except user k choose their maximal arm, and user k chooses an unfixed arm that is not the maximal arm for any other unfixed users.

Let denote the set of unfixed arms at time t. The probability of any user k being fixed at time t is given by,

Pr{User k being fixed}

The remainder of the proof follows in a similar manner as the proof of Theorem 3. 2) Regret due to Modified Exp3.P: We now bound the regret incurred by the users using Algorithm 7 during the time the users are not in the collision resolution phase. This corresponds to each of the K users independently employing the modified Exp3.P algorithm introduced in subsection IV-A. In Algorithm 7, when the users are not in the collision resolution phase, each user employs modified Exp3.P with

and . Using the result of Theorem 3 for K users, for any distinct set consisting of K arms,

, and does not depend on T. 3) Main Result: We now present the upper bound on the expected regret incurred by the users employing Algorithm 7. Theorem 5: The expected regret of K users using Algorithm 7 with M arms for T time units, is given by

where hM, K

, and does not depend on T. Thus, . Proof: The expected regret is due to collision resolution phase as well as the modified Exp3.P algorithm which is played a sub-linear number of times. Let denote the time taken for all the users to fix.

where the inequalities follow from Theorem 4 and equation (5), and

. If we choose y such that , we have which gives us

D. Unknown time horizon

In this subsection, we extend the results to the case of unknown time horizon. Each user considers some known time greater than the expected fixing time for the system and runs Algorithm 7. Once the user reaches the end of time , the user continues to use Algorithm 7 with a time-period of length . In this way when the user reaches the end of the previous time-period, the user doubles it and continues with Algorithm 7. Let T be such that , equivalently .

Theorem 6: The expected regret from using Algorithm 8 for T time units where is

where hM, K

Note that each user only needs knowledge of K in order to fix on an initial such that , where is the fixing time for all the users in the system. Furthermore, can be chosen even without the knowledge of K by simply replacing K by M, and the analysis follows because .

E. Dynamic case

In this subsection, we extend the results to a dynamic system with a changing number of users. Consider a system which starts with K users, and in which users leave the system once they are done with their transmission. It is easy to see that Algorithm 8 in this case leads to system-wide regret of the order over a time horizon T.

Let us now consider a dynamic system where users enter and leave the system over time. In order to use Algorithm 8 to obtain a sub-linear regret bound, we need to impose some restrictions on the number of users that have entered the system until time t, which we denote by . It is easy to see that the number of epochs in which users enter the system must be sub-linear in time to have sub-linear regret in the system. We restrict the number of users entering the system to be where . We note that this is similar to the dynamic case in [12] where there is a restriction on the number of users entering and leaving the system.

Let denote the number of active users at time t. Note that even in the dynamic scenario, we still retain the assumption of having in the system.

Theorem 7: The expected system-wide regret from using Algorithm 8 for T time units where with the number of users entering the system , with , is given by

where hM, M

Proof: We have . In epochs where no users enter the system, the regret can be bound by Theorem 6, and in epochs with new users, the regret accumulates through the entire epoch. The epoch length is upper bounded by , since from Theorem 5. The regret up to time T bounded as follows:

F. Experiments

In this section, we illustrate the performance of our algorithm in a simple adversarial setting. We consider a non-oblivious adversary, i.e., an adversary whose rewards do not depend on the users’ reward history.

We consider a system with known time-horizon T, fixed number of users K = 4 users and M = 7 channels. We set T = 160000, which gives us time units, and in Algorithm 7. The reward distributions for the channels are drawn i.i.d from the uniform distribution [a, 1] where a for each channel at each time unit is drawn i.i.d from the uniform distribution [0.2, 1].

Fig. 4: Accumulated regret as a function of time.

We repeat the experiment 100 times and consider the average accumulated regret with time. From Fig. 4, we see that the regret grows with time at a rate much lower than , but higher than , the expected regret in the single user case.

Remark 1: We note that Algorithm 7 can be used for a stochastic multi-user MAB with user-dependent rewards to achieve a sub-linear regret of order . While the regret is much higher than in [7], our algorithm does not rely on communication between the users and can also deal with a dynamic number of users in the system.

Remark 2: In the adversarial case, there is randomization in the selection of a channel, with being equivalent to , and hence each user does not transmit on a channel for a very long time. Thus, fairness is achieved without enforcing a strict duration for each transmission.

V. CONCLUSIONS

We modeled the dynamic spectrum allocation problem as a multi-user MAB with no communication among the users. We first considered a stochastic MAB model with rewards on the channel being the same for all users, and then an adversarial MAB model with user-dependent rewards. We showed that the proposed algorithms in both scenarios achieve sub-linear regret. We provided simulation results to show that the algorithms perform well in practice when the number of users is fixed. We also extended our algorithms to the dynamic case and showed that the algorithms continue to achieve sub-linear regret. It is of interest to develop algorithms in other variants of the multi-user MAB setting. For example, a system with user-dependent rewards, under the stochastic as well as the adversarial settings, without any user communication, when there are more users than channels in the system.

REFERENCES

[1] M. Bande and V. V. Veeravalli, “Multi-user multi-armed bandits for uncoordinated spectrum access,” in Proc. IEEE International Conference on Computing, Networking and Communications (ICNC), 2019.

[2] S. Bubeck, N. Cesa-Bianchi et al., “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[3] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, Nov 2010.

[4] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 4, pp. 731–745, 2011.

[5] O. Avner and S. Mannor, “Concurrent bandits and cognitive radio networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 66–81.

[6] J. Rosenski, O. Shamir, and L. Szlak, “Multi-player bandits–a musical chairs approach,” in International Conference on Machine Learning, 2016, pp. 155–163.

[7] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.

[8] J. Nieminen, R. Jantti, and L. Qian, “Time synchronization of cognitive radio networks,” in Global Telecommunications Conference, GLOBECOM. IEEE, 2009, pp. 1–6.

[9] O. Avner and S. Mannor, “Learning to coordinate without communication in multi-user multi-armed bandit problems.” arXiv preprint arXiv:1504.08167, 2015.

[10] C. Tang and C. Monteleoni, “On Lloyds algorithm: New theoretical insights for clustering in practice,” in Artificial Intelligence and Statistics, 2016, pp. 1280–1289.

[11] A. Kumar, Y. Sabharwal, and S. Sen, “A simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions,” in Proceedings of 45th Annual IEEE Symposium on Foundations of Computer Science, 2004, pp. 454–462.

[12] M. Bande and V. V. Veeravalli, “Multi-user multi-armed bandits for uncoordinated spectrum access,” arXiv preprint arXiv:1807.00867, 2018.

VI. APPENDIX

We present a lemma that ensures a certain number of observations from each distribution during the estimation phase of

observations of each reward distribution on each arm with probability greater than .

we have that

where the first inequality follows from union bound and the second inequality follows from Chernoff bound. Note that for a particular k, m and is i.i.d across t, since all users are choosing channels uniformly at random. In order for this probability to be upper bounded by we need:

player k has of arm m with n users, .

We also need the total number of observations each player has of each arm to be at least , i.e.

So we have two constraints on , which gives us:

which can be further simplified to

A. Clustering

Let N points be drawn independently from distributions with mean where . Let number of samples drawn from distribution with mean be denoted by and the separability condition (1) is satisfied. Additional notation used is introduced in Table I.

We now present an additional separability condition which is useful in order to prove some clustering results. For any and ,

TABLE I: Notation.

We first present the following lemma which describes the relationship between the separability conditions (1) and (6). Lemma 5: If the separability condition (1) is satisfied and , then for any r, s, with high probability

If suffices to show that with high probability,

From Hoeffdings, we have

where the last inequality follows because from Lemma 4, we have .

We now present some lemmas that are useful for proving that after clustering, the centroids are closer to the means of the distributions from which they are drawn.

Lemma 6: If the separability condition (6) is satisfied, then after using Cluster algorithm, we have that for any fixed and , with probability greater than ,

From Lemma 7, after the approximation algorithm, we have and . If we want which gives . From Lemma 9, which we need to be less than this giving us c > 16. From this and Lemma 8, the conditions for Lemma 10 are satisfied and . Thus we have,

For each denotes independently drawn bounded random variables from reward distribution with mean , we use Hoeffding’s lemma.

Lemma 7: An approximation algorithm returns the set of centroids where C(x) returns the centroid of the cluster to which x belongs. We have such that and .

which is a contradiction. We now show that which proves that .

Now we show that . For any s, r,

Since this is true for all r, s, we have

where the last inequality follows from the definition of .

where the second inequality follows from and the last from the definition of .

Note that the first statement with the definition of also implies for l = r, s

which gives us

where the first and second inequalities follow from the separability condition and Lemma 8 respectively. This gives us and similarly we also have .

designed for accessibility and to further open science