b

DiscoverSearch
About
My stuff
Multi-User Multi-Armed Bandits for Uncoordinated Spectrum Access
2018·arXiv
Abstract
Abstract

A multi-user multi-armed bandit (MAB) framework is used to develop algorithms for uncoordinated spectrum access. The number of users is assumed to be unknown to each user. A stochastic setting is first considered, where the rewards on a channel are the same for each user. In contrast to prior work, it is assumed that the number of users can possibly exceed the number of channels, and that rewards can be non-zero even under collisions. The proposed algorithm consists of an estimation phase and an allocation phase. It is shown that if every user adopts the algorithm, the system wide regret is constant with time with high probability. The regret guarantees hold for any number of users and channels, in particular, even when the number of users is less than the number of channels. Next, an adversarial multi-user MAB framework is considered, where the rewards on the channels are user-dependent. It is assumed that the number of users is less than the number of channels, and that the users receive zero reward on collision. The proposed algorithm combines the Exp3.P algorithm developed in prior work for single user adversarial bandits with a collision resolution mechanism to achieve sub-linear regret. It is shown that if every user employs the proposed algorithm, the system wide regret is of the order  O(T34 )over a horizon of time T. The algorithms in both stochastic and adversarial scenarios are extended to the dynamic case where the number of users in the system evolves over time and are shown to lead to sub-linear regret.

image

The existing spectrum management paradigm treats frequency spectrum as a fixed commodity, which leads to spectrum under-utilization. Cognitive radio has emerged as a useful strategy to increase spectrum utilization. The existing literature on cognitive radio has largely been focused on the primary/secondary user paradigm, where secondary users need to detect vacant spectrum when available and vacate the occupied spectrum when a primary user wants to transmit.

We focus on a different type of spectrum sharing system in which there is no distinction between users, and in which there is no coordination among the users. The collective performance across all users is more important than that of individual users. This is in contrast to the typical primary/secondary user paradigm in which secondary users bear the responsibility for ensuring priority-based spectrum sharing. We model this system using an adversarial multi-user multi-armed bandit (MAB) framework [2]. Our goal is to design an efficient channel access mechanism by managing interference in the system through a decentralized policy across the users.

Multi-arm bandit formulations in stochastic multi-user cognitive radios without user coordination were considered in [3], [4], [5] and [6]. The algorithm in [3] is based on a time-division fair sharing (TDFS) of the best arms between users. Although the algorithm achieves order optimal regret asymptotically, it requires pre-agreement among users and it is assumed that the number of users is fixed and known to all users. The algorithm in [4] does not require any coordination between users and achieves optimal regret asymptotically, but assumes that the number of users is known. The algorithm in [5] combines an ϵ-greedy learning rule with a collision avoidance mechanism, and [6] considers a musical chairs algorithm. Both of these approaches achieve sub-linear regret and do not require knowledge of the number of users. However, it was assumed that the channel parameters are the same for all the users. A stochastic multi-user MAB with user dependent rewards on channel was considered in [7]. However, the algorithm considers coordination and communication between users via an auction algorithm.

In this work, we focus on two scenarios that have not been previously studied in the multi-user MAB setting for uncoordinated dynamic spectrum access. We assume that the number of users is unknown and that there is no communication between the users. However, we make the mild assumption that the users have access to a shared clock for time synchronization (see also, [6], [8], [9]).

We first study a stochastic multi-user MAB where the rewards on the channels are not user dependent. In our model, all users are treated equally and the reward obtained by each user largely depends on the actions of the other users. When multiple users access the same channel, we allow for a non-zero reward with the assumption that the reward for each user decreases as the number of users on the channel increases. Thus we include the case where there are more users than channels. This is in

Parts of this work were presented at ICNC [1] and submitted to ICASSP. M. Bande and V. V. Veeravalli are with the Coordinated Science Laboratory and the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: mbande2@illinois.edu, vvv@illinois.edu). This research was supported by the US NSF WIFiUS Program under grant number CNS 14-57168 and by the US NSF SpecEES under grant number 1730882, through the University of Illinois at Urbana-Champaign.

contrast to the existing approaches, including [5] and [6], which focus on the primary/secondary user paradigm in the scenario where the reward distribution for a user is unknown but fixed. In particular, when multiple users access the same channel they receive zero reward. Hence, all these approaches fail when the number of users is greater than the number of channels.

We assume that the reward on the channel depends on the number of users on the channel and is drawn i.i.d from a distribution depending on the number of users on the channel. The degradation of the reward as a function of number of users depends on the system, e.g., the distance between the users, the protocol used for transmission (e.g., hybrid ARQ) and is captured through a reward distribution that depends on the number of users on the channel.

We propose an algorithm and show that if each user employs the algorithm, the system wide regret is O(1) in time, with high probability. The algorithm can be used for any number of users or channels. To the best of our knowledge, we are the first to provide sub-linear regret guarantees without user coordination when the number of users is greater than the number of channels.

In the second scenario, we study the adversarial multi-user MAB framework with user-dependent rewards. The adversarial bandit problem is an important variation of the MAB problem, where no stochastic assumption is made on the generation of rewards. The term “adversarial” refers to the mechanism choosing the sequence of rewards on each arm. If this mechanism is independent of the users actions, then the adversary is said to be oblivious. If the mechanism may adapt to the users’ past behaviors, then the adversary is said to be non-oblivious [2]. The existing literature on adversarial MABs is focused on the single user case, and a detailed overview of the proposed solutions for the adversarial MAB formulation can be found in [2]. The proposed algorithms in the single user adversarial setting achieve a sub-linear regret of  O(√T)over a time horizon T.

We consider multi-user dynamic spectrum allocation without any coordination among the users. We also assume that the rewards on each channel are user-dependent and may vary with time. Such a system is captured through a multi-user adversarial MAB model, particularly when the reward distribution for each channel and user may change over time. We propose an algorithm, and show that if each user employs the algorithm, the system wide regret is  O(T34 )over a time horizon T. To the best of our knowledge, we are the first to consider the multi-user setting for adversarial MABs and to provide sub-linear regret guarantees.

Let K be the number of users in the system. We initially assume that the users have unlimited data for transmission. In a more realistic setting, users may become active or inactive depending on their transmission needs; our dynamic setting covers this scenario. Each user can choose one among M channels for transmission. With M channels and K users attempting to access the spectrum, we assume that each user has prior knowledge of M, but not of K. The assumption of known M is reasonable if the spectrum partition is enforced and fixed. On the other hand, it is not realistic to assume the knowledge of K in an uncoordinated network.

We model the system as a multi-user MAB system with K users and M arms (channels). In each time unit t, let  Aktdenote the set of channels available to user k. User k chooses a channel  akt ∈ Aktbased on the reward history according to a certain policy and receives a reward  gkt. We assume that  gkt ∈ [0, 1], and that each user chooses a channel according to the same algorithm. The reward on each arm depends on the number of users who have chosen the arm. Let  ft = [ft(1), . . . , ft(M)]denote the number of users on each channel at time t, where �Mm=1 ft(m) = K. Thus, the reward  gkt (akt , ft(akt ))received by user k at time t is a function of the channel chosen  aktand the number of users on the channel  ft(akt ).

A. Stochastic setting

We model the system as a stochastic multi-user MAB system with K users and M arms (channels). Each user can choose one among M channels for transmission, where we allow for the possibility that  K ≥ M. We assume that the reward observed is inversely proportional to the number of users transmitting on the same channel. For example, the reward could be the rate achieved by the user on the channel which reduces due to interference from other users accessing the channel. Let  µ(m, f(m))denote the mean reward on channel m when the number of users on the channel is f(m). We assume that each user chooses a channel according to the same policy. We assume that  µ(m, f(m))becomes negligible for some  f(m) = β + 1, where  βdepends on the system. This restricts the number of users in the system as KM ≤ β.In order to ensure that one user does not monopolize a channel for an extended period of time, we impose the following condition. For each user, transmission on a particular channel takes place for a maximum of  Txtime units, after which the user releases the channel for at least  Txtime units before attempting to access the same channel.

We define the expected regret in the system as

image

where  f ∗ =argmaxf�Mi=1 f(i)µ(i, f(i))corresponds to the optimal number of users on each channel. To estimate the means on each channel as a function of number of users, we need to impose the following separability condition.

image

where  σ2is the variance of the distributions and c is a constant.

B. Adversarial setting

In this case, we model the system as an adversarial multi-user MAB with K users and M channels. We further restrict attention to the setting where there are more channels than users in the system i.e.,  K ≤ M. We assume that each user chooses a channel according to the same algorithm. For user  k ∈ [K], let  pkt = (pkt+1(1), ..., pkt+1(M))denote the probability vector across the arms, where  pkt (m)is the probability of choosing arm m at time t. We assume that the adversary chooses different reward for different users for the same channel. Let  gkt (akt , f(akt ))denote the reward observed by user k on choosing channel aktat time t. We assume that if more than one user chooses the same channel, they all receive zero reward. In other words, the users observe zero reward on collision. If there is no collision on the channel, the user observes a reward that is chosen by an adversary. Thus, we set  gkt (akt ) = 0when  f(akt ) > 1.

We adopt the standard notion of pseudo-regret used for adversarial bandits in [2]. The expected total regret in the system until time T is defined as

image

In this section, we focus on the stochastic multi-user MAB with user-independent rewards on each channel. We present an algorithm which leads to sub-linear regret with high probability, and extend it to the dynamic case.

A. Algorithm

The algorithm has two phases. The first is an estimation phase during which we estimate the number of users K and µ(m, f(m)), the average mean reward on each channel as a function of the number of users on the channel. The second is an allocation phase where the users arrange themselves in a way that minimizes system regret.

image

We estimate the number of users by keeping track of the number of collisions similar to [6], with the estimate given by ˆK = min{1 +round�ln( T0−ηcT0 )ln(1− 1M )

We estimate  µ(m, n)separately for each channel based on the reward x(m) observed on the corresponding channel, by clustering the samples using the k-means algorithm. We employ the algorithm Cluster (see Algorithm 2) inspired by [10]. We

image

are interested in finding the centroids of the clusters rather than the correct classification of all the samples. Hence, we use an  α-approximation algorithm with a run time  Tcto find the estimates the centroids of the cluster and show that we get good estimates with high probability. We consider the approximation algorithm in [11] with a run time  Tc ∼ O(T0).After obtaining estimates for  ˆµ(m, f)and ˆK, the optimal number of users on each channel  f ∗can be calculated. We use Alloc (see Algorithm 3) to ensure that each user settles or ‘fixes’ on a channel m , for which the number of users less than f ∗(m). That is, on finding a channel m with  µ(m, f(m)) ≤ µ(m, f ∗(m)), the user keeps transmitting on it for at most  Txtime units. The system incurs regret until all users have settled on some channels, and we call this duration the fixing time. Once all the users have settled on their channels the system does not incur regret. However in our system model, a user can transmit on a channel for at most  Txtime units, after which the user must switch. We assume that  Txis fixed for all the users but can vary with time. We use Permute (see Algorithm 4) to construct an efficient allocation for which the regret does not grow with time. In order to avoid system-wide regret every time users have to switch, we fix the ordering of each user after  N0epochs; this can be done for any  N0 ≥ 2. Our goal is to have each user transmit on all the channels. This is the coupon collector problem with each user having to collect M channels with the expected number of trials  N0 ∼ O(MlogM). When  K ≤ M, in order to have efficient allocation so that the regret does not grow with time, after the first epoch, each user switches to the next channel among the set of K best channels.

We fix the epoch size to be  Tx + Tf, where  Tfis the expected time taken for all the users to fix on a channel. After N0epochs, we continue with an epoch size of  Tx. We assume that 2  maxm f ∗(m) ≤ �m f ∗(m)to ensure that after every transmitting for  Txtime units, each user has other available channels. Note that our algorithm works even when  K ≤ M, in which case it reduces to a version of the algorithm in [6].

B. Analysis

We first investigate the case where K > M. We show that if all the users in the system use Algorithm 1, with high probability, the expected regret is O(1) in T. 1) Estimation phase: We now show that, with high probability, we have the correct estimates for  µ(m, f(m)). More precisely, we find estimates  ˆµk(m, n)such that  |ˆµk(m, n) − µ(m, n)| ≤ ϵwith high probability. Lemma 1: For any fixed  ϵ, δ, user k, channel m and number of users on the channel  n ≤ βthe estimate  ˆµk(m, n)obtained after running the algorithm for  T0 =

rounds, we have with probability at least  1 − δ,

image

Proof: Let  A1denote the event that there is at least one combination k, m, n such that  |ˆµk(m, n) − µ(m, n)| ≥ ϵand  A2denote the event that each player has more than 16ϵ2 ln 2MKβ(β+1)δobservations from distribution with mean  µ(m, n)for each m, n.

image

image

where the inequality follows from union bound. To show that  Pr(A1|A2) ≤ δ2, it suffices to show that  Pr(|ˆµk(m, n)−µ(m, n)| ≥ϵ|A2) ≤ δ2MK(β+1)which follows from Lemma 6 in the appendix with  δ ← δ2MK(β+1)for  n ≥ 2and follows from Hoeffding’s inequality for n = 1.

Lemma 2: For any  δ, if we run the estimation phase of the algorithm for  T0 ≥ ⌈

image

which is equivalent to showing

image

Hence, we choose  ϵ2 ≤ 0.49M exp( K−1M−1 ).

2) Allocation phase: We now find bounds on the expected regret during each fixing phase, given that the estimates of µ(m, f(m))and K are accurate. Lemma 3: The expected regret accumulated by the system during a fixing phase is upper bounded by

image

Pr(User k being fixed)

image

where (a) follows because we only consider one term in the each of the summations with i = 0, and (b) follows from (1 − 1x)x−1 ≥ 1exp(1)for  x ≥ 1. Thus for any user k, the expected fixing time is given by

image

and thus the regret during the fixing phase is given by,

image

where  Rktdenotes the regret incurred by user k at time t and we have  Rkt ≤ 1by our assumption on the reward distribution.

image

Analysis for  K ≤ M

For the case where  K ≤ M, there is no need for clustering. We only need the estimates for  µ(m, 1), and all users individually choose the best K channels. This reduces to the “musical chairs” algorithm and the analysis can be found in [6]. After fixing on a channel during the first allocation time, after every  Txtime units, each user switches to the next channel among the K best channels.

3) Main Result: We now present the upper bound on the expected regret incurred by the users employing Algorithm 1.

Theorem 1: For any fixed  ϵand  δ ∈ (0, 1), with probability greater than  1 − δ, the expected regret for K users using Algorithm 1 with M arms for T rounds, with parameter  T0 =

given by

image

i.e.,  E[R(T)] ∼ O(1)in T. Proof: The expected regret is due to regret during the estimation phase as well as the allocation phase. Let  Tfdenote the time taken for all the users to fix.

image

From Lemmas 1 and 2, we have the correct estimates for  µ(m, f(m))and K with high probability, with an estimation phase of  T0 + Tctime units. Thus,  K(T0 + Tc)corresponds to the regret accumulated system-wide during the estimation phase. Here  Tcdenotes the time used for running the  α-approximation algorithm for clustering. In the allocation phase, the regret in the system is accrued only during the  N0number of fixing phases. From Lemma 3, the regret in each fixing phase is  K2Mexp( K−1M−1).

C. Dynamic case

We now extend the results to a dynamic system with a changing number of users. The key idea is to run Algorithm 1 repeatedly across epochs. However, in order to obtain a sub-linear regret bound, we need to impose some restrictions on the number of epochs, and on the way users enter or leave the system. It is easy to see that the number of epochs  Nemust be sub-linear in time to have sub-linear regret in the system. We restrict the number of users entering and leaving the system until time t, which we denote by  ∆tto be  O(tζ)where  ζ < 12. We note that this is different from [6] where the time horizon is fixed and known, and there is also a restriction on when users can enter or leave the system. In our model, the dynamic scenario also includes the case where  Ktcan go from greater than M to less than M, and vice-versa.

Let  Ktdenote the number of active users at time t, where KtM ≤ β. Note that all the theorems in subsection III-B follow for the dynamic case with  Kt ≤ Mβ. We choose the starting epoch length  τto be greater than or equal to  T (1)0 +T (1)c +N0(Tx+Tf). We run Algorithm 1 for time  τ, 2τ, 3τ, and so on. The resulting algorithm is given below.

image

Theorem 2: With a probability greater than  1 − δ, the expected system-wide regret after running the Algorithm 5 for T rounds where  τ r(r+1)2 ≤ T ≤ τ (r+1)(r+2)2is

image

Proof: We have  τ r(r+1)2 ≤ Twhich gives us  r ≤ ( 2Tτ )12. The epoch length is changing with time. The total number of epochs  Neuntil time T is  Ne ≤ (r + 1) ∼ O(T12 ).We consider Theorem 1 with  δset as in the algorithm Dynamic allocation. Thus we have  T (r)0 ∼ O(ln r) ∼ O(ln T). Using union bound we show that with a probability at least  1 − δ, we have good estimates for  µand  Ktover all epochs.

image

In epochs with fixed or static users, the accumulated regret follows from Theorem 1, and in epochs with dynamic users, the system incurs regret during the entire epoch.

image

D. Experiments

In this section, our goal is to validate the performance of the estimation phase in the algorithm and show that the performance in the allocation phase does not suffer due to use of the estimated values i.e., the regret does not grow with time in the allocation phase.

We consider a system with K = 10 users and M = 6 channels and the non dynamic case. We set  T0 = 1000, Tx = 1000time units and  N0 = 5and repeat the experiment 100 times and consider the average accumulated regret. The value of  βis set to 3, and the reward distributions are chosen to be uniform with a variance of 0.01, and means between 0 and 1 given below,

image

We compare the performance of Algorithm 1 with the estimated values of  µand K with Algorithm 1 with the true parameter values. We also show how the estimates change with number of iterations in the estimation phase  T0. We used the in-built MATLAB kmeans function for clustering.

image

Fig. 1: Accumulated regret as a function of time.

From Fig. 1, we see that the accumulated regret grows with time during the estimation phase and remains constant during the allocation phase. Also, there is no noticeable difference between Algorithm 1 with the true parameter values and the one with the estimated values. This follows because the estimates of K and the mean converge to the true values within a few iterations as shown in Fig. 2 and Fig. 3.

image

Fig. 2: Error in the estimation of number of users K.

image

Fig. 3: Error in the estimation of the mean.

In this section, we consider the adversarial multi-user MAB model with user-dependent rewards on each channel. We present an algorithm that leads to sub-linear regret, and extend it to the dynamic case.

A. Single user MAB

We consider the Exp3.P algorithm described in [2] for a single user MAB in an adversarial setting. We modify the algorithm so that the user chooses an arm and updates the probability vector only in a few time units. This modification is useful in the multi-user case, where the users may not choose an arm in each time unit due to possible collisions. We now present a modified version of the Exp3.P algorithm, in which a new arm is chosen and the probability is updated at time units  t1, t2, . . . , tnsuch that  n ≤ Tand  α = maxj∈[n−1] tj+1 − tj. For each  j ∈ [n], we consider the reward over the time-period  tj+1 − tj, with the reward being normalized to lie between 0 and 1.

image

Theorem 3: The expected regret of Modified Exp3.P algorithm (Algorithm 6) until time T is given by E��Tt=1 gt(m) − gt(at)�

image

where  g′j(m) =

the regret bound for Exp3.P given in [2].

B. Multi-user MAB: Algorithm

We now consider the multi-user adversarial bandits under a known finite horizon T, and propose an algorithm which when employed by all users independently leads to sub-linear regret.

In a multi-user adversarial system, every time t that a user k chooses an arm according to a certain probability distribution pktto randomize against the adversary, there is a possibility for collision with other users. Hence there is a need for a collision resolution mechanism, so that the regret does not grow linearly with time. Instead of choosing an arm every time unit, a user chooses an arm only a sub-linear number of times until T ( e.g.,  T ywhere y < 1). The goal is to randomize sufficient number of times so as to counteract the adversary, while making sure that the regret due to collisions does not become large.

We propose an algorithm (Algorithm 7) that combines the modified Exp3.P algorithm (Algorithm 6) with a collision resolution mechanism with y < 1. In the analysis in Section IV-C, we pick  y = 12which is large enough to maintain the sub-linear regret achieved by the modified Exp3.P algorithm but small enough so that the regret due to collisions is sub-linear as well.

In every time-interval of length  T 1−y, we first have a collision resolution phase. Each user chooses a channel with probability pkt. A user settles or fixes on a channel if at any time the user finds a channel without collision. Once a user settles on a channel, the user keeps transmitting on the channel until the end of the time-interval of length  T 1−y. The system incurs regret until all K users have settled on K channels, and we call this duration the fixing time. The remaining part of the algorithm corresponds to each of the K users employing the modified Exp3.P algorithm, where they choose a channel once every  T ytime units.

C. Multi-user MAB: Analysis

In this subsection, we first consider the regret due to the collision resolution phase, then the regret due to the modified Exp3.P part of Algorithm 7, and then combine them to find an upper bound on the system-wide regret incurred when each user independently employs Algorithm 7.

image

1) Regret during collision resolution: Theorem 4: The expected regret accumulated by the system during a collision resolution phase is upper bounded by

image

Proof: We first note from equation (4) that the probability of choosing any channel by any user is at least γM. Let ρkt = maxm pkt (m), which implies that  ρkt ≥ 1M. Let “maximal” refer to the channel that has the highest probability of being chosen by that particular user. Thus, each user can be associated with one channel such that probability of choosing it is greater than 1M. Since  K ≤ M, for each user, there exists at least one channel such that it not the maximal channel for any of the remaining  K − 1users. Note that even when some users fix or settle on a channel, and there are both unfixed channels and unfixed users in the system, we can still find an unfixed channel such that it is not the maximal channel for the remaining unfixed users.

Based on the above discussion, we define the event  Bkto be the event where all unfixed users except user k choose their maximal arm, and user k chooses an unfixed arm that is not the maximal arm for any other unfixed users.

Let  Mtdenote the set of unfixed arms at time t. The probability of any user k being fixed at time t is given by,

Pr{User k being fixed}

image

The remainder of the proof follows in a similar manner as the proof of Theorem 3. 2) Regret due to Modified Exp3.P: We now bound the regret incurred by the users using Algorithm 7 during the time the users are not in the collision resolution phase. This corresponds to each of the K users independently employing the modified Exp3.P algorithm introduced in subsection IV-A. In Algorithm 7, when the users are not in the collision resolution phase, each user employs modified Exp3.P with  n = T y

and  α = T 1−y. Using the result of Theorem 3 for K users, for any distinct set  K ⊆ [M]consisting of K arms,

image

image

Mln M, and does not depend on T. 3) Main Result: We now present the upper bound on the expected regret incurred by the users employing Algorithm 7. Theorem 5: The expected regret of K users using Algorithm 7 with M arms for T time units, is given by

image

where h′(M, K) = K�5.15√M ln M +�

Mln M + KM K√M ln M�, and does not depend on T. Thus,  E[R(T)] ∼ O(T34 ). Proof: The expected regret is due to collision resolution phase as well as the modified Exp3.P algorithm which is played a sub-linear number of times. Let  Tfdenote the time taken for all the users to fix.

image

where the inequalities follow from Theorem 4 and equation (5), and  h(M) = 5.15√M ln M +�

Mln M. If we choose y such that 3y2 = 1 − y2, we have  y = 12which gives us

image

D. Unknown time horizon

In this subsection, we extend the results to the case of unknown time horizon. Each user considers some known time  τgreater than the expected fixing time for the system and runs Algorithm 7. Once the user reaches the end of time  τ, the user continues to use Algorithm 7 with a time-period of length  2τ. In this way when the user reaches the end of the previous time-period, the user doubles it and continues with Algorithm 7. Let T be such that  τ + 2τ + . . . + 2rτ ≤ T ≤ τ + 2τ + . . . + 2(r+1)τ, equivalently  2(r+1)τ ≤ T + τ < 2(r+2)τ.

image

Theorem 6: The expected regret from using Algorithm 8 for T time units where  (2(r+1) − 1)τ ≤ T < (2(r+2) − 1)τis

image

where h′(M, K) = K�5.15√M ln M +�

image

Note that each user only needs knowledge of K in order to fix on an initial  τsuch that  τ ≥ ETf, where  Tfis the fixing time for all the users in the system. Furthermore,  τcan be chosen even without the knowledge of K by simply replacing K by M, and the analysis follows because  K ≤ M.

E. Dynamic case

In this subsection, we extend the results to a dynamic system with a changing number of users. Consider a system which starts with K users, and in which users leave the system once they are done with their transmission. It is easy to see that Algorithm 8 in this case leads to system-wide regret of the order  O(T34 )over a time horizon T.

Let us now consider a dynamic system where users enter and leave the system over time. In order to use Algorithm 8 to obtain a sub-linear regret bound, we need to impose some restrictions on the number of users that have entered the system until time t, which we denote by  κt. It is easy to see that the number of epochs in which users enter the system must be sub-linear in time to have sub-linear regret in the system. We restrict the number of users entering the system  κtto be  O(tζ)where  ζ < 12. We note that this is similar to the dynamic case in [12] where there is a restriction on the number of users entering and leaving the system.

Let  Ktdenote the number of active users at time t. Note that even in the dynamic scenario, we still retain the assumption of having  Kt ≤ Min the system.

Theorem 7: The expected system-wide regret from using Algorithm 8 for T time units where  (2(r+1) − 1)τ ≤ T <(2(r+2) − 1)τwith the number of users entering the system  κT ∼ O(T ζ), with  ζ < 12, is given by

image

where h′(M, M) = M�5.15√M ln M +�

Proof: We have  2(r+1)τ ≤ τ + T. In epochs where no users enter the system, the regret can be bound by Theorem 6, and in epochs with new users, the regret accumulates through the entire epoch. The epoch length is upper bounded by  (2(r+1)τ)12, since  y = 12from Theorem 5. The regret up to time T bounded as follows:

image

F. Experiments

In this section, we illustrate the performance of our algorithm in a simple adversarial setting. We consider a non-oblivious adversary, i.e., an adversary whose rewards do not depend on the users’ reward history.

We consider a system with known time-horizon T, fixed number of users K = 4 users and M = 7 channels. We set T = 160000, which gives us  T12 = 400time units,  φ = 0.026, η = 0.025and  γ = 0.194in Algorithm 7. The reward distributions for the channels are drawn i.i.d from the uniform distribution [a, 1] where a for each channel at each time unit is drawn i.i.d from the uniform distribution [0.2, 1].

image

Fig. 4: Accumulated regret as a function of time.

We repeat the experiment 100 times and consider the average accumulated regret with time. From Fig. 4, we see that the regret grows with time at a rate much lower than  T34, but higher than  T12, the expected regret in the single user case.

Remark 1: We note that Algorithm 7 can be used for a stochastic multi-user MAB with user-dependent rewards to achieve a sub-linear regret of order  O(T34 ). While the regret is much higher than in [7], our algorithm does not rely on communication between the users and can also deal with a dynamic number of users in the system.

Remark 2: In the adversarial case, there is randomization in the selection of a channel, with  τ12being equivalent to  Tx, and hence each user does not transmit on a channel for a very long time. Thus, fairness is achieved without enforcing a strict duration  Txfor each transmission.

We modeled the dynamic spectrum allocation problem as a multi-user MAB with no communication among the users. We first considered a stochastic MAB model with rewards on the channel being the same for all users, and then an adversarial MAB model with user-dependent rewards. We showed that the proposed algorithms in both scenarios achieve sub-linear regret. We provided simulation results to show that the algorithms perform well in practice when the number of users is fixed. We also extended our algorithms to the dynamic case and showed that the algorithms continue to achieve sub-linear regret. It is of interest to develop algorithms in other variants of the multi-user MAB setting. For example, a system with user-dependent rewards, under the stochastic as well as the adversarial settings, without any user communication, when there are more users than channels in the system.

[1] M. Bande and V. V. Veeravalli, “Multi-user multi-armed bandits for uncoordinated spectrum access,” in Proc. IEEE International Conference on Computing, Networking and Communications (ICNC), 2019.

[2] S. Bubeck, N. Cesa-Bianchi et al., “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[3] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. 11, pp. 5667–5681, Nov 2010.

[4] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 4, pp. 731–745, 2011.

[5] O. Avner and S. Mannor, “Concurrent bandits and cognitive radio networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 66–81.

[6] J. Rosenski, O. Shamir, and L. Szlak, “Multi-player bandits–a musical chairs approach,” in International Conference on Machine Learning, 2016, pp. 155–163.

[7] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.

[8] J. Nieminen, R. Jantti, and L. Qian, “Time synchronization of cognitive radio networks,” in Global Telecommunications Conference, GLOBECOM. IEEE, 2009, pp. 1–6.

[9] O. Avner and S. Mannor, “Learning to coordinate without communication in multi-user multi-armed bandit problems.” arXiv preprint arXiv:1504.08167, 2015.

[10] C. Tang and C. Monteleoni, “On Lloyds algorithm: New theoretical insights for clustering in practice,” in Artificial Intelligence and Statistics, 2016, pp. 1280–1289.

[11] A. Kumar, Y. Sabharwal, and S. Sen, “A simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions,” in Proceedings of 45th Annual IEEE Symposium on Foundations of Computer Science, 2004, pp. 454–462.

[12] M. Bande and V. V. Veeravalli, “Multi-user multi-armed bandits for uncoordinated spectrum access,” arXiv preprint arXiv:1807.00867, 2018.

We present a lemma that ensures a certain number of observations from each distribution during the estimation phase of

image

observations of each reward distribution on each arm with probability greater than  1 − δ2.

image

we have that

image

where the first inequality follows from union bound and the second inequality follows from Chernoff bound. Note that for a particular k, m and  n, Ak,m,nis i.i.d across t, since all users are choosing channels uniformly at random. In order for this probability to be upper bounded by δ2we need:

image

player k has of arm m with n users, �T0t=1 Ak,m,n (t) > 12T0E [Ak,m,n(t)].

We also need the total number of observations each player has of each arm to be at least 16ϵ2 ln 2MKβ(β+1)δ, i.e.

image

So we have two constraints on  T0, which gives us:

image

which can be further simplified to

image

A. Clustering

Let N points  {xi, . . . , xN}be drawn independently from  βdistributions with mean  µrwhere  r ∈ [β]. Let number of samples drawn from distribution with mean  µrbe denoted by  nrand the separability condition (1) is satisfied. Additional notation used is introduced in Table I.

We now present an additional separability condition which is useful in order to prove some clustering results. For any m ∈ [M]and  r, s ∈ [β],

image

TABLE I: Notation.

We first present the following lemma which describes the relationship between the separability conditions (1) and (6). Lemma 5: If the separability condition (1) is satisfied and  N = T0 =32 exp( K−1M−1 )Mϵ2 ln 2MKβ(β+1)δ, then for any r, s, with high probability

image

If suffices to show that with high probability,

image

From Hoeffdings, we have

image

where the last inequality follows because from Lemma 4, we have  ns ≥ 16ϵ2 ln 2MKβ(β+1)δ.

image

We now present some lemmas that are useful for proving that after clustering, the centroids are closer to the means of the distributions from which they are drawn.

Lemma 6: If the separability condition (6) is satisfied, then after using Cluster algorithm, we have that for any fixed  ϵ, δand  ns ≥ Nϵ,δ = ⌈ 16ϵ2 ln( βδ )⌉, with probability greater than  1 − δ,

image

From Lemma 7, after the  αapproximation algorithm, we have  ∆s ≤ 2(α + 1) φ∗nsand  γ < 2(α+1)c. If we want  γ ≤ 18which gives  a < c16 − 1. From Lemma 9,  ρsin + ρsout ≤ 8cwhich we need to be less than  12this giving us c > 16. From this and Lemma 8, the conditions for Lemma 10 are satisfied and  γ < 18. Thus we have,

image

For each  r ∈ [β], Ss ∩ Trdenotes independently drawn bounded random variables from reward distribution with mean  µr, we use Hoeffding’s lemma.

image

Lemma 7: An  αapproximation algorithm returns the set of centroids  {ν1, . . . , νβ}where C(x) returns the centroid of the cluster to which x belongs. We have  ∀Ts ∃νssuch that  |νs − µs| ≤ 2(α + 1) φ∗nsand  γ < 2(α+1)c.

image

which is a contradiction. We now show that  φT ≤ 2φ∗which proves that  ∆s ≤ 2(α + 1) φ∗ns.

image

Now we show that  γ ≤ 2(α+1)c. For any s, r,

image

Since this is true for all r, s, we have

image

where the last inequality follows from the definition of  γ.

image

where the second inequality follows from  x ∈ Srand the last from the definition of  γ.

image

Note that the first statement with the definition of  γalso implies for l = r, s

image

which gives us

image

where the first and second inequalities follow from the separability condition and Lemma 8 respectively. This gives us  ρsout ≤2(1−4γ)cand similarly we also have  ρsin ≤ 2(1−4γ)c.

image


Designed for Accessibility and to further Open Science