Learning Algorithms for Minimizing Queue Length Regret

2020·arXiv

Abstract

I. INTRODUCTION

IN this work, we consider a statistical learning problem that is motivated by the following application. Consider a wirelesscommunication system consisting of a single transmitter/receiver pair and N channels over which they may communicate. Packets randomly arrive to the transmitter’s queue and wait in the queue until they are successfully delivered to the receiver. At each time slot, the transmitter can decide to transmit a frame on one of the N channels. If the queue is non-empty, the transmitted frame carries a packet with it over the channel, and if the frame is then successfully received, the packet is successfully delivered. For each channel, each frame transmission attempt is successful according to a probability that is initially unknown. The transmitter is informed whether a transmission was successful via receiver feedback that immediately follows each transmission. The objective of the controller is to minimize the queue’s backlog by using the receiver feedback to quickly identifying the best channel to transmit on. In the above application, each successful frame transmission offers one packet’s worth of service to the queue. Thus, the channels in the above application behave like servers in a general queueing system that, when selected, offer a random amount of service. Given the above motivation, we consider the problem of identifying the best of N available servers to minimize the queue’s backlog. To this end, we associate a one unit delay-cost to each time slot that each packet has to wait in the queue. To obtain good performance, the controller must schedule the servers to explore the offered service rate that each gives and also exploit its knowledge to schedule the server that appears to give the most service. We define the queue length regret as , where is the backlog under a learning policy and is the backlog under a controller that knows the best server. Our objective is to find a policy that minimizes this regret. Our problem is closely related to the stochastic multi-armed bandit problem. In this problem, a player is confronted by a set of N possible actions of which it may select only one at a given time. Each action provides an i.i.d. stochastic reward, but the statistics of the rewards are initially unknown. Over a set of T successive rounds, the player must select from the actions to both explore how much reward each gives as well as exploit its knowledge to focus on the action that appears to give the most reward. Learning policies for solving the multi-armed bandit have long been considered [2]. Historically, the performance of a policy is evaluated using regret, which is defined to be the expected difference between the reward accumulated by the learning policy and a player that knows, a priori, the action with highest mean reward. This quantifies the cost of having to learn the best action. It is well known that there exist policies such that the regret scales on the order of log T and that the order of this bound is tight [3]. In the seminal work of [4], policies for achieving an asymptotically efficient regret rate were derived, and subsequent work in [5]–[7] have provided simplified policies that are often used in practice.

This work was sponsored by NSF Grants AST-1547331 and CNS-1701964, and by Army Research Office (ARO) grant number W911NF-17-1-0508. This material is based upon work supported by the United States Air Force under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the United States Air Force.

One approach to solving our problem would be to simply use traditional bandit algorithms. In this approach, the reward in the bandit problem would be viewed as offered service in our problem, and the resulting methods would focus on maximizing the rate of offered service to the queue. In the context of the wireless system described above, this is equivalent to targeting a high rate of successful frame transmissions without regard to which frames are packet-bearing. Note that in an infinitely backlogged system, which always has packets available to send, this approach would maximize throughput. Since, as a rule of thumb, higher service throughputs generally correspond with lower packet delays in queueing systems, this approach has merit.

However, the drawback with this approach is that it does not exploit the fundamental queueing dynamics of the system. During time periods when the queue is empty, the offered service is unused, and the controller can, therefore, freely poll the servers without hurting its objective. This contrasts with non-empty periods, when exploring can be costly since it potentially uses a suboptimal server. However, a controller cannot take this lesson too far and restrict its explorations only to time slots when the queue is empty. This is because some servers may have a service rate that is below the packet arrival rate, and if the controller refuses to explore during non-empty periods, it may settle on destabilizing actions that cause the backlog to grow to infinity.

Instead, policies should favor exploration during periods when the queue is empty but perform enough exploration during non-empty periods to maintain stability. In this work, we show that for systems that have at least one server whose service rate is greater than the arrival rate there exist queue-length based policies such that .1 Likewise, we show that any traditional bandit learning algorithm, when applied to our setting, can have queue length regret . We additionally show that if there does not exist a server with a service rate that is greater than the arrival rate, for any policy the queue-length regret can grow as .

The problem considered herein is related to the work of [8], which also considered learning algorithms for scheduling service to a queue. The main difference is that in [8] the controller did not use the queue backlog to make decisions. As a result, the policies considered in that work were closely related to those in the bandit literature and focused on maximizing offered service. Under these policies, [8] showed that the tail of diminishes as , implying that is logarithmic. In this work, we show that by exploiting empty periods, the queue length regret is bounded.

Our problem can also be compared to the more general literature on reinforcement learning. In reinforcement learning, an actor seeks to learn the optimal policy for a sequential decision making problem with an evolving system state. The decision made by the actor at each state causes the actor to obtain a probabilistic reward and the system to randomly change its state. The objective is to learn how to obtain high reward by estimating the system’s statistics. See [9], [10] for a review of reinforcement learning. The online performance of reinforcement learning algorithms has been previously explored in [11] and [12], which used the upper confidence methods UCRL and UCRL2. The work of [13] has improved upon these bounds, and [14] extended the results to weakly-communicating problems. Thompson-sampling-inspired algorithms have also emerged as a design principle [15]. Posterior sampling reinforcement learning (PSRL), which is based on this principle, has become a popular method that can, in certain problems, outperform upper confidence methods [16]–[19].

Our work can be viewed as a reinforcement learning problem with known structure. The queue backlog is the problem’s state and a penalty is paid for inefficiently scheduling whenever the backlog is nonzero. Note that in our problem, the amount of regret that can be accrued at a given time is unbounded and the problem’s state space (the queue occupancy) is infinite. Our work has some overlap with the field of safe reinforcement learning [20]–[22]. We seek to find learning policies that do not destabilize the system and regularly return the queue backlog to the empty state. This structure separates our work from most of the previous literature that explores algorithms that are not designed to exploit queueing dynamics and are not focused on analyzing queue stability under different policies. The algorithms we design obtain their performance guarantees by using what is known about the problem, the dynamics of the queue and its impact on regret, while focusing on learning what is unknown, the server rates. Since the amount of service that a server will offer when scheduled is independent of the current queue backlog, our methods do not waste time trying to match the action-rewards to individual states, as would occur with a general reinforcement learning algorithm that knew nothing about the problem’s structure.

The use of statistical learning methods for optimizing wireless channel access in systems with uncertain channel conditions has been considered in the previous literature [23]–[35]. However, most previous works examined algorithms for maximizing transmission opportunities and did not focus on minimizing queueing delay. A major contribution of this work is to show that these two objectives are not necessarily the same. A learning algorithm that maximizes total offered service to a queue may not statistically minimize the queue backlog over time.

Scheduling algorithms that use queue state to make service decisions have a long history. In the seminal work of [36] and [37], the max-weight algorithm for assigning service to queues was shown to maximize throughput under complex scheduling constraints and probabilistic dynamics. This framework has been extended and applied to network switching [38], satellite communications [39], ad-hoc networking [40], [41], packet multicasting and broadcasting [42], packet-delivery-time reduction [43], multi-user MIMO [44], energy harvesting systems [45], and age-of-information minimization [46], [47]. In the works of

Fig. 1. Queueing system with N servers. At each time, one server may be scheduled. The rate at which each server can offer service is initially unknown.

[48] and [49], learning algorithms were used for achieving network stability under unknown arrival and channel statistics. The methods considered in those works augmented the max-weight algorithm with a statistical learning component. The resulting methods were shown to achieve stability but did not analyze queue length regret. In [50], learning algorithms for controlling networks with adversarial dynamics were considered and the max-weight algorithm and tracking algorithm were both explored as solutions for minimizing queue backlogs. However, the adversarial model considered in that work is pessimistic compared to the stochastic model considered herein.

The focus of this work is on proving the achievability of asymptotically-optimal regret growth. This requires that the controller samples all servers often enough to converge on the optimal server and achieve bounded regret. Importantly, the controller must concurrently guarantee that the queue does not blow-up to infinity and that exploration of sub-optimal servers is eventually limited to times when the queue is empty. The policies we use to prove our results do not optimize the queue length regret for finite-time analysis. As a result, although the tail of their regret is optimal, they can initially accrue large regret. We designed these policies to facilitate analysis, which is made difficult by the high-correlation between the queue backlog (which determines when to sample the servers to converge on the best server) and the observations of the servers (which determine the queue backlog and are used for identifying the best server). Our analyzed algorithms disentangle this relationship at the expense of good non-asymptotic performance. Nevertheless, we show through simulation that the insight gained from our analysis can inspire well-performing heuristics that exploit the structure found in our theoretical proofs. Analyzing and improving upon these heuristics is a potential direction for future research.

The rest of this paper is organized as follows. In Section II, we give the problem setup. We analyze the performance of traditional bandit algorithms, when applied to our problem, in Section III. In Sections IV and V, we characterize the regret of queue-length based policies when the best server has a service rate above and below the arrival rate, respectively. In Section VI, we use our theoretical analysis to define heuristic policies and test their performance through simulation. We conclude in Section VII. Note that a subset of the material in this paper appeared in [1].

II. PROBLEM SETUP

A. Problem Description

We consider a system consisting of a single queue and N servers (where integer ) operating over discrete time slots t = 0, 1, 2, . . . . See Fig. 1. Packets arrive to the queue as a Bernoulli process A(t) with rate and wait in the queue until they receive service from one of the servers. Each packet can be serviced no sooner than the next time slot after its arrival. Each server’s ability to offer service randomly fluctuates. Specifically, the number of packets that server can service at time t follows a Bernoulli process with rate . The arrival process and server processes are assumed to be independent of each other. We refer to server i as stabilizing if , which implies that the rate at which the server can provide service keeps up with the arrival process. Otherwise, we refer to it as non-stabilizing. Given the above, an instantiation of our problem is characterized by the tuple , where is the vector of service rates. Then, we let P be the set of all problems (i.e., tuples).

The main challenge in our problem is that the system can only activate one server at a given time. At each time slot, a system controller must select only one of the N servers and ask it to offer service. Then the selected server will inform the controller of the number of packets that it can offer to serve and, if the queue is non-empty, will service that number of packets. We denote the controller’s choice at time t as . For decision u(t) = i, the service offered to the queue is then denoted D(t) which is equal to . Throughout this work, the controller is not allowed to observe the offered service processes prior to making its decision u(t), and therefore it cannot know which servers will offer service prior to its choice.

Given the above, the queue backlog, Q(t), evolves as

where is used to denote the maximum of x and 0. We assume Q(0) = 0 and the controller knows this.

The system incurs a unit cost, in delay, for each time slot that each packet has to wait in the queue. Therefore, for a time horizon T, the controller wishes to minimize . Since, the controller cannot observe the values of prior to making its decision u(t), the optimal action is to select the server

to provide service to the queue. For simplicity, we assume is always unique.2 In our framework, the controller does not a priori know the values of and must therefore use observations of D(t) to identify . This implies that the service available from chosen server u(t) is revealed to the controller after time t, but the service that would have been available from all other servers remains hidden. Note that the controller can observe D(t) at all times t, even when Q(t) = 0. Finally, in this work, we assume that in addition to not knowing the values of , the system does not initially know the exact value of , either.

Given the above, the objective is to design a controller that uses its observations to decide which server to schedule at each time slot to minimize

To this end, denoting the history of all previous arrivals, service opportunities, and decisions as

we need to specify a controller policy , which at each time t uses any portion of H(t) to (possibly randomly) make a decision on u(t). Note that since H(t) contains the history of all previous arrivals and offered service to the queue, at time t, the policy implicitly knows the current backlog Q(t) as well (i.e., we could explicitly include Q(t) in H(t), but this would be redundant). We let be the set of all policies that we could design.

Define to be the queue backlog under the controller that always schedules , and let be the backlog under our policy that must learn which server is optimal. We will analyze the performance of using the following definition of queue length regret,

where the expectation is over the product measure of where P1 denotes the operation of and P2 the operation of the scheduler that always selects . We will often simply refer to this metric as regret in the rest of the paper. Observe that, if our policy minimizes (2) over the set , it must also minimize over the set.

B. Analyzing Queue Length Regret

Since at any time t, server has the highest probability of offering service to the queue, it is clear that the controller that always schedules must minimize the expected queue backlog. Therefore, for every time t and any policy ,

which implies that is monotonically increasing in T for all . Since, the queue begins empty at time 0, using (1), .

The focus in this work will be on characterizing how queue length regret can scale with time horizon T. Therefore, we will not focus on finding policies that optimize for some finite value of T. Instead, we will characterize as T goes to infinity. We will proceed to show that under different assumptions on and , the regret . Define the tail of the regret to be

Intuitively, this is the additional cost-to-go of infinitely running the policy past time T. Then the following proposition holds.

Proposition 1. If for policy and subset for each , then as for every .

Thus, we see that a policy that obtains O(1) queue length regret seeks to minimize the tail of the regret as time goes to infinity (i.e., causes the tail to trend to zero). A policy that optimizes the tail of the regret is not guaranteed to minimize the queue length regret. Indeed the policies we proceed to analyze in Section IV do not generally achieve small queue length regret, and are instead designed to facilitate proving that O(1) queue length regret is achievable. Nevertheless, we will show in Section VI that the insight gained from these policies can be used to construct heuristics with good regret performance.

III. REGRET OF TRADITIONAL BANDIT ALGORITHMS

In this section, we establish that the regret of any policy that does not use previous observations of the arrival process must have a queue length regret that grows logarithmically with T for some . Under this restriction, the policies in this section (in effect) do not observe previous arrivals to the queue, and their decision making process is solely focused on the observed offered services. Since the policies do not monitor process A(t), they cannot directly use the queue backlog Q(t) to make their decisions, either. Under this restriction, we may still borrow any strategy from the traditional stochastic multi-armed bandit literature to solve the problem. These policies focus on only maximizing offered service to the queue, using previous observations of the offered service to guide their decisions. Without loss of generality, the theorem is established for a system with two servers. The theorem’s proof makes use of a well-known lower bound on the performance of bandit learning algorithms [3, Theorem 2.2].

Let be the service rate when the service rates are sorted in increasing order.

Theorem 1. For any that does not use previous observations of the arrival process, with server rates and () such that for any fixed .

Denote the offered service from servers (1) and (2) as and respectively and the decision by policy at time t as . Then, from (3) and the queue evolution equation,

Thus, taking expectation and summing over time

Note that each defines a different decision process leading up to time t. For example, chooses server (2) with probability 1 at time , but follows the actions of policy at time (and chooses server (2) with probability 1 at time t, instead). One can then see that in the above, at each index t, we are comparing the performance of to the performance of a different decision process. Since the queue begins empty at t = 0, by (1), the expected queue backlog under any controller must be equal to 0 and at times 0 and 1, respectively. Thus, the left-hand side of (4) has been chosen as a summation from time slot 2 to .

Now, and u(t) are functions of events up to time and are therefore independent of and . Thus, we can write

Now, and {u(t) = (1)} are not necessarily independent events. However, we can use the fact that an arrival at time is a subset of the event that to lower bound the above (i.e., ). Then,

We then use the fact that policy is independent of the arrival process A(t) to obtain

Note that . Now, policy is a strategy for solving the stochastic multi-armed bandit problem (i.e., it uses the previous actions and observed offered service to decide on which server to schedule at time t). From [3, Theorem 2.2], there must exist a such that3

Without loss of generality, we can then assume the chosen has this property. Using equations (4) through (7), this implies

Then, because for all t,4

Therefore, by (8),

IV. REGRET OF QUEUE-LENGTH-BASED POLICIES

In this section, we examine the asymptotic scaling of regret for policies that make decisions using Q(t) (or equivalently, the previous observations of the arrival process). We do so by considering the problem for different subsets of P that impose different assumptions on the relationship between and . This section will build to the main result of this work, Theorem 4, which states that there exists a such that for any problem with .

We begin our analysis in Subsection IV-A under the assumption that every server is stabilizing (i.e., ). Under this assumption, the controller does not need to account for the possibility of system instability. As a result, the controller can limit itself to performing exploration only on time slots in which the queue is empty. During time slots in which the queue is backlogged, the controller will exploit its previously obtained knowledge to schedule the server it believes to be best.

In Subsection IV-B, we allow for both stabilizing and non-stabilizing servers in the system (i.e., is allowed to be less than for a non-empty subset of the servers). However, we will assume that the controller is given a randomized policy that achieves an offered service rate that is greater than the arrival rate to the queue. Note that the controller does not need to learn this policy and can use it to return the queue to the empty state. The given randomized policy is not required to have any relationship to and will not, in general, minimize queue length regret. As a result, to minimize queue length regret, the controller will not want to excessively rely upon it.

In Subsection IV-C, we further relax our assumptions on the problem and only require that . In this subsection, the controller will need to identify which servers are stabilizing, while simultaneously trying to minimize . This will require a policy that does not destabilize the system. To this end, the controller will have to explore the servers’ offered service rates during both time slots when the queue is empty and backlogged. Explorations during time slots when the queue is backlogged, in general, waste work and should therefore be performed sparingly. Intuitively, as the controller identifies which subset of servers have service rates , it can focus its explorations on time slots when the queue is empty.

The above three cases build upon one another. The insight from one will point to a policy for the next, and we will therefore analyze the above cases in sequence. Under each of the above assumptions, we will find that there exists a policy such that the regret converges for all meeting the assumption. In contrast, in Section V, we show that there does not exist a policy that can achieve convergent regret over the class of all problems for which .

Note that the main difficulty in evaluating a policy’s performance is the correlation between the time-evolving queue backlog Q(t) and the history of observed offered service, which is used to find the best server. In general, the queue will be empty more often once we begin to identify the best server, and, likewise, the best server will be more easily identified when we begin to obtain more empty periods in which we can freely explore. The policies considered in this section have been chosen to disentangle this relationship so that well-known tools from statistics can be applied in our analysis. Specifically, our results will make use of Hoeffding’s inequality, which we restate here for convenience.

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update with the observed state of D(t) For other time slots in the empty period, idle

Fig. 2. Policy for achieving Theorem 2. Each variable is initialized to zero prior to its first update.

. Then, by Hoeffding’s inequality, for all x > 0,

For simplicity, we will often use Hoeffding’s inequality in our analysis even when tighter concentration inequalities may exist.

A. All Servers Are Stabilizing

In this subsection, we assume all servers are stabilizing and design policies that are able to obtain O(1) regret under this assumption. Specifically, we will assume that the environment will only choose parameters from a subset defined below.

Assumption 1. .

Under this assumption we will prove the following theorem, which states that there exists a policy such that, for every in the set , the regret converges. The proof of Theorem 2 is constructive and will give a policy for obtaining the result.

Theorem 2. Under Assumption 1, such that, for each .

To prove Theorem 2, we analyze the policy shown in Fig. 2 on an arbitrary . This policy maintains sample mean variables that estimate the servers’ service rates . Each sample mean is updated using observations of server i’s offered service. However, it is important to note that each sample mean is not updated at every time slot that we schedule its corresponding server, and we will therefore not use every observation of the offered services to construct the sample means. Instead, to facilitate analysis, we will only update the sample means at strategically chosen time slots. Throughout this section, sample mean variables are initialized to zero prior to their first update with a service observation. After the first update, the sample mean variables equal the sum of the used observations divided by the number of used observations.

Note that under policy , the queue backlog will transition through alternating time intervals, wherein the queue is continuously empty over an interval of time (i.e., ) and then continuously busy over an interval of time (i.e., ). We enumerate the periods using positive integers p = 1, 2, . . . and refer to the occurrence of an empty period (busy period) as empty period p (busy period p, respectively). To arrive at Theorem 2, we will analyze the integrated queue backlog taken over the busy periods. To this end, we define the integral of the queue backlog over busy period p to be the summation of the queue backlog taken over those times t that are in busy period p. By our problem formulation, a unit cost is incurred by the system for each time slot each packet waits in the queue, and the integral of the queue backlog over busy period p can then be interpreted as the total cost accumulated over the busy period. In Fig. 3, we illustrate our terminology on a sample path of .

We now briefly point out some aspects of policy . See Fig. 4, which provides an illustration of . Under the policy, the first time slot of each empty period is used to update one of the sample means. To do this, the policy chooses one of the servers uniformly at random and schedules it. The amount of service that is offered from the chosen server is then observed and used to update the server’s corresponding sample mean. Note that the sample means are only updated during these time slots, and all other observations of the servers’ offered services, that are made during other time slots, are not used in our

Fig. 3. Example sample path for over the first 16 time slots for some, unspecified, . Empty period 1 occurs over time slots 0 and 1. Busy period 1 begins in time slot 2 and ends with time slot 9. The duration of the busy period is 8 time slots, and the integral of the queue backlog over the busy period is 14. Empty period 2 begins at time slot 10, and busy period 2 begins at time slot 12. The integral of the queue backlog over busy period 2 is 4. Finally, empty period 3 is shown beginning in time slot 15.

Fig. 4. Example sample path of evolving over the first 30 time slots for some, unspecified, . The background colors indicate the type of action policy took during the corresponding time slots (see Fig. 2). During time slots marked in gray, randomly selected one of the servers i and updated its sample mean with the observed offered service. The sample means were only updated during these time slots. During time slots marked in orange, scheduled the server with the (so-far) largest sample mean to service the queue. During time slots with a white background, no action was taken.

estimations. Now, following an empty period, when a packet arrives to the empty queue, the system enters a busy period. At the start of each busy period, the policy chooses to schedule the server with the highest sample mean (breaking ties between sample means according to any arbitrary decision rule) and continues to schedule that server until the queue finally empties and the busy period ends. Thus, during each busy period the queue is serviced by only one server (namely, the one with highest sample mean). We again stress that the policy does not use the observations of the service obtained during the busy period to update the sample mean. Notionally, the policy performs no exploration during busy periods and instead focuses on exploiting previous observations to empty the queue quickly.

Now, suppose we are at the beginning of busy period p and policy chooses to schedule server i for this busy period. Then, the duration of the busy period (i.e., the amount of time that will be required to empty the queue) is given by a random variable , whose mean we denote . Note that conditioned on server i being scheduled during the busy period, is not a function of the busy period number p. By Assumption 1, for all we have that and therefore is finite. Now, to establish the proof of Theorem 2, we will be interested in the integral of the queue backlog taken over the busy period. Note that, as with the busy period’s duration, the integral of the queue backlog is given by a random variable, which we denote . Using to denote the arbitrary start of the busy period,

and has mean . As with the busy period duration, variable is not a function of the busy period number p.

We introduce one last piece of notation before moving to the proof of Theorem 2. At the start of each busy period, policy schedules the server with the highest sample mean to empty the queue (i.e., the server with the highest sample mean is scheduled to every time slot in the busy period until the queue empties). We will refer to this decision as scheduling the busy period to server i. In the following analysis, we will be interested in the frequency with which the policy schedules each server i to busy periods. Therefore, we will use to denote the number of busy periods that schedules to server i in the first P busy periods. We will use these variables to bound, in expectation, the number of busy periods that are scheduled to a server other than .

The proof of Theorem 2 now proceeds through four lemmas. We begin with Lemma 1. We denote the integral of the queue backlog taken over busy periods schedules to i as:

For example, if in Fig. 4 the first and fourth busy periods were scheduled to server 1 but the second and third busy periods were scheduled to server 2, then and . Lemma 1 shows that we can upper bound the queue length regret with the queue backlog summed over times that is not scheduling . The proof follows from a sample path argument.

Consider a fixed outcome for the arrival and service processes and randomization in policy . Then, is the queue backlog of policy at time t for this sample path, and is the backlog at time t for a controller that schedules for all time slots since the start of time for this sample path. Now, assume at time t policy is in a busy period where it is scheduling . Then, . The intuition behind this claim follows from the fact that the queue is empty right before the busy period begins and does not switch servers mid-busy period. Therefore, for the given sample path of arrivals and service processes, had the controller scheduled for all time slots since the start of time, the queue could not be less. Taking expectations and using the definition of queue length regret in (2) then gives the result. The complete proof of the lemma can be found in Appendix A.

Lemma 1.

Next, in Lemma 2, the expected value of is shown to be finite. Note that by our problem’s definition at most one packet can arrive to the queue at each time slot. Then, since a busy period must start with one packet in queue at its first time slot (i.e., the packet that arrives to the queue to begin the busy period), the maximum size of the queue backlog during the busy period cannot be larger than the busy period’s duration. Thus, the expected value of can be bounded by the expected value of , which is finite. The complete proof of the lemma can be found in Appendix B.

Lemma 2. .

We next proceed to bound the expected value of (9) with Lemma 3. The lemma states that the integrated queue backlog (taken over busy periods scheduled to server i) up until time T is less than the integrated queue backlog (taken over busy periods scheduled to server i) through the busy period. Observe that the right hand side of (10), is the expected number of busy periods out of T busy periods scheduled to server , multiplied by the expected integrated queue backlog, . This quantity is the expected value of the integral of the queue backlog, over busy periods scheduled to i, up to busy period T.

The intuition behind Lemma 3 is simple. Since, by definition, each busy period must be at least one time slot in duration, we cannot have more than T busy periods in T time slots. (In fact, since empty periods must also be one time slot in duration, it is obvious that we will have far fewer than T busy periods in any T time slots.) Therefore, for any outcome of the system’s performance, the integrated queue backlog up to time slot T will be less than the integrated queue backlog through the busy period. Taking expectations then gives the result. The complete proof of the lemma can be found in Appendix C.

Lemma 3.

Finally, in Lemma 4, we show that the expected number of times schedules server to busy periods is bounded by a finite constant which is independent of T. In order to schedule server to a busy period, must estimate to be no less than . Since we obtain a new observation of one of the servers during each empty period p, using Hoeffding’s inequality, we can show that the probability that there exists a server such that decays exponentially in p. The result then follows from the convergence of geometric series. The proof of the lemma can be found in Appendix D.

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update with the observed state of D(t) For other time slots in the empty period, idle

Fig. 5. The policy for achieving Theorem 3. Each variable is initialized to zero prior to its first update.

Then by Lemmas 2 and 4, giving the result.

As a final note, it is important to observe that, by its statement, Theorem 2 does not imply that there exists a constant that bounds for all . It only states that for any chosen , under policy , the queue length regret converges to a finite number that is dependent on the chosen parameters.

B. Non-Stabilizing Servers

In this subsection, we relax Assumption 1 to allow for non-stabilizing servers. Thus, we will now allow for some strict subset of the servers. Importantly, we will assume that the controller is given a known convex summation over the servers’ rates that strictly dominates the arrival rate to the queue. Then, by randomizing over the servers using this convex summation, the controller can always stabilize the system. Concretely, we make the following assumption.

Assumption 2. For some given and ,

Observe that the set defined above is determined by the values of that are given to the controller and different values will lead to different sets. We emphasize that Assumption 2 does not imply that the controller knows any of the values of , a priori. Rather, it states that the controller can be confident that the convex summation it is provided will meet the condition of (11). For example, if then Assumption 2 reduces to and this fact being known by the controller, a priori. Then, if at any given moment the controller wishes to return the queue to the empty state, it can simply resort to persistently scheduling server 1 and wait for the queue to empty. However, note that server 1 may not be the optimal server , and, as a result, persistently scheduling server 1 for every time slot will not in general give good queue length regret performance for every . Generalizing the above example to arbitrary values, we see that Assumption 2 implies that, if at any point the controller decides to resort to a strategy of scheduling at each time slot server i with probability , it can be guaranteed that the queue will eventually empty. In other words, the values of define a stationary, randomized policy that has a service rate that dominates the arrival rate to the queue and can, therefore, empty the queue infinitely often. However, since by the assumption’s statement the values are not required to have any special relationship to , the randomized policy will not generally minimize regret, and its use should not be overly relied upon. Before continuing, we note that Assumption 2 is intentionally artificial. The assumption assumes a randomized policy is given to the controller and that the system will only encounter parameters for and such that the given policy can stabilize the queue. In most practical applications, this assumption is unrealistic. However, the results of this subsection will be highly instructive for the next subsection, where we relax Assumption 2 and no longer assume a known stabilizing policy is given to the controller. Importantly, in the following, we will develop a method that can sparingly use a suboptimal policy to consistently guarantee future empty periods for the queue without overly using the suboptimal policy so as to lose the bounded regret result.

We now give the main theorem of this subsection.

Fig. 6. Example sample path of for some, unspecified, . The background colors indicate the actions of policy (see Fig. 5). During time slots marked in gray, randomly updated one of the sample means, . During time slots marked in orange, scheduled the server with the (so-far) largest sample mean. During time slots marked in blue, used the randomized policy defined by , because it had crossed the corresponding busy period’s time-out threshold. Note that the threshold increases linearly with each busy period number, and, for this example, the threshold was not reached in the third busy period.

Theorem 3. Under Assumption 2, such that, for each .

A policy that achieves Theorem 3 is given in Fig. 5 and is referred to as throughout this subsection. The policy, similar to the policy of Subsection IV-A, iterates over empty and busy periods. See Fig. 6 for an illustration of the policy. As with policy , in the first time slot of each empty period, the policy schedules one server uniformly at random, observes the resulting offered service, and uses the observation to update a sample mean that estimates the server’s service rate. Again as with policy only uses the observations made during these time slots to update its sample means and does not use observations made during other time slots.

At the start of each busy period p, policy starts the busy period by scheduling the server with the highest sample mean for the first p time slots of the busy period or until the queue empties and ends the busy period (whichever occurs first). However, if the busy period lasts longer than p time slots the policy hits a time-out threshold and resorts to using the randomized policy defined by Assumption 2 to bring the queue backlog back to the empty state. The time-out threshold grows linearly in busy period number p, and therefore with each busy period, the controller becomes more reluctant to call upon the randomized policy. In the following analysis, we will see that this growing reluctance is important to establishing the proof of Theorem 3.

We now establish Theorem 3 by analyzing on an arbitrary . The proof is derived through three lemmas. We begin by introducing some new notation that will facilitate understanding.

We begin by introducing notation that allows us to reference the period number p that a time slot t corresponds to. To this end, for each time t, define p(t) to be the period number for the empty or busy period in which t resides. Note that for a given sample path, p(t) maps time slots t = 0, 1, . . . to their corresponding period number p = 1, 2, . . . . For example, in Fig. 6, for t = 0, 1, . . . , 4, p(t) = 1 since these time slots occur during the first empty and busy periods and for t = 5, 6, . . . , 12, p(t) = 2 since these time slots occur during the second empty and busy periods.

Now, in our following analysis, we will be concerned with identifying the busy periods in which the policy does not strictly schedule server over the entire busy period. Note that an ideal policy would schedule over all busy periods, and therefore we wish to identify during which busy periods policy falls short of this ideal. To facilitate this analysis, for each busy period p, we let be an indicator that takes the value 1 if: either a server not equal to is first scheduled by policy at the start of the busy period or the time-out threshold is hit. Note that by hitting the time-out threshold, we mean that the queue did not empty during the first p time slots of busy period p. Otherwise, if neither of these two conditions are met, we let the indicator C(p) = 0. Note that by the definition of , if any server is scheduled during busy period p, then C(p) must equal 1 (see Fig. 5). This is because, by the definition of policy , we can only schedule a server during a busy period if: either the policy began the busy period by scheduling or it began the busy period scheduling server but still hit the time-out threshold, leading it to schedule the stabilizing policy. Since to minimize regret, our objective is to mimic as best as we can the policy that schedules for all time slots, we can notionally view C(p) equaling 1 as policy failing to meet this ideal over busy period p.

In the following, we will combine the above two notations as C(p(t)). Then, if t is in a busy period, C(p(t)) = 1 if a server not equal to is scheduled at any time during the busy period in which t belongs. If t is in an empty period, C(p(t)) indicates whether during the busy period following this empty period, a server not equal to is ever scheduled. Then,

is the queue backlog integrated, up to time slot T, over those busy periods for which the conditions of C(p) are met. Note that if t is in an empty period, , and therefore the time slot contributes a value of 0 to the above summation regardless of the value of C (p(t)). For example, for the sample path in Fig. 6, if scheduled for the start of busy period 3, then C(1) = C(2) = C(4) = 1 since these busy periods reached their time-out thresholds, C(3) = 0, and . Likewise, if for the sample path, did not schedule for the start of busy period 3, then C(1) = C(2) = C(3) = C(4) = 1 and .Given this notation, we have the following lemma, which states that the regret is upper bounded in expectation by the sum of queue backlogs over those busy periods in which indicator C(p) = 1. The proof is similar to that of Lemma 1 and can be found in Appendix E.

Lemma 5. RQ.

Now, to upper bound the right-hand side of Lemma 5, we will want to bound the probability that indicator C(p) = 1. Recall that C(p) can equal 1 if one of the following events occurs: the policy begins the busy period by choosing a server other than to be scheduled for the first p time slots or if it crosses the time-out threshold. In the following lemma we establish that the probability of either event occurring diminishes exponentially in p. To understand the intuition behind this result, first note that at each busy period p, the probability that there exists a server with a decreases exponential with p. This follows from the fact that the policy updates one of the sample means with a new observation at each empty period and Hoeffding’s inequality. Thus, the probability that at busy period is not chosen to be scheduled over the first p time slots, is also exponentially decreasing with p. Note that this argument is very similar to the one made in the proof of Lemma 4. Now suppose that is scheduled at the start of busy period p. Then, since , we can show that the probability the queue does not empty before reaching the time-out threshold p must decrease exponentially in p as well. We now state the lemma whose proof can be found in Appendix F.

Lemma 6. There exists positive constants , and such that for all ,

To complete our upper bound on the right-hand side of Lemma 5, we will need to bound the cost of indicator C(p) equaling 1. To this end, let variable denote the integral of the queue backlog taken over busy period p. For example, for the sample path shown in Fig. 6, and . In Lemma 7, we show grows at most quadratically with p. This is because the queue backlog can grow at most linearly over the first p time slots of the busy period. Thus, if the time-out threshold is hit, the required amount of time that the randomized policy will need to empty the queue can also be at most on the order of p. Since, the maximum queue backlog at any time slot in a given busy period cannot be greater than the busy period’s duration, the expected integrated queue backlog will at most be on the order of .

Lemma 7. There exists positive constants and such that for all p,

The proof of the lemma can be found in Appendix G. Before continuing to the proof of the theorem, we observe that , since C(p) = 0 implies that the time-out threshold was not reached. Thus, we see that for all is finite (implying the expected duration of each busy period is finite under policy ). We are now ready to establish the theorem. Proof of Theorem 3: Beginning with Lemma 5, we have that the regret is bounded by the integrated queue backlog over those busy periods such that the conditions of C(p) are met. i.e.,

Now, no more than T busy periods can occur in T time slots, since each busy period, by definition, must be one time slot long. In fact, as was noted in the previous subsection, there must be much fewer than T busy periods in T time slots, since the empty periods in between the busy periods must also be at least one time slot long, as well. Thus, we can bound the right hand side of (12), which is the integral of the queue backlogs over busy periods with C(p) = 1 up to time T, with the integral of the queue backlogs over busy periods with C(p) = 1 through the busy period. Formally,

Note that in going from (12) to (13), we have changed from summing over time slots t to summing over busy period number p.

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update with the observed state of D(t) For other time slots in the empty period, idle

Fig. 7. The policy for achieving Theorem 4. Each variable is initialized to zero prior to its first update. The method uses the algorithm of Fig. 8 when the time-out threshold is hit.

Now, recall that C(p) is an indicator. When . Thus,

which follows from well-known results for series. Thus, we see that , which gives the result.

C. Learning Stabilizing Policies

We now relax Assumption 2 and no longer assume that a randomized policy is given to the controller that can be relied upon to stabilize the queue. As a result, the controller will have to learn how to take actions so as to empty the queue infinitely often. Once the controller can accomplish this, it may use the empty periods to identify the best server. To this end, we only make the following assumption about the parameters and .

Assumption 3.

We then have the following theorem, which is analogous to the previous subsections. The rest of this subsection will be devoted to proving its statement.

Theorem 4. Under Assumption 3, such that, for each .

To prove the theorem, we will analyze policy on an arbitrary . Policy is shown in Fig. 7. The policy is very similar to , except now when the time-out threshold of a busy period is reached, rather than relying on the known randomized policy given by Assumption 2, uses the method of Fig. 8 to bring the queue back to the empty state. Starting from the time-out threshold’s crossing, the method is run until the queue finally empties and the busy period ends. As can be seen in Fig. 8, we use time slot counter n = 0, 1, . . . to indicate the number of time slots since the method’s invocation (i.e., n is the time slot number normalized to the time-out threshold’s crossing).

In its execution, the method in Fig. 8 dedicates certain time slots to exploration. During these dedicated exploration time slots, the policy chooses from the servers uniformly at random, observes the offered service of the chosen server, and updates the corresponding sample mean using the observation. Note that the sample means are new variables that are used by the method during its execution at busy period p and are completely separate from the sample means in Fig. 7. The

Fig. 8. Learning algorithm for bringing the queue back to the empty state. Variable n counts the number of time slots since the algorithm’s invocation. The method uses a new set of sample means to estimate the server rates. Note that is a different variable from is only updated using observations made during busy period p. Each variable is initialized to zero. The array V (n) counts the number of dedicated exploration time slots that have occurred since the start of the algorithm’s invocation. We require that the locations of the dedicated exploration time slots be chosen such that V (n) meets the condition of (15) for each time slot n.

for some positive constants , and . Note that (15) requires that the frequency of explorations diminishes with n and eventually falls below . Subject to (15), the exact choice of the dedicated exploration times is left to implementation. Observe that our previous example that chose locations meets the condition of (15).

We now turn to establishing the proof of Theorem 4. Since policy is very similar to policy , the proof of Theorem 4 will closely follow the proof of Theorem 3. The main difference will be that we use Lemma 8 below in place of Lemma 7 of the previous subsection.

We begin by noting that, using the definitions of p(t) and C(p) from the previous subsection, Lemmas 5 and 6 continue to hold for policy . Specifically, as before, if we schedule a server during any time slot of a busy period p, than we must either have started the busy period by scheduling a server or we must have hit the time-out threshold. By the arguments of the previous subsection for policy , both events again diminish exponentially in p for policy . Then, the only step that is really missing in establishing the theorem is bounding the cost of having C(p) = 1 for busy period p. We proceed to do this next.

For policy , let denote the integrated queue backlog of busy period p. Then, we have the following lemma whose proof can be found in Appendix H.

Lemma 8. There exists positive constants and such that for all p,

Given Lemma 8, we now establish the proof. Note that, due to its similarity to the proof of Theorem 3, we have abridged a few steps. Proof of Theorem 4: Using Lemma 5 and the fact that no more than T busy periods can occur in T time slots,

Then, as in the proof of Theorem 3, we may write

where the last inequality follows from Lemmas 6 and 8. Since

V. SYSTEMS WITHOUT STABILIZING SERVERS

In the previous section, we saw that the existence of a stabilizing server (i.e., ) allowed for policies that had . A natural question is whether we can obtain a similar result for the subset of problems that do not have stabilizing servers. We proceed to show that there cannot exist a policy that can achieve O(1) regret over this entire subset. The intuition behind this result is that, when , the system may only experience a finite number of empty periods, and therefore, policies cannot rely upon the empty periods to perform sufficient exploration of the servers’ rates.

Let . Then, we have the following theorem.

Theorem 5. For any policy there exists a such that .

Proof: Consider a system with . Let be any arbitrary learning policy in . Then for all t > 0 the queue is non-empty. As a result, the queue backlogs are given by

where is the service offered by server and is the service offered by the server scheduled by at time t. By the above equations, it is clear that at and at . Thus, . For T > 1, equations (17) and (18) imply

And, we can write

Otherwise, there would exist a learning policy that, given any , always schedules the optimal server with probability one at time slot 1. (This would be a policy that guesses the best server with probability one, having not even made an observation

At each time t in the empty period, use the empty-period sampling rule to choose u(t) and update with the observed state of D(t)

For the first p time slots of the busy period or until the queue empties, schedule , updating with the observation of D(t) if The queue does not empty during the first p time slots then

At each time t until the queue empties, schedule the server with the maximum UCB1 weight, using the observed state D(t) to update end if

Fig. 9. The heuristic technique used in our simulations. The method maintains sample mean variables for each server, which are updated with the observed offered service every time that server is chosen. We consider three different rules for sampling servers during empty periods. Arrivals to the queue begin after time step N.

of every server prior to its guess.) Clearly, no such policy in can exist. Therefore, because is monotonically increasing in t, (21) implies that

for all and some . Using this in (20) and then (19) gives the result.

VI. SIMULATION RESULTS

The analysis of Section IV considered policies that proved the achievablility of O(1) queue length regret. As discussed in Section II these methods cause the tail of the regret to trend to zero. In order to keep our analysis tractable, the designed policies had to make concessions, described further below, that hurt their practical performance. Nevertheless, these policies do provide intuition about the problem, and with some fairly minor and intuitive changes, we show that these policies can be transformed into heuristic techniques that achieve good performance in simulation. The heuristics considered in this section use policy as a template.

We now briefly cover the inefficiencies of the policy of Section V. The first, and easiest to remedy, is that the policy only uses the first time slot of each empty period to sample a server. Clearly, a more efficient method would sample throughout the empty period, and the only reason policy was designed not to do this is because it is not strictly necessary for the proof of Theorem 4 and simplifies the mathematics.

The next inefficiency is that policy does not record the observations made of the chosen server during busy periods prior to reaching the threshold, and, after reaching the threshold, does not use its observations after the busy period expires. The reason for this is to facilitate simplicity in the proof of Theorem 4. Observations made during the busy period are correlated with the queue’s dynamics, and therefore if these observations are used during subsequent busy periods, one has to account for how the dynamics impact the probabilities of the previous observations. This would greatly complicate analysis. However, in practice, it makes sense that a policy should use all previous observations to inform its estimates of the servers’ rates.

Finally, policy uses a simple exploration process for identifying a server to empty the queue once the threshold is reached during a busy period. This method is relatively easy to analyze with the queue’s dynamics. However, algorithms from the literature on multi-armed bandits, such as UCB1 described below, are generally better methods and should probably be used in practice. Analyzing how the mechanisms used by these algorithms to achieve good expected performance interact with the queue’s dynamics to reach the next empty period is an interesting direction for future research.

Given the above points, in Fig. 9 we provide a heuristic technique that uses the intuition provided in the analysis of policy . The technique uses one of three methods to explore the servers during empty periods. Method UCB-LE chooses to sample at each time step in the empty period the server that has been least explored so-far (i.e., observed the least number of times). Method UCB-UE randomly chooses a server to explore at each time step of the empty period, uniformly choosing among the servers. Finally, method UCB-WE randomly chooses from the servers proportional to the value of , where is the

Fig. 10. Simulations of a four server problem with service rates . Different arrival rates are considered. (a) (c)

sample mean of the observations made so-far of server i and b is an additional term (set to 0.1 in our simulations). UCB-WE weights its explorations during empty periods towards the best performing servers.

During busy periods, the heuristic of Fig. 9 begins by choosing to schedule at each time slot the server with the highest sample mean until the busy-period threshold is reached (or until the queue empties). The observations made at this time are used to update the sample means of the servers. Once the threshold is reached, the method constructs at each time a UCB1 weight for each server and schedules the server with the highest weight. The UCB1 weight, proposed in [6] for solving the multi-armed bandit problem, is given by

where is the sample mean of the service that has been offered by server i so-far and is the number of observations made of the server. The method schedules according to the UCB1 weights until the queue empties.

In the following, we will compare the three versions of our heuristic (UCB-LE, UCB-UE, and UCB-WE) against a method that always chooses servers according to the UCB1 weight (during both empty and busy periods). UCB1 is a popular multi-armed bandit algorithm [6]. In our context, it is a method that maximizes the amount of offered service to the queue without regard to the queue’s state. In our simulations, all methods will spend the first N time slots sampling each of the N servers once before launching into their main techniques. The queue is kept empty during these first N time slots, and arrivals to the queue begin afterwards. In our plots, we calculate the queue length regret by averaging, across multiple runs, the difference in the queue backlog of the learning policies and the (genie) policy that always schedules the maximum rate server.

In Fig. 10, we explore the impact of increasing arrival rates on the four methods’ performance. In this scenario there are four servers with service rates 0.1, 0.3, 0.5, and 0.7. Arrival rates of 0.4, 0.5, and 0.6 are simulated. The queue length regret

Fig. 11. Simulations of a two server problem with arrival rate . Different service rates are considered. (a)

averaged over 10,000 simulation runs is plotted versus time. In the three plots, we see that the heuristic methods generally outperform UCB1 in minimizing queue backlog. The queue length regret grows with increasing arrival rate as the system becomes more heavily loaded and there are fewer opportunities for free exploration during empty periods. Additionally, as the arrival rate becomes greater, the amount of time available for the queue to be empty decreases, and the relative difference in performance between the heuristic methods and UCB1 decreases. This effect appears to become more severe as the system becomes more heavily loaded.

Fig. 11 examines how a narrowing gap between server rates impacts performance. In this scenario there are two servers. The first server has its service rate tested at values of 0.5, 0.54, and 0.58, while the second server has a fixed rate of 0.6. The arrival rate to the queue is 0.4 and each plot shows the queue length regret averaged over 10,000 simulation runs. We see that decreasing the gap between the server rates causes the heuristic methods’ queue length regret to converge more slowly as the methods have a more difficult time settling on the optimal server. Although the methods outperform UCB1 in all three cases, their relative advantage, over the considered timescales, generally decreases as the gap between the servers’ rates decreases.

VII. CONCLUSION

In this work, we considered a queueing system that has N available servers with unknown service rates. The objective of the system controller is to minimize the queue’s backlog by using observations of the service process to quickly identify the best server. To evaluate a learning policy’s performance, we introduced queue length regret, which is defined to be the expected difference in the accumulated queue backlog of the learning policy and the accumulated backlog of a controller that always schedules the optimal server. This metric quantifies the cost, in queueing delay, of having to learn the best server.

We proved that there exist queue-length based policies that have an order optimal O(1) regret for all systems such that . These policies achieve this result by taking advantage of times when the queue is empty and estimates of the servers can be improved without incurring extra packet delay. Importantly, we showed there exist policies that can ensure the queue empties infinitely often, so that this property can be exploited. In contrast to this result, we showed that traditional bandit algorithms that focus on only maximizing offered service to the system without accounting for the queue’s state can have a queue length regret that is . Thus, to obtain optimal performance, it is insufficient to design controllers that only focus on maximizing the offered service to the queue.

In the problem considered herein, the controller’s estimates of the servers’ rates are highly correlated with the queue’s dynamics, and therefore analyzing the system’s stochastic behavior can be difficult. The theoretical policies analyzed in this work were chosen for their analytical tractability, and the simplicity of the chosen policies allowed us to use well-known tools from statistics to show the achievability of O(1) regret, which was our main objective. We then used insight from our theoretical analysis to design heuristic techniques with good performance. Finding policies with provable optimal performance over finite-time is a possible direction for future research.

Consider an outcome and fixed sample paths and . Suppose at time t, policy is scheduling(i.e., ) and . Define, s < t the last time before t that the queue was empty (i.e., the time slot preceding the start of this busy period). Then, since does not change servers mid-busy period,

where is equal to the service offered by server at time . Now, let be the queue backlog under the controller that always schedules . Since , this implies that

Summing over time we then obtain

Taking expectation with respect to arrivals, departures, and the randomization in policy ,

Then, starting with the definition of ,

Consider a busy period scheduled to server i. The length of the busy period and integrated backlog over the busy period are random variables and , respectively. Furthermore,

We proceed to bound . To this end, define random variable to be the difference between the number of arrivals and offered service from server i in a time slot. The variable has probability mass function

The mean of is denoted and note that by Assumption 1 it is negative. Let for k = 1, 2, . . . be i.i.d. random variables with the distribution of .

random walk defined by . i.e.,

Note that each entry accounts for the difference in the number of arrivals and amount of service at a time step of the busy period. Thus, the hitting time is the first time slot when the service dominates the arrivals and the queue empties. Given the above,

where the first inequality follows from Hoeffding’s inequality and the second inequality follows from . Thus,

In the above, the second inequality follows from bounding with 1 for the the first terms and (23) for all terms greater than . Likewise, the last inequality follows from . This establishes the result.

The proof follows from the fact that, with probability one, there can be no greater than T busy periods in the time interval 0 to . Thus, the cost accrued over the busy periods occurring before time T, must be less than the cost accrued over the first T busy periods. Concretely, denoting the start time and finish time of busy period p with and , we have that

Thus, using the definition of and ,

Define . Note that is the half way distance between and . Furthermore, define to be the sample mean that policy has for server i at the start of busy period p and the number of times the policy explored i during an empty period before busy period p (i.e., the number of times the policy has updated ). Recall if we have not yet sampled i. Then,

The above states that in order to schedule i during p, i must have a sample mean no less than , and this only occurs if one of the sample means deviates from its true mean by a distance .

Now, for a sample mean to deviate by distance , the policy must either have not observed i very often during exploration or it must have observed atypical observations during exploration. Thus, we further break up the above events:

Note that and are sums of Bernoulli random variables that take the value 1 when i is observed during empty period p and that . Using Hoeffding’s inequality,

with a similar bound holding for . Now, the number of times we have explored a server during the empty periods, is independent of the observations that were made. Therefore, we can use Hoeffding’s to obtain

p(t) denotes the period (empty or busy) that time slot t resides in.

that and .5 Then, over this sample path, the server is scheduled over the entire busy period in which time slot t resides. By an argument similar to Lemma 1, this implies that . Therefore, summing over t

Taking expectation,

Then using the definition of queue length regret,

Define be an indicator random variable that equals 1 if we scheduled for the start of busy period p and hit the time-out threshold (i.e., the queue did not empty in the first p time slots). Otherwise, let be 0. Then,

Define to be the sample mean that policy has for server i at the start of busy period p. Now, if we did not schedulefor the start of busy period p, then there must exist an such that . Thus, using the union bound,

All that is left is to characterize . To do so, we will analyze the probability that the time-out threshold p is reached given is chosen to be scheduled at the start of busy period p. For an arbitrary time slot, the difference in the number of arrivals and available service from is given by a random variable that has probability mass function

Then, let random variable be the difference in the number of arrivals and available service from in the time slot since the start of the busy period. Note that the random variables are i.i.d. with the distribution of above.

Given that is scheduled at the start of busy period p, we will analyze the accumulated difference between the number of arrivals and service process (since the start of the busy period) as a random walk . The busy period begins with one packet in queue. Therefore, if the random walk hits in p time slots, the queue empties and the period ends before reaching the time-out threshold. Note that by Assumption 2, the mean and the walk has negative drift. Then,

Now, for any choice of , we have for all

where the first inequality follows from Hoeffding’s inequality and the second inequality from . Plugging in (26) and (28) into (25), we see that for all ,

Then, we can clearly choose values of and , that are independent of p, to meet the lemma’s condition.

We define U(p) to be the amount of time that we schedule the randomized policy over busy period p. Then, an upper bound

on the duration of the busy period is p + U(p). Now, in an argument similar to Appendix B, we can bound the integrated

We therefore need a bound on the P(U(p) > n|C(p) = 1) for n = 1, 2, . . . . To this end, we analyze the difference between the number of arrivals and offered service (provided by the randomized policy) to the queue that would occur after threshold-time p. Let the average offered service of the randomized policy be

. Define random variable to be the difference between the number of arrivals and service offered by the randomized policy in a given time slot. This random variable has probability mass function

Let be i.i.d. random variables with the distribution of . Each is the difference in the number of arrivals and offered service in time slot k following the time-out threshold. By Assumption 2, the mean and the random walk has negative drift.

Now, given that we hit the time-out threshold, at most p + 1 packets are in the queue when we start scheduling the known randomized policy. Then, we see that the length of time that we could have possibly scheduled the randomized policy is bounded by: (i.e., the random walk has yet to hit ). Note that this is a pessimistic bound, since there may be less than p + 1 packets in the queue at the start of scheduling the randomized policy. Then, we can obtain for ,

where the first inequality follows from Hoeffding’s inequality and the second from . Importantly, note that is a function of p. Putting this together, starting at (29), we see that

Then, since and , it is easy to see that we can choose values of and such that the lemma holds.

Recall that is the total queue backlog summed over the interval of busy period p. We begin by defining U(p) to be

the amount of time that the learning method of Fig. 8 is used during busy period p. Then, in an argument similar to the proof

We seek to bound P(U(p) > n|C(p) = 1) for n = 1, 2, . . . . In the following, we will use to denote the learning algorithm’s estimate of server i at time slot n (see Fig. 8) andthe number of times the algorithm has observed server i during an exploration (i.e., the number of times has

been updated). Recall that n takes value 0 when the timeout threshold p is crossed. (i.e., Variable n counts the number of time slots since the threshold was crossed.) Likewise, we have that the number of explorations made by time slot n is bounded by

for positive constants , and . Finally, we define a constant such that , which by Assumption 3 must exist.

In order to analyze the event {U(p) > n}, we consider whether running the learning algorithm of Fig. 8 over n time slots would have caused the queue to empty at any time up to or before n. Recall that at most p + 1 packets can be in the queue when the learning method is initiated. We will then break up the event {U(p) > n} into two parts. First, the event that was not the maximum estimate for any time slot over the interval . Second, the event that was the maximum estimate over the interval (and, thus, at all non-exploration time slots was scheduled), but the arrival process plus the initial queue backlog (which can be no greater than p + 1) has dominated the service process up until time n. Concretely, for ,

We now use the following lemmas. Their proofs can be found at the end of this appendix.

Lemma 9. For positive constants and that are independent of n and p,

Lemma 10. For ,

where , and are positive constants that are independent of n and p.

We now use Lemmas 9 and 10 to construct a bound on .

where the first inequality follows from (31) and the last inequality follows from applying Lemmas 9 and 10 and the union bound to (33). Note that for any non-negative integer x, real number y > 0, and ,

Then one can see that (34) can be bounded by a quadratic function of p giving the result. Below we provide the proofs of Lemmas 9 and 10. Proof of Lemma 9:

To this end, we can apply the following bounds

We proceed via an argument similar to the proof of Lemma 4. Consider an . Define

Then, as was argued in Lemma 4,

Recall that by time m, V (m) dedicated explore time slots have occurred. We proceed to bound the two events on the right-hand side of the above. As in Lemma 4, we break up these events into two cases; the event the method under observes a server i and the event the method observes atypical observations of the server. Concretely,

The probability of under observing i is then given by6

where the first inequality follows from Hoeffding’s inequality, the second inequality follows from plugging in the lower bound of (32), and the third inequality follows from . A similar expression holds for . We next proceed to bound the probability that the server sees atypical observations.

Thus, putting this altogether starting from (35), we see that

where and are positive constants that are independent of n and p. This gives the lemma.

We begin by defining to be the set of time slots that are not dedicated exploration time slots in the interval .Let be the service available from server at time slot m. Then we can bound,

where the last inequality follows by defining a threshold and noting that if is less than the threshold and is greater than the threshold.

We now turn to analyzing how the cardinality of scales with n. We begin by noting that for . Thus, the set is empty for relatively small n. For n large enough such that , we have

since there are time slots in the interval , at most V (n) of them are dedicated exploration time slots, and we can use (15) to upper bound V (n). Using (37), we see that7

Given the above, for n large enough such that , we use simple algebra to make the following bound on (36).

Now, using (38), we see that

where the last inequality follows from the fact that . Thus, there exists an that is independent of p such that

for appropriately chosen positive constant , which is independent of p or n. Noting that for appropriately chosen positive constants and gives the lemma.

REFERENCES

[1] T. Stahlbuhk, B. Shrader and E. Modiano, “Learning Algorithms for Minimizing Queue Length Regret,” in Proc. IEEE International Symposium on Information Theory, 2018.

[2] W. Thompson, “On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples,” Bulletin of the American Mathematics Society, vol. 25, no. 3/4, pp. 285–294, 1933.

[3] S. Bubeck and N. Cesa-Bianchi, “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[4] T. L. Lai and H. Robbins, “Asymptotically Efficient Adaptive Allocation Rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

[5] R. Agrawal, “Sample Mean Based Index Policies with O(log n) Regret for the Multi-Armed Bandit Problem,” Advances in Applied Probability, vol. 27, no. 4, pp. 1054–1078, 1995.

[6] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time Analysis of the Multiarmed Bandit Problem,” Machine Learning, vol. 47, no. 2–3, pp. 235–256, 2002.

[7] A. Garivier and O. Capp`e, “The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond,” in Proc. Conference On Learning Theory, 2011, pp. 359–376.

[8] S. Krishnasamy et al., “Regret of Queueing Bandits,” in Proc. Neural Information Processing Systems, 2016, pp. 1669–1677.

[9] D.P. Bertsekas, Reinforcement Learning and Optimal Control, Athena Scientific, 2019.

[10] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, Second Edition, MIT Press, 2018.

[11] P. Auer and R. Ortner, “Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning,” in Proc. Advances in Neural Information Processing Systems, 2006.

[12] T. Jaksch, P. Auer and R. Ortner, “Near-Optimal Regret Bounds for Reinforcement Learning,” Journal of Machine Learning Research, vol. 11, pp. 1563–1600, 2010.

[13] M.G. Azar, I. Osband and R. Munos, “Minimax Regret Bounds for Reinforcement Learning,” in Proc. International Conference on Machine Learning, 2017.

[14] R. Fruit, M. Pirotta and A. Lazaric, “Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes,” in Proc. Advances in Neural Information Processing Systems, 2018.

[15] D. Russo et al., “A Tutorial on Thompson Sampling,” Foundations and Trends in Machine Learning, vol. 11, no. 1, pp.1–96, 2018.

[16] M. Strens, “A Bayesian Framework for Reinforcement Learning,” in Proc. Conference on Machine Learning, 2000.

[17] I. Osband, B. Van Roy and D. Russo, “(More) Efficient Reinforcement Learning via Posterior Sampling,” in Proc. Advances in Neural Information Processing Systems, 2013.

[18] I. Osband and B. Van Roy, “Near-Optimal Reinforcement Learning in Factored MDPs,” in Proc. Advances in Neural Information Processing Systems, 2014.

[19] I. Osband and B. Van Roy, “Why is Posterior Sampling Better than Optimism for Reinforcement Learning?,” in Proc. International Conference on Machine Learning, 2017.

[20] T.M. Modolvan and P. Abbeel, “Safe Exploration in Markov Decision Processes,” in Proc. International Conference on Machine Learning, 2012.

[21] F. Berkenkamp, et al., “Safe Model-Based Reinforcement Learning with Stability Guarantees,” in Proc. NIPS, 2017.

[22] A. Wachi et al., “Safe Exploration and Optimization of Constrained MDPs using Gaussian Processes,” in Proc. AAAI, 2018.

[23] A. Anandkumar et al., “Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 4, pp. 731–745, 2011.

[24] O. Avner and S. Mannor, “Multi-user lax communications: a Multi-Armed Bandit approach,” in Proc. IEEE International Conference on Computer Communications, 2016.

[25] S. Cayci and A. Eryilmaz, “Learning for Serving Deadline-Constrained Traffic in Multi-Channel Wireless Networks,” in Proc. IEEE International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks, 2017.

[26] R. Combes et al., “Optimal Rate Sampling in 802.11 Systems,” in Proc. IEEE Conference on Computer Communications, 2014, pp. 2760–2767.

[27] Y. Gai, B. Krishnamachari and R. Jain, “Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation,” in Proc. IEEE New Frontiers in Dynamic Spectrum, 2010.

[28] Y. Gai, B. Krishnamachari and R. Jain, “Combinatorial Network Optimization with Unknown Variables: Multi-armed Bandits with Linear Rewards and Individual Observation,” IEEE/ACM Trans. on Networking, vol. 20, no. 5, pp. 1466–1478, 2012.

[29] D. Kalathil and N. Nayyar and R. Jain, “Decentralized Learning for Multiplayer Multiarmed Bandits,” IEEE Trans. on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.

[30] M. Lelarge, A. Proutiere and M. Sadegh Talebi, “Spectrum Bandit Optimization,” in Proc. IEEE Information Theory Workshop, 2013.

[31] K. Liu and Q. Zhao, “Distributed Learning in Multi-Armed Bandit with Multiple Players,” IEEE Trans. on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010.

[32] N. Nayyar, D. Kalathil and R. Jain, “On Regret-Optimal Learning in Decentralized Multiplayer Multiarmed Bandits,” IEEE Trans. on Control of Network Systems, vol. 5, no. 1, pp. 597–606, 2016.

[33] C. Tekin and M. Liu, “Online Learning Methods for Networking,” Foundations and Trends in Networking, vol. 8, no. 4, pp. 281–409, 2015.

[34] Y. Zhang et al., “Learning Temporal-Spatial Spectrum Reuse,” IEEE Trans. on Communications, vol. 64, no. 7, pp. 3092–3103, 2016.

[35] Y. Zhou et al., “Almost Optimal Channel Access in Multi-Hop Networks with Unknown Channel Variables,” in Proc. IEEE Distributed Computing Systems, 2014, pp. 461–470.

[36] L. Tassiulas and A. Ephremides, “Stability Properties of Constrained Queueing Systems and Scheduling Policies for Maximum Throughput in Multihop Radio Networks,” IEEE Trans. on Automatic Control, vol. 37, no. 12, pp. 1936–1948, 1992.

[37] L. Tassiulas and A. Ephremides, “Dynamic server allocation to parallel queues with randomly varying connectivity,” IEEE Trans. on Information Theory, vol. 39, no. 2, pp. 466-478, 1993.

[38] N. McKeown et al., “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Trans. on Communications, vol. 47, no. 8, pp. 1260-1267, 1999.

[39] M. J. Neely, E. Modiano and C. E. Rohrs, “Power Allocation and Routing in Multibeam Satellites with Time-Varying Channels,” IEEE/ACM Trans. on Networking, vol. 11, no. 1, 2003, pp. 138-152.

[40] X. Lin and N. Shroff, “The Impact of Imperfect Scheduling on Cross-Layer Rate Control in Wireless Networks,” in Proc. IEEE INFOCOM, pp.1804-1814, 2005.

[41] L. Chen et al., “Cross-Layer Congestion Control, Routing and Scheduling Design in Ad Hoc Wireless Networks,” in Proc. IEEE INFOCOM, 2006.

[42] A. Sinha and E. Modiano, “Optimal Control for Generalized Network-Flow Problems,” IEEE/ACM Trans. on Networking, vol. 26, no. 1, pp. 506-519, 2018.

[43] C. Joo, “On the Performance of Back-Pressure Scheduling Schemes with Logarithmic Weight,” IEEE Trans. on Wireless Communications, vol. 10, no. 11, pp.3632-3637, 2011.

[44] D. Bethanabhotla, G. Caire and M.J. Neely, “WiFlix: Adaptive Video Streaming in Massive MU-MIMO Wireless Networks,” IEEE Trans. on Wireless Communications, vol. 15, no. 6, pp. 4088-4103, 2016.

[45] H. Yu and M.J. Neely, “Learning Aided Optimization for Energy Harvesting Devices with Outdated State Information,” in Proc. INFOCOM, 2018.

[46] I. Kadota, A. Sinha and E. Modiano, “Optimizing Age of Information in Wireless Networks with Throughput Constraints,” in Proc. IEEE INFOCOM, 2018.

[47] I. Kadota and E. Modiano, “Minimizing the Age of Information in Wireless Networks with Stochastic Arrivals,” in Proc. ACM MobiHoc, 2019.

[48] S. Krishnasamy et al., “Augmenting Max-Weight with Explicit Learning for Wireless Scheduling with Switching Costs,” in Proc. IEEE International Conference on Computer Communications, 2017.

[49] T. Stahlbuhk, B. Shrader and E. Modiano, “Learning Algorithms for Scheduling in Wireless Networks with Unknown Channel Statistics,” Ad Hoc Networks, vol. 85, pp.131–144, 2019.

[50] Q. Liang and E. Modiano, “Minimizing Queue Length Regret Under Adversarial Network Models,” in Proc. ACM Measurement and Analysis of Computing Systems, 2018.

Designed for Accessibility and to further Open Science