b

DiscoverSearch
About
Learning Algorithms for Minimizing Queue Length Regret
2020·arXiv
Abstract
Abstract

We consider a system consisting of a single transmitter/receiver pair and N channels over which they may communicate. Packets randomly arrive to the transmitter’s queue and wait to be successfully sent to the receiver. The transmitter may attempt a frame transmission on one channel at a time, where each frame includes a packet if one is in the queue. For each channel, an attempted transmission is successful with an unknown probability. The transmitter’s objective is to quickly identify the best channel to minimize the number of packets in the queue over T time slots. To analyze system performance, we introduce queue length regret, which is the expected difference between the total queue length of a learning policy and a controller that knows the rates, a priori. One approach to designing a transmission policy would be to apply algorithms from the literature that solve the closely-related stochastic multi-armed bandit problem. These policies would focus on maximizing the number of successful frame transmissions over time. However, we show that these methods have  Ω(log T)queue length regret. On the other hand, we show that there exists a set of queue-length based policies that can obtain order optimal O(1) queue length regret. We use our theoretical analysis to devise heuristic methods that are shown to perform well in simulation.

image

IN this work, we consider a statistical learning problem that is motivated by the following application. Consider a wirelesscommunication system consisting of a single transmitter/receiver pair and N channels over which they may communicate. Packets randomly arrive to the transmitter’s queue and wait in the queue until they are successfully delivered to the receiver. At each time slot, the transmitter can decide to transmit a frame on one of the N channels. If the queue is non-empty, the transmitted frame carries a packet with it over the channel, and if the frame is then successfully received, the packet is successfully delivered. For each channel, each frame transmission attempt is successful according to a probability that is initially unknown. The transmitter is informed whether a transmission was successful via receiver feedback that immediately follows each transmission. The objective of the controller is to minimize the queue’s backlog by using the receiver feedback to quickly identifying the best channel to transmit on. In the above application, each successful frame transmission offers one packet’s worth of service to the queue. Thus, the channels in the above application behave like servers in a general queueing system that, when selected, offer a random amount of service. Given the above motivation, we consider the problem of identifying the best of N available servers to minimize the queue’s backlog. To this end, we associate a one unit delay-cost to each time slot that each packet has to wait in the queue. To obtain good performance, the controller must schedule the servers to explore the offered service rate that each gives and also exploit its knowledge to schedule the server that appears to give the most service. We define the queue length regret as Rπ(T) ≜ E��T −1t=0 Qπ(t) − �T −1t=0 Q∗(t)�, where  Qπ(t)is the backlog under a learning policy and  Q∗(t)is the backlog under a controller that knows the best server. Our objective is to find a policy that minimizes this regret. Our problem is closely related to the stochastic multi-armed bandit problem. In this problem, a player is confronted by a set of N possible actions of which it may select only one at a given time. Each action provides an i.i.d. stochastic reward, but the statistics of the rewards are initially unknown. Over a set of T successive rounds, the player must select from the actions to both explore how much reward each gives as well as exploit its knowledge to focus on the action that appears to give the most reward. Learning policies for solving the multi-armed bandit have long been considered [2]. Historically, the performance of a policy is evaluated using regret, which is defined to be the expected difference between the reward accumulated by the learning policy and a player that knows, a priori, the action with highest mean reward. This quantifies the cost of having to learn the best action. It is well known that there exist policies such that the regret scales on the order of log T and that the order of this bound is tight [3]. In the seminal work of [4], policies for achieving an asymptotically efficient regret rate were derived, and subsequent work in [5][7] have provided simplified policies that are often used in practice.

image

This work was sponsored by NSF Grants AST-1547331 and CNS-1701964, and by Army Research Office (ARO) grant number W911NF-17-1-0508. This material is based upon work supported by the United States Air Force under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the United States Air Force.

One approach to solving our problem would be to simply use traditional bandit algorithms. In this approach, the reward in the bandit problem would be viewed as offered service in our problem, and the resulting methods would focus on maximizing the rate of offered service to the queue. In the context of the wireless system described above, this is equivalent to targeting a high rate of successful frame transmissions without regard to which frames are packet-bearing. Note that in an infinitely backlogged system, which always has packets available to send, this approach would maximize throughput. Since, as a rule of thumb, higher service throughputs generally correspond with lower packet delays in queueing systems, this approach has merit.

However, the drawback with this approach is that it does not exploit the fundamental queueing dynamics of the system. During time periods when the queue is empty, the offered service is unused, and the controller can, therefore, freely poll the servers without hurting its objective. This contrasts with non-empty periods, when exploring can be costly since it potentially uses a suboptimal server. However, a controller cannot take this lesson too far and restrict its explorations only to time slots when the queue is empty. This is because some servers may have a service rate that is below the packet arrival rate, and if the controller refuses to explore during non-empty periods, it may settle on destabilizing actions that cause the backlog to grow to infinity.

Instead, policies should favor exploration during periods when the queue is empty but perform enough exploration during non-empty periods to maintain stability. In this work, we show that for systems that have at least one server whose service rate is greater than the arrival rate there exist queue-length based policies such that  Rπ(T) = O(1).1 Likewise, we show that any traditional bandit learning algorithm, when applied to our setting, can have queue length regret  Rπ(T) = Ω(log T). We additionally show that if there does not exist a server with a service rate that is greater than the arrival rate, for any policy the queue-length regret can grow as  Rπ(T) = Ω(T).

The problem considered herein is related to the work of [8], which also considered learning algorithms for scheduling service to a queue. The main difference is that in [8] the controller did not use the queue backlog to make decisions. As a result, the policies considered in that work were closely related to those in the bandit literature and focused on maximizing offered service. Under these policies, [8] showed that the tail of  E [Qπ(t) − Q∗(t)]diminishes as 1t, implying that  Rπ(T)is logarithmic. In this work, we show that by exploiting empty periods, the queue length regret is bounded.

Our problem can also be compared to the more general literature on reinforcement learning. In reinforcement learning, an actor seeks to learn the optimal policy for a sequential decision making problem with an evolving system state. The decision made by the actor at each state causes the actor to obtain a probabilistic reward and the system to randomly change its state. The objective is to learn how to obtain high reward by estimating the system’s statistics. See [9], [10] for a review of reinforcement learning. The online performance of reinforcement learning algorithms has been previously explored in [11] and [12], which used the upper confidence methods UCRL and UCRL2. The work of [13] has improved upon these bounds, and [14] extended the results to weakly-communicating problems. Thompson-sampling-inspired algorithms have also emerged as a design principle [15]. Posterior sampling reinforcement learning (PSRL), which is based on this principle, has become a popular method that can, in certain problems, outperform upper confidence methods [16][19].

Our work can be viewed as a reinforcement learning problem with known structure. The queue backlog is the problem’s state and a penalty is paid for inefficiently scheduling whenever the backlog is nonzero. Note that in our problem, the amount of regret that can be accrued at a given time is unbounded and the problem’s state space (the queue occupancy) is infinite. Our work has some overlap with the field of safe reinforcement learning [20][22]. We seek to find learning policies that do not destabilize the system and regularly return the queue backlog to the empty state. This structure separates our work from most of the previous literature that explores algorithms that are not designed to exploit queueing dynamics and are not focused on analyzing queue stability under different policies. The algorithms we design obtain their performance guarantees by using what is known about the problem, the dynamics of the queue and its impact on regret, while focusing on learning what is unknown, the server rates. Since the amount of service that a server will offer when scheduled is independent of the current queue backlog, our methods do not waste time trying to match the action-rewards to individual states, as would occur with a general reinforcement learning algorithm that knew nothing about the problem’s structure.

The use of statistical learning methods for optimizing wireless channel access in systems with uncertain channel conditions has been considered in the previous literature [23][35]. However, most previous works examined algorithms for maximizing transmission opportunities and did not focus on minimizing queueing delay. A major contribution of this work is to show that these two objectives are not necessarily the same. A learning algorithm that maximizes total offered service to a queue may not statistically minimize the queue backlog over time.

Scheduling algorithms that use queue state to make service decisions have a long history. In the seminal work of [36] and [37], the max-weight algorithm for assigning service to queues was shown to maximize throughput under complex scheduling constraints and probabilistic dynamics. This framework has been extended and applied to network switching [38], satellite communications [39], ad-hoc networking [40], [41], packet multicasting and broadcasting [42], packet-delivery-time reduction [43], multi-user MIMO [44], energy harvesting systems [45], and age-of-information minimization [46], [47]. In the works of

image

Fig. 1. Queueing system with N servers. At each time, one server may be scheduled. The rate at which each server can offer service is initially unknown.

[48] and [49], learning algorithms were used for achieving network stability under unknown arrival and channel statistics. The methods considered in those works augmented the max-weight algorithm with a statistical learning component. The resulting methods were shown to achieve stability but did not analyze queue length regret. In [50], learning algorithms for controlling networks with adversarial dynamics were considered and the max-weight algorithm and tracking algorithm were both explored as solutions for minimizing queue backlogs. However, the adversarial model considered in that work is pessimistic compared to the stochastic model considered herein.

The focus of this work is on proving the achievability of asymptotically-optimal regret growth. This requires that the controller samples all servers often enough to converge on the optimal server and achieve bounded regret. Importantly, the controller must concurrently guarantee that the queue does not blow-up to infinity and that exploration of sub-optimal servers is eventually limited to times when the queue is empty. The policies we use to prove our results do not optimize the queue length regret for finite-time analysis. As a result, although the tail of their regret is optimal, they can initially accrue large regret. We designed these policies to facilitate analysis, which is made difficult by the high-correlation between the queue backlog (which determines when to sample the servers to converge on the best server) and the observations of the servers (which determine the queue backlog and are used for identifying the best server). Our analyzed algorithms disentangle this relationship at the expense of good non-asymptotic performance. Nevertheless, we show through simulation that the insight gained from our analysis can inspire well-performing heuristics that exploit the structure found in our theoretical proofs. Analyzing and improving upon these heuristics is a potential direction for future research.

The rest of this paper is organized as follows. In Section II, we give the problem setup. We analyze the performance of traditional bandit algorithms, when applied to our problem, in Section III. In Sections IV and V, we characterize the regret of queue-length based policies when the best server has a service rate above and below the arrival rate, respectively. In Section VI, we use our theoretical analysis to define heuristic policies and test their performance through simulation. We conclude in Section VII. Note that a subset of the material in this paper appeared in [1].

A. Problem Description

We consider a system consisting of a single queue and N servers (where integer  N ∈ [2, ∞)) operating over discrete time slots t = 0, 1, 2, . . . . See Fig. 1. Packets arrive to the queue as a Bernoulli process A(t) with rate  λ ∈ (0, 1]and wait in the queue until they receive service from one of the servers. Each packet can be serviced no sooner than the next time slot after its arrival. Each server’s ability to offer service randomly fluctuates. Specifically, the number of packets that server  i ∈ [N]can service at time t follows a Bernoulli process  Di(t)with rate  µi. The arrival process and server processes are assumed to be independent of each other. We refer to server i as stabilizing if  µi > λ, which implies that the rate at which the server can provide service keeps up with the arrival process. Otherwise, we refer to it as non-stabilizing. Given the above, an instantiation of our problem is characterized by the tuple  (λ, ⃗µ), where  ⃗µis the vector of service rates. Then, we let P be the set of all problems (i.e., tuples).

The main challenge in our problem is that the system can only activate one server at a given time. At each time slot, a system controller must select only one of the N servers and ask it to offer service. Then the selected server will inform the controller of the number of packets that it can offer to serve and, if the queue is non-empty, will service that number of packets. We denote the controller’s choice at time t as  u(t) ∈ [N]. For decision u(t) = i, the service offered to the queue is then denoted D(t) which is equal to  Di(t). Throughout this work, the controller is not allowed to observe the offered service processes  Di(t)prior to making its decision u(t), and therefore it cannot know which servers will offer service prior to its choice.

Given the above, the queue backlog, Q(t), evolves as

image

where  (x)+is used to denote the maximum of x and 0. We assume Q(0) = 0 and the controller knows this.

The system incurs a unit cost, in delay, for each time slot that each packet has to wait in the queue. Therefore, for a time horizon T, the controller wishes to minimize �T −1t=0 Q(t). Since, the controller cannot observe the values of  Di(t)prior to making its decision u(t), the optimal action is to select the server

image

to provide service to the queue. For simplicity, we assume  i∗is always unique.2 In our framework, the controller does not a priori know the values of  µiand must therefore use observations of D(t) to identify  i∗. This implies that the service available from chosen server u(t) is revealed to the controller after time t, but the service that would have been available from all other servers remains hidden. Note that the controller can observe D(t) at all times t, even when Q(t) = 0. Finally, in this work, we assume that in addition to not knowing the values of  µi, the system does not initially know the exact value of  λ, either.

Given the above, the objective is to design a controller that uses its observations to decide which server to schedule at each time slot to minimize

image

To this end, denoting the history of all previous arrivals, service opportunities, and decisions as

image

we need to specify a controller policy  π, which at each time t uses any portion of H(t) to (possibly randomly) make a decision on u(t). Note that since H(t) contains the history of all previous arrivals and offered service to the queue, at time t, the policy implicitly knows the current backlog Q(t) as well (i.e., we could explicitly include Q(t) in H(t), but this would be redundant). We let  Πbe the set of all policies that we could design.

Define  Q∗(t)to be the queue backlog under the controller that always schedules  i∗, and let  Qπ(t)be the backlog under our policy  πthat must learn which server is optimal. We will analyze the performance of  πusing the following definition of queue length regret,

image

where the expectation is over the product measure of  P1 × P2where P1 denotes the operation of  πand P2 the operation of the scheduler that always selects  i∗. We will often simply refer to this metric as regret in the rest of the paper. Observe that, if our policy minimizes (2) over the set  Π, it must also minimize  E[�T −1t=0 Qπ(t)]over the set.

B. Analyzing Queue Length Regret

Since at any time t, server  i∗has the highest probability of offering service to the queue, it is clear that the controller that always schedules  i∗must minimize the expected queue backlog. Therefore, for every time t and any policy  π ∈ Π,

image

which implies that  Rπ(λ,⃗µ)(T)is monotonically increasing in T for all  π ∈ Π. Since, the queue begins empty at time 0, using (1),  Rπ(λ,⃗µ)(1) = Rπ(λ,⃗µ)(2) = 0.

The focus in this work will be on characterizing how queue length regret can scale with time horizon T. Therefore, we will not focus on finding policies that optimize  Rπ(λ,⃗µ)(T)for some finite value of T. Instead, we will characterize  Rπ(λ,⃗µ)(T)as T goes to infinity. We will proceed to show that under different assumptions on  λand  ⃗µ, the regret  Rπ(λ,⃗µ)(T) = O(1). Define the tail of the regret to be

image

Intuitively, this is the additional cost-to-go of infinitely running the policy past time T. Then the following proposition holds.

Proposition 1. If for policy  π ∈ Πand subset  P′ ⊆ P, Rπ(λ,⃗µ)(T) = O(1)for each  (λ, ⃗µ) ∈ P′, then  Lπ(λ,⃗µ)(T) → 0as T → ∞for every  (λ, ⃗µ) ∈ P′.

Thus, we see that a policy that obtains O(1) queue length regret seeks to minimize the tail of the regret as time goes to infinity (i.e., causes the tail to trend to zero). A policy that optimizes the tail of the regret is not guaranteed to minimize the queue length regret. Indeed the policies we proceed to analyze in Section IV do not generally achieve small queue length regret, and are instead designed to facilitate proving that O(1) queue length regret is achievable. Nevertheless, we will show in Section VI that the insight gained from these policies can be used to construct heuristics with good regret performance.

In this section, we establish that the regret of any policy  π ∈ Πthat does not use previous observations of the arrival process must have a queue length regret that grows logarithmically with T for some  ⃗µ. Under this restriction, the policies in this section (in effect) do not observe previous arrivals to the queue, and their decision making process is solely focused on the observed offered services. Since the policies do not monitor process A(t), they cannot directly use the queue backlog Q(t) to make their decisions, either. Under this restriction, we may still borrow any strategy from the traditional stochastic multi-armed bandit literature to solve the problem. These policies focus on only maximizing offered service to the queue, using previous observations of the offered service to guide their decisions. Without loss of generality, the theorem is established for a system with two servers. The theorem’s proof makes use of a well-known lower bound on the performance of bandit learning algorithms [3, Theorem 2.2].

Let  µ(j)be the  jthservice rate when the service rates are sorted in increasing order.

Theorem 1. For any  π ∈ Πthat does not use previous observations of the arrival process,  ∃⃗µwith server rates  µ(1)and  µ(2)(µ(2) > µ(1)) such that for any fixed  λ ∈�0, µ(2)�, Rπ(λ,⃗µ)(T) = Ω(log T).

image

Denote the offered service from servers (1) and (2) as  D(1)(t, ω)and  D(2)(t, ω)respectively and the decision by policy  πat time t as  u(t, ω). Then, from (3) and the queue evolution equation,

image

Thus, taking expectation and summing over time

image

Note that each  ˜π(t)defines a different decision process leading up to time t. For example,  ˜π(t − 1)chooses server (2) with probability 1 at time  t − 1, but  ˜π(t)follows the actions of policy  πat time  t − 1(and chooses server (2) with probability 1 at time t, instead). One can then see that in the above, at each index t, we are comparing the performance of  πto the performance of a different  ˜π(t)decision process. Since the queue begins empty at t = 0, by (1), the expected queue backlog under any controller must be equal to 0 and  λat times 0 and 1, respectively. Thus, the left-hand side of (4) has been chosen as a summation from time slot 2 to  T − 1.

Now,  Qπ(t)and u(t) are functions of events up to time  t − 1and are therefore independent of  D(1)(t)and  D(2)(t). Thus, we can write

image

Now,  {Qπ(t) > 0}and {u(t) = (1)} are not necessarily independent events. However, we can use the fact that an arrival at time  t − 1is a subset of the event that  Qπ(t) > 0to lower bound the above (i.e.,  {A(t − 1) = 1} ⊆ {Qπ(t) > 0}). Then,

image

We then use the fact that policy  πis independent of the arrival process A(t) to obtain

image

Note that  λ�µ(2) − µ(1)�> 0. Now, policy  πis a strategy for solving the stochastic multi-armed bandit problem (i.e., it uses the previous actions and observed offered service to decide on which server to schedule at time t). From [3, Theorem 2.2], there must exist a  ⃗µsuch that3

image

Without loss of generality, we can then assume the chosen  ⃗µhas this property. Using equations (4) through (7), this implies

image

Then, because  E [Q∗(t + 1)] ≤ E�Q˜π(t)(t + 1)�for all t,4

image

Therefore, by (8),

image

In this section, we examine the asymptotic scaling of regret  Rπ(λ,⃗µ)(T)for policies that make decisions using Q(t) (or equivalently, the previous observations of the arrival process). We do so by considering the problem for different subsets of P that impose different assumptions on the relationship between  λand  ⃗µ. This section will build to the main result of this work, Theorem 4, which states that there exists a  π ∈ Πsuch that for any problem  (λ, ⃗µ)with  µi∗ > λ, Rπ(λ,⃗µ)(T) = O(1).

We begin our analysis in Subsection IV-A under the assumption that every server is stabilizing (i.e.,  µi > λ, ∀i ∈ [N]). Under this assumption, the controller does not need to account for the possibility of system instability. As a result, the controller can limit itself to performing exploration only on time slots in which the queue is empty. During time slots in which the queue is backlogged, the controller will exploit its previously obtained knowledge to schedule the server it believes to be best.

In Subsection IV-B, we allow for both stabilizing and non-stabilizing servers in the system (i.e.,  µiis allowed to be less than  λfor a non-empty subset of the servers). However, we will assume that the controller is given a randomized policy that achieves an offered service rate that is greater than the arrival rate to the queue. Note that the controller does not need to learn this policy and can use it to return the queue to the empty state. The given randomized policy is not required to have any relationship to  i∗and will not, in general, minimize queue length regret. As a result, to minimize queue length regret, the controller will not want to excessively rely upon it.

In Subsection IV-C, we further relax our assumptions on the problem and only require that  µi∗ > λ. In this subsection, the controller will need to identify which servers are stabilizing, while simultaneously trying to minimize  Rπ(λ,⃗µ)(T). This will require a policy that does not destabilize the system. To this end, the controller will have to explore the servers’ offered service rates during both time slots when the queue is empty and backlogged. Explorations during time slots when the queue is backlogged, in general, waste work and should therefore be performed sparingly. Intuitively, as the controller identifies which subset of servers have service rates  µi > λ, it can focus its explorations on time slots when the queue is empty.

The above three cases build upon one another. The insight from one will point to a policy for the next, and we will therefore analyze the above cases in sequence. Under each of the above assumptions, we will find that there exists a policy such that the regret converges for all  (λ, ⃗µ)meeting the assumption. In contrast, in Section V, we show that there does not exist a policy that can achieve convergent regret over the class of all problems for which  µi ≤ λ, ∀i ∈ [N].

Note that the main difficulty in evaluating a policy’s performance is the correlation between the time-evolving queue backlog Q(t) and the history of observed offered service, which is used to find the best server. In general, the queue will be empty more often once we begin to identify the best server, and, likewise, the best server will be more easily identified when we begin to obtain more empty periods in which we can freely explore. The policies considered in this section have been chosen to disentangle this relationship so that well-known tools from statistics can be applied in our analysis. Specifically, our results will make use of Hoeffding’s inequality, which we restate here for convenience.

image

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update  �µiwith the observed state of D(t) For other time slots in the empty period, idle

image

Fig. 2. Policy  π1for achieving Theorem 2. Each variable  �µiis initialized to zero prior to its first update.

image

x1 < x2. Then, by Hoeffding’s inequality, for all x > 0,

image

For simplicity, we will often use Hoeffding’s inequality in our analysis even when tighter concentration inequalities may exist.

A. All Servers Are Stabilizing

In this subsection, we assume all servers are stabilizing and design policies that are able to obtain O(1) regret under this assumption. Specifically, we will assume that the environment will only choose parameters  (λ, ⃗µ)from a subset  P1defined below.

Assumption 1.  (λ, ⃗µ) ∈ P1 ≜ {P : µi > λ, ∀i ∈ [N]}.

Under this assumption we will prove the following theorem, which states that there exists a policy such that, for every  (λ, ⃗µ)in the set  P1, the regret converges. The proof of Theorem 2 is constructive and will give a policy for obtaining the result.

Theorem 2. Under Assumption 1,  ∃π ∈ Πsuch that, for each  (λ, ⃗µ) ∈ P1, Rπ(λ,⃗µ)(T) = O(1).

To prove Theorem 2, we analyze the policy  π1shown in Fig. 2 on an arbitrary  (λ, ⃗µ) ∈ P1. This policy maintains sample mean variables  �µithat estimate the servers’ service rates  µi. Each sample mean  �µiis updated using observations of server i’s offered service. However, it is important to note that each sample mean is not updated at every time slot that we schedule its corresponding server, and we will therefore not use every observation of the offered services to construct the sample means. Instead, to facilitate analysis, we will only update the sample means at strategically chosen time slots. Throughout this section, sample mean variables are initialized to zero prior to their first update with a service observation. After the first update, the sample mean variables equal the sum of the used observations divided by the number of used observations.

Note that under policy  π1, the queue backlog will transition through alternating time intervals, wherein the queue is continuously empty over an interval of time (i.e.,  Qπ1(t) = 0) and then continuously busy over an interval of time (i.e., Qπ1(t) > 0). We enumerate the periods using positive integers p = 1, 2, . . . and refer to the  pthoccurrence of an empty period (busy period) as empty period p (busy period p, respectively). To arrive at Theorem 2, we will analyze the integrated queue backlog taken over the busy periods. To this end, we define the integral of the queue backlog over busy period p to be the summation of the queue backlog �t Qπ1(t)taken over those times t that are in busy period p. By our problem formulation, a unit cost is incurred by the system for each time slot each packet waits in the queue, and the integral of the queue backlog over busy period p can then be interpreted as the total cost accumulated over the busy period. In Fig. 3, we illustrate our terminology on a sample path of  Qπ1(t).

We now briefly point out some aspects of policy  π1. See Fig. 4, which provides an illustration of  π1. Under the policy, the first time slot of each empty period is used to update one of the sample means. To do this, the policy chooses one of the servers uniformly at random and schedules it. The amount of service that is offered from the chosen server is then observed and used to update the server’s corresponding sample mean. Note that the sample means are only updated during these time slots, and all other observations of the servers’ offered services, that are made during other time slots, are not used in our

image

Fig. 3. Example sample path for  Qπ1(t)over the first 16 time slots for some, unspecified,  (λ, ⃗µ). Empty period 1 occurs over time slots 0 and 1. Busy period 1 begins in time slot 2 and ends with time slot 9. The duration of the busy period is 8 time slots, and the integral of the queue backlog over the busy period is 14. Empty period 2 begins at time slot 10, and busy period 2 begins at time slot 12. The integral of the queue backlog over busy period 2 is 4. Finally, empty period 3 is shown beginning in time slot 15.

image

Fig. 4. Example sample path of  Qπ1(t)evolving over the first 30 time slots for some, unspecified,  (λ, ⃗µ). The background colors indicate the type of action policy  π1took during the corresponding time slots (see Fig. 2). During time slots marked in gray,  π1randomly selected one of the servers i and updated its sample mean  �µiwith the observed offered service. The sample means were only updated during these time slots. During time slots marked in orange,  π1scheduled the server with the (so-far) largest sample mean to service the queue. During time slots with a white background, no action was taken.

estimations. Now, following an empty period, when a packet arrives to the empty queue, the system enters a busy period. At the start of each busy period, the policy chooses to schedule the server with the highest sample mean (breaking ties between sample means according to any arbitrary decision rule) and continues to schedule that server until the queue finally empties and the busy period ends. Thus, during each busy period the queue is serviced by only one server (namely, the one with highest sample mean). We again stress that the policy does not use the observations of the service obtained during the busy period to update the sample mean. Notionally, the policy performs no exploration during busy periods and instead focuses on exploiting previous observations to empty the queue quickly.

Now, suppose we are at the beginning of busy period p and policy  π1chooses to schedule server i for this busy period. Then, the duration of the busy period (i.e., the amount of time that will be required to empty the queue) is given by a random variable  Xi, whose mean we denote  Xi. Note that conditioned on server i being scheduled during the busy period,  Xiis not a function of the busy period number p. By Assumption 1, for all  i ∈ [N]we have that  µi > λand therefore  Xiis finite. Now, to establish the proof of Theorem 2, we will be interested in the integral of the queue backlog taken over the busy period. Note that, as with the busy period’s duration, the integral of the queue backlog is given by a random variable, which we denote  Zi. Using  τto denote the arbitrary start of the busy period,

image

and has mean  Zi. As with the busy period duration, variable  Ziis not a function of the busy period number p.

We introduce one last piece of notation before moving to the proof of Theorem 2. At the start of each busy period, policy π1schedules the server with the highest sample mean to empty the queue (i.e., the server with the highest sample mean is scheduled to every time slot in the busy period until the queue empties). We will refer to this decision as  π1scheduling the busy period to server i. In the following analysis, we will be interested in the frequency with which the policy schedules each server i to busy periods. Therefore, we will use  Sπ1i (P)to denote the number of busy periods that  π1schedules to server i in the first P busy periods. We will use these variables to bound, in expectation, the number of busy periods that are scheduled to a server other than  i∗.

The proof of Theorem 2 now proceeds through four lemmas. We begin with Lemma 1. We denote the integral of the queue backlog taken over busy periods  π1schedules to i as:

image

For example, if in Fig. 4 the first and fourth busy periods were scheduled to server 1 but the second and third busy periods were scheduled to server 2, then �29t=0 Qπ1(t)1 {u(t) = 1} = 10and  �29t=0 Qπ1(t)1 {u(t) = 2} = 17. Lemma 1 shows that we can upper bound the queue length regret with the queue backlog summed over times that  π1is not scheduling  i∗. The proof follows from a sample path argument.

Consider a fixed outcome  ωfor the arrival and service processes and randomization in policy  π1. Then,  Qπ1(t, ω)is the queue backlog of policy  π1at time t for this sample path, and  Q∗(t, ω)is the backlog at time t for a controller that schedules i∗for all time slots since the start of time for this sample path. Now, assume at time t policy  π1is in a busy period where it is scheduling  i∗. Then,  Qπ1(t, ω) ≤ Q∗(t, ω). The intuition behind this claim follows from the fact that the queue is empty right before the busy period begins and  π1does not switch servers mid-busy period. Therefore, for the given sample path of arrivals and service processes, had the controller scheduled  i∗for all time slots since the start of time, the queue could not be less. Taking expectations and using the definition of queue length regret in (2) then gives the result. The complete proof of the lemma can be found in Appendix A.

Lemma 1.

image

Next, in Lemma 2, the expected value of  Ziis shown to be finite. Note that by our problem’s definition at most one packet can arrive to the queue at each time slot. Then, since a busy period must start with one packet in queue at its first time slot (i.e., the packet that arrives to the queue to begin the busy period), the maximum size of the queue backlog during the busy period cannot be larger than the busy period’s duration. Thus, the expected value of  Zican be bounded by the expected value of  X2i, which is finite. The complete proof of the lemma can be found in Appendix B.

Lemma 2.  Zi < ∞.

We next proceed to bound the expected value of (9) with Lemma 3. The lemma states that the integrated queue backlog (taken over busy periods scheduled to server i) up until time T is less than the integrated queue backlog (taken over busy periods scheduled to server i) through the  T thbusy period. Observe that the right hand side of (10), is the expected number of busy periods out of T busy periods scheduled to server  i, E [Sπ1i (T)], multiplied by the expected integrated queue backlog, Zi. This quantity is the expected value of the integral of the queue backlog, over busy periods scheduled to i, up to busy period T.

The intuition behind Lemma 3 is simple. Since, by definition, each busy period must be at least one time slot in duration, we cannot have more than T busy periods in T time slots. (In fact, since empty periods must also be one time slot in duration, it is obvious that we will have far fewer than T busy periods in any T time slots.) Therefore, for any outcome of the system’s performance, the integrated queue backlog up to time slot T will be less than the integrated queue backlog through the  T thbusy period. Taking expectations then gives the result. The complete proof of the lemma can be found in Appendix C.

Lemma 3.

image

Finally, in Lemma 4, we show that the expected number of times  π1schedules server  i ̸= i∗to busy periods is bounded by a finite constant which is independent of T. In order to schedule server  i ̸= i∗to a busy period,  π1must estimate  �µito be no less than  �µi∗. Since we obtain a new observation of one of the servers during each empty period p, using Hoeffding’s inequality, we can show that the probability that there exists a server  i ̸= i∗such that  �µi ≥ �µi∗decays exponentially in p. The result then follows from the convergence of geometric series. The proof of the lemma can be found in Appendix D.

image

image

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update  �µiwith the observed state of D(t) For other time slots in the empty period, idle

image

Fig. 5. The policy  π2for achieving Theorem 3. Each variable  �µiis initialized to zero prior to its first update.

Then by Lemmas 2 and 4,  Rπ1(λ,⃗µ)(T) = O(1)giving the result.

As a final note, it is important to observe that, by its statement, Theorem 2 does not imply that there exists a constant that bounds  Rπ1(λ,⃗µ)(T)for all  (λ, ⃗µ) ∈ P1. It only states that for any chosen  (λ, ⃗µ) ∈ P1, under policy  π1, the queue length regret converges to a finite number that is dependent on the chosen parameters.

B. Non-Stabilizing Servers

In this subsection, we relax Assumption 1 to allow for non-stabilizing servers. Thus, we will now allow  µi ≤ λfor some strict subset of the servers. Importantly, we will assume that the controller is given a known convex summation over the servers’ rates that strictly dominates the arrival rate to the queue. Then, by randomizing over the servers using this convex summation, the controller can always stabilize the system. Concretely, we make the following assumption.

Assumption 2. For some given  αi ≥ 0and �i∈[N] αi = 1,

image

Observe that the set  P2(⃗α)defined above is determined by the values of  αithat are given to the controller and different values will lead to different sets. We emphasize that Assumption 2 does not imply that the controller knows any of the values of  µi, a priori. Rather, it states that the controller can be confident that the convex summation it is provided will meet the condition of (11). For example, if  α1 = 1then Assumption 2 reduces to  µ1 > λand this fact being known by the controller, a priori. Then, if at any given moment the controller wishes to return the queue to the empty state, it can simply resort to persistently scheduling server 1 and wait for the queue to empty. However, note that server 1 may not be the optimal server  i∗, and, as a result, persistently scheduling server 1 for every time slot will not in general give good queue length regret performance for every (λ, ⃗µ) ∈ P2(⃗α). Generalizing the above example to arbitrary  αivalues, we see that Assumption 2 implies that, if at any point the controller decides to resort to a strategy of scheduling at each time slot server i with probability  αi, it can be guaranteed that the queue will eventually empty. In other words, the values of  αidefine a stationary, randomized policy that has a service rate that dominates the arrival rate to the queue and can, therefore, empty the queue infinitely often. However, since by the assumption’s statement the  αivalues are not required to have any special relationship to  i∗, the randomized policy will not generally minimize regret, and its use should not be overly relied upon. Before continuing, we note that Assumption 2 is intentionally artificial. The assumption assumes a randomized policy is given to the controller and that the system will only encounter parameters for  λand  ⃗µsuch that the given policy can stabilize the queue. In most practical applications, this assumption is unrealistic. However, the results of this subsection will be highly instructive for the next subsection, where we relax Assumption 2 and no longer assume a known stabilizing policy is given to the controller. Importantly, in the following, we will develop a method that can sparingly use a suboptimal policy to consistently guarantee future empty periods for the queue without overly using the suboptimal policy so as to lose the bounded regret result.

We now give the main theorem of this subsection.

image

Fig. 6. Example sample path of  Qπ2(t)for some, unspecified,  (λ, ⃗µ). The background colors indicate the actions of policy  π2(see Fig. 5). During time slots marked in gray,  π2randomly updated one of the sample means,  �µi. During time slots marked in orange,  π2scheduled the server with the (so-far) largest sample mean. During time slots marked in blue,  π2used the randomized policy defined by  αi, because it had crossed the corresponding busy period’s time-out threshold. Note that the threshold increases linearly with each busy period number, and, for this example, the threshold was not reached in the third busy period.

Theorem 3. Under Assumption 2,  ∃π ∈ Πsuch that, for each  (λ, ⃗µ) ∈ P2(⃗α), Rπ(λ,⃗µ)(T) = O(1).

A policy that achieves Theorem 3 is given in Fig. 5 and is referred to as  π2throughout this subsection. The policy, similar to the policy  π1of Subsection IV-A, iterates over empty and busy periods. See Fig. 6 for an illustration of the policy. As with policy  π1, in the first time slot of each empty period, the policy schedules one server uniformly at random, observes the resulting offered service, and uses the observation to update a sample mean that estimates the server’s service rate. Again as with policy  π1, π2only uses the observations made during these time slots to update its sample means and does not use observations made during other time slots.

At the start of each busy period p, policy  π2starts the busy period by scheduling the server with the highest sample mean for the first p time slots of the busy period or until the queue empties and ends the busy period (whichever occurs first). However, if the busy period lasts longer than p time slots the policy hits a time-out threshold and resorts to using the randomized policy defined by Assumption 2 to bring the queue backlog back to the empty state. The time-out threshold grows linearly in busy period number p, and therefore with each busy period, the controller becomes more reluctant to call upon the randomized policy. In the following analysis, we will see that this growing reluctance is important to establishing the proof of Theorem 3.

We now establish Theorem 3 by analyzing  π2on an arbitrary  (λ, ⃗µ) ∈ P2(⃗α). The proof is derived through three lemmas. We begin by introducing some new notation that will facilitate understanding.

We begin by introducing notation that allows us to reference the period number p that a time slot t corresponds to. To this end, for each time t, define p(t) to be the period number for the empty or busy period in which t resides. Note that for a given sample path, p(t) maps time slots t = 0, 1, . . . to their corresponding period number p = 1, 2, . . . . For example, in Fig. 6, for t = 0, 1, . . . , 4, p(t) = 1 since these time slots occur during the first empty and busy periods and for t = 5, 6, . . . , 12, p(t) = 2 since these time slots occur during the second empty and busy periods.

Now, in our following analysis, we will be concerned with identifying the busy periods in which the policy does not strictly schedule server  i∗over the entire busy period. Note that an ideal policy would schedule  i∗over all busy periods, and therefore we wish to identify during which busy periods policy  π2falls short of this ideal. To facilitate this analysis, for each busy period p, we let  C(p) ∈ {0, 1}be an indicator that takes the value 1 if: either a server not equal to  i∗is first scheduled by policy π2at the start of the busy period or the time-out threshold is hit. Note that by hitting the time-out threshold, we mean that the queue did not empty during the first p time slots of busy period p. Otherwise, if neither of these two conditions are met, we let the indicator C(p) = 0. Note that by the definition of  π2, if any server  i ̸= i∗is scheduled during busy period p, then C(p) must equal 1 (see Fig. 5). This is because, by the definition of policy  π2, we can only schedule a server  i ̸= i∗during a busy period if: either the policy began the busy period by scheduling  i ̸= i∗or it began the busy period scheduling server i∗but still hit the time-out threshold, leading it to schedule the stabilizing policy. Since to minimize regret, our objective is to mimic as best as we can the policy that schedules  i∗for all time slots, we can notionally view C(p) equaling 1 as policy  π2failing to meet this ideal over busy period p.

In the following, we will combine the above two notations as C(p(t)). Then, if t is in a busy period, C(p(t)) = 1 if a server not equal to  i∗is scheduled at any time during the busy period in which t belongs. If t is in an empty period, C(p(t)) indicates whether during the busy period following this empty period, a server not equal to  i∗is ever scheduled. Then,

image

is the queue backlog integrated, up to time slot T, over those busy periods for which the conditions of C(p) are met. Note that if t is in an empty period,  Qπ2(t) = 0, and therefore the time slot contributes a value of 0 to the above summation regardless of the value of C (p(t)). For example, for the sample path in Fig. 6, if  π2scheduled  i∗for the start of busy period 3, then C(1) = C(2) = C(4) = 1 since these busy periods reached their time-out thresholds, C(3) = 0, and �29t=0 Qπ2(t)C (p(t)) = 28. Likewise, if for the sample path,  π2did not schedule  i∗for the start of busy period 3, then C(1) = C(2) = C(3) = C(4) = 1 and �29t=0 Qπ2(t)C (p(t)) = 30.Given this notation, we have the following lemma, which states that the regret is upper bounded in expectation by the sum of queue backlogs over those busy periods in which indicator C(p) = 1. The proof is similar to that of Lemma 1 and can be found in Appendix E.

Lemma 5. Rπ2(λ,⃗µ)(T) ≤ E��T −1t=0Qπ2(t)C (p(t))�.

Now, to upper bound the right-hand side of Lemma 5, we will want to bound the probability that indicator C(p) = 1. Recall that C(p) can equal 1 if one of the following events occurs: the policy begins the busy period by choosing a server other than  i∗to be scheduled for the first p time slots or if it crosses the time-out threshold. In the following lemma we establish that the probability of either event occurring diminishes exponentially in p. To understand the intuition behind this result, first note that at each busy period p, the probability that there exists a server  i ̸= i∗with a  �µi ≥ �µi∗decreases exponential with p. This follows from the fact that the policy updates one of the sample means with a new observation at each empty period and Hoeffding’s inequality. Thus, the probability that at busy period  p, i∗is not chosen to be scheduled over the first p time slots, is also exponentially decreasing with p. Note that this argument is very similar to the one made in the proof of Lemma 4. Now suppose that  i∗is scheduled at the start of busy period p. Then, since  µi∗ > λ, we can show that the probability the queue does not empty before reaching the time-out threshold p must decrease exponentially in p as well. We now state the lemma whose proof can be found in Appendix F.

Lemma 6. There exists positive constants  M0, χ, and  p0such that for all  p ≥ p0,

image

To complete our upper bound on the right-hand side of Lemma 5, we will need to bound the cost of indicator C(p) equaling 1. To this end, let variable  Zπ2(p) ∈ {1, 2, . . . }denote the integral of the queue backlog taken over busy period p. For example, for the sample path shown in Fig. 6,  Zπ2(1) = 5and  Zπ2(2) = 9. In Lemma 7, we show  E [Zπ2(p)| C(p) = 1]grows at most quadratically with p. This is because the queue backlog can grow at most linearly over the first p time slots of the busy period. Thus, if the time-out threshold is hit, the required amount of time that the randomized policy will need to empty the queue can also be at most on the order of p. Since, the maximum queue backlog at any time slot in a given busy period cannot be greater than the busy period’s duration, the expected integrated queue backlog will at most be on the order of  p2.

Lemma 7. There exists positive constants  M1and  β1such that for all p,

image

The proof of the lemma can be found in Appendix G. Before continuing to the proof of the theorem, we observe that  E [Zπ2(p)| C(p) = 0] ≤ p2, since C(p) = 0 implies that the time-out threshold was not reached. Thus, we see that for all  p, E [Zπ2(p)]is finite (implying the expected duration of each busy period is finite under policy  π2). We are now ready to establish the theorem. Proof of Theorem 3: Beginning with Lemma 5, we have that the regret is bounded by the integrated queue backlog over those busy periods such that the conditions of C(p) are met. i.e.,

image

Now, no more than T busy periods can occur in T time slots, since each busy period, by definition, must be one time slot long. In fact, as was noted in the previous subsection, there must be much fewer than T busy periods in T time slots, since the empty periods in between the busy periods must also be at least one time slot long, as well. Thus, we can bound the right hand side of (12), which is the integral of the queue backlogs over busy periods with C(p) = 1 up to time T, with the integral of the queue backlogs over busy periods with C(p) = 1 through the  T thbusy period. Formally,

image

Note that in going from (12) to (13), we have changed from summing over time slots t to summing over busy period number p.

image

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update  �µiwith the observed state of D(t) For other time slots in the empty period, idle

image

Fig. 7. The policy  π3for achieving Theorem 4. Each variable  �µiis initialized to zero prior to its first update. The method uses the algorithm of Fig. 8 when the time-out threshold is hit.

Now, recall that C(p) is an indicator. When  C(p) = 0, Zπ2(p)C(p) = 0. Thus,

image

which follows from well-known results for series. Thus, we see that  Rπ2(λ,⃗µ)(T) = O(1), which gives the result.

C. Learning Stabilizing Policies

We now relax Assumption 2 and no longer assume that a randomized policy is given to the controller that can be relied upon to stabilize the queue. As a result, the controller will have to learn how to take actions so as to empty the queue infinitely often. Once the controller can accomplish this, it may use the empty periods to identify the best server. To this end, we only make the following assumption about the parameters  λand  ⃗µ.

Assumption 3.  (λ, ⃗µ) ∈ P3 ≜ {P : µi∗ > λ} .

We then have the following theorem, which is analogous to the previous subsections. The rest of this subsection will be devoted to proving its statement.

Theorem 4. Under Assumption 3,  ∃π ∈ Πsuch that, for each  (λ, ⃗µ) ∈ P3, Rπ(λ,⃗µ)(T) = O(1).

To prove the theorem, we will analyze policy  π3on an arbitrary  (λ, ⃗µ) ∈ P3. Policy  π3is shown in Fig. 7. The policy is very similar to  π2, except now when the time-out threshold of a busy period is reached, rather than relying on the known randomized policy given by Assumption 2,  π3uses the method of Fig. 8 to bring the queue back to the empty state. Starting from the time-out threshold’s crossing, the method is run until the queue finally empties and the busy period ends. As can be seen in Fig. 8, we use time slot counter n = 0, 1, . . . to indicate the number of time slots since the method’s invocation (i.e., n is the time slot number normalized to the time-out threshold’s crossing).

In its execution, the method in Fig. 8 dedicates certain time slots to exploration. During these dedicated exploration time slots, the policy chooses from the servers uniformly at random, observes the offered service of the chosen server, and updates the corresponding sample mean  �µpiusing the observation. Note that the sample means  �µpiare new variables that are used by the method during its execution at busy period p and are completely separate from the sample means  �µiin Fig. 7. The

image

Fig. 8. Learning algorithm for bringing the queue back to the empty state. Variable n counts the number of time slots since the algorithm’s invocation. The method uses a new set of sample means  �µpi to estimate the server rates. Note that  �µpi is a different variable from  �µi, and �µpi is only updated using observations made during busy period p. Each variable  �µpi is initialized to zero. The array V (n) counts the number of dedicated exploration time slots that have occurred since the start of the algorithm’s invocation. We require that the locations of the dedicated exploration time slots be chosen such that V (n) meets the condition of (15) for each time slot n.

image

for some positive constants  b1, b2, r, and  ϵ ∈ (0, 1). Note that (15) requires that the frequency of explorations diminishes with n and eventually falls below  1 − λµi∗. Subject to (15), the exact choice of the dedicated exploration times is left to implementation. Observe that our previous example that chose locations  n = j2meets the condition of (15).

We now turn to establishing the proof of Theorem 4. Since policy  π3is very similar to policy  π2, the proof of Theorem 4 will closely follow the proof of Theorem 3. The main difference will be that we use Lemma 8 below in place of Lemma 7 of the previous subsection.

We begin by noting that, using the definitions of p(t) and C(p) from the previous subsection, Lemmas 5 and 6 continue to hold for policy  π3. Specifically, as before, if we schedule a server  i ̸= i∗during any time slot of a busy period p, than we must either have started the busy period by scheduling a server  i ̸= i∗or we must have hit the time-out threshold. By the arguments of the previous subsection for policy  π2, both events again diminish exponentially in p for policy  π3. Then, the only step that is really missing in establishing the theorem is bounding the cost of having C(p) = 1 for busy period p. We proceed to do this next.

For policy  π3, let  Zπ3(p) ∈ {1, 2, . . . }denote the integrated queue backlog of busy period p. Then, we have the following lemma whose proof can be found in Appendix H.

Lemma 8. There exists positive constants  M2and  β2such that for all p,

image

Given Lemma 8, we now establish the proof. Note that, due to its similarity to the proof of Theorem 3, we have abridged a few steps. Proof of Theorem 4: Using Lemma 5 and the fact that no more than T busy periods can occur in T time slots,

image

Then, as in the proof of Theorem 3, we may write

image

where the last inequality follows from Lemmas 6 and 8. Since

image

In the previous section, we saw that the existence of a stabilizing server (i.e.,  µi∗ > λ) allowed for policies that had Rπ(λ,⃗µ)(T) = O(1). A natural question is whether we can obtain a similar result for the subset of problems  (λ, ⃗µ)that do not have stabilizing servers. We proceed to show that there cannot exist a policy that can achieve O(1) regret over this entire subset. The intuition behind this result is that, when  µi∗ ≤ λ, the system may only experience a finite number of empty periods, and therefore, policies cannot rely upon the empty periods to perform sufficient exploration of the servers’ rates.

Let  P4 ≜ {P : µi ≤ λ, ∀i ∈ [N]}. Then, we have the following theorem.

Theorem 5. For any policy  π ∈ Πthere exists a  (λ, ⃗µ) ∈ P4such that  Rπ(λ,⃗µ)(T) = Ω(T).

Proof: Consider a system with  λ = 1. Let  πbe any arbitrary learning policy in  Π. Then for all t > 0 the queue is non-empty. As a result, the queue backlogs are given by

image

where  D∗(t)is the service offered by server  i∗and  Dπ(t)is the service offered by the server scheduled by  πat time t. By the above equations, it is clear that at  t = 0, Qπ(0) = Q∗(0) = 0and at  t = 1, Qπ(1) = Q∗(1) = 1. Thus, Rπ(λ,⃗µ)(0) = Rπ(λ,⃗µ)(1) = 0. For T > 1, equations (17) and (18) imply

image

And, we can write

image

Otherwise, there would exist a learning policy that, given any  ⃗µ, always schedules the optimal server with probability one at time slot 1. (This would be a policy that guesses the best server with probability one, having not even made an observation

image

At each time t in the empty period, use the empty-period sampling rule to choose u(t) and update  �µu(t)with the observed state of D(t)

image

For the first p time slots of the busy period or until the queue empties, schedule  u(t) = arg maxi∈[N] �µi, updating �µu(t)with the observation of D(t) if The queue does not empty during the first p time slots then

At each time t until the queue empties, schedule the server with the maximum UCB1 weight, using the observed state D(t) to update  �µu(t)end if

image

Fig. 9. The heuristic technique used in our simulations. The method maintains sample mean variables  �µifor each server, which are updated with the observed offered service every time that server is chosen. We consider three different rules for sampling servers during empty periods. Arrivals to the queue begin after time step N.

of every server prior to its guess.) Clearly, no such policy in  Πcan exist. Therefore, because  E��t−1τ=1 1 {u(τ) = i}�is monotonically increasing in t, (21) implies that

image

for all  t ≥ 2and some  ⃗µ. Using this in (20) and then (19) gives the result.

The analysis of Section IV considered policies that proved the achievablility of O(1) queue length regret. As discussed in Section II these methods cause the tail of the regret to trend to zero. In order to keep our analysis tractable, the designed policies had to make concessions, described further below, that hurt their practical performance. Nevertheless, these policies do provide intuition about the problem, and with some fairly minor and intuitive changes, we show that these policies can be transformed into heuristic techniques that achieve good performance in simulation. The heuristics considered in this section use policy  π3as a template.

We now briefly cover the inefficiencies of the policy  π3of Section V. The first, and easiest to remedy, is that the policy only uses the first time slot of each empty period to sample a server. Clearly, a more efficient method would sample throughout the empty period, and the only reason policy  π3was designed not to do this is because it is not strictly necessary for the proof of Theorem 4 and simplifies the mathematics.

The next inefficiency is that policy  π3does not record the observations made of the chosen server during busy periods prior to reaching the threshold, and, after reaching the threshold, does not use its observations after the busy period expires. The reason for this is to facilitate simplicity in the proof of Theorem 4. Observations made during the busy period are correlated with the queue’s dynamics, and therefore if these observations are used during subsequent busy periods, one has to account for how the dynamics impact the probabilities of the previous observations. This would greatly complicate analysis. However, in practice, it makes sense that a policy should use all previous observations to inform its estimates of the servers’ rates.

Finally, policy  π3uses a simple exploration process for identifying a server to empty the queue once the threshold is reached during a busy period. This method is relatively easy to analyze with the queue’s dynamics. However, algorithms from the literature on multi-armed bandits, such as UCB1 described below, are generally better methods and should probably be used in practice. Analyzing how the mechanisms used by these algorithms to achieve good expected performance interact with the queue’s dynamics to reach the next empty period is an interesting direction for future research.

Given the above points, in Fig. 9 we provide a heuristic technique that uses the intuition provided in the analysis of policy π3. The technique uses one of three methods to explore the servers during empty periods. Method UCB-LE chooses to sample at each time step in the empty period the server that has been least explored so-far (i.e., observed the least number of times). Method UCB-UE randomly chooses a server to explore at each time step of the empty period, uniformly choosing among the servers. Finally, method UCB-WE randomly chooses from the servers proportional to the value of  �µi + b, where  �µiis the

image

Fig. 10. Simulations of a four server problem with service rates  ⃗µ = (0.1, 0.3, 0.5, 0.7). Different arrival rates  λare considered. (a)  λ = 0.4 (b) λ = 0.5(c)  λ = 0.6.

sample mean of the observations made so-far of server i and b is an additional term (set to 0.1 in our simulations). UCB-WE weights its explorations during empty periods towards the best performing servers.

During busy periods, the heuristic of Fig. 9 begins by choosing to schedule at each time slot the server with the highest sample mean until the busy-period threshold is reached (or until the queue empties). The observations made at this time are used to update the sample means of the servers. Once the threshold is reached, the method constructs at each time a UCB1 weight for each server and schedules the server with the highest weight. The UCB1 weight, proposed in [6] for solving the multi-armed bandit problem, is given by

image

where  �µiis the sample mean of the service that has been offered by server i so-far and  Tiis the number of observations made of the server. The method schedules according to the UCB1 weights until the queue empties.

In the following, we will compare the three versions of our heuristic (UCB-LE, UCB-UE, and UCB-WE) against a method that always chooses servers according to the UCB1 weight (during both empty and busy periods). UCB1 is a popular multi-armed bandit algorithm [6]. In our context, it is a method that maximizes the amount of offered service to the queue without regard to the queue’s state. In our simulations, all methods will spend the first N time slots sampling each of the N servers once before launching into their main techniques. The queue is kept empty during these first N time slots, and arrivals to the queue begin afterwards. In our plots, we calculate the queue length regret by averaging, across multiple runs, the difference in the queue backlog of the learning policies and the (genie) policy that always schedules the maximum rate server.

In Fig. 10, we explore the impact of increasing arrival rates on the four methods’ performance. In this scenario there are four servers with service rates 0.1, 0.3, 0.5, and 0.7. Arrival rates of 0.4, 0.5, and 0.6 are simulated. The queue length regret

image

Fig. 11. Simulations of a two server problem with arrival rate  λ = 0.4. Different service rates are considered. (a)  ⃗µ = (0.5, 0.6) (b) ⃗µ = (0.54, 0.6) (c)⃗µ = (0.58, 0.6).

averaged over 10,000 simulation runs is plotted versus time. In the three plots, we see that the heuristic methods generally outperform UCB1 in minimizing queue backlog. The queue length regret grows with increasing arrival rate as the system becomes more heavily loaded and there are fewer opportunities for free exploration during empty periods. Additionally, as the arrival rate becomes greater, the amount of time available for the queue to be empty decreases, and the relative difference in performance between the heuristic methods and UCB1 decreases. This effect appears to become more severe as the system becomes more heavily loaded.

Fig. 11 examines how a narrowing gap between server rates impacts performance. In this scenario there are two servers. The first server has its service rate tested at values of 0.5, 0.54, and 0.58, while the second server has a fixed rate of 0.6. The arrival rate to the queue is 0.4 and each plot shows the queue length regret averaged over 10,000 simulation runs. We see that decreasing the gap between the server rates causes the heuristic methods’ queue length regret to converge more slowly as the methods have a more difficult time settling on the optimal server. Although the methods outperform UCB1 in all three cases, their relative advantage, over the considered timescales, generally decreases as the gap between the servers’ rates decreases.

In this work, we considered a queueing system that has N available servers with unknown service rates. The objective of the system controller is to minimize the queue’s backlog by using observations of the service process to quickly identify the best server. To evaluate a learning policy’s performance, we introduced queue length regret, which is defined to be the expected difference in the accumulated queue backlog of the learning policy and the accumulated backlog of a controller that always schedules the optimal server. This metric quantifies the cost, in queueing delay, of having to learn the best server.

We proved that there exist queue-length based policies that have an order optimal O(1) regret for all systems such that µi∗ > λ. These policies achieve this result by taking advantage of times when the queue is empty and estimates of the servers can be improved without incurring extra packet delay. Importantly, we showed there exist policies that can ensure the queue empties infinitely often, so that this property can be exploited. In contrast to this result, we showed that traditional bandit algorithms that focus on only maximizing offered service to the system without accounting for the queue’s state can have a queue length regret that is  Ω(log T). Thus, to obtain optimal performance, it is insufficient to design controllers that only focus on maximizing the offered service to the queue.

In the problem considered herein, the controller’s estimates of the servers’ rates are highly correlated with the queue’s dynamics, and therefore analyzing the system’s stochastic behavior can be difficult. The theoretical policies analyzed in this work were chosen for their analytical tractability, and the simplicity of the chosen policies allowed us to use well-known tools from statistics to show the achievability of O(1) regret, which was our main objective. We then used insight from our theoretical analysis to design heuristic techniques with good performance. Finding policies with provable optimal performance over finite-time is a possible direction for future research.

image

Consider an outcome  ωand fixed sample paths  A(t, ω)and  Di(t, ω), ∀i ∈ [N]. Suppose at time t, policy  π1is schedulingi∗(i.e.,  u(t, ω) = i∗) and  Qπ1(t, ω) > 0. Define, s < t the last time before t that the queue was empty (i.e., the time slot preceding the start of this busy period). Then, since  π1does not change servers mid-busy period,

image

where  D∗(τ, ω)is equal to the service offered by server  i∗at time  τ. Now, let  Q∗(τ, ω)be the queue backlog under the controller that always schedules  i∗. Since  (Q∗(s, ω) − D∗(s, ω))+ ≥ 0, this implies that

image

Summing over time we then obtain

image

Taking expectation with respect to arrivals, departures, and the randomization in policy  π1,

image

Then, starting with the definition of  Rπ1(λ,⃗µ)(T),

image

Consider a busy period scheduled to server i. The length of the busy period and integrated backlog over the busy period are random variables  Xiand  Zi, respectively. Furthermore,

image

We proceed to bound  E�X2i�. To this end, define random variable  Yito be the difference between the number of arrivals and offered service from server i in a time slot. The variable has probability mass function

image

The mean of  Yiis denoted  Y iand note that by Assumption 1 it is negative. Let  Y kifor k = 1, 2, . . . be i.i.d. random variables with the distribution of  Yi.

image

random walk defined by  Yi. i.e.,

image

Note that each entry  Y kiaccounts for the difference in the number of arrivals and amount of service at a time step of the busy period. Thus, the  −1hitting time is the first time slot when the service dominates the arrivals and the queue empties. Given the above,

image

where the first inequality follows from Hoeffding’s inequality and the second inequality follows from  n > n0. Thus,

image

In the above, the second inequality follows from bounding  P(Xi > n − 1)with 1 for the the first  n0terms and (23) for all terms greater than  n0. Likewise, the last inequality follows from �∞n=0 n2e− 12 nδ2 < ∞. This establishes the result.

image

The proof follows from the fact that, with probability one, there can be no greater than T busy periods in the time interval 0 to  T − 1. Thus, the cost accrued over the busy periods occurring before time T, must be less than the cost accrued over the first T busy periods. Concretely, denoting the start time and finish time of busy period p with  spand  fp, we have that

image

Thus, using the definition of  Ziand  Sπ1i (T),

image

Define  γ ≜ 12 (µi∗ − µi). Note that  γis the half way distance between  µi∗and  µi. Furthermore, define  �µi(p)to be the sample mean that policy  π1has for server i at the start of busy period p and  Ti(p)the number of times the policy explored i during an empty period before busy period p (i.e., the number of times the policy has updated  �µi(p)). Recall  �µi(p) = 0if we have not yet sampled i. Then,

image

The above states that in order to schedule i during p, i must have a sample mean no less than  i∗, and this only occurs if one of the sample means deviates from its true mean by a distance  γ.

Now, for a sample mean to deviate by distance  γ, the policy must either have not observed i very often during exploration or it must have observed atypical observations during exploration. Thus, we further break up the above events:

image

Note that  Ti(p)and  Ti∗(p)are sums of Bernoulli random variables that take the value 1 when i is observed during empty period p and that 1pE [Ti(p)] = 1pE [Ti∗(p)] = 1N. Using Hoeffding’s inequality,

image

with a similar bound holding for  P�Ti∗(p) ≤ p2N�. Now, the number of times we have explored a server during the empty periods, is independent of the observations that were made. Therefore, we can use Hoeffding’s to obtain

image

p(t) denotes the period (empty or busy) that time slot t resides in.

image

that  C(p(t, ω), ω) = 0and  Qπ2(t, ω) > 0.5 Then, over this sample path, the server  i∗is scheduled over the entire busy period in which time slot t resides. By an argument similar to Lemma 1, this implies that  Qπ2(t, ω) ≤ Q∗(t, ω). Therefore, summing over t

image

Taking expectation,

image

Then using the definition of queue length regret,

image

image

Define  Ci∗(p)be an indicator random variable that equals 1 if we scheduled  i∗for the start of busy period p and hit the time-out threshold (i.e., the queue did not empty in the first p time slots). Otherwise, let  Ci∗(p)be 0. Then,

image

Define  �µi(p)to be the sample mean that policy  π2has for server i at the start of busy period p. Now, if we did not schedulei∗for the start of busy period p, then there must exist an  i ̸= i∗such that  �µi∗(p) ≤ �µi(p). Thus, using the union bound,

image

All that is left is to characterize  P(Ci∗(p) = 1). To do so, we will analyze the probability that the time-out threshold p is reached given  i∗is chosen to be scheduled at the start of busy period p. For an arbitrary time slot, the difference in the number of arrivals and available service from  i∗is given by a random variable  Yi∗that has probability mass function

image

Then, let random variable  Y ki∗be the difference in the number of arrivals and available service from  i∗in the  kthtime slot since the start of the busy period. Note that the random variables are i.i.d. with the distribution of  Yi∗above.

Given that  i∗is scheduled at the start of busy period p, we will analyze the accumulated difference between the number of arrivals and service process (since the start of the busy period) as a random walk  Y 1i∗ + Y 2i∗ + . . .. The busy period begins with one packet in queue. Therefore, if the random walk hits  −1in p time slots, the queue empties and the period ends before reaching the time-out threshold. Note that by Assumption 2, the mean  Y i∗ < 0and the walk has negative drift. Then,

image

Now, for any choice of  δ ∈�0, −Y i∗�, we have for all  p ≥ p0 ≜� 1−Y i∗−δ

image

where the first inequality follows from Hoeffding’s inequality and the second inequality from  p ≥ p0. Plugging in (26) and (28) into (25), we see that for all  p ≥ p0,

image

Then, we can clearly choose values of  M0and  χ, that are independent of p, to meet the lemma’s condition.

image

We define U(p) to be the amount of time that we schedule the randomized policy over busy period p. Then, an upper bound

on the duration of the busy period is p + U(p). Now, in an argument similar to Appendix B, we can bound the integrated

image

We therefore need a bound on the P(U(p) > n|C(p) = 1) for n = 1, 2, . . . . To this end, we analyze the difference between the number of arrivals and offered service (provided by the randomized policy) to the queue that would occur after threshold-time p. Let the average offered service of the randomized policy be

µα ≜ �i∈[N] αiµi. Define random variable  Yαto be the difference between the number of arrivals and service offered by the randomized policy in a given time slot. This random variable has probability mass function

image

Let  Y 1α , Y 2α , . . .be i.i.d. random variables with the distribution of  Yα. Each  Y kαis the difference in the number of arrivals and offered service in time slot k following the time-out threshold. By Assumption 2, the mean  Y α < 0and the random walk Y 1α + Y 2α + . . .has negative drift.

Now, given that we hit the time-out threshold, at most p + 1 packets are in the queue when we start scheduling the known randomized policy. Then, we see that the length of time that we could have possibly scheduled the randomized policy is bounded by:  {U(p) > n} ⊆ {Y 1α +Y 2α +. . . Y mα > −(p+1), ∀m = 1, 2, . . . , n}(i.e., the random walk has yet to hit  −(p+1)). Note that this is a pessimistic bound, since there may be less than p + 1 packets in the queue at the start of scheduling the randomized policy. Then, we can obtain for  n ≥ 1,

image

where the first inequality follows from Hoeffding’s inequality and the second from  n ≥ n0(p). Importantly, note that  n0(p)is a function of p. Putting this together, starting at (29), we see that

image

Then, since �∞n=0 ne− 12 nδ2 < ∞and �∞n=0 n2e− 12 nδ2 < ∞, it is easy to see that we can choose values of  M1and  β1such that the lemma holds.

image

Recall that  Zπ3(p)is the total queue backlog summed over the interval of busy period p. We begin by defining U(p) to be

the amount of time that the learning method of Fig. 8 is used during busy period p. Then, in an argument similar to the proof

image

We seek to bound P(U(p) > n|C(p) = 1) for n = 1, 2, . . . . In the following, we will use  �µpi (n)to denote the learning algorithm’s estimate of server i at time slot n (see Fig. 8) andT pi (n)the number of times the algorithm has observed server i during an exploration (i.e., the number of times  �µpi (n)has

been updated). Recall that n takes value 0 when the timeout threshold p is crossed. (i.e., Variable n counts the number of time slots since the threshold was crossed.) Likewise, we have that the number of explorations made by time slot n is bounded by

image

for positive constants  b1, b2, r, and  ϵ ∈ (0, 1). Finally, we define a constant  σ ∈ (0, 1)such that λµi∗ < 1 − σ, which by Assumption 3 must exist.

In order to analyze the event {U(p) > n}, we consider whether running the learning algorithm of Fig. 8 over n time slots would have caused the queue to empty at any time up to or before n. Recall that at most p + 1 packets can be in the queue when the learning method is initiated. We will then break up the event {U(p) > n} into two parts. First, the event that  �µpi∗was not the maximum estimate for any time slot over the interval  (σn, n − 1). Second, the event that  �µpi∗was the maximum estimate over the interval  (σn, n−1)(and, thus, at all non-exploration time slots  i∗was scheduled), but the arrival process plus the initial queue backlog (which can be no greater than p + 1) has dominated the service process up until time n. Concretely, for  n ≥ 1,

image

We now use the following lemmas. Their proofs can be found at the end of this appendix.

Lemma 9. For positive constants  c1and  c2that are independent of n and p,

image

Lemma 10. For  n ≥ c4 + c5p,

image

where  c3, c4, and  c5are positive constants that are independent of n and p.

We now use Lemmas 9 and 10 to construct a bound on  E�Zπ3(p)���C(p) = 1�.

image

where the first inequality follows from (31) and the last inequality follows from applying Lemmas 9 and 10 and the union bound to (33). Note that for any non-negative integer x, real number y > 0, and  z ∈ (0, 1],

image

Then one can see that (34) can be bounded by a quadratic function of p giving the result. Below we provide the proofs of Lemmas 9 and 10. Proof of Lemma 9:

image

To this end, we can apply the following bounds

image

We proceed via an argument similar to the proof of Lemma 4. Consider an  m ≥ ⌈σn⌉. Define

image

Then, as was argued in Lemma 4,

image

Recall that by time m, V (m) dedicated explore time slots have occurred. We proceed to bound the two events on the right-hand side of the above. As in Lemma 4, we break up these events into two cases; the event the method under observes a server i and the event the method observes atypical observations of the server. Concretely,

image

The probability of under observing i is then given by6

image

where the first inequality follows from Hoeffding’s inequality, the second inequality follows from plugging in the lower bound of (32), and the third inequality follows from  m ≥ ⌈σn⌉. A similar expression holds for  T pi∗(m). We next proceed to bound the probability that the server sees atypical observations.

image

Thus, putting this altogether starting from (35), we see that

image

where  c1and  c2are positive constants that are independent of n and p. This gives the lemma.

image

We begin by defining  τ(n)to be the set of time slots that are not dedicated exploration time slots in the interval  (σn, n−1).Let  D∗(m)be the service available from server  i∗at time slot m. Then we can bound,

image

where the last inequality follows by defining a threshold 12 ((1 − σ)µi∗ + λ) nand noting that  �n−1m=0 A(m) + p + 1 ≤�m∈τ(n) D∗(m)if  �n−1m=0 A(m) + p + 1is less than the threshold and  �m∈τ(n) D∗(m)is greater than the threshold.

We now turn to analyzing how the cardinality of  τ(n)scales with n. We begin by noting that for  n−1 < ⌈σn⌉, |τ(n)| = 0. Thus, the set is empty for relatively small n. For n large enough such that  n − 1 ≥ ⌈σn⌉, we have

image

since there are  (n − 1) − ⌈σn⌉ + 1time slots in the interval  (σn, n − 1), at most V (n) of them are dedicated exploration time slots, and we can use (15) to upper bound V (n). Using (37), we see that7

image

Given the above, for n large enough such that  |τ(n)| ≥ 1, we use simple algebra to make the following bound on (36).

image

(36)  ≤ P

image

Now, using (38), we see that

image

where the last inequality follows from the fact that 11−σ > 1. Thus, there exists an  n0that is independent of p such that

image

(39)  ≤ P

image

image

for appropriately chosen positive constant  c3, which is independent of p or n. Noting that for appropriately chosen positive constants  c4and  c5, max�n0, n1, p+1δ−θ�≤ c4 + c5pgives the lemma.

[1] T. Stahlbuhk, B. Shrader and E. Modiano, “Learning Algorithms for Minimizing Queue Length Regret,” in Proc. IEEE International Symposium on Information Theory, 2018.

[2] W. Thompson, “On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples,” Bulletin of the American Mathematics Society, vol. 25, no. 3/4, pp. 285–294, 1933.

[3] S. Bubeck and N. Cesa-Bianchi, “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[4] T. L. Lai and H. Robbins, “Asymptotically Efficient Adaptive Allocation Rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

[5] R. Agrawal, “Sample Mean Based Index Policies with O(log n) Regret for the Multi-Armed Bandit Problem,” Advances in Applied Probability, vol. 27, no. 4, pp. 1054–1078, 1995.

[6] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time Analysis of the Multiarmed Bandit Problem,” Machine Learning, vol. 47, no. 2–3, pp. 235–256, 2002.

[7] A. Garivier and O. Capp`e, “The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond,” in Proc. Conference On Learning Theory, 2011, pp. 359–376.

[8] S. Krishnasamy et al., “Regret of Queueing Bandits,” in Proc. Neural Information Processing Systems, 2016, pp. 1669–1677.

[9] D.P. Bertsekas, Reinforcement Learning and Optimal Control, Athena Scientific, 2019.

[10] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, Second Edition, MIT Press, 2018.

[11] P. Auer and R. Ortner, “Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning,” in Proc. Advances in Neural Information Processing Systems, 2006.

[12] T. Jaksch, P. Auer and R. Ortner, “Near-Optimal Regret Bounds for Reinforcement Learning,” Journal of Machine Learning Research, vol. 11, pp. 1563–1600, 2010.

[13] M.G. Azar, I. Osband and R. Munos, “Minimax Regret Bounds for Reinforcement Learning,” in Proc. International Conference on Machine Learning, 2017.

[14] R. Fruit, M. Pirotta and A. Lazaric, “Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes,” in Proc. Advances in Neural Information Processing Systems, 2018.

[15] D. Russo et al., “A Tutorial on Thompson Sampling,” Foundations and Trends in Machine Learning, vol. 11, no. 1, pp.1–96, 2018.

[16] M. Strens, “A Bayesian Framework for Reinforcement Learning,” in Proc. Conference on Machine Learning, 2000.

[17] I. Osband, B. Van Roy and D. Russo, “(More) Efficient Reinforcement Learning via Posterior Sampling,” in Proc. Advances in Neural Information Processing Systems, 2013.

[18] I. Osband and B. Van Roy, “Near-Optimal Reinforcement Learning in Factored MDPs,” in Proc. Advances in Neural Information Processing Systems, 2014.

[19] I. Osband and B. Van Roy, “Why is Posterior Sampling Better than Optimism for Reinforcement Learning?,” in Proc. International Conference on Machine Learning, 2017.

[20] T.M. Modolvan and P. Abbeel, “Safe Exploration in Markov Decision Processes,” in Proc. International Conference on Machine Learning, 2012.

[21] F. Berkenkamp, et al., “Safe Model-Based Reinforcement Learning with Stability Guarantees,” in Proc. NIPS, 2017.

[22] A. Wachi et al., “Safe Exploration and Optimization of Constrained MDPs using Gaussian Processes,” in Proc. AAAI, 2018.

[23] A. Anandkumar et al., “Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 4, pp. 731–745, 2011.

[24] O. Avner and S. Mannor, “Multi-user lax communications: a Multi-Armed Bandit approach,” in Proc. IEEE International Conference on Computer Communications, 2016.

[25] S. Cayci and A. Eryilmaz, “Learning for Serving Deadline-Constrained Traffic in Multi-Channel Wireless Networks,” in Proc. IEEE International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks, 2017.

[26] R. Combes et al., “Optimal Rate Sampling in 802.11 Systems,” in Proc. IEEE Conference on Computer Communications, 2014, pp. 2760–2767.

[27] Y. Gai, B. Krishnamachari and R. Jain, “Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation,” in Proc. IEEE New Frontiers in Dynamic Spectrum, 2010.

[28] Y. Gai, B. Krishnamachari and R. Jain, “Combinatorial Network Optimization with Unknown Variables: Multi-armed Bandits with Linear Rewards and Individual Observation,” IEEE/ACM Trans. on Networking, vol. 20, no. 5, pp. 1466–1478, 2012.

[29] D. Kalathil and N. Nayyar and R. Jain, “Decentralized Learning for Multiplayer Multiarmed Bandits,” IEEE Trans. on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.

[30] M. Lelarge, A. Proutiere and M. Sadegh Talebi, “Spectrum Bandit Optimization,” in Proc. IEEE Information Theory Workshop, 2013.

[31] K. Liu and Q. Zhao, “Distributed Learning in Multi-Armed Bandit with Multiple Players,” IEEE Trans. on Signal Processing, vol. 58, no. 11, pp. 5667–5681, 2010.

[32] N. Nayyar, D. Kalathil and R. Jain, “On Regret-Optimal Learning in Decentralized Multiplayer Multiarmed Bandits,” IEEE Trans. on Control of Network Systems, vol. 5, no. 1, pp. 597–606, 2016.

[33] C. Tekin and M. Liu, “Online Learning Methods for Networking,” Foundations and Trends in Networking, vol. 8, no. 4, pp. 281–409, 2015.

[34] Y. Zhang et al., “Learning Temporal-Spatial Spectrum Reuse,” IEEE Trans. on Communications, vol. 64, no. 7, pp. 3092–3103, 2016.

[35] Y. Zhou et al., “Almost Optimal Channel Access in Multi-Hop Networks with Unknown Channel Variables,” in Proc. IEEE Distributed Computing Systems, 2014, pp. 461–470.

[36] L. Tassiulas and A. Ephremides, “Stability Properties of Constrained Queueing Systems and Scheduling Policies for Maximum Throughput in Multihop Radio Networks,” IEEE Trans. on Automatic Control, vol. 37, no. 12, pp. 1936–1948, 1992.

[37] L. Tassiulas and A. Ephremides, “Dynamic server allocation to parallel queues with randomly varying connectivity,” IEEE Trans. on Information Theory, vol. 39, no. 2, pp. 466-478, 1993.

[38] N. McKeown et al., “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Trans. on Communications, vol. 47, no. 8, pp. 1260-1267, 1999.

[39] M. J. Neely, E. Modiano and C. E. Rohrs, “Power Allocation and Routing in Multibeam Satellites with Time-Varying Channels,” IEEE/ACM Trans. on Networking, vol. 11, no. 1, 2003, pp. 138-152.

[40] X. Lin and N. Shroff, “The Impact of Imperfect Scheduling on Cross-Layer Rate Control in Wireless Networks,” in Proc. IEEE INFOCOM, pp.1804-1814, 2005.

[41] L. Chen et al., “Cross-Layer Congestion Control, Routing and Scheduling Design in Ad Hoc Wireless Networks,” in Proc. IEEE INFOCOM, 2006.

[42] A. Sinha and E. Modiano, “Optimal Control for Generalized Network-Flow Problems,” IEEE/ACM Trans. on Networking, vol. 26, no. 1, pp. 506-519, 2018.

[43] C. Joo, “On the Performance of Back-Pressure Scheduling Schemes with Logarithmic Weight,” IEEE Trans. on Wireless Communications, vol. 10, no. 11, pp.3632-3637, 2011.

[44] D. Bethanabhotla, G. Caire and M.J. Neely, “WiFlix: Adaptive Video Streaming in Massive MU-MIMO Wireless Networks,” IEEE Trans. on Wireless Communications, vol. 15, no. 6, pp. 4088-4103, 2016.

[45] H. Yu and M.J. Neely, “Learning Aided Optimization for Energy Harvesting Devices with Outdated State Information,” in Proc. INFOCOM, 2018.

[46] I. Kadota, A. Sinha and E. Modiano, “Optimizing Age of Information in Wireless Networks with Throughput Constraints,” in Proc. IEEE INFOCOM, 2018.

[47] I. Kadota and E. Modiano, “Minimizing the Age of Information in Wireless Networks with Stochastic Arrivals,” in Proc. ACM MobiHoc, 2019.

[48] S. Krishnasamy et al., “Augmenting Max-Weight with Explicit Learning for Wireless Scheduling with Switching Costs,” in Proc. IEEE International Conference on Computer Communications, 2017.

[49] T. Stahlbuhk, B. Shrader and E. Modiano, “Learning Algorithms for Scheduling in Wireless Networks with Unknown Channel Statistics,” Ad Hoc Networks, vol. 85, pp.131–144, 2019.

[50] Q. Liang and E. Modiano, “Minimizing Queue Length Regret Under Adversarial Network Models,” in Proc. ACM Measurement and Analysis of Computing Systems, 2018.


Designed for Accessibility and to further Open Science