Resource Management in Wireless Networks via Multi-Agent Deep Reinforcement Learning

2020·arXiv

Abstract

Abstract

I. INTRODUCTION

One of the key drivers for improving throughput in future wireless networks, including fifth generation mobile networks (5G), is the densification achieved by deploying more base stations.

The authors are with Intel Corporation, Santa Clara, CA 95054. E-mails: {navid.naderializadeh, jerry.sydir, meryem.simsek, hosein.nikopour}@intel.com.

The rise of such ultra-dense network paradigms implies that the limited physical wireless resources (in time, frequency, etc.) need to support an increasing number of simultaneous transmissions. Effective radio resource management procedures are, therefore, critical to mitigate the interference among such concurrent transmissions and achieve the desired performance enhancement in these ultra-dense environments.

The radio resource management problem is in general non-convex and therefore computationally complex, especially as the network size increases. There is a rich literature of centralized and distributed algorithms for radio resource management, using various techniques in different areas such as geometric programming [1], weighted minimum mean square optimization [2], game theory [3], information theory [4], [5], and fractional programming [6].

Due to the dynamic nature of wireless networks, these radio resource management algorithms may, however, fail to guarantee a reasonable level of performance across all ranges of scenarios. Such dynamics may better be handled by algorithms that learn from interactions with the environment. Particularly, frameworks that base their decision making process on the massive amounts of data that are already available in wireless communication networks are well suited to cope with these challenges.

A specific subset of machine learning algorithms, called reinforcement learning (RL) methods, are uniquely positioned in this regard. In the simplest form, RL algorithms consider an agent which interacts with an environment over time by receiving observations, taking actions, and collecting rewards, while the environment transitions to the subsequent step, emitting a new set of observations. Collecting experiences in such a framework, the ultimate goal of these algorithms is to train the agent to take actions that maximize its reward over time. Recent years have seen the rise of deep reinforcement learning, where deep neural networks (DNNs) are used as function approximators to estimate the probability of taking each action given each observation, and/or the value of each observation-action pair. Deep RL algorithms have achieved resounding success in solving challenging sequential decision making problems, especially in various gaming environments, such as Atari 2600 and Go [7]–[10].

These promising results have motivated researchers in other domains to apply deep RL algorithms to attack challenging problems in their areas, especially when deriving optimal “groundtruth” solutions is difficult, if not impossible. Of particular interest to us are the numerous recent works that have attempted to tackle various radio resource management problems using deep RL techniques. In particular, [11], [12] use deep RL for the problem of spectrum sharing and resource allocation in cognitive radio networks. In [13], the authors propose a multi-agent deep RL approach for spectrum sharing in vehicular networks, where each vehicle-to-vehicle (V2V) link acts as an agent, learning how to reuse the resources utilized by vehicle-to-infrastructure (V2I) links in order to improve system-wide performance metrics. In [14], deep RL is leveraged to address demand-aware resource allocation in network slicing. Moreover, several works have focused on downlink power control in cellular networks using various single-agent and multi-agent deep RL architectures [15]–[19].

There are, however, several drawbacks on these prior works. First, most of these works intend to optimize a single metric or objective function, a prominent example of which is the sum-throughput of the links/users across the network. However, resource allocation solutions which optimize the sum-throughput often allocate resources unfairly among users, as they only focus on the average performance and fail to guarantee a minimum performance. Second, measurements of channels and other metrics at each node in a real-world wireless network reach the other nodes in the network with certain amounts of delay, while many of the past works assume ideal message passing among transmitter/user nodes. Moreover, many of the aforementioned RL-based solutions are not scalable, in the sense that they do not address the possible mismatch between training and deployment environments and do not typically consider the applicability and robustness of their solution to variations in the environment.

In this paper, we consider the application of deep RL techniques to the problem of distributed user scheduling and downlink power control in multi-cell wireless networks, and we propose a mechanism for scheduling transmissions using deep RL so as to be fair among all users throughout the network. We evaluate our proposed multi-agent deep RL algorithm using a system-level simulator and compare its performance against several decentralized and centralized baseline algorithms. In particular, we show that our trained agents outperform two decentralized baseline scheduling algorithms in terms of the tradeoff between sum-rate (representative of “cell-center” users, i.e., the ones with relatively good channel conditions) and rate (representative of “cell-edge” users with poor channel conditions). Moreover, our agents attain competitive performance as compared to a centralized binary power control method, called ITLinQ [4], [20].

Our proposed design for the deep RL agents is scalable, and ensures that their DNN structure does not vary with the actual size of the wireless network, i.e., number of transmitters and users. We test the robustness of our trained agents with respect to changes in the environment, and demonstrate that our agents maintain their performance gains throughout a range of network configurations. We also shed light on the interpretability aspect of the agents and analyze their decision making criteria in various network conditions.

We make the following main contributions in this paper:

• We introduce a multi-agent deep RL algorithm, which performs joint optimization of user

• We consider the agents’ observations to be undersampled and delayed, to account for real-

• We introduce a scalable design of observation and action spaces, allowing an agent with a

• We introduce a novel method for normalizing the observation variables, which are input to

• We utilize a configurable reward which allows us to achieve the right balance between

The rest of this paper is organized as follows. In Section II, we present our system model and formulate the problem. We then describe our proposed multi-agent deep RL framework, including the environment, observations, actions, and rewards in Section III. We present our simulation results in Section IV. We provide further discussion on the results and several future directions in Section V. Finally, we conclude the paper in Section VI.

II. SYSTEM MODEL AND PROBLEM FORMULATION

We consider the downlink direction of a wireless network, consisting of N access points (APs) user equipment devices (UEs) , as illustrated in Figure 1, where the APs intend to transmit data to the UEs across the network. We assume that each AP maintains a local pool of users associated with it based on long-term component of the channel gain, i.e., path-loss and shadowing. In particular, let us use to denote the reference signal received power by , and we define the set of UEs associated with , denoted by

Figure 1: A wireless network comprising multiple access points (APs) and user equipment devices (UEs). The solid green lines denote the signal links between APs and their associated UEs, while the dashed red lines denote (strong) interference links between APs and neighboring non-associated UEs.

We assume that each AP has at least one UE associated with it. This, in conjunction with (1),

ensures that the set of user pools for all APs is a partition of the entire set of UEs; i.e.,

We consider a synchronous time-slotted communication framework, where each the following two major decisions at each scheduling interval t:

• User scheduling: It selects one of the UEs from its local user pool to serve. We let

• Power control: It selects a transmit power level Pidenotes the

Given the above decisions, the received signal at (we removed the dependence on t for

brevity) at scheduling interval t can be written as

where hji(t) denotes the channel gain between each at scheduling interval t, xi(t) denotes the signal transmitted by at scheduling interval t and njthe additive white Gaussian noise at in scheduling interval t, with denoting the noise variance. This implies that the achievable rate of at scheduling interval t (taken as the

Shannon capacity) can be written as

We assume that the system runs for T consecutive scheduling intervals, and we define the

average rate of each over this period as

We now specify the following two metrics that we intend to optimize in this paper:

• Sum-rate: This metric is defined as the aggregate average throughput across the entire

• percentile rate: As the name suggests, this metric is defined as the average rate threshold

Note that sum-rate provides an indication of how high the throughputs of the users are on average across the network (i.e., “cell-center” users), while the percentile rate shows the performance of the worst-case users that are in poor channel conditions (i.e., “cell-edge” users). These two metrics are in natural conflict with each other, and our goal in this paper is to devise a scheduling algorithm that outputs a sequence of joint user scheduling and power control decisions across the network such that we achieve the best tradeoff between the two metrics.

A. Distributed Scheduling, Feedback and Backhaul Delays

In this paper, we particularly aim to design a distributed scheduling algorithm in the sense that each AP in the network should make its user scheduling and power control decisions on its own. In order to enable such decision making, the APs rely on information fed back to them by the UEs in a periodic manner. In particular, we assume that each status indicators and reports them back to its associated scheduling intervals. In order

Figure 2: An illustration of the timeline for reporting the status updates from the UEs back to their associated APs with a period of scheduling intervals and delay of scheduling intervals. These feedback reports are exchanged between the APs via a backhaul interface with a further delay of scheduling intervals.

to account for real-world measurement, processing and communication latencies, we assume that these feedback reports arrive at the AP with a certain delay of scheduling intervals.

In addition to the feedback each AP receives from its own associated UEs, we assume that there is some periodic message passing among the APs across the network via a delayed backhaul interface. To be specific, we consider the case that the feedback reports communicated between agents are additionally delayed for scheduling intervals. This allows each AP to have access to the feedback reports from the UEs associated with other (neighboring) APs as well, albeit with some delay.

Figure 2 visualizes how feedback reports are exchanged between the UEs and their associated and non-associated APs over time. For each access point , we denote by the most recent set of available measurements at scheduling interval t that have

been reported by each . As the figures shows, for any

III. PROPOSED DEEP REINFORCEMENT LEARNING FRAMEWORK

We model the distributed scheduler as a multi-agent deep RL system. In particular, we propose to equip each AP with a deep RL agent, as illustrated in Figure 3. As mentioned in Section II-A, each agent observes the state of the UEs in its local user pool, and it also exchanges information

Figure 3: Multi-agent deep reinforcement learning diagram, where the agents receive local observations from their associated UEs, while also receiving remote observations from their neighboring agents. Upon taking their actions, the agents receive a centralized reward, which helps them tune their policies to take actions that maximize their rewards over time.

with neighboring agents, thus observing the state of neighboring APs’ associated UEs. We utilize a centralized training procedure, collecting the experiences of all agents and using them to train a single policy, which is then used by all the agents. Even though training is centralized, the execution phase is distributed, with each agent making its own decision at each scheduling interval only based on the specific observations it receives from the environment.

A. Environment

The environment is parametrized by the physical size of the deployment area, number of APs and UEs, and parameters governing the channel models used to create AP-UE channel realizations. At the start of each episode, the environment is reset, resulting in a new set of physical locations for all the APs and UEs in the network, along with channel realizations between them, modeling both long-term and short-term fading components.

1) Observations: Observations available to each agent at each scheduling interval consist of

local observations, representing the state of the UEs associated with the corresponding AP and remote observations, representing the state of the UEs associated with neighboring APs. In the following, we will elaborate on these observations in more detail:

• Local observations: These observations are based on measurements made and reported

Figure 4: Observation exchange graph for the configuration in Figure 1 with N = 6 APs and n = 3 remote agents per AP.

• Remote observations: The user scheduling and power control decisions made by an agent

Figure 5 illustrates how the local and remote observations are concatenated together at each agent, resulting in a fixed-length observation vector. As there are n remote agents per agent, and k UEs’ observations are included per agent, the length of the observation vector for each agent equals 2(n + 1)k.

Figure 5: The fixed-length observation vector structure at each agent, composed of local and remote observations. With a slight abuse of notation, we use to denote the top-k selected UEs associated with the agent, and denote the top-k selected UEs associated with the iremote agent, all sorted based on their PF ratios.

Remark 1: Note that the dimension of the observation space, and therefore the input size to the deep RL agent’s neural network, does not depend on the number of APs and/or UEs in the network. This makes our algorithm scalable regardless of the specific environment parameters.

Remark 2: When there are fewer than k UEs associated to an AP, we set the corresponding values in the local/remote observations to default values, similar to a zero-padding operation. In particular, we use default values of 0 for weight and -60 dB for SINR.

Remark 3: The current LTE and future 5G cellular standards developed by 3GPP support periodic channel quality indicator (CQI) feedback reports from UEs to their serving base stations. Moreover, the weights can either be reported by the UEs, or the base stations may keep track of the long-term average rates of their associated UEs. The base stations can also exchange observations among each other through backhaul links, such as the X2 interface [21]. Therefore, our proposed observation structure is completely practical and may be readily implemented in current and future cellular networks.

2) Actions: As mentioned in Section II, at each scheduling interval, each AP needs to select

a target UE from its user pool to serve, and a transmit power level to transmit data to the scheduled UE. To jointly optimize the user scheduling and power control decisions, we define a joint action space, where each action represents a (transmit power level, target UE) pair.

We quantize the range of positive transmit powers (potentially non-uniform) discrete power levels. Moreover, because the number of UEs associated with an AP can be varying and/or potentially large, we take a similar approach to deal with this issue as in forming the observations: At each scheduling interval, we limit the choice of target UE to one of the top-k UEs included in the local observations. This is also reasonable because the agent has information solely on those k users in its local observation vector.

Given the above considerations, the number of possible actions for each agent at each scheduling interval is 1 + pk, where the additional action is one in which the agent remains silent for that scheduling interval and selects none of the top-k UEs to serve. In the event that the AP has fewer than k associated UEs and erroneously selects a target UE which does not exist, we map the action to the “off” action, indicating that the AP should not transmit.

Remark 4: Note that similar to the observation space, the action space dimension does not depend on the network size as well. This allows us to have a robust agent architecture that can be trained in a specific environment, and then deployed on a different environment in terms of, for instance, number of APs and/or UEs compared to the training environment. In Section IV-F, we show how well the agent performs in such mismatched scenarios.

3) Rewards: As shown in Figure 3, we utilize a centralized reward based on the actions of

all the agents at each scheduling interval. In particular, assuming that each has selected to serve at scheduling interval t, the reward emitted to each of the agents is

a weighted sum-rate reward, calculated as

where is the most recent reported weight measurement by available at the rate achieved by is a parameter which determines the tradeoff between R, the two metrics that we intend to optimize. Specifically, turns the reward to sum-rate, favoring cell-center UEs, while changes the reward to approximately the summation of the scheduled UEs’ PF ratios, hence appealing to cell-edge UEs.

There are, however, two exceptions to the reward emitted to the agents, which are as follows:

• All agents off: Due to the distributed nature of decision making by the agents, it is possible

• Invalid user selected: As mentioned in Section III-A2, it might happen that an AP has fewer

B. Normalizing the Agents’ Observations and Rewards

It is widely known that normalizing DNN inputs and outputs have a significant impact on its training behavior and predictive performance [22]. Before training or testing our model in a specific environment, we first create a number of environment realizations in an offline fashion and run one or several simple baseline scheduling algorithms in those realizations. While doing so, we collect data on the observations and rewards of all the agents throughout all environment realizations and all scheduling intervals within each realization. We then leverage the resulting dataset in the following way to pre-process the observations and rewards before using them to train the agent’s DNN:

• Observations: As mentioned in Section III-A1, we have two types of observations: weights

ple percentiles of the observed weights. In particular, we consider Q percentile values , denoted respectively by, as depicted in

• Rewards: We follow a well-known standardization procedure for normalizing the rewards,

Figure 6: Example empirical distributions of weight and SINR observations (blue) and their corresponding percentile values (orange) using Q = 10 percentile levels.

C. Training and Validation Procedure

We consider an episodic training procedure, in which each episode represents a realization of the environment in which the locations of the APs and UEs and channel realizations are randomly selected following a set of probability distributions and constraints on minimum APAP and UE-AP distances. We control the density of APs and UEs in our environment by fixing the size of the deployment area and selecting different numbers of APs and UEs for different training sessions. Because the channels between APs and UEs depend heavily on their relative locations, each new episode allows the system to experience a potentially unexplored subset of the observation space. The associations between UE and AP take place as in (1) as a new episode begins, and remain fixed for the duration of that episode. An episode consists of a fixed number of T scheduling intervals, where at each interval, the agents decide on which user scheduling and power control actions the APs should take.

We further structure the training process into epochs, each of which consists of a fixed number of consecutive training episodes. At the completion of each epoch, we pause training in order to evaluate the current policy against a fixed set of validation environments carefully selected

to be representative of all possible environments. We use a score metric R, defined as

to quantify the performance after each epoch and select the best model during training as the

one achieving the best performance in terms of R

IV. SIMULATION RESULTS

In this section, we first mention the details of the wireless system parameters. Next, we present the baseline algorithms that we use to compare the performance of our proposed method against. We then discuss our considered deep RL agents and their corresponding parameters. Finally, we proceed to present our simulation results.

A. Description of the Wireless Environment

We consider networks with UEs, dropped randomly within a 500m square area. We impose a minimum AP-AP distance of 35m and AP-UE distance of 10m. We consider a bandwidth of 10MHz, maximum AP transmit power of Ppower spectral density of dBm/Hz, and episode length of T = 2000 scheduling intervals.

The communication channel between APs and UEs consists of three different components:

• Path-loss: We consider a dual-slope path-loss model [23], [24], which states that the path-

• Shadowing: We assume that all the links experience log-normal shadowing with a standard

performance is typically three times more challenging than enhancing cell-center performance. Hence, the score metric Remphasizes the percentile rate three times more than the average rate of the UEs across the network.

• Short-term fading: We use the sum of sinusoids (SoS) model [25] for short-term flat

As for the feedback reports, each UE is assumed to sample and send its measurements to its associated AP every scheduling intervals, and these reports arrive at the associated AP after a delay of scheduling intervals. Moreover, we assume a backhaul delay of scheduling intervals for observation exchange among the APs.

B. Baseline Algorithms

We compare the performance of our proposed scheduler against several baseline algorithms.

• Full reuse: At each scheduling interval, each AP schedules the UE in its local user pool

• Time division multiplexing (TDM): The UEs are scheduled in a round-robin fashion. In

C. Deep RL Agents and their Hyperparameters

We consider the following two different types of agents:

• Double Deep-Q Network (DQN) [7], [27]: We consider a double DQN agent, which is a

• Advantage Actor-Critic (A2C) [31]: We use the OpenAI baselines [32] implementation

For validation purposes during training, we create a set of 50 validation environments, whose average and percentile rates are within a 5% relative error of those achieved over 1000 random environment realizations by both full reuse and TDM baseline algorithms. We define an epoch as a group of 10 consecutive episodes, and we run the training for 200 epochs, or equivalently, 2000 episodes, amounting to a total of 16 million and 40 million training scheduling intervals for DQN and A2C, respectively (due to different numbers of parallel environments). Once training is complete, we test the resulting models across another randomly-generated set of 1000 environment realizations.

As for the observations, we consider each agent to include weights and SINRs from k = 3 UEs having the highest PF ratios, alongside receiving remote observations from a set of n = 3 remote agents. This implies that the size of the observation vector of each agent at each scheduling interval, hence the size of the input layer of each agent’s neural network, is equal to 2(n+1)k = 24. We consider the moving average parameters for the long-term average rate and interference at the UEs to be , respectively. We also consider a set of Q = 20 percentile levels for mapping and normalizing both weight and SINR observations, calculated using an offline dataset generated by both full reuse and TDM baseline algorithms.

We consider a binary power control policy, where an AP at any given scheduling interval is either off, or serves a UE with full transmit power. This implies that the total number of actions, and therefore the size of the output layer of each agent’s neural network, equals 1 + pk = 4. Moreover, for the reward function as defined in (18), we consider , which helps strike the right tradeoff between the average and percentile rates as we will show next.

For each type of environment configuration, we train 5 models, utilizing different random number generator seeds. In the following sections, we report the mean of the results across the trained 5 models, with the shaded regions around the curves illustrating the standard deviation.

Remark 5: We analyze how many remote agents’ observations should be included in each agent’s observations, where as mentioned in Section III-A1, the remote agents are selected based on physical proximity. Figure 7 shows the variation in the mean long-term UE SINR in networks with APs and 100 UEs, where for each configuration of N APs, the interference includes the contribution from n remote agents physically-closest to the serving AP. As the figure demonstrates, by far the largest reduction in SINR occurs when the closest AP transmits. Moreover, the curves flatten out as the number of included interference terms increase, indicating that interference from farther APs is less consequential and may be safely omitted from the observation space. This justifies why including observations from n = 3 remote agents is a reasonable choice.

Figure 7: Impact of the number of strongest interferers included in the interference calculation on the average long-term SINR of UEs in networks with N

Figure 8: The sum-rates, percentile rates, and scores achieved by the DQN and A2C agents alongside baseline algorithms over the 50 validation environments, when training on environments with 4 APs and 24 UEs.

D. Validation Performance during Training

We first demonstrate how the behavior of the model evolves as training proceeds. Figure 8 illustrates the evolution of validation sum-rate, percentile rate, and the score metric Rdefined in (21), when training on environments with N = 4 APs and K = 24 UEs. As the plots in Figure 8 show, both DQN and A2C agents initially favor sum-rate performance, while suffering in terms of the percentile rate. As training proceeds, the agents learn a better balance between the two metrics, trading off sum-rate for improvements in terms of percentile rate. As the figure shows, A2C achieves a better sum-rate, while DQN achieves a better coverage and also a better score, outperforming the centralized ITLinQ approach after only 12 epochs. Moreover, DQN converges faster than A2C, due to better sample efficiency thanks to the experience buffer.

As mentioned in Section III-C, for each training run, we select the model at the epoch which

Figure 9: Test results on the sum-rate and percentile rate from models trained on environments with 4 APs and various numbers of UEs, where the model trained on each configuration was deployed on the same configuration during the test phase.

yields the highest score level. We can then use the resulting model to conduct final, large-scale,

test evaluations upon the completion of training, as we will show next.

E. Final Test Performance with Similar Train and Test Configurations

In this section, we present the final test results for models tested on the same configuration as the one in their training environment. Figure 9 demonstrates the achievable sum-rate and percentile rate for the environments with 4 APs and varying numbers of UEs. As the plots show, our proposed deep RL methods significantly outperform TDM in both sum-rate and percentile rate, and they also provide considerable percentile rate gains over full reuse. Our reward design helps the agents achieve a balance between sum-rate and percentile rate, helping the DQN agent attain percentile rate values which are on par with ITLinQ for large numbers of UEs (32-40), while outperforming it for smaller numbers of UEs (16-24). The A2C agent, on other hand, performs consistently well in terms of the sum-rate, approaching ITLinQ as the number of users increases across the network.

In Figure 10, we plot the sum-rate and percentile rate for the configurations with 40 UEs and different numbers of APs. As the figure shows, the relative trends are similar to the previous case in terms of sum-rate, with A2C outperforming ITLinQ for networks with 8 APs, but in terms of the percentile rate, both agents outperform TDM and full reuse, while having inferior performance relative to the centralized ITLinQ approach as the number of APs, and equivalently the number of agents, gets larger.

Figure 10: Test results on the sum-rate and percentile rate from models trained on environments with 40 UEs and various numbers of APs, where the model trained on each configuration was deployed on the same configuration during the test phase.

F. Final Test Performance with Discrepant Train and Test Configurations

As mentioned before, our design of the observation and action spaces is such that they have a fixed size regardless of the actual training configuration. We test the robustness of our models with respect to network density by testing policies trained on one density deployed in environments of other densities. To reduce clutter, in the following, we only plot the average results over the 5 seeds and remove the shaded regions representing the standard deviation of the results.

We first test models trained on environments with N = 4 APs and different numbers of UEs against each other, and plot the results in Figure 11. We observe that all DQN and A2C agents are robust in terms of both metrics. Interestingly, the A2C model trained on the case with 16 UEs has a much better performance (especially in terms of sum-rate) than its counterpart DQN model as the number of UEs in the test deployment increases. For models with 40 UEs, however, DQN tends to perform better, especially in terms of the percentile rate.

Next, we cross-test the models trained on environments with 40 UEs and different numbers of APs against each other. Figure 12 shows the sum-rates and percentile rates achieved by these models. All the models exhibit fairly robust behaviors with the exception of the DQN model trained on configurations with 4 APs, whose percentile rate performance deteriorates for higher numbers of APs. Note that in this case, the number of agents changes across different scenarios, and we observe that in general, training with more agents leads to more capable models, which can still perform well when deployed in sparser scenarios, while training with few agents may not scale well as the number of agents increases.

Figure 11: Test results on the sum-rate and percentile rate from models trained on environments with 4 APs and various numbers of UEs, where the model trained on each configuration was deployed on all the other configurations as well. The first and second element of each tuple in the legends represent numbers of APs and UEs in the training environment, respectively.

Figure 12: Test results on the sum-rate and percentile rate from models trained on environments with 40 UEs and various numbers of APs, where the model trained on each configuration was deployed on all the other configurations as well. The first and second element of each tuple in the legends represent numbers of APs and UEs in the training environment, respectively.

Remark 6: We have also tested our trained models with observations mapped using 20 percentile levels on test environments in which the observations were mapped using different numbers of percentile levels. For the scenario with N = 4 APs and K = 24 UEs, we observed that using 10-100 percentile levels during the test phase achieves results very similar to (within 3% of) the ones obtained using 20 percentile levels. This shows that our proposed approach is very robust to the granularity of mapping the observations fed into the agent’s neural network.

Figure 13: Weight vs. SINR scatter plot of the top UE of a DQN agent trained and tested on networks with N = 4 APs and K = 24 UEs. Red (resp., green) points represent the scenarios where the agent decided to stay silent (resp., serve one of its top-3 UEs).

G. Interpreting the Agent’s Decisions

In this section, we attempt to interpret our trained agent’s decisions during the test phase. In particular, we collect data on the inputs and outputs of a DQN agent, trained on a network with N = 4 APs and K = 24 UEs and tested on the same configuration. Using this data, we will try to visualize the agent’s actions in different situations.

Figure 13 shows a scatter plot of the SINR and weight of the agent’s “top UE,” i.e., the UE in the AP’s user pool with the highest PF ratio. The red points illustrates the cases where the agent decided to remain silent, while the green points represent the cases in which the agent served one of its top-3 UEs. As expected, higher weights and/or higher SINRs lead to a higher chance of the AP not being off. Quite interestingly, the boundary between the green and red regions can be approximately characterized as , which is effectively a linear boundary on the PF ratio; i.e., the agent decides to be active if and only if the PF ratio of its top UE is above some threshold that it has learned based on its interactions with the environment.

Given that the PF ratio is a reasonable indicator of the status of each UE, Figure 14 compares the PF ratios of the top-3 UEs included in the agent’s observation and action spaces in the cases where the agent decided to serve one of the those UEs. As the figure shows, the agent’s user scheduling decision heavily depends on the relative difference between the PF ratios of the top-3

Figure 14: Comparison of the PF ratios of the top-3 UEs observed by a DQN agent trained and tested on networks with N = 4 APs and K = 24 UEs for the cases in which the agent decided to serve one of those UEs.

Figure 15: Comparison of the PF ratios of the DQN agent’s top UE and the remote agents’ top UEs for an agent trained and tested on networks with N = 4 APs and K = 24 UEs. Red (resp., green) points represent the scenarios where the agent decided to remain silent (resp., serve its top UE).

UEs. In general, the second and third UEs have some chance of being scheduled if they have a PF ratio close to that of the top UE. However, this chance is significantly reduced for the third UE, as highlighted by the regions corresponding to different user scheduling actions.

Moreover, Figure 15 shows the impact of remote observations on the agent’s power control decisions. In particular, the figure demonstrates the cases where the agent either remains silent (red points), or decides to serve its top UE (green points). We observe that the agent learns a non-linear decision boundary between the PF ratio of its top UE and the PF ratios of the top UE of each remote agent. Notably, the green region becomes larger as we go from the left plot to the right plot. This implies that the agent “respects” the PF ratio of the top UE of its closest remote agent more as compared to the second and third closest remote agents, since the interference between them tends to be stronger, hence their actions impacting each other more significantly.

V. DISCUSSION

In this section, we discuss some of the implications of our proposed framework in more detail and provide ideas for future research on how to improve upon the current work.

A. Analysis on the Number of Observable UEs by the Agent

As described in Section III, we bound the dimension of the agent’s observation and action space by selecting a finite number of k UEs, whose observations are included in the agent’s observation vector. We select these UEs by sorting the user pool of each AP according to their PF ratios. In this section, we shed light on the tradeoffs implied by such a “user filtering” method.

We first analyze the actions taken by the DQN agent when trained and tested in a network with N = 4 APs and K = 24 UEs. Recall that the agent can either take an “off” action, or decide to serve one of its top-3 UEs. We observe in Figure 16a that the algorithm selects action 0 (no transmission) or 1 (the top UE) most of the time and rarely selects the other two UEs. In this formulation of the algorithm, including information from more than 3 UEs would most likely not improve the performance. This seems logical given that the PF ratio represents the short-term ability of the UE to achieve a high rate (represented by the SINR term) along with the long-term demand to be scheduled for the sake of fairness (represented by the weight term).

Given our goal of having a scalable agent that can be employed by any AP having an arbitrary number of associated users, it is not feasible to have an agent which can observe all its UEs at every scheduling interval in a general network configuration. However, we conducted a controlled experiment, where we restricted the environment realizations to the ones in which all APs have a constant number of associated UEs. In particular, we considered a configuration with N = 4 APs and K = 24 UEs, where each AP has exactly 6 UEs associated with it. In such a scenario, we are indeed able to design an agent, which can observe the state of all k = 6 of its associated UEs, as well as all UEs associated of all its n = 3 remote agents. Furthermore, we did not sort the UEs of each agent according to their PF ratios. This ensures that each input port to the agent’ neural network contains an observation from the same UE over time. Note that the size of the observation vector is now 2(n + 1)k = 48 and the number of actions equals 1 + pk = 7.

After training and testing a DQN agent on the above scenario, we observed that the resulting sum-rate and percentile rate were within 6% of the original model using the sorted top-3

Figure 16: Distribution of the DQN agent’s actions in a network with N = 4 APs and K = 24 UEs, where (a) the agent observes the top-3 UEs sorted based on their PF ratios, and (b) each AP is restricted to have exactly 6 associated UEs, and the agent is able to observe all 6 UEs in an unsorted manner. For the latter case, the UE indices on the x-axis represent the rank of the selected UE at the corresponding scheduling interval according to the UEs’ PF ratios, even though such a ranking was not used for sorting the UEs in the observation and action spaces.

UEs. This demonstrates that the algorithm is indeed able to learn user scheduling without the “aid” of sorting the UEs by PF ratio. Moreover, Figure 16b demonstrates the percentage of time that the model with observations from all (unsorted) UEs selects the “off” action and each of the UEs. For the purposes of this figure, we sorted the UEs by their PF ratios, so the percentage of time that the algorithm selects the top-UE represents the percentage of time that the agent selects the UE with the top PF ratio, regardless of its position in the observation/action space. We see that the algorithm learns to select the UE with the highest PF ratio most often, but interestingly, the distribution among the various UEs is slightly more even as compared to Figure 16a. This implies that by letting the agent observe all UEs in an arbitrary order, its resulting user scheduling behavior is similar to a PF-based scheduler, but not exactly the same. Further interpretation of this result is left for future study.

B. Enhanced Training Procedure

One of the unique challenges to training agents to perform scheduling tasks in a multi-node wireless environment, is that we can only simulate individual snapshots of the environment and explore a limited subset of the state space. In our our training procedure, we fixed the parameters governing our environment (size of the deployment area, number of APs and UEs, minimum distance constraints, maximum transmit powers, etc.) and selected AP and UE locations randomly at each environment reset. We utilized multiple parallel environments to ensure that training batches contained experiences from different snapshots. This approach, along with a carefully-selected learning rate schedule and a sufficiently long training period, allowed our agents to achieve the performance that we reported in Section IV.

An interesting question remains whether a more deliberate selection of training environments can accelerate training and/or result in better-performing agents. The fact that we can simulate only a limited number of environment realizations at a time provides an opportunity to guide the exploration and learning process.

One possible direction is to systematically control the density of the environments experienced during training, by systematically varying the parameters governing the environment. Possible strategies could be to move from less dense to more dense deployments or vice versa, or to more carefully select training batches to always include experiences from a range of densities.

Another possible direction is to control the complexity of the environments on which the agent is being trained. Intuitively, some environment realizations are easy and some are hard. For example, an environment in which all UEs are very close to their associated AP, i.e., cell-center scenario, is easy because the optimal policy is for the APs to transmit at every scheduling interval. An environment where all UEs are clustered around the borders between the coverage areas of adjacent APs, i.e., cell-edge scenario, is slightly more difficult because the optimal policy is for neighboring APs to coordinate not to transmit at the same time. Environments in which users are distributed throughout the coverage areas of the APs are, however, much more complex because the optimal user scheduling and power control choices become completely non-trivial. Various approaches to curriculum learning could potentially be applied [34], [35]. The main difficulty with this approach is determining a more granular measure of environment complexity and a procedure for generating environment realizations that exhibit the desired difficulty levels.

C. Capturing Temporal Dynamics

One of the main challenges faced by the scheduler is dealing with delayed observations available to the agent. We have shown that our agents are able to successfully cope with this problem, but an interesting question is whether we can include recurrent architectures in the agent’s neural network to learn, predict, and leverage network dynamics based on the sequence of observations it receives over the course of an episode.

Two approaches are possible. The first is to include recurrent neural network (RNN) elements, such as long short-term memory (LSTM) [36], or attention mechanisms such as Transformers [37], at the inputs, with the goal of predicting the actual current observations based on the past history of delayed observations. A second approach would be to place the RNN/attention elements at the output similar to [38], [39]. One thing to note here is that in order for the system to learn the temporal dynamics of the environment, it must be exposed to the observations of each individual UEs over time. This means that the approach of including observations from the top-3 UEs, sorted by their PF ratios, will likely not work and observations from all UEs must be included in an unsorted order. This is challenging due to the variable number of UEs associated to each AP, and we leave this for future work.

VI. CONCLUDING REMARKS

We introduced a distributed multi-agent deep RL framework for performing joint user selection and power control decisions in a dense multi-AP wireless network. Our framework takes into account real-world measurement, feedback, and backhaul communication delays and is scalable with respect to the size and density of the wireless network. We show, through simulation results, that our approach, despite being executed in a distributed fashion, achieves a tradeoff between sum-rate and percentile rate, which is close to that of a centralized information-theoretic scheduling algorithm. We also show that our algorithm is robust to variations in the density of the wireless network and maintains its performance gains even if the number of APs and/or UEs change in the network during deployment.

REFERENCES

[1] A. Gjendemsjø, D. Gesbert, G. E. Øien, and S. G. Kiani, “Binary power control for sum rate maximization over multiple interfering links,” IEEE Transactions on Wireless Communications, vol. 7, no. 8, pp. 3164–3173, 2008.

[2] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4331–4340, 2011.

[3] L. Song, D. Niyato, Z. Han, and E. Hossain, “Game-theoretic resource allocation methods for device-to-device communi- cation,” IEEE Wireless Communications, vol. 21, no. 3, pp. 136–144, 2014.

[4] N. Naderializadeh and A. S. Avestimehr, “ITLinQ: A new approach for spectrum sharing in device-to-device communication systems,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 6, pp. 1139–1151, 2014.

[5] X. Yi and G. Caire, “ITLinQ+: An improved spectrum sharing mechanism for device-to-device communications,” in 2015 49th Asilomar Conference on Signals, Systems and Computers. IEEE, 2015, pp. 1310–1314.

[6] K. Shen and W. Yu, “FPLinQ: A cooperative spectrum sharing strategy for device-to-device communications,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 2323–2327.

[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[8] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, “Move evaluation in Go using deep convolutional neural networks,” arXiv preprint arXiv:1412.6564, 2014.

[9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.

[10] C. Berner, G. Brockman, B. Chan, V. Cheung, P. D˛ebiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al., “Dota 2 with large scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, 2019.

[11] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, “Intelligent power control for spectrum sharing in cognitive radios: A deep reinforcement learning approach,” IEEE Access, vol. 6, pp. 25 463–25 473, 2018.

[12] A. Tondwalkar and A. Kwasinski, “Deep reinforcement learning for distributed uncoordinated cognitive radios resource allocation,” arXiv preprint arXiv:1911.03366, 2019.

[13] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multi-agent reinforcement learning,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2282–2292, 2019.

[14] Y. Hua, R. Li, Z. Zhao, X. Chen, and H. Zhang, “GAN-powered deep distributional reinforcement learning for resource management in network slicing,” IEEE Journal on Selected Areas in Communications, 2019.

[15] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications (ICC). IEEE, 2017, pp. 1–7.

[16] F. Meng, P. Chen, and L. Wu, “Power allocation in multi-user cellular networks with deep Q learning approach,” in ICC 2019-2019 IEEE International Conference on Communications (ICC). IEEE, 2019.

[17] K. I. Ahmed and E. Hossain, “A deep Q-learning method for downlink power allocation in multi-cell networks,” arXiv preprint arXiv:1904.13032, 2019.

[18] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2239–2250, 2019.

[19] G. Zhao, Y. Li, C. Xu, Z. Han, Y. Xing, and S. Yu, “Joint power control and channel allocation for interference mitigation based on reinforcement learning,” IEEE Access, vol. 7, pp. 177 254–177 265, 2019.

[20] N. Naderializadeh, O. Orhan, H. Nikopour, and S. Talwar, “Ultra-dense networks in 5G: Interference management via non-orthogonal multiple access and treating interference as noise,” in 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall). IEEE, 2017, pp. 1–6.

[21] G. Nardini, A. Virdis, and G. Stea, “Modeling X2 backhauling for LTE-advanced and assessing its effect on CoMP coordinated scheduling,” in 2016 1st International Workshop on Link-and System Level Simulations (IWSLS). IEEE, 2016, pp. 1–6.

[22] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 9–48.

[23] J. G. Andrews, X. Zhang, G. D. Durgin, and A. K. Gupta, “Are we approaching the fundamental limits of wireless network densification?” IEEE Communications Magazine, vol. 54, no. 10, pp. 184–190, 2016.

[24] X. Zhang and J. G. Andrews, “Downlink cellular network analysis with multi-slope path loss models,” IEEE Transactions on Communications, vol. 63, no. 5, pp. 1881–1894, 2015.

[25] Y. Li and X. Huang, “The simulation of independent Rayleigh faders,” IEEE Transactions on Communications, vol. 50, no. 9, pp. 1503–1514, 2002.

[26] C. Geng, N. Naderializadeh, A. S. Avestimehr, and S. A. Jafar, “On the optimality of treating interference as noise,” IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1753–1767, 2015.

[27] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[29] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017, pp. 2681–2690.

[30] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quantifying generalization in reinforcement learning,” arXiv preprint arXiv:1812.02341, 2018.

[31] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,” arXiv preprint arXiv:1611.05763, 2016.

[32] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “OpenAI baselines,” https://github.com/openai/baselines, 2017.

[33] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.

[34] G. Hacohen and D. Weinshall, “On the power of curriculum learning in training deep networks,” arXiv preprint arXiv:1904.03626, 2019.

[35] D. Seita, D. Chan, R. Rao, C. Tang, M. Zhao, and J. Canny, “ZPD teaching strategies for deep reinforcement learning from demonstrations,” arXiv preprint arXiv:1910.12154, 2019.

[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

[38] A. Gruslys, W. Dabney, M. G. Azar, Piot, M. G. Bilal, Bellemare, and R. Munos, “The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning,” in Seventh International Conference on Learning Representations (ICLR), 2018.

[39] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in Seventh International Conference on Learning Representations (ICLR), 2019.

Designed for Accessibility and to further Open Science