PoPS: Policy Pruning and Shrinking for Deep Reinforcement Learning

2020·Arxiv

Abstract

Abstract

The recent success of deep neural networks (DNNs) for function approximation in reinforcement learning has triggered the development of Deep Reinforcement Learning (DRL) algorithms in various fields, such as robotics, computer games, natural language processing, computer vision, sensing systems, and wireless networking. Unfortunately, DNNs suffer from high computational cost and memory consumption, which limits the use of DRL algorithms in systems with limited hardware resources.

In recent years, pruning algorithms have demonstrated considerable success in reducing the redundancy of DNNs in classifica-tion tasks. However, existing algorithms suffer from a significant performance reduction in the DRL domain. In this paper, we develop the first effective solution to the performance reduction problem of pruning in the DRL domain, and establish a working algorithm, named Policy Pruning and Shrinking (PoPS), to train DRL models with strong performance while achieving a compact representation of the DNN. The framework is based on a novel iterative policy pruning and shrinking method that leverages the power of transfer learning when training the DRL model. We present an extensive experimental study that demonstrates the strong performance of PoPS using the popular Cartpole, Lunar Lander, Pong, and Pacman environments. Finally, we develop an open source software for the benefit of researchers and developers in related fields.

keywords—Deep reinforcement learning, deep neural network, pruning algorithms.

I. INTRODUCTION

Deep reinforcement learning (DRL) algorithms have attracted much attention in recent years due to their capability to provide a good approximation of the objective value in decision making tasks while dealing with very large state and action spaces. In contrast to classic reinforcement learning methods that perform well for small-size models but perform poorly for large-scale models, DRL combines a deep neural network (DNN) with reinforcement learning for overcoming this issue. The DNN is used to map from states to actions in large-scale models so as to yield a policy that maximizes the objective value. In DeepMind’s recently published Nature paper [1], [2], a DRL algorithm was developed to teach computers how to play Atari games directly from the on-screen

© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Dor Livne and Kobi Cohen are with the School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva 8410501 Israel. Email:dorliv@bgu.ac.il, yakovsec@bgu.ac.il

This work was supported in part by the U.S.-Israel Binational Science Foundation (BSF) under grant 2017723, and by the Cyber Security Research Center at Ben-Gurion University of the Negev under grant 076/16.

pixels, and strong performance was demonstrated in many tested games. In recent years, there is a growing attention of using DRL methods in various fields, such as robotics, natural language processing, computer vision, sensing systems, and wireless networking. A survey of recent studies can be found in [3].

The superior performance of DRL algorithms in decision making tasks has triggered the need to make them practically appealing when using cheap hardware devices. For example, industrialization of artificially intelligent engines in controlled manufacturing processes often requires small and cheap sensing devices to detect and respond to events. Another example is automatizing users in wireless communication and Internet of Things (IoT) systems which often consist of low power, computationally limited and battery constrained nodes.

This issue has been recognized by the industry, and there are more and more players in the market that develop chips for low-power devices that support computationally intensive deep learning algorithms with low-power consumption. Prominent examples are Qualcomm Artificial Intelligence Engine, Intel’s EyeQ family of system-on-chip (SoC) devices, Intel’s Myriad 2 family, NXP’s ADAS chip, and more. Along with these industrial developments, establishing fundamental algorithmic methods to reduce the size of DRL models is crucial for making them practically appealing for a wide range of applications that use systems with limited hardware resources. This challenge has triggered a new and fascinating research direction: How to train DRL models with strong performance while achieving compact representations of the DNNs? In this paper we address this issue.

A. Contributions

In recent years, pruning algorithms have demonstrated considerable success in reducing the redundancy of DNNs in classification tasks (see related work in Section I-B). The basic idea is to remove low-ranked neurons from the DNN which are redundant and do not contribute significantly to the output, resulting in a sparse DNN with a small number of non-zero parameters and faster operations. Pruning DNN is an iterative process, as illustrated in Fig. 1. The popular vanilla version works as follows [4]. The first step is to train the model. Then, the algorithm enters an iterative process. At each iteration, the algorithm prunes the low-ranked connections with weights below a certain threshold. Then, it fine-tunes the model to recuperate from the pruning procedure. Eventually, the pruning procedure generates a sparse DNN, which is tractable for hardware implementations [5].

Despite the success of pruning methods in classification tasks, they suffer from a significant performance reduction

Fig. 1. An illustration of the pruning procedure. The first step is to train the model, the second is to prune it, and the last is to fine-tune the pruned model. The algorithm iterates the last two steps.

in the DRL domain. There exist a number of recent studies that developed a DRL framework to guide the pruning process of a DNN [6], [7]. However, those methods did not address the problem of training a DRL model with a compact representation as we consider here. The objective in [6], [7] was to train DNNs with a compact representation used in classification tasks (and not in the DRL domain). The pruning actions were guided by a trained non-compact DRL model. In this paper, we develop the first effective solution to the performance reduction problem of pruning in the DRL domain, and establish an algorithm that achieves the state-of-the-art results in terms of training DRL models with strong required performance while minimizing the size of the DNN in the DRL domain. Specifically, our contributions are summarized below: 1) Developing a novel Policy Pruning and Shrinking (PoPS) algorithm for training DRL models: In the DRL domain, the goal of an agent is to learn a policy, which is a mapping from a state to an action. Thus, the objective value at each state-action pair is not given as in supervised learning, but need to be learned online by exploring actions. Therefore, the agent faces the well known exploration versus exploitation dilemma. On the one hand, the agent should explore all actions in order to figure out their influence on the objective function. On the other hand, it should exploit the information gathered so far to choose the best actions. Existing pruning techniques do not operate well in the DRL domain since the ground truth is not given. Therefore, they must explore the state space in order to recuperate from the pruning procedure by interacting with the environment. This leads to a significant instability in the pruning procedure and increases the performance loss. As a result, when implementing pruning in the DRL domain, our algorithm must guard against losing a significant information regarding the optimal policy. Once the algorithm has this ability, it can detect redundancy in the DRL model, and train a regenerated shrunk dense model with strong performance. Building upon this insight, we develop the first effective

solution to the performance reduction problem of pruning in the DRL domain. Specifically we establish a novel Policy Pruning and Shrinking (PoPS) algorithm to train DRL models with strong performance while achieving a compact representation of the DNN. PoPS executes an iterative procedure using three main steps to achieve this goal. In the first step, PoPS leverages the power of transfer learning to capture the full information regarding the desired policy. Specifically, PoPS trains a teacher network using a large-scale DRL model to yield a policy that maximizes the objective function without pruning. In the second step, PoPS executes a novel transfer learning-based policy pruning procedure, which is controlled by the teacher, to find an efficient pruned representation of the model. The policy pruning step avoids the direct interaction with the environment when implementing pruning in the DRL domain. Since the teacher that controls the pruning process has already explored and exploited actions successfully, the fine tuning step has the ability to remove redundancy iteratively without the need of exploring actions which are less likely to contribute significantly to the objective function. In the third step, namely the policy shrinking step, PoPS regenerates and trains a newly-constructed smaller dense model based on the redundancy measured by the policy pruning procedure. The policy pruning and policy shrinking steps are repeated until the algorithm can no longer detect any redundancy. A detailed description of PoPS algorithm is given in Section III.

2) Open source software: We developed an open source software implementation of the PoPS algorithm. PoPS was developed using Python and is available at GitHub (see link in [8]). The open source software is built in an object oriented programming fashion which makes it easy to adjust to different environments and model settings. Our experimental study demonstrates the versatility of the software in four different challenging environments, as detailed in the next paragraph. We encourage the use of the open source software by researchers and developers in related fields.

3) Achieving the state-of-the-art performance in extensive experimental study: We evaluated our framework using four different challenging environments from the OpenAI gym library [9], namely Cartpole, Lunar Lander, PONG, and Pacman environments, associated with different model architectures, namely deep Q-network (DQN) with a feed-forward fully connected network, Actor-Critic network with a feed-forward fully connected network, DQN with a Convolutional Neural Network (CNN), and dueling DQN with CNN, respectively. We compared PoPS with the commonly used magnitude-based weight pruning [4] using the gradual pruning framework [10], dubbed Magnitude-Base Gradual Pruning (MBGP), which is known to achieve strong performance in many DNN architectures, and the Knowledge DistillationBased Pruning (KDBP) algorithm which demonstrated improvements in the pruning procedure in various classification settings by using knowledge distillation when pruning DNNs [11]. In all tested environments, PoPS generated a compact representation of the DRL model with strong performance and a size of less than 1% of the initial representation size. By contrast, the MBGP and KDBP algorithms were not able to reduce the size below 28% in the Cartpole environment,

68% in the Lunar Lander environemnt, 6% in the Pong environment, and 8% in the Pacman environment, of the initial representation size without significantly degrading the agent’s performance. The results obtained by PoPS present the state-of-the-art performance in terms of training DRL models with strong required performance while minimizing the size of the DNN in the DRL domain.

B. Related Work

DRL algorithms have attracted much attention in recent years due to their capability to provide a good approximation of the objective value while dealing with very large state and action spaces. In DeepMind’s recently published Nature paper [2], a DRL algorithm was developed to teach computers how to play Atari games directly from the on-screen pixels, and reported strong performance in many games. In [12], strong performance was found for several players in MNIST games and the switch riddle. A Double Q-learning method has been proposed in [13], where two DNNs simultaneously learn the policy and value evaluation. A survey of recent studies can be found in [3]. This recent success has triggered the need to make DRL algorithms practically appealing when using cheap hardware devices. This is particularly relevant for DRL algorithms in sensing and inference systems (see [14]–[17] and references therein), IoT, and wireless communication networks [18]–[24], that often require to operate using low power, computationally limited and battery constrained devices. The proposed PoPS framework in this paper addresses this challenge.

DNNs are typically over-parameterized, i.e, there is a significant redundancy in deep learning models [25]. This redundancy leads to a waste of both computation and memory usage, which limits their use in resource-constrained devices. Therefore, in recent years, various methods have been developed to find efficient compact representations of DNNs. In [4], the authors developed a pruning method followed by vector quantization and Huffman coding [26] to compress neural networks. In [27], a pruning technique, called Network Trimming, was developed, where neurons are pruned based on the statistics of neuron activations. In [28], an iterative twostep algorithm was developed to effectively prune channels in each layer. In [10], the authors augmented Tensorflow with a pruning framework to prune the network connections during the training phase. In [29], the authors developed a layer-compensated pruning algorithm for a layer-wise compensate filter pruning. In [30], they proposed the LeGR algorithm, which is parameterized to learn layer-wise affine transformations over the filter norms. It constructs a learned global ranking to achieve an efficient resource-constrained filter pruning. In [6], [7], (non-compact) DRL models were used to explore the architecture space of the model in order to find an efficient architecture. Specifically, in [6], DRL was used to determine which layers should be dropped and the size of the remaining layers. In [7], DRL was used to evaluate an efficient sparsity ratio of each layer. However, those methods did not address the problem of training a DRL model with a compact representation as we consider here. In [31], the authors developed the incremental network quantization (INQ), an iterative network quantization method that incorporates three interdependent operations: weight partitioning, group-wise quantization, and re-training. In [32], [33], LightNN algorithm that replaces the multiplications to one shift or a constrained number of shifts and adds in binarized NNs to improve the energy efficiency was developed and analzyed. Finally, the existing pruning methods suffer from a significant performance reduction in the DRL domain as explained and demonstrated in Sections I-A, IV.

Training a student network using the outputs of a trained teacher network was first introduced in [34], with the goal of compressing a large ensemble model to a single network. This approach was extended later in [35] to transfer knowledge from a deep neural network to a shallow one. More recently, it was used for policy distillation to train smaller dense networks that perform at the expert level [36]. In this paper, we extend the method of training a student network to guard against the performance loss of pruning in the DRL domain. Transfer learning methods to fine-tune the model in the pruning procedure were used in [11], [37] in the form of Knowledge Distillation [38] to improve the stability and robustness of the pruning process in classification tasks. These methods, however, did not address the performance reduction problem of pruning in the DRL domain, which we address in this paper.

II. SYSTEM MODEL AND PROBLEM STATEMENT

A. System Model

Consider a system consisting of an agent who interacts with an environment. Let S be the state space that the system can reach, and be the system state at time t. Let A be the action space that the agent can take. At each time (say t), the agent takes action , receives reward , and the system transits to state , which is observed (or partially observed) by the agent. An illustration of the system is given in Figure 2.

Fig. 2. An illustration of the system model. At each time (say t), the agent takes action , receives reward , and the system transits to state , which is observed (or partially observed) by the agent.

Let

be the accumulated discounted reward, where is a discounted factor, and T is the time horizon of the control problem. The agent’s objective is to find a policy that maximizes the expected accumulated discounted reward:

where E denotes the expected accumulated discounted reward when the model performs policy .

Q-learning is a reinforcement learning method that aims at finding good policies to solve the dynamic programming problem in (2). It has been widely applied in various decision making problems, primarily because of its ability to evaluate the expected utility without requiring prior knowledge about the system model, and its ability to adapt when stochastic transitions occur [39]. The basic idea of Q-learning is to approximate the objective value (referred to as Q-value) for all possible state-action pairs. This is done by minimizing the Time Difference (TD) error using a stochastic approximation method that judiciously trades off between exploring all available actions to figure out their influence on the objective value, and exploiting actions which are likely to be more valuable in terms of maximizing the objective value.

While Q-learning performs well when dealing with small action and state spaces, it becomes impractical when the problem size increases, mainly for two reasons: (i) A stored lookup table of Q-values for all possible state-action pairs is required, which makes the storage complexity intolerable for large-scale problems. (ii) As the state space increases, many states are rarely visited, which significantly decreases performance.

In recent years, DRL methods that combine DNN with Qlearning, have shown great potential for overcoming these issues. Using DRL, the DNN maps from the (partially) observed state to an action, instead of storing a lookup table of Qvalues. Furthermore, large-scale models can be represented by the DNN well so that the algorithm can preserve good performance for very large-scale models. A well known DRL algorithm was presented in DeepMind’s Nature paper [2] and demonstrated superior performance for teaching computers how to play Atari games directly from the on-screen pixels. Since then, there is a growing attention on using DRL methods in various fields (see related work in Section I-B).

C. The Objective

Unfortunately, DNNs are usually very demanding in computational power and memory usage, which limits the use of DRL algorithms in systems with limited hardware resources. Therefore, the objective of this paper is to develop a fundamental method to train DRL models with strong performance in terms of solving (2) while achieving compact representations of the DNNs. In the next section we address this challenge.

III. THE POLICY PRUNING AND SHRINKING (POPS) ALGORITHM

As discussed in Section I-A.1, existing pruning algorithms do not effectively apply to the DRL domain, due to the performance loss with respect to the policy learning incurred by the pruning procedure when interacting directly with the environment. The suggested PoPS algorithm is designed to solve the issue of the performance reduction of pruning in the DRL domain. The framework is based on a novel iterative policy pruning and shrinking method that leverages the power of transfer learning when training the DRL model.

Note that running a sort of exhaustive search over layer sizes to find compact representations gradually becomes impractical in large-scale NNs due to the huge architecture space. The basic idea of PoPS is to use the power of pruning-based technique in detecting the redundancy over the DNN by dictating the compression direction in the huge architecture space. Thus, it avoids the need of running a laborious gradual search.

A. Architecture of the PoPS Algorithm

In this section we present the architecture of PoPS algorithm. PoPS aims to find an efficient, compact, and redundantfree representation of the DNN. An illustration of PoPS architecture is given in Fig. 3. We next discuss in details each component in the architecture.

1) Creating a teacher: This part of PoPS algorithm trains an agent to obtain a policy that solves (2) using a large-scale redundant DNN. The output of this module is the DRL model with the corresponding DNN. This output is used as a teacher that guides the iterative policy pruning and shrinking steps.

In order to train the teacher to solve (2), two commonly used training methods were incorporated in the framework. This makes the framework applicable to a wide range of decision making tasks. The first is a Q-learning type that uses the DNN (referred to as Deep Q-network or DQN) to map from the observed state to a Q-value for each action [2]. The action that maximizes the Q-value is selected (according to -greedy distribution). Note that standard Q-learning methods use the same parameters for estimating the target and the current Qvalue, resulting in a large correlation between the TD target and the updated parameters at every training step. This makes the training procedure highly unstable. Thus, the target DQN overcomes this issue by using a separate network with fixed parameters for estimating the TD target, where every N steps the algorithm copies the parameters from the DQN network to the target network. Denote the DQN network by Q and the target network by . Then, the updates to the Q-values satisfy the following equation:

Fig. 3. An illustration of the PoPS framework.

Fig. 4. An illustration of the IPP module used in PoPS algorithm.

Fig. 5. An illustration of the Shrinking module used in PoPS algorithm.

where the subscript t denotes the time index, and

is the learned value obtained by getting reward after taking action in state , moving to the next state , and then taking action that maximizes the future Q-value

seen at the next state. The term is the old learned value. Thus, the algorithm aims at minimizing the TD error between the learned value and the current estimate value. The learning rate is set to , which is typically set close to zero.

The second commonly used method that we incorporate in

the PoPS framework to train the teacher is the Advantage Actor-Critic method [40]. The actor critic model uses two neural networks. The first is an actor neural network that we denote by , where is the actor network parameters. The second is a critic network denoted by , where w is the critic network parameters. is used to determine the optimal action and is used to estimate the state-value function of the current state. Intuitively, the actor observes the environment reaction to it with an action, while the critic observes the actor and provides a feedback. and are trained simultaneously. The increments of the parameters , respectively, at time t, satisfy the following equations:

and

To further reduce the high variability in value-based methods, we use the advantage function instead of the value function such that:

where

This prodedure evaluates the gain in taking a specific action as compared to the average rewarded action at a given state.

2) Experience Replay: Once PoPS has trained the teacher, it periodically stores batches of randomly selected experience replay instances in the buffer (where typically the batch size is between 10, 000 and 30, 000). Note that for large-scale problems, the buffer capacity should not be too small to avoid missing important experiences and suffering from a performance reduction while replacing old experiences that reside in the buffer. In our experiments, the buffer capacity was set to 100, 000. Then, the iterative policy pruning and shrinking steps are trained by the experience replay instances and do not interact directly with the environment. It is important to note that the experience accumulation step is repeated until the student model has sufficient coverage of the state space for each training session.

3) Iterative Policy Pruning (IPP): We develop a fundamental iterative policy pruning (IPP) method to approach the performance limits when pruning the DNN, which are far to be reached by existing pruning methods in the DRL domain. IPP is an iterative process, as illustrated in Fig. 4. At each iteration, IPP prunes the DNN by accumulating experiences from the teacher. It then fine tunes the model using policy distillation method [36]. At each time step, IPP prunes each layer so that the sparsity of the layers satisfies the gradual pruning equation [10]:

where t is the global step of the training process. The term is the sparsity demand at the global step t, ginitial is the initial sparsity of the layer (typically set to 0), gfinal is the target sparsity, and is the global step from which the algorithm starts to prune (typically set to 0). The term n is the number of pruning steps to preform until reaches gfinal, and is the pruning frequency.

Every N time steps, IPP evaluates the model. If the evaluation of the model is lower then a predefined low threshold it stops the pruning, fine tunes the model using policy distillation with the KL-divergence loss function (see (10) below), and trains the model until it recuperates and solves (2). Specifically, Let be the state-value experience replay pairs generated by the teacher, and be the student values (i.e., the model that we want to train). Then, IPP aims at matching the output distributions between and using the KL-divergence measure. Let be the parameters of the student network. Then, is trained to be the minimizer of the following loss function:

Once the model recuperates, the pruning procedure continues. If the evaluation of the model solves (2), then IPP saves the model. The IPP continues accumulating experiences from the teacher and repeats this procedure until the convergence condition is met, i.e., the sparsity remains unchanged for M consecutive iterations. Then, the output model is the one IPP saved which solves (2).

In Fig. 6 we illustrate the performance after one step of the iterative pruning procedure by IPP as compared to the iterative pruning procedure by the commonly used magnitude-based weight pruning [4] using the gradual pruning framework [10]. It can be seen that the IPP procedure achieves very good performance, and overcomes the significant performance loss typically incurred by the pruning procedure in the DRL domain.

4) Policy Shrinking: This part of PoPS algorithm shrinks the model to obtain a dense small-scale DNN based on the output from IPP. The first step is to measure the redundancy of the model based on the output from IPP, and generate the dense model accordingly. The second step is to train the newly generated model using experiences from the teacher. An illustration is given in Fig 5.

4.1) Measuring redundancy and creating model: This step measures the number of non-zero parameters of the DNN output by the IPP module to determine the redundancy in that layer. This step determines the size of each layer which is required for efficient operations. Then, the algorithm shrinks the model, i.e., creates a dense DNN, where the number of

Fig. 6. The performance after one step of the iterative pruning procedure by IPP as compared to the iterative pruning procedure by the commonly used Magnitude-Based weight pruning [4] using the Gradual Pruning (MBGP) framework [10].

variables of each layer is determined by the number of non-zero parameters that was measured in the sparse model.

4.2) Train model: This component trains the newly generated shrunken model by taking experiences from the teacher. The algorithm uses the policy distillation training method with KL-divergence loss function in this step as well, as detailed in the IPP module.

Finally, after the model is trained, the algorithm compares its size with the size of the model that was constructed at the previous iteration. If the difference is smaller then a predefined threshold it concludes that the algorithm has converged. Otherwise, it repeats the procedure.

B. Pseudo Code of the PoPS algorithm

The pseudo code of PoPS algorithm is described in Algorithm 1. Let be a pre-trained model and let be a trained model at iteration i (where ). is the Experience Replay set described in Section III-A. The description of Algorithm III-B starts by pruning the model using the IPP module as detailed in Section III-A, and illustrated in Fig. 4. IPP first accumulates experiences from stored in D, as illustrated in Figure 3 via the accumulate-experience function in the pseudo code. Then, it prunes and fine-tunes the model using batches from D and the KL-divergence loss function described in (10) by the prune-and-train function. PoPS repeats this procedure until the IPP convergence condition is met, as detailed in Section III-A. Once the IPP convergence condition is met, the output of the IPP module is used as the input to the Shrinking module.

The Shrinking module measures the redundancy at each layer, denoted by M, by using the sparsity of via the measure-redundancy function in the pseudo code. denotes a randomly initialized model built by the measure M via the create-model function, which is then trained using batches from D and the KL-divergence loss function described in (10) by the train function. Eventually, a trained model, denoted by , is stored as the output of the Shrinking module. PoPS ends when the size of the model converges.

C. Open Source Software

We developed an open source software implementation of PoPS algorithm. Researchers and developers in related fields are welcome to integrate PoPS in their working environment. PoPS was developed using Python and is available at GitHub (see link in [8]). The implementation details of the open source software are described in the Appendix.

D. Complexity Discussion

The number of multiplications through a feed-forward fully connected network with G layers, in which K is the size of the input state vector and is the number of units in the layer, is given by . As a result, the computational complexity of the forward and back propagation for one sample is given by O(D). In PoPS, the complexity of accumulating N experiences by the teacher is O(DN) and the complexity of preforming back-propagation on M experiences is O(DM). Thus, each iteration of the IPP module has a complexity of order O(D(N + M)). in the Shrinking module, given that at each iteration the model is trained on M experiences, the complexity of each iteration is O(D(N + M)). Typically, the number of experiences is between 10, 000 and 30, 000. The number of iterations required to reach a compact architecture by PoPS depends on the amount of the redundancy in the initial DNN. Higher redundancy might result in running more iterations. In our experiments, PoPS converged to a compact architecture after no more then 9 iterations, which is very efficient.

IV. EXPERIMENTS

In this section we present an extensive experimental study to evaluate the performance of PoPS algorithm. To demonstrate the robustness and versatility of PoPS algorithm, we tested PoPS using four different environments from the OpenAI gym library [9], each has a different model setting. The experiments were executed on ASUS Turbo GeForce GTX 1080 TI GPU. We compared the following algorithms: (i) the commonly used magnitude-based weight pruning [4] using the gradual pruning framework [10], dubbed Magnitude-Base Gradual Pruning (MBGP), which is known to achieve strong performance in many DNN architectures; (ii) the Knowledge DistillationBased Pruning (KDBP) algorithm, which demonstrated improvements in the pruning procedure in various classification settings by using knowledge distillation when pruning DNNs [11]; and (iii) the proposed PoPS algorithm. Note that the existing MBGP and KDBP vanilla versions conduct pruning in supervised learning settings. Thus, we adapted them to the DRL setting. As explained in Section III-A, two common DRL training methods were implemented, namely DQN, and ActorCritic. Since the ground truth is not given in the DRL domain, we replaced it by the learned values. In the DQN setting, we used the TD values, given in (4) as the ground truth during

the fine-tuning procedure. In the Actor-Critic setting, we used the Advantage values, given in (8), to prune the actor, which is the DRL model used for the inference task (where the critic was trained using the TD values given in (4)). When implementing MBGP in the DRL domain, the pruned model interacts with the environment and feeds the experience replay with experiences. The fine-tuning procedure is preformed at each time-step by evaluating the TD or Advantage values using a batch of experiences from the experience replay.

When implementing KDBP in supervised learning settings, the student network is trained with the soft teacher output to improve the fine-tuning process, where the true labels are used to supervise the student output. The loss is computed using the cross-entropy between the true labels and the student output distribution, and the cross-entropy between the soft student output distribution and the soft teacher output distribution [11]. Therefore, when implementing KDBP in the DRL domain, we use the soft output teacher to improve the fine-tuning process. The pruned network interacts with the environment to obtain experiences which are then used to evaluate the distribution over TD values or Advantage values to compute the cross-entropy between those values and the student output distribution. They are used by the teacher as well to evaluate the soft teacher output distribution to compute the cross-entropy between the soft student output distribution and the soft teacher output distribution.

In both KDBP and MBGP, the interactions of the pruned agent with the environment makes the training less stable.

A. Solving the Cartpole Environment with Compact Representations of a DQN

In the Cartpole environment, a pole is attached to a cart, which is allowed to move along the x-axis, as illustrated in Fig. 7. The state space contains the velocity and position of the cart, the velocity of the tip of the pole, and the angle between the pole and the vertical line. At each time step, the agent is allowed to push the cart to the left or to the right, with a force of and +1 unit, respectively, i.e., |A| = 2. The pendulum starts upright, and the goal is to prevent it from falling over. At each time step where the pole remains upright, the agent receives a reward of +1. The episode ends when the pole is more the 15 degrees from vertical, or the cart moves more then 2.4 units from the center, or if the pole stays upright for 200 time steps. The maximum possible reward over an episode is thus 200. A policy that solves the environment is considered to be a policy that achieves more than 195 points in average over 100 episodes.

Fig. 7. An illustration of the Cartpole environment.

The DRL architecture that was used in this experiment was a DQN with 3 feed-forward fully connected layers with 256, 256, and 128 neurons, respectively, summing up to a total of roughly 100K parameters. The initial model was trained using a target DQN, as described in III-A, until it was able to solve the environment. Then, we tested the performance of PoPS, MBGP, and KDBP algorithms in terms of solving the environment while obtaining compact representations of the DQN setting.

We present the performance of PoPS algorithm in Table I, and MBGP and KDBP algorithms in Table II. It can be seen that PoPS algorithm achieves tremendous performance gain over the MBGP and KDBP algorithms after only several iterations. Specifically, PoPS generated a compact representation of the DRL model that solves the environment with a size of less than 0.3% of the initial DQN representation size. By contrast, the MBGP and KDBP algorithms were not able to generate a representation of the DRL model that solves the environment with a size of less than 40% of the initial DQN representation size.

B. Solving the Lunar Lander Environment with Compact Representations of an Actor-Critic Architecture

In the second experiment, we tested the performance of the algorithms in the Lunar Lander environment. The environment consists of a ship and a landing pad in a shifting landscape, as illustrated in Fig. 8. The observation is an 8 dimensional state vector that includes the ship’s coordinates, velocity, landerangle, angular-velocity, right-leg and left-leg grounded flag. The action space consists of 4 discrete actions: do nothing, fire main engine, fire left engine, and fire right engine. Successful landing yields 100 points, crash landing yields points, each leg that contacts the ground yields 10 points, and firing the main engine would subtract 0.3 points per time step. An episode begins in a random state and ends when the lander lands successfully, crashes, flies out of bounds, or reaches 1000 time steps. A policy that solves the environment is considered to be a policy that achieves more than 200 points in average over 100 episodes.

Fig. 8. An illustration of the Lunar lander environment.

The architecture that was used in this experiment was an Actor-Critic network with 3 feed-forward fully connected layers with 64 neurons each, summing up to a total of roughly 9K parameters. The initial model was trained with a Critic in an Actor-Critic fashion until it was able to solve the environment. Then, we tested the performance of PoPS, MBGP, and KDBP algorithms in terms of solving the environment while obtaining compact representations of the Actor-Critic setting.

We present the performance of PoPS algorithm in Table III, and MBGP and KDBP algorithms in Table IV. It can be seen again that PoPS algorithm achieves tremendous performance gain over the MBGP and KDBP algorithms after only several iterations. Specifically, PoPS generated a compact representation of the DRL model that solves the environment with a size of less than 0.8% of the initial Actor-Critic representation size. By contrast, the MBGP and KDBP algorithms were not able to generate a representation of the DRL model that solves the environment with a size of less than 68% of the initial ActorCritic representation size. This experiment demonstrates the ability of PoPS to detect redundancy in a smaller model as well.

C. Solving the Pong Environment with Compact Representations of a Convolutional Neural Network (CNN)

In the third experiment, we tested the performance of the algorithms in the Pong environment. Pong is a twodimensional game that simulates a tennis table. The player uses the paddle to hit a ball back and forth, and it aims at passing the ball beyond the opponent’s pad. An illustration is given in Fig. 9. The observation is an RGB image of the screen which is an array of shape (210, 160, 3). The state space is comprised of 4 grey scaled frames of size 84X84 stacked together. The action space consists of 3 discrete actions: up, down, and stay. A reward of +1 is received when the player passes the ball beyond the opponent’s pad, whereas a reward of is received when the ball passes the player’s pad. The episode ends when a score of 21 (winning ) or (losing) is reached. A policy that solves the environment for the player

TABLE I POPS PERFORMANCE IN THE CARTPOLE ENVIRONMENT

TABLE V POPS PERFORMANCE IN THE PONG ENVIRONMENT

is considered to be a policy that achieves more than 18 points in average over 100 episodes.

The Pong environment has a very high dimension. Thus, the

capacity should not be too small, to avoid missing important experiences and suffering from a performance reduction. In PoPS, the teacher feeds batches of experiences before each

Fig. 9. An illustration of the Pong environment.

fine-tuning or training phase by interacting with the Pong environment. Thus, the experiences in the experience replay are shifted dynamically, giving the pruned model the notion that there is a limitless amount of experiences. This method covers the large state space of Pong environment efficiently.

The architecture that was used in this experiment was a CNN with 3 convolutional layers with a filter size of 8, 4, and 3, respectively, and the numbers of channels were 32, 64, and 64, respectively. The 3 convolutional layers were followed by one fully connected layer with 512 neurons. The initial model was trained using a target DQN, as described in Section III-A. The model was trained until it was able to solve the environment. Then, we tested the performance of PoPS, MBGP, and KDBP algorithms in terms of solving the environment while obtaining compact representations of the DQN with the CNN setting.

We present the performance of PoPS algorithm in Table V, and MBGP and KDBP algorithms in Table VI. It can be seen again that PoPS algorithm achieves tremendous performance gain over the MBGP and KDBP algorithms after only several iterations. Specifically, PoPS generated a compact representation of the DRL model that solves the environment with a size of less than 1% of the initial DQN with CNN representation size. By contrast, the KDBP algorithm, although presented a significant improvement as compared to the first two experiments, was not able to generate a representation of the DRL model that solves the environment with a size of less than 6% of the initial DQN with CNN representation size. The MBGP algorithm performed poorly in this setting and was not able to generate a representation of the DRL model that solves the environment with a size of less than 60% of the initial DQN with CNN representation size.

D. Solving the Pacman Environment with Compact Representations of a Dueling DQN with CNN

In the fourth experiment, we tested the performance of the algorithms in the Pacman environment. The player (i.e., agent) controls the Pacman and tries to collect all the Pac-Dots while avoiding the ghosts. An illustration is given in Fig. 10. The observation is an RGB image of the screen which is an array of shape (210, 160, 3). The state space is comprised of 4 grey scaled frames of size 84X84 stacked together. The action space consists of 9 discrete actions: none, up, down, right, left, up-right, up-left, down-left and down-right. A reward of +10 is given for each collection of a Pac-Dot, +50 for Power Pellet and +200 for destroying vulnerable ghosts (multiplied by two if the player destroys them in succession). The episode ends once the Pacman runs out of lives or if the Pacman manages to collect all the Pac-Dots. A policy that solves the environment for the player is considered to be a policy that achieves more than 1800 points in average over 100 episodes.

Fig. 10. An illustration of the Pacman environment.

The architecture that we used in this experiment is a CNN with 3 convolutional layers with a filter size of 8, 4, and 3, respectively, and the numbers of channels were 32, 64, and 64, respectively. The 3 convolutional layers were followed by one fully connected layer with 512 neurons and then an output layer with 10 neurons to implement a dueling architecture [41]. The initial model was trained using a target dueling DQN, as described in Section III-A. The model was trained until it was able to solve the environment. Then, we tested the performance of PoPS, MBGP, and KDBP algorithms in terms of solving the environment while obtaining compact representations of the dueling DQN with the CNN setting.

We present the performance of PoPS algorithm in Table VII, and MBGP and KDBP algorithms in Table VIII. It can be seen that PoPS achieves tremendous performance gain over the MBGP and KDBP algorithms in this setting as well after only several iterations. Specifically, PoPS generated a compact representation of the DRL model that solves the environment with a size of less than 0.3% of the initial dueling DQN with CNN representation size. By contrast, the MBGP and KDBP algorithms were not able to generate a representation of the DRL model that solves the environment with a size of less than 10%, and 8%, respectively, of the initial dueling DQN with CNN representation size.

V. CONCLUSION

We have developed a policy pruning and shrinking (PoPS) framework for achieving compact representations of DNNs in the DRL domain. The suggested framework is based on a novel iterative policy pruning and shrinking method that leverages the power of transfer learning when training the DRL model. Extensive experimental results using Cartpole, Lunar Lander, Pong, and Pacman environments demonstrated tremendous performance gain of PoPS over the MBGP and KDBP algorithms in all four environments and different model settings. The results obtained by PoPS present the state-of-the-art performance in terms of training DRL models with strong required performance while minimizing the size of the DNN in the DRL domain. This research demonstrates the great potential of PoPS in making DRL models practically appealing for a wide range of applications that use systems with limited hardware resources. We developed open source software of PoPS framework which is available in [8]. Researchers and developers in related fields are invited to incorporate PoPS in their working environment. Finally, despite the success of DRL-based algorithms in many fields, not much is known about their theoretical performance guarantees. In particular, a potential future direction is to establish theoretical guarantees for compression ratios in the DRL domain.

VI. APPENDIX

A. Implementation Details of the Open Source Software

In this appendix, we summarize the implementation details of the open source software that we developed. PoPS was developed using Python and is available at GitHub (see link in [8]). Fig. 11 illustrates the top level hierarchy of all the modules in the PoPS project. Interested readers are encouraged to read the README file for further information.

Fig. 11. The PoPS project directory as seen in GitHub.

1) The configs.py file holds the configuration for each environment and model, parameters (such as the target sparsity and the pruning frequency for the pruning procedure) that affect the initial training phase, and the PoPS procedure.

2) The model.py file contains the model architecture for each environment such as class DQNPong, class CartPoleDQN, class ActorLunarLander, and CriticLunarLander. These models are used for the initial training phase, and follow the DQNAgent interface. Each model is associated with a Student version that inherits it, such as StudentPong. The Student version is adjusted for the PoPS algorithm such that the loss function is the loss presented in (10) and the architecture is a dynamic architecture which is defined by the redundancy measures.

3) The train.py file contains functions that execute the policy distillation training procedure as well as the IPP’s pruning and fine-tuning steps. The functions are well documented and are used by a variety of models and environments.

4) The file prune.py contains the IPP module orchestrating the pruning phase in the PoPS algorithm as detailed in Section III with the trainstudent function which is used for pruning and fine tuning the model.

5) The utils package contains several modules with helpful utilities.

For each environment there is a unique evaluation tool and an accumulated experience function. These two functions with the modules described above implement the interface needed for the PoPS algorithm. The implementation of PoPS algorithm is similar for each environment and is well documented at GitHub, including the experiments presented in Section IV. For each environment, the initial trained model is given in the savedmodels folder as well, with a train script for the initial training phase for reconstructing the experiments from scratch if desired. For instance, a description of the Pong experiment is given below. An illustration of the experiment directory as seen in GitHub is given in Fig. 12.

• The evaluate.py file is the unique evaluation tool for the Pong environment. It can used as a stand alone script as well to evaluate models in the Pong environment.

• The copyweights.py file is the function needed for PoPS to create the initial Student model from the teacher model . It is a compatibility function for the purpose of creating a Student version of a model.

• The traingym.py file is a script to initially train the model to solve the environment.

• The PoPS

iterativePong.py file is for the PoPS algorithm which uses the modules discussed above. It is very easy to read and implement. The first step is to import the teacher model and then create the initial student model with copyweights.py. Then, there is a convergence loop where PoPS prunes the model using IPP, evaluates the redundancy, creates a model with a size according to the measured redundancy, trains it, and repeats until convergence.

• The accumulate

experiencePong.py file is a module which fills the Experience Replay with interactions of the teacher model with the Pong environment. The experiences in the Experience Replay are used later for training with respect to the KL divergence loss function (10).

Fig. 12. The experiment directory of Pong as seen in GitHub.

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.

[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[3] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint arXiv:1701.07274, 2017.

[4] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems (NIPS), pp. 1135–1143, 2015.

[5] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254, IEEE, 2016.

[6] A. Ashok, N. Rhinehart, F. Beainy, and K. M. Kitani, “N2n learning: Network to network compression via policy gradient reinforcement learning,” arXiv preprint arXiv:1709.06030, 2017.

[7] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800, 2018.

[8] D. Livne and K. Cohen, “An open source software for Policy Pruning and Shrinking Framework,” available on GitHub: https://github.com/dorlivne/PoPS, 2019.

[9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.

[10] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017.

[11] N. Yu, S. Qiu, X. Hu, and J. Li, “Accelerating convolutional neural networks by group-wise 2d-filter pruning,” in 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2502–2509, IEEE, 2017.

[12] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.

[13] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[14] C. Wang, K. Cohen, and Q. Zhao, “Active hypothesis testing on a tree: Anomaly detection under hierarchical observations,” in IEEE International Symposium on Information Theory (ISIT), pp. 993–997, 2017.

[15] A. Puzanov and K. Cohen, “Deep reinforcement one-shot learning for artificially intelligent classification systems,” arXiv preprint arXiv:1808.01527, 2018.

[16] A. Puzanov, S. Zhang, and K. Cohen, “Deep reinforcement one-shot learning for artificially intelligent classification in expert aided systems,”

submitted to ELSEVIER Engineering Applications of Artificial Intelligence.

[17] A. Puzanov and K. Cohen, “Deep reinforcement one-shot learning for change point detection,” in 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1047–1051, IEEE, 2018.

[18] T. J. O’Shea and T. C. Clancy, “Deep reinforcement learning radio control and signal detection with kerlym, a gym rl agent,” arXiv preprint arXiv:1605.09221, 2016.

[19] U. Challita, L. Dong, and W. Saad, “Proactive resource management in LTE-U systems: A deep learning perspective,” arXiv preprint arXiv:1702.07031, 2017.

[20] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access,” in International Conference on Computing, Networking and Communications (ICNC), 2017.

[21] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” to appear in the IEEE Transactions on Cognitive Communications and Networking, 2018.

[22] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for dynamic spectrum access in multichannel wireless networks,” in IEEE Global Communications Conference (GLOBECOM), pp. 1–7, Dec 2017.

[23] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for distributed dynamic spectrum access,” IEEE Transactions on Wireless Communications, vol. 18, pp. 310–323, 2019.

[24] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” arXiv preprint arXiv:1803.04311, 2018.

[25] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al., “Predicting parameters in deep learning,” in Advances in neural information processing systems, pp. 2148–2156, 2013.

[26] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.

[27] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A datadriven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250, 2016.

[28] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397, 2017.

[29] T.-W. Chin, C. Zhang, and D. Marculescu, “Layer-compensated pruning for resource-constrained convolutional neural networks,” arXiv preprint arXiv:1810.00518, 2018.

[30] T.-W. Chin, R. Ding, C. Zhang, and D. Marculescu, “Legr: Filter pruning via learned global ranking,” arXiv preprint arXiv:1904.12368, 2019.

[31] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” arXiv preprint arXiv:1702.03044, 2017.

[32] R. Ding, Z. Liu, R. Shi, D. Marculescu, and R. Blanton, “Lightnn: Filling the gap between conventional deep neural networks and binarized networks,” in Proceedings of the on Great Lakes Symposium on VLSI 2017, pp. 35–40, ACM, 2017.

[33] R. Ding, Z. Liu, R. Blanton, and D. Marculescu, “Quantized deep neural networks for energy efficient hardware-based inference,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference, pp. 1– 8, IEEE Press, 2018.

[34] C. Bucilu, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541, ACM, 2006.

[35] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in Advances in neural information processing systems, pp. 2654–2662, 2014.

[36] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” arXiv preprint arXiv:1511.06295, 2015.

[37] C. Lemaire, A. Achkar, and P.-M. Jodoin, “Structured pruning of neural networks with budget-aware regularization,” arXiv preprint arXiv:1811.09332, 2018.

[38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[39] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.

[40] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, pp. 1928–1937, 2016.

[41] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.