Control-Tutored Reinforcement Learning

2019·Arxiv

Abstract

Abstract

We introduce a control-tutored reinforcement learning (CTRL) algorithm. The idea is to enhance tabular learning algorithms so as to improve the exploration of the statespace, and substantially reduce learning times by leveraging some limited knowledge of the plant encoded into a tutoring model-based control strategy. We illustrate the benefits of our novel approach and its effectiveness by using the problem of controlling one or more agents to herd and contain within a goal region a set of target free-roving agents in the plane.

I. INTRODUCTION

Reinforcement learning (RL) [1] is increasingly used to learn control policies from data [2]–[4] in applications. While the lack of requiring a formal model of the environment/plant renders this approach particularly appealing in many applications, a key drawback is its sample inefficiency. Essentially, this is due to the fact that RL finds the control policy by exploring heuristically the Markov Decision Process underlying the problem accepting possible failures. Unfortunately, long training phases are often unacceptable and failures while learning might lead to unsafe situations. Moreover, many applications are often characterized by a continuous statespace and using RL requires a dense discretization of the system state space, yielding a substantial growth of learning times sometimes incompatible with the nature and scope of the control problem of interest.

Therefore, much research effort is being devoted to design safer and more sample efficient RL algorithms. An example is model-based reinforcement learning that has been used both to empower learning processes e.g. [5], [6] and to guarantee safety in critical cases where a model is available e.g. [7], [8]. The introduction of some model also helps to bring some degree of stability to the overall learning process [9]. Other extensions include the Deep Q-Network (DQN) approach presented in [10] and the Actor-Critic paradigm [1], [11], [12].

In this paper, we present an alternative model-based approach where a feedback control strategy designed with only limited or qualitative knowledge of the system dynamics is used to enhance the RL algorithm when needed. The resulting control-tutored Q-learning (CTQL) algorithm is better apt to deal with continuous or large state spaces while retaining many of the features of a tabular method. Our algorithm is complementary to other existing model-based approaches such as [13], [14]. Indeed, in our setting, the control tutor supports the process of exploring the optimization landscape by suggesting possible actions

Fig. 1: Reinforcement Learning scheme

based on its partial knowledge of the system dynamics that the learning agent can take whenever it is unable to find a better action to take by itself. In so doing, the control tutor contributes to completing the Q-table speeding up the convergence of the learning process. A related but different idea was recently presented in [8] where RL is mirrored with a Model Predictive Controller (MPC) and a different strategy is used to orchestrate transitions between RL and MPC.

To validate our approach, we apply CTQL to solve a challenging multi-agent herding control problem. The goal is that of driving a set of agents (the herders) so that they can confine another group of autonomous roving agents (the targets) into some predefined area of the plane and keep them therein [15], [16]. We show that CTQL can be effectively used to solve this “herding” problem both in the case of one herder agent influencing one target and in the more challenging case of two herders controlling the motion of a group of ten target agents. Interestingly, we find that CTQL obtains better performance and convergence than Q-learning or feedback control on their own, solving the herding problem even when they are unable to do so.

II. REINFORCEMENT LEARNING: THE KEY INGREDIENTS

Reinforcement learning is an area of machine learning which provides a set of methods that rely on approximations producing suboptimal policies [17] for the solution of dynamic programming problems in the presence of uncertain dynamics [1]. We briefly review here its main ingredients to properly expound the novel CTQL algorithm in the right context.

A RL control loop is schematically shown in Fig. 1. Specifically, the interaction between the control agent and the environment/plant (or simply system in what follows) is described through: (i) the state space S containing all possible system states; (ii) the action space A of all possible actions the agent can take to influence the system state. As shown in Fig. 1, other key components of a RL control algorithm are: (i) a policy function ; (ii) a reward function R; (iii) a learning update rule; and (iv) an auxiliary function.

The policy selection function is used to determine what action to apply to the system starting from state at time k. The effects of such action, say , are evaluated via the reward function that evaluates the effects of that action with respect to the control goal. Namely, given the action taken at step k, the current state of the system , and the computed next state value , the expected reward is computed. The learning update rule is then used to update an auxiliary function storing the expected rewards for taking a certain action when the system is in a given state. Such an auxiliary function is interrogated at each step by the learning agent to decide what action to take next and represents the “experience” accumulated by the agent during the learning process.

In the Q-learning approach [1], [18], the auxiliary function is expressed as a tabular function, Q(s, a) with , called Q-table. The state and action spaces are assumed to be discrete and of finite cardinality. In this way, the learning agent accesses the table by the state, at time k, and selects the best action to take according to the values stored in the Q-table. The policy selection function exploits the -greedy criterion [1] and is defined as follows:

where is a positive constant in the range ]0, 1[ representing the probability of taking a random action instead of an action stored in the Q-table. Randomness in the policy promotes exploration and fulfills the hypotheses needed to prove convergence of the algorithm towards the optimal solution [18]. The learning update rule is defined as follows:

where is the reward obtained by selecting action from state , and is the next state, and is the value stored in the Q-table at time k. The parameters and are both in the range [0, 1] and are known as the learning rate and the discount factor, respectively.

As mentioned in the introduction, the main problem of this algorithm is related to the need of discretizing the action and state spaces that, for continuous dynamical systems, can lead to a substantial growth of the Q-table and poor learning performance [12]. Hence, the need for enhanced RL algorithms as the one we propose next.

III. CONTROL TUTORED Q-LEARNING (CTQL)

The key idea behind CTQL is schematically summarized in Fig. 2. Specifically, at each time step k, the learning agent selects its next action from a given system state , by

Fig. 2: Control Tutored Q-learning (CTQL) Schematic

choosing either the action suggested by the control tutor via a model-based policy , or the one suggested by the standard -greedy policy used for the Q-learning as defined in (1). In so doing, CTQL adopts the same Q-table structure and learning update rule of Q-learning, but exploits a new policy selection function, say .

Mathematically, the policy selection function in CTQL is a switching policy defined as:

According to this policy, at step k, given the state , the learning agent checks the entries of the Q-table for all actions . If at least one of these entries is positive, then that action is selected by Q-learning, otherwise the action is chosen that is suggested by the control-tutor via the policy . This policy is defined as follows:

where is the control input generated by the control tutor using a feedback controller designed on a rough model of the plant. As such input does not necessarily belong to A, the policy function selects the action which is closest to .

Once, the action is selected from either or , the corresponding expected reward is then computed and used to update the Q-table. The pseudocode of the CTQL algorithm is given in Algorithm 1. Note that both and contain some degree of randomness to guarantee that when implemented the policy selection function of the CTQL is still within the scope of the probabilistic proof of convergence available for the Q-learning algorithm and described in [18].

To illustrate the viability and effectiveness of CTQL, we apply it to solve the herding problem in robotics and discuss its performance by comparing it to a traditional (untutored) Q-learning approach.

IV. APPLICATION TO THE HERDING PROBLEM

We consider the problem of letting one or more mobile agents in the plane (the herders) drive the motion of a group of autonomous agents (the targets) so as to move them towards some goal region and mantain them therein. Under the assumption that the herders only possess limited knowledge of the dynamics of the targets, we will solve the problem of controlling the herders by using CTQL. For the sake of simplicity, we assume all agents are able to adjust their velocities almost instantaneously, as done for example in [19]. In what follows we will use the pedix ‘’ to denote quantities pertaining to the target agents and the pedix ‘h’ for those concerning the herders.

A. Problem Formulation

Assuming, the target agents’ velocity is upper bounded by some maximum velocity , the dynamics of the target agents is assumed to be:

where is the imaginary unit, is the position of the i-th target agent (out of N) at time is the vector stacking the positions of the M herder agents at time t, and the vector field is the sum of two contributions, i.e. .

Here, the term models the action of the herders onto the target and is assumed to be the same for all the targets. It is defined as:

where we omitted the explicit dependence on time is the targets’ influence radius, is a constant gain modelling the intensity of the coupling with the herder, and U is an interaction function defined as

that ensures that the coupling between target and herder agents is active only if their relative distance is smaller than some .

The term represents the target own random dynamics defined as:

where and are scalars updated every seconds with values extracted from uniform distributions and , respectively.

The herders’ speed, as for the targets, is saturated to a maximum fixed value so that their dynamics can be written as:

where is a control input at time t to be determined in order to fulfill the control goal.

The control objective is to design the input vector u = able to drive the targets to reach and remain in the circular goal region of center and radius , that is, to guarantee that

B. Control Design

For the sake of simplicity, we start by considering the case where N = M = 1 (dropping the suffixes i and j) and the goal region is centered at the origin, i.e. . We suppose the herder knows the position of the target but possesses only a conservative estimate, of the target’s true influence radius .

We design the control input u driving the herder as follows. At time t, if , then the herder moves towards the target at its maximum speed to reduce its distance until entering the estimated influence region at some time when . Within this region the herder adopts a learning strategy to push the herder towards the goal region. For the sake of comparison, we first test how Q-learning performs to solve the problem and then move to CTQL.

1) Q-learning implementation: We start by applying the

classical Q-learning algorithm with the following definitions of state and action spaces, and of the reward function.

The action and state space are defined as follows. S := , where (i) D is the set of distances of the herder and the target from the center of the goal region; (ii) W is the set of angular positions of the herder; (iii) V is the set of possible speeds of the target. In our implementation the sets D, W, V are discrete sets and are defined in detail in the Appendix. The action space (see also the Appendix) is the set of possible discretized values of the input vector u to the herder dynamics given by (9).

Let be the position of a generic target (herder) agent at a discrete time instant k. Then, the reward function implemented in our experiments is:

R, s, s, s, s, s, s

where

with , and being positive constant gains, and where was chosen w.l.o.g. as a decreasing function of its argument.

We test the Q-learning algorithm by considering two discretizations of state space, a finer and a coarser one (see the Appendix for further details). When using the finer discretization, we see that, as shown in Fig. 3 (top panel), Q-learning is unable to achieve the control goal after over 5000 training trials. Convergence is instead achieved within about 25s after a training phase of about the same duration when a coarser discretization of the state space is used, see Fig. 3 (bottom panel).

Fig. 3: Performance of the Q-learning algorithm after 5000 trials and the state discretization is finer (top panel) or coarser (bottom panel). The radial distance of the herder (black line) and the target (red line) are shown together with the radius of the goal region (green line).

2) Control-Tutored Q-learning Application: We move

next to adopting the CTQL approach described in Sec. III. The state and action spaces, as well as the reward function, are the same as those proposed in Sec. IV-B.1.

The design of the tutoring control law requires some model of the expected dynamics of the targets. We assume that only an estimate of the target true dynamics is available which we suppose to be given by the inaccurate model:

where is a gain modelling the intensity of the coupling between the target and the herder, , and is the step function defined in (7).

Assuming the target’s dynamics as in (14), we then select the herder control input so as to push the target position towards the origin; namely we choose

where and are two positive control gains. With this choice of u when the target and the herder interact, the target dynamics becomes:

so that any choice of and would achieve convergence to the origin were the dynamics (14) the correct ones. Without loss of generality here we choose , .

As expected, when applied to control the “true” target dynamics, we observe that, as shown in Fig. 4, the herder driven by (15) fails to achieve the desired goal as the target escapes the region where they actually interact and becomes uncontrollable. (Note that a better choice of the controller or the gains might resolve this issue for the approximate model but here leave the controller unchanged as we wish to explore whether our CTQL approach can instead achieve convergence even when the control tutor is designed on a set of very simplifying qualitative assumptions such as those we made.)

Fig. 4: Performance of the control tutor without any learning. The inset shows a zoom of the transient dynamics during the interval . The color codes are described in the caption of Fig. 3.

C. Numerical Validation

1) CTQL herding of a single target: As shown in Fig.

5, in the case of one herder interacting with one target, using the CTQL approach with the control tutor designed above is successful after just one or two training trials independently of the state discretization used. A summary of the performance and convergence times of CTQL compared with those where the Control Tutor (CT) or Q-learning (QL) are used on their own is shown in Table I. The numerical

Fig. 5: Performance of the CTQL algorithm with a finer (top panel) or coarser (bottom panel) state space discretization after just 1 learning trial. Convergence is immediately achieved in both cases. The color codes are described in the caption of Fig. 3.

TABLE I: Performance comparison between CT, QL and CTQL with the finer (and coarser) state discretization

experiments where initiated with random initial conditions, uniformly selected in [15, 30], and such that the initial distance is uniformly distributed in . We observe that the control tutor on its own (without learning) is never successful while the Q-learning performance strongly depends on the state discretization used. CTQL instead always achieves convergence guaranteeing robustness to hyperparameters selection such as state discretization and a very limited number of learning trials (1 or 2 in the case we tested as compared to over 5000 for Q learning with a coarse state discretization).

2) CTQL of multiple herders and targets: To further test

our strategy, we considered the case of M herders controlling targets. In this context, herders’ behavior needs to include some cooperation rule to successfully drive and contain the targets. Here, the herders use CTQL and cooperate to fill in the same Q-table.

We assume each herder is always aware of the current

positions of all the targets. Then, using this information, herders (i) compute the center of mass (CoM) of the positions of the targets and (ii) split the plane into M circular sectors centered at the origin by starting with the line passing through the origin and the computed CoM. Each herder then assumes control of one of such sectors taking the task of searching and recovering targets that are located in that area. The sectors are re-computed and re-allocated every 10second. Such division of the region of interest forces each herder to choose the targets to chase only in its sector of competence and, consequently, avoids interference among herders.

As the velocities of all the agents are comparable, herders may end up continually switching between two or more targets to chase without pushing any of them towards the goal region. To avoid such a case, the following rule has been introduced for herder agents:

1) Select the furthest target from the goal region in your sector of competence;

2) while trying to contain in the goal region G, check if another target, say , becomes the new furthest target from G. If such a target exists, then

3) compute the distance between and , and

4) if switch the control law to contain in G, otherwise keep containing target .

Figure 6 shows the performance of CTQL when M = 2 herders interact with a group of N = 15 targets confirming the effectiveness of using a control tutor that allows the learning algorithm to achieve convergence after just one learning trial.

V. CONCLUSIONS

In this paper, we introduced an extension of Q-learning where the policy selection function is enhanced by means of a control tutor that, using a feedback control law with limited knowledge of the system dynamics, is able to support the exploration of the optimization landscape guaranteeing better convergence and shorter learning times. To illustrate the effectiveness of the approach, we discussed its application to the herding problem showing that the combination of learning and feedback control can achieve ambitious control goals even in those cases where neither would work on its own. We envisage that a similar control tutored approach can be used to enhance the performance and convergence of other more sophisticated learning algorithms. We wish to emphasize that from a control viewpoint, the combined presence of RL and feedback control renders viable the use of a control strategy that would otherwise be useless without the presence of learning. Ongoing work is focussed on refining this approach with the aim of obtaining a better understanding of its advantages and limitations for future applications.

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018.

Fig. 6: Radial coordinates of (a) the two herding agents and (b) the 15 target agents after just one learning trial when CTQL is employed to drive the herders. The radius of the goal region is shown as a green solid line.

[2] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.

[3] J. Garcıa and F. Fern´andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.

[4] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” preprint available from arXiv:1903.08792, 2019.

[5] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” Proc. of the International Conference on Machine Learning, pp. 2829–2838, 2016.

[6] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” Proc. of the International Conference on Machine Learning, pp. 465–472, 2011.

[7] U. Rosolia and F. Borrelli, “Learning model predictive control for iterative tasks. a data-driven control framework,” IEEE Transactions on Automatic Control, vol. 63, no. 7, pp. 1883–1896, 2017.

[8] P. Ferraro, M. Rathi, and G. Russo, “Driving reinforcement learning with models,” preprint available from arXiv:1911.04400v1, 2019.

[9] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” Advances in neural information processing systems, vol. 30, pp. 908–918, 2017.

[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[11] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. of International conference on machine learning, 2016, pp. 1928–1937.

[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” preprint available from arXiv:1509.02971v6, 2019.

[13] F. Fathinezhad, V. Derhami, and M. Rezaeian, “Supervised fuzzy reinforcement learning for robot navigation,” Applied Soft Computing, vol. 40, pp. 33 – 41, 2016.

[14] M. Brunner, U. Rosolia, J. Gonzales, and F. Borrelli, “Repetitive learning model predictive control: An autonomous racing example,” Proc. of the IEEE Conference on Decision and Control, pp. 2545– 2550, 2017.

[15] R. A. Licitra, Z. D. Hutcheson, E. A. Doucette, and W. E. Dixon, “Single agent herding of n-agents: A switched systems approach,” IFAC-PapersOnLine, vol. 50, pp. 14 374–14 379, 2017.

[16] A. Pierson and M. Schwager, “Controlling noncooperative herds with robotic herders,” IEEE Transactions on Robotics, vol. 34, pp. 517–525, 2017.

[17] D. P. Bertsekas, Reinforcement Learning and Optimal Control. Athena Scientific, 2019.

[18] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, 1992.

[19] G. Albi, M. Bongini, E. Cristiani, and D. Kalise, “Invisible control of self-organizing agents leaving unknown environments,” SIAM Journal on Applied Mathematics, vol. 76, no. 4, pp. 1683–1710, 2016.

APPENDIX

We report here all the parameters that were used for the numerical simulations reported in the paper.

The circular goal region is centered at the origin, i.e. , with radius . The targets parameters were set to . The random diffusive motion of the target uses as update time interval and as maximum speed. The herder’s maximum speed was set to . The estimated radius of the influence zone assumed for the design of the control tutor is . Each learning trial lasted in the case of the single target experiments and in the case of multiple targets, the sampling time was set to . The parameters of the learning update rule were set to and while the randomness parameter in the policies and was set to . The parameters of the reward function were set to and . To implement QL and CTQL, two alternative discretization of state space were tested. To reduce computational burden in the implementation the set of discretized relative distances were used to address and construct the Q-table. A coarser discretization was obtained by sampling the range of relative distances with stepsize , and the range of angles [0,] with . The angular position of the herder was discretized in the range [0, ] with stepsize . The target speed was discretized in the range with stepsize , and the range of angles [0,] with . A finer discretization was obtained by reducing the sampling stepsizes of the quantities above to , and . The action space consisted of possible herder velocities discretized in the range [0,] with stepsize and possible angular orientation in the range [0, ] with . In the model of the target dynamics used for the control synthesis the parameter was set to unity while the tutoring control law gains were set to .

designed for accessibility and to further open science