Lipschitz Lifelong Reinforcement Learning

2020·Arxiv

Abstract

Abstract

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the benefits of the method in Lifelong RL experiments.

1 Introduction

Lifelong Reinforcement Learning (RL) is an online problem where an agent faces a series of RL tasks, drawn sequentially. Transferring knowledge from prior experience to speed up the resolution of new tasks is a key question in that setting (Lazaric 2012; Taylor and Stone 2009). We elaborate on the intuitive idea that similar tasks should allow a large amount of transfer. An agent able to compute online a similarity measure between source tasks and the current target task could be able to perform transfer accordingly. By measuring the amount of inter-task similarity, we design a novel method for value transfer, practically deployable in the online Lifelong RL setting. Specifically, we introduce a metric between MDPs and prove that the optimal Q-value function is Lipschitz continuous with respect to the MDP space. This property makes it possible to compute a provable upper bound on the optimal Q-value function of an unknown target task, given the learned optimal Q-value function of a source task. Knowing this upper bound accelerates the convergence of an RMax-like algorithm (Brafman and Tennenholtz 2002), relying on an optimistic estimate of the optimal Q-value function. Overall, the proposed transfer method consists of computing online the distance between source and target tasks, deducing the upper bound on the optimal Q value function of the source task and using this bound to accelerate learning. Importantly, the method exhibits no negative transfer, i.e., it cannot cause performance degradation, as the computed upper bound provably does not underestimate the optimal Q-value function.

Our contributions are as follows. First, we study theoretically the Lipschitz continuity of the optimal Q-value function in the task space by introducing a metric between MDPs (Section 3). Then, we use this continuity property to propose a value-transfer method based on a local distance between MDPs (Section 4). Full knowledge of both MDPs is not required and the transfer is non-negative, which makes the method applicable online and safe. In Section 4.3, we build a PAC-MDP algorithm called Lipschitz RMax, applying this transfer method in the online Lifelong RL setting. We provide sample and computational complexity bounds and showcase the algorithm in Lifelong RL experiments (Section 5).

2 Background and Related Work

Reinforcement Learning (RL) (Sutton and Barto 2018) is a framework for sequential decision making. The problem is typically modeled as a Markov Decision Process (MDP) (Puterman 2014) consisting of a 4-tuple S is a state space, A an action space, is the expected reward of taking action is the transition probability of reaching state when taking action a in state s. Without loss of generality, we assume . Given a discount factor , the expected cumulative return obtained along a trajectory starting with state s and action a using policy in MDP M is denoted by and called the Q-function. The optimal Q-function highest attainable expected return from s, a and is the optimal value function in s. Notice that implies for all . This maximum upper bound is used by the RMax algorithm as an optimistic initialization of the learned Q function. A key point to reduce the sample complexity of this algorithm is to benefit from a tighter upper bound, which is the purpose of our transfer method.

Lifelong RL (Silver, Yang, and Li 2013; Brunskill and Li 2014) is the problem of experiencing online a series of MDPs drawn from an unknown distribution. Each time an MDP is sampled, a classical RL problem takes place where the agent is able to interact with the environment to maximize its expected return. In this setting, it is reasonable to think that knowledge gained on previous MDPs could be re-used to improve the performance in new MDPs. In this paper, we provide a novel method for such transfer by characterizing the way the optimal Q-function can evolve across tasks. As commonly done (Wilson et al. 2007; Brunskill and Li 2014; Abel et al. 2018), we restrict the scope of the study to the case where sampled MDPs share the same state-action space . For brevity, we will refer indifferently to MDPs, models or tasks, and write them

Using a metric between MDPs has the appealing characteristic of quantifying the amount of similarity between tasks, which intuitively should be linked to the amount of transfer achievable. Song et al. (2016) define a metric based on the bi-simulation metric introduced by Ferns, Panangaden, and Pre- cup (2004) and the Wasserstein metric (Villani 2008). value transfer is performed between states with low bi-simulation distances. However, this metric requires knowing both MDPs completely and is thus unusable in the Lifelong RL setting where we expect to perform transfer before having learned the current MDP. Further, the transfer technique they propose does allow negative transfer (see Appendix, Section 7). Carroll and Seppi (2005) also define a value-transfer method based on a measure of similarity between tasks. However, this measure is not computable online and thus not applicable to the Lifelong RL setting. Mahmud et al. (2013) and Brunskill and Li (2013) propose MDP clustering methods; respectively using a metric quantifying the regret of running the optimal policy of one MDP in the other MDP and the norm between the MDP models. An advantage of clustering is to prune the set of possible source tasks. They use their approach for policy transfer, which differs from the value-transfer method proposed in this paper. Ammar et al. (2014) learn the model of a source MDP and view the prediction error on a target MDP as a dissimilarity measure in the task space. Their method makes use of samples from both tasks and is not readily applicable to the online setting considered in this paper. Lazaric, Restelli, and Bonarini (2008) provide a practical method for sample transfer, computing a similarity metric reflecting the probability of the models to be identical. Their approach is applicable in a batch RL setting as opposed to the online setting considered in this paper. The approach developed by Sorg and Singh (2009) is very similar to ours in the sense that they prove bounds on the optimal Q-function for new tasks, assuming that both MDPs are known and that a soft homomorphism exists between the state spaces. Brun- skill and Li (2013) also provide a method that can be used for Q-function bounding in multi-task RL.

Abel et al. (2018) present the MaxQInit algorithm, providing transferable bounds on the Q-function with high probability while preserving PAC-MDP guarantees (Strehl, Li, and Littman 2009). Given a set of solved tasks, they derive the probability that the maximum over the Q-values of previous MDPs is an upper bound on the current task’s optimal Qfunction. This approach results in a method for non-negative transfer with high probability once enough tasks have been sampled. The method developed by Abel et al. (2018) is similar to ours in two fundamental points: first, a theoretical upper bounds on optimal Q-values across the MDP space is built; secondly, this provable upper bound is used to transfer knowledge between MDPs by replacing the maximum

Qs, a)

Figure 1: The optimal Q-value function represented for a particular s, a pair across the MDP space. The RMax, MaxQInit and LRMax bounds are represented for three sampled MDPs.

bound in an RMax-like algorithm, providing PAC guarantees. The difference between the two approaches is illustrated in Figure 1, where the MaxQInit bound is the one developed by Abel et al. (2018), and the LRMax bound is the one we present in this paper. On this figure, the essence of the LRMax bound is noticeable. It stems from the fact that the optimal Q value function is locally Lipschitz continuous in the MDP space w.r.t. a specific pseudometric. Confirming the intuition, close MDPs w.r.t. this metric have close optimal Q values. It should be noticed that no bound is uniformly better than the other as intuited by Figure 1. Hence, combining all the bounds results in a tighter upper bound as we will illustrate in experiments (Section 5). We first carry out the theoretical characterization of the Lipschitz continuity properties in the following section. Then, we build on this result to propose a practical transfer method for the online Lifelong RL setting.

3 Lipschitz Continuity of Q-Functions

The intuition we build on is that similar MDPs should have similar optimal Q-functions. Formally, this insight can be translated into a continuity property of the optimal Q-function over the MDP space M. The remainder of this section mathematically formalizes this intuition that will be used in the next section to derive a practical method for value transfer. To that end, we introduce a local pseudometric characterizing the distance between the models of two MDPs at a particular state-action pair. A reminder and a detailed discussion on the metrics used herein can be found in the Appendix, Section 8. Definition 1. Given two tasks , , and a function , we define the pseudometric between models at

This pseudometric is relative to a positive function f. We implicitly cast this definition in the context of discrete state spaces. The extension to continuous spaces is straightforward but beyond the scope of this paper. For the sake of clarity in the remainder of this study, we introduce

corresponding to the pseudometric between models with the particular choice of . From this definition stems the following pseudo-Lipschitz continuity result.

Proposition 1 (Local pseudo-Lipschitz continuity). For two MDPs

with the local MDP pseudometric local MDP dissimilarity is the unique solution to the following fixed-point equation for

This result establishes that the distance between the optimal Q-functions of two MDPs at is controlled by a local dissimilarity between the MDPs. The latter follows a fixed-point equation (Equation 3), which can be solved by Dynamic Programming (DP) (Bellman 1957). Note that, although the local MDP dissimilarity is asymmetric, a pseudometric, hence the name pseudo-Lipschitz continuity. Notice that the policies in Equation 2 are the optimal ones for the two MDPs and thus are different. Proposition 1 is a mathematical result stemming from Definition 1 and should be distinguished from other frameworks of the literature that assume the continuity of the reward and transition models w.r.t. (Rachelson and Lagoudakis 2010; Pirotta, Restelli, and Bascetta 2015; Asadi, Misra, and Littman 2018).This result establishes that the optimal Q-functions of two close MDPs, in the sense of Equation 1, are themselves close to each other. Hence, given , the function

can be used as an upper bound on an unknown MDP. This is the idea on which we construct a computable and transferable upper bound in Section 4. In Figure 1, the upper bound of Equation 4 is represented by the LRMax bound. Noticeably, we provide a global pseudo-Lipschitz continuity property, along with similar results for the optimal value function and the value function of a fixed policy. As these results do not directly serve the purpose of this article, we report them in the Appendix, Section 10.

4 Transfer Using the Lipschitz Continuity

A purpose of value transfer, when interacting online with a new MDP, is to initialize the value function and drive the exploration to accelerate learning. We aim to exploit value transfer in a method guaranteeing three conditions:

C1. the resulting algorithm is PAC-MDP;

C2. the transfer accelerates learning;

C3. the transfer is non-negative.

To achieve these conditions, we first present a transferable upper bound on in Section 4.1. This upper bound stems from the Lipschitz continuity result of Proposition 1. Then, we propose a practical way to compute this upper bound in Section 4.2. Precisely, we propose a surrogate bound that can be calculated online in the Lifelong RL setting, without having explored the source and target tasks completely. Finally, we implement the method in an algorithm described in Section 4.3, and demonstrate formally that it meets conditions C1, C2 and C3. Improvements are discussed in Section 4.4.

4.1 A Transferable Upper Bound on Q∗M

From Proposition 1, one can naturally define a local upper bound on the optimal Q-function of an MDP given the optimal Q-function of another MDP.

Definition 2. Given two tasks M and , for all Lipschitz upper bound on defined as

The optimism in the face of uncertainty principle leads to considering that the long-term expected return from any state is the maximum return, unless proven otherwise. Particularly, the RMax algorithm (Brafman and Tennenholtz 2002), explores an MDP so as to shrink this upper bound. RMax is a model-based, online RL algorithm with PACMDP guarantees (Strehl, Li, and Littman 2009), meaning that convergence to a near-optimal policy is guaranteed in a polynomial number of missteps with high probability. It relies on an optimistic model initialization that yields an optimistic upper bound U on the optimal Q-function, then acts greedily w.r.t. U. By default, it takes the maximum value , but any tighter upper bound is admissible. Thus, shrinking U with Equation 5 is expected to improve the learning speed or sample complexity for new tasks in Lifelong RL.

In RMax, during the resolution of a task is split into a subset of known state-action pairs K and its complement of unknown pairs. A state-action pair is known if the number of collected reward and transition samples allows estimating an -accurate model in -norm with probability higher than . We refer to and as the RMax precision parameters. This results in a threshold on the number of visits n(s, a) to a pair s, a that are necessary to reach this precision. Given the experience of a set of m MDPs , we define the total bound as the minimum over all the induced Lipschitz bounds. Proposition 2. Given a partially known task , the set of known state-action pairs K, and the set of Lipschitz bounds on induced by previous tasksthe function Q defined below is an upper bound on for all

with U, . . . , U

Commonly in RMax, Equation 6 is solved to a precision via Value Iteration. This yields a function Q that is a valid heuristic (bound on ) for the exploration of MDP M.

4.2 A Computable Upper Bound on Q∗M

The key issue addressed in this section is how to actually compute U(s, a), particularly when both source and target tasks are partially explored. Consider two tasks which vanilla RMax has been applied, yielding the respective sets of known state-action pairs K and , along with the learned models , and the upper bounds respectively on . Notice that, if , then for all s, a pairs. Conversely, if -accurate estimate of high probability. Equation 6 allows the transfer of knowledge from to M if can be computed. Unfortunately, the true model and optimal value functions, necessary to compute , are partially known (see Equation 5). Thus, we propose to compute a looser upper bound based on the learned models and value functions. First, we provide an upper bound (Definition 1).

Proposition 3. Given two tasks M, and respectively K, the subsets of where their models are known with accuracy -norm with probability at least

with the upper bound on the pseudometric between models defined below for

Importantly, this upper bound can be calculated analytically (see Appendix, Section 13). This makes usable in the online Lifelong RL setting, where already explored tasks may be partially learned, and little knowledge has been gathered on the current task. The magnitude of the B term is controlled by . In the case where no information is available on the maximum value of have that . measures the accuracy with which the tasks are known: the smaller , the tighter the B bound. Note that is used as an upper bound on the true . In many cases, ; e.g. for stochastic short- est path problems, which feature rewards only upon reaching terminal states, we have that and thus is a tighter bound for transfer. Combining and Equation 3, one can derive an upper bound , detailed in Proposition 4.

Proposition 4. Given two tasks the set of state-action pairs where (R, T) is known with accuracy -norm with probability at least solution of the following fixed-point equation on ) is an upper bound on with probability at least

illustrates the fact that for a large return horizon (large ), a high accuracy (small ) is needed for the bound to be computable. Eventually, a computable upper bound on M with high probability is given by

The associated upper bound on U(s, a) (Equation 6) given the set of previous tasks is defined by

This upper bound can be used to transfer knowledge from a partially solved source task to a target task. If on a subset of , then the convergence rate can be improved. As complete knowledge of both tasks is not needed to compute the upper bound, it can be applied online in the Lifelong RL setting. In the next section, we explicit an algorithm that leverages this value-transfer method.

4.3 Lipschitz RMax Algorithm

In Lifelong RL, MDPs are encountered sequentially. Applying RMax to task M yields the set of known state-action pairs K, the learned models and , and the upper bound Q on . Saving this information when the task changes allows computing the upper bound of Equation 10 for the new target task, and using it to shrink the optimistic heuristic of RMax. This computation effectively transfers value functions between tasks based on task similarity. As the new task is explored online, the task similarity is progressively assessed with better confidence, refining the values of , and eventually , allowing for more efficient transfer where the task similarity is appraised. The resulting algorithm, Lipschitz RMax (LRMax), is presented in Algorithm 1. To avoid ambiguities with , we use to store learned features ( , ) about previous MDPs. In a nutshell, the behavior of LRMax is precisely that of RMax, but with a tighter admissible heuristic that becomes better as the new task is explored (while this heuristic remains constant in vanilla RMax). LRMax is PAC-MDP (Condition C1) as stated in Propositions 5 and 6 below. With S = |S| and A = |A|, the sample complexity of vanilla RMax is , which is improved by LRMax in Proposition 5 and meets Condition C2. Finally, a provable upper bound with high probability on avoids negative transfer and meets Condition C3.

Proposition 5 (Sample complexity (Strehl, Li, and Littman 2009)). With probability , the greedy policy w.r.t. Q computed by LRMax achieves an -optimal return in MDP

samples (when logarithmic factors are ignored), with de-fined in Equation 10 a non-static, decreasing quantity, upper bounded by

Proposition 5 shows that the sample complexity of LRMax is no worse than that of RMax. Consequently, in the worst case, LRMax performs as badly as learning from scratch, which is to say that the transfer method is not negative as it cannot degrade the performance.

Proposition 6 (Computational complexity). The total computational complexity of LRMax (Algorithm 1) is

with the number of interaction steps, the precision of value iteration and N the number of source tasks.

4.4 Refining the LRMax Bounds

LRMax relies on bounds on the local MDP dissimilarity (Equation 8). The quality of the Lipschitz bound on be improved according to the quality of those estimates. We discuss two methods to provide finer estimates.

Refining with prior knowledge. First, from the definition of , it is easy to show that this pseudometric between models is always upper bounded by . However, in practice, the tasks experienced in a Lifelong RL experiment might not cover the full span of possible MDPs M and may systematically be closer to each other than . For instance, the distance between two games in the Arcade Learning Environment (ALE) (Bellemare et al. 2013), is smaller than

Figure 2: Illustration of the prior knowledge on the maximum pseudo-distance between models for a particular s, a pair.

the maximum distance between any two MDPs defined on the common state-action space of the ALE (extended discussion in Appendix, Section 17). Let us note the set of possible MDPs for a particular Lifelong RL experiment. Let imum model pseudo-distance at a particular s, a pair on the subset . Prior knowledge might indicate a smaller upper bound for . We will note such an upper bound , considered valid for all s, a pairs, i.e., such that . In a Lifelong RL experiment, can be seen as a rough estimate of the maximum model discrepancy an agent may encounter. Figure 2 illustrates the relative importance of Solving Equation 8 boils down to accumulating values in . Hence, reducing a estimate in a single s, a pair actually reduces in all s, a pairs. Thus, replacing in Equation 8 by , provides a smaller upper bound on , and thus a smaller which allows transfer if it is less than . Consequently, the knowl- edge of such a bound can make a difference between successful and unsuccessful transfer, even if its value is of little importance. Conversely, setting a value for tifies the distance between MDPs where transfer is efficient.

Refining by learning the maximum distance. The value of can be estimated online for each s, a pair, discarding the hypothesis of available prior knowledge. We propose to use an empirical estimate of the maximum model distance at s, a: , with the set of explored tasks. The pitfall of this approach is that, with few explored tasks, could underestimate . Proposition 7 provides a lower bound on the probability that does not underestimate , depending on the number of sampled tasks.

Proposition 7. Consider an algorithm producing model estimates for a subset K of after interacting with any two MDPs . Assume for any . For all , after sampling m tasks, if m is large enough to verify

This result indicates when upper bounds with high probability. In such a case, of Equation 8 can be replaced by to tighten the bound on . Assuming a lower bound on the sampling probability of a task implies that M is finite and is seen as a non-adversarial task sampling rule (Abel et al. 2018).

5 Experiments

The experiments reported here1 illustrate how the Lipschitz bound (Equation 9) provides a tighter upper bound on , improving the sample complexity of LRMax compared to RMax, and making the transfer of inter-task knowledge effective. Graphs are displayed with 95% confidence intervals. For information in line with the Machine Learning Reproducibility Check-list (Pineau 2019) see the Appendix, Section 22.

We evaluate different variants of LRMax in a Lifelong RL experiment. The RMax algorithm will be used as a no-transfer baseline. LRMax(x) denotes Algorithm 1 with prior . MaxQInit denotes the MAXQINIT algorithm from Abel et al. (2018), consisting in a state-of-the art PACMDP algorithm. Both LRMax and MaxQInit algorithms achieve value transfer by providing a tighter upper bound on than . Computing both upper bounds and taking the minimum results in combining the two approaches. We include such a combination in our study with the LRMaxQInit algorithm. Similarly, the latter algorithm benefiting from prior knowledge is denoted by LRMaxQInit(x). For the sake of comparison, we only compare algorithms with the same features, namely, tabular, online, PAC-MDP methods, presenting non-negative transfer.

The environment used in all experiments is a variant of the “tight” task used by Abel et al. (2018). It is an grid-world, the initial state is in the centre, actions are the cardinal moves (Appendix, Section 18). The reward is always zero except for the three goal cells in the upper-right corner. Each sampled task has its own reward values, drawn from [0.8, 1] for each of the three goal cells and its own probability of slipping (performing a different action than the one selected), picked in [0, 0.1]. Hence, tasks have different reward and transition functions. Notice the distinction in applicability between MaxQInit, that requires the set of MDPs to be finite, and LRMax, that can be used with any set of MDPs. For the comparison between both to be possible, we drew tasks from a finite set of 5 MDPs. We sample 15 tasks sequentially among this set, each run for 2000 episodes of length 10. The operation is repeated 10 times to narrow the confidence intervals. We set , and (discussion in Appendix, Section 21). Other Lifelong RL experiments are reported in Appendix, Section 19.

The results are reported in Figure 3. Figure 3a displays the discounted return for each task, averaged across episodes. Similarly, Figure 3b displays the discounted return for each episode, averaged across tasks (same color code as Figure 3a). Figure 3c displays the discounted return for five specific instances, detailed below. To avoid inter-task disparities, all the aforementioned discounted returns are displayed relative to an estimator of the optimal expected return for each task. For readability, Figures 3b and 3c display a moving average over 100 episodes. Figure 3d reports the benefits of various values of on the algorithmic properties.

In Figure 3a, we first observe that LRMax benefits from the transfer method, as the average discounted return increases as more tasks are experienced. Moreover, this advantage appears as early as the second task. In contrast, MaxQInit requires to wait for task 12 before benefiting from transfer. As suggested in Section 4.4, increasing amounts of prior knowledge allow the LRMax transfer method to be more efficient: a smaller known upper bound on accelerates convergence. Combining both approaches in the LRMaxQInit algorithm outperforms all other methods. Episode-wise, we observe in Figure 3b that the LRMax transfer method allows for faster convergence, i.e., lower sample complexity. Interestingly, LRMax exhibits three stages in the learning process. 1) The first episodes are characterized by a direct exploitation of the transferred knowledge, causing these episodes to yield high payoff. This behavior is a consequence of the combined facts that the Lipschitz bound (Equation 9) is larger on promising regions of seen on previous tasks and the fact that LRMax acts greedily w.r.t. that bound. 2) This high performance regime is followed by the exploration of unknown regions of , in our case yielding low returns. Indeed, as promising regions are explored first, the bound becomes tighter for the corresponding state-action pairs, enough for the Lipschitz bound of unknown pairs to become larger, thus driving the exploration towards low payoff regions. Such regions are then identified and never revisited. 3) Eventually, LRMax stops exploring and converges to the optimal policy. Importantly, in all experiments, LRMax never experiences negative transfer, as supported by the provability of the Lipschitz upper bound with high probability. LRMax is at least as efficient as the no-transfer RMax baseline.

Figure 3c displays the collected returns of RMax, LRMax(0.1), and MaxQInit for specific tasks. We observe that LRMax benefits from transfer as early as Task 2, where the previous 3-stage behavior is visible. MaxQInit takes until task 12 to leverage the transfer method. However, the bound it provides is tight enough that it does not have to explore.

In Figure 3d, we display the following quantities for various values of , the fraction of the time the Lipschitz bound was tighter than the RMax bound is the relative gain of time steps before convergence when comparing LRMax to RMax. This quantity is estimated based on the last updates of the empirical model , is the relative total return gain on 2000 episodes of LRMax w.r.t. RMax. First, we observe an increase of as becomes tighter. This means that the Lipschitz bound of Equation 9 becomes effectively smaller than . This phe- nomenon leads to faster convergence, indicated by Eventually, this increased convergence rate allows for a net total return gain, as can be seen with the increase of

Overall, in this analysis, we have showed that LRMax ben-efits from an enhanced sample complexity thanks to the value-transfer method. The knowledge of a prior increases

Figure 3: Experimental results. LRMax benefits from an enhanced sample complexity thanks to the value-transfer method.

this benefit. The method is comparable to the MaxQInit method and has some advantages such as the early fitness for use and the applicability to infinite sets of tasks. Moreover, the transfer is non-negative while preserving the PAC-MDP guarantees of the algorithm. Additionally, we show in Appendix, Section 20 that, when provided with any prior LRMax increasingly stops using it during exploration, con-firming the claim of Section 4.4 that providing transfer even if its value is of little importance.

6 Conclusion

We have studied theoretically the Lipschitz continuity property of the optimal Q-function in the MDP space w.r.t. a new metric. We proved a local Lipschitz continuity result, establishing that the optimal Q-functions of two close MDPs are themselves close to each other. We then proposed a value-transfer method using this continuity property with the Lipschitz RMax algorithm, practically implementing this approach in the Lifelong RL setting. The algorithm preserves PAC-MDP guarantees, accelerates learning in subsequent tasks and exhibits no negative transfer. Improvements of the algorithm were discussed in the form of prior knowledge on the maximum distance between models and online estimation of this distance. As a non-negative, similarity-based, PAC-MDP transfer method, the LRMax algorithm is the first method of the literature combining those three appealing features. We showcased the algorithm in Lifelong RL experiments and demonstrated empirically its ability to accelerate learning while not experiencing any negative transfer. Notably, our approach can directly extend other PAC-MDP algorithms (Szita and Szepesv´ari 2010; Rao and Whiteson 2012; Pazis, Parr, and How 2016; Dann, Lattimore, and Brun- skill 2017) to the Lifelong setting. In hindsight, we believe this contribution provides a sound basis to non-negative value transfer via MDP similarity, a study that was lacking in the literature. Key insights for the practitioner lie both in the theoretical analysis and in the practical derivation of a transfer scheme achieving non-negative transfer with PAC guarantees. Further, designing scalable methods conveying the same intuition could be a promising research direction.

We would like to thank Dennis Wilson for fruitful discussions and comments on the paper. This research was supported by the Occitanie region, France; ISAE-SUPAERO (Institut Sup´erieur de l’A´eronautique et de l’Espace); fondation ISAESUPAERO; ´Ecole Doctorale Syst`emes; and ONERA, the French Aerospace Lab.

References

Abel, D.; Jinnai, Y.; Guo, S. Y.; Konidaris, G.; and Littman, M. L. 2018. Policy and Value Transfer in Lifelong Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 20–29.

Ammar, H. B.; Eaton, E.; Taylor, M. E.; Mocanu, D. C.; Driessens, K.; Weiss, G.; and Tuyls, K. 2014. An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning. In Workshops at the 28th AAAI Conference on Artificial Intelligence (AAAI 2014).

Asadi, K.; Misra, D.; and Littman, M. L. 2018. Lipschitz Continuity in Model-Based Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning (ICML 2018) .

Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research 47: 253–279.

Bellman, R. 1957. Dynamic Programming. Princeton, USA: Princeton University Press.

Brafman, R. I.; and Tennenholtz, M. 2002. R-max - a Gen- eral Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3(Oct): 213–231.

Brunskill, E.; and Li, L. 2013. Sample Complexity of Multi- task Reinforcement Learning. In Proceedings of the 29th conference on Uncertainty in Artificial Intelligence (UAI 2013).

Brunskill, E.; and Li, L. 2014. PAC-inspired Option Dis- covery in Lifelong Reinforcement Learning. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), 316–324.

Carroll, J. L.; and Seppi, K. 2005. Task Similarity Measures for Transfer in Reinforcement Learning Task Libraries. In Proceedings of the 5th International Joint Conference on Neural Networks (IJCNN 2005), volume 2, 803–808. IEEE.

Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 5713–5723.

Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for Finite Markov Decision Processes. In Proceedings of the 20th conference on Uncertainty in Artificial Intelligence (UAI 2004), 162–169. AUAI Press.

Lazaric, A. 2012. Transfer in Reinforcement Learning: a Framework and a Survey. In Reinforcement Learning, 143– 173. Springer.

Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Transfer of Samples in Batch Reinforcement Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 544–551.

Mahmud, M. M.; Hawasly, M.; Rosman, B.; and Ramamoor- thy, S. 2013. Clustering Markov Decision Processes for Continual Transfer. Computing Research Repository (arXiv/CoRR) URL https://arxiv.org/abs/1311.3959.

Neyman, J. 1937. X—outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236(767): 333– 380.

Pazis, J.; Parr, R. E.; and How, J. P. 2016. Improving PAC Exploration using the Median of Means. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), 3898–3906.

Pineau, J. 2019. Machine Learning Reproducibility Checklist. https://www.cs.mcgill.ca/jpineau/ ReproducibilityChecklist.pdf. Version 1.2, March 27, 2019, last accessed on August 27, 2020.

Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gra- dient in Lipschitz Markov Decision Processes. Machine Learning 100(2-3): 255–283.

Puterman, M. L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.

Rachelson, E.; and Lagoudakis, M. G. 2010. On the Locality of Action Domination in Sequential Decision Making. In Proceedings of the 11th International Symposium on Artificial Intelligence and Mathematics (ISAIM 2010).

Rao, K.; and Whiteson, S. 2012. V-MAX: Tempered Opti- mism for Better PAC Reinforcement Learning. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), 375–382.

Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong Machine Learning Systems: Beyond Learning Algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, volume 13, 05.

Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuring the Distance Between Finite Markov Decision Processes. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016), 468–476.

Sorg, J.; and Singh, S. 2009. Transfer via Soft Homomorphisms. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009), 741–748. International Foundation for Autonomous Agents and Multiagent Systems.

Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement Learning in Finite MDPs: PAC Analysis. Journal of Machine Learning Research 10(Nov): 2413–2444.

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press, Cambridge.

Szita, I.; and Szepesv´ari, C. 2010. Model-Based Reinforcement Learning with Nearly Tight Exploration Complexity Bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 1031–1038.

Taylor, M. E.; and Stone, P. 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10(Jul): 1633–1685.

Villani, C. 2008. Optimal Transport: Old and New, volume 338. Springer Science & Business Media.

Watkins, C. J. C. H.; and Dayan, P. 1992. Q-learning. Machine Learning 8(3-4): 279–292.

Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi- Task Reinforcement Learning: A Hierarchical Bayesian Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), 1015–1022.

Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4, Yuu Jinnai3, Emmanuel Rachelson1, Michael L. Littman3

7 Negative transfer

where is a positive constant and is the 1-Wasserstein metric. For each state of the target model, the closest counterpart state (with the smallest bi-simulation distance) of the source MDP is identified and its learned Q-values are used to initialize the Q-function of the target MDP. In their experiments, Song et al. (2016) run a standard Q-Learning algorithm (Watkins and Dayan 1992) with an -greedy exploration strategy thereafter.

Figure 4: The T-shaped MDP transfer task.

Let us now consider applying this method to a similar task to the T-shaped MDP transfer task proposed by Taylor and Stone (2009). The source and target tasks are respectively described on the left and right sides of Figure 4. In each task, the states are represented by the circles and the arrows between them correspond to the available actions that allow to move from one state to the other. The initial state of both tasks is the left state I for the source task and Ifor the target task. Between the states I and A in the source task (respectively Iin the target task) are k states, k being a parameter increasing the distance to travel from I to A (respectively I). The tasks are deterministic and the reward is zero everywhere except for the state B in the source task and Cin the target task where a reward of +1 is received. Consequently, the optimal policy in the source task is to go to the state A and then to the state B. In the target task, the same applies except that a transition to state C should be applied in place of state Bwhen the agent is in state A

Regardless of the parameters used in the bi-simulation metric of Equation 11, the direct state transfer method from Song et al. (2016) maps the following states together as they share the exact same model:

Hence, during learning, the Q-function of the target task is initialized with the values of the Q-function of the source task. Therefore, the behavior derived with the Q-Learning algorithm is the optimal policy of the source task, but in the target task. Depending on the value of the learning rate of the algorithm, the time to favor action DOWN in state Ainstead of action UP can be arbitrarily long. Also, depending on the value of , the exploration of state Cdue to the -greedy strategy can be arbitrarily unlikely. Finally, the time needed for one of those two events to occur increases proportionally to the value of k, which can be arbitrarily large.

This case illustrates the difficulty facing any transfer method in the general context of Lifelong RL. The method proposed by Song et al. (2016) can be highly efficient in some cases as they show in experiments, but the lack of theoretical guarantees makes negative transfer possible. Generally, using a similarity measure such that the bi-simulation metric helps to discourage using some source tasks when the computed similarity is too low. However, as we saw in the T-shaped MDP example, this rule is not absolute and the choice of the metric is important. The approach we develop in this paper aims at avoiding negative transfer by providing a conservative transferred knowledge that is simply of no use when the similarity between source and target tasks is too low. This is intuitive as we do not expect any task to provide transferable knowledge to any other task.

8 Discussion on metrics and related notions

A metric on a set X is a function which has the following properties for any

P1. (positivity),

P2. (positive definiteness),

P3. m(x, y) = m(y, x) (symmetry),

P4. (triangle inequality).

If property P2 is not verified by m, but instead we have for any is called a pseudo-metric. If m only verifies P1, P2 and P4 then m is called a quasi-metric. If m only verifies P1 and P2 and if X is a set of probability measures, then m is called a divergence.

From this, the pseudo-metric between models of Definition 1 is indeed a pseudo-metric as it is relative to a positive function f that could be equal to zero and break property P2.

The local MDP dissimilarity between MDPs of Proposition 1 does not respect properties P2 and P3, hence the name dissimilarity. The quantity, however, regains property P3 and is hence a pseudo-metric. A noticeable consequence is that Proposition 1 is “in the spirit” of a Lipschitz continuity result but cannot be called as such, hence the name pseudo-Lipschitz continuity.

The same goes for the global dissimilarity . However, using allows to regain property 3 and makes this quantity a pseudo-metric again between MDPs.

9 Proof of Proposition 1

Notation 1. Given two sets X and Y , we note F (X, Y ) the set of functions defined on the domain X with codomain Y .

Lemma 1. Given two MDPs , the following equation on is a fixed-point equation admitting a unique solution for any

We refer to this unique solution as

Proof of Lemma 1. The proof follows closely that in (Puterman 2014) that proves that the Bellman operator over value functions is a contraction mapping. Let L be the functional operator that maps any function

Then for f and g, two functions from , we have that

Since this is true for any pair , we have that

Since is a contraction mapping in the metric space . This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 1. The proof is by induction. The value iteration sequence of iterates of the optimal Q-function of any MDP is defined for all

Consider two MDPs . It is obvious that. Suppose the propertytrue at rank . Consider now the rank n + 1 and a pair

where we used Lemma 1 in the last inequality. Since and are respectively the limits of the sequences and , it results from passage to the limit that

By symmetry, we also haveand we can take the minimum of the two valid upper bounds, yielding:

10 Similar results to Proposition 1

Similar results to Proposition 1 can be derived. First, an important consequence is the global pseudo-Lipschitz continuity result

From a pure transfer perspective, Equation 12 is interesting since the right hand side does not depend on s, a. Hence, the counterpart of the upper bound of Equation 4, namely,

is easier to compute. Indeed, can be computed once and for all, contrarily to that needs to be evaluated for all s, a pair. However, we do not use this result for transfer because it is impractical to compute online. Indeed, estimating the maximum in the definition of can be as hard as solving both MDPs, which, when it happens, is too late for transfer to be useful.

Proof of Proposition 8. The proof is by induction. We consider the sequence of value iteration iterates defined for any MDP

and, by symmetry, the result holds as well for . Suppose that it is true at rank . Consider rank n + 1 and

, we have that:

By symmetry, the results holds as well for which concludes the proof by induction.

The second result is for the value function and is stated below.

Proposition 9 (Local pseudo-Lipschitz continuity of the optimal value function). For any two MDPs

where the local MDP pseudo-metric has the same definition as in Proposition 1.

Proof of Proposition 9. The proof follows exactly the same steps as the proof of Proposition 1, i.e., by first constructing the value iteration sequence of iterates of the optimal value function, showing the result by induction for rank and then concluding with a passage to the limit.

Another result can be derived for the value of any policy . For the sake of generality, we state the result for any stochastic policy mapping states to distributions over actions. Note that a deterministic policy is a stochastic policy mapping states to Dirac distributions over actions. First, we state the result for the value function in Proposition 10 and then for the Q function in Proposition 11.

Proposition 10 (Local pseudo-Lipschitz continuity of the value function of any policy). For any two MDPs , for any stochastic stationary policy

where and is defined as the fixed-point of the following fixed-point equation on

Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.

Lemma 2. Given two MDPs , any stochastic stationary policy , the following equation on is a fixed-point equation admitting a unique solution for any

We refer to this unique solution as

Proof of Lemma 2. Let L be the functional operator that maps any function

Then for f and g, two functions from S to R, we have that

Hence we have that is a contraction mapping in the metric space This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 10. Consider a stochastic stationary stationary policy . The value iteration sequence of iterates of the value function of any MDP is defined for all

Consider two MDPs . It is obvious thatfor all . Suppose the property true at rank . Consider now the rank n + 1 and the state

where we used Lemma 2 in the last inequality. Since and are respectively the limits of the sequences and , it results from passage to the limit that

By symmetry, we also haveand we can take the minimum of the two valid upper bounds, yielding:

which concludes the proof.

Proposition 11 (Local pseudo-Lipschitz continuity of the Q-function of any policy). For any two MDPs , for any stochastic stationary policy

where is defined as the fixed-point of the following fixed-point equation on

Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.

Lemma 3. Given two MDPs , any stochastic stationary policy , the following equation on fixed-point equation admitting a unique solution for any

We refer to this unique solution as

Proof of Lemma 3. Let L be the functional operator that maps any function

Then for f and g, two functions from , we have for all

Hence we have that . Since is a contraction mapping in the metric space . This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 11. Consider a stochastic stationary policy . The value iteration sequence of iterates function for the policy is defined for all

Consider two MDPs . It is obvious thatfor all . Suppose the propertytrue at rank for all . Consider now the rank n + 1

where we used Lemma 3 in the last inequality. Since and are respectively the limits of the sequences and

By symmetry, we also haveand we can take the minimum of the two valid upper bounds,

11 Proof of Proposition 2

Proof of Proposition 2. The result is clear for all since the Lipschitz bounds are provably greater than . For , the result is shown by induction. Let us consider the Dynamic Programming (Bellman 1957) sequences converging

Obviously, we have at rank n = 0 that for all . Suppose the property true at rank and consider rank n + 1:

Which concludes the proof by induction. The result holds by passage to the limit since the considered Dynamic Programming sequences converge to the true functions.

12 Proof of Proposition 3

Proof of Proposition 3. Consider two tasks M = (T, R) and , with K and the respective sets of state-action pairs where their learned models are known with accuracy -norm with probability at least , we have that,

Importantly, notice that the probabilistic event of Inequality 13 is the intersection of at most 4SA individual events of estimating either or with precision . Each one of those individual events is itself true with probability at least , where is a parameter. For all the individual events to be true at the same time, i.e. for Inequality 13 to be verified, one must apply Boole’s inequality and set to ensure a total probability — i.e., probability of the intersection of all the individual events — of at least

We demonstrate now the result for each one of the three cases

the case being the symmetric of case (ii).

(i) If , then we have -close estimates of both models with high probability, as described by Inequality 13. By

definition:

The second term of the right hand side of Equation 14 respects the following sequence of inequalities with probability at least

Replacing the Inequalities 15 and 16 in Equation 14 yields

which holds with probability at least and proves the Theorem for case (i). (ii) If , then we do not have an -close estimate of and . Similarly to the proof of case (i), we upper bound sequentially the two terms of the right hand side of Equation 14. With probability at least , we have the following:

Similarly, with probability at least

where is the set of probability vectors of size S. Combining inequalities 17 and 18, we get the following with probability at least , by noticing on the left hand side:

which is the expected result for case (ii). (iii) If , then we do not have -close estimates of both tasks. In such a case, the result

is straightforward by remarking that, as a consequence of Inequality 13, we have that with probability at least

13 Analytical calculation of ˆDsa(M∥ ¯M) in Proposition 3

Consider two tasks M = (T, R) and , with K and the respective sets of state-action pairs where their learned models and are known with accuracy in -norm with probability at least . We note , a known upper bound on the maximum achievable value. In the worst case where one does not have any information on the value of is a valid upper bound. We detail the computation of for each cases: 1) 2) being the symmetric of case 2), the same calculations apply. 1) If

Since (s, a) is a known state-action pair, everything is known and computable in this last equation. Note that be tracked along the updates of and thus its computation does not induce any additional computational complexity. 2) If

First, we have

Maximizing over the variable such that is equivalent to maximizing a convex combination of the positive vector whose terms are not independent as they must sum to one. This is easily solvable as a linear programming problem. A straightforward (simplex-like) resolution procedure consists in progressively adding mass on the terms that will maximize the convex combination as follows:

•

• l = Sort states by decreasing values of

• While – = pop first state in l

This allows calculating the maximum over transition models.

Notice that there is a simpler computation that almost always yields the same result (when it does not, it provides an upper bound) and does not require the burden of the previous procedure. Consider the subset of states for which (often these are states in ). Among those states, let us suppose there exists , unreachable from (s, a), according to , i.e., has not been fully explored, as is often the case in RMax, there may be many such states. Then the distribution t with all its mass on maximizes the term. Conversely, if such a state does not exist (that is, if for all such states is an upper bound on the term. Therefore:

with equality in many cases. 3) If , the resolution is trivial and we have

Overall, computing the value of the provided upper bound in the three cases allows to compute

14 Proof of Proposition 4

Lemma 4. Given two tasks the set of state-action pairs for which (R, T) is known with accuracy with probability at least , this equation on is a fixed-point equation admitting a unique solution.

We refer to this unique solution as

Proof of Lemma 4. Let L be the functional operator that maps any function