b

DiscoverSearch
About
My stuff
Lipschitz Lifelong Reinforcement Learning
2020·arXiv
Abstract
Abstract

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the benefits of the method in Lifelong RL experiments.

Lifelong Reinforcement Learning (RL) is an online problem where an agent faces a series of RL tasks, drawn sequentially. Transferring knowledge from prior experience to speed up the resolution of new tasks is a key question in that setting (Lazaric 2012; Taylor and Stone 2009). We elaborate on the intuitive idea that similar tasks should allow a large amount of transfer. An agent able to compute online a similarity measure between source tasks and the current target task could be able to perform transfer accordingly. By measuring the amount of inter-task similarity, we design a novel method for value transfer, practically deployable in the online Lifelong RL setting. Specifically, we introduce a metric between MDPs and prove that the optimal Q-value function is Lipschitz continuous with respect to the MDP space. This property makes it possible to compute a provable upper bound on the optimal Q-value function of an unknown target task, given the learned optimal Q-value function of a source task. Knowing this upper bound accelerates the convergence of an RMax-like algorithm (Brafman and Tennenholtz 2002), relying on an optimistic estimate of the optimal Q-value function. Overall, the proposed transfer method consists of computing online the distance between source and target tasks, deducing the upper bound on the optimal Q value function of the source task and using this bound to accelerate learning. Importantly, the method exhibits no negative transfer, i.e., it cannot cause performance degradation, as the computed upper bound provably does not underestimate the optimal Q-value function.

Our contributions are as follows. First, we study theoretically the Lipschitz continuity of the optimal Q-value function in the task space by introducing a metric between MDPs (Section 3). Then, we use this continuity property to propose a value-transfer method based on a local distance between MDPs (Section 4). Full knowledge of both MDPs is not required and the transfer is non-negative, which makes the method applicable online and safe. In Section 4.3, we build a PAC-MDP algorithm called Lipschitz RMax, applying this transfer method in the online Lifelong RL setting. We provide sample and computational complexity bounds and showcase the algorithm in Lifelong RL experiments (Section 5).

Reinforcement Learning (RL) (Sutton and Barto 2018) is a framework for sequential decision making. The problem is typically modeled as a Markov Decision Process (MDP) (Puterman 2014) consisting of a 4-tuple  ⟨S, A, R, T⟩, whereS is a state space, A an action space,  Rasis the expected reward of taking action  a in state s and T ass′is the transition probability of reaching state  s′ when taking action a in state s. Without loss of generality, we assume  Ras ∈ [0, 1]. Given a discount factor  γ ∈ [0, 1), the expected cumulative return �t γtRatst obtained along a trajectory starting with state s and action a using policy  πin MDP M is denoted by  QπM(s, a)and called the Q-function. The optimal Q-function  Q∗M is thehighest attainable expected return from s, a and  V ∗M(s) =maxa∈A Q∗M(s, a)is the optimal value function in s. Notice that  Ras ≤ 1implies  Q∗M(s, a) ≤ 11−γfor all  s, a ∈ S × A. This maximum upper bound is used by the RMax algorithm as an optimistic initialization of the learned Q function. A key point to reduce the sample complexity of this algorithm is to benefit from a tighter upper bound, which is the purpose of our transfer method.

Lifelong RL (Silver, Yang, and Li 2013; Brunskill and Li 2014) is the problem of experiencing online a series of MDPs drawn from an unknown distribution. Each time an MDP is sampled, a classical RL problem takes place where the agent is able to interact with the environment to maximize its expected return. In this setting, it is reasonable to think that knowledge gained on previous MDPs could be re-used to improve the performance in new MDPs. In this paper, we provide a novel method for such transfer by characterizing the way the optimal Q-function can evolve across tasks. As commonly done (Wilson et al. 2007; Brunskill and Li 2014; Abel et al. 2018), we restrict the scope of the study to the case where sampled MDPs share the same state-action space S × A. For brevity, we will refer indifferently to MDPs, models or tasks, and write them  M = ⟨R, T⟩.

Using a metric between MDPs has the appealing characteristic of quantifying the amount of similarity between tasks, which intuitively should be linked to the amount of transfer achievable. Song et al. (2016) define a metric based on the bi-simulation metric introduced by Ferns, Panangaden, and Pre- cup (2004) and the Wasserstein metric (Villani 2008). value transfer is performed between states with low bi-simulation distances. However, this metric requires knowing both MDPs completely and is thus unusable in the Lifelong RL setting where we expect to perform transfer before having learned the current MDP. Further, the transfer technique they propose does allow negative transfer (see Appendix, Section 7). Carroll and Seppi (2005) also define a value-transfer method based on a measure of similarity between tasks. However, this measure is not computable online and thus not applicable to the Lifelong RL setting. Mahmud et al. (2013) and Brunskill and Li (2013) propose MDP clustering methods; respectively using a metric quantifying the regret of running the optimal policy of one MDP in the other MDP and the  L1norm between the MDP models. An advantage of clustering is to prune the set of possible source tasks. They use their approach for policy transfer, which differs from the value-transfer method proposed in this paper. Ammar et al. (2014) learn the model of a source MDP and view the prediction error on a target MDP as a dissimilarity measure in the task space. Their method makes use of samples from both tasks and is not readily applicable to the online setting considered in this paper. Lazaric, Restelli, and Bonarini (2008) provide a practical method for sample transfer, computing a similarity metric reflecting the probability of the models to be identical. Their approach is applicable in a batch RL setting as opposed to the online setting considered in this paper. The approach developed by Sorg and Singh (2009) is very similar to ours in the sense that they prove bounds on the optimal Q-function for new tasks, assuming that both MDPs are known and that a soft homomorphism exists between the state spaces. Brun- skill and Li (2013) also provide a method that can be used for Q-function bounding in multi-task RL.

Abel et al. (2018) present the MaxQInit algorithm, providing transferable bounds on the Q-function with high probability while preserving PAC-MDP guarantees (Strehl, Li, and Littman 2009). Given a set of solved tasks, they derive the probability that the maximum over the Q-values of previous MDPs is an upper bound on the current task’s optimal Qfunction. This approach results in a method for non-negative transfer with high probability once enough tasks have been sampled. The method developed by Abel et al. (2018) is similar to ours in two fundamental points: first, a theoretical upper bounds on optimal Q-values across the MDP space is built; secondly, this provable upper bound is used to transfer knowledge between MDPs by replacing the maximum 11−γ

Q∗M(s, a)

image

Figure 1: The optimal Q-value function represented for a particular s, a pair across the MDP space. The RMax, MaxQInit and LRMax bounds are represented for three sampled MDPs.

bound in an RMax-like algorithm, providing PAC guarantees. The difference between the two approaches is illustrated in Figure 1, where the MaxQInit bound is the one developed by Abel et al. (2018), and the LRMax bound is the one we present in this paper. On this figure, the essence of the LRMax bound is noticeable. It stems from the fact that the optimal Q value function is locally Lipschitz continuous in the MDP space w.r.t. a specific pseudometric. Confirming the intuition, close MDPs w.r.t. this metric have close optimal Q values. It should be noticed that no bound is uniformly better than the other as intuited by Figure 1. Hence, combining all the bounds results in a tighter upper bound as we will illustrate in experiments (Section 5). We first carry out the theoretical characterization of the Lipschitz continuity properties in the following section. Then, we build on this result to propose a practical transfer method for the online Lifelong RL setting.

The intuition we build on is that similar MDPs should have similar optimal Q-functions. Formally, this insight can be translated into a continuity property of the optimal Q-function over the MDP space M. The remainder of this section mathematically formalizes this intuition that will be used in the next section to derive a practical method for value transfer. To that end, we introduce a local pseudometric characterizing the distance between the models of two MDPs at a particular state-action pair. A reminder and a detailed discussion on the metrics used herein can be found in the Appendix, Section 8. Definition 1. Given two tasks  M = ⟨R, T⟩, ¯M = ⟨ ¯R, ¯T⟩, and a function  f : S → R+, we define the pseudometric between models at  (s, a) ∈ S × A w.r.t. f as:

image

This pseudometric is relative to a positive function f. We implicitly cast this definition in the context of discrete state spaces. The extension to continuous spaces is straightforward but beyond the scope of this paper. For the sake of clarity in the remainder of this study, we introduce

image

corresponding to the pseudometric between models with the particular choice of  f = γV ∗¯M. From this definition stems the following pseudo-Lipschitz continuity result.

Proposition 1 (Local pseudo-Lipschitz continuity). For two MDPs  M, ¯M, for all (s, a) ∈ S × A,

image

with the local MDP pseudometric ∆sa(M, ¯M) ≜min�dsa(M∥ ¯M), dsa( ¯M∥M)�, and thelocal MDP dissimilarity  dsa(M∥ ¯M)is the unique solution to the following fixed-point equation for  dsa:

image

This result establishes that the distance between the optimal Q-functions of two MDPs at  (s, a) ∈ S × Ais controlled by a local dissimilarity between the MDPs. The latter follows a fixed-point equation (Equation 3), which can be solved by Dynamic Programming (DP) (Bellman 1957). Note that, although the local MDP dissimilarity  dsa(M∥ ¯M)is asymmetric,  ∆sa(M, ¯M) isa pseudometric, hence the name pseudo-Lipschitz continuity. Notice that the policies in Equation 2 are the optimal ones for the two MDPs and thus are different. Proposition 1 is a mathematical result stemming from Definition 1 and should be distinguished from other frameworks of the literature that assume the continuity of the reward and transition models w.r.t.  S × A(Rachelson and Lagoudakis 2010; Pirotta, Restelli, and Bascetta 2015; Asadi, Misra, and Littman 2018).This result establishes that the optimal Q-functions of two close MDPs, in the sense of Equation 1, are themselves close to each other. Hence, given Q∗¯M, the function

image

can be used as an upper bound on  Q∗M with Man unknown MDP. This is the idea on which we construct a computable and transferable upper bound in Section 4. In Figure 1, the upper bound of Equation 4 is represented by the LRMax bound. Noticeably, we provide a global pseudo-Lipschitz continuity property, along with similar results for the optimal value function  V ∗M and the value function of a fixed policy. As these results do not directly serve the purpose of this article, we report them in the Appendix, Section 10.

A purpose of value transfer, when interacting online with a new MDP, is to initialize the value function and drive the exploration to accelerate learning. We aim to exploit value transfer in a method guaranteeing three conditions:

C1. the resulting algorithm is PAC-MDP;

C2. the transfer accelerates learning;

C3. the transfer is non-negative.

To achieve these conditions, we first present a transferable upper bound on  Q∗M in Section 4.1. This upper bound stems from the Lipschitz continuity result of Proposition 1. Then, we propose a practical way to compute this upper bound in Section 4.2. Precisely, we propose a surrogate bound that can be calculated online in the Lifelong RL setting, without having explored the source and target tasks completely. Finally, we implement the method in an algorithm described in Section 4.3, and demonstrate formally that it meets conditions C1, C2 and C3. Improvements are discussed in Section 4.4.

4.1 A Transferable Upper Bound on QM

From Proposition 1, one can naturally define a local upper bound on the optimal Q-function of an MDP given the optimal Q-function of another MDP.

Definition 2. Given two tasks M and ¯M, for all  (s, a) ∈S × A, theLipschitz upper bound on  Q∗M induced by Q∗¯M isdefined as  U ¯M(s, a) ≥ Q∗M(s, a) with:

image

The optimism in the face of uncertainty principle leads to considering that the long-term expected return from any state is the 11−γmaximum return, unless proven otherwise. Particularly, the RMax algorithm (Brafman and Tennenholtz 2002), explores an MDP so as to shrink this upper bound. RMax is a model-based, online RL algorithm with PACMDP guarantees (Strehl, Li, and Littman 2009), meaning that convergence to a near-optimal policy is guaranteed in a polynomial number of missteps with high probability. It relies on an optimistic model initialization that yields an optimistic upper bound U on the optimal Q-function, then acts greedily w.r.t. U. By default, it takes the maximum value U(s, a) = 11−γ, but any tighter upper bound is admissible. Thus, shrinking U with Equation 5 is expected to improve the learning speed or sample complexity for new tasks in Lifelong RL.

In RMax, during the resolution of a task  M, S × Ais split into a subset of known state-action pairs K and its complement  Kcof unknown pairs. A state-action pair is known if the number of collected reward and transition samples allows estimating an  ϵ-accurate model in  L1-norm with probability higher than  1 − δ. We refer to  ϵand  δas the RMax precision parameters. This results in a threshold  nknownon the number of visits n(s, a) to a pair s, a that are necessary to reach this precision. Given the experience of a set of m MDPs ¯M = { ¯M1, . . . , ¯Mm}, we define the total bound as the minimum over all the induced Lipschitz bounds. Proposition 2. Given a partially known task  M = ⟨R, T⟩, the set of known state-action pairs K, and the set of Lipschitz bounds on  Q∗M induced by previous tasks�U ¯M1, . . . , U ¯Mm�,the function Q defined below is an upper bound on  Q∗Mfor all  s, a ∈ S × A.

image

with U(s, a) = min� 11−γ , U ¯M1(s, a), . . . , U  ¯Mm(s, a)�.

Commonly in RMax, Equation 6 is solved to a precision ϵQvia Value Iteration. This yields a function Q that is a valid heuristic (bound on  Q∗M) for the exploration of MDP M.

4.2 A Computable Upper Bound on QM

The key issue addressed in this section is how to actually compute U(s, a), particularly when both source and target tasks are partially explored. Consider two tasks  M and ¯M, onwhich vanilla RMax has been applied, yielding the respective sets of known state-action pairs K and ¯K, along with the learned models ˆM = ⟨ ˆT, ˆR⟩ and ˆ¯M = ⟨ ˆ¯T, ˆ¯R⟩, and the upper bounds  Q and ¯Qrespectively on  Q∗M and Q∗¯M. Notice that, if K = ∅, then  Q(s, a) = 11−γfor all s, a pairs. Conversely, if Kc = ∅, Q is an ϵ-accurate estimate of  Q∗M in L1-norm withhigh probability. Equation 6 allows the transfer of knowledge from ¯Mto M if  U ¯M(s, a)can be computed. Unfortunately, the true model and optimal value functions, necessary to compute  U ¯M, are partially known (see Equation 5). Thus, we propose to compute a looser upper bound based on the learned models and value functions. First, we provide an upper bound ˆDsa(M∥ ¯M) on Dsa(M∥ ¯M)(Definition 1).

Proposition 3. Given two tasks M, ¯Mand respectively K, ¯Kthe subsets of  S × Awhere their models are known with accuracy  ϵ in L1-norm with probability at least  1 − δ,

image

with ˆDsa(M∥ ¯M)the upper bound on the pseudometric between models defined below for  B = ϵ�1 + γ maxs′ ¯V (s′)�.

image

Importantly, this upper bound ˆDsa(M∥ ¯M)can be calculated analytically (see Appendix, Section 13). This makes ˆDsa(M∥ ¯M)usable in the online Lifelong RL setting, where already explored tasks may be partially learned, and little knowledge has been gathered on the current task. The magnitude of the B term is controlled by  ϵ. In the case where no information is available on the maximum value of ¯V , wehave that  B = ϵ1−γ.  ϵmeasures the accuracy with which the tasks are known: the smaller  ϵ, the tighter the B bound. Note that ¯Vis used as an upper bound on the true  V ∗¯M. In many cases,  maxs′ V ∗¯M(s′) ≤ 11−γ; e.g. for stochastic short- est path problems, which feature rewards only upon reaching terminal states, we have that  maxs′ V ∗¯M(s′) = 1and thus B = (1 + γ)ϵis a tighter bound for transfer. Combining ˆDsa(M∥ ¯M)and Equation 3, one can derive an upper bound ˆdsa(M∥ ¯M) on dsa(M∥ ¯M), detailed in Proposition 4.

Proposition 4. Given two tasks  M, ¯M ∈ M, Kthe set of state-action pairs where (R, T) is known with accuracy  ϵ inL1-norm with probability at least  1 − δ. If γ(1 + ϵ) < 1, thesolution ˆdsa(M∥ ¯M)of the following fixed-point equation on ˆdsa (for all s, a ∈ S × A) is an upper bound on  dsa(M∥ ¯M)with probability at least  1 − δ:

image

image

illustrates the fact that for a large return horizon (large  γ), a high accuracy (small  ϵ) is needed for the bound to be computable. Eventually, a computable upper bound on  Q∗M given¯M with high probability is given by

image

The associated upper bound on U(s, a) (Equation 6) given the set of previous tasks ¯M = { ¯Mi}mi=1 is defined by

image

This upper bound can be used to transfer knowledge from a partially solved source task to a target task. If ˆU(s, a) ≤ 11−γon a subset of  S × A, then the convergence rate can be improved. As complete knowledge of both tasks is not needed to compute the upper bound, it can be applied online in the Lifelong RL setting. In the next section, we explicit an algorithm that leverages this value-transfer method.

4.3 Lipschitz RMax Algorithm

In Lifelong RL, MDPs are encountered sequentially. Applying RMax to task M yields the set of known state-action pairs K, the learned models ˆTand ˆR, and the upper bound Q on  Q∗M. Saving this information when the task changes allows computing the upper bound of Equation 10 for the new target task, and using it to shrink the optimistic heuristic of RMax. This computation effectively transfers value functions between tasks based on task similarity. As the new task is explored online, the task similarity is progressively assessed with better confidence, refining the values of ˆDsa(M∥ ¯M), ˆdsa(M∥ ¯M)and eventually ˆU, allowing for more efficient transfer where the task similarity is appraised. The resulting algorithm, Lipschitz RMax (LRMax), is presented in Algorithm 1. To avoid ambiguities with ¯M, we use ˆMto store learned features ( ˆT, ˆR, K, Q) about previous MDPs. In a nutshell, the behavior of LRMax is precisely that of RMax, but with a tighter admissible heuristic ˆUthat becomes better as the new task is explored (while this heuristic remains constant in vanilla RMax). LRMax is PAC-MDP (Condition C1) as stated in Propositions 5 and 6 below. With S = |S| and A = |A|, the sample complexity of vanilla RMax is ˜O(S2A/(ϵ3(1 − γ)3)), which is improved by LRMax in Proposition 5 and meets Condition C2. Finally, ˆU isa provable upper bound with high probability on  Q∗M, whichavoids negative transfer and meets Condition C3.

Proposition 5 (Sample complexity (Strehl, Li, and Littman 2009)). With probability  1 − δ, the greedy policy w.r.t. Q computed by LRMax achieves an  ϵ-optimal return in MDP

image

samples (when logarithmic factors are ignored), with ˆUde-fined in Equation 10 a non-static, decreasing quantity, upper bounded by 11−γ .

image

Proposition 5 shows that the sample complexity of LRMax is no worse than that of RMax. Consequently, in the worst case, LRMax performs as badly as learning from scratch, which is to say that the transfer method is not negative as it cannot degrade the performance.

Proposition 6 (Computational complexity). The total computational complexity of LRMax (Algorithm 1) is

image

with  τthe number of interaction steps,  ϵQthe precision of value iteration and N the number of source tasks.

4.4 Refining the LRMax Bounds

LRMax relies on bounds on the local MDP dissimilarity (Equation 8). The quality of the Lipschitz bound on  Q∗M canbe improved according to the quality of those estimates. We discuss two methods to provide finer estimates.

Refining with prior knowledge. First, from the definition of  Dsa(M∥ ¯M), it is easy to show that this pseudometric between models is always upper bounded by 1+γ1−γ . However, in practice, the tasks experienced in a Lifelong RL experiment might not cover the full span of possible MDPs M and may systematically be closer to each other than 1+γ1−γ . For instance, the distance between two games in the Arcade Learning Environment (ALE) (Bellemare et al. 2013), is smaller than

image

Figure 2: Illustration of the prior knowledge on the maximum pseudo-distance between models for a particular s, a pair.

the maximum distance between any two MDPs defined on the common state-action space of the ALE (extended discussion in Appendix, Section 17). Let us note ˜M ⊂ Mthe set of possible MDPs for a particular Lifelong RL experiment. Let  Dmax(s, a) ≜ maxM, ¯M∈ ˜M2�Dsa(M∥ ¯M)�be the max-imum model pseudo-distance at a particular s, a pair on the subset ˜M. Prior knowledge might indicate a smaller upper bound for  Dmax(s, a) than 1+γ1−γ . We will note such an upper bound  Dmax, considered valid for all s, a pairs, i.e., such that Dmax ≥ maxs,a,M, ¯M∈S×A× ˜M2�Dsa(M∥ ¯M)�. In a Lifelong RL experiment,  Dmaxcan be seen as a rough estimate of the maximum model discrepancy an agent may encounter. Figure 2 illustrates the relative importance of  Dmax vs. 1+γ1−γ .Solving Equation 8 boils down to accumulating ˆDsa(M∥ ¯M)values in ˆdsa(M∥ ¯M). Hence, reducing a ˆDsa(M∥ ¯M)estimate in a single s, a pair actually reduces ˆdsa(M∥ ¯M)in all s, a pairs. Thus, replacing ˆDsa(M∥ ¯M)in Equation 8 by min{Dmax, ˆDsa(M∥ ¯M)}, provides a smaller upper bound ˆdsa(M∥ ¯M)on  dsa(M∥ ¯M), and thus a smaller ˆUwhich allows transfer if it is less than 11−γ . Consequently, the knowl- edge of such a bound  Dmaxcan make a difference between successful and unsuccessful transfer, even if its value is of little importance. Conversely, setting a value for  Dmax quan-tifies the distance between MDPs where transfer is efficient.

Refining by learning the maximum distance. The value of  Dmax(s, a)can be estimated online for each s, a pair, discarding the hypothesis of available prior knowledge. We propose to use an empirical estimate of the maximum model distance at s, a: ˆDmax(s, a) ≜ maxM, ¯M∈ ˆM2{ ˆDsa(M∥ ¯M)}, with ˆMthe set of explored tasks. The pitfall of this approach is that, with few explored tasks, ˆDmax(s, a)could underestimate  Dmax(s, a). Proposition 7 provides a lower bound on the probability that ˆDmax(s, a) + ϵdoes not underestimate Dmax(s, a), depending on the number of sampled tasks.

Proposition 7. Consider an algorithm producing  ϵ-accuratemodel estimates ˆDsa(M∥ ¯M)for a subset K of  S × Aafter interacting with any two MDPs  M, ¯M ∈ M. Assume ˆDsa(M∥ ¯M) ≥ Dsa(M∥ ¯M)for any  s, a /∈ K. For all s, a ∈ S × A, δ ∈ (0, 1], after sampling m tasks, if m is large enough to verify  2(1 − pmin)m − (1 − 2pmin)m ≤ δ,

image

This result indicates when ˆDmax(s, a) + ϵupper bounds  Dmax(s, a)with high probability. In such a case, ˆDsa(M∥ ¯M)of Equation 8 can be replaced by min{ ˆDmax(s, a) + ϵ, ˆDsa(M∥ ¯M)}to tighten the bound on dsa(M∥ ¯M). Assuming a lower bound  pminon the sampling probability of a task implies that M is finite and is seen as a non-adversarial task sampling rule (Abel et al. 2018).

The experiments reported here1 illustrate how the Lipschitz bound (Equation 9) provides a tighter upper bound on  Q∗, improving the sample complexity of LRMax compared to RMax, and making the transfer of inter-task knowledge effective. Graphs are displayed with 95% confidence intervals. For information in line with the Machine Learning Reproducibility Check-list (Pineau 2019) see the Appendix, Section 22.

We evaluate different variants of LRMax in a Lifelong RL experiment. The RMax algorithm will be used as a no-transfer baseline. LRMax(x) denotes Algorithm 1 with prior Dmax = x. MaxQInit denotes the MAXQINIT algorithm from Abel et al. (2018), consisting in a state-of-the art PACMDP algorithm. Both LRMax and MaxQInit algorithms achieve value transfer by providing a tighter upper bound on  Q∗than 11−γ. Computing both upper bounds and taking the minimum results in combining the two approaches. We include such a combination in our study with the LRMaxQInit algorithm. Similarly, the latter algorithm benefiting from prior knowledge  Dmax = xis denoted by LRMaxQInit(x). For the sake of comparison, we only compare algorithms with the same features, namely, tabular, online, PAC-MDP methods, presenting non-negative transfer.

The environment used in all experiments is a variant of the “tight” task used by Abel et al. (2018). It is an  11 × 11grid-world, the initial state is in the centre, actions are the cardinal moves (Appendix, Section 18). The reward is always zero except for the three goal cells in the upper-right corner. Each sampled task has its own reward values, drawn from [0.8, 1] for each of the three goal cells and its own probability of slipping (performing a different action than the one selected), picked in [0, 0.1]. Hence, tasks have different reward and transition functions. Notice the distinction in applicability between MaxQInit, that requires the set of MDPs to be finite, and LRMax, that can be used with any set of MDPs. For the comparison between both to be possible, we drew tasks from a finite set of 5 MDPs. We sample 15 tasks sequentially among this set, each run for 2000 episodes of length 10. The operation is repeated 10 times to narrow the confidence intervals. We set  nknown = 10, δ = 0.05, and  ϵ = 0.01(discussion in Appendix, Section 21). Other Lifelong RL experiments are reported in Appendix, Section 19.

The results are reported in Figure 3. Figure 3a displays the discounted return for each task, averaged across episodes. Similarly, Figure 3b displays the discounted return for each episode, averaged across tasks (same color code as Figure 3a). Figure 3c displays the discounted return for five specific instances, detailed below. To avoid inter-task disparities, all the aforementioned discounted returns are displayed relative to an estimator of the optimal expected return for each task. For readability, Figures 3b and 3c display a moving average over 100 episodes. Figure 3d reports the benefits of various values of  Dmaxon the algorithmic properties.

In Figure 3a, we first observe that LRMax benefits from the transfer method, as the average discounted return increases as more tasks are experienced. Moreover, this advantage appears as early as the second task. In contrast, MaxQInit requires to wait for task 12 before benefiting from transfer. As suggested in Section 4.4, increasing amounts of prior knowledge allow the LRMax transfer method to be more efficient: a smaller known upper bound  Dmaxon ˆDsa(M∥ ¯M)accelerates convergence. Combining both approaches in the LRMaxQInit algorithm outperforms all other methods. Episode-wise, we observe in Figure 3b that the LRMax transfer method allows for faster convergence, i.e., lower sample complexity. Interestingly, LRMax exhibits three stages in the learning process. 1) The first episodes are characterized by a direct exploitation of the transferred knowledge, causing these episodes to yield high payoff. This behavior is a consequence of the combined facts that the Lipschitz bound (Equation 9) is larger on promising regions of  S × Aseen on previous tasks and the fact that LRMax acts greedily w.r.t. that bound. 2) This high performance regime is followed by the exploration of unknown regions of  S × A, in our case yielding low returns. Indeed, as promising regions are explored first, the bound becomes tighter for the corresponding state-action pairs, enough for the Lipschitz bound of unknown pairs to become larger, thus driving the exploration towards low payoff regions. Such regions are then identified and never revisited. 3) Eventually, LRMax stops exploring and converges to the optimal policy. Importantly, in all experiments, LRMax never experiences negative transfer, as supported by the provability of the Lipschitz upper bound with high probability. LRMax is at least as efficient as the no-transfer RMax baseline.

Figure 3c displays the collected returns of RMax, LRMax(0.1), and MaxQInit for specific tasks. We observe that LRMax benefits from transfer as early as Task 2, where the previous 3-stage behavior is visible. MaxQInit takes until task 12 to leverage the transfer method. However, the bound it provides is tight enough that it does not have to explore.

In Figure 3d, we display the following quantities for various values of  Dmax: ρLip, the fraction of the time the Lipschitz bound was tighter than the RMax bound 11−γ ; ρSpeed-up,is the relative gain of time steps before convergence when comparing LRMax to RMax. This quantity is estimated based on the last updates of the empirical model ¯M; ρReturn, is the relative total return gain on 2000 episodes of LRMax w.r.t. RMax. First, we observe an increase of  ρLipas  Dmaxbecomes tighter. This means that the Lipschitz bound of Equation 9 becomes effectively smaller than 11−γ. This phe- nomenon leads to faster convergence, indicated by  ρSpeed-up.Eventually, this increased convergence rate allows for a net total return gain, as can be seen with the increase of  ρReturn.

Overall, in this analysis, we have showed that LRMax ben-efits from an enhanced sample complexity thanks to the value-transfer method. The knowledge of a prior  Dmaxincreases

image

Figure 3: Experimental results. LRMax benefits from an enhanced sample complexity thanks to the value-transfer method.

this benefit. The method is comparable to the MaxQInit method and has some advantages such as the early fitness for use and the applicability to infinite sets of tasks. Moreover, the transfer is non-negative while preserving the PAC-MDP guarantees of the algorithm. Additionally, we show in Appendix, Section 20 that, when provided with any prior  Dmax,LRMax increasingly stops using it during exploration, con-firming the claim of Section 4.4 that providing  Dmax enablestransfer even if its value is of little importance.

We have studied theoretically the Lipschitz continuity property of the optimal Q-function in the MDP space w.r.t. a new metric. We proved a local Lipschitz continuity result, establishing that the optimal Q-functions of two close MDPs are themselves close to each other. We then proposed a value-transfer method using this continuity property with the Lipschitz RMax algorithm, practically implementing this approach in the Lifelong RL setting. The algorithm preserves PAC-MDP guarantees, accelerates learning in subsequent tasks and exhibits no negative transfer. Improvements of the algorithm were discussed in the form of prior knowledge on the maximum distance between models and online estimation of this distance. As a non-negative, similarity-based, PAC-MDP transfer method, the LRMax algorithm is the first method of the literature combining those three appealing features. We showcased the algorithm in Lifelong RL experiments and demonstrated empirically its ability to accelerate learning while not experiencing any negative transfer. Notably, our approach can directly extend other PAC-MDP algorithms (Szita and Szepesv´ari 2010; Rao and Whiteson 2012; Pazis, Parr, and How 2016; Dann, Lattimore, and Brun- skill 2017) to the Lifelong setting. In hindsight, we believe this contribution provides a sound basis to non-negative value transfer via MDP similarity, a study that was lacking in the literature. Key insights for the practitioner lie both in the theoretical analysis and in the practical derivation of a transfer scheme achieving non-negative transfer with PAC guarantees. Further, designing scalable methods conveying the same intuition could be a promising research direction.

image

We would like to thank Dennis Wilson for fruitful discussions and comments on the paper. This research was supported by the Occitanie region, France; ISAE-SUPAERO (Institut Sup´erieur de l’A´eronautique et de l’Espace); fondation ISAESUPAERO; ´Ecole Doctorale Syst`emes; and ONERA, the French Aerospace Lab.

Abel, D.; Jinnai, Y.; Guo, S. Y.; Konidaris, G.; and Littman, M. L. 2018. Policy and Value Transfer in Lifelong Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 20–29.

Ammar, H. B.; Eaton, E.; Taylor, M. E.; Mocanu, D. C.; Driessens, K.; Weiss, G.; and Tuyls, K. 2014. An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning. In Workshops at the 28th AAAI Conference on Artificial Intelligence (AAAI 2014).

Asadi, K.; Misra, D.; and Littman, M. L. 2018. Lipschitz Continuity in Model-Based Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning (ICML 2018) .

Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research 47: 253–279.

Bellman, R. 1957. Dynamic Programming. Princeton, USA: Princeton University Press.

Brafman, R. I.; and Tennenholtz, M. 2002. R-max - a Gen- eral Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3(Oct): 213–231.

Brunskill, E.; and Li, L. 2013. Sample Complexity of Multi- task Reinforcement Learning. In Proceedings of the 29th conference on Uncertainty in Artificial Intelligence (UAI 2013).

Brunskill, E.; and Li, L. 2014. PAC-inspired Option Dis- covery in Lifelong Reinforcement Learning. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), 316–324.

Carroll, J. L.; and Seppi, K. 2005. Task Similarity Measures for Transfer in Reinforcement Learning Task Libraries. In Proceedings of the 5th International Joint Conference on Neural Networks (IJCNN 2005), volume 2, 803–808. IEEE.

Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 5713–5723.

Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for Finite Markov Decision Processes. In Proceedings of the 20th conference on Uncertainty in Artificial Intelligence (UAI 2004), 162–169. AUAI Press.

Lazaric, A. 2012. Transfer in Reinforcement Learning: a Framework and a Survey. In Reinforcement Learning, 143– 173. Springer.

Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Transfer of Samples in Batch Reinforcement Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 544–551.

Mahmud, M. M.; Hawasly, M.; Rosman, B.; and Ramamoor- thy, S. 2013. Clustering Markov Decision Processes for Continual Transfer. Computing Research Repository (arXiv/CoRR) URL https://arxiv.org/abs/1311.3959.

Neyman, J. 1937. X—outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236(767): 333– 380.

Pazis, J.; Parr, R. E.; and How, J. P. 2016. Improving PAC Exploration using the Median of Means. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), 3898–3906.

Pineau, J. 2019. Machine Learning Reproducibility Checklist. https://www.cs.mcgill.ca/∼jpineau/ ReproducibilityChecklist.pdf. Version 1.2, March 27, 2019, last accessed on August 27, 2020.

Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gra- dient in Lipschitz Markov Decision Processes. Machine Learning 100(2-3): 255–283.

Puterman, M. L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.

Rachelson, E.; and Lagoudakis, M. G. 2010. On the Locality of Action Domination in Sequential Decision Making. In Proceedings of the 11th International Symposium on Artificial Intelligence and Mathematics (ISAIM 2010).

Rao, K.; and Whiteson, S. 2012. V-MAX: Tempered Opti- mism for Better PAC Reinforcement Learning. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), 375–382.

Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong Machine Learning Systems: Beyond Learning Algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, volume 13, 05.

Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuring the Distance Between Finite Markov Decision Processes. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016), 468–476.

Sorg, J.; and Singh, S. 2009. Transfer via Soft Homomorphisms. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009), 741–748. International Foundation for Autonomous Agents and Multiagent Systems.

Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement Learning in Finite MDPs: PAC Analysis. Journal of Machine Learning Research 10(Nov): 2413–2444.

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press, Cambridge.

Szita, I.; and Szepesv´ari, C. 2010. Model-Based Reinforcement Learning with Nearly Tight Exploration Complexity Bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 1031–1038.

Taylor, M. E.; and Stone, P. 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10(Jul): 1633–1685.

Villani, C. 2008. Optimal Transport: Old and New, volume 338. Springer Science & Business Media.

Watkins, C. J. C. H.; and Dayan, P. 1992. Q-learning. Machine Learning 8(3-4): 279–292.

Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi- Task Reinforcement Learning: A Hierarchical Bayesian Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), 1015–1022.

image

Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4, Yuu Jinnai3, Emmanuel Rachelson1, Michael L. Littman3

image

image

where  c ∈ Ris a positive constant and  W1is the 1-Wasserstein metric. For each state of the target model, the closest counterpart state (with the smallest bi-simulation distance) of the source MDP is identified and its learned Q-values are used to initialize the Q-function of the target MDP. In their experiments, Song et al. (2016) run a standard Q-Learning algorithm (Watkins and Dayan 1992) with an  ϵ-greedy exploration strategy thereafter.

image

Figure 4: The T-shaped MDP transfer task.

Let us now consider applying this method to a similar task to the T-shaped MDP transfer task proposed by Taylor and Stone (2009). The source and target tasks are respectively described on the left and right sides of Figure 4. In each task, the states are represented by the circles and the arrows between them correspond to the available actions that allow to move from one state to the other. The initial state of both tasks is the left state I for the source task and I′ for the target task. Between the states I and A in the source task (respectively I′ to A′ in the target task) are k states, k being a parameter increasing the distance to travel from I to A (respectively I′ to A′). The tasks are deterministic and the reward is zero everywhere except for the state B in the source task and C′ in the target task where a reward of +1 is received. Consequently, the optimal policy in the source task is to go to the state A and then to the state B. In the target task, the same applies except that a transition to state C should be applied in place of state B′ when the agent is in state A′.

Regardless of the parameters used in the bi-simulation metric of Equation 11, the direct state transfer method from Song et al. (2016) maps the following states together as they share the exact same model:

image

Hence, during learning, the Q-function of the target task is initialized with the values of the Q-function of the source task. Therefore, the behavior derived with the Q-Learning algorithm is the optimal policy of the source task, but in the target task. Depending on the value of the learning rate of the algorithm, the time to favor action DOWN in state A′ instead of action UP can be arbitrarily long. Also, depending on the value of  ϵ, the exploration of state C′due to the  ϵ-greedy strategy can be arbitrarily unlikely. Finally, the time needed for one of those two events to occur increases proportionally to the value of k, which can be arbitrarily large.

This case illustrates the difficulty facing any transfer method in the general context of Lifelong RL. The method proposed by Song et al. (2016) can be highly efficient in some cases as they show in experiments, but the lack of theoretical guarantees makes negative transfer possible. Generally, using a similarity measure such that the bi-simulation metric helps to discourage using some source tasks when the computed similarity is too low. However, as we saw in the T-shaped MDP example, this rule is not absolute and the choice of the metric is important. The approach we develop in this paper aims at avoiding negative transfer by providing a conservative transferred knowledge that is simply of no use when the similarity between source and target tasks is too low. This is intuitive as we do not expect any task to provide transferable knowledge to any other task.

A metric on a set X is a function  m : X × X → Rwhich has the following properties for any  x, y, z ∈ X:

P1.  m(x, y) ≥ 0(positivity),

P2.  m(x, y) = 0 ⇔ x = y(positive definiteness),

P3. m(x, y) = m(y, x) (symmetry),

P4.  m(x, z) ≤ m(x, y) + m(y, z)(triangle inequality).

If property P2 is not verified by m, but instead we have for any  x ∈ X that m(x, x) = 0, then mis called a pseudo-metric. If m only verifies P1, P2 and P4 then m is called a quasi-metric. If m only verifies P1 and P2 and if X is a set of probability measures, then m is called a divergence.

From this, the pseudo-metric between models of Definition 1 is indeed a pseudo-metric as it is relative to a positive function f that could be equal to zero and break property P2.

The local MDP dissimilarity between MDPs  dsa(M∥ ¯M)of Proposition 1 does not respect properties P2 and P3, hence the name dissimilarity. The  ∆sa(M, ¯M) = min�dsa(M∥ ¯M), dsa( ¯M∥M)�quantity, however, regains property P3 and is hence a pseudo-metric. A noticeable consequence is that Proposition 1 is “in the spirit” of a Lipschitz continuity result but cannot be called as such, hence the name pseudo-Lipschitz continuity.

The same goes for the global dissimilarity  d(M∥ ¯M) = 11−γ maxs,a∈S×A�Dsa(M∥ ¯M)�. However, using  min�d ¯MM, dM¯M�allows to regain property 3 and makes this quantity a pseudo-metric again between MDPs.

Notation 1. Given two sets X and Y , we note F (X, Y ) the set of functions defined on the domain X with codomain Y .

Lemma 1. Given two MDPs  M, ¯M ∈ M, the following equation on  d ∈ F (S × A, R)is a fixed-point equation admitting a unique solution for any  (s, a) ∈ S × A:

image

We refer to this unique solution as  dsa(M∥ ¯M).

Proof of Lemma 1. The proof follows closely that in (Puterman 2014) that proves that the Bellman operator over value functions is a contraction mapping. Let L be the functional operator that maps any function  d ∈ F (S × A, R) to

image

Then for f and g, two functions from  S × A to R and (s, a) ∈ S × A, we have that

image

Since this is true for any pair  (s, a) ∈ S × A, we have that

Since  γ < 1, Lis a contraction mapping in the metric space  (F (S × A, R) , ∥·∥∞). This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 1. The proof is by induction. The value iteration sequence of iterates  (QnM)n∈N of the optimal Q-function of any MDP  M ∈ Mis defined for all  (s, a) ∈ S × A by:

image

Consider two MDPs  M, ¯M ∈ M. It is obvious that��Q0M(s, a) − Q0¯M(s, a)�� ≤ dsa(M∥ ¯M) for all (s, a) ∈ S × A. Suppose the property��QnM(s, a) − Qn¯M(s, a)�� ≤ dsa(M∥ ¯M)true at rank  n ∈ N for all (s, a) ∈ S × A. Consider now the rank n + 1 and a pair  (s, a) ∈ S × A:

image

where we used Lemma 1 in the last inequality. Since  Q∗Mand  Q∗¯Mare respectively the limits of the sequences  (QnM)n∈Nand �Qn¯M�n∈N, it results from passage to the limit that ��Q∗M(s, a) − Q∗¯M(s, a)�� ≤ dsa(M∥ ¯M) .

By symmetry, we also have��Q∗M(s, a) − Q∗¯M(s, a)�� ≤ dsa(M∥ ¯M)and we can take the minimum of the two valid upper bounds, yielding:

image

Similar results to Proposition 1 can be derived. First, an important consequence is the global pseudo-Lipschitz continuity result

image

From a pure transfer perspective, Equation 12 is interesting since the right hand side does not depend on s, a. Hence, the counterpart of the upper bound of Equation 4, namely,

image

is easier to compute. Indeed,  ∆(M, ¯M)can be computed once and for all, contrarily to  ∆sa(M, ¯M)that needs to be evaluated for all s, a pair. However, we do not use this result for transfer because it is impractical to compute online. Indeed, estimating the maximum in the definition of  d(M∥ ¯M)can be as hard as solving both MDPs, which, when it happens, is too late for transfer to be useful.

Proof of Proposition 8. The proof is by induction. We consider the sequence of value iteration iterates defined for any MDP M ∈ M for (s, a) ∈ S × A by

image

and, by symmetry, the result holds as well for  d( ¯M∥M). Suppose that it is true at rank  n ∈ N. Consider rank n + 1 and

(s, a) ∈ S × A, we have that:

By symmetry, the results holds as well for  d( ¯M∥M)which concludes the proof by induction.

The second result is for the value function and is stated below.

Proposition 9 (Local pseudo-Lipschitz continuity of the optimal value function). For any two MDPs  M, ¯M ∈ M, for all s ∈ S,��V ∗M(s) − V ∗¯M(s)�� ≤ maxa∈A ∆sa(M, ¯M)

where the local MDP pseudo-metric  ∆sa(M, ¯M)has the same definition as in Proposition 1.

Proof of Proposition 9. The proof follows exactly the same steps as the proof of Proposition 1, i.e., by first constructing the value iteration sequence of iterates of the optimal value function, showing the result by induction for rank  n ∈ Nand then concluding with a passage to the limit.

Another result can be derived for the value of any policy  π. For the sake of generality, we state the result for any stochastic policy mapping states to distributions over actions. Note that a deterministic policy is a stochastic policy mapping states to Dirac distributions over actions. First, we state the result for the value function in Proposition 10 and then for the Q function in Proposition 11.

Proposition 10 (Local pseudo-Lipschitz continuity of the value function of any policy). For any two MDPs  M, ¯M ∈ M, for any stochastic stationary policy  π, for all s ∈ S,��V πM(s) − V π¯M(s)�� ≤ ∆πs (M, ¯M)

where  ∆πs (M, ¯M) ≜ min�dπs (M∥ ¯M), dπs ( ¯M∥M)�and  dπs (M∥ ¯M)is defined as the fixed-point of the following fixed-point equation on  d ∈ F (S, R):

image

Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.

Lemma 2. Given two MDPs  M, ¯M ∈ M, any stochastic stationary policy  π, the following equation on  d ∈ F (S, R)is a fixed-point equation admitting a unique solution for any  s ∈ S:

image

We refer to this unique solution as  dπs (M∥ ¯M).

Proof of Lemma 2. Let L be the functional operator that maps any function  d ∈ F (S, R) to

image

Then for f and g, two functions from S to R, we have that

Hence we have that  ∥Lf − Lg∥∞ ≤ γ ∥f − g∥∞. Since γ < 1, Lis a contraction mapping in the metric space  (F (S, R) , ∥·∥∞).This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 10. Consider a stochastic stationary stationary policy  π. The value iteration sequence of iterates  (V π,nM )n∈Nof the value function of any MDP  M ∈ Mis defined for all  s ∈ S by:

image

Consider two MDPs  M, ¯M ∈ M. It is obvious that���V π,0M (s) − V π,0¯M (s)��� ≤ dπs (M∥ ¯M)for all  s ∈ S. Suppose the property ��V π,nM (s) − V π,n¯M (s)�� ≤ dπs (M∥ ¯M)true at rank  n ∈ N for all s ∈ S. Consider now the rank n + 1 and the state  s ∈ S:

image

where we used Lemma 2 in the last inequality. Since  V πMand  V π¯Mare respectively the limits of the sequences  (V π,nM )n∈Nand �V π,n¯M �n∈N, it results from passage to the limit that ��V πM(s) − V π¯M(s)�� ≤ dπs (M∥ ¯M) .

By symmetry, we also have��V πM(s) − V π¯M(s)�� ≤ dπs ( ¯M∥M)and we can take the minimum of the two valid upper bounds, yielding: ��V πM(s) − V π¯M(s)�� ≤ min�dπs (M∥ ¯M), dπs ( ¯M∥M)�,

which concludes the proof.

Proposition 11 (Local pseudo-Lipschitz continuity of the Q-function of any policy). For any two MDPs  M, ¯M ∈ M, for any stochastic stationary policy  π, for all (s, a) ∈ S × A,��QπM(s, a) − Qπ¯M(s, a)�� ≤ ∆πsa(M, ¯M)

where  ∆πsa(M, ¯M) ≜ min�dπsa(M∥ ¯M), dπsa( ¯M∥M)�and dπsa(M∥ ¯M)is defined as the fixed-point of the following fixed-point equation on  d ∈ F (S × A, R):

image

Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.

Lemma 3. Given two MDPs  M, ¯M ∈ M, any stochastic stationary policy  π, the following equation on  d ∈ F (S × A, R) is afixed-point equation admitting a unique solution for any  (s, a) ∈ S × A:

image

We refer to this unique solution as  dπsa(M∥ ¯M).

Proof of Lemma 3. Let L be the functional operator that maps any function  d ∈ F (S × A, R) to

image

Then for f and g, two functions from  S × A to R, we have for all  (s, a) ∈ S × A that

image

Hence we have that  ∥Lf − Lg∥∞ ≤ γ ∥f − g∥∞. Since  γ < 1, Lis a contraction mapping in the metric space (F (S × A, R) , ∥·∥∞). This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.

Proof of Proposition 11. Consider a stochastic stationary policy  π. The value iteration sequence of iterates  (Qπ,nM )n∈N of the Qfunction for the policy  π and MDP M ∈ Mis defined for all  (s, a) ∈ S × A by:

image

Consider two MDPs  M, ¯M ∈ M. It is obvious that���Qπ,0M (s, a) − Qπ,0¯M (s, a)��� ≤ dπsa(M∥ ¯M)for all  (s, a) ∈ S × A. Suppose the property��Qπ,nM (s, a) − Qπ,n¯M (s, a)�� ≤ dπsa(M∥ ¯M)true at rank  n ∈ Nfor all  (s, a) ∈ S × A. Consider now the rank n + 1

image

where we used Lemma 3 in the last inequality. Since  QπMand  Qπ¯Mare respectively the limits of the sequences  (Qπ,nM )n∈Nand

image

By symmetry, we also have��QπM(s, a) − Qπ¯M(s, a)�� ≤ dπsa( ¯M∥M)and we can take the minimum of the two valid upper bounds,

image

Proof of Proposition 2. The result is clear for all  (s, a) /∈ Ksince the Lipschitz bounds are provably greater than  Q∗M. For (s, a) ∈ K, the result is shown by induction. Let us consider the Dynamic Programming (Bellman 1957) sequences converging

image

Obviously, we have at rank n = 0 that  Q0M(s, a) ≤ U 0(s, a)for all  (s, a) ∈ S × A. Suppose the property true at rank  n ∈ Nand consider rank n + 1:

image

≤ 0 .Which concludes the proof by induction. The result holds by passage to the limit since the considered Dynamic Programming sequences converge to the true functions.

Proof of Proposition 3. Consider two tasks M = (T, R) and ¯M = ( ¯T, ¯R), with K and ¯Kthe respective sets of state-action pairs where their learned models ˆM = ( ˆT, ˆR) and ˆ¯M = ( ˆ¯T, ˆ¯R)are known with accuracy  ϵ in L1-norm with probability at least 1 − δ, i.e., we have that,

image

Importantly, notice that the probabilistic event of Inequality 13 is the intersection of at most 4SA individual events of estimating either  Ras, T ass′, ¯Rasor  ¯Tass′with precision  ϵ. Each one of those individual events is itself true with probability at least  1 − δ′, where  δ′ ∈ (0, 1]is a parameter. For all the individual events to be true at the same time, i.e. for Inequality 13 to be verified, one must apply Boole’s inequality and set  δ′ = δ/(4SA)to ensure a total probability — i.e., probability of the intersection of all the individual events — of at least  1 − δ.

We demonstrate now the result for each one of the three cases

image

the case  (s, a) ∈ Kc ∩ ¯Kbeing the symmetric of case (ii).

(i) If  (s, a) ∈ K ∩ ¯K, then we have  ϵ-close estimates of both models with high probability, as described by Inequality 13. By

definition:

The second term of the right hand side of Equation 14 respects the following sequence of inequalities with probability at least 1 − δ:

image

Replacing the Inequalities 15 and 16 in Equation 14 yields

which holds with probability at least  1 − δand proves the Theorem for case (i). (ii) If  (s, a) ∈ K ∩ ¯Kc, then we do not have an  ϵ-close estimate of ¯Tas·and  ¯Ras. Similarly to the proof of case (i), we upper bound sequentially the two terms of the right hand side of Equation 14. With probability at least  1 − δ, we have the following: ��Ras − ¯Ras�� ≤���Ras − ˆRas��� +���ˆRas − ¯Ras���

image

Similarly, with probability at least  1 − δ, we have:

image

where  VSis the set of probability vectors of size S. Combining inequalities 17 and 18, we get the following with probability at least  1 − δ, by noticing  DγV ∗¯Msa (M, ¯M)on the left hand side:

image

which is the expected result for case (ii). (iii) If  (s, a) ∈ Kc ∩ ¯Kc, then we do not have  ϵ-close estimates of both tasks. In such a case, the result

image

is straightforward by remarking that, as a consequence of Inequality 13, we have that  V ∗¯M(s) ≤ ¯V (s)with probability at least 1 − δ.

Consider two tasks M = (T, R) and ¯M = ( ¯T, ¯R), with K and ¯Kthe respective sets of state-action pairs where their learned models ˆM = ( ˆT, ˆR)and ˆ¯M = ( ˆ¯T, ˆ¯R)are known with accuracy  ϵin  L1-norm with probability at least  1 − δ. We note  Vmax, a known upper bound on the maximum achievable value. In the worst case where one does not have any information on the value of  Vmax, setting Vmax = 11−γ is a valid upper bound. We detail the computation of  ˆDsa(M∥ ¯M)for each cases: 1)  (s, a) ∈ K ∩ ¯K,2)  (s, a) ∈ K ∩ ¯Kc, and 3) (s, a) ∈ Kc ∩ ¯Kc. The case (s, a) ∈ Kc ∩ ¯Kbeing the symmetric of case 2), the same calculations apply. 1) If  (s, a) ∈ K ∩ ¯K, we have

image

Since (s, a) is a known state-action pair, everything is known and computable in this last equation. Note that  maxs′∈S ¯V (s′) canbe tracked along the updates of ¯Vand thus its computation does not induce any additional computational complexity. 2) If  (s, a) ∈ K ∩ ¯Kc, we have

image

First, we have

Maximizing over the variable  t ∈ [0, 1]Ssuch that �s′∈S ts′ = 1is equivalent to maximizing a convex combination of the positive vector ¯Vwhose terms are not independent as they must sum to one. This is easily solvable as a linear programming problem. A straightforward (simplex-like) resolution procedure consists in progressively adding mass on the terms that will maximize the convex combination as follows:

 ts′ = 0, ∀s′ ∈ S

l = Sort states by decreasing values of ¯V

• While �s∈S ts < 1 s′ = pop first state in l

image

This allows calculating the maximum over transition models.

Notice that there is a simpler computation that almost always yields the same result (when it does not, it provides an upper bound) and does not require the burden of the previous procedure. Consider the subset of states for which ¯V (s′) = maxs∈S ¯V (s)(often these are states in ¯Kc). Among those states, let us suppose there exists  s+, unreachable from (s, a), according to ˆT, i.e., ˆTass+ = 0. If ¯Mhas not been fully explored, as is often the case in RMax, there may be many such states. Then the distribution t with all its mass on  s+maximizes the  maxt∈[0,1]Sterm. Conversely, if such a state does not exist (that is, if for all such states ˆTass+ > 0), then maxs∈S ¯V (s)is an upper bound on the  maxt∈[0,1]Sterm. Therefore:

image

with equality in many cases. 3) If  (s, a) ∈ Kc ∩ ¯Kc, the resolution is trivial and we have

Overall, computing the value of the provided upper bound in the three cases allows to compute ˆDsa(M∥ ¯M) for all (s, a) ∈ S×A.

Lemma 4. Given two tasks  M, ¯M ∈ M, Kthe set of state-action pairs for which (R, T) is known with accuracy  ϵ in L1-normwith probability at least  1 − δ. If γ(1 + ϵ) < 1, this equation on ˆd ∈ F (S × A, R)is a fixed-point equation admitting a unique solution.

image

We refer to this unique solution as ˆdsa(M∥ ¯M).

Proof of Lemma 4. Let L be the functional operator that maps any function  d ∈ F (S × A, R) to

image

Let f and g be two functions from  S × A to R. If (s, a) ∈ K, we have

image

If  (s, a) /∈ K, we have

In both cases,  ∥Lf − Lg∥∞ ≤ γ(1 + ϵ) ∥f − g∥∞. If  γ(1 + ϵ) < 1, Lis a contraction mapping in the metric space (F (S × A, R) , ∥ · ∥∞). This metric space being complete and non-empty, it follows from Banach fixed-point theorem that d = Ld admits a single solution.

Proof of Proposition 4. Consider two MDPs  M, ¯M ∈ M. Before proving the result, notice that we shall put ourselves in the case of Proposition 3, for the upper bound on the pseudometric between models ˆDsa(M∥ ¯M)to be true upper bounds with probability at least  1 − δfor all  (s, a) ∈ S × A. As seen in the proof of Proposition 3, this implies learning any reward or transition function with precision  ϵ in L1-norm with probability at least  1 − δ/(4SA).

The proof is done by induction, by calculating the values of  dsa(M∥ ¯M)and ˆdsa(M∥ ¯M)following the value iteration algorithm. Those values can respectively be computed via the sequences of iterates  (dn)n∈N and ( ˆdn)n∈Ndefined as follows for all  (s, a) ∈ S × A:

image

The proof at rank n = 0 is trivial. Let us assume the proposition  dnsa(M∥ ¯M) ≤ ˆdnsa(M∥ ¯M), ∀ (s, a) ∈ S × Atrue at rank n ∈ Nand consider rank n + 1. There are two cases, depending on the fact that (s, a) is in K or not. If  (s, a) ∈ K, we have

image

Using Proposition 3, we have that ˆDsa(M∥ ¯M)is an upper bound on  Dsa(M∥ ¯M)with probability at least  1 − δ. Hence

image

This plus the fact that  dnsa(M∥ ¯M) ≤ ˆdnsa(M∥ ¯M), ∀ (s, a) ∈ S × Aby induction hypothesis, we have with probability at least 1 − δ,

image

image

Using the same reasoning than in case  (s, a) ∈ K, we have with probability higher than  1 − δ,

image

which concludes the proof in the second case.

Proof of Proposition 6. The cost of LRMax is constant on most time steps since the action is greedily chosen w.r.t. the upper bound on the optimal Q-function, which is a lookup table. Let  N ∈ Nbe the number of source tasks that have been learned by LRMax during a Lifelong RL experiment. When updating a new state-action pair, i.e., labeling it as a known pair, the algorithm performs 2N Dynamic Programming (DP) computations to update the induced Lipschitz bounds (Equation 8) plus one DP computation to update the total-bound (Equation 6). In total, we apply (2N + 1) DP computations for each state-action pair update. As at most SA state-action pairs are updated during the exploration of the current MDP, the total number of DP

computations is at most SA(2N + 1), for which we use the value iteration algorithm.

We use the value iteration as a Dynamic Programming method. Strehl, Li, and Littman (2009) report the minimum number of iterations needed by the value iteration algorithm to estimate a value function (or Q-function in our case) that is  ϵQ-close to the

optimum in maximum norm. This minimum number is given by

Each iteration has a cost  S2A. Overall, the cost of all the DP computations in a complete run of LRMax is

image

This, plus the constant cost O(1) applied on each one of the  τdecision epochs concludes the proof.

Proof of Proposition 7. Consider an algorithm producing  ϵ-accurate model estimates ˆDsa(M∥ ¯M)for a subset  K of S × A afterinteracting with any two MDPs  M, ¯M ∈ M. Assume ˆDsa(M∥ ¯M)to be an upper bound of  Dsa(M∥ ¯M)for any  (s, a) /∈ K. These assumptions are guaranteed with high probability by Proposition 3 while running Algorithm 1 in the Lifelong RL setting. Then, for any  (s, a) ∈ S × Aand any two MDPs  M, ¯M ∈ M, we have that

image

Particularly, ˆDsa(M∥ ¯M) + ϵ ≥ Dsa(M∥ ¯M)for all  (s, a) ∈ S × Aand any  M, ¯M ∈ M. By definition of  Dmax(s, a), this

implies that, for all  (s, a) ∈ S × A,

where ˜Mis the set of possible tasks in the considered Lifelong RL experiment. Consider ˆM, the set of sampled MDPs which allows to define ˆDmax(s, a) = maxM, ¯M∈ ˆM ˆDsa(M∥ ¯M)as the maximum model distance for all the experienced MDPs at

(s, a) ∈ S × A. We have that

only if two MDPs maximizing the right hand side of this equation belong to ˆM. If it is the case, then Equation 19 imply that

image

Overall, we require the two MDPs maximizing  maxM, ¯M∈ ˜M ˆDsa(M∥ ¯M)to be sampled for Equation 20 to hold. Let us now derive the probability that those two MDPs have been sampled. We note them  M1 and M2. There may exist more candidates for the maximization but, for the sake of generality, we put ourselves in the case where only two MDPs achieve the maximization. Let us consider drawing  m ∈ Ntasks. We note  p1(respectively  p2) the probability of sampling  M1(respectively  M2) each time a task is sampled. We note  X1(respectively  X2) the random variable of the first occurrence of the task  M1(respectively  M2) among the m trials. Hence, the probability of sampling  M1for the first time at trial  k ∈ {1, . . . , m}is given by the geometric law and is equal to

image

We are interested in the probability of the event that  M1and  M2have been sampled in the m first trials, i.e. Pr  (X1 ≤ m ∩ X2 ≤ m). Following the rule of addition for probabilities, we have that,

image

Given that the event of sampling either  M1or  M2during a single trial happens with probability  p1 + p2, we have by analogy with Equation 21 that Pr  (X1 ≤ m ∪ X2 ≤ m) = 1 − (1 − (p1 + p2))m. As a result, the following holds: Pr  (X1 ≤ m ∩ X2 ≤ m) = 1 − (1 − p1)m + 1 − (1 − p2)m − (1 − (1 − (p1 + p2))m)= 1 − (1 − p1)m − (1 − p2)m + (1 − (p1 + p2))m≥ 1 − 2(1 − pmin)m + (1 − 2pmin)m .

As said earlier, Equation 20 holds if  M1 and M2have been sampled during the first m trials, which imply that the probability for Equation 20 to hold is at least equal to the probability of sampling both tasks. Formally,

image

In turn, if m verifies  2(1 − pmin)m − (1 − 2pmin)m ≤ δ, then  1 − 2(1 − pmin)m + (1 − 2pmin)m ≥ 1 − δand Pr�ˆDmax(s, a) + ϵ ≥ Dmax(s, a)�≥ 1 − δ, which concludes the proof.

Section 4.4 introduced the idea of exploiting prior knowledge on the maximum distance between two MDP models. This idea begs for a more detailed discussion. Consider two MDPs  M and ¯M. By definition of the local model pseudo metric in Equation 1, the maximum possible distance is given by

image

But this assumes that any transition or reward model can define  M and ¯M. In other words, the maximization is made on the whole set of possible MDPs. To illustrate why this is too naive, consider a game within the Arcade Learning Environment (Bellemare et al. 2013). We, as humans, have a strong bias concerning similarity between environments. If the game changes, we still assume groups of pixels will move together on the screen as the result of game actions. For instance, we generally discard possible new games ¯Mthat “teleport” objects across the screen without physical considerations. We also discard new games that allow transitions from a given screen to another screen full of static. These examples illustrate why the knowledge of  Dmaxis very natural (and also why its precise value may be irrelevant). The same observation can be made for the “tight” experiment of Section 5; the set of possible MDPs is restricted by some implicit assumptions that constrain the maximum distance between tasks. For instance, in these experiments, all transitions move to a neighboring state and never “teleport” the agent to the other side of the gridworld. Without the knowledge of  Dmax, LRMax assumes such environments are possible and therefore transfer values very cautiously (with the ultimate goal not to under estimate the optimal Q-function, in order to avoid negative transfer). Overall, the experiments of Section 5 confirm this important insight: safe transfer occurs slowly if no a priori is given on the maximum distance between MDPs. On the contrary, the knowledge of  Dmaxallows a faster and more efficient transfer between environments.

The tight environment is a  11 × 11grid-world illustrated in Figure 5. The initial state of the agent is the central cell displayed with an “S”. The actions are moving 1 cell in one of the four cardinal directions. The reward is 0 everywhere, except for executing an action in one of the three teal cells in the upper-right corner. Each time a task is sampled, a slipping probability of executing another action as the one selected is drawn in [0, 1] and the reward received in each one of the teal cells is picked in [0.8, 1.0].

image

Figure 5: The tight grid-world environment.

image

Figure 6: The corridor grid-world environment.

We ran additional experiments on the corridor grid-world environment represented in Figure 6. The initial state of the agent is the central cell labeled with the letter “S”. The actions are {left, right} and the goal is to reach the cell labeled with the letter “G” on the extreme right. A reward R > 0 is received when reaching the goal and 0 otherwise. At each new task, a new value of R is sampled in [0.8, 1]. The transition function is fixed and deterministic.

The key insight in this experiment is not to lose time exploring the left part of the corridor. We ran 20 episodes of 11 time steps for each one of the 20 sampled tasks. Results are displayed in Figure 7a and 7b, respectively for the average relative discounted return over episodes and over tasks. Similarly as in Section 5, we observe in Figure 7a that LRMax benefits from the transfer method as early as the second task. The MaxQInit algorithm benefits from the transfer from task number 12. Prior knowledge Dmaxdecreases the sample complexity of LRMax as reported earlier and the combination of LRMax with MaxQInit outperforms all other methods by providing a tighter upper bound on the optimal Q-value function. This decrease of sample complexity is also observed in the episode-wise display of Figure 7b where the convergence happens more quickly on average for LRMax and even more for MaxQInit. This figure allows to see the three learning stages of LRMax reported in Section 5.

We also ran Lifelong RL experiments in the maze grid-world of Figure 8. The tasks consists in reaching the goal cell labeled with a “G” while the initial state of the agent is the central cell, labeled with an “S”. Two walls configurations are possible, yielding two different tasks with probability 12of being sampled in the Lifelong RL setting. The first task corresponds to the case where orange walls are actually walls and green cells are normal white cells where the agent can go. The second task is the converse, where green walls are walls and orange cells are normal white cells. We run 100 episodes of length 15 time steps and sample a total of 30 different tasks. Results can be found in Figure 9. In this experiment, we observe the increase of performance of LRMax as the value of  Dmaxdecreases. The three stages behavior of LRMax reported in Section 5 does not appear in this case. We tested the performance of using the online estimation of the local model distances of Proposition 7 in the algorithm referred by LRMax in Figure 9. Once enough tasks have been sampled, the estimate on the model local distance is used with high confidence on its value and refines the upper bound computed analytically in Equation 7. Importantly, this instance of LRMax achieved the best result in this particular environment, demonstrating the usefulness of this result. This method being similar to the MaxQInit estimation of maximum Q-values, we unsurprisingly observe that both algorithms feature a similar performance in the maze environment.

Consider two MDPs  M, ¯M ∈ M. Each time a state-action pair  (s, a) ∈ S × Ais updated, we compute the local distance upper bound ˆDsa(M∥ ¯M)(Equation 7) for all  (s, a) ∈ S × A. In this computation, one can leverage the knowledge of  Dmax to selectmin�ˆDsa(M∥ ¯M), Dmax�. We show that LRMax relies less and less on  Dmaxas knowledge on the current task increases. For this experiment, we used the two grid-worlds environments displayed in Figures 10a and 10b.

image

Figure 7: Results of the corridor Lifelong RL experiment with 95% confidence interval.

image

Figure 8: The maze grid-world environment. The walls correspond to the black cells and either the green ones or the orange ones.

The rewards collected with any actions performed in the teal cells of both tasks are defined as:

image

where  (sx, sy)are the coordinates of the current state,  (gx, gy)the coordinate of the goal cell labelled with a G and  σ is a spanparameter equal to 1 in the first environment and 1.5 in the second environment. The agent starts at the cell labelled with the S letter. Black cells represent unreachable cells (walls). We run LRMax twice on the two different maze grid-worlds and record for each model update the proportion of times  Dmaxis smaller than ˆDsa(M∥ ¯M)in Figure 11 via the % use of  Dmax.

With maximum value  Dmax = 19, ˆDsa(M∥ ¯M)is systematically lesser than  Dmax, resulting in 0% use. Conversely, with minimum value  Dmax = 0, the use expectedly increases to 100%. The in-between value of  Dmax = 10displays a linear decay of the use. This suggests that, at each update, ˆDsa(M∥ ¯M) ≤ Dmaxis only true for one more unique s, a pair, resulting in a constant decay of the use. With fewer prior (Dmax = 15 or 17), updating one single s, a pair allows ˆDsa(M∥ ¯M)to drop under Dmaxfor more than one pair, resulting in less use of the prior knowledge. The conclusion of this experiment if that  Dmax is onlyuseful at the beginning of the exploration, while LRMax relies more on its own bound ˆDsa(M∥ ¯M)when partial knowledge of the task has been acquired.

image

Figure 9: Averaged discounted return over tasks for the maze grid-world Lifelong RL experiment.

image

Figure 10: The two grid-worlds of the prior use experiment.

image

Figure 11: Proportion of times where  Dmax ≤ ˆDsa(M∥ ¯M), i.e., use of the prior, vs computation of the Lipschitz bound. Each curve is displayed with 95% confidence intervals.

image

Table 1: Summary of the number of experiment repetition, number of sampled tasks, number of episodes, maximum length of episodes and upper bounds on the number of collected samples.

We used  nknown = 10, δ = 0.05 and ϵ = 0.01. Theoretically,  nknownshould be a lot larger (≈ 105) in order to reach an accuracy ϵ = 0.01according to Strehl, Li, and Littman (2009). However, it is common practice to assume such small values of  nknownare sufficient to reach an acceptable model accuracy  ϵ. Interestingly, empirical validation did not confirm this assumption for any RMax-based algorithm. We keep these values nonetheless for the sake of comparability between algorithms and consistency with the literature. Despite such absence of accuracy guarantees, RMax-based algorithms still perform surprisingly well and are robust to model estimation uncertainties.

For the experiments run in Section 5, the computing infrastructure used was a laptop using a single 64-bit CPU (model: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz). The collected samples sizes and number of evaluation runs for each experiment is summarized in Table 1. The displayed confidence intervals for any curve presented in the paper is the 95% confidence interval (Neyman 1937) on the displayed mean. No data were excluded neither pre-computed. Hyper-parameters were determined to our appreciation, they may be sub-optimal but we found the results convincing enough to display interesting behaviors.


Designed for Accessibility and to further Open Science