Lifelong Reinforcement Learning (RL) is an online problem where an agent faces a series of RL tasks, drawn sequentially. Transferring knowledge from prior experience to speed up the resolution of new tasks is a key question in that setting (Lazaric 2012; Taylor and Stone 2009). We elaborate on the intuitive idea that similar tasks should allow a large amount of transfer. An agent able to compute online a similarity measure between source tasks and the current target task could be able to perform transfer accordingly. By measuring the amount of inter-task similarity, we design a novel method for value transfer, practically deployable in the online Lifelong RL setting. Specifically, we introduce a metric between MDPs and prove that the optimal Q-value function is Lipschitz continuous with respect to the MDP space. This property makes it possible to compute a provable upper bound on the optimal Q-value function of an unknown target task, given the learned optimal Q-value function of a source task. Knowing this upper bound accelerates the convergence of an RMax-like algorithm (Brafman and Tennenholtz 2002), relying on an optimistic estimate of the optimal Q-value function. Overall, the proposed transfer method consists of computing online the distance between source and target tasks, deducing the upper bound on the optimal Q value function of the source task and using this bound to accelerate learning. Importantly, the method exhibits no negative transfer, i.e., it cannot cause performance degradation, as the computed upper bound provably does not underestimate the optimal Q-value function.
Our contributions are as follows. First, we study theoretically the Lipschitz continuity of the optimal Q-value function in the task space by introducing a metric between MDPs (Section 3). Then, we use this continuity property to propose a value-transfer method based on a local distance between MDPs (Section 4). Full knowledge of both MDPs is not required and the transfer is non-negative, which makes the method applicable online and safe. In Section 4.3, we build a PAC-MDP algorithm called Lipschitz RMax, applying this transfer method in the online Lifelong RL setting. We provide sample and computational complexity bounds and showcase the algorithm in Lifelong RL experiments (Section 5).
Reinforcement Learning (RL) (Sutton and Barto 2018) is a framework for sequential decision making. The problem is typically modeled as a Markov Decision Process (MDP) (Puterman 2014) consisting of a 4-tuple S is a state space, A an action space,
is the expected reward of taking action
is the transition probability of reaching state
when taking action a in state s. Without loss of generality, we assume
. Given a discount factor
, the expected cumulative return
obtained along a trajectory starting with state s and action a using policy
in MDP M is denoted by
and called the Q-function. The optimal Q-function
highest attainable expected return from s, a and
is the optimal value function in s. Notice that
implies
for all
. This maximum upper bound is used by the RMax algorithm as an optimistic initialization of the learned Q function. A key point to reduce the sample complexity of this algorithm is to benefit from a tighter upper bound, which is the purpose of our transfer method.
Lifelong RL (Silver, Yang, and Li 2013; Brunskill and Li 2014) is the problem of experiencing online a series of MDPs drawn from an unknown distribution. Each time an MDP is sampled, a classical RL problem takes place where the agent is able to interact with the environment to maximize its expected return. In this setting, it is reasonable to think that knowledge gained on previous MDPs could be re-used to improve the performance in new MDPs. In this paper, we provide a novel method for such transfer by characterizing the way the optimal Q-function can evolve across tasks. As commonly done (Wilson et al. 2007; Brunskill and Li 2014; Abel et al. 2018), we restrict the scope of the study to the case where sampled MDPs share the same state-action space . For brevity, we will refer indifferently to MDPs, models or tasks, and write them
Using a metric between MDPs has the appealing characteristic of quantifying the amount of similarity between tasks, which intuitively should be linked to the amount of transfer achievable. Song et al. (2016) define a metric based on the bi-simulation metric introduced by Ferns, Panangaden, and Pre- cup (2004) and the Wasserstein metric (Villani 2008). value transfer is performed between states with low bi-simulation distances. However, this metric requires knowing both MDPs completely and is thus unusable in the Lifelong RL setting where we expect to perform transfer before having learned the current MDP. Further, the transfer technique they propose does allow negative transfer (see Appendix, Section 7). Carroll and Seppi (2005) also define a value-transfer method based on a measure of similarity between tasks. However, this measure is not computable online and thus not applicable to the Lifelong RL setting. Mahmud et al. (2013) and Brunskill and Li (2013) propose MDP clustering methods; respectively using a metric quantifying the regret of running the optimal policy of one MDP in the other MDP and the norm between the MDP models. An advantage of clustering is to prune the set of possible source tasks. They use their approach for policy transfer, which differs from the value-transfer method proposed in this paper. Ammar et al. (2014) learn the model of a source MDP and view the prediction error on a target MDP as a dissimilarity measure in the task space. Their method makes use of samples from both tasks and is not readily applicable to the online setting considered in this paper. Lazaric, Restelli, and Bonarini (2008) provide a practical method for sample transfer, computing a similarity metric reflecting the probability of the models to be identical. Their approach is applicable in a batch RL setting as opposed to the online setting considered in this paper. The approach developed by Sorg and Singh (2009) is very similar to ours in the sense that they prove bounds on the optimal Q-function for new tasks, assuming that both MDPs are known and that a soft homomorphism exists between the state spaces. Brun- skill and Li (2013) also provide a method that can be used for Q-function bounding in multi-task RL.
Abel et al. (2018) present the MaxQInit algorithm, providing transferable bounds on the Q-function with high probability while preserving PAC-MDP guarantees (Strehl, Li, and Littman 2009). Given a set of solved tasks, they derive the probability that the maximum over the Q-values of previous MDPs is an upper bound on the current task’s optimal Qfunction. This approach results in a method for non-negative transfer with high probability once enough tasks have been sampled. The method developed by Abel et al. (2018) is similar to ours in two fundamental points: first, a theoretical upper bounds on optimal Q-values across the MDP space is built; secondly, this provable upper bound is used to transfer knowledge between MDPs by replacing the maximum
Qs, a)
Figure 1: The optimal Q-value function represented for a particular s, a pair across the MDP space. The RMax, MaxQInit and LRMax bounds are represented for three sampled MDPs.
bound in an RMax-like algorithm, providing PAC guarantees. The difference between the two approaches is illustrated in Figure 1, where the MaxQInit bound is the one developed by Abel et al. (2018), and the LRMax bound is the one we present in this paper. On this figure, the essence of the LRMax bound is noticeable. It stems from the fact that the optimal Q value function is locally Lipschitz continuous in the MDP space w.r.t. a specific pseudometric. Confirming the intuition, close MDPs w.r.t. this metric have close optimal Q values. It should be noticed that no bound is uniformly better than the other as intuited by Figure 1. Hence, combining all the bounds results in a tighter upper bound as we will illustrate in experiments (Section 5). We first carry out the theoretical characterization of the Lipschitz continuity properties in the following section. Then, we build on this result to propose a practical transfer method for the online Lifelong RL setting.
The intuition we build on is that similar MDPs should have similar optimal Q-functions. Formally, this insight can be translated into a continuity property of the optimal Q-function over the MDP space M. The remainder of this section mathematically formalizes this intuition that will be used in the next section to derive a practical method for value transfer. To that end, we introduce a local pseudometric characterizing the distance between the models of two MDPs at a particular state-action pair. A reminder and a detailed discussion on the metrics used herein can be found in the Appendix, Section 8. Definition 1. Given two tasks ,
, and a function
, we define the pseudometric between models at
This pseudometric is relative to a positive function f. We implicitly cast this definition in the context of discrete state spaces. The extension to continuous spaces is straightforward but beyond the scope of this paper. For the sake of clarity in the remainder of this study, we introduce
corresponding to the pseudometric between models with the particular choice of . From this definition stems the following pseudo-Lipschitz continuity result.
Proposition 1 (Local pseudo-Lipschitz continuity). For two MDPs
with the local MDP pseudometric local MDP dissimilarity
is the unique solution to the following fixed-point equation for
This result establishes that the distance between the optimal Q-functions of two MDPs at is controlled by a local dissimilarity between the MDPs. The latter follows a fixed-point equation (Equation 3), which can be solved by Dynamic Programming (DP) (Bellman 1957). Note that, although the local MDP dissimilarity
is asymmetric,
a pseudometric, hence the name pseudo-Lipschitz continuity. Notice that the policies in Equation 2 are the optimal ones for the two MDPs and thus are different. Proposition 1 is a mathematical result stemming from Definition 1 and should be distinguished from other frameworks of the literature that assume the continuity of the reward and transition models w.r.t.
(Rachelson and Lagoudakis 2010; Pirotta, Restelli, and Bascetta 2015; Asadi, Misra, and Littman 2018).This result establishes that the optimal Q-functions of two close MDPs, in the sense of Equation 1, are themselves close to each other. Hence, given
, the function
can be used as an upper bound on an unknown MDP. This is the idea on which we construct a computable and transferable upper bound in Section 4. In Figure 1, the upper bound of Equation 4 is represented by the LRMax bound. Noticeably, we provide a global pseudo-Lipschitz continuity property, along with similar results for the optimal value function
and the value function of a fixed policy. As these results do not directly serve the purpose of this article, we report them in the Appendix, Section 10.
A purpose of value transfer, when interacting online with a new MDP, is to initialize the value function and drive the exploration to accelerate learning. We aim to exploit value transfer in a method guaranteeing three conditions:
C1. the resulting algorithm is PAC-MDP;
C2. the transfer accelerates learning;
C3. the transfer is non-negative.
To achieve these conditions, we first present a transferable upper bound on in Section 4.1. This upper bound stems from the Lipschitz continuity result of Proposition 1. Then, we propose a practical way to compute this upper bound in Section 4.2. Precisely, we propose a surrogate bound that can be calculated online in the Lifelong RL setting, without having explored the source and target tasks completely. Finally, we implement the method in an algorithm described in Section 4.3, and demonstrate formally that it meets conditions C1, C2 and C3. Improvements are discussed in Section 4.4.
4.1 A Transferable Upper Bound on Q∗M
From Proposition 1, one can naturally define a local upper bound on the optimal Q-function of an MDP given the optimal Q-function of another MDP.
Definition 2. Given two tasks M and , for all
Lipschitz upper bound on
defined as
The optimism in the face of uncertainty principle leads to considering that the long-term expected return from any state is the maximum return, unless proven otherwise. Particularly, the RMax algorithm (Brafman and Tennenholtz 2002), explores an MDP so as to shrink this upper bound. RMax is a model-based, online RL algorithm with PACMDP guarantees (Strehl, Li, and Littman 2009), meaning that convergence to a near-optimal policy is guaranteed in a polynomial number of missteps with high probability. It relies on an optimistic model initialization that yields an optimistic upper bound U on the optimal Q-function, then acts greedily w.r.t. U. By default, it takes the maximum value
, but any tighter upper bound is admissible. Thus, shrinking U with Equation 5 is expected to improve the learning speed or sample complexity for new tasks in Lifelong RL.
In RMax, during the resolution of a task is split into a subset of known state-action pairs K and its complement
of unknown pairs. A state-action pair is known if the number of collected reward and transition samples allows estimating an
-accurate model in
-norm with probability higher than
. We refer to
and
as the RMax precision parameters. This results in a threshold
on the number of visits n(s, a) to a pair s, a that are necessary to reach this precision. Given the experience of a set of m MDPs
, we define the total bound as the minimum over all the induced Lipschitz bounds. Proposition 2. Given a partially known task
, the set of known state-action pairs K, and the set of Lipschitz bounds on
induced by previous tasks
the function Q defined below is an upper bound on
for all
with U, . . . , U
Commonly in RMax, Equation 6 is solved to a precision via Value Iteration. This yields a function Q that is a valid heuristic (bound on
) for the exploration of MDP M.
4.2 A Computable Upper Bound on Q∗M
The key issue addressed in this section is how to actually compute U(s, a), particularly when both source and target tasks are partially explored. Consider two tasks which vanilla RMax has been applied, yielding the respective sets of known state-action pairs K and
, along with the learned models
, and the upper bounds
respectively on
. Notice that, if
, then
for all s, a pairs. Conversely, if
-accurate estimate of
high probability. Equation 6 allows the transfer of knowledge from
to M if
can be computed. Unfortunately, the true model and optimal value functions, necessary to compute
, are partially known (see Equation 5). Thus, we propose to compute a looser upper bound based on the learned models and value functions. First, we provide an upper bound
(Definition 1).
Proposition 3. Given two tasks M, and respectively K,
the subsets of
where their models are known with accuracy
-norm with probability at least
with the upper bound on the pseudometric between models defined below for
Importantly, this upper bound can be calculated analytically (see Appendix, Section 13). This makes
usable in the online Lifelong RL setting, where already explored tasks may be partially learned, and little knowledge has been gathered on the current task. The magnitude of the B term is controlled by
. In the case where no information is available on the maximum value of
have that
.
measures the accuracy with which the tasks are known: the smaller
, the tighter the B bound. Note that
is used as an upper bound on the true
. In many cases,
; e.g. for stochastic short- est path problems, which feature rewards only upon reaching terminal states, we have that
and thus
is a tighter bound for transfer. Combining
and Equation 3, one can derive an upper bound
, detailed in Proposition 4.
Proposition 4. Given two tasks the set of state-action pairs where (R, T) is known with accuracy
-norm with probability at least
solution
of the following fixed-point equation on
) is an upper bound on
with probability at least
illustrates the fact that for a large return horizon (large ), a high accuracy (small
) is needed for the bound to be computable. Eventually, a computable upper bound on
M with high probability is given by
The associated upper bound on U(s, a) (Equation 6) given the set of previous tasks is defined by
This upper bound can be used to transfer knowledge from a partially solved source task to a target task. If on a subset of
, then the convergence rate can be improved. As complete knowledge of both tasks is not needed to compute the upper bound, it can be applied online in the Lifelong RL setting. In the next section, we explicit an algorithm that leverages this value-transfer method.
4.3 Lipschitz RMax Algorithm
In Lifelong RL, MDPs are encountered sequentially. Applying RMax to task M yields the set of known state-action pairs K, the learned models and
, and the upper bound Q on
. Saving this information when the task changes allows computing the upper bound of Equation 10 for the new target task, and using it to shrink the optimistic heuristic of RMax. This computation effectively transfers value functions between tasks based on task similarity. As the new task is explored online, the task similarity is progressively assessed with better confidence, refining the values of
,
and eventually
, allowing for more efficient transfer where the task similarity is appraised. The resulting algorithm, Lipschitz RMax (LRMax), is presented in Algorithm 1. To avoid ambiguities with
, we use
to store learned features (
,
) about previous MDPs. In a nutshell, the behavior of LRMax is precisely that of RMax, but with a tighter admissible heuristic
that becomes better as the new task is explored (while this heuristic remains constant in vanilla RMax). LRMax is PAC-MDP (Condition C1) as stated in Propositions 5 and 6 below. With S = |S| and A = |A|, the sample complexity of vanilla RMax is
, which is improved by LRMax in Proposition 5 and meets Condition C2. Finally,
a provable upper bound with high probability on
avoids negative transfer and meets Condition C3.
Proposition 5 (Sample complexity (Strehl, Li, and Littman 2009)). With probability , the greedy policy w.r.t. Q computed by LRMax achieves an
-optimal return in MDP
samples (when logarithmic factors are ignored), with de-fined in Equation 10 a non-static, decreasing quantity, upper bounded by
Proposition 5 shows that the sample complexity of LRMax is no worse than that of RMax. Consequently, in the worst case, LRMax performs as badly as learning from scratch, which is to say that the transfer method is not negative as it cannot degrade the performance.
Proposition 6 (Computational complexity). The total computational complexity of LRMax (Algorithm 1) is
with the number of interaction steps,
the precision of value iteration and N the number of source tasks.
4.4 Refining the LRMax Bounds
LRMax relies on bounds on the local MDP dissimilarity (Equation 8). The quality of the Lipschitz bound on be improved according to the quality of those estimates. We discuss two methods to provide finer estimates.
Refining with prior knowledge. First, from the definition of , it is easy to show that this pseudometric between models is always upper bounded by
. However, in practice, the tasks experienced in a Lifelong RL experiment might not cover the full span of possible MDPs M and may systematically be closer to each other than
. For instance, the distance between two games in the Arcade Learning Environment (ALE) (Bellemare et al. 2013), is smaller than
Figure 2: Illustration of the prior knowledge on the maximum pseudo-distance between models for a particular s, a pair.
the maximum distance between any two MDPs defined on the common state-action space of the ALE (extended discussion in Appendix, Section 17). Let us note the set of possible MDPs for a particular Lifelong RL experiment. Let
imum model pseudo-distance at a particular s, a pair on the subset
. Prior knowledge might indicate a smaller upper bound for
. We will note such an upper bound
, considered valid for all s, a pairs, i.e., such that
. In a Lifelong RL experiment,
can be seen as a rough estimate of the maximum model discrepancy an agent may encounter. Figure 2 illustrates the relative importance of
Solving Equation 8 boils down to accumulating
values in
. Hence, reducing a
estimate in a single s, a pair actually reduces
in all s, a pairs. Thus, replacing
in Equation 8 by
, provides a smaller upper bound
on
, and thus a smaller
which allows transfer if it is less than
. Consequently, the knowl- edge of such a bound
can make a difference between successful and unsuccessful transfer, even if its value is of little importance. Conversely, setting a value for
tifies the distance between MDPs where transfer is efficient.
Refining by learning the maximum distance. The value of can be estimated online for each s, a pair, discarding the hypothesis of available prior knowledge. We propose to use an empirical estimate of the maximum model distance at s, a:
, with
the set of explored tasks. The pitfall of this approach is that, with few explored tasks,
could underestimate
. Proposition 7 provides a lower bound on the probability that
does not underestimate
, depending on the number of sampled tasks.
Proposition 7. Consider an algorithm producing model estimates
for a subset K of
after interacting with any two MDPs
. Assume
for any
. For all
, after sampling m tasks, if m is large enough to verify
This result indicates when upper bounds
with high probability. In such a case,
of Equation 8 can be replaced by
to tighten the bound on
. Assuming a lower bound
on the sampling probability of a task implies that M is finite and is seen as a non-adversarial task sampling rule (Abel et al. 2018).
The experiments reported here1 illustrate how the Lipschitz bound (Equation 9) provides a tighter upper bound on , improving the sample complexity of LRMax compared to RMax, and making the transfer of inter-task knowledge effective. Graphs are displayed with 95% confidence intervals. For information in line with the Machine Learning Reproducibility Check-list (Pineau 2019) see the Appendix, Section 22.
We evaluate different variants of LRMax in a Lifelong RL experiment. The RMax algorithm will be used as a no-transfer baseline. LRMax(x) denotes Algorithm 1 with prior . MaxQInit denotes the MAXQINIT algorithm from Abel et al. (2018), consisting in a state-of-the art PACMDP algorithm. Both LRMax and MaxQInit algorithms achieve value transfer by providing a tighter upper bound on
than
. Computing both upper bounds and taking the minimum results in combining the two approaches. We include such a combination in our study with the LRMaxQInit algorithm. Similarly, the latter algorithm benefiting from prior knowledge
is denoted by LRMaxQInit(x). For the sake of comparison, we only compare algorithms with the same features, namely, tabular, online, PAC-MDP methods, presenting non-negative transfer.
The environment used in all experiments is a variant of the “tight” task used by Abel et al. (2018). It is an grid-world, the initial state is in the centre, actions are the cardinal moves (Appendix, Section 18). The reward is always zero except for the three goal cells in the upper-right corner. Each sampled task has its own reward values, drawn from [0.8, 1] for each of the three goal cells and its own probability of slipping (performing a different action than the one selected), picked in [0, 0.1]. Hence, tasks have different reward and transition functions. Notice the distinction in applicability between MaxQInit, that requires the set of MDPs to be finite, and LRMax, that can be used with any set of MDPs. For the comparison between both to be possible, we drew tasks from a finite set of 5 MDPs. We sample 15 tasks sequentially among this set, each run for 2000 episodes of length 10. The operation is repeated 10 times to narrow the confidence intervals. We set
, and
(discussion in Appendix, Section 21). Other Lifelong RL experiments are reported in Appendix, Section 19.
The results are reported in Figure 3. Figure 3a displays the discounted return for each task, averaged across episodes. Similarly, Figure 3b displays the discounted return for each episode, averaged across tasks (same color code as Figure 3a). Figure 3c displays the discounted return for five specific instances, detailed below. To avoid inter-task disparities, all the aforementioned discounted returns are displayed relative to an estimator of the optimal expected return for each task. For readability, Figures 3b and 3c display a moving average over 100 episodes. Figure 3d reports the benefits of various values of on the algorithmic properties.
In Figure 3a, we first observe that LRMax benefits from the transfer method, as the average discounted return increases as more tasks are experienced. Moreover, this advantage appears as early as the second task. In contrast, MaxQInit requires to wait for task 12 before benefiting from transfer. As suggested in Section 4.4, increasing amounts of prior knowledge allow the LRMax transfer method to be more efficient: a smaller known upper bound on
accelerates convergence. Combining both approaches in the LRMaxQInit algorithm outperforms all other methods. Episode-wise, we observe in Figure 3b that the LRMax transfer method allows for faster convergence, i.e., lower sample complexity. Interestingly, LRMax exhibits three stages in the learning process. 1) The first episodes are characterized by a direct exploitation of the transferred knowledge, causing these episodes to yield high payoff. This behavior is a consequence of the combined facts that the Lipschitz bound (Equation 9) is larger on promising regions of
seen on previous tasks and the fact that LRMax acts greedily w.r.t. that bound. 2) This high performance regime is followed by the exploration of unknown regions of
, in our case yielding low returns. Indeed, as promising regions are explored first, the bound becomes tighter for the corresponding state-action pairs, enough for the Lipschitz bound of unknown pairs to become larger, thus driving the exploration towards low payoff regions. Such regions are then identified and never revisited. 3) Eventually, LRMax stops exploring and converges to the optimal policy. Importantly, in all experiments, LRMax never experiences negative transfer, as supported by the provability of the Lipschitz upper bound with high probability. LRMax is at least as efficient as the no-transfer RMax baseline.
Figure 3c displays the collected returns of RMax, LRMax(0.1), and MaxQInit for specific tasks. We observe that LRMax benefits from transfer as early as Task 2, where the previous 3-stage behavior is visible. MaxQInit takes until task 12 to leverage the transfer method. However, the bound it provides is tight enough that it does not have to explore.
In Figure 3d, we display the following quantities for various values of , the fraction of the time the Lipschitz bound was tighter than the RMax bound
is the relative gain of time steps before convergence when comparing LRMax to RMax. This quantity is estimated based on the last updates of the empirical model
, is the relative total return gain on 2000 episodes of LRMax w.r.t. RMax. First, we observe an increase of
as
becomes tighter. This means that the Lipschitz bound of Equation 9 becomes effectively smaller than
. This phe- nomenon leads to faster convergence, indicated by
Eventually, this increased convergence rate allows for a net total return gain, as can be seen with the increase of
Overall, in this analysis, we have showed that LRMax ben-efits from an enhanced sample complexity thanks to the value-transfer method. The knowledge of a prior increases
Figure 3: Experimental results. LRMax benefits from an enhanced sample complexity thanks to the value-transfer method.
this benefit. The method is comparable to the MaxQInit method and has some advantages such as the early fitness for use and the applicability to infinite sets of tasks. Moreover, the transfer is non-negative while preserving the PAC-MDP guarantees of the algorithm. Additionally, we show in Appendix, Section 20 that, when provided with any prior LRMax increasingly stops using it during exploration, con-firming the claim of Section 4.4 that providing
transfer even if its value is of little importance.
We have studied theoretically the Lipschitz continuity property of the optimal Q-function in the MDP space w.r.t. a new metric. We proved a local Lipschitz continuity result, establishing that the optimal Q-functions of two close MDPs are themselves close to each other. We then proposed a value-transfer method using this continuity property with the Lipschitz RMax algorithm, practically implementing this approach in the Lifelong RL setting. The algorithm preserves PAC-MDP guarantees, accelerates learning in subsequent tasks and exhibits no negative transfer. Improvements of the algorithm were discussed in the form of prior knowledge on the maximum distance between models and online estimation of this distance. As a non-negative, similarity-based, PAC-MDP transfer method, the LRMax algorithm is the first method of the literature combining those three appealing features. We showcased the algorithm in Lifelong RL experiments and demonstrated empirically its ability to accelerate learning while not experiencing any negative transfer. Notably, our approach can directly extend other PAC-MDP algorithms (Szita and Szepesv´ari 2010; Rao and Whiteson 2012; Pazis, Parr, and How 2016; Dann, Lattimore, and Brun- skill 2017) to the Lifelong setting. In hindsight, we believe this contribution provides a sound basis to non-negative value transfer via MDP similarity, a study that was lacking in the literature. Key insights for the practitioner lie both in the theoretical analysis and in the practical derivation of a transfer scheme achieving non-negative transfer with PAC guarantees. Further, designing scalable methods conveying the same intuition could be a promising research direction.
We would like to thank Dennis Wilson for fruitful discussions and comments on the paper. This research was supported by the Occitanie region, France; ISAE-SUPAERO (Institut Sup´erieur de l’A´eronautique et de l’Espace); fondation ISAESUPAERO; ´Ecole Doctorale Syst`emes; and ONERA, the French Aerospace Lab.
Abel, D.; Jinnai, Y.; Guo, S. Y.; Konidaris, G.; and Littman, M. L. 2018. Policy and Value Transfer in Lifelong Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 20–29.
Ammar, H. B.; Eaton, E.; Taylor, M. E.; Mocanu, D. C.; Driessens, K.; Weiss, G.; and Tuyls, K. 2014. An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning. In Workshops at the 28th AAAI Conference on Artificial Intelligence (AAAI 2014).
Asadi, K.; Misra, D.; and Littman, M. L. 2018. Lipschitz Continuity in Model-Based Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning (ICML 2018) .
Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research 47: 253–279.
Bellman, R. 1957. Dynamic Programming. Princeton, USA: Princeton University Press.
Brafman, R. I.; and Tennenholtz, M. 2002. R-max - a Gen- eral Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3(Oct): 213–231.
Brunskill, E.; and Li, L. 2013. Sample Complexity of Multi- task Reinforcement Learning. In Proceedings of the 29th conference on Uncertainty in Artificial Intelligence (UAI 2013).
Brunskill, E.; and Li, L. 2014. PAC-inspired Option Dis- covery in Lifelong Reinforcement Learning. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), 316–324.
Carroll, J. L.; and Seppi, K. 2005. Task Similarity Measures for Transfer in Reinforcement Learning Task Libraries. In Proceedings of the 5th International Joint Conference on Neural Networks (IJCNN 2005), volume 2, 803–808. IEEE.
Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 5713–5723.
Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for Finite Markov Decision Processes. In Proceedings of the 20th conference on Uncertainty in Artificial Intelligence (UAI 2004), 162–169. AUAI Press.
Lazaric, A. 2012. Transfer in Reinforcement Learning: a Framework and a Survey. In Reinforcement Learning, 143– 173. Springer.
Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Transfer of Samples in Batch Reinforcement Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 544–551.
Mahmud, M. M.; Hawasly, M.; Rosman, B.; and Ramamoor- thy, S. 2013. Clustering Markov Decision Processes for Continual Transfer. Computing Research Repository (arXiv/CoRR) URL https://arxiv.org/abs/1311.3959.
Neyman, J. 1937. X—outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236(767): 333– 380.
Pazis, J.; Parr, R. E.; and How, J. P. 2016. Improving PAC Exploration using the Median of Means. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), 3898–3906.
Pineau, J. 2019. Machine Learning Reproducibility Checklist. https://www.cs.mcgill.ca/jpineau/ ReproducibilityChecklist.pdf. Version 1.2, March 27, 2019, last accessed on August 27, 2020.
Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gra- dient in Lipschitz Markov Decision Processes. Machine Learning 100(2-3): 255–283.
Puterman, M. L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
Rachelson, E.; and Lagoudakis, M. G. 2010. On the Locality of Action Domination in Sequential Decision Making. In Proceedings of the 11th International Symposium on Artificial Intelligence and Mathematics (ISAIM 2010).
Rao, K.; and Whiteson, S. 2012. V-MAX: Tempered Opti- mism for Better PAC Reinforcement Learning. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), 375–382.
Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong Machine Learning Systems: Beyond Learning Algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, volume 13, 05.
Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuring the Distance Between Finite Markov Decision Processes. In Proceedings of the 15th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016), 468–476.
Sorg, J.; and Singh, S. 2009. Transfer via Soft Homomorphisms. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009), 741–748. International Foundation for Autonomous Agents and Multiagent Systems.
Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement Learning in Finite MDPs: PAC Analysis. Journal of Machine Learning Research 10(Nov): 2413–2444.
Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press, Cambridge.
Szita, I.; and Szepesv´ari, C. 2010. Model-Based Reinforcement Learning with Nearly Tight Exploration Complexity Bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 1031–1038.
Taylor, M. E.; and Stone, P. 2009. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research 10(Jul): 1633–1685.
Villani, C. 2008. Optimal Transport: Old and New, volume 338. Springer Science & Business Media.
Watkins, C. J. C. H.; and Dayan, P. 1992. Q-learning. Machine Learning 8(3-4): 279–292.
Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi- Task Reinforcement Learning: A Hierarchical Bayesian Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), 1015–1022.
Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4, Yuu Jinnai3, Emmanuel Rachelson1, Michael L. Littman3
where is a positive constant and
is the 1-Wasserstein metric. For each state of the target model, the closest counterpart state (with the smallest bi-simulation distance) of the source MDP is identified and its learned Q-values are used to initialize the Q-function of the target MDP. In their experiments, Song et al. (2016) run a standard Q-Learning algorithm (Watkins and Dayan 1992) with an
-greedy exploration strategy thereafter.
Figure 4: The T-shaped MDP transfer task.
Let us now consider applying this method to a similar task to the T-shaped MDP transfer task proposed by Taylor and Stone (2009). The source and target tasks are respectively described on the left and right sides of Figure 4. In each task, the states are represented by the circles and the arrows between them correspond to the available actions that allow to move from one state to the other. The initial state of both tasks is the left state I for the source task and Ifor the target task. Between the states I and A in the source task (respectively I
in the target task) are k states, k being a parameter increasing the distance to travel from I to A (respectively I
). The tasks are deterministic and the reward is zero everywhere except for the state B in the source task and C
in the target task where a reward of +1 is received. Consequently, the optimal policy in the source task is to go to the state A and then to the state B. In the target task, the same applies except that a transition to state C should be applied in place of state B
when the agent is in state A
Regardless of the parameters used in the bi-simulation metric of Equation 11, the direct state transfer method from Song et al. (2016) maps the following states together as they share the exact same model:
Hence, during learning, the Q-function of the target task is initialized with the values of the Q-function of the source task. Therefore, the behavior derived with the Q-Learning algorithm is the optimal policy of the source task, but in the target task. Depending on the value of the learning rate of the algorithm, the time to favor action DOWN in state Ainstead of action UP can be arbitrarily long. Also, depending on the value of
, the exploration of state C
due to the
-greedy strategy can be arbitrarily unlikely. Finally, the time needed for one of those two events to occur increases proportionally to the value of k, which can be arbitrarily large.
This case illustrates the difficulty facing any transfer method in the general context of Lifelong RL. The method proposed by Song et al. (2016) can be highly efficient in some cases as they show in experiments, but the lack of theoretical guarantees makes negative transfer possible. Generally, using a similarity measure such that the bi-simulation metric helps to discourage using some source tasks when the computed similarity is too low. However, as we saw in the T-shaped MDP example, this rule is not absolute and the choice of the metric is important. The approach we develop in this paper aims at avoiding negative transfer by providing a conservative transferred knowledge that is simply of no use when the similarity between source and target tasks is too low. This is intuitive as we do not expect any task to provide transferable knowledge to any other task.
A metric on a set X is a function which has the following properties for any
P1. (positivity),
P2. (positive definiteness),
P3. m(x, y) = m(y, x) (symmetry),
P4. (triangle inequality).
If property P2 is not verified by m, but instead we have for any is called a pseudo-metric. If m only verifies P1, P2 and P4 then m is called a quasi-metric. If m only verifies P1 and P2 and if X is a set of probability measures, then m is called a divergence.
From this, the pseudo-metric between models of Definition 1 is indeed a pseudo-metric as it is relative to a positive function f that could be equal to zero and break property P2.
The local MDP dissimilarity between MDPs of Proposition 1 does not respect properties P2 and P3, hence the name dissimilarity. The
quantity, however, regains property P3 and is hence a pseudo-metric. A noticeable consequence is that Proposition 1 is “in the spirit” of a Lipschitz continuity result but cannot be called as such, hence the name pseudo-Lipschitz continuity.
The same goes for the global dissimilarity . However, using
allows to regain property 3 and makes this quantity a pseudo-metric again between MDPs.
Notation 1. Given two sets X and Y , we note F (X, Y ) the set of functions defined on the domain X with codomain Y .
Lemma 1. Given two MDPs , the following equation on
is a fixed-point equation admitting a unique solution for any
We refer to this unique solution as
Proof of Lemma 1. The proof follows closely that in (Puterman 2014) that proves that the Bellman operator over value functions is a contraction mapping. Let L be the functional operator that maps any function
Then for f and g, two functions from , we have that
Since this is true for any pair , we have that
Since is a contraction mapping in the metric space
. This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.
Proof of Proposition 1. The proof is by induction. The value iteration sequence of iterates of the optimal Q-function of any MDP
is defined for all
Consider two MDPs . It is obvious that
. Suppose the property
true at rank
. Consider now the rank n + 1 and a pair
where we used Lemma 1 in the last inequality. Since and
are respectively the limits of the sequences
and
, it results from passage to the limit that
By symmetry, we also haveand we can take the minimum of the two valid upper bounds, yielding:
Similar results to Proposition 1 can be derived. First, an important consequence is the global pseudo-Lipschitz continuity result
From a pure transfer perspective, Equation 12 is interesting since the right hand side does not depend on s, a. Hence, the counterpart of the upper bound of Equation 4, namely,
is easier to compute. Indeed, can be computed once and for all, contrarily to
that needs to be evaluated for all s, a pair. However, we do not use this result for transfer because it is impractical to compute online. Indeed, estimating the maximum in the definition of
can be as hard as solving both MDPs, which, when it happens, is too late for transfer to be useful.
Proof of Proposition 8. The proof is by induction. We consider the sequence of value iteration iterates defined for any MDP
and, by symmetry, the result holds as well for . Suppose that it is true at rank
. Consider rank n + 1 and
, we have that:
By symmetry, the results holds as well for which concludes the proof by induction.
The second result is for the value function and is stated below.
Proposition 9 (Local pseudo-Lipschitz continuity of the optimal value function). For any two MDPs
where the local MDP pseudo-metric has the same definition as in Proposition 1.
Proof of Proposition 9. The proof follows exactly the same steps as the proof of Proposition 1, i.e., by first constructing the value iteration sequence of iterates of the optimal value function, showing the result by induction for rank and then concluding with a passage to the limit.
Another result can be derived for the value of any policy . For the sake of generality, we state the result for any stochastic policy mapping states to distributions over actions. Note that a deterministic policy is a stochastic policy mapping states to Dirac distributions over actions. First, we state the result for the value function in Proposition 10 and then for the Q function in Proposition 11.
Proposition 10 (Local pseudo-Lipschitz continuity of the value function of any policy). For any two MDPs , for any stochastic stationary policy
where and
is defined as the fixed-point of the following fixed-point equation on
Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.
Lemma 2. Given two MDPs , any stochastic stationary policy
, the following equation on
is a fixed-point equation admitting a unique solution for any
We refer to this unique solution as
Proof of Lemma 2. Let L be the functional operator that maps any function
Then for f and g, two functions from S to R, we have that
Hence we have that is a contraction mapping in the metric space
This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.
Proof of Proposition 10. Consider a stochastic stationary stationary policy . The value iteration sequence of iterates
of the value function of any MDP
is defined for all
Consider two MDPs . It is obvious that
for all
. Suppose the property
true at rank
. Consider now the rank n + 1 and the state
where we used Lemma 2 in the last inequality. Since and
are respectively the limits of the sequences
and
, it results from passage to the limit that
By symmetry, we also haveand we can take the minimum of the two valid upper bounds, yielding:
which concludes the proof.
Proposition 11 (Local pseudo-Lipschitz continuity of the Q-function of any policy). For any two MDPs , for any stochastic stationary policy
where is defined as the fixed-point of the following fixed-point equation on
Before proving the Proposition, we show that the fixed point equation admits a unique solution in the following Lemma.
Lemma 3. Given two MDPs , any stochastic stationary policy
, the following equation on
fixed-point equation admitting a unique solution for any
We refer to this unique solution as
Proof of Lemma 3. Let L be the functional operator that maps any function
Then for f and g, two functions from , we have for all
Hence we have that . Since
is a contraction mapping in the metric space
. This metric space being complete and non-empty, it follows by direct application of the Banach fixed-point theorem that the equation d = Ld admits a unique solution.
Proof of Proposition 11. Consider a stochastic stationary policy . The value iteration sequence of iterates
function for the policy
is defined for all
Consider two MDPs . It is obvious that
for all
. Suppose the property
true at rank
for all
. Consider now the rank n + 1
where we used Lemma 3 in the last inequality. Since and
are respectively the limits of the sequences
and
By symmetry, we also haveand we can take the minimum of the two valid upper bounds,
Proof of Proposition 2. The result is clear for all since the Lipschitz bounds are provably greater than
. For
, the result is shown by induction. Let us consider the Dynamic Programming (Bellman 1957) sequences converging
Obviously, we have at rank n = 0 that for all
. Suppose the property true at rank
and consider rank n + 1:
Which concludes the proof by induction. The result holds by passage to the limit since the considered Dynamic Programming sequences converge to the true functions.
Proof of Proposition 3. Consider two tasks M = (T, R) and , with K and
the respective sets of state-action pairs where their learned models
are known with accuracy
-norm with probability at least
, we have that,
Importantly, notice that the probabilistic event of Inequality 13 is the intersection of at most 4SA individual events of estimating either or
with precision
. Each one of those individual events is itself true with probability at least
, where
is a parameter. For all the individual events to be true at the same time, i.e. for Inequality 13 to be verified, one must apply Boole’s inequality and set
to ensure a total probability — i.e., probability of the intersection of all the individual events — of at least
We demonstrate now the result for each one of the three cases
the case being the symmetric of case (ii).
(i) If , then we have
-close estimates of both models with high probability, as described by Inequality 13. By
definition:
The second term of the right hand side of Equation 14 respects the following sequence of inequalities with probability at least
Replacing the Inequalities 15 and 16 in Equation 14 yields
which holds with probability at least and proves the Theorem for case (i). (ii) If
, then we do not have an
-close estimate of
and
. Similarly to the proof of case (i), we upper bound sequentially the two terms of the right hand side of Equation 14. With probability at least
, we have the following:
Similarly, with probability at least
where is the set of probability vectors of size S. Combining inequalities 17 and 18, we get the following with probability at least
, by noticing
on the left hand side:
which is the expected result for case (ii). (iii) If , then we do not have
-close estimates of both tasks. In such a case, the result
is straightforward by remarking that, as a consequence of Inequality 13, we have that with probability at least
Consider two tasks M = (T, R) and , with K and
the respective sets of state-action pairs where their learned models
and
are known with accuracy
in
-norm with probability at least
. We note
, a known upper bound on the maximum achievable value. In the worst case where one does not have any information on the value of
is a valid upper bound. We detail the computation of
for each cases: 1)
2)
being the symmetric of case 2), the same calculations apply. 1) If
Since (s, a) is a known state-action pair, everything is known and computable in this last equation. Note that be tracked along the updates of
and thus its computation does not induce any additional computational complexity. 2) If
First, we have
Maximizing over the variable such that
is equivalent to maximizing a convex combination of the positive vector
whose terms are not independent as they must sum to one. This is easily solvable as a linear programming problem. A straightforward (simplex-like) resolution procedure consists in progressively adding mass on the terms that will maximize the convex combination as follows:
•
• l = Sort states by decreasing values of
• While –
= pop first state in l
This allows calculating the maximum over transition models.
Notice that there is a simpler computation that almost always yields the same result (when it does not, it provides an upper bound) and does not require the burden of the previous procedure. Consider the subset of states for which (often these are states in
). Among those states, let us suppose there exists
, unreachable from (s, a), according to
, i.e.,
has not been fully explored, as is often the case in RMax, there may be many such states. Then the distribution t with all its mass on
maximizes the
term. Conversely, if such a state does not exist (that is, if for all such states
is an upper bound on the
term. Therefore:
with equality in many cases. 3) If , the resolution is trivial and we have
Overall, computing the value of the provided upper bound in the three cases allows to compute
Lemma 4. Given two tasks the set of state-action pairs for which (R, T) is known with accuracy
with probability at least
, this equation on
is a fixed-point equation admitting a unique solution.
We refer to this unique solution as
Proof of Lemma 4. Let L be the functional operator that maps any function
Let f and g be two functions from
If
In both cases, . If
is a contraction mapping in the metric space
. This metric space being complete and non-empty, it follows from Banach fixed-point theorem that d = Ld admits a single solution.
Proof of Proposition 4. Consider two MDPs . Before proving the result, notice that we shall put ourselves in the case of Proposition 3, for the upper bound on the pseudometric between models
to be true upper bounds with probability at least
for all
. As seen in the proof of Proposition 3, this implies learning any reward or transition function with precision
-norm with probability at least
The proof is done by induction, by calculating the values of and
following the value iteration algorithm. Those values can respectively be computed via the sequences of iterates
defined as follows for all
The proof at rank n = 0 is trivial. Let us assume the proposition true at rank
and consider rank n + 1. There are two cases, depending on the fact that (s, a) is in K or not. If
Using Proposition 3, we have that is an upper bound on
with probability at least
This plus the fact that by induction hypothesis, we have with probability at least
Using the same reasoning than in case , we have with probability higher than
which concludes the proof in the second case.
Proof of Proposition 6. The cost of LRMax is constant on most time steps since the action is greedily chosen w.r.t. the upper bound on the optimal Q-function, which is a lookup table. Let be the number of source tasks that have been learned by LRMax during a Lifelong RL experiment. When updating a new state-action pair, i.e., labeling it as a known pair, the algorithm performs 2N Dynamic Programming (DP) computations to update the induced Lipschitz bounds (Equation 8) plus one DP computation to update the total-bound (Equation 6). In total, we apply (2N + 1) DP computations for each state-action pair update. As at most SA state-action pairs are updated during the exploration of the current MDP, the total number of DP
computations is at most SA(2N + 1), for which we use the value iteration algorithm.
We use the value iteration as a Dynamic Programming method. Strehl, Li, and Littman (2009) report the minimum number of iterations needed by the value iteration algorithm to estimate a value function (or Q-function in our case) that is -close to the
optimum in maximum norm. This minimum number is given by
Each iteration has a cost . Overall, the cost of all the DP computations in a complete run of LRMax is
This, plus the constant cost O(1) applied on each one of the decision epochs concludes the proof.
Proof of Proposition 7. Consider an algorithm producing -accurate model estimates
for a subset
interacting with any two MDPs
. Assume
to be an upper bound of
for any
. These assumptions are guaranteed with high probability by Proposition 3 while running Algorithm 1 in the Lifelong RL setting. Then, for any
and any two MDPs
, we have that
Particularly, for all
and any
. By definition of
, this
implies that, for all
where is the set of possible tasks in the considered Lifelong RL experiment. Consider
, the set of sampled MDPs which allows to define
as the maximum model distance for all the experienced MDPs at
. We have that
only if two MDPs maximizing the right hand side of this equation belong to . If it is the case, then Equation 19 imply that
Overall, we require the two MDPs maximizing to be sampled for Equation 20 to hold. Let us now derive the probability that those two MDPs have been sampled. We note them
. There may exist more candidates for the maximization but, for the sake of generality, we put ourselves in the case where only two MDPs achieve the maximization. Let us consider drawing
tasks. We note
(respectively
) the probability of sampling
(respectively
) each time a task is sampled. We note
(respectively
) the random variable of the first occurrence of the task
(respectively
) among the m trials. Hence, the probability of sampling
for the first time at trial
is given by the geometric law and is equal to
We are interested in the probability of the event that and
have been sampled in the m first trials, i.e. Pr
. Following the rule of addition for probabilities, we have that,
Given that the event of sampling either or
during a single trial happens with probability
, we have by analogy with Equation 21 that Pr
. As a result, the following holds: Pr
As said earlier, Equation 20 holds if have been sampled during the first m trials, which imply that the probability for Equation 20 to hold is at least equal to the probability of sampling both tasks. Formally,
In turn, if m verifies , then
and Pr
, which concludes the proof.
Section 4.4 introduced the idea of exploiting prior knowledge on the maximum distance between two MDP models. This idea begs for a more detailed discussion. Consider two MDPs . By definition of the local model pseudo metric in Equation 1, the maximum possible distance is given by
But this assumes that any transition or reward model can define . In other words, the maximization is made on the whole set of possible MDPs. To illustrate why this is too naive, consider a game within the Arcade Learning Environment (Bellemare et al. 2013). We, as humans, have a strong bias concerning similarity between environments. If the game changes, we still assume groups of pixels will move together on the screen as the result of game actions. For instance, we generally discard possible new games
that “teleport” objects across the screen without physical considerations. We also discard new games that allow transitions from a given screen to another screen full of static. These examples illustrate why the knowledge of
is very natural (and also why its precise value may be irrelevant). The same observation can be made for the “tight” experiment of Section 5; the set of possible MDPs is restricted by some implicit assumptions that constrain the maximum distance between tasks. For instance, in these experiments, all transitions move to a neighboring state and never “teleport” the agent to the other side of the gridworld. Without the knowledge of
, LRMax assumes such environments are possible and therefore transfer values very cautiously (with the ultimate goal not to under estimate the optimal Q-function, in order to avoid negative transfer). Overall, the experiments of Section 5 confirm this important insight: safe transfer occurs slowly if no a priori is given on the maximum distance between MDPs. On the contrary, the knowledge of
allows a faster and more efficient transfer between environments.
The tight environment is a grid-world illustrated in Figure 5. The initial state of the agent is the central cell displayed with an “S”. The actions are moving 1 cell in one of the four cardinal directions. The reward is 0 everywhere, except for executing an action in one of the three teal cells in the upper-right corner. Each time a task is sampled, a slipping probability of executing another action as the one selected is drawn in [0, 1] and the reward received in each one of the teal cells is picked in [0.8, 1.0].
Figure 5: The tight grid-world environment.
Figure 6: The corridor grid-world environment.
We ran additional experiments on the corridor grid-world environment represented in Figure 6. The initial state of the agent is the central cell labeled with the letter “S”. The actions are {left, right} and the goal is to reach the cell labeled with the letter “G” on the extreme right. A reward R > 0 is received when reaching the goal and 0 otherwise. At each new task, a new value of R is sampled in [0.8, 1]. The transition function is fixed and deterministic.
The key insight in this experiment is not to lose time exploring the left part of the corridor. We ran 20 episodes of 11 time steps for each one of the 20 sampled tasks. Results are displayed in Figure 7a and 7b, respectively for the average relative discounted return over episodes and over tasks. Similarly as in Section 5, we observe in Figure 7a that LRMax benefits from the transfer method as early as the second task. The MaxQInit algorithm benefits from the transfer from task number 12. Prior knowledge decreases the sample complexity of LRMax as reported earlier and the combination of LRMax with MaxQInit outperforms all other methods by providing a tighter upper bound on the optimal Q-value function. This decrease of sample complexity is also observed in the episode-wise display of Figure 7b where the convergence happens more quickly on average for LRMax and even more for MaxQInit. This figure allows to see the three learning stages of LRMax reported in Section 5.
We also ran Lifelong RL experiments in the maze grid-world of Figure 8. The tasks consists in reaching the goal cell labeled with a “G” while the initial state of the agent is the central cell, labeled with an “S”. Two walls configurations are possible, yielding two different tasks with probability of being sampled in the Lifelong RL setting. The first task corresponds to the case where orange walls are actually walls and green cells are normal white cells where the agent can go. The second task is the converse, where green walls are walls and orange cells are normal white cells. We run 100 episodes of length 15 time steps and sample a total of 30 different tasks. Results can be found in Figure 9. In this experiment, we observe the increase of performance of LRMax as the value of
decreases. The three stages behavior of LRMax reported in Section 5 does not appear in this case. We tested the performance of using the online estimation of the local model distances of Proposition 7 in the algorithm referred by LRMax in Figure 9. Once enough tasks have been sampled, the estimate on the model local distance is used with high confidence on its value and refines the upper bound computed analytically in Equation 7. Importantly, this instance of LRMax achieved the best result in this particular environment, demonstrating the usefulness of this result. This method being similar to the MaxQInit estimation of maximum Q-values, we unsurprisingly observe that both algorithms feature a similar performance in the maze environment.
Consider two MDPs . Each time a state-action pair
is updated, we compute the local distance upper bound
(Equation 7) for all
. In this computation, one can leverage the knowledge of
. We show that LRMax relies less and less on
as knowledge on the current task increases. For this experiment, we used the two grid-worlds environments displayed in Figures 10a and 10b.
Figure 7: Results of the corridor Lifelong RL experiment with 95% confidence interval.
Figure 8: The maze grid-world environment. The walls correspond to the black cells and either the green ones or the orange ones.
The rewards collected with any actions performed in the teal cells of both tasks are defined as:
where are the coordinates of the current state,
the coordinate of the goal cell labelled with a G and
parameter equal to 1 in the first environment and 1.5 in the second environment. The agent starts at the cell labelled with the S letter. Black cells represent unreachable cells (walls). We run LRMax twice on the two different maze grid-worlds and record for each model update the proportion of times
is smaller than
in Figure 11 via the % use of
With maximum value ,
is systematically lesser than
, resulting in 0% use. Conversely, with minimum value
, the use expectedly increases to 100%. The in-between value of
displays a linear decay of the use. This suggests that, at each update,
is only true for one more unique s, a pair, resulting in a constant decay of the use. With fewer prior (
), updating one single s, a pair allows
to drop under
for more than one pair, resulting in less use of the prior knowledge. The conclusion of this experiment if that
useful at the beginning of the exploration, while LRMax relies more on its own bound
when partial knowledge of the task has been acquired.
Figure 9: Averaged discounted return over tasks for the maze grid-world Lifelong RL experiment.
Figure 10: The two grid-worlds of the prior use experiment.
Figure 11: Proportion of times where , use of the prior, vs computation of the Lipschitz bound. Each curve is displayed with 95% confidence intervals.
Table 1: Summary of the number of experiment repetition, number of sampled tasks, number of episodes, maximum length of episodes and upper bounds on the number of collected samples.
We used . Theoretically,
should be a lot larger (
) in order to reach an accuracy
according to Strehl, Li, and Littman (2009). However, it is common practice to assume such small values of
are sufficient to reach an acceptable model accuracy
. Interestingly, empirical validation did not confirm this assumption for any RMax-based algorithm. We keep these values nonetheless for the sake of comparability between algorithms and consistency with the literature. Despite such absence of accuracy guarantees, RMax-based algorithms still perform surprisingly well and are robust to model estimation uncertainties.
For the experiments run in Section 5, the computing infrastructure used was a laptop using a single 64-bit CPU (model: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz). The collected samples sizes and number of evaluation runs for each experiment is summarized in Table 1. The displayed confidence intervals for any curve presented in the paper is the 95% confidence interval (Neyman 1937) on the displayed mean. No data were excluded neither pre-computed. Hyper-parameters were determined to our appreciation, they may be sub-optimal but we found the results convincing enough to display interesting behaviors.