Importance Weighted Transfer of Samples in Reinforcement Learning

2018·Arxiv

Abstract

Abstract

We consider the transfer of experience samples (i.e., tuples ) in reinforcement learning (RL), collected from a set of source tasks to improve the learning process in a given target task. Most of the related approaches focus on selecting the most relevant source samples for solving the target task, but then all the transferred samples are used without considering anymore the discrepancies between the task models. In this paper, we propose a model-based technique that automatically estimates the relevance (importance weight) of each source sample for solving the target task. In the proposed approach, all the samples are transferred and used by a batch RL algorithm to solve the target task, but their contribution to the learning process is proportional to their importance weight. By extending the results for importance weighting provided in supervised learning literature, we develop a finite-sample analysis of the proposed batch RL algorithm. Furthermore, we empirically compare the proposed algorithm to state-of-the-art approaches, showing that it achieves better learning performance and is very robust to negative transfer, even when some source tasks are significantly different from the target task.

1. Introduction

The goal of transfer in Reinforcement Learning (RL) (Sut- ton & Barto, 1998) is to speed-up RL algorithms by reusing knowledge obtained from a set of previously learned tasks. The intuition is that the experience made by learning source tasks might be useful for solving a related, but different, target task. Transfer across multiple tasks may be achieved in different ways. The available approaches differ in the type of information transferred (e.g., samples, value functions, parameters, policies, etc.) and in the criteria used to establish whether such knowledge could be beneficial for solving the target or not.

This work focuses on the problem of transferring samples from a set of source MDPs to augment the dataset used to learn the target MDP. To motivate our approach, consider a typical learning scenario where samples are costly to obtain. This is often the case in robotics applications, where the interaction with the real environment could be extremely time-consuming, thus reducing the number of samples available. The typical remedy of adopting a simulator often leads to sub-optimal solutions due to the differences with respect to the real environment. A more effective approach is to transfer the simulated samples to speed-up learning in the target task.

The transfer of samples has been widely studied in the supervised learning community. In particular, Crammer et al. (2008) formalized the problem from a theoretical perspective and provided generalization bounds for the transfer scenario. An interesting result is a trade-off between the number of tasks from which to transfer and the total number of samples. In RL, Taylor et al. (2008) and Lazaric et al. (2008) proposed almost simultaneously methods to transfer single samples. While the former method focused on a model-based approach, the latter one proposed a selective approach to transfer samples into a batch RL algorithm (e.g., Fitted Q-Iteration (Ernst et al., 2005)). Furthermore, Lazaric et al. (2008) considered a model-free approach to compute a similarity measure between tasks, which was used to decide which samples to transfer. More recently, Lazaric & Restelli (2011) analyzed the transfer of samples in batch RL from a theoretical perspective, demonstrating again the trade-off between the total number of samples and the number of tasks from which to transfer. Finally, Laroche & Barlier (2017) proposed a way to transfer all the samples to augment the dataset used by Fitted QIteration. The limitation of this approach resides in the restrictive assumption that all the tasks are assumed to share the same transition dynamics and differ only in the reward function. For a survey on transfer in RL, we refer the reader to (Taylor & Stone, 2009; Lazaric, 2012).

One of the main drawbacks of many previous works is that, even after a detailed selection, transferred samples are used in the target task without accounting for the differences between the original (source) MDP and the target one, thus introducing a bias even in the asymptotic case. In this paper, we present a novel approach to transfer samples into a batch RL algorithm. Unlike other works, we do not assume any particular similarity between tasks besides a shared state-action space, and we develop a new model-based methodology to automatically select the relevance (importance weight) of each sample. Existing algorithms for transferring across different state-action spaces (e.g., Taylor et al., 2007) can be straightforwardly combined to our method. Our approach transfers all the samples, but their impact in solving the target task is proportional to their importance weight. To compute the importance weight of each sample, we rely on a non-parametric estimate of the MDP structure. In particular, we adopt Gaussian processes (Rasmussen & Williams, 2006) to estimate the reward and state transition models of the source and target tasks from samples. Then, we propose a robust way to compute two sets of importance weights, one for the reward model and one for the transition model. We introduce an approximate value iteration algorithm based on Fitted Q-iteration that uses such weights to account for the distribution shift introduced by the different MDPs, thus implicitly selecting which samples have higher priority based on their likelihood to be generated from the target MDP. We provide a theoretical analysis showing the asymptotic correctness of our approach and an empirical evaluation on two classical RL domains and a real-world task.

2. Preliminaries

In this section, we start by introducing our mathematical notation. Then, we recall concepts of Markov decision processes and approximate value iteration. Finally, we formalize the transfer settings considered in this work.

Notation. For a measurable space , we denote by the set of probability measures over and by the space of measurable functions over bounded by , i.e., . Given a probability measure , we define the -norm of a measurable function f as . Let be a Z-valued sequence for some space Z. For , the empirical norm of a function is . Note that when , we have that . When- ever the subscript p is dropped, we implicitly consider the -norm.

Markov Decision Process. We define a discounted Markov Decision Process (MDP) as a tuple M = , where S is a measurable state space, A is a finite set of actions, is the transition probability kernel, is the reward probability kernel, and is the discount factor. We suppose is uniformly bounded by . A Markov randomized policy maps states to distributions over actions as . As a consequence of taking an action in , the agent receives a reward and the state evolves accordingly to . We define the action-value function of a policy as and the optimal action-value function as for all (s, a). Notice that Q is bounded by . Then, the optimal policy is a pol- icy that is greedy with respect to , i.e., for all , . The optimal action-value function is also the unique fixed-point of the optimal Bellman operator , which is defined by (e.g., Puterman, 1994).

Approximate solution. Fitted Q-Iteration (FQI) (Ernst et al., 2005) is a batch RL algorithm that belongs to the family of Approximate Value Iteration (AVI). AVI is a value-based approach that represents Q-functions by a hypothesis space of limited capacity. Starting from an initial action-value function , at each iteration , AVI approximates the application of the optimal Bellman operator in H such that . Formally, let be a set of transitions such that and define the empirical optimal Bellman operator as . Then, at each iteration k, FQI com- putes

Transfer settings. We consider a set of tasks, i.e., MDPs, , where denotes the target and the sources. We suppose all tasks share the same state-action space and have potentially different dynamics and reward. Suppose that, for j = 0, . . . , m, we have access to a dataset of samples from the j-th MDP, , where state-action pairs are drawn from a common distribution .1 The goal of transfer learning is to use the samples in to speed up the learning process in the target task .

3. Importance Weights for Transfer

In this section, we introduce our approach to the transfer of samples. Recall that our goal is to exploit at best samples in

to augment the dataset used by FQI to solve the target task, thus speeding up the learning process. In the rest of the paper we exploit the fact that FQI decomposes the RL problem into a sequence of supervised learning problems. It is easy to notice that the optimization problem (1) is an instance of empirical risk minimization, where are the input data, are the targets, and is a squared loss.

As mentioned in the introduction, we aim to exploit all the available samples to solve the target task. Suppose we adopt a naive approach where we concatenate all the samples, i.e., , to solve (1). This approach suffers from sample selection bias (Cortes et al., 2008), i.e., samples are collected from different distributions or domains. In fact, although we assumed state-action pairs to be sampled from a fixed taskindependent distribution, the target variables Y are distributed according to the MDP they come from.

A standard technique used to correct the bias or discrep- ancy induced by the distribution shift is importance weighting. This technique consists in weighting the loss function to emphasize the error on some samples and decrease it on others, to correct the mismatch between distributions (Cortes et al., 2008). The definition of the impor- tance weight for the point X is w(X) = P(X)/Q(X) where P is the distribution of the target, and Q is the distribution according to which sample X is collected. In our specific case, given an arbitrary sample (X, Y ), its joint distribution under MDP is . Denote by the i-th sample drawn from MDP , then its importance weight is

By feeding FQI on the full dataset with samples weighted by (for short), we get an algorithm that automatically selects which samples to exploit, i.e., those that, based on the importance weights, are more likely to be generated from the target MDP. This approach looks appealing but presents several issues. First, the distribution is, even in the case where the MDPs are known, very hard to characterize. Second, consider a simple case where we have a source MDP with the same transition dynamics as the target, but with entirely different reward. Then, the importance weights defined above are likely to be very close to zero for any source sample, thus making transfer useless. However, we would like a method able to leverage the fact that transition dynamics do not change, thus transferring only that part of the sample.

To overcome the second limitation, we consider the following variation of the FQI algorithm. At the first iteration of FQI, we use all the samples to fit a model of the target reward function:

where is the hypothesis space we consider to represent action-value functions2 and

Problem (2) is unbiased if , though is frequently used since it provides lower vari- ance. The theoretical analysis is not affected by the choice of , while in the experiments we will use . Then, at each iteration , FQI updates the Q-function as:

where and . Intuitively, instead of considering the reward in the dataset , we use . Since the stochasticity due to the reward samples is now removed, only the transition kernel plays a role, and the importance weights are given by:

The resulting algorithm, named Importance Weighted Fitted Q-Iteration (IWFQI), is shown in Algorithm 1. In practice, we have to compute an estimate of and since and are unknown quantities. We postpone this topic to Section 5 since several approaches can be exploited. Instead, in the following section, we present a theoretical analysis that is independent of the way the importance weights are estimated.

4. Theoretical Analysis

We now study the theoretical properties of our IWFQI algorithm. We analyze the case where we have samples from one source task, but no samples from the target task are available, i.e., , and . A generalization to the case where target samples or samples from more sources are available is straightforward, and it only complicates our derivation. To ease our notation, we adopt the subscript “T” and “S” to denote the target and the source. Furthermore, we want to emphasize that the results provided in this section are independent from the way the importance weights are estimated.

Consider the sequence of action-value functions computed by IWFQI. At each iteration k, we incur in an error in approximating the optimal Bellman operator. Our goal is to bound, in terms of such errors, , i.e., the expected error under distribution between the performance of the optimal policy and that of the policy greedy w.r.t. . Here is an arbitrary evaluation distribution over that the user can freely choose. In practice, it might coincide with the sampling distribution . Since IWFQI belongs to the family of AVI algorithms, we can resort to Theorem 3.4 in (Farahmand, 2011). We report here the version with -norm for the sake of completeness.

Theorem 1. (Theorem 3.4 of (Farahmand, 2011)) Let K be a positive integer and . Then, for any se- quence and the correspond- ing sequence , where , we have:

where:

We refer the reader to Chapter 3 of (Farahmand, 2011) for the definitions of the coefficients and .

Intuitively, the bound given in Theorem 1 depends on the errors made by IWFQI in approximating the optimal Bellman operator at each iteration. Thus, our problem reduces to bounding such errors. Cortes et al. (2010) already provided a theoretical analysis of importance weighted regression. However, their results are not immediately applicable to our case since they only consider a regression problem where the target variable Y is a deterministic function of the input X. On the other hand, we have the more general regression estimation problem where Y is a random variable, and we want to learn its conditional expectation given X. Thus, we extend Theorem 4 of (Cortes et al., 2010) to provide a bound on the expected -error between the hypothesis returned by a weighted regressor (with estimated weights ) and the regression function . Following (Cortes et al., 2010), we denote by Pdim(U) the pseudo-dimension of a real-valued function class U. The proof is in the appendix.

Theorem 2. Let be a functional space. Suppose we have a dataset of N i.i.d. samples distributed according to Q(X, Y ) = , while is the target distribution. Assume almost surely. Let be any positive function, , , and , where denotes the empirical expectation on D. Furthermore, assume and . Then, for any , the following holds with probability at least :

∥∥

Notice that this result is of practical interest outside of the reinforcement learning field. Here it is used to bound the errors in order to state the following result.

Theorem 3. Let be a hypothesis space, a distribution over a se- quence of Q-functions as defined in Equation (4), and the optimal Bellman operator of the target task. Suppose to have a dataset of N i.i.d. samples D drawn from the source task according to a joint distribution . Let denote the ideal importance weights defined in (5) and (3), and denote arbitrary positive functions with bounded second moments. Define , , where . Similarly, define , , and for the transition model. Then, for any , with probability at least :

where is the concentrability coefficient of one-step transitions as defined in (Farahmand, 2011, Definition 5.2).

As expected, four primary sources of error contribute to our bound: (i) the bias due to estimated weights (first four terms), (ii) the approximation error (fifth and sixth term), (iii) the estimation error (seventh term), (iv) the propagation error (eighth term). Notice that, assuming to have a consistent estimator for the importance weights (an example is given in Section 5), the bias term vanishes as the number of samples N tends to infinity. Furthermore, the estimation error decreases with N, thus vanishing as the number of samples increases. Thus, in the asymptotic case our bound shows that the only source of error is due to the limited capacity of the functional space H under consideration, as in most AVI algorithms. Furthermore, we notice that fitting the reward function and using it instead of the available samples propagates an error term through iterations, i.e., the approximation error . If we were able to estimate the importance weights for the typical case where both reward and transition samples are used, we could get rid of such error. However, since the resulting weights somehow depend on the joint densities between P and R, we expect their variance, as measured by , to be much bigger, thus making the resulting bound even larger. Furthermore, we argue that, when the reward function is simple enough and only a limited number of samples is available, a separate fit might be beneficial even for plain FQI. In fact, the variance of the empirical optimal Bellman operator can be reduced by removing the source of stochasticity due to the reward samples at the cost of propagating a small approximation error through iterations. The bounds for AVI, (e.g., Munos & Szepesv´ari, 2008; Farah- mand, 2011; Farahmand & Precup, 2012), can be straightforwardly extended to such case by adopting a procedure similar to the one described in the proof of Theorem 3. Finally, in most practical applications the reward function is actually known and, thus, does not need to be fitted. In such cases, it is possible to get rid of the corresponding terms in Theorem 3, allowing transfer to occur without errors even when rewards are completely different between tasks.

5. Estimation of Importance Weights

In this section, we specify how to compute the importance weights. Since P and R are unknown, we only have access to an estimation of and used in (3) and (5), respectively. To obtain an approximation of the unknown densities, we consider Gaussian Processes (GPs) although any distribution matching technique and/or probabilistic model can be used.

Gaussian Processes. We use the available samples to fit two Gaussian processes (GPs) (Rasmussen & Williams, 2006) for each task : one for the transition model and one for the reward model . To motivate our choice, GPs have been successfully adopted to model stochastic dynamical systems with high-dimensional and continuous state-action spaces in many existing works (e.g., Kuss & Rasmussen, 2004; Deisenroth & Rasmussen, 2011; Doshi- Velez & Konidaris, 2016; Berkenkamp et al., 2017). For the sake of simplicity, we only show how to compute the importance weights for the reward model. Our procedure straightforwardly generalizes to the transition model.

Given a sample from the j-th task, the j-th GP returns a Gaussian distribution over the reward’s mean, i.e., , which, together with the target GP’s prediction, induces a distribution over the importance weights. In practice, the choice of a single importance weight can rely on some statistics of such distribution (e.g., its mean or mode). Perhaps not surprisingly, this is made non-trivial by the fact that explicitly characterizing such distribution is very complicated, and computing empirical statistics requires an expensive repeated sampling from the GPs’ posteriors. Interestingly, the following theorem shows that this is not necessary when the reward model follows a Gaussian law, as the expected weights under their unknown distribution can be computed in closedform.

Theorem 4 (Reward Weights in Gaussian Models). Assume each task to have Gaussian reward distribution with unknown

mean. Given the available samples in , we build an estimate of the reward distribution such that, for any MDP . Then, given a sample from the j-th MDP, its importance weight , where G is the distribution induced by the GPs’ predictions. Let C =

and suppose , then

The proof is in Appendix A. In practice, we estimate the importance weights by taking their expectation as in (6), i.e., . Intuitively, using the expected weights is more robust than merely taking the ratio of the estimated densities. Furthermore, the estimated weights converge to the true ones when the GP predictions are perfect, i.e., when and , both in the source and in the target. This is a significant advantage over the more common approach of density-ratio estimation (Sugiyama et al., 2012), where a parametric form for the weight function is typically assumed. One drawback is that the expectation diverges when , that is, when the source GP has a prediction variance that is greater than the intrinsic noise of the model. Notice, however, that this happens very rarely since the source GP is never asked to predict samples it has not seen during training. Furthermore, since in practice the model noise is unknown and has to be estimated, an overestimation is ben-eficial as it introduces a regularization effect (Mohammadi et al., 2016), thus avoiding the problem mentioned above.

6. Related Work

In (Taylor et al., 2008), the authors propose a method to transfer samples for model-based RL. Although they assume tasks might have different state-action space, inter-task mappings are used to map source samples to the target. However, the proposed method does not account for differences in the transition or reward models, which could lead to significant negative transfer when the tasks are different. Our approach, on the other hand, can selectively discard samples based on the estimated difference between the MDPs. Lazaric et al. (2008) compute a compliance measure between the target and source tasks and use it to specify from which tasks the transfer is more likely to be beneficial. Furthermore, a relevance measure is computed within each task to determine what are the best samples to transfer. These two measures are then combined to transfer samples into FQI. Once again, our approach does not require any explicit condition to decide what to transfer, nor does it require any assumption of similarity between the tasks. Furthermore, the compliance and relevance measures computed in (Lazaric et al., 2008) jointly account for both the reward and transition models, thus discarding samples when either one of the models is very different between the tasks. On the other hand, our approach can retain at least the part of the sample that is similar, at the cost of introducing a small bias. In (Laroche & Barlier, 2017), the authors propose a technique for transferring samples into FQI under the assumption that the transition dynamics do not change between the tasks. Similarly to our method, they learn the reward function at the first iteration and substitute the predicted values to the reward samples in the dataset. This allows them to safely adopt the full set of target and source samples in the remaining FQI iterations, as all tasks share the same transition model and, thus, samples are unbiased. However, we argue that this assumption of shared dynamics indeed limits the applicability of the transfer method to most real-world tasks.

In the supervised learning literature, Crammer et al. (2008) analyzed the transfer problem from a theoretical perspective and extended the classical generalization bounds to the case where samples are directly transferred from a set of source tasks. The most relevant result in their bounds is a trade-off between the total number of samples transferred and the total number of tasks from which transfer occurs. Increasing the first term decreases the variance, while it is likely to increase the bias due to the differences between the tasks. On the other hand, decreasing the first term also decreases the bias, but it is likely to increase the variance due to the limited number of samples. We observe that such trade-off does not arise in our case. Our method transfers all samples while accounting for the differences between the tasks. The only bias term is due to the errors in the estimation of the task models, which is likely to decrease as the number of samples increases.

Another work from the supervised learning literature that is related to our approach is (Garcke & Vanck, 2014). The authors proposed a method to transfer samples from a different dataset and used importance weighting to correct the distribution shift. However, they leveraged ideas from density-ratio estimation (e.g., Sugiyama et al., 2012) and supposed the weight function to have a given parametric form, thus directly estimating it from the data. Conversely, we estimate the densities involved and try to characterize the weight distribution, taking its expectation as our final estimate.

7. Experiments

We evaluate IWFQI on three different domains with increasing level of complexity. In all experiments, we compare our method to two existing algorithms for transferring samples into FQI: the relevance-based transfer (RBT) algorithm of (Lazaric et al., 2008) and the shared-dynamics transfer (SDT) algorithm of (Laroche & Barlier, 2017).

7.1. Puddle World

Our first experimental domain is a modified version of the puddle world environment presented in (Sutton, 1996). Puddle world is a discrete-action, continuous-state (stochastic) navigation problem (see Appendix C.1 for a complete description). At each time-step, the agent receives a reward of plus a penalization proportional to the distance from all puddles. Each action moves the agent by in the corresponding direction. In particular, we con-

Figure 1. Puddle world with episodes transferred from 3 source tasks in the case of shared dynamics (left) and puddle-based dynamics (right).

Figure 2. Acrobot swing-up with (100 + 50) episodes transferred from 2 source tasks. (left) learning performance. (right) relative number of samples transferred from each source task.

sider two versions of the environment: (i) shared dynamics, where is fixed, and (ii) puddle-based dynamics, where slows-down the agent proportionally to the distance from all puddles.

We consider three source tasks and one target task, where each task has different puddles in different locations (see Appendix C.1). For each source task, we generate a dataset of 20 episodes from a nearly-optimal policy. We run IWFQI with weights computed according to Equation (6), where we set the model noise to be ten times the true value. For evaluating our weight estimation procedure, we also run IWFQI with ideal importance weights (computed as the ratio of the true distributions). In each algorithm, FQI is run for 50 iterations with Extra-Trees (Ernst et al., 2005). An -greedy policy () is used to collect data in the target task.

Shared dynamics. We start by showing the results for in Figure 1(left). All results are averaged over 20 runs and are reported with 95% confidence intervals. As expected, FQI alone is not able to learn the target task in such a small number of episodes. On the other hand, IWFQI has a good jump-start and converges to an optimal policy in only 20 episodes. Interestingly, IWFQI with ideal weights has almost the same performance, thus showing the robustness of our weight estimation procedure. RBT also learns the optimal policy rather quickly. However, the limited number of target and source samples available in this experiment makes it perform significantly worse in the first episodes. Since in this version of the puddle world the dynamics do not change between tasks, SDT also achieves good performance, converging to a nearly-optimal policy.

Puddle-based dynamics. We also show the results for the more challenging version of the environment were puddles both penalize and slow-down the agent (see Figure 1(right)). Notice that, in this case, transition dynamics change between tasks, thus making the transfer more challenging. Similarly, as before, our approach quickly learns the optimal policy and is not affected by the estimated weights. Furthermore, the benefits of over-estimating the model noise can be observed from the small improvement over IWFQI-ID. RBT is also able to learn the optimal policy. However, the consequences of inaccurately computing compliance and relevance are more evident in this case, where the algorithm negatively transfers samples in the first episodes. Finally, SDT still shows an improvement over plain FQI, but it is not able to learn the optimal policy due to the bias introduced by the different dynamics.

7.2. Acrobot

Acrobot (Sutton & Barto, 1998) is a classic control problem where the goal is to swing-up a two-link pendulum by applying positive or negative torque to the joint between the two links. Due to its non-linear and complex dynamics, Acrobot represents a very challenging problem, requiring a considerable amount of samples to be solved. In this experiment, we consider a multi-task scenario where robots might have different link lengths and masses . Our target task is the classic Acrobot swing-up problem, where the robot has lengths (1.0, 1.0) and masses (1.0, 1.0). Furthermore, we consider two source tasks. The first is another swing-up task where the robot has lengths (1.1, 0.7) and masses (0.9, 0.6). The second is a constant-spin task, where the goal is to make the first joint rotate at a fixed constant speed, with lengths (0.95, 0.95) and masses (0.95, 1.0). The exact definition of the tasks’ dynamics and rewards is in Appendix C.2. Notice the intrinsic difficulty of transfer: the first source task has the same reward as the target but very different dynamics, and conversely for the second source task. Using nearly-optimal policies, we generate 100 episodes from the first source and 50 episodes from the second. We run all algorithms (except SDT since the problem violates the shared-dynamics assumption) for 200 episodes and average over 20 runs. Results are shown in Figure 2(left). We notice that both our approach and RBT achieve a good jump-start and learn faster than plain FQI. However, to better investigate how samples are transferred, we show the transfer ratio from each source task in Figure 2(right). Since RBT transfers rewards and transitions jointly, it decides to compensate the highly biased reward samples from the constant-spin task by over-sampling the first source task. However, it inevitably introduces bias from the different dynamics. Our approach, on the other hand, correctly transfers almost all reward samples from the swing-up task, while discarding those from the constant-spin task. Due to transition noise over-estimation, IWFQI achieves an interesting adaptive behaviour: during the initial episodes, when few target samples are available, and the GPs are inaccurate, more samples are transferred. This causes a reduction of the variance in the first phases of learning that is much greater than the increase of bias. However, as more target samples are available, the transfer becomes useless, and our approach correctly decides to discard most transition samples, thus minimizing both bias and variance.

7.3. Water Reservoir Control

In this experiment, we consider a real-world problem where the goal is to learn how to optimally control a water reservoir system. More specifically, the objective is to learn a per-day water release policy that meets a given demand while keeping the water level below a flooding threshold. Castelletti et al. (2010) successfully addressed such problem by adopting batch RL techniques. However, the authors proved that, due to the highly non-linear and noisy environment, an enormous amount of historical data is needed to achieve good performance. Consider now the case where a new water reservoir, for which no historical data is available, needs to be controlled. Since each sample corresponds to one day of release, learning by direct interaction with the environment is not practical and leads to poor control policies during the initial years, when only a little experience has been collected. Although we do not know the new environment, it is reasonable to assume that we have access to operational data from existing reservoirs. Then, our solution is to transfer samples to immediately achieve good performance. However, such reservoirs might be located in very different environments and weight objectives differently, thus making transfer very challenging.

We adopt a system model similar to the one proposed in (Castelletti et al., 2010). The state variables are the current water storage and day , while there are 8 discrete actions, each corresponding to a particular release decision. The system evolves according to the simple mass balance equation , where is the net inflow at day t and is modeled as periodic function, with period of one year, plus Gaussian noise. Given the demand d and the flooding threshold f, the reward function is a convex combination of the two objectives, , where . Different tasks have different inflow functions and different reward weights, which model different geographic regions and objectives, respectively.

Figure 3. Water reservoir control. Average cost per day during the first 10 years of learning. IWFQI outperforms the expert and quickly achieves near-optimal performance.

We collected 10800 samples, corresponding to 30 years of historical data, from each of 6 source water reservoirs under a hand-coded expert policy. Further details about the tasks are given in Appendix C.3. We compared our approach to FQI and RBT over the first 10 years of learning. An -greedy policy () was used to collect batches of 1 year of samples, except for the first batch, for which an expert’s policy was used. Results, averaged over 20 runs, are shown in Figure 3. We notice that IWFQI immediately outperforms the expert’s policy and quickly achieves near-optimal performance. RBT, on the other hand, has a good jump-start but then seems to worsen its performance. Once again, this is because each source task has at least few samples that can be transferred. However, selecting such samples is very complicated and leads to negative transfer in case of failure. Finally, FQI performs significantly worse than all alternatives and is, thus, not reported.

8. Conclusions

In this paper, we presented Importance Weighted Fitted QIteration, a novel AVI algorithm for transferring samples in batch RL that uses importance weighting to automatically account for the difference in the source and target distributions. IWFQI exploits Gaussian processes to learn transition and reward models that are used to compute the importance weights. The use of two different processes for reward and transition models allows maximizing the information transferred. We theoretically investigated IWFQI showing (i) its asymptotic correctness in general settings, and (ii) how to compute a robust statistical estimate of the weights for Gaussian models. Finally, we empirically proved its effectiveness in common benchmarks and on a real-world water control problem.

One of the drawbacks of our method is that it does not fully exploit possible similarities between tasks. Recent approaches (e.g., Doshi-Velez & Konidaris, 2016; Killian et al., 2017) learn models relating a family of tasks to ease the transfer of knowledge. Exploring how such relations can benefit our approach (e.g., to improve the weight estimates) is an interesting line for future developments.

References

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems 30, pp. 908–919. Curran Associates, Inc., 2017.

Bromiley, P. Products and convolutions of gaussian proba- bility density functions. 2003.

Castelletti, A., Galelli, S., Restelli, M., and Soncini-Sessa, R. Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research, 46(9), 2010.

Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. Sample selection bias correction theory. In International Conference on Algorithmic Learning Theory, pp. 38–53. Springer, 2008.

Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450, 2010.

Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.

Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011.

Doshi-Velez, F. and Konidaris, G. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference, volume 2016, pp. 1432. NIH Public Access, 2016.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.

Farahmand, A.-m. Regularization in reinforcement learning. PhD thesis, University of Alberta, 2011.

Farahmand, A. M. and Precup, D. Value pursuit iteration. In Advances in Neural Information Processing Systems, pp. 1340–1348, 2012.

Garcke, J. and Vanck, T. Importance weighted inductive transfer learning for regression. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 466–481. Springer, 2014.

Gy¨orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.

Killian, T. W., Daulton, S., Doshi-Velez, F., and Konidaris, G. Robust and efficient transfer learning with hidden parameter markov decision processes. In Advances in Neural Information Processing Systems 30, pp. 6251– 6262, 2017.

Kuss, M. and Rasmussen, C. E. Gaussian processes in rein- forcement learning. In Advances in Neural Information Processing Systems 16, pp. 751–758. MIT Press, 2004.

Laroche, R. and Barlier, M. Transfer reinforcement learn- ing with shared dynamics. In AAAI, pp. 2147–2153. AAAI Press, 2017.

Lazaric, A. Transfer in reinforcement learning: a frame- work and a survey. In Reinforcement Learning, pp. 143– 173. Springer, 2012.

Lazaric, A. and Restelli, M. Transfer from multiple mdps. In Advances in Neural Information Processing Systems, pp. 1746–1754, 2011.

Lazaric, A., Restelli, M., and Bonarini, A. Transfer of sam- ples in batch reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 544–551. ACM, 2008.

Mohammadi, H., Riche, R. L., Durrande, N., Touboul, E., and Bay, X. An analytic comparison of regularization methods for gaussian processes. arXiv preprint arXiv:1602.00853, 2016.

Munos, R. and Szepesv´ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.

Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. ISBN 0471619779.

Rasmussen, C. E. and Williams, C. K. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning. Cambridge University Press, 2012.

Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pp. 1038–1044, 1996.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

Taylor, M. E. and Stone, P. Transfer learning for reinforce- ment learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.

Taylor, M. E., Stone, P., and Liu, Y. Transfer learning via inter-task mappings for temporal difference learning. Journal of Machine Learning Research, 8(Sep):2125– 2167, 2007.

Taylor, M. E., Jong, N. K., and Stone, P. Transferring in- stances for model-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 488–505. Springer, 2008.

Tosatto, S., Pirotta, M., D’Eramo, C., and Restelli, M. Boosted fitted q-iteration. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3434–3443. PMLR, 2017.

A. Gaussian Models: remarks and proofs

In this section we consider the case where the transition and reward models are Gaussian distributions.

A task is jointly Gaussian when it has:

In this setting, by using a Gaussian process to estimate the transition and reward models, we can compute the expected importance weights, where the expectation is taken under the distribution induced by the GPs’ predictions, in closed form. Note that, since reward and transition weights are independent, we can independently consider Assumptions 7 and 8 for the computation of and , respectively.

Theorem 4 (Reward Weights in Gaussian Models). Assume each task to have Gaussian reward distribution N

with unknown mean. Given the available samples in , we build an estimate of the reward distribution such that, for any MDP . Then, given a sample from the j-th MDP, its importance weight , where G is the distribution induced by the GPs’

predictions. Let and suppose , then

Proof. In order to simplify our notation, we consider the dependence on (s, a, r) implicit. We have:

The first integral is over the product of two Gaussian densities, which is known to be (Bromiley, 2003):

where the values of the mean and variance of the first density are not important to complete this proof since such density integrates out. By adopting the same procedure as the one described in (Bromiley, 2003), we can write the ratio of the two Gaussian densities in the second integral as:

where, again, the values of and are not relevant to our proof. Finally, we can write:

We can derive a similar result for the transition model by considering Assumption 8.

Theorem 5 (Transition Weights in Gaussian Models). Assume each task to have Gaussian transition distribution with unknown mean. Furthermore, suppose that . Given the available samples in , we build an estimate of the transition distribution such that, for any MDP and state component . Then, given a sam-

ple from the j-th MDP, the importance weight for the transition model given by

is distributed according to some distribution G induced by the GPs’ predictions. Let and suppose

σs, a) < δs, a) for all d, then

Proof. Notice that the density decomposes into:

By noticing that the d-th GP for task j provides an independent estimate of the transition mean’s d-th component, , the proof of Theorem 4 can be straightforwardly applied to each component of , thus proving the theorem.

B. Proofs of Theorems

Theorem 2. Let be a functional space. Suppose we have a dataset of N i.i.d. samples D = distributed according to , while is the target distribution. Assume almost surely. Let be any positive function, , and

, where denotes the empirical expectation on D. Furthermore, assume and . Then, for any , the following holds with probability at least :

Proof. Applying H¨older’s inequality, for all :

Thus, by applying Corollary 1 of (Cortes et al., 2010) to , we can write:

Let us now expand the left-hand side of (14):

where is a constant term (independent of ) and is the regression function weighted by . Plugging this into (14) we get:

Consider now the hypothesis such that . Since is in H and was defined as the hypothesis minimizing the empirical weighted loss, we have:

Similarly to what we did for , we can bound the empirical error of . According to Corollary 1 of (Cortes et al., 2010), we have that for any , with probability at least :

By adopting (15) to expand the expected error of , we obtain:

If we now put (16) and (19) together by means of (17), we get that, with probability at least :

By taking the square root of both sides of (20) and usingfor , we obtain:

We can now bound the expected error of with respect to by:

We already provided a bound on the first term, so let us analyze the second one. We have:

where the first inequality is due to Jensen’s inequality. Thus, . By combining (21), (22), (23), and (24), we have:

which concludes the proof.

Lemma 1. Let be a functional space. Call and , where and D is a dataset of N i.i.d. samples. Assume and . Let be as defined in (2). Then, for any , with probability at least :

Proof. The result follows straightforwardly by applying Theorem 2.

Lemma 2. Let be a functional space. Call and , where and D is a dataset of N i.i.d. samples. Assume and . Let be as defined in (4) and denote . Then, for any , with probability at least :

Proof. The result follows straightforwardly by applying Theorem 2.

Theorem 3. Let be a hypothesis space, a distribution over a sequence of Q-functions as defined in Equation (4), and the optimal Bellman operator of the target task. Suppose to have a dataset of N i.i.d. samples D drawn from the source task according to a joint distribution . Let denote the ideal importance weights defined in (5) and (3), and denote arbitrary positive functions with bounded second moments. Define , where . Similarly, define , and for the transition model. Then, for any , with probability at least :

where is the concentrability coefficient of one-step transitions as defined in (Farahmand, 2011, Definition 5.2).

Proof. We can decompose the error at iteration k into:

where, for any pair (s, a), is the optimal Bellman operator of the target task using the approximated reward function defined in (2). The two terms in (27) can be bounded straightforwardly by applying Lemma 1 and Lemma 2, respectively. The application of Lemma 2 to the second term gives rise to , which can be further bounded by noticing that:

The second term in (28) is again , while the first term has already been bounded in Theorem 5.3 of (Farahmand, 2011):

Then, by combining the bounds from Lemma 1 and Lemma 2 with (28) and (29), we can write:

where we recall that gs, as, a)|s, a, Ms, as, a, and

C. Additional Details on the Experiments

C.1. Puddle World

Our first experimental domain is a modified version of the puddle world environment presented in (Sutton, 1996). Puddle world is a two-dimensional continuous grid with a goal area and some elliptical “puddles”. The goal is to drive the agent from a starting position to the goal area while avoiding the puddles. The state-space is , while the action-space is discrete and allows the agent to move in the four cardinal directions. At each time-step, the agent receives a reward of plus a penalization proportional to the distance from all puddles: , where U is the set of puddles and is the weight of puddle u for state s. In the goal the reward is zero. In our experiments, we modeled as a bivariate Gaussian. Each action moves the agent by in the corresponding direction. In particular, we consider two versions of the environment: I) shared dynamics where and II) puddle-based dynamics where puddles also slow-down the agent by: . Finally, a white Gaussian noise of and is added to the reward and the transition model, respectively. In our experiments we set and a maximum horizon of 50 time-steps.

Figure 4. From left to right: the target task and the three source tasks. The agent always starts in the bottom-left corner and must reach the goal area (shown in green). Puddles are shown in black.

We provide additional details on the puddle world experiments. The target task and the three source tasks can be seen in Figure 4. Notice that the optimal paths to solve each task have at least a small overlapping, thus allowing some knowledge transfer. However, the optimal policy for one task is likely to cross a puddle if carelessly used in another domain. This makes the transfer problem more challenging since the algorithm has to figure out which samples should be retained and which should be discarded.

In both experiments, 20 episodes were generated beforehand from each source task. For IWFQI, a Gaussian process was fitted on each of the three source datasets using the squared exponential kernel. The noises of the reward and transition models were estimated as 10 times their true value. In each algorithm, FQI was run for 50 iterations using Extra-Trees with 50 estimators and a minimum of 2 samples to split a node. Results were averaged over 20 independent runs.

C.2. Acrobot

We provide a precise description of the two tasks used in the Acrobot experiment. For both tasks, the state-space is composed of the two link angles and their velocities . The transition dynamics are the ones described in (Sutton & Barto, 1998). The agent can only apply a torque of +2 or to the joint between the two links. The initial state is , where . Performance is evaluated starting from multiple states , with evenly spaced in . The swing-up task has reward:

and terminates whenever or 100 time-steps are reached. The constant-spin task has reward:

and terminates whenever 100 time-steps are reached.

We collect 100 episodes from the first source task (corresponding to 3400 samples) and 50 episodes from the second source task (corresponding to 5000 samples). For all algorithms, FQI uses extra-trees with 50 estimators and a minimum of 20 samples to split a node. Data is collected in batches of 10 episodes using an -greedy policy (). For IWFQI, GPs use the squared exponential kernel with parameters estimated by maximum likelihood on the data.

To further demonstrate the advantages of our approach, we show what happens when only the constant-spin source task is available. Clearly, most of the reward samples should be discarded, and conversely for the transition samples. As we can see from Figure 5a and 5b, RBT now performs significantly worse than FQI. This is due to the fact that, by transferring samples jointly, it cannot avoid introducing bias. Our approach, on the other hand, is able to discard the reward samples, thus being robust to negative transfer. Furthermore, it achieves a little improvement over FQI due to the few samples transferred.

C.3. Water Reservoir Control

All tasks used in this experiment are modeled according to the dynamics described in the paper. For the sake of simplicity, each water reservoir is supposed to have capacity of , minimum storage of , flooding threshold of , and per-day demand of . Due to the different geographic locations, each task has different inflow function , where is the fixed noise variance. The different mean-inflow functions are shown in Figure 6a. Furthermore, each water reservoir weighs the flooding and demand objectives differently. This is

Figure 5. Transfer of samples from the constant-spin task to the swing-up task. (a) discounted expected reward and (b) number of steps before reaching a goal state.

modeled by changing the respective weights and . The values for all tasks are reported in Table 1. Notice that there is no source task that is globally similar to the target: either some reward structure is shared or some transition structure is, never both. This makes transfer very challenging since samples have to be accurately selected to prevent detrimental consequences.

In this experiment, we run FQI using extra-trees for 80 iterations with 100 estimators and a minimum of 10 samples to split a node. GPs use the anisotropic squared exponential kernel. For each of the 6 source reservoirs, we gather 30 years of historical data where controls are applied by a human operator’s policy. We learn the target task by collecting batches of 1 year, each using an -greedy policy () on the previously learned Q-function. Evaluation is performed by averaging 10 trajectories of 1 year, each starting from January and with an initial storage of of water.

To better demonstrate the difficulty of this control problem, we run FQI for 500 episodes (equivalent to 500 years of interaction). Furthermore, to make the problem simpler, we allow the agent to sample the state-action space arbitrarily, so as to have a better exploration. The result is shown in Figure 6b. Although we significantly simplified the problem and we allowed FQI to gather an enormous amount of data, the algorithm still needs almost 500 years to achieve optimal performance. This demonstrates that solving this task by directly interacting with the real environment is clearly impractical. Thus, transfer of previous knowledge is, in this case, mandatory to achieve good performance.

Table 1. Reward parameters for the different water reservoirs.

Figure 6. Water reservoir control. (a) Inflow profiles for all tasks. (b) Learning without transfer for 500 years.

designed for accessibility and to further open science