b

DiscoverSearch
About
My stuff
Lipschitz Continuity in Model-based Reinforcement Learning
2018·arXiv
Abstract
Abstract

We examine the impact of learning Lipschitz continuous models in the context of model-based reinforcement learning. We provide a novel bound on multi-step prediction error of Lipschitz models where we quantify the error using the Wasserstein metric. We go on to prove an error bound for the value-function estimate arising from Lipschitz models and show that the estimated value function is itself Lipschitz. We conclude with empirical results that show the benefits of controlling the Lipschitz constant of neural-network models.

The model-based approach to reinforcement learning (RL) focuses on predicting the dynamics of the environment to plan and make high-quality decisions (Kaelbling et al., 1996; Sutton & Barto, 1998). Although the behavior of model-based algorithms in tabular environments is well understood and can be effective (Sutton & Barto, 1998), scaling up to the approximate setting can cause instabilities. Even small model errors can be magnified by the planning process resulting in poor performance (Talvitie, 2014).

In this paper, we study model-based RL through the lens of Lipschitz continuity, intuitively related to the smoothness of a function. We show that the ability of a model to make accurate multi-step predictions is related to the model’s one-step accuracy, but also to the magnitude of the Lipschitz constant (smoothness) of the model. We further show that the dependence on the Lipschitz constant carries over to the value-prediction problem, ultimately influencing the quality of the policy found by planning.

We consider a setting with continuous state spaces and stochastic transitions where we quantify the distance between distributions using the Wasserstein metric. We introduce a novel characterization of models, referred to as a Lipschitz model class, that represents stochastic dynamics using a set of component deterministic functions. This allows us to study any stochastic dynamic using the Lipschitz continuity of its component deterministic functions. To learn a Lipschitz model class in continuous state spaces, we provide an Expectation-Maximization algorithm (Dempster et al., 1977).

One promising direction for mitigating the effects of inaccurate models is the idea of limiting the complexity of the learned models or reducing the horizon of planning (Jiang et al., 2015). Doing so can sometimes make models more useful, much as regularization in supervised learning can improve generalization performance (Tibshirani, 1996). In this work, we also examine a type of regularization that comes from controlling the Lipschitz constant of models. This regularization technique can be applied efficiently, as we will show, when we represent the transition model by neural networks.

We consider the Markov decision process (MDP) setting in which the RL problem is formulated by the tuple ⟨S, A, R, T, γ⟩. Here, by S we mean a continuous state space and by A we mean a discrete action set. The functions R : S ×A → R and T: S × A → Pr(S)denote the reward and transition dynamics. Finally,  γ ∈ [0, 1)is the discount rate. If |A| = 1, the setting is called a Markov reward process (MRP).

2.1. Lipschitz Continuity

Our analyses leverage the “smoothness” of various functions, quantified as follows.

Definition 1. Given two metric spaces  (M1, d1)and (M2, d2)consisting of a space and a distance metric, a function  f : M1 �→ M2 isLipschitz continuous (sometimes simply Lipschitz) if the Lipschitz constant, defined as

image

is finite.

image

Figure 1. An illustration of Lipschitz continuity. Pictorially, Lipschitz continuity ensures that f lies in between the two affine functions (colored in blue) with slopes  K and −K.

Equivalently, for a Lipschitz f,

image

The concept of Lipschitz continuity is visualized in Figure 1.

A Lipschitz function f is called a non-expansion when Kd1,d2(f) = 1and a contraction when  Kd1,d2(f) < 1. Lipschitz continuity, in one form or another, has been a key tool in the theory of reinforcement learning (Bertsekas, 1975; Bertsekas & Tsitsiklis, 1995; Littman & Szepesv´ari, 1996; uller, 1996; Ferns et al., 2004; Hinderer, 2005; Rachelson & Lagoudakis, 2010; Szepesv´ari, 2010; Pazis & Parr, 2013; Pirotta et al., 2015; Pires & Szepesv´ari, 2016; Berkenkamp et al., 2017; Bellemare et al., 2017) and bandits (Kleinberg et al., 2008; Bubeck et al., 2011). Below, we also define Lipschitz continuity over a subset of inputs.

Definition 2. A function  f : M1 × A �→ M2is uniformly Lipschitz continuous in A if

image

is finite.

Note that the metric  d1is defined only on  M1.

2.2. Wasserstein Metric

We quantify the distance between two distributions using the following metric:

Definition 3. Given a metric space (M, d) and the set P(M) of all probability measures on M, the Wasserstein metric (or the 1st Kantorovic metric) between two probability distributions  µ1 and µ2 in P(M)is defined as

image

where  Λdenotes the collection of all joint distributions j on M × Mwith marginals  µ1 and µ2 (Vaserstein, 1969).

Sometimes referred to as “Earth Mover’s distance”, Wasserstein is the minimum expected distance between pairs of points where the joint distribution j is constrained to match the marginals  µ1and  µ2. New applications of this metric are discovered in machine learning, namely in the context of generative adversarial networks (Arjovsky et al., 2017) and value distributions in reinforcement learning (Bellemare et al., 2017).

Wasserstein is linked to Lipschitz continuity using duality:

image

This equivalence, known as Kantorovich-Rubinstein duality (Villani, 2008), lets us compute Wasserstein by maximizing over a Lipschitz set of functions  f : S �→ R, a relatively easier problem to solve. In our theory, we utilize both definitions, namely the primal definition (3) and the dual definition (4).

We introduce a novel representation of stochastic MDP transitions in terms of a distribution over a set of deterministic components.

Definition 4. Given a metric state space  (S, dS)and an action space A, we define  Fgas a collection of functions: Fg = {f : S �→ S}distributed according to g(f | a) where a ∈ A. We say that  Fg isa Lipschitz model class if

image

is finite.

Our definition captures a subset of stochastic transitions, namely ones that can be represented as a state-independent distribution over deterministic transitions. An example is provided in Figure 2. We further prove in the appendix (see Claim 1) that any finite MDP transition probabilities can be decomposed into a state-independent distribution g over a finite set of deterministic functions f.

Associated with a Lipschitz model class is a transition function given by:

image

Given a state distribution  µ(s), we also define a generalized notion of transition function �TG(· | µ, a) given by:

image

image

Figure 2. An example of a Lipschitz model class in a gridworld environment (Russell & Norvig, 1995). The dynamics are such that any action choice results in an attempted transition in the corresponding direction with probability 0.8 and in the neighboring directions with probabilities 0.1 and 0.1. We can define  Fg ={fup, fright, fdown, fleft} where each f outputs a deterministic next position in the grid (factoring in obstacles). For a = up, we have: g(fup | a = up) = 0.8, g(fright | a = up) = g(fleft | a = up) = 0.1, and g(fdown | a = up) = 0. Defining distances between states as their Manhattan distance in the grid, then  ∀f sups1,s2�d(f(s1), f(s2)�/d(s1, s2) = 2, and so KF =2. So, the four functions and g comprise a Lipschitz model class.

image

We are primarily interested in  KAd,d( �TG), the Lipschitz constant of �TG. However, since �TGtakes as input a probability distribution and also outputs a probability distribution, we require a notion of distance between two distributions. This notion is quantified using Wasserstein and is justified in the next section.

We consider the stochastic model-based setting and show through an example that the Wasserstein metric is a reasonable choice compared to other common options.

Consider a uniform distribution over states  µ(s)as shown in black in Figure 3 (top). Take a transition function  TGin the environment that, given an action a, uniformly randomly adds or subtracts a scalar  c1. The distribution of states after one transition is shown in red in Figure 3 (middle). Now, consider a transition model �TGthat approximates  TGby uniformly randomly adding or subtracting the scalar c2. The distribution over states after one transition using this imperfect model is shown in blue in Figure 3 (bottom). We desire a metric that captures the similarity between the output of the two transition functions. We first consider Kullback-Leibler (KL) divergence and observe that:

image

unless the two constants are exactly the same.

image

Figure 3. A state distribution  µ(s)(top), a stochastic environment that randomly adds or subtracts  c1(middle), and an approximate transition model that randomly adds or subtracts a second scalar c2 (bottom).

The next possible choice is Total Variation (TV) defined as:

image

if the two distributions have disjoint supports regardless of how far the supports are from each other.

In contrast, Wasserstein is sensitive to how far the constants are as:

image

It is clear that, of the three, Wasserstein corresponds best to the intuitive sense of how closely  TGapproximates �TG. This is particularly important in high-dimensional spaces where the true distribution is known to usually lie in low-dimensional manifolds. (Narayanan & Mitter, 2010)

To extract a prediction with a horizon n > 1, model-based algorithms typically apply the model for n steps by taking the state input in step t to be the state output from the step  t − 1. Previous work has shown that model error can result in poor long-horizon predictions and ineffective planning (Talvitie, 2014; 2017). Observed even beyond reinforcement learning (Lorenz, 1972; Venkatraman et al., 2015), this is referred to as the compounding error phenomenon. The goal of this section is to provide a bound on multi-step prediction error of a model. We formalize the notion of model accuracy below:

Definition 5. Given an MDP with a transition function T, we identify a Lipschitz model  Fgas  ∆-accurate if its induced �T satisfies:

image

We want to express the multi-step Wasserstein error in terms of the single-step Wasserstein error and the Lipschitz constant of the transition function �TG. We provide a bound on the Lipschitz constant of �TGusing the following lemma:

Lemma 1. A generalized transition function �TGinduced by a Lipschitz model class  Fgis Lipschitz with a constant:

image

Intuitively, Lemma 1 states that, if the two input distributions are similar, then for any action the output distributions given by �TGare also similar up to a  KF factor.We prove this lemma, as well as the subsequent lemmas, in the appendix.

Given the one-step error (Definition 5), a start state distribution  µand a fixed sequence of actions  a0, ..., an−1, we desire a bound on n-step error:

image

where �T nG (·|µ) := �TG(·| �TG(·|... �TG(·|µ, a0)..., an−2), an−1)

� �� �nrecursive calls and  T nG (· | µ)is defined similarly. We provide a useful lemma followed by the theorem.

Lemma 2. (Composition Lemma) Define three metric spaces  (M1, d1), (M2, d2), and (M3, d3). Define Lipschitz functions  f : M2 �→ M3 and g : M1 �→ M2with constants Kd2,d3(f)and  Kd1,d2(g). Then,  h : f ◦ g : M1 �→ M3is Lipschitz with constant  Kd1,d3(h) ≤ Kd2,d3(f)Kd1,d2(g).

Similar to composition, we can show that summation preserves Lipschitz continuity with a constant bounded by the sum of the Lipschitz constants of the two functions. We omitted this result due to brevity.

Theorem 1. Define a  ∆-accurate �TGwith the Lipschitz constant  KFand an MDP with a Lipschitz transition function  TGwith constant  KT. Let ¯K = min{KF , KT }. Then  ∀n ≥ 1:

image

Proof. We construct a proof by induction. Using Kantarovich-Rubinstein duality (Lipschitz property of f not shown for brevity) we first prove the base of induction:

δ(1) := W� �TG(· | µ, a0), TG(· | µ, a0)�

image

We now prove the inductive step. Assuming  δ(n − 1) :=W� �T n−1G (· | µ), T n−1G (· | µ)�≤ ∆ �n−2i=0 (KF )iwe can write:

image

+W��TG�· | T n−1G (· | µ), an−1�, TG(· | T n−1G (· | µ), an−1)�

We now use Lemma 1 and Definition 5 to upper bound the first and the second term of the last line respectively.

image

Note that in the triangle inequality, we may replace �TG�· | T n−1G (· | µ)�with  TG�· | �T n−1G (· | µ)�and follow the same basic steps to get:

image

Combining (5) and (6) allows us to write:

image

which concludes the proof.

There exist similar results in the literature relating one-step transition error to multi-step transition error and sub-optimality bounds for planning with an approximate model. The Simulation Lemma (Kearns & Singh, 2002; Strehl et al., 2009) is for discrete state MDPs and relates error in the one-step model to the value obtained by using it for planning. A related result for continuous state-spaces (Kakade et al., 2003) bounds the error in estimating the probability of a trajectory using total variation. A second related result (Venkatraman et al., 2015) provides a slightly looser bound for prediction error in the deterministic case—our result can be thought of as a generalization of their result to the probabilistic case.

We next investigate the error in the state-value function induced by a Lipschitz model class. To answer this question, we consider an MRP  M1denoted by  ⟨S, A, T, R, γ⟩and a second MRP  M2that only differs from the first in its transition function  ⟨S, A, �T, R, γ⟩. Let A = {a} be the action set with a single action a. We further assume that the reward function is only dependent upon state. We first express the state-value function for a start state s with respect to the two transition functions. By  δsbelow, we mean a Dirac delta function denoting a distribution with probability 1 at state s.

image

Next we derive a bound on��VT (s) − V �T (s)�� ∀s.

Theorem 2. Assume a Lipschitz model class  Fgwith a ∆-accurate �Twith ¯K = min{KF , KT }. Further, assume a Lipschitz reward function with constant  KR = KdS,R(R).Then  ∀s ∈ S and ¯K ∈ [0, 1γ )

image

Proof. We first define the function  f(s) = R(s)KR .It can be observed that  KdS,R(f) = 1. We now write:

image

Let  F = {h : KdS,R(h) ≤ 1}. Then given  f ∈ F:

image

We can derive the same bound for  V �T (s) − VT (s)using the fact that Wasserstein distance is a metric, and therefore symmetric, thereby completing the proof.

Regarding the tightness of our bounds, we can show that when the transition model is deterministic and linear then Theorem 1 provides a tight bound. Moreover, if the reward function is linear, the bound provided by Theorem 2 is tight. (See Claim 2 in the appendix.) Notice also that our proof does not require a bounded reward function.

We next show that, given a Lipschitz transition model, solving for the fixed point of a class of Bellman equations yields a Lipschitz state-action value function. Our proof is in the context of Generalized Value Iteration (GVI) (Littman & Szepesv´ari, 1996), which defines Value Iteration (Bellman, 1957) for planning with arbitrary backup operators.

image

To prove the result, we make use of the following lemmas. Lemma 3. Given a Lipschitz function  f : S �→ Rwith constant  KdS,dR(f):

image

Lemma 4. The following operators (Asadi & Littman, 2017) are Lipschitz with constants:

1.  K∥∥∞,dR(max(x)) = K∥∥∞,dR�mean(x)� =K∥∥∞,dR(ϵ-greedy(x)) = 1

image

2.  K∥∥∞,dR(mmβ(x) := log

image

Theorem 3. For any non-expansion backup operator f outlined in Lemma 4, GVI computes a value function with a Lipschitz constant bounded by KAdS ,dR(R)1−γKdS ,W (T )if

γKAdS,W (T) < 1.

Proof. From Algorithm 1, in the nth round of GVI updates:

image

Now observe that:

image

Where we used Lemmas 3, 2, and 4 for the second, third, and fourth inequality respectively. Equivalently:

image

By computing the limit of both sides, we get:

image

This concludes the proof.

Two implications of this result: First, PAC exploration in continuous state spaces is shown assuming a Lipschitz value function (Pazis & Parr, 2013). However, the theorem shows that it is sufficient to have a Lipschitz model, an assumption perhaps easier to confirm. The second implication relates to value-aware model learning (VAML) objective (Farahmand et al., 2017). Using the above theorem, we can show that minimizing Wasserstein is equivalent to minimizing the VAML objective (Asadi et al., 2018).

Our first goal in this section1 is to compare TV, KL, and Wasserstein in terms of the ability to best quantify error of an imperfect model. To this end, we built finite MRPs with random transitions, |S| = 10 states, and  γ = 0.95. In the first case the reward signal is randomly sampled from [0, 10], and in the second case the reward of an state is the index of that state, so small Euclidean norm between two states is an indication of similar values. For  105 trials, we generated an MRP and a random model, and then computed model error and planning error (Figure 4). We understand a good metric as the one that computes a model error with a high correlation with value error. We show these correlations for different values of  γ in Figure 5.

image

Figure 4. Value error (x axis) and model error (y axis). When the reward is the index of the state (right), correlation between Wasserstein error and value-prediction error is high. This highlights the fact that when closeness in the state-space is an indication of similar values, Wasserstein can be a powerful metric for model-based RL. Note that Wasserstein provides no advantage given random rewards (left).

image

Figure 5. Correlation between value-prediction error and model error for the three metrics using random rewards (left) and index rewards (right). Given a useful notion of state similarities, low Wasserstein error is a better indication of planning error.

image

Table 1. Lipschitz constant for various functions used in a neural network. Here,  Wjdenotes the jth row of a weight matrix W.

It is known that controlling the Lipschitz constant of neural nets can help in terms of improving generalization error due to a lower bound on Rademacher complexity (Neyshabur et al., 2015; Bartlett & Mendelson, 2002). It then follows from Theorems 1 and 2 that controlling the Lipschitz constant of a learned transition model can achieve better error bounds for multi-step and value predictions. To enforce this constraint during learning, we bound the Lipschitz constant of various operations used in building neural network. The bound on the constant of the entire neural network then follows from Lemma 2. In Table 1, we provide Lipschitz constant for operations (see Appendix for proof) used in our experiments. We quantify these results for different  p-norms ∥·∥p.

Given these simple methods for enforcing Lipschitz continuity, we performed empirical evaluations to understand the impact of Lipschitz continuity of transition models, specifically when the transition model is used to perform multi-step state-predictions and policy improvements. We chose two standard domains: Cart Pole and Pendulum. In Cart Pole, we trained a network on a dataset of  15 ∗ 103tuples  ⟨s, a, s′⟩. During training, we ensured that the weights of the network are smaller than k. For each k, we performed 20 independent model estimation, and chose the model with median cross-validation error.

Using the learned model, along with the actual reward signal of the environment, we then performed stochastic actor-critic RL. (Barto et al., 1983; Sutton et al., 2000) This required an interaction between the policy and the learned model for relatively long trajectories. To measure the usefulness of the model, we then tested the learned policy on the actual domain. We repeated this experiment on Pendulum. To train the neural transition model for this domain we used  104samples. Notably, we used deterministic policy gradient (Silver et al., 2014) for training the policy network with the hyper parameters suggested by Lillicrap et al. (2015). We report these results in Figure 6.

Observe that an intermediate Lipschitz constant yields the best result. Consistent with the theory, controlling the Lipschitz constant in practice can combat the compounding errors and can help in the value estimation problem. This ultimately results in learning a better policy.

We next examined if the benefits carry over to stochastic

image

Figure 6. Impact of Lipschitz constant of learned models in Cart Pole (left) and Pendulum (right). An intermediate value of k (Lipschitz constant) yields the best performance.

settings. To capture stochasticity we need an algorithm to learn a Lipschitz model class (Definition 4). We used an EM algorithm to joinly learn a set of functions f, parameterized by  θ = {θf : f ∈ Fg}, and a distribution over functions g. Note that in practice our dataset only consists of a set of samples  ⟨s, a, s′⟩and does not include the function the sample is drawn from. Hence, we consider this as our latent variable z. As is standard with EM, we start with the log-likelihood objective (for simplicity of presentation we assume a single action in the derivation):

image

where we used Jensen’s inequality and concavity of log in the last line. This derivation leads to the following EM algorithm.

image

Figure 7. A stochastic problem solved by training a Lipschitz model class using EM. The top left figure shows the functions before any training (iteration 0), and the bottom right figure shows the final results (iteration 50).

In the M step, find  θtby solving for:

image

In the E step, compute posteriors:

qt(zi =f|si, si′)= p(si, si′|zi = f; θtf)g(zi = f; θt)�f p(si, si′|zi = f; θtf)g(zi = f; θt).

Note that we assume each point is drawn from a neural network f with probability:

image

and with a fixed variance  σ2 tuned as a hyper-parameter.

We used a supervised-learning domain to evaluate the EM algorithm. We generated 30 points from 5 functions (written at the end of Appendix) and trained 5 neural networks to fit these points. Iterations of a single run is shown in Figure 7 and the summary of results is presented in Figure 8. Observe that the EM algorithm is effective, and that controlling the Lipschitz constant is again useful.

We next applied EM to train a transition model for an RL setting, namely the gridworld domain from Moerland et al. (2017). Here a useful model needs to capture the stochastic behavior of the two ghosts. We modify the reward to be -1 whenever the agent is in the same cell as either one of the ghosts and 0 otherwise. We performed environmental interactions for 1000 time-steps and measured the return. We compared against standard tabular methods(Sutton & Barto, 1998), and a deterministic model that predicts expected next state (Sutton et al., 2008; Parr et al., 2008). In all cases we used value iteration for planning.

image

Figure 8. Impact of controlling the Lipschitz constant in the supervised-learning domain. Notice the U-shape of final Wasserstein loss with respect to Lipschitz constant k.

image

Figure 9. Performance of a Lipschitz model class on the gridworld domain. We show model test accuracy (left) and quality of the policy found using the model (right). Notice the poor performance of tabular and expected models.

Results in Figure 9 show that tabular models fail due to no generalization, and expected models fail since the ghosts do not move on expectation, a prediction not useful for planner. Performing value iteration with a Lipschitz model class outperforms the baselines.

We took an important step towards understanding model-based RL with function approximation. We showed that Lipschitz continuity of an estimated model plays a central role in multi-step prediction error, and in value-estimation error. We also showed the benefits of employing Wasserstein for model-based RL. An important future work is to apply these ideas to larger problems.

The authors recognize the assistance of Eli Upfal, John Langford, George Konidaris, and members of Brown’s Rlab specifically Cameron Allen, David Abel, and Evan Cater. The authors also thank anonymous ICML reviewer 1 for insights on a value-aware interpretation of Wasserstein.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.

Asadi, K. and Littman, M. L. An alternative softmax operator for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 243–252, 2017.

Asadi, K., Cater, E., Misra, D., and Littman, M. L. Equivalence between wasserstein and value-aware model-based reinforcement learning. arXiv preprint arXiv:1806.01265, 2018.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pp. 834–846, 1983.

Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458, 2017.

Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684, 1957.

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pp. 908–919, 2017.

Bertsekas, D. Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control, 20(3):415–419, 1975.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming: an overview. In Decision and Control, 1995, Proceedings of the 34th IEEE Conference on, volume 1, pp. 560–564. IEEE, 1995.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv´ari, C. X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695, 2011.

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.

Farahmand, A.-M., Barreto, A., and Nikovski, D. Value-Aware Loss Function for Model-based Reinforcement Learning. In Proceedings of the

20th International Conference on Artificial Intelligence and Statistics, pp. 1486–1494, 2017.

Ferns, N., Panangaden, P., and Precup, D. Metrics for finite markov decision processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 162–169. AUAI Press, 2004.

Fox, R., Pakman, A., and Tishby, N. G-learning: Taming the noise in reinforcement learning via soft updates. Uncertainty in Artifical Intelligence, 2016.

Gao, B. and Pavel, L. On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, 2017.

Hinderer, K. Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62(1):3–22, 2005.

Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The dependence of effective planning horizon on model accuracy. In Proceedings of AAMAS, pp. 1181–1189, 2015.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.

Kakade, S., Kearns, M. J., and Langford, J. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312, 2003.

Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3): 209–232, 2002.

Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pp. 681–690. ACM, 2008.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

Littman, M. L. and Szepesv´ari, C. A generalized reinforcement-learning model: Convergence and applications. In Proceedings of the 13th International Conference on Machine Learning, pp. 310–318, 1996.

Lorenz, E. Predictability: does the flap of a butterfly’s wing in Brazil set off a tornado in Texas? na, 1972.

Moerland, T. M., Broekens, J., and Jonker, C. M. Learning multimodal transition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.

M¨uller, A. Optimal selection from distributions with unknown parameters: Robustness of bayesian models. Mathematical Methods of Operations Research, 44(3): 371–386, 1996.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017.

Narayanan, H. and Mitter, S. Sample complexity of testing the manifold hypothesis. In Advances in Neural Information Processing Systems, pp. 1786–1794, 2010.

Neu, G., Jonsson, A., and G´omez, V. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.

Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, pp. 1376–1401, 2015.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 752–759. ACM, 2008.

Pazis, J. and Parr, R. Pac optimal exploration in continuous space markov decision processes. In AAAI, 2013.

Pires, B. ´A. and Szepesv´ari, C. Policy error bounds for model-based reinforcement learning with factored linear models. In Conference on Learning Theory, pp. 121–151, 2016.

Pirotta, M., Restelli, M., and Bascetta, L. Policy gradient in lipschitz Markov decision processes. Machine Learning, 100(2-3):255–283, 2015.

Rachelson, E. and Lagoudakis, M. G. On the locality of action domination in sequential decision making. In International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010.

Russell, S. J. and Norvig, P. Artificial intelligence: A modern approach, 1995.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.

Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.

Sutton, R. S., Szepesv´ari, C., Geramifard, A., and Bowling, M. H. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, pp. 528–536, 2008.

Szepesv´ari, C. Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1):1–103, 2010.

Talvitie, E. Model regularization for stable sample rollouts. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 780–789. AUAI Press, 2014.

Talvitie, E. Self-correcting models for model-based reinforcement learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.

Vaserstein, L. N. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.

Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., 2015.

Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Claim 1. In a finite MDP, transition probabilities can be expressed using a finite set of deterministic functions and a distribution over the functions.

Proof. Let  Pr(s, a, s′)denote the probability of a transiton from s to  s′when executing the action a. Define an ordering over states  s1, ..., snwith an additional unreachable state  s0. Now define the cumulative probability distribution:

image

Further define L as the set of distinct entries in C:

Note that, since the MDP is assumed to be finite, then |L| is finite. We sort the values of L and denote, by  ci, ith smallest value of the set. Note that  c0 = 0and  c|L| = 1. We now build determinstic set of functions  f1, ..., f|L|as follows: ∀i = 1 to |L| and ∀j = 1 to n, define fi(s) = sjif and only if:

image

We also define the probability distribution g over f as follows:

Given the functions  f1, ..., f|L|and the distribution g, we can now compute the probability of a transition to  sj from s after

executing action a:

where 1 is a binary function that outputs one if and only if its condition holds. We reconstructed the transition probabilities using distribution g and deterministic functions  f1, ..., f|L|.

Claim 2. Given a deterministic and linear transition model, and a linear reward signal, the bounds provided in Theorems 1 and 2 are both tight.

image

First observe that the bound in Theorem 2 is tight for n = 2:

image

image

and more generally and after n compositions of the models, denoted by  T n and ˆT n, the following equality holds:

image

Consider the state s = 0. Note that clearly v(0) = 0. We now compute the value predicted using ˆT, denoted by  ˆv(0):

image

and so:

Note that this exactly matches the bound derived in our Theorem 2.

Lemma 1. A generalized transition function �TGinduced by a Lipschitz model class  Fgis Lipschitz with a constant:

image

Proof.

image

W� �T(· | µ1, a), �T(· | µ2, a)� := infj

Dividing by  W(µ1, µ2)and taking  sup over a, µ1, and µ2, we conclude:

image

We can also prove this using the Kantarovich-Rubinstein duality theorem:

For every  µ1, µ2, and a ∈ A we have:

Again we conclude by dividing by  W(µ1, µ2)and taking  sup over a, µ1, and µ2.

Lemma 2. (Composition Lemma) Define three metric spaces  (M1, d1), (M2, d2), and (M3, d3). Define Lipschitz functions f : M2 �→ M3and  g : M1 �→ M2with constants  Kd2,d3(f)and  Kd1,d2(g). Then,  h : f ◦ g : M1 �→ M3is Lipschitz with constant  Kd1,d3(h) ≤ Kd2,d3(f)Kd1,d2(g).

Proof.

Lemma 3. Given a Lipschitz function  f : S �→ Rwith constant  KdS,dR(f):

image

Proof.

Lemma 4. The following operators (Asadi & Littman, 2017) are Lipschitz with constants:

1.  K∥∥∞,dR(max(x)) = K∥∥∞,dR�mean(x)�= K∥∥∞,dR(ϵ-greedy(x)) = 1

image

2.  K∥∥∞,dR(mmβ(x) := log

Proof. 1 was proven by Littman & Szepesv´ari (1996), and 2 is proven several times (Fox et al., 2016; Asadi & Littman, 2017; Nachum et al., 2017; Neu et al., 2017). We focus on proving 3. Define

image

and observe that  boltzβ(x) = x⊤ρ(x).Gao & Pavel (2017) showed that  ρis Lipschitz:

image

Using their result, we can further show:

image

dividing both sides by  ∥x1 − x2∥∞leads to 3.

image

Matrix multiplication Let  W ∈ Rn×m. We derive the Lipschitz continuity for the function  ×W(x) = Wx.

For  p = ∞ we have:

image

∥Wx1 − Wx2∥∞

image

where  Wj refers to jth row of the weight matrix W. Similarly, for p = 1 we have:

∥Wx1 − Wx2∥1

and finally for p = 2:

image

image

∥Wx1 − Wx2∥2

image

Vector addition We show that  +b : Rn → Rn has Lipschitz constant 1 for  p = 0, 1, ∞ for all b ∈ Rn.

image

Supervised-learning domain We used the following 5 functions to generate the dataset:

image

We sampled each function 30 times, where the input was chosen uniformly randomly from  [−2, 2]each time.


Designed for Accessibility and to further Open Science