Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

2020·Arxiv

Abstract

Abstract

We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter D. In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span. Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.

1 Introduction

Designing computationally and statistically efficient algorithms is a core problem in Reinforcement Learning (RL). There is a rich line of works that achieve a strong sample efficiency guarantee with regret analysis in tabular MDPs, where state and action spaces are finite and small (Jaksch et al., 2010; Osband et al., 2013; Dann and Brunskill, 2015; Kearns and Singh, 2002). A current challenge in RL is dealing with large state and action spaces where even polynomial dependence of regret on state and action spaces size is unacceptable. One idea to meet this challenge is to consider MDPs with compact representations. For example, factored MDPs (FMDPs) (Boutilier et al., 2000) represent transition functions of MDPs using a compact Dynamic Bayesian network (DBN) (Ghahramani, 1997). FMDPs have a variety of applications in important real-world problems, e.g. multi-agent RL, and they also serve as important case studies in theoretical RL research (Guestrin et al., 2002a,c; Tavakol and Brefeld, 2014; Sun et al., 2019).

There is no FMDP planner that is both computationally efficient and accurate (Goldsmith et al., 1997; Littman, 1997). Guestrin et al. (2003) proposed approximate algorithms with prespecified basis functions and bounded approximation

errors. For the even harder online learning setting, we study oracle-efficient algorithms, which learns an unknown FMDP efficiently by assuming an efficient planning oracle. In this paper, our goal is to design efficient online algorithms that only make a polynomial number of calls to the oracle planning oracle. Side-stepping the computational intractability of the offline problem by assuming oracle access to a solver has yielded insights into simpler decision making problems. For example, oracle-based efficient algorithms have been proposed for the contextual bandit problem (Syrgkanis et al., 2016; Luo et al., 2018). Online learning in episodic FMDP has been studied by Osband and Van Roy (2014). They proposed two algorithms, PSRL (Posterior Sampling RL) and UCRL-factored with near-optimal Bayesian and frequentist regret bounds, respectively. However, their UCRL-factored algorithm relies on solving a Bounded FMDP (Givan et al., 2000), which is an even stronger assumption than the access to a planning oracle. This work studies FMPDs in the more challenging non-episodic setting. Previous studies in non-episodic FMDPs either have some high order terms in their analysis (Strehl, 2007) or depend on some strong connectivity assumptions, e.g. mixing time (Kearns and Koller, 1999). There is no near-optimal regret analysis in this setting yet. Regret analysis in the non-episodic setting relies on the connectivity assumptions. Previous available connectivity assumptions include mixing time (Lewis and Puterman, 2001), diameter (Jaksch et al., 2010) and span of bias vector (Bartlett and Tewari, 2009). Mixing time is the strongest assumption and span of bias vector gives the tightest regret bound among the three. However, we show that even upper bound using span can be loose if the factor structure is not taken into account. This paper makes three main contributions: 1. We provide two oracle-efficient algorithms, DORL (Discrete Optimism RL) and PSRL (Posterior Sampling RL), with near-optimal frequentist regret bound and Bayesian regret bound respectively. Both upper bounds depend on the diameter of the unknown FMDP. The algorithms call the FMDP planner only a polynomial number of times. The upper bound of DORL, when specialized to the standard non-factored MDP setting, matches that of UCRL2 (Jaksch et al., 2010). The same applies to the upper bound of PSRL in the non-factored setting (Ouyang et al., 2017).

2. We propose a tighter connectivity measure especially designed for FMDPs, called factored span and prove a regret lower bound that depends on the factored span of the unknown FMDP rather than its diameter.

3. Our last algorithm FSRL is not oracle efficient but its regret scales with factored span, and using it, we are able to reduce the gap between upper and lower bounds on regret in terms of both the dependence on diameter and on m, the number of factors.

2 Preliminaries

We first introduce necessary definitions and notation for non-episodic MDPs and FMDPs.

2.1 Non-episodic MDP

We consider a setting where a learning agent interacts without resets or episodes with a Markov decision process (MDP), represented by M = {S, A, P, R}, with finite state space S, finite action space A, the transition probability and reward distribution . Here denotes a distribution over the space X. Let G(X) be the space of all possible distributions over X and is the class of all the mappings from space A := |A|.

An MDP M and a learning algorithm L operating on M with an arbitrary initial state constitute a stochastic process described by the state visited at time step t, the action chosen by L at step t, the reward and the next state obtained for t = 1, . . . , T. Let be the trajectory up to time t.

Below we will define our regret measure in terms of undiscounted sum of rewards. To derive non-trivial upper bounds, we need some connectivity constraint. There are several subclasses of MDPs corresponding to different types of connectivity constraints (e.g., see the discussion in Bartlett and Tewari (2009)). We first focus on the class of communicating MDPs, i.e., the diameter of the MDP, which is defined below, is upper bounded by some

Definition 1 (Diameter). Consider the stochastic process defined by a stationary policy operating on an MDP M with initial state be the random variable for the first time step in which state is reached in this process. Then the diameter of M is defined as

A stationary policy on an MDP M is a mapping . An average reward (also called gain) of a policy on M with an initial distribution is defined as

where the expectation is over trajectories . We restrict the choice of policies within the set of all policies whose average reward is independent of the starting state . It can be shown that for a communicating MDP the optimal policies with the highest average reward are in the set and neither the optimal policy nor the optimal reward depends on the initial state. Let denote the optimal policy for MDP M and denote the optimal average reward or optimal gain. For any optimal policy , following the definition

where the expectation is taken over the trajectory generated by policy . And the bias vector of MPD M is . Let the span of a vector h be . Note that if there are multiple optimal policies, we consider the policy with the largest span for its bias vector.

We define the regret of a reinforcement learning algorithm L operating on MDP M up to time T as

and Bayesian regret w.r.t. a prior distribution on a set of MDPs as

Optimality equation for average reward criterion. We let denote the S-dimensional vector with each element representing and denote the matrix with each row as . For any communicating MDP M, the bias vector h(M) satisfies the following equation (Puterman, 2014):

2.2 Factored MDP

Factored MDP is modeled with a DBN (Dynamic Bayesian Network) (Dean and Kanazawa, 1989), where transition dynamics and rewards are factored and each factor only depends on a finite scope of state and action spaces. We use the definition in Osband and Van Roy (2014). We call factored set if it can be factored by . Note this formulation generalizes those in Strehl (2007); Kearns and Koller (1999) to allow the factorization of the action set as well.

Definition 2 (Scope operation for factored sets). For any subset of indices , let us define the scope set . Further, for any define the scope variable to be the value of the variables with indices . For singleton sets {i}, we write x[i] for x[{i}] in the natural way.

Definition 3 (Factored reward distribution). A reward distribution R is factored over X with scopes if and only if, for all , there exists distributionssuch that any can be decomposed as , with each individually observable. Throughout the paper, we also let R(x) denote reward function of the distribution R(x), which is the expectation

Definition 4 (Factored transition probability). A transition function P is factored over with scopes if and only if, for all there exist somesuch that,

For simplicity, let P(x) also denote the vector for the probability of each next state from current pair in the same way.

Assumptions on FMDP. To ensure a finite number of parameters, we assume that for for and for all for some finite L and W. Furthermore, we assume that with probability 1.

Empirical estimates. We first define number of visits for each factored set. Let be the number of visits to be the number of visits to until t and be the number of visits to until t. The empirical estimate for is for . Estimate for transition probability is for . We let and be the first step of episode k.

3 Oracle-eﬃcient Algorithms

We use PSRL (Posterior Sampling RL) and a modified version of UCRL-factored, called DORL (Discrete Optimism RL). Both PSRL and DORL use a fixed policy within an episode. For PSRL, we apply optimal policy for an MDP sampled from the posterior distribution of the true MDP. For DORL, instead of optimizing over a bounded MDP, we construct a new extended MDP, which is also factored with the number of parameters polynomial in that of the true MDP. Then the optimal policy of the extended FMDP is mapped to the policy space of the true FMDP. Instead of using dynamic episodes, we show that a simple fixed episode scheme can also give us near-optimal regret bounds. Algorithm details are shown in Algorithm 1.

3.1 Extended FMDP

Previous two constructions. Previous near-optimal algorithms on regular MDP depend on constructing an extended MDP with a high probability of being optimistic. Jaksch et al. (2010) constructs the extended MDP with a continuous action space to allow choosing any transition probability in a confidence set. This construction generates a bounded-parameter MDP. Agrawal and Jia (2017) instead sample transition probability only from the extreme points of a similar confidence set and combine them by adding extra discrete actions.

Solving the bounded-parameter MDP by the first construction, which requires storing and ordering the S-dimensional bias vector, is not feasible for FMDPs. There is no direct adaptation that mitigates this computation issue. We show that the second construction using only a discrete set of MDPs, by removing the sampling part, can be solved with a much lower complexity in the FMDP setting.

We formally describe the construction. For simplicity, we ignore the notations for k in this section. First define the error bounds as an input. For every , we have an error bound for transition probability . For every , we have an error bound for . At the start of episode k the construction takes the inputs of and the error bounds, and outputs the extended MDP

Extreme transition dynamic. We first define the extreme transition probability mentioned above in factored setting. Let be the transition probability that encourages visiting

where is the vector with all zeros except for a one on the j-th element. By this definition, is a new transition probability that puts all the uncertainty onto the direction s. An example is shown in Figure 1. Our construction assigns an action for each extreme transition dynamic.

Figure 1: An extreme transition dynamic that encourages visiting the second state out of three states.

with the factored transition probability to be

The new reward function is factored over , with reward functions to be

for any

Claim 1. The factored set of the extended MDP satisfies each

By Claim 1, any planner that efficiently solves the original MDP, can also solve the extended MDP. We find the best policy using the planner. To run a policy on original action space, we choose for every maps any new state-action pair to the pair it is extended from, i.e.

4 Upper bounds for PSRL and DORL

We achieve the near-optimal Bayesian regret bound by PSRL and frequentist regret bound by DORL, respectively. Let denote the order ignoring the logarithmic term and the universal constant.

Theorem 1 (Regret of PSRL). Let M be the factored MDP with graph structure , all and ,

and diameter upper bounded by D. Then if is the true prior distribution over the set of MDPs with diameter D, then we bound Bayesian regret of PSRL:

Theorem 2 (Regret of DORL). Let M be the factored MDP with graph structure , all and ,

and diameter upper bounded by D. Then, with high probability, regret of DORL is upper bounded by:

The two bounds match the frequentist regret bound in Jaksch et al. (2010) and Bayesian regret bound in Ouyang et al. (2017) for non-factored communicating MDP. We also give a condition of designing the speed of changing policies.

Remark. Replacing the episode length in Algorithm 1 with any that satisfies , the frequentist bound in Theorem 2 still holds. Furthermore, if is fixed the Bayesian bound in Theorem 2 also holds.

5 Lower Bound and Factored Span

Any regret bound depends on a difficulty measure determining the connectivity of the MDP. The upper bounds of DORL and PSRL use diameter. A tighter alternative is the span of bias vector (Bartlett and Tewari, 2009), defined as is the bias vector of the optimal policy. However, none of those connectivity measures address the complexity of the graph structure. Indeed, some graph structure allows a tighter regret bound. In this section, we first show a lower bound with a Cartesian product structure. We further propose a new connectivity measure that can scale with the complexity of the graph structure.

Large diameter case. We consider a simple FMDP with infinite diameter but still solvable. The FMDP is a Cartesian product of two identical MDPs, and with S = {0, 1, 2, 3}, A = {1, 2}. The transition probability is chosen such that from any state and action pair, the next state will either move forward or move backward with probability one (state 0 is connected with state 3 as a circle).

We can achieve a low regret easily by learning each MDP independently. However, since the sum of the two states always keeps the same parity, vector state (0, 1) can never be reached from (0, 0). Thus, the FMDP has an infinite diameter. The span of bias vector, on the other hand, is upper bounded by , which is tight in this case.

Lower bound with only dependency on span. Let us formally state the lower bound. Our lower bound casts some restrictions on the scope of transition probability, i.e. the scope contains itself, which we believe is a natural assumption. We provide a proof sketch for Theorem 3 here.

Theorem 3 (Lower bound). For any algorithm, any graph structure satisfying with , , there exists an FMDP with the span of bias vector , such that for any initial state , the expected regret of the algorithm after T step is

The proof is given in Appendix H. As we can see, the upper bound in Theorem 1 is larger than the lower bound by a factor of

discuss how to reduce the first three excesses.

5.1 Tighter connectivity measure

The mismatch in the dependence on m is due to not taking the factor structure into account properly in the definition of the span. A tighter bound should be able to detect the easier structure, e.g. product of independent MDPs. We now propose factored span that scales with the complexity of the graph structure.

Definition 5 (Factored span). For an FMDP M with an bias vector h of it optimal policy and a factorization of state space , we define factored span

, . . . , s, . . . , s

Proposition 1. For any bias vector . The first equality holds when the FMDP has the construction of Cartesian product of m independent MDPs. The lower bound (4) can also be written as

5.2 Tighter upper bound

We now provide another algorithm called FSRL (Factored-span RL) with a tighter regret bound of as shown in Theorem 4. The bound reduces the gap on m, l and replaces D with the sum of factored span Q. Proposition 1 guarantees that such that the upper bound is at least as good as the upper bound in Theorem 1.

FSRL (full description in Appendix I, Algorithm 2), mimics REGAL.C by solving the following optimization,

where is the confidence set defined in (2) and (3). FSRL relies on the computational oracle of optimizing average rewards over the confidence set with the sum of factored span bounded by a prespecified value. Therefore, FSRL cannot be run by just calling an FMDP planning oracle.

Theorem 4 (Regret of FSRL (Factored-span RL)). Let M be the factored MDP with graph structure , all

and , bias vector of optimal policy h and its sum of factored spans Q(h). Then, with high probability, regret of FSRL is upper bounded by:

The proof idea is on bounding the deviation of transition probabilities between the true MDP and in episode k with factored span. The details are shown in Appendix G.

6 Simulation

There are two previously available sample efficient and implementable algorithms for FMDPs: factored and factored Rmax (f-Rmax). F-Rmax was shown to have better empirical performance (Guestrin et al., 2002b). Thus, we compare PSRL, DORL and f-Rmax. For PSRL, at the start of each episode, we simply sample each factored transition probability and reward functions from a Dirichlet distribution and a Gaussian distribution, i.e. Dirichletis searched over (0.05, 0.1, 0.3, 0.75, 1, 5, 20). The total number of samplings for PSRL in each round is upper bounded by the number of parameters of the FMDP. For DORL, we replace the coefficients 18 and 12 in (2) and (3) with a hyper-parameter c searched over the set {0.05, 0.1, 0.3, 0.5, 1, 5, 20}. For f-Rmax, m, the number of visits needed to be known are chosen from 100, 300, 500, 700 and the best choice is selected for each experiment.

For the approximate planner used by our algorithm, we implement approximate linear programming (Guestrin et al., 2003) with the basis for . For regret evaluation, we use an accurate planner to find the true optimal average reward.

We compare three algorithms on computer network administrator domain with a circle and a three-leg structure (Guestrin et al., 2001; Schuurmans and Patrascu, 2002). To avoid the extreme case in our lower bound, both the MDPs

Figure 2: Regrets of PSRL, f-Rmax and DORL on circle and three-leg MDP with a size 4, 7. For PSRL, c = 0.75. For f-Rmax, m = 300, 500, 500, 500 and for DORL, c = 0.03 in Circle 4, Circle 7, Three-leg 4, Three-leg 7, respectively.

are set to have limited diameters. The details on the environment are in Appendix ??.

Figure 2 shows the regret of the two algorithms on circle and three-leg structure with a size 4, 7, respectively. Each experiment is run 20 times, with which median, 75% and 25% quantiles are computed. Our DORL and PSRL has very similar performance in all the environment except for Three-leg with a size 4. Optimal hyper-parameter for PSRL and DORL is stable in the way that c around 0.75 and 0.03 are the optimal parameter for PSRL and DORL respectively for all the experiments. Note that we use the exact, not approximate, optimal reward in regret evaluation. So we see that DORL and PSRL was always able to find a near-optimal optimal policy despite the use of an approximate planner.

7 Discussion

In this paper, we provide two oracle-efficient algorithms PSRL and DORL for non-episodic FMDPs, with a Bayesian and frequentist regret bound of , respectively. PSRL outperforms previous near-optimal algorithm f-Rmax on computer network administration domain. The regret still converges despite using an approximate planner. We prove a lower bound of for non-episodic MDP. Our large diameter example shows that diameter D can be arbitrary larger than the span . To reduce the gap, we propose factored span that scales with the difficulty of the graph structure and a new algorithm, FSRL with a regret bound of . In the lower bound construction, Q equals to the span of the FMDP.

FSRL relies on a harder computational oracle that is not efficiently solvable yet. Fruit et al. (2018) achieved a regret bound depending on span using an implementable Bias-Span-Constrained value iteration on non-factored MDP. It remains unknown whether FSRL could be approximately solved using an efficient implementation.

In non-factored MDP, Zhang and Ji (2019) achieved the lower bound. On the lower bound of non-episodic FMDP, it remains an open problem to close the remaining gap involving

Our algorithms require the full knowledge of the graph structure of the FMDP, which can be impractical. The structural learning scenario has been studied by Strehl et al. (2007); Chakraborty and Stone (2011); Hallak et al. (2015). Their algorithms either rely on an admissible structure learner or do not have a regret or sample complexity guarantee. It remains an open problem of whether an efficient algorithm with theoretical guarantees exists for FMDP with unknown graph structure.

8 Broader Impacts

As a theoretical paper, we can not foresee any direct societal consequences in the near future. Factored MDP, the main problem we study in this paper, may be used in multi-agent Reinforcement Learning scenario.

References

Agrawal, S. and Jia, R. (2017). Optimistic posterior sampling for reinforce- ment learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194.

Bartlett, P. L. and Tewari, A. (2009). Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press.

Boutilier, C., Dearden, R., and Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial intelligence, 121(1-2):49– 107.

Chakraborty, D. and Stone, P. (2011). Structure learning in ergodic factored mdps without knowledge of the transition function’s in-degree. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 737–744. Citeseer.

Dann, C. and Brunskill, E. (2015). Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826.

Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational intelligence, 5(2):142–150.

Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. (2018). Efficient bias-span- constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578–1586.

Ghahramani, Z. (1997). Learning dynamic bayesian networks. In International School on Neural Networks, Initiated by IIASS and EMFCSC, pages 168–197. Springer.

Givan, R., Leach, S., and Dean, T. (2000). Bounded-parameter markov decision processes. Artificial Intelligence, 122(1-2):71–109.

Goldsmith, J., Littman, M. L., and Mundhenk, M. (1997). The complexity of plan existence and evaluation in robabilistic domains. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pages 182–189.

Guestrin, C., Koller, D., and Parr, R. (2001). Max-norm projections for factored mdps. In IJCAI, volume 1, pages 673–682.

Guestrin, C., Koller, D., and Parr, R. (2002a). Multiagent planning with factored mdps. In Advances in neural information processing systems, pages 1523–1530.

Guestrin, C., Koller, D., Parr, R., and Venkataraman, S. (2003). Efficient solution algorithms for factored mdps. Journal of Artificial Intelligence Research, 19:399–468.

Guestrin, C., Patrascu, R., and Schuurmans, D. (2002b). Algorithm-directed exploration for model-based reinforcement learning in factored mdps. In ICML, pages 235–242. Citeseer.

Guestrin, C., Venkataraman, S., and Koller, D. (2002c). Context-specific multia- gent coordination and planning with factored mdps. In AAAI/IAAI, pages 253–259.

Hallak, A., Schnitzler, F., Mann, T., and Mannor, S. (2015). Off-policy model- based learning under unknown factored dynamics. In International Conference on Machine Learning, pages 711–719.

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563– 1600.

Kearns, M. and Koller, D. (1999). Efficient reinforcement learning in factored mdps. In IJCAI, volume 16, pages 740–747.

Kearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in poly- nomial time. Machine learning, 49(2-3):209–232.

Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM.

Lewis, M. E. and Puterman, M. L. (2001). A probabilistic analysis of bias optimality in unichain markov decision processes. IEEE Transactions on Automatic Control, 46(1):96–100.

Littman, M. L. (1997). Probabilistic propositional planning: representations and complexity. In Proceedings of the fourteenth national conference on artificial intelligence, pages 748–754.

Luo, H., Wei, C.-Y., Agarwal, A., and Langford, J. (2018). Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pages 1739–1776.

Osband, I., Russo, D., and Van Roy, B. (2013). (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011.

Osband, I. and Van Roy, B. (2014). Near-optimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604–612.

Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. (2017). Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342.

Puterman, M. L. (2014). Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons.

Schuurmans, D. and Patrascu, R. (2002). Direct value-approximation for factored mdps. In Advances in Neural Information Processing Systems, pages 1579– 1586.

Strehl, A. L. (2007). Model-based reinforcement learning in factored-state mdps. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 103–110. IEEE.

Strehl, A. L., Diuk, C., and Littman, M. L. (2007). Efficient structure learning in factored-state mdps. In AAAI, volume 7, pages 645–650.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. (2019). Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933.

Syrgkanis, V., Luo, H., Krishnamurthy, A., and Schapire, R. E. (2016). Improved regret bounds for oracle-based adversarial contextual bandits. In Advances in Neural Information Processing Systems, pages 3135–3143.

Tavakol, M. and Brefeld, U. (2014). Factored mdps for detecting topics of user sessions. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 33–40.

Zhang, Z. and Ji, X. (2019). Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, pages 2823–2832.

A Proof of Theorem 1 and 2

A standard regret analysis consists of proving the optimism, bounding the deviations and bounding the probability of failing the confidence set. Our analysis follows the standard procedure while adapting them to a FMDP setting. The novelty is on the proof of the general episode-assigning criterion and the lower bound.

Some notations. For simplicity, we let denote the optimal policy of the true MDP, . Let be the starting time of episode k and K be the total number of episodes. Since for any does not depend on s, we also let denote for any s. Let and denote the optimal average reward for

Regret decomposition. We follow the standard regret analysis framework by Jaksch et al. (2010). We first decompose the total regret into three parts in each episode:

Using Hoeffding’s inequality, the regret caused by (7) can be upper bounded by , with probability at least

Confidence set. Let be the confidence set of FMDPs at the start of episode k with the same factorization, such that for and each

and for each

In the following analysis, we all assume that the true MDP M for both PSRL and DORL are in by PSRL are in . In the end, we will bound the regret caused by the failure of confidence set.

A.1 Regret caused by difference in optimal gain

We further bounded the regret caused by (5). For PSRL, since we use fixed episodes, we show that the expectation of (5) equals to zero.

Lemma 1 (Lemma 1 in Osband et al. (2013)). If is the distribution of M, then, for any

We let . As g is a function. Since , K are fixed value for each

For DORL, we need to prove optimism, i.e, with high probability. Given , we show that there exists a policy for average reward

Lemma 2. For any policy and any vector be the policy for satisfying , where . Then, given

Corollary 1. Let be the policy that satisfies , where for any starting state

Proof of Lemma 2 and Corollary 1 are shown in Appendix B. Thereon, . The total regret of (5)

A.2 Regret caused by deviation

We further bound regret caused by (6), which can be decomposed into the deviation between our brief and the true MDP. We first show that the diameter of can be upper bounded by D.

Bounded diameter. We need diameter of extended MDP to be upper bounded to give a sublinear regret. For PSRL, since prior distribution has no mass on MDP with diameter greater than D, the diameter of MDP from posterior is upper bounded by D almost surely. For DORL, we have the following lemma, the proof of which is given in Appendix C.

Lemma 3. When M is in the confidence set , the diameter of the extended MDP

be the number of visits on s, a in episode k and be the row vector of Using optimal equation,

where ˜

Using Azuma-Hoeffding inequality and the same analysis in Jaksch et al. (2010), we bound 1with probability at least

To bound 2and 3, we analyze the deviation in transition and reward function between . For DORL, the deviation in transition probability is upper bounded by

The deviation in reward function

Decomposing the bound for each scope provided by and for PSRL , it holds for both PSRL and DORL that:

where with some abuse of notations, define for . The second inequality is from the fact that (Osband and Van Roy, 2014).

A.3 Balance episode length and episode number

We give a general criterion for bounding (8), (9) and (10).

Lemma 4. For any fixed episodes , if there exists an upper bound , such that , we have the bound

where Z is any scope with , and and are the number of visits to x in and before episode k. Furthermore, total regret of (8), (9) and (10) can be bounded by

Lemma 4 implies that bounding the deviation regret is to balance total number of episodes and the length of the longest episode. The proof, as shown in Appendix D, relies on defining the last episode , such that

Instead of using the doubling trick that was used in Jaksch et al. (2010). We use an arithmetic progression: for . As in our algorithm, , we have and for all . Thus, by Lemma 4, putting (7), (8), (10), (9) together, the total regret for is upper bounded by

with a probability at least For the failure of confidence set, we prove the following Lemma in Appendix E.

, with probability greater than

Combined with (11), with probability at least the regret bound in Theorem 2 holds.

For PSRL, has the same posterior distribution. The expectation of the regret caused by are the same. Choosing sufficiently small , Theorem 1 follows.

B Optimism (Proof of Lemma 2 and Corollary 1)

Lemma 6. For any policy and any vector be the policy for satisfying , where . Then, given

and define . Slightly abusing the notations, let . Define two S-dimensional vectors with . Define . Without loss of generality, we let

Corollary 2. Let be the policy that satisfies , where for any starting state

Proof. Let be the row vector of stationary distribution starting from some . By optimal equation,

where the last inequality is by Lemma 2 and Corollary 1 follows.

C Proof of Lemma 3

Lemma 7. Given M in the confidence set , the diameter of the extended MDP

time to reach from is at most D, without loss of generality we assume is the last state. Let E be the vector with each element to be the expected time to reach except for itself. We find such that the expected time to reach can be bounded by D. We choose the satisfies

removing -th row and column and defined in the same way for M. We immediately have be the expected time to reach from every other states except for itself under

Therefore,

and

D Deviation bound (Proof of Lemma 4)

Lemma 8. For any fixed episodes , if there exists an upper bound , such that , we have the bound

where Z is any scope, and and are the number of visits to x in and before episode k. Furthermore, the total regret of (8), (9) and (10) can be bounded by

, which gives for

Conditioning on

where the first inequality uses for , the second inequality is by the fact that and the third one is

Given any , we can bound the term with a fixed value . Thus, the random variable is upper bounded

by almost surely. Finally,

The last statement is completed by directly summing (8), (9) and (10).

E Regret caused by failing conﬁdence bound

, with probability greater than

Proof. We first deal with the probabilities, with which in each round a reward function of the true MDP M is not in the confidence set. Using Hoeffding’s

inequality, we have for any

Thus, with probability at least , the true reward function is in the confidence set for every For the transition probability, we use a different concentration inequality.

Lemma 10 (Multiplicative Chernoff Bound (Kleinberg et al., 2008) Lemma 4.9). Consider n, i.i.d random variables be their mean and let X be their average. Then with probability

Using Lemma 10, for each x, i, k, it holds that with probability

Then with a probability , it holds for all x, i, k. Therefore, with a probability , the true MDP is in the confidence set for each k.

F Span of Cartesian product of MDPs

be the Cartesian product of n independent MDPs each with a span of bias vector . The optimal policy for has a span

Proof. Let for be the optimal gain of each MDP. Optimal gain of is direct . As noted in Puterman (2014) (8.2.3), by the definition of bias vector we have

The lemma is directly by

We immediately have

G Proof of Theorem 4

Proof. Starting from 2simplicity, we remove the subscriptions of s and use and to denote the vector for s-th row of the two matrix.

where As

By applying (12) recurrently, we have

Note that is generally smaller than . In our lower bound case each , which improves our upper bound by a scale of 1/m.

The reduction of l can be achieved by bounding each factored reward to be in [1, 1/l]. The following proof remains the same.

H Proof of Lower Bound

Proof sketch. Let . As , a special case is the FMDP with graph structure , which can be decomposed into n independent MDPs as in the previous example. Among the n MDPs, the last MDPs are trivial. By simply setting the rest l MDPs to be the construction used by Jaksch et al. (2010), which we refer to as "JAO MDP", the regret for each MDP with the span sp(h), is for . The total regret is

be the Cartesian product of n independent MDPs each with a span of bias vector . The optimal policy for has a span

Using Lemma 12 (proved in Appendix F), and the total expected regret is . Normalizing the reward function to be in

[0, 1], the expected regret of the FMDP is , which completes the proof.

I FSRL algorithm

Here we provide a complete description of the FSRL algorithm that was omitted in the main paper due to space considerations.