Can Agents Learn by Analogy? An Inferable Model for PAC Reinforcement Learning

2019·Arxiv

ABSTRACT

ABSTRACT

Model-based reinforcement learning algorithms make decisions by building and utilizing a model of the environment. However, none of the existing algorithms attempts to infer the dynamics of any state-action pair from known state-action pairs before meeting it for sufficient times. We propose a new model-based method called Greedy Inference Model (GIM) that infers the unknown dynamics from known dynamics based on the internal spectral properties of the environment. In other words, GIM can “learn by analogy”. We further introduce a new exploration strategy which ensures that the agent rapidly and evenly visits unknown state-action pairs. GIM is much more computationally efficient than state-of-the-art model-based algorithms, as the number of dynamic programming operations is independent of the environment size. Lower sample complexity could also be achieved under mild conditions compared against methods without inferring. Experimental results demonstrate the effectiveness and efficiency of GIM in a variety of real-world tasks.

KEYWORDS

Model-based reinforcement learning; Spectral method; Sample complexity; Computational complexity

Yanchao Sun and Furong Huang. 2020. Can Agents Learn by Analogy? An Inferable Model for PAC Reinforcement Learning. In Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, May 9–13, 2020, IFAAMAS, 17 pages.

1 INTRODUCTION

In Reinforcement Learning (RL)[37], an agent interacts with the environment by taking actions and receiving rewards or payoffs to figure out a policy that maximizes the total rewards. Recently, RL has been successfully applied in many fields such as robotics [27], games [30], recommendation systems [44], etc. However, the high sample complexity and cost of computational resources prevent RL algorithms from being successfully deployed in many real-world tasks.

We call the RL algorithms which explicitly learn a model from experiences model-based, and algorithms that directly learn from interactions without any model model-free. Although these two types of algorithms are both effective in learning, they usually differ in terms of sample complexity, computational complexity and

space complexity, which respectively measure the amount of interactions, computations, and memory an algorithm needs in RL tasks. Model-based algorithms are more sample efficient, but cost more computations and space. In contrast, model-free algorithms save computations and space, but usually need more samples/experience to learn, and easily get trapped in local optima.

In this paper, we focus on model-based algorithms due to the following two reasons. First, model-based algorithms make more efficient use of samples than model-free ones. Most existing PACMDP algorithms1 are model-based. Second, the learned model is an abstraction of the environment, and can be easily transferred to other similar tasks [8]. For instance, if we change the reward of a state, only the reward value of the state should be changed in the learned model in model-based methods. However, for model-free methods, many state and action values will be affected.

Our goal is to find a method that can reduce both the sample and computational complexity of model-based methods. We focus on the following challenges:

• High stochasticity. The transitions among states are usually stochastic. The highly stochastic transitions require a large number of trials and errors to reach the right decisions. Can we avoid visiting the highly stochastic transitions and still achieve a good policy?

• Dilemma between sample and computational complexity. Utilizing samples in an efficient manner requires more operations, while pursuing high speed may sacrifice accuracy and lead to more errors. Can we achieve both sample and computational efficiency?

• Interplay of exploration and exploitation. The trade-off between exploration and exploitation is a crucial problem in RL. Should we always iteratively alternate between exploration and exploitation as implemented by most existing algorithms? Recent efforts [20, 33, 38] improve the sample and computational efficiency of model-based RL algorithms from various perspectives. However, the internal structure of the underlying Markov Decision Process (MDP) is ignored. The internal structure of MDPs refers to the spectral properties of the transition and reward distributions. More explicitly, we observed that many MDPs have locally or globally interrelated dynamics, resulting from the existence of similar states, actions or transitions. For example, consider a simple 2 3 grid world in Table 1(left) where the agent can go up, down, left and right. The floor is slippery so when it wants to go up, it may slip to either left or right with probability 0.2. (Slipping also happens for

other actions). If there is a wall in the objective direction, it stays in the current state. Table 1(right) is the transition table from other states to state 2; the entries are the transition probabilities from state-action pairs to state 2. The rows of state 4 and 6 are omitted because state 2 is not immediately reachable from state 4 or 6. We find that the rows of state 2 and 5 are exactly the same, and the rank of this matrix is 3, smaller than the number of states or actions. This phenomenon becomes more ubiquitous in larger environments.

1 2 3

4 5 6

Table 1: A grid world example and the transition table to state 2.

Due to the existence of such similar structures, we do not have to learn every state and action thoroughly, and the exploration can be much more efficient. We propose to explore a subset of “essential” transition dynamics, then infer the remaining dynamics using spectral methods, to achieve sample and computational efficiency. In this paper, we propose a novel model-based RL algorithm called Greedy Inference Model (GIM), which utilizes the structural and spectral characteristics of MDPs and thus expedites the learning process. GIM introduces a novel exploration strategy to discover the unknowns efficiently, and a spectral method to estimate the entire model with the knowledge of a fraction of the model. The core idea of GIM can be applied to any model-based methods. We prove that GIM is PAC-MDP, and it has significantly lower computational complexity and potentially lower sample complexity than state-of-the-art model-based algorithms. Systematic empirical studies demonstrate that GIM outperforms both model-based and model-free state-of-the-art approaches on a wide variety of tasks. Our contributions are summarized as follows:

• To the best of our knowledge, we are the first to estimate the model by utilizing the internal structure of the MDPs with guaranteed accuracy. We avoid directly estimating the highly stochastic transitions, which is sampleconsuming.

• We show that GIM can significantly reduce the computational cost, as the number of dynamic programming operations is independent of the environment size. We also prove GIM could improve the sample efficiency of model-based algorithms.

• We propose a new mechanism to address the exploration and exploitation dilemma. By using a new exploration strategy (-curious walking), GIM takes fewer exploration steps in total than existing methods.

2 RELATED WORK

2.1 RL Algorithms

Model-based algorithms. Model-based algorithms like [24], RMax [6] and MBIE [35] construct a model from interactions, storing the transition probabilities and rewards of every state and action pair, and then make predictions with the model. Followup works improve the efficiency of aforementioned model-based algorithms. RTDP-RMAX and RTDP-MBIE [33] reduce the number of updates and achieve lower computational complexity, with minor performance loss on the accumulated rewards achieved. MORMAX [38] modifies RMax algorithm and reduces the sample complexity [12, 23] by maintaining an imperfect model, but the model estimation is not accurate which prevents accurate rewards predictions for some state-action pairs. [20] proposes a method with no dependence on the size of the state-action space, but it assumes that an approximate imperfect model is given.

Model-free algorithms. Model-free algorithms [18, 40, 41] decide what actions to take based on the trajectory/history. Delayed Qlearning [34] is a special model-free algorithm as it is PAC-MDP, whose sample complexity depends linearly on the state and the action number. However Delayed Q-learning has higher dependence on the discount factor and the error tolerance than RMax.

Deep RL algorithms. Recently, researchers have made significant progress by combining deep learning with both model-based or model-free RL [30, 39] and achieving impressive empirical performance. However theoretical understanding of deep learning , and thus deep RL, remains unsettled. Deep RL, usually applied for large-scale decision-making problems, requires large number of training examples, which are not practical for tasks with limited training examples.

PAC RL. A key goal of RL algorithms is to maximize the reward with as few samples as possible. The sample-efficiency of RL algorithms can be measured by the PAC performance metric (sample complexity) first formally defined in [23]. [12] derives a tighter PAC upper bound for episodic fixed-horizon RL tasks. Recently, more strict metrics like Uniform-PAC and IPOC [13, 14] are proposed to measure the performance of RL algorithms. And by computing certificates for optimistic RL algorithms [14], minimax-optimal PAC bounds up to lower-order terms are achieved under certain conditions.

2.2 Spectral Methods

Matrix completion. The spectral method we will use in this paper is mainly the well-studied matrix completion. It is proved that we can recover a matrix with only a fraction of its (noisy) entries [9, 26]. Spectral methods and RL. Spectral methods have been applied in RL in the learning of POMDP (Partially Observable Markov Decision Process) [4] and ROMDP (Rich-Observation Markov Decision Process) [3], where a multi-view model [2] is used. Researchers discover that knowledge can be transferred between tasks, domains or agents [7, 16, 28] using spectral methods. Moreover, some recent works propose new learning algorithms by constructing certain low-rank models [5, 21, 31], where spectral methods are involved. Low-rank transition model. There is a line of works learning the low-rank structure of the transition models [15, 22, 29, 42, 43], although we focus on different low-rank objects and use different models as well as assumptions.

3 NOTATIONS AND PROBLEM SETUP

3.1 Notations for RL

In this paper, we focus on episodic, discrete-time, and fixed horizon MDPs with finite state and action spaces. A Markov decision process (MDP) is defined as a tuple , where S is the state space (with cardinality S); A is the action space (with cardinality is the transition probability function with representing the probability of transiting to state from state by taking action is the reward function with recording the reward one can get by taking action in state is the initial state distribution. are called the dynamics of the MDP. We use H to denote the horizon (number of steps one can take in an episode) of an MDP.

Definition 3.1 (Dynamic Matrices). Given an MDP M denoted by tuple , we define S + 1 dynamic matrices are called transition dynamic matrices, where is called reward dynamic matrix in which

The empirical estimations of the dynamic matrices are:

where is the total number of visits to state-action pair the total number of transitions from to s by taking action and the total rewards 2 for . The empirical dynamic matrix ˆM is an approximation of the corresponding dynamic matrix M, so we have ˆM = M + Z where Z is a noise matrix. The more observations we have, the more accurate the approximation is.

Our main goal is to recover every . More explicitly, given the empirical dynamic matrix ˆM, the algorithm should return a matrix -close to the original

The value function of a policy for a given MDP M with horizon H is the expected average rewardThe optimal policy is the policy that achieves the largest possible value . In an RL task, an agent searches for the optimal policy by interacting with the MDP. The general goal of RL algorithms is to learn the optimal policy for any given MDP with as few interactions as possible. A widely-used framework to evaluate the performance of RL algorithms is sample complexity of exploration [23], or sample complexity for short.

Definition 3.2 (Sample complexity of exploration). For any 0 and 0 1, and at any episode t, if the policy generated by an RL algorithm is near-optimal at episode t. If with probability at least 1 , the total number of episodes that L is not near-optimal is upper bounded by a function is called the sample complexity of L.

Intuitively, sample complexity illustrates the number of steps in which the agent does not act near-optimally.

3.2 Notations for Spectral Methods

Incoherence [26] of a matrix is an important property that demonstrates the “sparsity” of the singular vectors of the matrix: all coordinates of each singular vector are of comparable magnitude (a.k.a., a dense singular vector) vs just a few coordinates having significantly larger magnitudes (a.k.a., a sparse singular vector).

A matrix with rank r has SVD , where U and V are orthonormal. We say M is -incoherent if (1) for all we have , and ; (2) There exist such that -th singular value of M.

The smaller are, the more spread-out the singular vectors are. As a result, matrix completion methods require a smaller number of known entries to confidently recover the entire matrix. See Appendix3 A for details about matrix completion and incoherence.

4 MOTIVATIONS

Before the formal introduction of our proposed learning algorithm, we consider two questions:

The key to RL is to evaluate the value of every state and action, with or without a model. The agent makes observations of each state-action pair, accumulates experience, and estimates the model or the values. It knows nothing about a state-action pair before meeting it. However, is it necessary to learn every new state-action pair from scratch? As humans, we can learn by analogy. For example, if one has jumped out of a window on the second floor and got injured, he will learn never to jump from another window on the third floor, because he accumulates knowledge from his previous experience and finds the internal connections between these two situations. But existing RL agents, which are not able to analyze new states, tend to make the same mistakes in similar situations.

Therefore, in this work, we extract and use the internal connections of the environment via spectral methods to reduce the unnecessary trials and errors for the agent.

The exploration-exploitation dilemma has been intensively studied for decades and remains unsolved. In RL, exploration is to try the unknowns, while exploitation maximizes rewards based on the current knowledge. Most RL algorithms interleave (or alternate between) exploration and exploitation. An example is the widely-used -greedy exploration method, which chooses actions greedily with respect to the action values with probability 1(exploit), and randomly chooses actions with probability (explore). Moreover, many model-based algorithms, such as and RMax, choose actions with the maximum value in known states, and execute the action that has been tried the fewest times in unknown states. However, is this interleaving the only manner to get the optimal results? If the agent “greedily” chooses the most rewarding action when it knows little about the whole environment, it usually misses the largest possible reward in the long run. Can the agent ignore the short-term benefits in the beginning, and keep exploring before it gains enough knowledge?

In this work, we implement a two-phase algorithm in which exploitation follows after exploration, instead of interleaving the two. And we prove that our new method requires fewer samples and computations.

5 GREEDY INFERENCE MODEL

In this section, we present a novel model-based RL algorithm called Greedy Inference Model (GIM) that considers the structural properties of MDPs and separates exploration from exploitation. More explicitly, two main ideas of GIM are (1) using matrix completion to recover/complete the dynamic matrices, and (2) greedily exploring the unknowns.

5.1 Complete Unknowns with Knowns

As in many model-based algorithms [6, 24], we distinguish all state-action pairs as “m-known” or “m-unknown” (we will say known and unknown for short) pairs: a state-action pair is known if we have visited it for over m times, so that the estimations for its transition probabilities and average reward are guaranteed to be nearly accurate with high probability. We use K to denote the set of all known state-action pairs, and ¯K for unknowns.

and A actions, the known-ness mask is a binary matrix defined as

Remark. The summation of all entries of the known-ness mask is the total number of known state-action pairs in the MDP. Row sums and column sums are the numbers of known state-action pairs related to every state or action respectively.

As discussed in Motivations, unlike previous model-based methods such as RMax, we avoid the necessity of observing and gaining knowledge on every single state-action pair as the MDPs usually have some internal structure/pattern. We use matrix completion, a widely used spectral method, to estimate the missing values in partially observed matrices. We now introduce how to complete the “unknowns” with “knowns” in an MDP.

5.1.1 Estimate Unknowns via Matrix Completion. Matrix completion is the problem of recovering unknown entries in the matrix from a small fraction of its known (noisy) entries, which are randomly sampled from the matrix. When the matrix satisfies some assumptions that we will discuss later, the recovery is guaranteed to be accurate, even under noisy known entries. Based on matrix completion theory, GIM needs only a fraction of state-action pairs to be known to recover the unknown state-action pairs. Now we formally define the matrix completion problem as the optimization problem: for every dynamic matrix M

where is the known-ness mask defined in Definition 5.1, denotes element-wise product, and r is the rank of M or the upper bound of the rank.

5.1.2 Requirements for Accurate Completion. Matrix completion makes it possible to know all the dynamics from some known state-action pairs, but the accuracy of completed dynamics is determined by the structure of the matrix, as well as the number and the locations of known entries. In general, matrix completion with noisy observations requires (1) the number of known entries be greater than some threshold, and (2) the known entries be spread out randomly. We propose the following exploration strategy that conforms to the two requirements above.

5.2 Greedily Explore the Environment

To satisfy the two requirements on the known state-action pairs and guarantee the success of matrix completion, we propose a new exploration strategy called

Let denote the fraction of known state-action pairs over all state-action pairs. Therefore . We introduce the concept of -known state below.

distinct actions such that the corresponding state-action pair (s,a) is known.

Intuitively, the idea of our proposed -curious walking is: if the current state s is not -known, choose an action a that the agent has taken the most but (s,a) is still unknown; if the current state is -known, select the action which most likely leads to a non--known state. The agent also chooses actions randomly with a small probability to avoid being trapped in local optima.

Algorithm 1 shows the procedure of , where Random() generates a random number from a uniform distribution in [0, 1]; n(s,a) is the total number of visits to state-action pair (s,a); is the total number of transitions from s to by taking action a; the indicator function -known, and 0 otherwise.

Compared with the balanced walking method used in both and RMax, -curious walking:

• Encourages the agent to choose the actions it has the most experience with, until the action is known for the current state. So the agent rapidly knows state-action pairs.

• Spreads the knowledge evenly, i.e., the agent attempts to know every state with actions instead of attempting to know all A actions for some states but nothing for other states.

5.3 GIM Algorithm

The proposed GIM algorithm is described in Algorithm 2. When there are less than -known state-action pairs, the agent keeps exploring with -curious walking (see lines 7-14). As long as pairs are m-known, the algorithm performs matrix completion for all dynamic matrices (see lines 15-20). MatComp could be any off-the-shelf matrix completion algorithm that solves the problem defined in Equation (3). Note that matrix completion algorithms implicitly estimate the rank r of the input matrix, so there is no need to specify the rank as an input to Algorithm 2. Known threshold. m is the least number of visits to one state-action pair to make the estimation and the completion accurate. The choice of m is specified in Theorem 6.3. controls the fraction of known state-action pairs, based on which the matrix completion can get convincing results for unknowns. The value of is determined by the structure of the underlying MDP. The more the states and actions in the MDP are interrelated, the smaller can be. If the underlying MDP has completely unrelated dynamics, then to be 1, and matrix completion does nothing but returning the empirical transition model itself.

We will further discuss parametersin theory and practice in the next two sections. More importantly, we will show that the advantage of our proposed algorithm over previous model-based algorithms is larger for smaller . Even under the worst scenario of -curious walking improves learning efficiency under certain conditions.

14 if

Set all state-action pairs as known

Compute the optimal policy ˜

21 else

with optimal policy ˜

5.4 GIM As a Framework

Although Algorithm 2 estimates the dynamics by directly averaging collected samples, which is similar to the classic RMax algorithm, GIM can also be regarded as a framework and can be combined with other model-based methods. The simple RMax-style structure in Algorithm 2 is an illustration of how GIM could be combined with a model-based method; the analysis we will provide in Section 6 exhibits how GIM could improve a model-based method.

The key ideas of GIM are to infer the unknown dynamics as well as to know the environment greedily and evenly, which improve the model-based method combined. For example, in algorithms driven by confidence interval estimation, such as MBIE [35], UCRL2 [19] and etc, we can also use matrix completion to recover the “uncertain dynamics” using the “dynamics with high confidence”. This extra operation will not affect what has been learned, but rather make a guaranteed estimation of the unlearned parts. Therefore, the learning process is boosted by utilizing the internal structures of the environment, and as a result, samples are saved.

Overall, it is not our goal to propose a specific algorithm with the best complexity. Instead, we attempt to improve any model-based algorithm by inferring the dynamics.

6 THEORETICAL ANALYSIS ON COMPLEXITIES

In this section, we analyze the computational complexity, sample complexity and space complexity of GIM. By comparing with existing model-based methods, we show that GIM achieves a much better computational complexity and improves the sample complexity under mild conditions.

6.1 Computational Complexity

Theorem 6.1 states the computational complexity of GIM.

Theorem 6.1 (Computational Complexity of GIM). Given an MDP M with S states and A actions, if GIM is executed for N steps, then the total computational complexity of GIM is

wherebe the number of computations for one dynamic programming operation, i.e., updating the policy by solving Bellman equations.

depends on the environment size. More specifically, if the maximum number of iterations for dynamic programming is set as

Proof. During the execution of GIM, both dynamic programming and matrix completion are implemented only once, which lead to and [17] computations. For every time step, GIM updates , which can be done in constant time. Although -curious walking in Algorithm 1 performs a loop over all the actions, in practice we are able to find the best action within constant time or logarithmic time through maintaining the known-ness table. These constant-time per step computations together lead to the

Comparison with RMax and RTDP-RMAX[33]. We first compare the cost of dynamic programming, the major computational burden for most RL algorithms. Since within each dynamic programming, the amount of computation required is the same for GIM and other model-based methods, we use the number of dynamic programming operations as the metric for computation complexity comparison.

Lemma 6.2. The number of dynamic programming operations required by GIM is O(1), whereas the number of dynamic programming operations required by RMax and RTDP-RMAX are

RMax’s computational complexity is , as RMax computes the action values every time a new state is known. RTDP-RMAX is proposed to reduce the computational complexity of RMax. It initializes all action values to be , the maximum possible value for an episode, and only updates the value of one state-action pair when the value decreases more than some threshold . So it requires at most computations for the dynamic programming, where is the error tolerance of action-value estimation. Thus in terms of dynamic programming computation complexity, GIM is much faster than RMax and RTDP-RMAX.

Besides the dynamic programming, as shown in Theorem 6.1, GIM requires some constant-time operations per step, which is inevitable for all algorithms, and an extra matrix completion computation, which is a one-time cost and is negligible in a long learning process. Experiments in Section 7.1.4 verifies this fact.

6.2 Sample Complexity

As stated in Section 5.4, what we propose is a new exploration and estimation approach, that could be combined with model-based PAC algorithms [12, 36, 38] to get lower sample complexity. In this section, we analyze the sample complexity of Algorithm 2 by adapting the analysis of RMax [23]. RMax is chosen due to its simplicity and versatility.

We now introduce a few notations that are essential in our analysis. (1) Denote the upper bounds of the condition number and the rank of every dynamic matrix by and r. (2) All dynamic matrices are at least -incoherent. (3) Let , and

We also make the following two mild assumptions.

Assumption 1. There is a known diameter D, such that any state is reachable from any state s in at most D steps on average. Assume that the diameter D is smaller than the horizon H.

The assumption about diameter is commonly used in RL [19], and it ensures the reachability of all states from any state on average. It is mild to assume D < H as the horizon is often set large.

Assumption 2. The distribution of the estimation noise, ˆand , is sub-Gaussian with 0 mean. And the estimation noises for different state-action pairs are independent.

This modeling of difference between the ground-truth probability and empirical estimation using sub-Gaussian variables is widely used.

The sample complexity of GIM is in Theorem 6.3.

Theorem 6.3 (Sample Complexity of GIM). Given an MDP M with fixed horizon H and diameter D, suppose the upper bounds of the condition number, rank and incoherence parameters of dynamic matrices are and , for any 0 1, 0 1, with

completion fraction

and the known threshold

algorithm 2 produces a policy ˆ, which satisfies but episodes, with probability at least 1

Remark. For low-rank dynamic matrices,tends to be small. In our experiments, is typically less than 2. If we regard stant, the sample complexity of GIM becomes

Proof Sketch. We first set m to be the least number of visits to a state-action pair to make it known (see Condition 1 in Appendix B.1). Then, we prove with at most episodes, we know pairs as Lemma B.1. Finally, we prove the value of m should be full proof and the lemmas are in Appendix B.

settings, is [23]. We compare the sample complexity of GIM and RMax in the following scenarios. GIM has lower sample complexity than RMax if (2) When S > A. GIM has lower sample complexity than RMax if We deactivate the matrix completion steps by simply setting , when the un- derlying MDP does not have any inner-related structure. This makes GIM follow -curious walking until all the state-actions are known. In this case, the sample complexity becomes Because and 1 is close to 1, GIM generates less non--optimal episodes than RMax does. So -curious walking strategy itself saves samples.

Note that , so the conditions in (1) and (2) are satisfied for most tasks.

Achieve Lower Sample Complexity. One may note that the sample complexity bound in Theorem 6.3 is not optimal in terms of the dependency on S and A. When A > S, GIM needs ˜to learn a near-optimal policy. However, the best known bounds of model-based algorithms are of order ˜O(SA) [13, 14, 38]. The saved factor of S results from the direct analysis of the value function (and an imperfect model approximation). Since we claim that GIM can work as a framework, can GIM also achieve linear dependency? In our analysis and the original analysis of RMax, the known threshold is at least m = O(S lnS), but as indicated by [25], with specific updating strategies, m = O(lnS) samples might be enough to maximize the rewards. So it is possible for GIM to avoid an S factor by incorporating re-estimating methods, although this is out of the scope of this paper.

6.3 Space Complexity

The memory GIM needs is mainly for the storage of dynamic matrices. Similar with other model-based RL algorithms, the space complexity of GIM is , as we have S + 2 matrices with size . In contrast, the space complexity of model-free algorithm such as Delayed Q-learning could be as low as O(SA). Although a large space complexity seems to be unavoidable for model-based methods, one can consider storing the sparse dynamics in a sparse format where only non-zero elements are stored if the dynamic matrices are sparse. Then the space complexity will be reduced to is the number of non-zero entries.

7 EXPERIMENTS

7.1 Performance on Multiple Tasks

7.1.1 Tasks. To exhibit the universal applicability of GIM, we conduct experiments on multiple tasks of varying levels of hardness: (1) Synthetic. We create various MDPs by randomly generating the dynamic matrices with varying numbers of states, actions and ranks. (2) GridWorld. A classic grid world task, with world size 4 probability 0.4 and step cost 0.2. (3) CasinoLand. A challenging task constructed by [32], which consists of six rooms and three levers. Pulling some levers may lead to a large reward with a small probability. (4) RiverSwim. Another challenging task constructed by [32], where a chain of 6 states represents a river, and an agent needs to “swim” from the first state to the last one to get a large reward. See Appendix C.1 for detailed description of CasinoLand and RiverSwim.

7.1.2 Baselines. To verify the effectiveness and efficiency of GIM, it is compared against popular model-based and model-free methods: RMax, Q-learning, Delayed Q-learning, and Double Qlearning [18] methods. We select RMax among all model-based methods, because Algorithm 2 is designed on the basis of RMax. So the effectiveness of our proposed strategies can be justified by comparing with RMax. Note that it is also possible to apply similar strategies to other existing model-based algorithms, as claimed in Section 5.4.

Moreover, we implement an “optimal” agent which knows all the dynamics and deploys the optimal policy from the beginning, as well as a “random” agent which chooses action randomly. The “optimal” agent is the best any agent could achieve, while the “random” agent is the worst any agent could perform. We use the simple_rl framework provided by Abel [1] to conduct the experiments.

7.1.3 Reward Comparison. We demonstrate the average cumulative reward over 20 runs of GIM and baselines on various tasks in Figure 1. The plots are smoothed out by plotting every few hundred epochs for better illustration. Note that a line of reward increases linearly after it finds the best policy. Agents whose sample complexities are lower converge to the best policy earlier. For hyperparameters, we set the known threshold m for both RMax and GIM to be 40, and set the completion fraction threshold for GIM to be 0.8. We will discuss the influence of different settings on RMax and GIM later. Among all these methods, GIM achieves the highest total

Figure 1: Comparison of the mean cumulative reward over 20 runs of GIM and baselines on various tasks.

reward. Although Q-learning, Delayed-Q and Double-Q converge to a good policy quickly, they are sometimes trapped in local-optima and cannot win in the long run. On the contrary, RMax figures out a policy that is near-optimal, but it often takes more episodes to converge to that policy. GIM avoids these two drawbacks; it converges quickly and the returned policy is near-optimal.

Table 2: Comparison of running times in seconds.

7.1.4 Running Time Comparison. Table 2 provides the average running times of each agent on various tasks. In general, our model-based GIM is faster than model-based RMax, and is even comparable to the model-free methods which are generally faster as there is no need to maintain a model. For RiverSwim, GIM is slightly slower than RMax as the one-time cost of computation of matrix completion slows down the GIM agent. This one-time cost is less significant for time-consuming tasks where running time is the bottleneck.

7.1.5 Scale Up to Larger Environments. Although many model-based algorithms perform well in small environments, it is usually not easy for them to work on large-scale problems. However, GIM avoids exhaustedly visiting and estimating the whole environment by inferring some dynamics. When the environment is highly internal-dependent, only the knowledge of a small fraction of the space is needed. Thus fewer samples and computations are consumed. To evaluate the scalability of GIM compared with baselines, we gradually enlarge the size of the synthetic task and show the cumulative rewards in Figure 2. GIM performs well compared with baselines, while RMax fails to converge to the optima within the

given number of episodes. We set the known thresholds for both GIM and RMax to be 100. As a result, the performance of learned policy by GIM is slightly worse than Q-learning, since m = 100 is not adequate for estimating the large environments. But one can anticipate higher rewards by GIM, once m is set higher.

The running times corresponding to Figure 2, are shown in Table 3, where one can find GIM is much faster than RMax and Double Q-learning. As the environment size gets larger, GIM takes slightly longer to run, while RMax spends much more time due to the increased computation requirements.

Figure 2: Comparison of the mean cumulative reward of GIM and baselines on synthetic tasks with various environment size.

Table 3: Comparison of running times (in seconds) on synthetic tasks with various environment size. The number of episodes are all set to be 2

7.2 Experiments on Parameters

In practice, it is important to select an appropriate known threshold m for GIM and RMax. GIM also requires a completion fraction . Although we can choose based on experience, it is related to the properties of the dynamic matrices () as proved in Theorem 6.3. So, we design and conduct a series of systematic tests to study how (for fixed ) influence the learning effectiveness and efficiency of GIM, in comparison with RMax.

The following three measurements are evaluated. (1) AvgReward: average reward per episode; (2) TotalEps: number of episodes needed to know all states and actions; and (3) PostAvgReward: average reward after knowing all states and actions (after exploration), which reflects accurateness of the learned dynamics.

7.2.1 Values of the Known Threshold m. We run experiments on synthetic and GridWorld tasks with different known threshold m. Figure 3 shows the AvgReward, PostAvgReward and TotalEps of RMax and GIM with different values of m. For the same m, GIM gains more rewards (Figure 3a), and completes exploration faster than RMax (Figure 3b), with a slightly worse returned policy (Figure 3c). When m varies, RMax requires much more episodes to explore, while GIM is more robust to the changing m as shown in Figure 3b. We present the results on higher-rank synthetic tasks and the GridWorld task in Appendix C.2.

Figure 3: Comparison of AvgReward, TotalEps, and PostAvgReward of RMax and GIM on synthetic task with different known thresholds. S=20, A=10 and rank=2.

ward of GIM and RMax under varying dynamic matrix ranks r’s, incoherence parameter ’s and condition number ’s in Figure 4. (1) Influence of r. As Figure 4a shows, the average rewards of both GIM and RMax drop slightly when rank becomes higher. Even if the dynamic matrix is full-rank, the policy returned by GIM still obtains a high reward. We see from Figure 4b and Figure 4c that GIM outperforms RMax, and does not change much with varying

Experiments of matrix properties are conducted on synthetic tasks because it is easy to control the properties of their underlying MDPs. But the observed connections between studied properties and learning results can be extended to any other tasks. Additional experimental results including PostAvgReward, TotalEps are shown in Appendix C.3, as well as the results for

Figure 4: Comparison of AvgReward of GIM and RMax with varying ranks, 20, A = 10 and m = 40. The generated ranks for (b) and (c) are 2 and 4 respectively.

8 CONCLUSION

This paper proposes a model-based RL algorithm called GIM. GIM utilizes the internal structures of MDPs to infer the unknown dynamics, and uses a novel exploration strategy to efficiently explore the environment. Theoretical analysis and empirical results show that GIM can reduce both sample complexity and computational complexity when combined with classic model-based algorithms. We envision incorporating our proposed techniques for multi-task RL where internal structures of MDPs can connect different tasks.

REFERENCES

[1] David Abel. 2019. simple_rl: Reproducible Reinforcement Learning in Python. (2019).

[2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. 2014. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15, 1 (2014), 2773–2832.

[3] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. 2016. Reinforcement learning in rich-observation MDPs using spectral methods. arXiv preprint arXiv:1611.03907 (2016).

[4] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. 2016. Reinforcement learning of POMDPs using spectral methods. arXiv preprint arXiv:1602.07764 (2016).

[5] Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. 2011. Closing the learningplanning loop with predictive state representations. The International Journal of Robotics Research 30, 7 (2011), 954–966.

[6] Ronen I. Brafman and Moshe Tennenholtz. 2003. R-max - a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. J. Mach. Learn. Res. 3 (March 2003), 213–231. https://doi.org/10.1162/153244303765208377

[7] Stefano Bromuri. 2012. A Tensor Factorization Approach to Generalization in Multi-Agent Reinforcement Learning. In 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 2. IEEE, 274–281.

[8] Emma Brunskill and Lihong Li. 2013. Sample complexity of multi-task reinforcement learning. arXiv preprint arXiv:1309.6821 (2013).

[9] Emmanuel J Candes and Yaniv Plan. 2010. Matrix completion with noise. Proc. IEEE 98, 6 (2010), 925–936.

[10] Emmanuel J Candes and Terence Tao. 2009. The power of convex relaxation: Near-optimal matrix completion. arXiv preprint arXiv:0903.1476 (2009).

[11] Yudong Chen. 2013. Incoherence-Optimal Matrix Completion. Information Theory IEEE Transactions on 61, 5 (2013), 2909–2923.

[12] Christoph Dann and Emma Brunskill. 2015. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2818–2826. http://papers.nips.cc/paper/ 5827-sample-complexity-of-episodic-fixed-horizon-reinforcement-learning. pdf

[13] Christoph Dann, Tor Lattimore, and Emma Brunskill. 2017. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems. 5713–5723.

[14] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. 2018. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056 (2018).

[15] Yaqi Duan, Mengdi Wang, Zaiwen Wen, and Yaxiang Yuan. 2018. Adaptive Low-Nonnegative-Rank Approximation for State Aggregation of Markov Chains. arXiv preprint arXiv:1810.06032 (2018).

[16] Kimberly Ferguson and Sridhar Mahadevan. 2006. Proto-transfer learning in markov decision processes using spectral methods. Computer Science Department Faculty Publication Series (2006), 151.

[17] David Gamarnik, Quan Li, and Hongyi Zhang. 2017. Matrix Completion from O(n) Samples in Linear Time. arXiv preprint arXiv:1702.02267 (2017).

[18] Hado V. Hasselt. 2010. Double Q-learning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 2613–2621. http://papers.nips.cc/ paper/3964-double-q-learning.pdf

[19] Thomas Jaksch, Ronald Ortner, and Peter Auer. 2010. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11, Apr (2010), 1563–1600.

[20] Nan Jiang. 2018. PAC reinforcement learning with an imperfect model. In ThirtySecond AAAI Conference on Artificial Intelligence.

[21] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. 2017. Contextual decision processes with low Bellman rank are PAClearnable. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 1704–1713.

[22] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. 2019. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388 (2019).

[23] Sham Machandranath Kakade et al. 2003. On the sample complexity of reinforcement learning. Ph.D. Dissertation. University of London London, England.

[24] Michael Kearns and Satinder Singh. 2002. Near-optimal reinforcement learning in polynomial time. Machine learning 49, 2-3 (2002), 209–232.

[25] Michael J Kearns and Satinder P Singh. 1999. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems. 996–1002.

[26] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. 2010. Matrix completion from noisy entries. Journal of Machine Learning Research 11, Jul (2010), 2057–2078.

[27] Jens Kober, J Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research 32, 11 (2013), 1238–1274.

[28] Alessandro Lazaric, Emma Brunskill, et al. 2013. Sequential transfer in multiarmed bandit with finite set of models. In Advances in Neural Information Processing Systems. 2220–2228.

[29] Xudong Li, Mengdi Wang, and Anru Zhang. 2018. Estimation of Markov chain via rank-constrained likelihood. arXiv preprint arXiv:1804.00795 (2018).

[30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.

[31] Hao Yi Ong. 2015. Value function approximation via low-rank models. arXiv preprint arXiv:1509.00061 (2015).

[32] A Strehl and M Littman. 2004. Exploration via model based interval estimation. In International Conference on Machine Learning. Citeseer.

[33] Alexander L Strehl, Lihong Li, and Michael L Littman. 2012. Incremental model-based learners with formal learning-time guarantees. arXiv preprint arXiv:1206.6870 (2012).

[34] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. 2006. PAC Model-free Reinforcement Learning. In Proceedings of the . ACM, New York, NY, USA, 881–888. https://doi.org/10.1145/1143844.1143955

[35] Alexander L. Strehl and Michael L. Littman. 2005. A Theoretical Analysis of ModelBased Interval Estimation. In Proceedings of the 22Nd International Conference . ACM, New York, NY, USA, 856–863. https: //doi.org/10.1145/1102351.1102459

[36] Alexander L Strehl and Michael L Littman. 2008. An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci. 74, 8 (2008), 1309–1331.

[37] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

[38] István Szita and Csaba Szepesvári. 2010. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 1031–1038.

[39] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence.

[40] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.

[41] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.

[42] Lin F Yang and Mengdi Wang. 2019. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound. arXiv preprint arXiv:1905.10389 (2019).

[43] Lin F Yang and Mengdi Wang. 2019. Sample-optimal parametric q-learning with linear transition models. arXiv preprint arXiv:1902.04779 (2019).

[44] Jing Zhang, Bowen Hao, Bo Chen, Cuiping Li, Hong Chen, and Jimeng Sund. 2019. Hierarchical Reinforcement Learning for Course Recommendation in MOOCs. Psychology 5, 4.64 (2019), 5–65.

Appendix: Can Agents Learn by Analogy? An Inferable Model for PAC Reinforcement Learning

A MATRIX COMPLETION AND INCOHERENCE PROPERTY

Matrix Completion

The matrix completion problem, which considers the recovery of some missing entries from random sampling of the matrix, is widely-used in many practical fields like recommendation, information diffusion and etc.

A typical matrix completion problem can be stated as follows. We have a numerical matrix M with rows and columns. If we only observe m entries where , are we able to recover the remaining unknown entries accurately and efficiently? Apparently, we cannot find the true recovery without any restriction and additional information. The matrices which we can complete from partial entries must have some certain kinds of structure. One important property is the low-rankness, i.e., the matrix should be low-rank or approximately low-rank. An matrix M with rank r can be represented by numbers, but it only has degrees of freedom (can be derived through counting parameters of SVD). So it is unnecessary to known all the entries to determine the matrix M. When r is small, theoretically, one can uniquely determine the whole matrix with only a small fraction of the matrix. Incoherence Conditions

However, not any arbitrary low-rank matrix can be recovered from a few entries. For example, the following rank-1 matrix only has one non-zero entry (1), so it is impossible to know the matrix exactly unless all of the entries have been observed.

where -th standard basis with appropriate dimension. The smallest possible for any matrix can be 1, when the singular vectors all have equal magnitudes. And it is easy to get . The more concentrated the singular vectors are, the higher

To get guarantees for all values of the rank, [10] proposes another incoherence condition called strong incoherence condition. A matrix M satisfies the strong incoherence condition with parameter

The strong incoherence condition requires the left and right singular vectors of the matrix to be unaligned with each other. While a paper [11] has shown that this condition is not necessary for matrix completion without noise.

In our model, we perform matrix completion on the dynamic matrix, which is the transition probabilities to a specific state from all state-actions pairs. In order to get higher recovery accuracy with less known state-action pairs, it is required that (1) the matrix is low-rank, which is true when different actions/states have the same transition probabilityies to the same state. (2) the transitions to a specific state are not highly concentrated on some state-action pairs.

B PROOF DETAILS OF THEOREM 6.3 B.1 Sample Complexity with Known Threshold m

As in RMax, let m be the number of visits needed to make a state-action pair known, and we assume the following condition holds for m:

Then with the parameter m, we can express the sample complexity as

Lemma B.1 (Sample Complexity in terms of m). Let M be an MDP with fixed horizon is the policy computed by GIM, then for any starting state , with probability at least 1 for all but

Proof. (of Lemma B.1) Because of Condition 1, i.e., with probability at least 1

So, after completion, the required value can immediately be reached. We only need to bound the number of exploration episodes.

Due to the -curious walking strategy, in every step, one of the following three cases will happen: (1) with probability , a random action

is selected. (2) with probability 1 , visit an unknown state-action pair. (3) with probability 1 , choose an action which has the largest

probability to a non--known state.

Note that the exploration ends when there are known state-action pairs, and m visits are needed to make an unknown pair known. For case (2), a visit to unknown pair happens. For case (1) and (3), the agent starts a trajectory going to a non--known state, and then either

(1) or (2) happens. Thus, with probability 1 , there is a successful visit to an unknown pair within at most D steps. Further, in N steps, the expected number of visits to unknown pairs is . Then, using Hoeffding’s inequality, we can get

steps are enough to have visits to unknown pairs with probability at least 1 Therefore, the number of exploration episodes is bounded by , with probability at least 1 . Note that the order of the result will not be changed even if the agent is reset in the beginning of every episode and has to restart a trajectory, since we assume

In order to prove Theorem 6.3, we need the following two lemmas: simulation lemma and perturbation bound lemma.

Lemma B.2 (Simulation Lemma). Given two MDPs in the same state-action space with the same horizon H, and their corresponding dynamics , if for any state s and action a,

then for any policy

Proof. (of Lemma B.2) We let denote a trajectory in MDP, and the set of all trajectories of length the probability of observing trajectory M following policy the expected average reward of going through trajectory Then the difference between the values of two MDPs under policy is (For simplixity, we omit the superscript for all variables.)

The bound of the second term is given by

So by recursing, we can get the final bound over the second term, and then the bound of Lemma B.2 holds.

Lemma B.3 (Perturbation Bound of Dynamic Matrices Completion). Let be the groud truth dynamic matrices for a specific RL task, the empirical dynamic matrices for the same task, and the knownness mask. Let be the noise matrices, so . If the fraction of observed entries

and if then with probability at least 1 , for any 1, the result dynamic matrix returned by the function MatrixCompletion

where are respectively the supremum of the condition number and the rank of all dynamic matrices; is the maximum variance proxy for noisy entries.

To prove Lemma B.3, we first bound the perturbation of matrix completion. There are many existing works about matrix completion, and since the specific matrix completion algorithms are not our focus in this paper, we simply choose a theoretical bound stated by Theorem 1.2 of paper [26] and reorganize it as Lemma B.4. Note that a tighter bound can be achieved with other algorithms and theories.

set of entries . If we observe is a noise matrix, whose entries are independent random variables,

with zero mean and sub-gaussian tails with variance proxy . Then with probability 1 , the completed matrix ˜M satisfys

provided that

The proof of this lemma is in the third section of paper [26]. Now we can proceed to prove Lemma B.3.

Proof. (of Lemma B.3)

We first analyze the k-th dynamic matrix denote the rank, the condition number, and the maximum absolute entry

value of respectively.

Because of the sub-Gaussian assumption, the entries of are independent random variables with zero mean, and sub-Gaussian tails with variance proxy . Then we have

Sum over all dynamic matrices, we get

The above inequality is equivalent to

It implies that for any state s and action a, we have:

Because of the equivalence of norms, the sum of absolute values of dynamic differences (a.k.a. the norm) satisfies

Then we get

B.2 Proof of Theorem 6.3

Now we can proceed to prove the main theorem of sample complexity. Setting the known threshold m. In Lemma B.1, we assume that we know a good known threshold m such that the estimated values of the completed dynamic matrices can achieve a near-optimal value (Condition 1 holds). For similicity, we define an operator to measure the distance of two MDPs as

Based on Lemma B.2, the above equation can be satisfied when

Lemma B.3 shows that, when the known fraction is greater than some threshold, we have

with probability 1 are respectively the supremum of the condition number and the rank of all dynamic matrices; the maximum variance proxy for noisy entries.

For any state-action pair (s,a) and any transit-in state be the estimated transition probability . For simplicity, let p denote , Using Hoeffding’s bound, we get

then Condition 1 holds. Combining the result. By replacing the known threshold m in Lemma B.1 with the value of m, as well as setting the failure probability of exploration and satisfying Condition 1 to be both 2, we can derive to the theorem.

C SUPPLEMENTAL EXPERIMENT RESULTS C.1 CasinoLand and RiverSwim

C.2 Selection of Known Thresholds

We design and conduct a series of systematic tests to study how the value of hyper parameters and how the MDP properties influence the learning effectiveness and efficiency of GIM. Similarly, every experiment is executed for 20 runs. The following three measurements are evaluated: (1) AvgReward Average reward per episode; (2) TotalEps Number of episodes needed to know all states and actions; and (3) PostAvgReward Average reward after knowing all states and actions (after exploration), which reflects accurateness of the learned dynamics.

For both RMax and GIM, the known threshold m is the key to the practical performance of the algorithm. Although the theory tells what the optimal order of the known threshold should be, there is an unknown constant that we need to determine in practice. If the known threshold is too low, the estimated dynamics are likely to be far away from the true dynamics. But if the known threshold is set to be very high, the agent may spend too much time exploring the environment without “knowing” anything, which leads to low total reward in finite-episode learning.

Figure 7: Comparison of AvgReward on synthetic tasks where S=20, A=10 and rank varies from 2 to 10.

Figure 7, 8 and 9 shows the AvgReward, TotalEps and PostAvgReward of RMax and GIM with different known thresholds on the synthetic tasks. To make the comparison easier to interpret, we set the average rewards obtained by the optimal policies as 1.0, and normalize the rewards of RMax and GIM. To make the results more convincing, we generate synthetic dynamic matrices with rank (roughly) equals to 2, 4, 6, 8, 10 for MDPs with 20 states and 10 actions.

In Figure 7, we find that within the same number of episodes, higher known thresholds result in lower per-episode reward, because more episodes are spent on exploring. However, the average reward of GIM is decreasing much slower than RMax, and GIM always keeps a very high reward rate (> 95%) as known threshold increases.

The reason for their different AvgReward changes are explained by Figure 8 and 9. Figure 8 shows that GIM is able to finish exploration in much fewer episodes than RMax, and does not rise as drastically as RMax with the increasing known threshold. The fast exploration of GIM is not surprising because of -curious walking and the completion method. But the accuracy of the dynamics returned by matrix completion remains questionable. Without effictive exploitation, efficient exploration is useless. So we also compare the corresponding PostAvgReward in Figure 9, where we can see, because matrix completion introduces new noises, GIM is not as accurate as RMax with the same known threshold. But the fact is, the difference is very small (< 1%). Therefore, unless the number of episodes is extremely large or infinite so that the exploration steps can be ignored, GIM tends to gain more rewards in total.

Figure 8: Comparison of TotalEps on synthetic tasks where S=20, A=10 and rank varies from 2 to 10.

Figure 9: Comparison of PostAvgReward on synthetic tasks where S=20, A=10 and rank varies from 2 to 10.

According to the above results, GIM outperforms RMax since it can get higher per-episode reward, and can find the near-optimal policy in less episodes. Futherthemore, GIM is less sensitive to the setting of the know threshold than RMax, while RMax needs some efforts to tune the know threshold well. Although the performance of GIM drops with the increasing rank, the degree of the drop is slight.

Experiments on the grid world task also demonstrate the similar results, which are all shown in Figure 10.

Figure 10: Comparison of AvgReward, TotalEps and PostAvgReward on 4 GridWorld tasks

Based on the comprehensive comparison with RMax about known thresholds, we conclude that (1) GIM can explore the unknowns much faster than RMax, without sacrificing too much accuracy of predicting. (2) GIM is more robust to the selection of known threshold than RMax, which makes it easier to be used and tuned in practice.

C.3 Properties of Dynamic Matrices

Figure 11: Comparison of AvgReward, TotalEps and PostAvgReward of GIM and RMax on synthetic task where S = 20,A = 10 with different ranks.

C.3.1 Influence of Matrix Rank r. In the above section, we can see that the larger the rank is, the less accurate the return policy is (according to PostAvgReward). To make the relation between rank and the learning results more clear, we visualize how GIM and RMax fluctuate when the rank of the underlying dynamic matrices changes in Figure 11. As the rank increases, the TotalEps of GIM keeps stable, whereas PostAvgReward drops because matrix completion works worse for high ranks. So GIM is more preferable for tasks with low-rank structures, i.e., similar transitions exist. But for tasks which do not have such structures, GIM is still able to explore efficiently and get near-optimal rewards.

and Figure 14 respectively represent the AvgReward, TotalEps and PostAvgReward of GIM and RMax on a series of synthetic tasks with different incoherence parameters () and condition number Based on the figures, we conclude that GIM does not change a lot with , in terms of efficiency and accuracy.

Figure 12: Comparison of AvgReward, PostAvgReward and TotalEps of GIM and RMax on synthetic task where S = 20,A = 10 with different incoherence parameter

Figure 13: Comparison of AvgReward, PostAvgReward and TotalEps of GIM and RMax on synthetic task where S = 20,A = 10 with different incoherence parameter

Figure 14: Comparison of AvgReward, PostAvgReward and TotalEps of GIM and RMax on synthetic task where S = 20,A = 10 with different condition number