Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

2020·Arxiv

Abstract

Abstract

We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy using offline data collected by running a logging policy . Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon H, which motivates a splurge of recent interest in alternatives that break the “Curse of Horizon” (Liu et al., 2018a; Xie et al., 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order ) in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of H away from a Cramer-Rao lower bound of order Ω(). In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

1 Introduction

Off-policy evaluation (OPE), which predicts the performance of a policy with data only sampled by a logging/behavior policy (Sutton & Barto, 2018), plays a key role for using reinforcement learning (RL) algorithms responsibly in many real-world decision-making problems such as marketing, finance, robotics, and healthcare. Deploying a policy without having an accurate evaluate of its performance could be costly, illegal, and can even break down the machine learning system. There is a large body of literature that studied the off-policy evaluation problem in both theoretical and application-oriented aspects. From the theoretical perspective, OPE problem is extensively studied in contextual bandits (Li et al., 2011; Dud´ık et al., 2011; Swaminathan et al., 2017; Wang et al., 2017) and reinforcement learning (RL) (Li et al., 2015; Jiang & Li, 2016; Thomas & Brunskill, 2016; Farajtabar et al., 2018; Xie et al., 2019) and the results of OPE studies have been applied to real-world applications including marketing (Theocharous et al., 2015; Thomas et al., 2017) and education (Mandel et al., 2014).

Problem setup. In the reinforcement learning (RL) problem the agent interacts with an underlying unknown dynamics which is modeled as a Markov decision process (MDP). An MDP is defined by a tuple are the state and action spaces, [0, 1] is the transition kernel with ) representing the probability of seeing state after taking action a at state is the mean reward function with ) being the average immediate goodness of (s, a) at time t. Also, is denoted as the initial state distribution and H is the time horizon. The subscript t in means the transition dynamics are non-stationary and could be different at each time t. A (non-stationary) policy assigns each state a distribution over actions at each time ) is a probability simplex with dimension |S|. For brevity we suppress the subscript t of and denote ) the p.m.f of actions given state at time t.

Given a target policy of interest , then the distribution of one H-step trajectory = () is specified by := (as follows: , for t = 1) and random reward has mean ). Then the value function under policy is defined as:

The OPE problem aims at estimating while given that n episodic data

For brevity, to denote the pair (). This can be understood as: To distinguish the data from different episodes, we use superscript to denote which episode they belong

are actually coming from a different logging policy

Existing methods. The classical way to tackle the problem of OPE relies on incorporating importance sampling weights (IS), which corrects the mismatch in the distributions under the behavior policy and target policy. Specifically, define the t-step importance ratio as := ), then it uses the cumulative importance ratio := to create IS based estimators:

where = ). There are different versions of IS estimators including weighted IS estimators and doubly robust estimators (Murphy et al., 2001; Hirano et al., 2003; Dud´ık et al., 2011; Jiang & Li, 2016).

Even though IS-based off-policy evaluation methods possess a lot of advantages (e.g. unbiasedness), the variance of the cumulative importance ratios may grow exponentially as the horizon goes long. Attempts to break the barriers of horizon have been tried using model-based approaches (Liu et al., 2018b; Gottesman et al., 2019), which builds the whole MDP using either parametric or nonparametric models for estimating the value of target policy. (Liu et al., 2018a) considers breaking the curse of horizon of time-invariant MDPs by deploying importance sampling on the average visitation distribution of state-action pairs, (Hallak & Mannor, 2017) considers leveraging the stationary ratio of state-action pairs to replace the trajectory weights in an online fashion and (Gelada & Bellemare, 2019) further applies the same idea in the deep reinforcement learning regime. Recently, (Kallus & Uehara, 2019a,b) propose double reinforcement learning (DRL), which is based on doubly robust estimator with cross-fold estimation of q-functions and marginalized density ratios. It was shown that DRL is asymptotically efficient when both components are estimated at fourth-root rates, however no finite sample error bounds are given.

Our goal. In this paper, our goal is to obtain the optimality of IS-based methods through marginalized importance sampling (MIS). As an earlier attempt, Xie et al. (2019) constructs MIS estimator by aggregating all trajectories that share the same state transition patterns to directly estimate the state distribution shifts after the change of policies from the behavioral to the target. However, as pointed out by Kallus & Uehara (2019a) and Remark 4 in Xie et al. (2019), the MSE upper bound of MIS estimator is asymptotically inefficient by a multiplicative factor of H. Xie et al. (2019) conjectures that the lower bound is not achievable in their infinite action setting. To bridge the gap and ultimately achieve the optimality, we consider the Tabular MDPs, where both the state space and action space are finite () and each state-action pair can be visited frequently as long as the logging policy does sufficient exploration (which corresponds to Assumption 2.2). Under the Tabular MDP setting, we can show the MSE upper bound of MIS estimator matches the Cramer-Rao lower bound provided by Jiang & Li (2016) by incorporating frequent action observability. To distinguish the difference, throughout the rest of paper we call the modified MIS estimator Tabular-MIS (TMIS) and the MIS estimator in Xie et al. (2019) State-MIS (SMIS).

1.1 Summary of results.

This work considers the problem of off-policy evaluation for a finite horizon, nonstationary, episodic MDP under tabular MDP setting. We propose and analyze Tabular-MIS estimator, which closes the gap between Cramer-Rao lower bound provided by Jiang & Li (2016) (on the variance of any unbiased estimator for a simplified setting of an nonstationary episodic MDP) and the MSE upper bound of State-MIS estimator (Xie et al., 2019). We also provide a high probability result by introducing a data-splitting type Tabular-MIS estimator, which retains the asymptotic efficiency while having an exponential tail. To the best of our knowledge, Split-TMIS is the first IS-based estimator in OPE that achieves asymptotic sample efficiency while having finite sample guarantees in high probability.

Moreover, the calculation of Tabular-MIS estimator and Split-TMIS does not explicitly incorporate the importance weights, which in turn implies our off-policy evaluation algorithm can be implemented without needing to know logging probabilities . Such logging-policy-free feature makes our TMIS estimator more practical in the real-world applications.

Key proof ingradients. We use a modified version of fictitious estimator of Xie et al. (2019) as the bridge to connect our estimator . Different from Xie et al. (2019) who directly analyzes transition dynamic ), we need to do a finer decomposition ) = ) and analyze ). Also, Bellman equations are leveraged for expressing the variance of TMIS recursively. For deriving the high probability bound, we design the data-splitting TMIS which not only matches perfectly with the standard concentration inequalities but also maintains the MSE of the same order as TMIS for certain appropriately chosen batch data-splitting size.

1.2 Other related work

Markov Decision Processes have a long history of associated research (Puterman, 1994; Sutton & Barto, 1998), but many theoretical problems in the basic tabular setting remain an active area of research as of today. We briefly review the other settings that this problem and connect them to our results.

Regret bound and sample complexity in the online setting. The bulk of existing work focuses on online learning, where the agent interacts with the MDP with the interests of identifying the optimal policy or minimizing the regret against the optimal policy. The optimal regret is obtained by (Azar et al., 2017) using a model-based approach which translates into a sample complexity bound of ), which matches the lower bound of Ω(Azar et al., 2013). The method is however not “uniform PAC” where the state of the art sample complexity remains ) (Dann et al., 2017). Model-free approaches that require a space constraint of O(HSA) were studied by Jin et al. (2018) which implies a sample complexity bound of

Sample complexity with a generative model. Another sequence of work assumes access to a generative model where one can sample from O(1) (Kearns & Singh, 1999). Sidford et al. (2018) is the first that establishes the optimal sample complexity of ˜Θ() under this setting (counting H generative model calls as one episode). Agarwal et al. (2019) establishes similar results by estimating the parameters of the MDP model using maximum-likelihood estimation.

Our setting is different in two ways. First, we consider a fixed pair of logging and target policy and , so our bounds can depend explicitly and instead of S, A. Second, we do not have either online access to the environment (to change policies) or a generative model. Our high-probability bound with a direct union bound argument, implies a sample complexity of ˜) for identifying the optimal policy, which is suboptimal up to a factor of S, but notably has the optimal dependence in H. We remark that achieving the optimal dependence in the planning horizon H (or the discounting factor (1 the infinite horizon case) is generally tricky (see, e.g., the COLT open problem (Jiang & Agarwal, 2018) for more details). The current paper is among the few instances where we know how to obtain the optimal parameters.

Finally, we acknowledge that tabular RL is a basic abstraction that is relatively far away from real applications, which might have unobserved states, continuous state, non-zero Bellman error in the value function approximation. We leave generalization of the techniques in this paper to these more practical settings as future work.

2 Method

2.1 Problem description

In addition to the non-stationary, finite horizon tabular MDP M = () (where S := and A := ), non-stationary logging policy and target policy in Section 1, we denote () and () the induced joint state-action distribution at time t and the state distribution counterparts ), satisfying The initial distributions are identical . Moreover, we use to represent the state transition probability from step j to step i under policy , where ) = ). The marginal state distribution vector

Historical data was obtained by logging policy and we can only use D to estimate the value of target policy . Suppose we only assume knowledge about ) for all () for any actions other than the noisy immediate reward after observing . The goal is to find an estimator to minimize the mean-square error (MSE):

Assumption 2.1 (Bounded rewards)

The bounded reward assumption can be relaxed to : such that 0 , Var[(as in Xie et al. (2019)), for achieving Cramer-Rao lower bound. However, the boundedness will become essential for applying concentrate inequalities in deriving high probability bounds.

Assumption 2.2 (Sufficient exploration)0.

Assumption 2.3 is also necessary for discrete state and actions, as otherwise the second moments of the importance weight would be unbounded and the MSE of estimators will become intractable . The bound on is natural since and it is finite by the Assumption 2.2; similarly, is also automatically satisfied if min0. Finally, as we will see in the results, explicit dependence on and only appear in the low-order terms of the error bound.

2.2 Tabular-MIS estimator

To overcome the barrier caused by cumulative importance weights in IS type estimators, marginalized importance sampling directly estimates the marginalized state visitation distribution and defines the MIS estimator:

whenever 0 and (() = 0 when = 0. Then the MIS estimator (1) becomes:

Construction of State-MIS estimator. Based on the estimated marginal state transition = , State-MIS estimatorin Xie et al. (2019) directly estimates the state transition ) and state reward

State-MIS estimator directly constructs state transitions () without explicitly modeling actions. Therefore, it is still valid when action space A is unbounded. However, importance weights must be explicitly utilized for compensating the discrepancy between and and the knowledge of ) at each state-action pair (s, a) is required.

Construction of Tabular-MIS estimator. Since tabular MDP setting assumes finite states and actions, we can go beyond importance weights and construct empirical estimates for

where we set the empirical visitation frequency to state-action () at time t. The corresponding estimation of ) are defined as:

In conclusion, by using the same estimator for share the same form of (2). However, Tabular-MIS estimator constructs a different estimation of component though (3)-(4) by leveraging the fact that each state-action pair is visited frequently under tabular setting.

The motivation of MIS-type estimators comes from the fact that we have a nonstationary MDP model and its underlying state marginal transition follows = . The MIS estimators are then obtained by using corresponding plug-in estimators for each different components (for , for ). On the other hand, IS-type estimators design the value function in a more straightforward way without needing to estimate the transition environment (Mahmood et al., 2014). Therefore in this sense MIS-type estimators are essentially model-based estimators with the model of interactive environment

3 Main Results

We now show that our Tabular-MIS estimator achieves the asymptotic Cramer-Rao lower bound for DAG-MDP (Jiang & Li, 2016) and therefore is asymptotically sample efficient. To formalize our statement, we pre-specify the following boundary conditions: 0, 0, and, as a reminder, and

Theorem 3.1. Suppose the n episodic historical data is obtained by running a logging policy is the new target policy which we want to test. If the number of episodes n satisfies

then under Assumption 2.1-2.3 our Tabular-MIS estimator has the following Mean- Square-Error upper bound:

+

where the value function under is defined as: () := {1, 2, ..., H}.

The proof of this theorem, and all the other technical results we present in this section, are deferred to the appendix due to the space constraint. We summarize the novel ingredients in the proof in Section 3.1. Before that, we make a few remarks about a few interesting aspects of this result.

Remark 3.2 (Asymptotic efficiency and local minimaxity). The error bound implies that lim

This exactly matches the CR-lower bound in Jiang & Li (2016, Proposition 3) for DAGMDP. In contrast, the State-MIS estimator in (Xie et al., 2019) achieves an asymptotic MSE of

We note that while in classical literature CR-lower bound is often used to lower bound the variance of unbiased estimators, the modern theory of estimation establishes that it is also the correct asymptotic minimax lower bound for the MSE of all estimators in every local neighborhood of the parameter space (see, e.g., Van der Vaart, 2000, Chapter 8). In other words, our results imply that Tabular-MIS estimator is asymptotically, locally, uniformly minimax optimal, namely, optimal for every problem instance separately.

It is worth pointing out that while asymptotically efficient estimators for this problem in related settings have been proposed in independent recent work (Kallus & Uehara, 2019a,b), our estimator is the first that comes with finite sample guarantees with an explicit expression on the low-order terms. Moreover, our estimator demonstrates that doubly robust estimation techniques is not essential for achieving asymptotic efficiency.

Remark 3.3 (Simplified finite sample error bound). The theory implies that there is universal constants such that for all , i.e., when we have a just visited every state-action pair for Ω(

In deriving the above remark, we used the somewhat surprising observation that

Note that we are summing H quantities that are potentially on the order of , yet no additional factors of H shows up. This observation is folklore and has been used in deriving tight results for tabular RL in (e.g., Azar et al., 2017). It can be proven using the following decomposition of the variance of the empirical mean estimator and the fact that it is bounded by

Lemma 3.4. For any policy and any MDP.

The proof, which applies the law-of-total-variance recursively, is deferred to the appendix.

Remark 3.5 (When One surprising observation is that Tabular-MIS estimator improves the efficiency even for the on-policy evaluation problem when . In other word, the natural Monte Carlo estimator of the reward in the on-policy evaluation problem is in fact asymptotically inefficient.

3.1 Building blocks of the analysis

At the high level, the techniques we used, including the idea of fictitious estimator and peeling the variance (expectation) of fictitious estimator from behind by applying total law of variances (expectations) repeatedly, are consistent with Xie et al. (2019).

In addition to the above techniques, we leverage the fact of frequent state-action visitations in our design of TMIS estimator and based on that we are able to achieve an asymptotic lower Mean Square Error (MSE) bound. The main components are the followings.

and

where is the parameter constrained by 0 1, which we will choose later in the proof.

This slight modification makes no longer implementable using the logging data D, but it does provide an unbiased estimator of in appendix) and, most importantly, it is easier to do theoretical analysis on than on . Moreover, Multiplicative Chernoff bound (Lemma A.2 in appendix) helps to find the connection between

Peeling arguments using the total law of variance (expectation). The core idea in analyzing the variance of is to peel the variance from behind (start from time H to 1) and the peeling tool we used here is through marriaging the standard Bellman equations with the total law of variance. Lemma B.2 (in appendix) shows this spirit and it is used repeatedly throughout the whole analysis. Beyond that, the peeling argument can be used to prove the dependence in for our Tabular-MIS estimator. This result explicates that is enough for TMIS to evaluate a particular policy and this is different from SMIS, which in general requires the dependence of for off-policy evaluation.

3.2 A high-probability bound with data-splitting TMIS.

Tabular-MIS estimator provides the asymptotic optimal variance bound of order and based on that it is natural to ask the related learning question: whether TMIS can further achieve a high probability bound with the same sample complexity? We figure out that the standard concentration inequalities (e.g. Hoeffding’s inequality, Bernstein inequality) cannot be directly applied because of the highly correlated structures of the Tabular-MIS estimator. To address this problem we design the following data split version of TMIS and as we will see, the original TMIS is essentially a special case of data-splitting TMIS.

Data splitting Tabular-MIS estimator. Assume the total number of episodes n can be factorized as , where M, N > 1 are two integers,and we can partition the data D into N folds with each fold different episodes, or in other words, we split the n episodes evenly. Then by the i.i.d. nature of n episodes, we have are independent collections.

For each , we can create a Tabular-MIS estimator (for notation simplicity we use to denote in the future discussions) using its own M episodes. Then are independent of each other and we can use the empirical mean to define the data splitting Tabular-MIS estimatorand the corresponding fictitious version:

The data splitting TMIS estimator explicitly characterizes the independence of n different episodes by grouping them into N chunks. Chunks are independent of each other and taking the average over all (i = 1, ..., N) will guarantee the validity of using concentration inequalities.

More importantly, the data splitting TMIS estimator holds the same information-theoretical variance lower bound as the non-data splitting TMIS estimator, which is not surprising since the non-data splitting TMIS estimator is just the special case of the data splitting TabularMIS estimator with N = 1. This idea is summarized into the following theorem:

Remark 3.7. The condition in Theorem 3.6 is achieveable. For example, choose then the condition holds when n is sufficiently large.

High probability bound. By coupling the data splitting techniques with the boundedness of Tabular-MIS estimator (, see Lemma B.3 in appendix), we can apply concentration inequalities to show the difference between and is bounded by order ), which is summarized into the following theorem.

Theorem 3.8. Suppose n i.i.d. episodic historical data comes from a near-uniform logging policy and suppose M, the number of episodes in each , satisfies: and . Then we have with probability 1 , the data splitting Tabular-MIS estimator obeys:

The proof Theorem 3.8 relies on bounding the difference between and using Multiplicative Chernoff bound and bounding the difference between and using Bernstein inequality. During the process of bounding we observe that a stronger uniform bound can be derived. In fact, this bound is 0. We formalize it into the following lemma.

Lemma 3.9. Suppose n i.i.d. episodic historical data comes from a near-uniform logging policy and suppose M, the number of episodes in each , satisfies: M > max [.Then we have with probability 1

Since , therefore let N = 1, M = n, we have: if then we have with probability 1

where consists of all the H-step nonstationary policies.

Remark 3.10. The uniform difference bound between and is obtained by observing the construction of fictitious estimator (7) and (8) are independent of the specific target policy . This result tells the supcan be arbitrarily small with high probability and therefore does not depend on H factor. This fact will help us to derive the correct dependence in H for uniform convergence problem, see Section 5.

3.3 Some interpretations.

Logging policy free algorithm. We point out the implementation of Tabular-MIS estimator does not require the knowledge of logging policy , as shown in Algorithm 1,2.This is critical in the sense that in the real-world sequential decision making problems, it is very likely the complete information about logging policy is not provided. This may happen due to mis-records or the lack of maintainance. By only using the historical data, tabular MIS off-policy evaluation is able to achieve the asymptotic efficiency. In contrast, the state MIS estimator always requires the full information about the logging policy.

Connection to approximate MDP estimation. Our TMIS is essentially an approxi-

mate MDP estimator (with the non-stationary dynamic transitions estimated by maximum likelihood estimator (MLE)) except that we marginalize out the action in both and ) and provide an importance sampling interpretation. To the best of our knowledge, existing analysis of the fully model-based approach does not provide tight bounds. We give two examples. The seminal simulation lemma in Kearns & Singh (2002) together with a naive concentration-type analysis gives only an ) bound in our setting. In a very recent compilation of improvements over this bound (Jiang, 2018), this bound can be improved to either ). Our result is the first that achieves the optimal ) rate regardless of whether it is the model-based or model-free approach.

4 Experiments

In this section, we present some empirical studies to demonstrate that our main theoretical results about Tabular-MIS estimator in Theorem 3.1 are empirically verified.

Time-varying, non-mixing Tabular MDP. We test our approach in simulated MDP environment where both the states and the actions are binary. Concretely, there are two states and and two actions and . State always has probability 1 going back to itself, regardless of the actions, ) = 1 and ) = 1. For state , at each time step there is one action (we call it a) that has probability 2/H going to and the other action (we call it ) has probability 1 going back to ) = 2/H = 1 ) and ) = 1. Moreover, which action will make state go to state with probability 2/H is decided by a random parameter [0, 1]. If and if . One can receive reward 1 at each time step if t > H/2 and is in state , and will receive reward 0 otherwise. Lastly, for state , we set ) = ); for state , we set ) = ) = 1/2 and

Figure 1(a) shows the asymptotic convergence rates of relative RMSE with respect to the number of episodes, given fixed horizon H = 100. Both SMIS and TMIS has a ) convergence rate. The saving ofof TMIS over SMIS in this log-log plot is reflected in the intercept. Figure 1(b) has fixed n = 1024 with varying horizon H. Note since ), therefore for TMIS our theoretical result implies= ) = O(1), which is consistent with the horizontal line when H is large. Moreover, for SMIS = ) = ), so after taking the log() we should have asymptotic linear trend with coefficient 1/2. The red line in Figure 1(b) empirically verifies this result. More empirical study discussions are deferred to Appendix D.

Figure 1: Relative RMSE () on Non-stationary Non-mixing MDP

5 Discussion

From off-policy evaluation to offline learning. A real offline reinforcement learning system is equipped with both offline learning algorithms and off-policy evaluation algorithms. The decision maker should first run the offline learning algorithm to find a near optimal policy and then use off-policy evaluation methods to check if the obtained policy is good enough. Under our tabular MDP setting, we point out it is possible to find a -optimal policy in near optimal time and sample complexity -value iteration (QVI) based algorithm designed by Sidford et al. (2018). Their QVI algorithm assumes a generative model which can provide independent sample of the next state given any current state-action (s, a). At a first glance, this assumption seems too strong for offline learning since we cannot force the agent to stay in any arbitrary location. In fact, the Assumption 2.2 on actually reveals that the underlying logging policy can be considered as the surrogate of the generative model. As n goes large, the visitation frequency of any () will be large enough with high probability, as guaranteed by Multiplicative Chernoff bound.

From off-policy evaluation to uniform off-policy evaluation. The high probability result achieves ) complexity. Following this discovery line, then it is natural to ask whether uniform convergence over a class of policies (e.g. all deterministic policies) can be achieved with optimal sample complexity. This problem is interesting since it will guarantee the strong performance of off-policy evaluation methods over all policies in certain policy class . By a direct application of union bound, we can obtain the following result:

contains all the deterministic H-step policies. Then under the same

condition as Theorem 3.8, the data splitting Tabular-MIS estimator satisfies:

with probability 1

The uniform convergence bound implies that the empirical best policy ˆ= argmaxis within =

bound for learning the optimal policy (Azar et al., 2013) in all parameters except a factor of S.

Finally, we note that the conjecture posed by Xie et al. (2019) remains unsolved. The result we presented in this paper leverage the fact that we can estimate the parameters of the MDP model. In the infinite A case, we can never observe any (s, a) pairs more than once, hence not able to estimate the transition dynamics or the expected reward. The minimax lower bound in (Wang et al., 2017) (for the contextual bandit setting) already establishes that the Cramer-Rao lower bound is not achievable in this setting even if H = 1 and S = 1. It remains open whether is required.

6 Conclusion

In this paper, we propose and analyze a new marginalized importance sampling estimator for the off-policy evaluation (OPE) problem under the episodic tabular Markov decision process model. We show that the estimator is has a finite sample error bound that matches the exact Cramer-Rao lower bound up to low-order factors. We also provide an extension with high probability error bound. To the best of our knowledge, these results are the first of their kinds. Future work includes resolving the open problems mentioned before and generalizing the results to more practical settings.

References

Agarwal, A., Kakade, S., & Yang, L. F. (2019). On the optimality of sparse model-based planning for markov decision processes. arXiv preprint arXiv:1906.03804.

Azar, M. G., Munos, R., & Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3), 325–349.

Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, (pp. 263–272). JMLR. org.

Chernoff, H., et al. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4), 493–507.

Dann, C., Lattimore, T., & Brunskill, E. (2017). Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, (pp. 5713–5723).

Dud´ık, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601.

Farajtabar, M., Chow, Y., & Ghavamzadeh, M. (2018). More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493.

Gelada, C., & Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrap- ping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, (pp. 3647–3655).

Gottesman, O., Liu, Y., Sussex, S., Brunskill, E., & Doshi-Velez, F. (2019). Combin- ing parametric and nonparametric models for off-policy evaluation. arXiv preprint arXiv:1905.05787.

Hallak, A., & Mannor, S. (2017). Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, (pp. 1372–1383). JMLR. org.

Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161–1189.

Jiang, N. (2018). Notes on tabular methods.

Jiang, N., & Agarwal, A. (2018). Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, (pp. 3395–3398).

Jiang, N., & Li, L. (2016). Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, (pp. 652–661). JMLR. org.

Jin, C., Allen-Zhu, Z., Bubeck, S., & Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems, (pp. 4863–4873).

Kallus, N., & Uehara, M. (2019a). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526.

Kallus, N., & Uehara, M. (2019b). Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850.

Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3), 209–232.

Kearns, M. J., & Singh, S. P. (1999). Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in neural information processing systems, (pp. 996–1002).

Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual- bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, (pp. 297–306). ACM.

Li, L., Munos, R., & Szepesv´ari, C. (2015). Toward minimax off-policy value estimation.

Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018a). Breaking the curse of horizon: Infinite- horizon off-policy estimation. In Advances in Neural Information Processing Systems, (pp. 5361–5371).

Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., & Brunskill, E. (2018b). Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, (pp. 2644–2653).

Mahmood, A. R., van Hasselt, H. P., & Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, (pp. 3014–3022).

Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., & Popovic, Z. (2014). Offline policy evaluation across representations with applications to educational games. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, (pp. 1077–1084). International Foundation for Autonomous Agents and Multiagent Systems.

Murphy, S. A., van der Laan, M. J., Robins, J. M., & Group, C. P. P. R. (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456), 1410–1423.

Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic program- ming.

Sidford, A., Wang, M., Wu, X., Yang, L., & Ye, Y. (2018). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, (pp. 5186–5196).

Sridharan, K. (2002). A gentle introduction to concentration inequalities. Dept. Comput. Sci., Cornell Univ., Tech. Rep.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction, vol. 1. MIT press Cambridge.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., & Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems, (pp. 3632–3642).

Theocharous, G., Thomas, P. S., & Ghavamzadeh, M. (2015). Personalized ad recom- mendation systems for life-time value optimization with guarantees. In Twenty-Fourth International Joint Conference on Artificial Intelligence.

Thomas, P., & Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforce- ment learning. In International Conference on Machine Learning, (pp. 2139–2148).

Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., & Brunskill, E. (2017). Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Twenty-Ninth IAAI Conference.

Van der Vaart, A. W. (2000). Asymptotic statistics, vol. 3. Cambridge university press.

Wang, Y.-X., Agarwal, A., & Dudik, M. (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, (pp. 3589–3597). JMLR. org.

Xie, T., Ma, Y., & Wang, Y.-X. (2019). Towards optimal off-policy evaluation for re- inforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, (pp. 9665–9675).

Appendix

A Concentration inequalities and other technical lemmas

Lemma A.1 (Bernsteins Inequality (Sridharan, 2002) ). Let be independent bounded random variables such that ] = 0 and with probability 1. Let =

Lemma A.2 (Multiplicative Chernoff bound (Chernoff et al., 1952) ). Let X be a Binomial random variable with parameter , we have that

A slightly weaker bound that suffices for our propose is the following:

B Proof of the main theorem

To analyze the MSE upper bound ], we create a fictitious surrogate which is an unbiased version of . A few auxiliary lemmas are first presented and Bellman equations are used for deriving variance decomposition in a recursive way. Second order moment of marginalized state distribution can then be bounded by analyzing its variance.

B.1 Fictitious tabular MIS estimator.

The fictitious estimatorfills in the gap of state-action location () of the true estimator where = 0. Specifically, it replaces every component in with a fictitious counterpart, := , with = and () =

() = In particular, let denotes the event

where 0 1 is a parameter that we will choose later.

The name ”fictitious” comes from the fact that is not implementable using the data, but it creates a bridge between and because of its unbiasedness, see Lemma B.5. Also, for simplicity of the proof, throughout the rest of the paper we denote: := Also, in the base case, we denote := and that Then we have the following preliminary auxiliary lemmas.

and are deterministic given . Moreover, given , is unbiased of is unbiased of

Proof of Lemma B.1. By construction of the estimator, and only depend on , therefore are constants. For the second argument, we have

where the third equal sign comes from the fact that conditional on , ˆ) — the empirical mean — is unbiased. The result about ˜can be derived using a similar fashion.

Using Lemma B.1, we can derive the following recursions for expectation and variance:

Lemma B.2. For h = 1, ..., H, we have

Var

Proof. The proof of Lemma B.2 can be found in Lemma B.2 and Lemma 4.1 in Xie et al. (2019) by coupling the standard Bellman equation:

with the total law of expectations and the total law of variances.

Lemma B.3 (Boundedness of Tabular MIS estimators)

The last line is inequality since ) = 0 when = 0. Following the same logic, it is easy to show ) is a non-degenerated probability distribution.

Next note ) = ) = = 1. Suppose ) is a (degenerated) probability distribution, then from , by induction we know (degenerated) probability distribution for all t.

Using Assumption 2.1, it is easy to show ˆ, then combining all results above we have . Similarly,

The boundedness of Tabular-MIS estimator cannot be inherited by the State-MIS estimator since explicitly uses importance weights and there is no reason for it to be less than . As a result, we do not need an extra projection step for our estimation to be valid (see Xie et al. (2019) Lemma B.1). Thanks to the following lemma, throughout the rest of the analysis we only need to consider

be the Tabular-MIS estimator and be the fictitious version of TMIS we described above with parameter . Then the MSE of the TMIS and fictitious TMIS satisfies

Proof of Lemma B.4. Define E := ()(1 . Similarly to Lemma B.1 in the appendix of Xie et al. (2019), we have

where the last inequality uses Lemma B.3. Then combining the multiplicative Chernoff bound (Lemma A.2 in the Appendix) and a union bound over each , we get that

which provides the stated result.

Lemma B.4 tells that MSE of two TMISs differs by a quantity 3

and this illustrates that the gap between two MSE’s can be sufficiently small as long as

B.2 Variance and Bias of Fictitious tabular MIS estimator.

Lemma B.5 (Xie et al. (2019) Lemma B.2)Lemma B.6 (Variance decomposition).

Var[

where denotes the value function under which satisfies the Bellman equation

Remark B.7. Note even though the construction of TMIS and SMIS are different, both fictitious estimators are unbiased for . Therefore the MSE of MIS estimators are dominated by the variance of the fictitious estimators. Comparing Lemma B.6 with Lemma 4.1 in Xie et al. (2019) we can see our Tabular-MIS estimator achieves a lower bound, and it is essentially asymptotic optimal, as explained by Remark 3.2.

Proof of Lemma B.6. The proof relies on applying Lemma B.2 in a recursive way. One key observation is

To begin with the following variance decomposition, which applies (11) recursively.

Var[

Now let us analyze each are conditionally independent given partitions the n episodes into S disjoint sets according to the states at time h. Similarly, ˜) and ˜) for each () are also conditionally independent given . These observations imply:

(15) The second line and the fourth line use the conditional independence for and () respectively. The fifth line uses that when )(1 ), the conditional variance is 0. The sixth line uses the fact that episodes are iid.

Plug (15) into the above variance decomposition and uses = 0, we finally get

Var[

B.3 Bounding the variance of .

Applying the definition of variance, we directly have

where we use the fact that ) is unbiased (which can be proved by induction through applying total law of expectations and the recursive relationship ). Therefore the only thing left is to bound the the variance of ). To tackle it, we consider bounding the covariance matrix of ). As we shall see in Lemma B.8, fortunately, we are able to derive an identical result of Lemma B.4 in Xie et al. (2019) for our Tabular-MIS estimator, which helps greatly in bounding the the variance of

Lemma B.8 (Covariance of

Cov(

where = — the transition matrices under policy from time

Proof of Lemma B.8. We start by applying the law of total variance to obtain the following recursive equation

The decomposition of the covariance in the third line uses that Cov(X + Y ) = Cov(X) + Cov(Y ) when X and Y are statistically independent and the columns of are independent when conditioning on

(

The second line uses the fact that conditional on , the variance of ) is zero given Data. The third line uses the basic property of empirical average, and the fourth line comes from the fact

The last line (25) uses the fact that is positive semidefi- nite, )(1 ) and the definition of variance for ). Combining (19) and (25) and by recursively apply them, we get the stated results.

Benefitting from the identical semidefinite ordering bound on Cov( ) for TMIS and SMIS, we can borrow the following results from Xie et al. (2019) for our Tabular-MIS estimator.

Lemma B.9 (Corollary 2 of Xie et al. (2019))

where

Before giving the proof of Theorem 3.1, we first prove Lemma 3.4.

Proof of Lemma 3.4. Let value function () = ] and Q-function ) = ], then by total law of variance we obtain

(let’s suppress the policy for simplicity):

=E

+Var

+Var

(26) where we use Markovian property that () equals () in distribution and E

recursively and letting h = H, we get the stated result.

Remark B.11. A straight forward implication of Lemma 3.4 is the following:

Combing Lemma B.6 and B.10, we are now ready to prove the main Theorem 3.1.

Proof of Theorem 3.1. Plug the result of Lemma B.10 into Lemma B.6 and uses the unbi-

Choose . Then by assumption we have 2, which allows us to write (1 (1 + 2) in the leading term and (1 2 in the subsequent terms. The condition of Lemma B.10 is satisfied by The second assumption on n. Then, combining (27) with Lemma B.4 we get:

+8τ

(28) now use Lemma 3.4, we can bound the last term in (28) by

Combine this term with we obtain the higher order term ), where we use that pigeonhole principle implies that

This completes the proof.

C Proofs of data splitting Tabular-MIS estimator.

We define the fictitious data splitting Tabular-MIS estimator as:

where each is the fictitious Tabular-MIS estimator of . Moreover, we set all jointly share the same fictitious parameter

Proof of Theorem 3.6. Let := : , then an argument similar to Lemma B.4 can be derived:

and

therefore ] will be sufficiently small if By near-uniformity we

Moreover, by i.i.d and unbiasedness of

by the second assumption on M and Theorem 3.1.

We now proof Lemma 3.9, since it will be used to as the intermediate step for proving Theorem 3.8.

Proof of Lemma 3.9. Note that

therefore by near-uniformity is suf-ficient to guarantee the stated result.

Now we can prove Theorem 3.8.

Proof of Theorem 3.8. First of all, we have

Now by Bernstein inequality we have

(30) Solving (30) and apply Theorem 3.1, we obtain

(31) As N goes large, the square root term in (31) will dominate and it seems we only need to consider the square root term in N and treat the second term as the higher order term. However, since arbitrary large given n. An example is: when N = n, then M = n/N = 1 does not satisfy the condition. Therefore to make the square root term dominates we need

This translates to

where absorbs all the Polylog terms.

Therefore under the condition (32), we can really absorb the second term in (31) (as higher order term) and combine it with Lemma 3.9 to get that with probability 1

Proof of Theorem 5.1. The non-uniform result of Theorem 3.8 gives:

Note that all nonstationary deterministic polices class have cardinality implies log , therefore combine Lemma 3.9 with a direct union bound and Multiplicative Chernoff bound we obtain

D More details about Empirical Results.

Restate Time-varying, non-mixing Tabular MDP in Section 4.

There are two states and two actions always has probability 1 going back to itself, regardless of the actions, ) = 1 and ) = 1. For state , at each time step there is one action (we call it a) that has probability 2/H going to and the other action (we call it ) has probability 1 going back to

and which action will make state go to state with probability 2/H is decided by a random parameter uniform sampled in [0, 1]. If and if . These are generated by a sequence of pseudo-random numbers. Moreover, one can receive reward 1 at each time step if t > H/2 and is in state , and will receive reward 0 otherwise. Lastly, for logging policy, we define it to be uniform:

For target policy , we define it as:

We run this non-stationary MDP model in the Python environment and pseudo-random numbers ’s are generated by keeping numpy.random.seed(100).

We run each methods under K = 100 macro-replications with data and use each = 1, ..., K) to construct a estimator , then the (empirical) RMSE is computed as:

Other generic IS-based estimators. There are other importance sampling based estimators including weighted importance sampling (WIS) and importance sampling with stationary state distribution (SSD-IS, Liu et al. (2018a)). The empirical comparisons including these methods are well-demonstrated in Xie et al. (2019) and it was empirically shown that they are worse than SMIS. Because of that, we only focus on comparing SMIS and TMIS in our simulation study.