38:[["$","audio",null,{"id":"tts"}],["$","$L3c",null,{"paperID":"2001.04515","publisher":"arxiv","paperJSON":{"title":"Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings","paperID":"2001.04515","avgLineHeight":19.92,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"S","element":"span"},{"text":"equenti","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":"l ","element":"span"},{"style":{"fontWeight":"bold"},"text":"V","element":"span"},{"text":"alue ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"valuation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient’s health status. A ","element":"span"},{"text":"Python ","element":"span"},{"text":"implementation of the proposed procedure is available at ","element":"span"},{"href":"https://github.com/shengzhang37/SAVE","style":{"fontFamily":"monospace"},"text":"https://github.com/shengzhang37/SAVE","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Key Words: ","element":"span"},{"text":"Confidence interval; Value function; Reinforcement learning; Infinite horizons; Bidirectional Asymptotics.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Reinforcement learning (RL) is a general technique that allows an agent to learn and interact with an environment. ","element":"span"},{"text":"A policy defines the agent’s way of behaving. ","element":"span"},{"text":"It maps the states of environments to a set of actions to be chosen from. ","element":"span"},{"text":"RL algorithms have made tremendous achievements and found extensive applications in video games ","element":"span"},{"href":"#id-0","referenceIndex":41,"text":"(Silver ","element":"a"},{"href":"#id-0","referenceIndex":41,"text":"et al., ","element":"a"},{"href":"#id-0","referenceIndex":41,"text":"2016)","element":"a"},{"text":", robotics ","element":"span"},{"href":"#id-1","referenceIndex":19,"text":"(Kormushev et al., ","element":"a"},{"href":"#id-1","referenceIndex":19,"text":"2013)","element":"a"},{"text":", bidding ","element":"span"},{"href":"#id-2","referenceIndex":16,"text":"(Jin et al., ","element":"a"},{"href":"#id-2","referenceIndex":16,"text":"2018)","element":"a"},{"text":", ridesharing ","element":"span"},{"text":"(Xu et al., ","element":"span"},{"href":"#id-3","referenceIndex":51,"text":"2018)","element":"a"},{"text":", etc. In particular, a number of RL methods have been proposed in precision medicine, to derive an optimal policy as a set of sequential treatment decision rules that optimize patients’ clinical outcomes over a fixed period of time (finite horizon). References include ","element":"span"},{"href":"#id-4","referenceIndex":29,"text":"Murphy ","element":"a"},{"href":"#id-4","referenceIndex":29,"text":"(2003)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":52,"text":"Zhang et al. ","element":"a"},{"href":"#id-5","referenceIndex":52,"text":"(2013)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":54,"text":"Zhao et al. ","element":"a"},{"href":"#id-6","referenceIndex":54,"text":"(2015)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":36,"text":"Shi et al. ","element":"a"},{"href":"#id-7","referenceIndex":36,"text":"(2018a,","element":"a"},{"href":"#id-8","referenceIndex":39,"text":"b)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":53,"text":"Zhang ","element":"a"},{"href":"#id-9","referenceIndex":53,"text":"et al. ","element":"a"},{"href":"#id-9","referenceIndex":53,"text":"(2018)","element":"a"},{"text":", to name a few.","element":"span"}],[{"text":"Mobile health (or mHealth) technology has recently emerged due to the use of mobile devices such as mobile phones, tablet computers or wearable devices in health care. It allows health-care providers to communicate with patients and manage their illness in real time. It also collects rich longitudinal data (e.g., through mobile health apps) that can be used to estimate the optimal policy. Data from mHealth applications differ from those in finite horizon settings in that the number of treatment decision points for each patient is not necessarily fixed (infinite horizon) while the total number of patients could be limited. Take the OhioT1DM dataset ","element":"span"},{"href":"#id-10","referenceIndex":24,"text":"(Marling and Bunescu, ","element":"a"},{"href":"#id-10","referenceIndex":24,"text":"2018) ","element":"a"},{"text":"as an example. It contains data for six patients with type 1 diabetes. For all patients, their continuous glucose monitoring (CGM) blood glucose levels, insulin doses including bolus and basal rates, self-reported times of meals and exercises are continually measured and recorded for eight weeks. Developing an optimal policy as functions of these time-varying covariates could potentially assist these patients in improving their health status.","element":"span"}],[{"text":"In this paper, we focus on the infinite horizon setting where the data generating process is modeled by a Markov decision process (MDP, ","element":"span"},{"href":"#id-11","referenceIndex":30,"text":"Puterman, ","element":"a"},{"href":"#id-11","referenceIndex":30,"text":"1994)","element":"a"},{"text":". Specifically, at each time point, the agent selects an action based on the observed state. The system responds by giving the decision maker a corresponding outcome and moving into a new state in the next time step. ","element":"span"},{"text":"This model is generally applicable to sequential decision making, including applications from mHealth, games, robotics, ridesharing, etc. After a policy is being proposed, it is important to examine its benefit prior to recommending it for practical use. The goodness of a policy is quantified by its (state) value function, corresponding to the discounted cumulative reward that the agent receives on average, starting from some initial state. The inference of the value function helps a decision maker to evaluate the impact of implementing a policy when the environment is in a certain state. ","element":"span"},{"text":"In some applications, it is also important to evaluate the integrated value of a policy aggregated over different initial states. For example, in medical studies, one might wish to know the mean outcome of patients in the population. The integrated value could thus be used as a criterion for comparing different policies.","element":"span"}],[{"text":"In statistics literature, a few methods have been proposed to estimate the optimal policy in infinite horizons. ","element":"span"},{"href":"#id-12","referenceIndex":9,"text":"Ertefaie and Strawderman ","element":"a"},{"href":"#id-12","referenceIndex":9,"text":"(2018) ","element":"a"},{"text":"proposed a variant of gradient Q-learning method. ","element":"span"},{"href":"#id-13","referenceIndex":20,"text":"Luckett et al. ","element":"a"},{"href":"#id-13","referenceIndex":20,"text":"(2019) ","element":"a"},{"text":"proposed a V-learning to directly search the optimal policy among a restricted class of policies. Inference of the value function under a generic (data-dependent) policy has not been studied in these papers. In the computer science literature, ","element":"span"},{"href":"#id-14","referenceIndex":44,"text":"Thomas et al. ","element":"a"},{"href":"#id-14","referenceIndex":44,"text":"(2015) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-15","referenceIndex":15,"text":"Jiang and Li ","element":"a"},{"href":"#id-15","referenceIndex":15,"text":"(2016) ","element":"a"},{"text":"proposed (augmented) inverse propensity-score weighted ((A)IPW) estimators for the the value function in infinite horizons and derived their associated CIs. However, these methods are not suitable for settings where only a limited number of trajectories (e.g., plays of a game or patients in medical studies) are available, since (A)IPW estimators become increasingly unstable as the number of decision points diverges to infinity. Recently, ","element":"span"},{"href":"#id-16","referenceIndex":17,"text":"Kallus and Uehara ","element":"a"},{"href":"#id-16","referenceIndex":17,"text":"(2019) ","element":"a"},{"text":"proposed a double reinforcement learning (DRL) method that achieves consistent estimation of the value under a fixed policy even with limited number of trajectories. Their method computes a Q-function and a marginalized density ratio. Learning the density ratio is challenging in general and it remains difficult to investigate the goodness-of-fit of the estimated density ratio in practice.","element":"span"}],[{"text":"The focus of this paper is to construct confidence intervals (CIs) for a (possibly data-dependent) policy’s value function at a given state as well as its integrated value with respect to a given reference distribution. Our proposed CI is derived by estimating the state-action value function (Q-function) under the target policy. Similar to the value, the Q-function measures the discounted cumulative reward that the agent receives on average, starting from some initial state-action pair. We use series/sieve method to approximate the Q-function based on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"basis functions, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"grows with the total number of observations. The advances of our proposed method are summarized as follows. ","element":"span"},{"text":"First, the proposed inference method is generally applicable. Specifically, it can be applied to any fixed policy (either deterministic or random) and any data-dependent policy whose value converges at a certain rate. The latter includes policies estimated by general Q-learning type algorithms that learns an optimal Q-function from the observed data, such as gradient Q-learning ","element":"span"},{"href":"#id-17","referenceIndex":23,"text":"(Maei et al., ","element":"a"},{"href":"#id-17","referenceIndex":23,"text":"2010; ","element":"a"},{"href":"#id-12","referenceIndex":9,"text":"Ertefaie and Strawderman, ","element":"a"},{"href":"#id-12","referenceIndex":9,"text":"2018)","element":"a"},{"text":", fitted Q-iteration (see for example,","element":"span"}],[{"href":"#id-18","referenceIndex":8,"text":"Ernst et al., ","element":"a"},{"href":"#id-18","referenceIndex":8,"text":"2005; ","element":"a"},{"href":"#id-19","referenceIndex":32,"text":"Riedmiller, ","element":"a"},{"href":"#id-19","referenceIndex":32,"text":"2005)","element":"a"},{"text":", etc. See Section ","element":"span"},{"href":"#id-20","text":"3.2.4 ","element":"a"},{"text":"for detailed illustrations.","element":"span"}],[{"text":"Second, when applied to data-dependent policies, our method is valid in nonregular cases where the optimal policy is not uniquely defined. Inference without requiring the uniqueness of the optimal policy is extremely challenging even in the simpler finite-horizon settings (see the related discussions in ","element":"span"},{"href":"#id-21","referenceIndex":21,"text":"Luedtke and van der Laan, ","element":"a"},{"href":"#id-21","referenceIndex":21,"text":"2016)","element":"a"},{"text":". The major challenge lies in that the estimated policy may not stabilize as sample size grows, making the variance of the value estimator difficult to estimate (see Section ","element":"span"},{"href":"#id-22","text":"3.2.1 ","element":"a"},{"text":"for details). We achieve valid inference by proposing a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"S","element":"span"},{"text":"equenti","element":"span"},{"style":{"fontWeight":"bold"},"text":"A","element":"span"},{"text":"l ","element":"span"},{"style":{"fontWeight":"bold"},"text":"V","element":"span"},{"text":"alue ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E","element":"span"},{"text":"valuation (SAVE) method that splits the data into several blocks and recursively update the estimated policy and its value estimator. It is worth mentioning that the data-splitting rule cannot be arbitrarily determined since the observations are time dependent in infinite horizon settings (see Section ","element":"span"},{"href":"#id-23","text":"3.2.2 ","element":"a"},{"text":"for details).","element":"span"}],[{"text":"Third, our CI is valid as long as either the number of trajectories ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"in the data, or the number of decision points ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"per trajectory diverges to infinity. It can thus be applied to a wide variety of real applications in infinite horizons ranging from the Framingham heart study ","element":"span"},{"href":"#id-24","referenceIndex":47,"text":"(Tsao and Vasan, ","element":"a"},{"href":"#id-24","referenceIndex":47,"text":"2015) ","element":"a"},{"text":"with over two thousand patients to the OhioT1DM dataset that contains eight weeks’ worth of data for six people. We also allow both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"to approach infinity, which is the case in applications from video games. In contrast, CIs proposed by ","element":"span"},{"href":"#id-14","referenceIndex":44,"text":"Thomas et al. ","element":"a"},{"href":"#id-14","referenceIndex":44,"text":"(2015) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-15","referenceIndex":15,"text":"Jiang and Li ","element":"a"},{"href":"#id-15","referenceIndex":15,"text":"(2016) ","element":"a"},{"text":"require ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"to grow to infinity to achieve nominal coverage.","element":"span"}],[{"text":"Lastly, we consider both off-policy and on-policy learning methods. In off-policy settings, CIs are derived based on historical data collected by a potentially different behavior policy. Off-policy evaluation is critical in situations where running the target policy could be expensive, risky or unethical. In on-policy settings, the estimated policy is recursively updated as batches of new observations arrive. To our knowledge, this is the first work on statistical inference of a data-dependent policy in on-policy settings in sequential decision making with infinite horizons.","element":"span"}],[{"text":"To study the asymptotic properties of our proposed CI, we focus on tensor-product spline and wavelet series estimators. Our technical contributions are described as follows. First, we introduce a bidirectional-asymptotic framework that allows either ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"to approach infinity. Our major technical contribution is to derive a nonasymptotic error bound for the spectral norm of sums of mean zero random matrices formed by the data transactions from MDP as a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"(see e.g., Lemma ","element":"span"},{"href":"#id-25","text":"3)","element":"a"},{"text":". This result is important in studying the limiting distribution of series estimators under such a theoretical framework.","element":"span"}],[{"text":"Second, for policies that are estimated by Q-learning type algorithms such as the greedy gradient Q-learning, fitted Q-iteration and deep Q-network ","element":"span"},{"href":"#id-26","referenceIndex":28,"text":"(Mnih et al., ","element":"a"},{"href":"#id-26","referenceIndex":28,"text":"2015)","element":"a"},{"text":", we relate the convergence rate of their values to the prediction error of the corresponding estimated Q-functions. We show in Theorems ","element":"span"},{"href":"#id-27","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"that the values can converge at faster rates than the estimated Q-functions under certain margin type conditions on the optimal Q-function. To the best of our knowledge, these findings have not been discovered in the reinforcementlearning literature. Our theorems form a basis for researchers to study the value properties of Q-learning type algorithms. Moreover, our theoretical results are consistent with findings in point treatment studies where there is only one single decision point (see e.g., ","element":"span"},{"href":"#id-29","referenceIndex":31,"text":"Qian and ","element":"a"},{"href":"#id-29","referenceIndex":31,"text":"Murphy, ","element":"a"},{"href":"#id-29","referenceIndex":31,"text":"2011; ","element":"a"},{"href":"#id-21","referenceIndex":21,"text":"Luedtke and van der Laan, ","element":"a"},{"href":"#id-21","referenceIndex":21,"text":"2016)","element":"a"},{"text":". However, the derivation of Theorems ","element":"span"},{"href":"#id-27","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"is more involved since the value function in our settings is an infinite series involving both immediate and future rewards.","element":"span"}],[{"text":"Third, when these basis functions are used, we mathematically characterize the approximation error of the Q-function as a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", the dimension of the state variables, and the smoothness of the Markov transition function and the conditional mean of the immediate reward as a function of the state-action pair. This offers some guidance to practitioners on the choice of the number of basis functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", when some prior knowledge on the degree of smoothness of the aforementioned functions are available.","element":"span"}],[{"text":"The rest of the paper is organized as follows. We introduce the model setup in Section ","element":"span"},{"text":"2. ","element":"span"},{"text":"In Sections ","element":"span"},{"text":"3 ","element":"span"},{"text":"and ","element":"span"},{"text":"4, ","element":"span"},{"text":"we present the proposed off-policy and on-policy evaluation methods, respectively. Simulation studies are conducted to evaluate the empirical performance of the proposed inference methods in Section ","element":"span"},{"text":"5. ","element":"span"},{"text":"We apply the proposed inference method to the OhioT1DM dataset in Section ","element":"span"},{"text":"6, ","element":"span"},{"text":"Finally, we conclude our paper by a discussion section.","element":"span"}]]},{"heading":"2 Optimal policy in infinite-horizon settings","paragraphs":[[{"text":"We begin by introducing the notion of the optimal policy, the Q-function and the value function in infinite-horizon settings. Let ","element":"span"},{"style":{"height":18.47},"width":174.45,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/4-0.png","element":"img","alt":" X0,t ∈ X","inline":true,"padRight":true},{"text":"be the time-varying covariates collected at time point ","element":"span"},{"style":{"height":19.27},"width":216.12,"height":48.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-0.png","element":"img","alt":" t, A0,t ∈ A","inline":true,"padRight":true},{"text":"denote the action taken at time ","element":"span"},{"style":{"height":18.87},"width":196.35,"height":47.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-1.png","element":"img","alt":" t, and Y0,t","inline":true,"padRight":true},{"text":"stand for the immediate reward observed. Here, ","element":"span"},{"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"denote the state and action space, respectively. We assume ","element":"span"},{"text":"X ","element":"span"},{"text":"is a subspace of ","element":"span"},{"style":{"height":16.14},"width":237.81,"height":40.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-2.png","element":"img","alt":" Rd where d","inline":true,"padRight":true},{"text":"is the number of state vectors and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is a discrete space ","element":"span"},{"style":{"height":19.2},"width":534.4,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-3.png","element":"img","alt":"{0, 1, · · · , m − 1} where m","inline":true,"padRight":true},{"text":"denotes the number of actions. Suppose the system satisfies the","element":"span"}],[{"text":"following Markov assumption (MA),","element":"span"}],[{"style":{"width":"90%"},"width":1667,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-4.png","element":"img"}],[{"text":"for some transition function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":". Here, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"defines the next state distribution conditional on the current state-action pair. Moreover, suppose the following conditional mean indepen-","element":"span"}],[{"text":"dence assumption (CMIA) holds","element":"span"}],[{"style":{"width":"67%"},"width":1236,"height":141,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-5.png","element":"img"}],[{"text":"for some reward function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":". By MA, CMIA automatically holds when ","element":"span"},{"style":{"height":18.47},"width":65.51,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-6.png","element":"img","alt":" Y0,t","inline":true,"padRight":true},{"text":"is a deterministic function of ","element":"span"},{"style":{"height":18.87},"width":400.47,"height":47.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-7.png","element":"img","alt":" X0,t, A0,t and X0,t+1","inline":true,"padRight":true},{"text":"that measures the system’s status at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ 1. The latter is satisfied in our real data application (see Section ","element":"span"},{"text":"6 ","element":"span"},{"text":"for details) and is commonly assumed in the reinforcement learning literature. CMIA is thus weaker than this condition. MA and CMIA are important to guarantee the existence of an optimal policy (see ","element":"span"},{"href":"#id-30","text":"(2.1)","element":"a"},{"text":") and derive the bidirectional-asymptotic theory of the proposed CI (see the discussions below Theorem ","element":"span"},{"href":"#id-31","text":"1)","element":"a"},{"text":". We assume both assumptions hold throughout this paper.","element":"span"}],[{"text":"In the following, we focus on the class of stationary policies that map the covariate space ","element":"span"},{"text":"X ","element":"span"},{"text":"to probability mass functions on ","element":"span"},{"style":{"height":19.6},"width":254.22,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-8.png","element":"img","alt":" A. Let π(·|·","inline":true},{"text":") denote such a policy. It satisfies ","element":"span"},{"style":{"height":19.6},"width":179.45,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-9.png","element":"img","alt":"π(a|x) ≥","inline":true,"padRight":true},{"text":"0, for any ","element":"span"},{"style":{"height":20.78},"width":1018.65,"height":51.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-10.png","element":"img","alt":" a ∈ A, x ∈ X and �a∈A π(a|x) = 1, for any x ∈ X","inline":true},{"text":". For a deterministic ","element":"span"},{"text":"policy, we have ","element":"span"},{"style":{"height":19.6},"width":976.82,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-11.png","element":"img","alt":" π(a|x) ∈ {0, 1}, for any a ∈ A, x ∈ X. Under π","inline":true},{"text":", a decision maker will set ","element":"span"},{"style":{"height":18.87},"width":175.93,"height":47.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-12.png","element":"img","alt":"A0,t = a","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"style":{"height":19.67},"width":389.3,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-13.png","element":"img","alt":" π(a|X0,t) at time t","inline":true},{"text":". For such a policy and a given discounted","element":"span"}],[{"text":"factor 0 ","element":"span"},{"style":{"height":19.6},"width":392.18,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-14.png","element":"img","alt":" ≤ γ < 1, let V (π; x","inline":true},{"text":") denote the value function","element":"span"}],[{"style":{"width":"37%"},"width":684,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-15.png","element":"img"}],[{"text":"where the expectation ","element":"span"},{"style":{"height":13.2},"width":51.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-16.png","element":"img","alt":" Eπ ","inline":true,"padRight":true},{"text":"is taken by assuming that the system follows the policy ","element":"span"},{"style":{"height":14},"width":144.58,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-17.png","element":"img","alt":" π. The","inline":true,"padRight":true},{"text":"rate ","element":"span"},{"style":{"height":13.2},"width":26,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-18.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"reflects a trade-off between immediate and future rewards. If ","element":"span"},{"style":{"height":18},"width":454.89,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/5-19.png","element":"img","alt":" γ = 0, the agent tends","inline":true,"padRight":true},{"text":"to choose actions that maximize the immediate reward. As ","element":"span"},{"style":{"height":13.2},"width":26,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-0.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"increases, the agent will","element":"span"}],[{"text":"consider future rewards more seriously. Under CMIA, we have","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"π","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"height":27.2},"width":150.2,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-1.png","element":"img","alt":") =�","inline":true}],[{"style":{"width":"53%"},"width":988,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-2.png","element":"img"}],[{"text":"Similar to Theorem 6.2.12 of ","element":"span"},{"href":"#id-11","referenceIndex":30,"text":"Puterman ","element":"a"},{"href":"#id-11","referenceIndex":30,"text":"(1994)","element":"a"},{"text":", we can show under the given conditions","element":"span"}],[{"text":"that there exists at least one optimal policy ","element":"span"},{"style":{"height":15.34},"width":73.72,"height":38.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-3.png","element":"img","alt":" πopt ","inline":true,"padRight":true},{"text":"that satisfies","element":"span"}],[{"id":"id-30","style":{"width":"68%"},"width":1261,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-4.png","element":"img"}],[{"text":"To better understand ","element":"span"},{"style":{"height":15.34},"width":73.72,"height":38.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-5.png","element":"img","alt":" πopt","inline":true},{"text":", we introduce the state-action function (Q-function) under a","element":"span"}],[{"text":"policy ","element":"span"},{"style":{"height":9.2},"width":86.32,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-6.png","element":"img","alt":" π as","inline":true}],[{"style":{"width":"49%"},"width":919,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-7.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":18.54},"width":82.52,"height":46.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-8.png","element":"img","alt":" Qopt ","inline":true,"padRight":true},{"text":"denote the optimal Q-function, i.e, ","element":"span"},{"style":{"height":20.14},"width":498.3,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-9.png","element":"img","alt":" Qopt(·, ·) = supπ Q(π; ·, ·","inline":true},{"text":"). It can be shown that ","element":"span"},{"style":{"height":15.34},"width":248.93,"height":38.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-10.png","element":"img","alt":" πopt satisfies","inline":true}],[{"style":{"width":"52%"},"width":970,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-11.png","element":"img"}],[{"text":"There exist infinitely many optimal policies when arg max","element":"span"},{"style":{"height":20.14},"width":203.48,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-12.png","element":"img","alt":"a Qopt(x, a","inline":true},{"text":") is not unique for some ","element":"span"},{"style":{"height":15.74},"width":317.84,"height":39.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-13.png","element":"img","alt":"x ∈ X. Let Πopt ","inline":true,"padRight":true},{"text":"denote the set consisting of all these optimal policies. Define","element":"span"}],[{"id":"id-34","style":{"width":"78%"},"width":1446,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-14.png","element":"img"}],[{"text":"where sarg max denotes the smallest maximizer when the argmax is not unique. Such a","element":"span"}],[{"text":"deterministic optimal policy may be appealing in medical studies. For example, in optimal dose studies, it is preferred to assign each patient the smallest optimal dose level to avoid toxicity.","element":"span"}]]},{"heading":"3 Off-policy evaluation","paragraphs":[[{"id":"id-54","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inference of the value under a fixed policy","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"denote the number of trajectories in the dataset. For the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th trajectory, let ","element":"span"},{"style":{"height":19.67},"width":188.6,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-15.png","element":"img","alt":" {Ai,t}t≥0,","inline":true},{"style":{"height":19.67},"width":452.84,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/6-16.png","element":"img","alt":"{Xi,t}t≥0 and {Yi,t}t≥0","inline":true,"padRight":true},{"text":"denote the sequence of actions, states and rewards, respectively. It is worth mentioning that the time points are not necessarily homogeneous across different trajectories. Suppose the data are generated according to a fixed policy ","element":"span"},{"style":{"height":19.6},"width":77.7,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/7-0.png","element":"img","alt":" b(·|·","inline":true},{"text":"), better known","element":"span"}],[{"text":"as the behavior policy such that","element":"span"}],[{"style":{"width":"74%"},"width":1372,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/7-1.png","element":"img"}],[{"text":"are i.i.d copies of ","element":"span"},{"style":{"height":19.67},"width":419.54,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/7-2.png","element":"img","alt":" {(X0,t, A0,t, Y0,t)}t≥0.","inline":true,"padRight":true},{"text":"The observed data can thus be summarized as ","element":"span"},{"style":{"height":19.67},"width":901.34,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/7-3.png","element":"img","alt":"{(Xi,t, Ai,t, Yi,t, Xi,t+1)}0≤t 0, let ⌊p⌋","inline":true,"padRight":true},{"text":"denote the largest integer that is smaller than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". Define the class of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-smooth functions as follows:","element":"span"}],[{"style":{"width":"74%"},"width":1371,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/8-13.png","element":"img"}],[{"text":"Λ(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p, c","element":"span"},{"text":") =","element":"span"}],[{"style":{"width":"74%"},"width":1371,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/8-14.png","element":"img"}],[{"text":"When 0 ","element":"span"},{"style":{"height":16},"width":131.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-0.png","element":"img","alt":" < p ≤","inline":true,"padRight":true},{"text":"1, we have ","element":"span"},{"style":{"height":19.2},"width":773.84,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-1.png","element":"img","alt":" ⌊p⌋ = 0. It is equivalent to require h","inline":true,"padRight":true},{"text":"to satisfy sup","element":"span"},{"style":{"height":21.32},"width":208.76,"height":53.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-2.png","element":"img","alt":"x,y |h(x) −","inline":true},{"style":{"height":19.87},"width":385.84,"height":49.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-3.png","element":"img","alt":"h(y)|/∥x − y∥p2 ≤ c","inline":true},{"text":". The notion of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-smoothness is thus reduced to the H¨older continuity.","element":"span"}],[{"text":"For any ","element":"span"},{"style":{"height":17.6},"width":278.16,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-4.png","element":"img","alt":" x ∈ X, a ∈ A","inline":true},{"text":", suppose the transition kernel ","element":"span"},{"style":{"height":19.6},"width":154.56,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-5.png","element":"img","alt":" P(·|x, a","inline":true},{"text":") is absolutely continuous with respect to the Lebesgue measure. Then there exists some transition density function ","element":"span"},{"style":{"height":19.6},"width":778.39,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-6.png","element":"img","alt":"q such that P(dx′|x, a) = q(x′|x, a)dx′","inline":true},{"text":". We impose the following condition.","element":"span"}],[{"text":"(A1.) There exist some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p, c > ","element":"span"},{"text":"0 such that ","element":"span"},{"style":{"height":19.6},"width":960.99,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-7.png","element":"img","alt":" r(·, a), q(x′|·, a) ∈ Λ(p, c) for any a ∈ A, x′ ∈ X.","inline":true}],[{"id":"id-36","style":{"fontWeight":"bold"},"text":"Lemma 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under A1, there exists some constant ","element":"span"},{"style":{"height":19.6},"width":815.15,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-8.png","element":"img","alt":" c′ > 0 such that Q(π; ·, a) ∈ Λ(p, c′) for","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"any policy ","element":"span"},{"style":{"height":15.2},"width":271.09,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-9.png","element":"img","alt":" π and a ∈ A.","inline":true}],[{"text":"Lemma ","element":"span"},{"href":"#id-36","text":"1 ","element":"a"},{"text":"implies the Q-function has bounded derivatives up to order ","element":"span"},{"style":{"height":19.2},"width":65.76,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-10.png","element":"img","alt":" ⌊p⌋","inline":true},{"text":". This motivates us to first estimate the Q-function and then derive the corresponding value estimators based on the relation ","element":"span"},{"style":{"height":20.78},"width":647.36,"height":51.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-11.png","element":"img","alt":" V (π; x) = �a∈A π(a|x)Q(π; x, a","inline":true},{"text":"). By the Bellman equation ","element":"span"},{"href":"#id-37","text":"(3.6)","element":"a"},{"text":", we","element":"span"}],[{"text":"can show the Q-function satisfies","element":"span"}],[{"id":"id-38","style":{"width":"93%"},"width":1732,"height":146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-12.png","element":"img"}],[{"text":"The above equation forms a basis of our methods to learn ","element":"span"},{"style":{"height":19.6},"width":151.81,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-13.png","element":"img","alt":" Q(π; ·, ·","inline":true},{"text":") (see details in the next section). In contrast to Equation ","element":"span"},{"href":"#id-32","text":"(3.3)","element":"a"},{"text":", the sampling ratio ","element":"span"},{"style":{"height":19.6},"width":471.67,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-14.png","element":"img","alt":" π(a|x)/b(a|x) does not","inline":true,"padRight":true},{"text":"appear in ","element":"span"},{"href":"#id-38","text":"(3.7)","element":"a"},{"text":". This is because ","element":"span"},{"style":{"height":18.87},"width":68.05,"height":47.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-15.png","element":"img","alt":" Ai,t","inline":true,"padRight":true},{"text":"is the only sampling action and no further actions are involved in ","element":"span"},{"href":"#id-38","text":"(3.7)","element":"a"},{"text":". As a result, our method does not require correct specification of the behavior policy. Nor do we need to estimate it from the observed dataset. This is another advantage of modelling the Q-function over the value.","element":"span"}],[{"id":"id-107","style":{"fontWeight":"bold"},"text":"3.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Method","element":"span"}],[{"text":"We describe our procedure in this section. We propose to approximate ","element":"span"},{"style":{"height":19.6},"width":367.46,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-16.png","element":"img","alt":" Q(π; ·, ·) based on","inline":true}],[{"text":"linear sieves, which takes the form","element":"span"}],[{"style":{"width":"43%"},"width":810,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-17.png","element":"img"}],[{"text":"where Φ","element":"span"},{"style":{"height":19.67},"width":578.9,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/9-18.png","element":"img","alt":"L(·) = {ϕL,1(·), · · · , ϕL,L(·)}⊤ ","inline":true,"padRight":true},{"text":"is a vector consisting of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"sieve basis functions, such as splines or wavelet bases (see for example, ","element":"span"},{"href":"#id-39","referenceIndex":13,"text":"Huang, ","element":"a"},{"href":"#id-39","referenceIndex":13,"text":"1998, ","element":"a"},{"text":"for choices of basis functions). We allow ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"to grow with the sample size to reduce the bias of the resulting estimates.","element":"span"}],[{"text":"Under certain mild conditions, there exist some ","element":"span"},{"style":{"height":21.53},"width":192.8,"height":53.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-0.png","element":"img","alt":" {β∗π,a}a∈A ","inline":true,"padRight":true},{"text":"that satisfy","element":"span"}],[{"style":{"width":"90%"},"width":1676,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-1.png","element":"img"}],[{"text":"for any ","element":"span"},{"style":{"height":15.2},"width":133.24,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-2.png","element":"img","alt":" a′ ∈ A","inline":true},{"text":". Recall that ","element":"span"},{"style":{"height":22.87},"width":1095.65,"height":57.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-3.png","element":"img","alt":" A = {0, 1, . . . , m − 1}. Define β∗π = (β∗Tπ,1, · · · , β∗Tπ,m)⊤,","inline":true}],[{"style":{"width":"78%"},"width":1443,"height":146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-4.png","element":"img"}],[{"style":{"height":19.67},"width":715.68,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-5.png","element":"img","alt":"ξi,t = ξ(Xi,t, Ai,t), Uπ,i,t = Uπ(Xi,t","inline":true},{"text":"). The above equation can be rewritten as ","element":"span"},{"style":{"height":19.67},"width":218.35,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-6.png","element":"img","alt":" Eξi,t(ξi,t −","inline":true},{"style":{"height":19.67},"width":488.76,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-7.png","element":"img","alt":"γUπ,i,t+1)⊤β∗π = Eξi,tYi,t","inline":true},{"text":". Based on the observed data, we propose to estimate ","element":"span"},{"style":{"height":18.33},"width":269.07,"height":45.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-8.png","element":"img","alt":" β∗π by solving","inline":true}],[{"style":{"width":"87%"},"width":1606,"height":501,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-9.png","element":"img"}],[{"text":"A two-side CI is given by","element":"span"}],[{"id":"id-41","style":{"width":"89%"},"width":1644,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.67},"width":43.75,"height":29.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-11.png","element":"img","alt":" zα","inline":true,"padRight":true},{"text":"denotes the upper ","element":"span"},{"style":{"height":9.2},"width":30,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-12.png","element":"img","alt":" α","inline":true},{"text":"-th quantile of a standard normal distribution, and","element":"span"}],[{"style":{"width":"42%"},"width":786,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-13.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"92%"},"width":1701,"height":141,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-14.png","element":"img"}],[{"text":"Let ","element":"span"},{"text":"G ","element":"span"},{"text":"be a reference distribution on the covariate space ","element":"span"},{"text":"X","element":"span"},{"text":". Define the following inte-","element":"span"}],[{"text":"grated value function","element":"span"}],[{"style":{"width":"32%"},"width":606,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/10-15.png","element":"img"}],[{"text":"By setting ","element":"span"},{"style":{"height":19.6},"width":68.4,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-0.png","element":"img","alt":" G(·","inline":true},{"text":") to be a Dirac measure ","element":"span"},{"style":{"height":19.6},"width":1002.7,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-1.png","element":"img","alt":" δx(·), i.e, G(X) = I(x ∈ X), ∀X ⊆ X, V (π; G)","inline":true,"padRight":true},{"text":"is reduced to ","element":"span"},{"style":{"height":19.6},"width":358.29,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-2.png","element":"img","alt":" V (π; x). Let ν0(·","inline":true},{"text":") be the probability density function of ","element":"span"},{"style":{"height":18.47},"width":82.21,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-3.png","element":"img","alt":" X0,0","inline":true},{"text":". By setting","element":"span"}],[{"style":{"height":19.6},"width":343.97,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-4.png","element":"img","alt":"G(dx) = ν0(x)dx","inline":true},{"text":", we obtain","element":"span"}],[{"style":{"width":"53%"},"width":985,"height":165,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-5.png","element":"img"}],[{"text":"Based on ","element":"span"},{"style":{"height":17.2},"width":51.55,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-6.png","element":"img","alt":"�βπ","inline":true},{"text":", a two-side CI for ","element":"span"},{"style":{"height":19.6},"width":142.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-7.png","element":"img","alt":" V (π; G","inline":true},{"text":") is given by","element":"span"}],[{"id":"id-40","style":{"width":"90%"},"width":1665,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-8.png","element":"img"}],[{"text":"where","element":"span"}],[{"id":"id-78","style":{"width":"86%"},"width":1595,"height":251,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"3.1.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theory","element":"span"}],[{"text":"In this section, we focus on proving the validity of the proposed CIs in ","element":"span"},{"href":"#id-40","text":"(3.9)","element":"a"},{"text":". By setting ","element":"span"},{"style":{"height":19.6},"width":223.28,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-10.png","element":"img","alt":"G(·) = δx(·","inline":true},{"text":"), it implies that the CI in ","element":"span"},{"href":"#id-41","text":"(3.8) ","element":"a"},{"text":"achieves nominal coverage as well. To simplify the presentation, we assume ","element":"span"},{"style":{"height":16.07},"width":484.46,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-11.png","element":"img","alt":" T1 = T2 = · · · = Tn = T","inline":true},{"text":", all the covariates are continuous and ","element":"span"},{"style":{"height":20.54},"width":210.93,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-12.png","element":"img","alt":"X = [0, 1]d","inline":true},{"text":". Our theory is valid regardless of whether ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded or diverges to infinity. We remark that the boundedness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"does not mean we work on a finite-horizon setting, since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is the termination time of the study, not the final time step of each trajectory.","element":"span"}],[{"text":"In addition, we restrict our attentions to two particular types of sieve basis functions, corresponding to tensor product of B-splines with degree ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"and dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"or Wavelets with regularity ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"and dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":". See Section 6 of ","element":"span"},{"href":"#id-42","referenceIndex":5,"text":"Chen and Christensen ","element":"a"},{"href":"#id-42","referenceIndex":5,"text":"(2015) ","element":"a"},{"text":"for a brief review of these sieve bases. This together with A1 implies that there exists a set of vectors ","element":"span"},{"style":{"height":21.53},"width":192.8,"height":53.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-13.png","element":"img","alt":"{β∗π,a}a∈A ","inline":true,"padRight":true},{"text":"that satisfy sup","element":"span"},{"style":{"height":23.67},"width":875.6,"height":59.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-14.png","element":"img","alt":"x∈X,a∈A |Q(π; x, a) − Φ⊤L(x)β∗π,a| = O(L−p/d","inline":true},{"text":"). See Section 2.2 of ","element":"span"},{"href":"#id-39","referenceIndex":13,"text":"Huang ","element":"a"},{"href":"#id-39","referenceIndex":13,"text":"(1998) ","element":"a"},{"text":"for detailed discussions on the approximation power of these sieve bases.","element":"span"}],[{"text":"Following the behavior policy ","element":"span"},{"style":{"height":19.6},"width":77.71,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-15.png","element":"img","alt":" b(·|·","inline":true},{"text":"), the set of variables ","element":"span"},{"style":{"height":19.67},"width":182.84,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-16.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"forms a time-homogeneous","element":"span"}],[{"text":"Markov chain. Its transition kernel ","element":"span"},{"style":{"height":16.07},"width":62.26,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-17.png","element":"img","alt":" PX","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"73%"},"width":1354,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/11-18.png","element":"img"}],[{"text":"We impose the following assumptions.","element":"span"}],[{"text":"(A2.) The Markov chain ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-0.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"has an unique invariant distribution with some density function ","element":"span"},{"style":{"height":19.6},"width":59.38,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-1.png","element":"img","alt":" µ(·","inline":true},{"text":"). The density functions ","element":"span"},{"style":{"height":18},"width":174.25,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-2.png","element":"img","alt":" µ and ν0","inline":true,"padRight":true},{"text":"are uniformly bounded away from 0 and ","element":"span"},{"style":{"height":8.8},"width":60.82,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-3.png","element":"img","alt":" ∞.","inline":true,"padRight":true},{"text":"(A3.) Suppose (i) and (ii) hold when ","element":"span"},{"style":{"height":13.6},"width":156.35,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-4.png","element":"img","alt":" T → ∞","inline":true,"padRight":true},{"text":"and (i) holds when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded.","element":"span"}],[{"text":"(i)","element":"span"},{"style":{"height":25.23},"width":1243.14,"height":63.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-5.png","element":"img","alt":"λmin[�T−1t=0 E{ξ0,tξ⊤0,t − γ2uπ(X0,t, A0,t)u⊤π (X0,t, A0,t)}] ≥ T ¯c","inline":true,"padRight":true},{"text":"for some constant ¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0, ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":19.67},"width":1201.26,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-6.png","element":"img","alt":" uπ(x, a) = E{Uπ(X0,1)|X0,0 = x, A0,0 = a} and λmin(K","inline":true},{"text":") denotes the minimum eigenvalue of a matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":".","element":"span"}],[{"text":"(ii) The Markov chain ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-7.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"is geometrically ergodic.","element":"span"}],[{"text":"We make a few remarks. First, we do not require the limiting density function ","element":"span"},{"style":{"height":18},"width":149.23,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-8.png","element":"img","alt":" µ to be","inline":true,"padRight":true},{"text":"equal to the initial state density ","element":"span"},{"style":{"height":11.67},"width":55.06,"height":29.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-9.png","element":"img","alt":" ν0.","inline":true}],[{"text":"Second, Condition A3(i) guarantees the matrix ","element":"span"},{"style":{"height":16.07},"width":90.74,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-10.png","element":"img","alt":" E�Σπ","inline":true,"padRight":true},{"text":"is invertible. In Section ","element":"span"},{"href":"#id-43","text":"C.1 ","element":"a"},{"text":"of the supplementary article, we show A3(i) is automatically satisfied when ","element":"span"},{"style":{"height":18},"width":261.89,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-11.png","element":"img","alt":" µ = ν0, the","inline":true,"padRight":true},{"text":"target policy ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-12.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is deterministic and ","element":"span"},{"style":{"height":14},"width":180.88,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-13.png","element":"img","alt":" b is the ϵ","inline":true},{"text":"-greedy policy with respect to ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-14.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"that satisfies","element":"span"}],[{"style":{"height":19.74},"width":223.35,"height":49.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-15.png","element":"img","alt":"ϵ ≤ 1 − γ2.","inline":true}],[{"text":"Third, we present the detailed definition of geometric ergodicity in Appendix ","element":"span"},{"text":"A ","element":"span"},{"text":"to save space. Suppose the Markov chain ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-16.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"has a finite state space. Assume ","element":"span"},{"style":{"height":16.07},"width":62.26,"height":40.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-17.png","element":"img","alt":" PX","inline":true,"padRight":true},{"text":"is diagonalizable. Then A3(ii) holds when the second largest eigenvalue of ","element":"span"},{"style":{"height":16.07},"width":58.22,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-18.png","element":"img","alt":" PX","inline":true,"padRight":true},{"text":"is strictly smaller than 1. When ","element":"span"},{"style":{"height":18.47},"width":77.21,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-19.png","element":"img","alt":" X0,t","inline":true},{"text":"’s are generated by the vector autoregressive process ","element":"span"},{"style":{"height":19.67},"width":545.23,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-20.png","element":"img","alt":" E{X0,t|X0,t−1} = f(X0,t−1)","inline":true,"padRight":true},{"text":"for some function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":34,"text":"Saikkonen ","element":"a"},{"href":"#id-44","referenceIndex":34,"text":"(2001) ","element":"a"},{"text":"provided sufficient conditions that ensure the geometric ergodicity of the Markov chain.","element":"span"}],[{"text":"Finally, when ","element":"span"},{"style":{"height":19.67},"width":352.99,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-21.png","element":"img","alt":" ν0 = µ, {X0,t}t≥0","inline":true,"padRight":true},{"text":"is stationary. Under Condition A3(ii), it follows from Theorem 3.7 of ","element":"span"},{"href":"#id-45","referenceIndex":2,"text":"Bradley ","element":"a"},{"href":"#id-45","referenceIndex":2,"text":"(2005) ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-22.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"is exponentially ","element":"span"},{"style":{"height":17.2},"width":28,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-23.png","element":"img","alt":" β","inline":true},{"text":"-mixing (see the proof of Lemma ","element":"span"},{"href":"#id-25","text":"3 ","element":"a"},{"text":"for details). When ","element":"span"},{"style":{"height":13.6},"width":170.62,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-24.png","element":"img","alt":" T → ∞","inline":true},{"text":", A3(ii) enables us to derive matrix concentration inequalities for ","element":"span"},{"style":{"height":16.07},"width":58.86,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-25.png","element":"img","alt":"�Σπ","inline":true},{"text":". This together with A3(i) implies that ","element":"span"},{"style":{"height":16.07},"width":58.86,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-26.png","element":"img","alt":"�Σπ","inline":true,"padRight":true},{"text":"is invertible, with probability approaching 1 (wpa1). We remark that A3(ii) is not needed when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded.","element":"span"}],[{"text":"For any ","element":"span"},{"style":{"height":17.6},"width":410.22,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-27.png","element":"img","alt":" x ∈ X, a ∈ A, define","inline":true}],[{"style":{"width":"84%"},"width":1561,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-28.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"ω","element":"span"},{"style":{"height":19.6},"width":226.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-29.png","element":"img","alt":"π(x, a) = E","inline":true}],[{"style":{"width":"84%"},"width":1561,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-30.png","element":"img"}],[{"id":"id-31","style":{"fontWeight":"bold"},"text":"Theorem 1 (bidirectional asymptotics) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume A1-A3 hold. Suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfies ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"=","element":"span"}],[{"style":{"height":23.64},"width":1175.14,"height":59.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/12-31.png","element":"img","alt":"o{√nT/ log(nT)}, L2p/d ≫ nT{1 + ∥�x ΦL(x)G(dx)∥−22 }","inline":true},{"style":{"fontStyle":"italic"},"text":", and there exists some constant","element":"span"}],[{"style":{"height":21.59},"width":1842.86,"height":53.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-0.png","element":"img","alt":"c0 ≥ 1 such that ωπ(x, a) ≥ c−10 for any x ∈ X, a ∈ A and Pr(max0≤t≤T−1 |Y0,t| ≤ c0) = 1.","inline":true}],[{"style":{"width":"73%"},"width":1350,"height":160,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-1.png","element":"img"}],[{"text":"A sketch for the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"is given in Appendix ","element":"span"},{"href":"#id-46","text":"E.1. ","element":"a"},{"text":"Under the conditions in Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"we can show that ","element":"span"},{"style":{"height":19.6},"width":132.79,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-2.png","element":"img","alt":" �σ(π; G","inline":true},{"text":") converges almost surely to some ","element":"span"},{"style":{"height":19.6},"width":132.79,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-3.png","element":"img","alt":" σ(π; G","inline":true},{"text":"). The form","element":"span"}],[{"text":"of ","element":"span"},{"style":{"height":19.6},"width":132.79,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-4.png","element":"img","alt":" σ(π; G","inline":true},{"text":") is given in Section ","element":"span"},{"href":"#id-47","text":"E.5. ","element":"a"},{"text":"In addition, we have","element":"span"}],[{"id":"id-48","style":{"width":"95%"},"width":1769,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.47},"width":308.8,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-6.png","element":"img","alt":" Σπ = E�Σπ and","inline":true}],[{"style":{"width":"65%"},"width":1212,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-7.png","element":"img"}],[{"text":"By MA, CMIA and ","element":"span"},{"href":"#id-38","text":"(3.7)","element":"a"},{"text":", the leading term on the RHS of ","element":"span"},{"href":"#id-48","text":"(3.12) ","element":"a"},{"text":"forms a mean-zero martingale (details can be found in Section ","element":"span"},{"href":"#id-47","text":"E.5)","element":"a"},{"text":". As either ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"grows to infinity, the asymptotic normality follows from the martingale central limit theorem.","element":"span"}],[{"text":"When ","element":"span"},{"style":{"height":19.6},"width":132.79,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-8.png","element":"img","alt":" σ(π; G","inline":true},{"text":") is bounded away from zero, it can be seen from ","element":"span"},{"href":"#id-48","text":"(3.12) ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":19.6},"width":209.93,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-9.png","element":"img","alt":"�V (π; G) −","inline":true},{"style":{"height":21.81},"width":524.7,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-10.png","element":"img","alt":"V (π; G) = Op(n−1/2T −1/2","inline":true},{"text":"). That is, the proposed value estimator converges at a rate of (","element":"span"},{"style":{"height":21.74},"width":157.32,"height":54.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-11.png","element":"img","alt":"nT)−1/2","inline":true},{"text":". In contrast, AIPW-type estimators typically converge at a rate of ","element":"span"},{"style":{"height":16.94},"width":275.63,"height":42.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-12.png","element":"img","alt":" n−1/2 and are","inline":true,"padRight":true},{"text":"thus not suitable for settings with only a few trajectories.","element":"span"}],[{"id":"id-84","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inference of the value under an (estimated) optimal policy","element":"span"}],[{"text":"For simplicity, we assume ","element":"span"},{"style":{"height":16.07},"width":512.24,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-13.png","element":"img","alt":" T1 = T2 = · · · = Tn = T","inline":true,"padRight":true},{"text":"throughout this section. Consider an estimated policy ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-14.png","element":"img","alt":" �π","inline":true},{"text":", computed based on the data ","element":"span"},{"style":{"height":19.67},"width":812.1,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/13-15.png","element":"img","alt":" {(Xi,t, Ai,t, Yi,t, Xi,t+1)}0≤t t1 or","inline":true},{"style":{"height":15.67},"width":130.86,"height":39.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/15-11.png","element":"img","alt":"i2 > i1","inline":true},{"text":". For any (","element":"span"},{"style":{"height":19.6},"width":256.23,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/15-12.png","element":"img","alt":"i2, t2) ∈ Ik+1","inline":true},{"text":", we require the following:","element":"span"}],[{"id":"id-56","style":{"width":"71%"},"width":1324,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/15-13.png","element":"img"}],[{"text":"Then ","element":"span"},{"style":{"height":14.6},"width":61.23,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/15-14.png","element":"img","alt":" �π¯Ik","inline":true,"padRight":true},{"text":"depends on the ","element":"span"},{"style":{"height":15.67},"width":32.97,"height":39.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/15-15.png","element":"img","alt":" i2","inline":true},{"text":"-th patent’s trajectory only through ","element":"span"},{"style":{"height":19.67},"width":461.66,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/15-16.png","element":"img","alt":" {(Xi2,j, Ai2,j, Yi2,j)}j k1","inline":true},{"text":", we have either ","element":"span"},{"style":{"height":19.6},"width":278.08,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-10.png","element":"img","alt":" n(k2) > n(k1)","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":19.6},"width":276.24,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-11.png","element":"img","alt":" T(k2) > T(k1","inline":true},{"text":"). Thus, the proposed data-splitting rule guarantees ","element":"span"},{"href":"#id-56","text":"(3.16) ","element":"a"},{"text":"holds for any","element":"span"}],[{"style":{"width":"1%"},"width":33,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-12.png","element":"img"}],[{"text":"In Theorem ","element":"span"},{"href":"#id-57","text":"2 ","element":"a"},{"text":"below, we establish the validity of our CI in ","element":"span"},{"href":"#id-58","text":"(3.17)","element":"a"},{"text":". It relies on Condition A3* and A4. A3* is very similar to A3 and we present the detailed definition in Appendix ","element":"span"},{"text":"A ","element":"span"},{"text":"to save space.","element":"span"}],[{"text":"(A4) ","element":"span"},{"style":{"height":22.23},"width":1182.5,"height":55.58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-13.png","element":"img","alt":" E|V (�π¯Ik; G) − V (π∗; G)| = O(|¯Ik|−b0) for some b0 > 1/","inline":true},{"text":"2 such that (","element":"span"},{"style":{"height":21.74},"width":261.78,"height":54.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-14.png","element":"img","alt":"nT)b0−1/2 ≫","inline":true},{"style":{"height":22.96},"width":397.18,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-15.png","element":"img","alt":"∥�x ΦL(x)G(dx)∥−12 ","inline":true,"padRight":true},{"text":", where the big-","element":"span"},{"style":{"fontStyle":"italic"},"text":"O ","element":"span"},{"text":"term is uniform in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"104%"},"width":1917,"height":364,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-16.png","element":"img"}],[{"id":"id-57","style":{"fontWeight":"bold"},"text":"Theorem 2 (bidirectional asymptotics) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume A1-A2, A3* and A4 hold. Suppose","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"style":{"height":23.64},"width":1782.25,"height":59.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-17.png","element":"img","alt":" = O(1) and L satisfies L = o{√nT/ log(nT)}, L2p/d ≫ nT{1 + ∥�x ΦL(x)G(dx)∥−22 }.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"height":17.6},"width":296.41,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-18.png","element":"img","alt":" Tmin = T if T","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded. Assume there exists some constant ","element":"span"},{"style":{"height":16.47},"width":348.18,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-19.png","element":"img","alt":" c0 ≥ 1 such that","inline":true},{"style":{"height":21.59},"width":1332.61,"height":53.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-20.png","element":"img","alt":"ωπ(x, a) ≥ c−10 for any x, a, π and Pr(max0≤t≤T−1 |Y0,t| ≤ c0) = 1","inline":true},{"style":{"fontStyle":"italic"},"text":". Then as either ","element":"span"},{"style":{"height":10.4},"width":153.82,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-21.png","element":"img","alt":" n → ∞","inline":true}],[{"style":{"width":"81%"},"width":1511,"height":241,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/16-22.png","element":"img"}],[{"text":"We provide a sketch for the proof of Theorem ","element":"span"},{"href":"#id-57","text":"2 ","element":"a"},{"text":"in Appendix ","element":"span"},{"href":"#id-52","text":"E.2.","element":"a"}],[{"id":"id-111","style":{"fontWeight":"bold"},"text":"3.2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence of the value under an estimated optimal policy","element":"span"}],[{"text":"For any ","element":"span"},{"style":{"height":16.8},"width":383.63,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-0.png","element":"img","alt":" I ⊆ I0, we use �πI","inline":true,"padRight":true},{"text":"to denote an estimated optimal policy based on observations in ","element":"span"},{"style":{"height":19.6},"width":275.19,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-1.png","element":"img","alt":" I. Let �QI(·, ·","inline":true},{"text":") denote some consistent estimator for ","element":"span"},{"style":{"height":20.14},"width":326.18,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-2.png","element":"img","alt":" Qopt(·, ·) and �πI","inline":true,"padRight":true},{"text":"denote the greedy policy with respect to ","element":"span"},{"style":{"height":19.6},"width":125.38,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-3.png","element":"img","alt":"�QI(·, ·","inline":true},{"text":") (see Equation ","element":"span"},{"href":"#id-59","text":"(3.14)","element":"a"},{"text":").","element":"span"}],[{"text":"In the following, we focus on relating ","element":"span"},{"style":{"height":20.14},"width":482.69,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-4.png","element":"img","alt":" |V (πopt; G) − V (�πI; G)|","inline":true,"padRight":true},{"text":"to the prediction loss ","element":"span"},{"style":{"height":18.54},"width":201.08,"height":46.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-5.png","element":"img","alt":"�QI − Qopt","inline":true},{"text":". By definition, ","element":"span"},{"style":{"height":20.14},"width":1310.5,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-6.png","element":"img","alt":" V (πopt; x) ≥ V (�πI; x), ∀x ∈ X. Hence, V (πopt; G) ≥ V (�πI; G). It","inline":true,"padRight":true},{"text":"suffices to provide an upper bound for ","element":"span"},{"style":{"height":20.14},"width":426.27,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-7.png","element":"img","alt":" V (πopt; G) − V (�πI; G","inline":true},{"text":"). We introduce a margin-type condition A5 below.","element":"span"}],[{"text":"(A5) Assume there exist some constants ","element":"span"},{"style":{"height":17.2},"width":141.02,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-8.png","element":"img","alt":" α, δ0 >","inline":true,"padRight":true},{"text":"0 such that","element":"span"}],[{"id":"id-63","style":{"width":"92%"},"width":1708,"height":246,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.4},"width":27,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-10.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"denotes the Lebesgue measure, the big-","element":"span"},{"style":{"fontStyle":"italic"},"text":"O ","element":"span"},{"text":"terms are uniform in 0 ","element":"span"},{"style":{"height":17.2},"width":295.17,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-11.png","element":"img","alt":" < ε ≤ δ0, and","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":21.19},"width":750.26,"height":52.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-12.png","element":"img","alt":"a′∈A−arg maxa Qopt(x,a) Qopt(x, a′) = −∞","inline":true,"padRight":true},{"text":"if the set ","element":"span"},{"style":{"height":20.14},"width":578.6,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-13.png","element":"img","alt":" A − arg maxa Qopt(x, a) = ∅.","inline":true}],[{"text":"For each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", the quantity max","element":"span"},{"style":{"height":21.19},"width":1171.41,"height":52.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-14.png","element":"img","alt":"a Qopt(x, a) − maxa′∈A−arg maxa Qopt(x,a) Qopt(x, a′) measures","inline":true,"padRight":true},{"text":"the difference in value between ","element":"span"},{"style":{"height":15.34},"width":73.72,"height":38.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-15.png","element":"img","alt":" πopt ","inline":true,"padRight":true},{"text":"and the policy that assigns the best suboptimal treatment(s) at the first decision point and follows ","element":"span"},{"style":{"height":15.34},"width":73.72,"height":38.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-16.png","element":"img","alt":" πopt ","inline":true,"padRight":true},{"text":"subsequently. In point treatment studies, ","element":"span"},{"href":"#id-29","referenceIndex":31,"text":"Qian and Murphy ","element":"a"},{"href":"#id-29","referenceIndex":31,"text":"(2011) ","element":"a"},{"text":"imposed a similar condition (see Equation (3.3), ","element":"span"},{"href":"#id-29","referenceIndex":31,"text":"Qian and Mur- ","element":"a"},{"href":"#id-29","referenceIndex":31,"text":"phy, ","element":"a"},{"href":"#id-29","referenceIndex":31,"text":"2011) ","element":"a"},{"text":"to derive sharp convergence rate for the value under an estimated optimal individualized treatment regime. ","element":"span"},{"text":"Here, we generalize their condition in infinite-horizon settings. A5 is also closely related to the margin condition commonly used to bound the excess misclassification error ","element":"span"},{"href":"#id-60","referenceIndex":48,"text":"(Tsybakov, ","element":"a"},{"href":"#id-60","referenceIndex":48,"text":"2004; ","element":"a"},{"href":"#id-61","referenceIndex":1,"text":"Audibert and Tsybakov, ","element":"a"},{"href":"#id-61","referenceIndex":1,"text":"2007)","element":"a"},{"text":".","element":"span"}],[{"text":"The margin-type condition is mild. In Appendix ","element":"span"},{"href":"#id-62","text":"A.3, ","element":"a"},{"text":"we present detailed examples and show the condition holds under these examples. The following theorems summarize our results.","element":"span"}],[{"id":"id-27","style":{"fontWeight":"bold"},"text":"Theorem 3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume A1, ","element":"span"},{"href":"#id-63","text":"(3.18) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-63","text":"(3.19) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. Suppose the following event occurs with","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"probability at least ","element":"span"},{"text":"1 ","element":"span"},{"style":{"height":19.6},"width":225.11,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-17.png","element":"img","alt":" − O(|I|−κ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any finite ","element":"span"},{"style":{"height":16.4},"width":127.35,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-18.png","element":"img","alt":" κ > 0,","inline":true}],[{"style":{"width":"46%"},"width":858,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/17-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":21.74},"width":1147.95,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-0.png","element":"img","alt":" b∗ > 0. Then E|V (πopt; G) − V (�πI; G)| = O(|I|−b∗(1+α)).","inline":true}],[{"text":"In Theorem ","element":"span"},{"href":"#id-27","text":"3, ","element":"a"},{"text":"we require the estimated Q-function to satisfy certain uniform convergence rate. In Theorem ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"below, we relax this condition by assuming that the integrated loss converges to zero at certain rate.","element":"span"}],[{"id":"id-28","style":{"width":"78%"},"width":1442,"height":241,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":21.74},"width":1273.2,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-2.png","element":"img","alt":" b∗ > 0. Then E|V (πopt; G) − V (�πI; G)| = O(|I|−b∗(2+2α)/(2+α)).","inline":true}],[{"text":"It can be seen from Theorems ","element":"span"},{"href":"#id-27","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"that the integrated value converges faster compared to the Q-function. We provide a sketch for the proofs of both theorems in Appendix ","element":"span"},{"href":"#id-64","text":"E.3.","element":"a"}],[{"id":"id-20","style":{"fontWeight":"bold"},"text":"3.2.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Applications","element":"span"}],[{"style":{"width":"0%"},"width":15,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-3.png","element":"img"}],[{"text":"In this section, we provide several examples to illustrate the convergence rate of ","element":"span"},{"style":{"height":17.2},"width":177.22,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-4.png","element":"img","alt":"�QI. The","inline":true,"padRight":true},{"text":"proposed methods can be applied to evaluating the values under these estimated policies. The algorithm in Example ","element":"span"},{"href":"#id-65","text":"1 ","element":"a"},{"text":"requires to impose a linear model assumption for the optimal Q-function. The algorithm in Example ","element":"span"},{"href":"#id-66","text":"2 ","element":"a"},{"text":"allows more general nonlinear and nonparametric models for the optimal Q-function.","element":"span"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Example 1 (Greedy gradient Q-learning) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The optimal Q-function satisfies","element":"span"}],[{"style":{"width":"58%"},"width":1086,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", and hence","element":"span"}],[{"style":{"width":"72%"},"width":1342,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Suppose we model ","element":"span"},{"style":{"height":20.14},"width":193.11,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-7.png","element":"img","alt":" Qopt(x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by linear sieves ","element":"span"},{"text":"Φ","element":"span"},{"style":{"height":19.6},"width":131.18,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-8.png","element":"img","alt":"⊤L(x)θa","inline":true},{"style":{"fontStyle":"italic"},"text":". Then we can compute ","element":"span"},{"style":{"height":19.67},"width":253.44,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-9.png","element":"img","alt":" {�θa,I}a∈A by","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"minimizing the following projected Bellman error:","element":"span"}],[{"style":{"width":"86%"},"width":1590,"height":183,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/18-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":21.21},"width":1217.24,"height":53.04,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-0.png","element":"img","alt":" δi,t({θa}a∈A) = Yi,t + γ maxa′∈A Φ⊤L(Xi,t+1)θa′ − Φ⊤L(Xi,t)θAi,t","inline":true},{"style":{"fontStyle":"italic"},"text":". The above loss is non- ","element":"span"},{"style":{"fontStyle":"italic"},"text":"smooth and non-convex as a function of ","element":"span"},{"style":{"height":19.2},"width":158.19,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-1.png","element":"img","alt":" {θa}a∈A","inline":true},{"style":{"fontStyle":"italic"},"text":". The estimator ","element":"span"},{"style":{"height":19.67},"width":188.44,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-2.png","element":"img","alt":" {�θa,I}a∈A","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can be computed based on the greedy gradient Q-learning algorithm.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Assuming the optimal Q-function is correctly specified, ","element":"span"},{"href":"#id-12","referenceIndex":9,"style":{"fontStyle":"italic"},"text":"Ertefaie and Strawderman ","element":"a"},{"href":"#id-12","referenceIndex":9,"style":{"fontStyle":"italic"},"text":"(2018) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"established the consistency and asymptotic normality of the parameter estimates under the scenario where both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"fontStyle":"italic"},"text":"are fixed. Set ","element":"span"},{"style":{"height":19.67},"width":428.89,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-3.png","element":"img","alt":"�QI(x, a) = Φ⊤(x)�θa,I","inline":true},{"style":{"fontStyle":"italic"},"text":". Using similar arguments in proving Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"we can show that with proper choice of ","element":"span"},{"style":{"height":21.32},"width":521.26,"height":53.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-4.png","element":"img","alt":" L, supx∈X,a∈A | �QI(x, a) −","inline":true},{"style":{"height":20.14},"width":206.32,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-5.png","element":"img","alt":"Qopt(x, a)|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"coverages at a rate of ","element":"span"},{"style":{"height":21.74},"width":295.87,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-6.png","element":"img","alt":" O(|I|−p/(2p+d))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"up to some logarithmic factors, with probability at least ","element":"span"},{"text":"1","element":"span"},{"style":{"height":20.54},"width":264.62,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-7.png","element":"img","alt":"−O(n−2T −2)","inline":true},{"style":{"fontStyle":"italic"},"text":". The condition in Theorem ","element":"span"},{"href":"#id-27","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"thus holds for any ","element":"span"},{"style":{"height":19.6},"width":310.18,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-8.png","element":"img","alt":" b∗ < p/(2p+d).","inline":true}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"Example 2 (Fitted ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontWeight":"bold"},"text":"-iteration) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In fitted ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontStyle":"italic"},"text":"-iteration (FQI), the optimal Q-function is","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"approximated by some nonparametric models ","element":"span"},{"style":{"height":19.6},"width":164.94,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-9.png","element":"img","alt":" Q(·, ·, θ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"indexed by ","element":"span"},{"style":{"height":14},"width":22,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-10.png","element":"img","alt":" θ","inline":true},{"style":{"fontStyle":"italic"},"text":". The parameter ","element":"span"},{"style":{"height":14},"width":78.08,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-11.png","element":"img","alt":" θ is","inline":true}],[{"style":{"fontStyle":"italic"},"text":"iteratively updated by","element":"span"}],[{"style":{"width":"75%"},"width":1393,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":17.2},"width":660.98,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-13.png","element":"img","alt":" k = 0, 1, 2, . . . , K − 1, where Ik","inline":true},{"style":{"fontStyle":"italic"},"text":"’s are some subsets of ","element":"span"},{"style":{"height":16.47},"width":628.13,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-14.png","element":"img","alt":" I. When I1 = · · · = IK = I","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.6},"width":164.94,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-15.png","element":"img","alt":" Q(·, ·, θ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the family of neural networks, this algorithm is the neural FQI proposed by ","element":"span"},{"href":"#id-19","referenceIndex":32,"style":{"fontStyle":"italic"},"text":"Riedmiller ","element":"a"},{"href":"#id-19","referenceIndex":32,"style":{"fontStyle":"italic"},"text":"(2005)","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"href":"#id-67","referenceIndex":10,"style":{"fontStyle":"italic"},"text":"Fan et al. ","element":"a"},{"href":"#id-67","referenceIndex":10,"style":{"fontStyle":"italic"},"text":"(2020) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"studied a variant of neural FQI by assuming ","element":"span"},{"style":{"height":16.47},"width":80.86,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-16.png","element":"img","alt":" Ik’s","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are disjoint and the training samples in ","element":"span"},{"style":{"height":20.72},"width":139.68,"height":51.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-17.png","element":"img","alt":" ∪Kk=1Ik","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are independent. Using similar arguments ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in the proof of Theorem 4.4 in ","element":"span"},{"href":"#id-67","referenceIndex":10,"style":{"fontStyle":"italic"},"text":"Fan et al. ","element":"a"},{"href":"#id-67","referenceIndex":10,"style":{"fontStyle":"italic"},"text":"(2020)","element":"a"},{"style":{"fontStyle":"italic"},"text":", we can show ","element":"span"},{"style":{"height":22.85},"width":497.55,"height":57.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-18.png","element":"img","alt":" E�x∈X�a∈A | �QI(x, a) −","inline":true},{"style":{"height":20.54},"width":276.87,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-19.png","element":"img","alt":"Qopt(x, a)|2dx","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"coverages at a rate of ","element":"span"},{"style":{"height":21.74},"width":339.15,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-20.png","element":"img","alt":" O(|I|−(2p)/(2p+d))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"up to some logarithmic factors. The conditions in Theorem ","element":"span"},{"href":"#id-28","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"thus hold for any ","element":"span"},{"style":{"height":19.6},"width":328.07,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-21.png","element":"img","alt":" b∗ < p/(2p + d).","inline":true}]]},{"heading":"4 Extensions to on-policy evaluation","paragraphs":[[{"text":"We now extend our methodology in Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"to on-policy settings. ","element":"span"},{"text":"The proposed CI is similar to that presented in Section ","element":"span"},{"href":"#id-23","text":"3.2.2 ","element":"a"},{"text":"and applies to any reinforcement learning algorithms that iteratively update the estimated policy based on batches of observations. Let ","element":"span"},{"style":{"height":19.67},"width":205.98,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-22.png","element":"img","alt":" {T(k)}k≥1","inline":true,"padRight":true},{"text":"be a monotonically increasing sequence that diverges to infinity. At the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th iteration, define ","element":"span"},{"text":"¯","element":"span"},{"style":{"height":26.08},"width":958.78,"height":65.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-23.png","element":"img","alt":"Ik = {(i, t) : 1 ≤ i ≤ n, 0 ≤ t < �kj=1 T(j)}","inline":true},{"text":". The data observed ","element":"span"},{"text":"so far can be summarized as ","element":"span"},{"style":{"height":26.06},"width":824.21,"height":65.15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/19-24.png","element":"img","alt":" {(Xi,t, Ai,t, Yi,t, Xi,t+1)}1≤i≤n,0≤t<�kj=1 T(j)","inline":true},{"text":". We compute the","element":"span"}],[{"style":{"width":"1%"},"width":19,"height":6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-0.png","element":"img"}],[{"text":"estimated policy ","element":"span"},{"style":{"height":14.6},"width":61.22,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-1.png","element":"img","alt":" �π¯Ik","inline":true,"padRight":true},{"text":"based on these data. Then we determine the behavior policy ","element":"span"},{"style":{"height":19.8},"width":158.69,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-2.png","element":"img","alt":"�b¯Ik as a","inline":true}],[{"text":"function of ","element":"span"},{"style":{"height":14.6},"width":61.22,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-3.png","element":"img","alt":" �π¯Ik","inline":true,"padRight":true},{"text":"and generate new observations","element":"span"}],[{"id":"id-68","style":{"width":"85%"},"width":1573,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-4.png","element":"img"}],[{"text":"according to ","element":"span"},{"style":{"height":19.8},"width":57.9,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-5.png","element":"img","alt":"�b¯Ik","inline":true},{"text":". To balance the exploration-exploitation trade-off, a common choice of ","element":"span"},{"style":{"height":19.8},"width":57.9,"height":49.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-6.png","element":"img","alt":"�b¯Ik","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"height":8.8},"width":19,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-7.png","element":"img","alt":" ϵ","inline":true},{"text":"-greedy policy with respect to ","element":"span"},{"style":{"height":14.6},"width":78.51,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-8.png","element":"img","alt":" �π¯Ik.","inline":true}],[{"text":"Let ","element":"span"},{"style":{"height":26.08},"width":1010.25,"height":65.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-9.png","element":"img","alt":" Ik+1 = {1 ≤ i ≤ n, �kj=1 T(j) ≤ t < �k+1j=1 T(j)}","inline":true},{"text":". The new observations in ","element":"span"},{"href":"#id-68","text":"(4.20) ","element":"a"},{"text":"are conditionally independent of ","element":"span"},{"style":{"height":14.6},"width":61.23,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-10.png","element":"img","alt":" �π¯Ik","inline":true,"padRight":true},{"text":"given those in ","element":"span"},{"text":"¯","element":"span"},{"style":{"height":16.07},"width":44.04,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-11.png","element":"img","alt":"Ik","inline":true},{"text":". So the Bellman equation in ","element":"span"},{"href":"#id-38","text":"(3.7) ","element":"a"},{"text":"is valid with ","element":"span"},{"style":{"height":22.24},"width":570.56,"height":55.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-12.png","element":"img","alt":" π = �π¯Ik for any (i, t) ∈ ¯Ik+1","inline":true},{"text":". We compute ","element":"span"},{"style":{"height":21.12},"width":745.41,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-13.png","element":"img","alt":"�VIk+1(�π¯Ik; G) and �σIk+1(�π¯Ik; G) as in","inline":true,"padRight":true},{"text":"Appendix ","element":"span"},{"href":"#id-55","text":"B.1 ","element":"a"},{"text":"of the supplementary article, where the number of basis ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"+ 1) depends on both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"+ 1). We iterate this procedure for ","element":"span"},{"style":{"height":17.2},"width":352.31,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-14.png","element":"img","alt":" k = 1, 2, . . . , K −","inline":true,"padRight":true},{"text":"1. The estimated","element":"span"}],[{"text":"value and CI for ","element":"span"},{"style":{"height":20.6},"width":179.4,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-15.png","element":"img","alt":" V (�π¯Ik; G","inline":true},{"text":") are given by","element":"span"}],[{"style":{"width":"94%"},"width":1751,"height":461,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":24.3},"width":1120.43,"height":60.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-17.png","element":"img","alt":" �σ(G) = {�Kk=2�T(k)}{�Kk=2�T(k)�σ−1k (�π¯Ik−1; G)}−1","inline":true},{"text":". Similar to Theorem ","element":"span"},{"href":"#id-57","text":"2, ","element":"a"},{"text":"we ","element":"span"},{"text":"can show such a CI achieves nominal coverage under certain conditions. To save space, we provide our technical results in Section ","element":"span"},{"href":"#id-69","text":"C.3 ","element":"a"},{"text":"of the supplementary article.","element":"span"}]]},{"heading":"5 Simulations","paragraphs":[[{"text":"In this section, we conduct Monte Carlo simulations to examine the finite sample performance of the proposed CI. We consider off-policy settings in Sections ","element":"span"},{"href":"#id-70","text":"5.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-71","text":"5.2, ","element":"a"},{"text":"where CIs for values under both fixed and optimal policies are reported. In Section ","element":"span"},{"href":"#id-72","text":"5.3, ","element":"a"},{"text":"we report CIs computed in on-policy settings. The state vector ","element":"span"},{"style":{"height":18.47},"width":77.2,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-18.png","element":"img","alt":" X0,t","inline":true,"padRight":true},{"text":"in our settings might not have bounded supports. For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , d","element":"span"},{"text":", we define ","element":"span"},{"style":{"height":26.46},"width":731.8,"height":66.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-19.png","element":"img","alt":" X(j)∗0,t = Φ(X(j)0,t ) where X(j)0,t stands","inline":true,"padRight":true},{"text":"for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th element of ","element":"span"},{"style":{"height":19.67},"width":262.26,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/20-20.png","element":"img","alt":" X0,t and Φ(·","inline":true},{"text":") is the cumulative distribution function of a standard normal random variable. ","element":"span"},{"text":"This gives us a transformed state vector with bounded","element":"span"}],[{"style":{"width":"55%"},"width":1021,"height":267,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-0.png","element":"img"}],[{"id":"id-76","text":"Figure 1: Illustration of Cliff Walking","element":"figcaption","subtype":"caption"}],[{"text":"support. The basis functions are constructed from the tensor product of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"one-dimensional cubic B-spline sets where knots are placed at equally spaced sample quantiles of the transformed state variables. For discrete state space ","element":"span"},{"style":{"height":19.2},"width":824.02,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-1.png","element":"img","alt":" X = {x1, · · · , xM}, we set L = M and","inline":true,"padRight":true},{"text":"Φ","element":"span"},{"style":{"height":19.6},"width":722.86,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-2.png","element":"img","alt":"M(·) = {I(· = x1), · · · , I(· = xM)}⊤","inline":true},{"text":". We set the discount factor ","element":"span"},{"style":{"height":17.2},"width":133.66,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-3.png","element":"img","alt":" γ = 0.","inline":true},{"text":"5 in all settings, and set ","element":"span"},{"style":{"height":19.6},"width":505.53,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-4.png","element":"img","alt":" L = ⌊(nT)η⌋ with η = 3/","inline":true},{"text":"7. Here, for any ","element":"span"},{"style":{"height":19.2},"width":208.92,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-5.png","element":"img","alt":" z ∈ R, ⌊z⌋","inline":true,"padRight":true},{"text":"denotes the largest integer that is smaller or equal to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":". We tried several other values of the parameter ","element":"span"},{"style":{"height":13.2},"width":24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-6.png","element":"img","alt":" η","inline":true},{"text":", and the resulting CIs are very similar and not sensitive to the choice of ","element":"span"},{"style":{"height":13.2},"width":24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-7.png","element":"img","alt":" η","inline":true},{"text":". We also tried several other values of ","element":"span"},{"style":{"height":13.2},"width":26,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-8.png","element":"img","alt":" γ","inline":true},{"text":". Overall, the proposed CI achieves nominal coverage and performs better than other baseline methods. More details can be found in Appendix ","element":"span"},{"href":"#id-73","text":"D.2 ","element":"a"},{"text":"of the supplementary article.","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Off-policy evaluation with a fixed target policy","element":"span"}],[{"text":"We consider three scenarios. In Scenarios (A) and (B), the system dynamics are given by","element":"span"}],[{"id":"id-87","style":{"width":"77%"},"width":1436,"height":247,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-9.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":24.91},"width":1182.87,"height":62.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-10.png","element":"img","alt":" t ≥ 0, where {zt}t≥0 iid∼ N(02, I2/4) and X0,0 ∼ N(02, I2","inline":true},{"text":"). In Scenario (A), we consider a completely randomized study and set ","element":"span"},{"style":{"height":19.67},"width":179.07,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-11.png","element":"img","alt":" {A0,t}t≥0","inline":true,"padRight":true},{"text":"to i.i.d Bernoulli random variables with expectation 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5. ","element":"span"},{"text":"In Scenario (B), we allow the treatment assignment mechanism to depend on the observed state. ","element":"span"},{"text":"Specifically, we set ","element":"span"},{"style":{"height":19.67},"width":689.05,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-12.png","element":"img","alt":" m = 2 and Pr(A0,t = 1|X0,t) =","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"height":19.67},"width":1015.58,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-13.png","element":"img","alt":".5sigmoid(X0,t,1) + 0.5sigmoid(X0,t,2) where X0,t,i","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th element in ","element":"span"},{"style":{"height":18.87},"width":207.31,"height":47.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-14.png","element":"img","alt":" X0,t. The","inline":true}],[{"text":"target policy we consider is designed as follows,","element":"span"}],[{"style":{"width":"40%"},"width":743,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.67},"width":38.61,"height":29.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-16.png","element":"img","alt":" xi","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"th element of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". The reference distribution ","element":"span"},{"style":{"height":19.6},"width":425.03,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/21-17.png","element":"img","alt":" G is set to N(02, I2).","inline":true}],[{"text":"In Scenario (C), we consider a standard RL setting included in OpenAI Gym ","element":"span"},{"href":"#id-74","referenceIndex":3,"text":"(Brockman ","element":"a"},{"href":"#id-74","referenceIndex":3,"text":"et al., ","element":"a"},{"href":"#id-74","referenceIndex":3,"text":"2016)","element":"a"},{"text":": Cliff Walking. This RL example is detailed in Example 6.6 in ","element":"span"},{"href":"#id-75","referenceIndex":42,"text":"Sutton and ","element":"a"},{"href":"#id-75","referenceIndex":42,"text":"Barto ","element":"a"},{"href":"#id-75","referenceIndex":42,"text":"(2018)","element":"a"},{"text":". The objective is to identify the optimal path from the starting point S to the destination point G without falling off the cliff (see Figure ","element":"span"},{"href":"#id-76","text":"1)","element":"a"},{"text":". This scenario corresponds to an episodic task where the agent will be sent instantly to the starting point wherever it steps into the cliff or arrives at the destination. We manually add some noises to the immediate rewards simulated by the OpenAI Gym to ensure that the system dynamics are not deterministic. We remark that this task is considered in ","element":"span"},{"href":"#id-77","referenceIndex":18,"text":"Kallus and Uehara ","element":"a"},{"href":"#id-77","referenceIndex":18,"text":"(2020) ","element":"a"},{"text":"as well. The target policy is the optimal policy and the behavior policy is a 50-50 mixture of the optimal and uniform random policies.","element":"span"}],[{"text":"The true value function ","element":"span"},{"style":{"height":19.6},"width":142.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/22-0.png","element":"img","alt":" V (π, G","inline":true},{"text":") is computed by Monte Carlo approximations. Specifically, we simulate ","element":"span"},{"style":{"height":15.74},"width":172.76,"height":39.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/22-1.png","element":"img","alt":" N = 105 ","inline":true,"padRight":true},{"text":"independent trajectories with initial state variable distributed according to ","element":"span"},{"text":"G","element":"span"},{"text":". The action at each decision point is chosen according to ","element":"span"},{"style":{"height":14},"width":259.55,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/22-2.png","element":"img","alt":" π. Then we","inline":true,"padRight":true},{"text":"approximate ","element":"span"},{"style":{"height":26.08},"width":857.73,"height":65.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/22-3.png","element":"img","alt":" V (π, G) by �Nj=1�Ti−1t=0 γtYj,t/N where Ti","inline":true,"padRight":true},{"text":"is set to 500 in Scenarios (A), (B) ","element":"span"},{"text":"and the termination time of each episode in Scenario (C). The integrals in ","element":"span"},{"href":"#id-78","text":"(3.10) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-78","text":"(3.11) ","element":"a"},{"text":"are computed via Monte Carlo methods. For Scenarios (A) and (B), we further consider 9 cases by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 25","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"50","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"100 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 30","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"50","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"70. For Scenario (C), we consider 3 cases by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 500","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1000","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1500. Each trajectory have 13 time points on average, under the behavior policy.","element":"span"}],[{"text":"The DRL estimator has been shown to be much more efficient than AIPW or IPW estimators ","element":"span"},{"href":"#id-14","referenceIndex":44,"text":"(Thomas et al., ","element":"a"},{"href":"#id-14","referenceIndex":44,"text":"2015; ","element":"a"},{"href":"#id-15","referenceIndex":15,"text":"Jiang and Li, ","element":"a"},{"href":"#id-15","referenceIndex":15,"text":"2016)","element":"a"},{"text":". ","element":"span"},{"text":"So we focus on comparing our approach with DRL. DRL requires the calculation of the Q-function, the marginalized density ratio and the behavior policy. Here, we treat the behavior policy as known and estimate the Q-function and the density ratio based on nonparametric sieve regression.","element":"span"}],[{"id":"id-80","text":"Table 1: Empirical coverage probabilities (ECP) and average lengths (AL) of CIs con- ","element":"figcaption","subtype":"caption"},{"text":"structed by the proposed method and DRL as well as the mean-squared errors (MSE) of the corresponding value estimators under Scenario (C).","element":"figcaption","subtype":"caption"}],[{"style":{"width":"64%"},"width":1197,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/22-4.png","element":"img"}],[{"id":"id-79","text":"Figure 2: Empirical coverage probabilities (ECP) and average lengths (AL) of CIs con- ","element":"figcaption","subtype":"caption"},{"text":"structed by the proposed method (colored in blue) and the DRL method (colored in red) as well as the mean-squared errors (MSE) of the corresponding value estimators, with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". Settings correspond to Scenario (A) and Scenario (B), from top plots to bottom plots.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"78%"},"width":1453,"height":875,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/23-0.png","element":"img"}],[{"text":"In Figure ","element":"span"},{"href":"#id-79","text":"2 ","element":"a"},{"text":"and Table ","element":"span"},{"href":"#id-80","text":"1, ","element":"a"},{"text":"we report the empirical coverage probabilities (ECPs) and average lengths (ALs) of CIs constructed by the proposed method, with different choices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". It can be seen that our CI achieves nominal coverage in all cases. Its length decreases as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nT ","element":"span"},{"text":"increases. This is consistent with our theoretical findings where we show the proposed value estimator converges at a rate of ","element":"span"},{"style":{"height":16.94},"width":218.27,"height":42.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/23-1.png","element":"img","alt":" n−1/2T −1/2 ","inline":true,"padRight":true},{"text":"under certain conditions (see the discussions below Theorem ","element":"span"},{"href":"#id-31","text":"1)","element":"a"},{"text":".","element":"span"}],[{"text":"Comparing our method with DRL, it is clear that our CIs are in general narrower than those constructed by DRL. In addition, MSEs of the proposed value estimates are smaller than those based on DRL. This is consistent with our theoretical analysis in Appendix ","element":"span"},{"href":"#id-81","text":"C.2 ","element":"a"},{"text":"where we show the variance of our value estimator is strictly smaller than that based on DRL under certain conditions. In addition, it can be seen from Table ","element":"span"},{"href":"#id-80","text":"1 ","element":"a"},{"text":"that ECPs of DRL are below 90% in Scenario (C).","element":"span"}],[{"text":"In Appendix ","element":"span"},{"href":"#id-82","text":"D.3, ","element":"a"},{"text":"we conduct some additional simulation studies under Scenario (A) by setting the reference distribution ","element":"span"},{"text":"G ","element":"span"},{"text":"to a Dirac measure. The proposed CI achieves nominal","element":"span"}],[{"text":"coverage under these settings as well.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Off-policy evaluation with an (estimated) optimal policy","element":"span"}],[{"text":"In this section, we focus on constructing the CI for value under an optimal policy. Specifically, we use a version of fitted Q-iteration (double FQI) to compute the estimated optimal policy. Detailed algorithm can be found in Section ","element":"span"},{"href":"#id-83","text":"B.3 ","element":"a"},{"text":"of the supplementary article. To implement the proposed CI in Section ","element":"span"},{"href":"#id-84","text":"3.2, ","element":"a"},{"text":"we set ","element":"span"},{"style":{"height":19.6},"width":799.39,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-0.png","element":"img","alt":" Kn = 2, KT = 2 in Scenarios (A), (B)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.6},"width":1752.86,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-1.png","element":"img","alt":" Kn = 3, KT = 1 in Scenario (C). To evaluate our CI, we generate a very large dataset","inline":true,"padRight":true},{"text":"to compute an estimated optimal policy ","element":"span"},{"style":{"height":13.34},"width":45.28,"height":33.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-2.png","element":"img","alt":" �π∗ ","inline":true,"padRight":true},{"text":"based on double FQI and use the Monte Carlo methods described in Section ","element":"span"},{"href":"#id-70","text":"5.1 ","element":"a"},{"text":"to evaluate its value ","element":"span"},{"style":{"height":19.6},"width":161.1,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-3.png","element":"img","alt":" V (�π∗; G","inline":true},{"text":"). Then we treat ","element":"span"},{"style":{"height":19.6},"width":235.34,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-4.png","element":"img","alt":" V (�π∗; G) as","inline":true,"padRight":true},{"text":"the true optimal value ","element":"span"},{"style":{"height":19.6},"width":161.1,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-5.png","element":"img","alt":" V (π∗; G","inline":true},{"text":"). We consider the same three scenarios detailed in Section ","element":"span"},{"href":"#id-70","text":"5.1. ","element":"a"},{"text":"For Scenarios (A) and (B), we fix ","element":"span"},{"style":{"height":17.2},"width":126.29,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-6.png","element":"img","alt":" γ = 0.","inline":true},{"text":"5 and consider 6 cases by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 100","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"200 and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 60","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"100","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"140. For Scenario (C), we consider 4 cases by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 3000","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"4500 and ","element":"span"},{"style":{"height":17.2},"width":207.1,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-7.png","element":"img","alt":"γ = 0.5, 0.","inline":true},{"text":"7. The ECP, AL and MSE of our CI are reported in the top two panels of Figure ","element":"span"},{"href":"#id-85","text":"3 ","element":"a"},{"text":"in Table ","element":"span"},{"href":"#id-86","text":"2. ","element":"a"},{"text":"It can be seen that these ECPs are close to the nominal level in most cases. ALs and MSEs decay as either ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"increases.","element":"span"}],[{"text":"In addition, we design a non-regular setting Scenario (D) where the actions do not have effects on the transition dynamics or the immediate rewards. Specifically, for any ","element":"span"},{"style":{"height":16.4},"width":125.95,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-8.png","element":"img","alt":" t ≥ 0,","inline":true}],[{"text":"we set","element":"span"}],[{"style":{"width":"30%"},"width":556,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-9.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"19%"},"width":351,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":24.91},"width":720.98,"height":62.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-11.png","element":"img","alt":" {zt}t≥0iid∼ N(02, I2/4) and {A0,t}t≥0","inline":true,"padRight":true},{"text":"are i.i.d Bernoulli random variables with expectation 0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5. Under this setup, any policy will achieve the same value function. As a result,","element":"span"}],[{"id":"id-86","text":"Table 2: Empirical coverage probabilities (ECP) and average lengths (AL) of CIs con- ","element":"figcaption","subtype":"caption"},{"text":"structed by the proposed method as well as the mean-squared errors (MSE) of the corresponding value estimators under Scenario (C), with different combinations of ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":174.5,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-12.png","element":"img","alt":" n and γ.","inline":true}],[{"style":{"width":"63%"},"width":1164,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/24-13.png","element":"img"}],[{"text":"Figure 3: Empirical coverage probabilities and average lengths of CIs constructed by the proposed method as well as the mean squared errors of the value estimates, with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"id":"id-85","text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". Settings correspond to Scenarios (A), (B) and (D), from top panels to bottom panels.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"91%"},"width":1683,"height":1016,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/25-0.png","element":"img"}],[{"text":"the optimal policy is not unique. We consider the same reference distribution ","element":"span"},{"text":"G","element":"span"},{"text":", and the same combinations of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"as in the regular setting. ECPs and ALs of the proposed CIs are plotted in the bottom panels of Figure ","element":"span"},{"href":"#id-85","text":"3. ","element":"a"},{"text":"It can be seen that our CIs achieve nominal coverage in the non-regular setting as well.","element":"span"}],[{"text":"In Appendix ","element":"span"},{"href":"#id-82","text":"D.3, ","element":"a"},{"text":"we conduct some additional simulation studies under Scenario (A) by setting the reference distribution ","element":"span"},{"text":"G ","element":"span"},{"text":"to a Dirac measure. Findings are very similar to those in cases where ","element":"span"},{"style":{"height":19.6},"width":295,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/25-1.png","element":"img","alt":" G = N(02, I2).","inline":true}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"5.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"On-policy evaluation with an (estimated) optimal policy","element":"span"}],[{"text":"We consider a setting where the transition dynamics and immediate rewards are defined by ","element":"span"},{"href":"#id-87","text":"(5.21)","element":"a"},{"text":". In the first block of data ","element":"span"},{"style":{"height":20.58},"width":732.25,"height":51.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/25-2.png","element":"img","alt":" {(Xi,t, Ai,t, Yi,t, Xi,t+1)}0≤t ","element":"span"},{"text":"0.","element":"span"}],[{"text":"(ii) The Markov chain ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-3.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"is geometrically ergodic.","element":"span"}],[{"text":"We remark that Condition A3*(ii) is the same as A3(ii).","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"More on the margin condition","element":"span"}],[{"text":"To better understand Condition A5, we consider a simple scenario where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". Define","element":"span"}],[{"style":{"height":20.14},"width":544.82,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-4.png","element":"img","alt":"τ(x) = Qopt(x, 1) − Qopt(x,","inline":true,"padRight":true},{"text":"0). It follows that","element":"span"}],[{"style":{"width":"79%"},"width":1460,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-5.png","element":"img"}],[{"text":"As a result, ","element":"span"},{"href":"#id-63","text":"(3.18) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-63","text":"(3.19) ","element":"a"},{"text":"are equivalent to the followings:","element":"span"}],[{"id":"id-106","style":{"width":"71%"},"width":1324,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-6.png","element":"img"}],[{"text":"Apparently, these two conditions hold when inf","element":"span"},{"style":{"height":19.6},"width":241.96,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-7.png","element":"img","alt":"x∈X |τ(x)| >","inline":true,"padRight":true},{"text":"0. They are satisfied in many","element":"span"}],[{"text":"other cases. For example, let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 1. Consider","element":"span"}],[{"style":{"width":"29%"},"width":546,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-8.png","element":"img"}],[{"text":"for some ","element":"span"},{"style":{"height":10.8},"width":80.38,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-9.png","element":"img","alt":" α >","inline":true,"padRight":true},{"text":"0. Then, with some calculations, we can show","element":"span"}],[{"style":{"width":"58%"},"width":1072,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-10.png","element":"img"}],[{"text":"This verifies ","element":"span"},{"href":"#id-106","text":"(A.23)","element":"a"},{"text":". When ","element":"span"},{"text":"G ","element":"span"},{"text":"has a bounded density function on ","element":"span"},{"text":"X","element":"span"},{"text":", ","element":"span"},{"href":"#id-106","text":"(A.24) ","element":"a"},{"text":"is reduced to ","element":"span"},{"href":"#id-106","text":"(A.23)","element":"a"},{"text":". If ","element":"span"},{"style":{"height":19.6},"width":68.4,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-11.png","element":"img","alt":" G(·","inline":true},{"text":") equals the Dirac measure ","element":"span"},{"style":{"height":19.6},"width":73.01,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-12.png","element":"img","alt":" δx(·","inline":true},{"text":"), then ","element":"span"},{"href":"#id-106","text":"(A.24) ","element":"a"},{"text":"automatically holds for any","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"α > ","element":"span"},{"text":"0.","element":"span"}]]},{"heading":"B Additional details regarding the method","paragraphs":[[{"id":"id-55","style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"More on the CI in ","element":"span"},{"href":"#id-58","text":"(3.17)","element":"a"}],[{"style":{"width":"0%"},"width":15,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-13.png","element":"img"}],[{"text":"We begin by providing more details on the estimators ","element":"span"},{"style":{"height":21.12},"width":244.92,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-14.png","element":"img","alt":"�VIk+1(�π¯Ik; G","inline":true},{"text":") and its standard error ","element":"span"},{"style":{"height":21.12},"width":244.24,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-15.png","element":"img","alt":"�σIk+1(�π¯Ik; G","inline":true},{"text":"). In general, for a given ","element":"span"},{"style":{"height":16.07},"width":150.86,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-16.png","element":"img","alt":" I ⊆ I0","inline":true,"padRight":true},{"text":"and any policy ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-17.png","element":"img","alt":" π","inline":true},{"text":", we define ","element":"span"},{"style":{"height":19.6},"width":268.3,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/37-18.png","element":"img","alt":"�VI(π; G) and","inline":true}],[{"style":{"height":19.6},"width":230.34,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-0.png","element":"img","alt":"�σI(π; G) as","inline":true}],[{"style":{"width":"81%"},"width":1497,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-1.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"99%"},"width":1837,"height":281,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-2.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|I| ","element":"span"},{"text":"stands for the number of elements in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":".","element":"span"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Value difference between the target and behavior policy","element":"span"}],[{"text":"In this section, we outline a method to evaluate the value difference function between the target and behavior policy. We first consider the scenario where the target policy is a fixed policy. We next consider the scenario where the target policy is an estimated optimal policy. To simplify the presentation, we assume ","element":"span"},{"style":{"height":16.07},"width":375.11,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-3.png","element":"img","alt":" T1 = · · · = Tn = T","inline":true},{"text":". The proposed method can be similarly extended to on-policy settings.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inference of the value difference under a fixed policy","element":"span"}],[{"text":"Consider a data-independent policy ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-4.png","element":"img","alt":" π","inline":true},{"text":". We aim to evaluate the value difference function VD(","element":"span"},{"style":{"height":19.6},"width":683.96,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-5.png","element":"img","alt":"π; x) = V (π; x) − V (b; x) where b","inline":true,"padRight":true},{"text":"is the unknown behavior policy. We first apply our method in Section ","element":"span"},{"href":"#id-107","text":"3.1.2 ","element":"a"},{"text":"to compute an estimator value function ","element":"span"},{"style":{"height":19.6},"width":400.12,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-6.png","element":"img","alt":"�V (π; x) for V (π; x).","inline":true}],[{"text":"To estimate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"), we observe that the Q-function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") satisfies the Bellman equation, ","element":"span"},{"style":{"height":19.67},"width":1634.04,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-7.png","element":"img","alt":" E[{Yi,t + γQ(b; Xi,t+1, Ai,t+1) − Q(b; Xi,t, Ai,t)}|Xi,t, Ai,t] = 0. We approximate","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") based on linear sieves Φ","element":"span"},{"style":{"height":21.77},"width":159.71,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-8.png","element":"img","alt":"⊤L(x)β∗b,a","inline":true},{"text":". Similar to Section ","element":"span"},{"href":"#id-107","text":"3.1.2, ","element":"a"},{"style":{"height":21.77},"width":434.28,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-9.png","element":"img","alt":" βb = (β∗⊤b,1, · · · , β∗⊤b,m)⊤","inline":true}],[{"text":"can be estimated by","element":"span"}],[{"style":{"width":"91%"},"width":1682,"height":219,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/38-10.png","element":"img"}],[{"style":{"width":"1%"},"width":19,"height":6,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-0.png","element":"img"}],[{"text":"The resulting estimates for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") can be derived as Φ","element":"span"},{"style":{"height":19.67},"width":174.68,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-1.png","element":"img","alt":"⊤L(x)�βb,a.","inline":true,"padRight":true},{"text":"The corresponding estimator for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b, x","element":"span"},{"text":") is given by ","element":"span"},{"style":{"height":19.98},"width":873.63,"height":49.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-2.png","element":"img","alt":"�V (b, x) = �a�b(a|x)Φ⊤L(x)�βb,a where �b(a|x","inline":true},{"text":") denotes the","element":"span"}],[{"text":"sieve estimator Φ","element":"span"},{"style":{"height":19.6},"width":484.52,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-3.png","element":"img","alt":"⊤L(x)�αa for b(a|x) where","inline":true}],[{"style":{"width":"83%"},"width":1530,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-4.png","element":"img"}],[{"text":"This yields the estimator for the value difference ","element":"span"},{"style":{"height":19.6},"width":609.18,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-5.png","element":"img","alt":"�VD(π; x) = �V (π; x) − �V (b; x).","inline":true}],[{"text":"We next derive a confidence interval for VD(","element":"span"},{"style":{"height":19.6},"width":480.07,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-6.png","element":"img","alt":"π; x) based on �VD(π; x","inline":true},{"text":"). Similar to the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"we can show","element":"span"},{"style":{"height":22.2},"width":574.39,"height":55.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-7.png","element":"img","alt":"√nT{�VD(π; x) − VD(π; x)}","inline":true,"padRight":true},{"text":"is equivalent to","element":"span"}],[{"id":"id-108","style":{"width":"109%"},"width":2012,"height":851,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-8.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"B.2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Inference of the value difference under an estimated optimal policy","element":"span"}],[{"text":"We begin by dividing the data into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"non-overlapping subsets ","element":"span"},{"style":{"height":20.72},"width":139.68,"height":51.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-9.png","element":"img","alt":" ∪Kk=1Ik","inline":true},{"text":". Similar to Section","element":"span"}],[{"href":"#id-23","text":"3.2.2, ","element":"a"},{"text":"we construct the value difference estimator by","element":"span"}],[{"style":{"width":"99%"},"width":1842,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-10.png","element":"img"}],[{"text":"corresponding confidence interval is given by","element":"span"}],[{"style":{"width":"89%"},"width":1649,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/39-11.png","element":"img"}],[{"style":{"width":"99%"},"width":1842,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-0.png","element":"img"}],[{"text":"case where the behavior policy is equal to a deterministic optimal policy. To elaborate,","element":"span"}],[{"text":"notice that when the behavior policy is deterministic, the second line of ","element":"span"},{"href":"#id-108","text":"(B.25) ","element":"a"},{"text":"equal zero. In addition, when ","element":"span"},{"style":{"height":15.34},"width":263.12,"height":38.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-1.png","element":"img","alt":" π = b = πopt ","inline":true,"padRight":true},{"text":"for some optimal policy ","element":"span"},{"style":{"height":15.34},"width":73.72,"height":38.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-2.png","element":"img","alt":" πopt","inline":true},{"text":", the first line equals zero as well. In that case, ","element":"span"},{"style":{"height":19.6},"width":165.32,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-3.png","element":"img","alt":"�VD(π; x","inline":true},{"text":") would have a degenerate distribution. Suppose the estimated optimal policy is consistent for ","element":"span"},{"style":{"height":20.14},"width":373.47,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-4.png","element":"img","alt":" πopt. Then �VD(x","inline":true},{"text":") might not have a tractable limiting distribution, leading to an invalid confidence interval.","element":"span"}],[{"text":"To address this concern, we could redefine the inverse weights ","element":"span"},{"style":{"height":23.08},"width":668.15,"height":57.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-5.png","element":"img","alt":" �σ∗Ik+1(�π¯Ik; x) by �σ∗Ik+1(�π¯Ik; x, δ) =","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":23.08},"width":626.84,"height":57.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-6.png","element":"img","alt":"{�σ∗Ik+1(�π¯Ik; x), δ} for some δ >","inline":true,"padRight":true},{"text":"0, as in ","element":"span"},{"href":"#id-109","referenceIndex":22,"text":"Luedtke and van der Laan ","element":"a"},{"href":"#id-109","referenceIndex":22,"text":"(2017)","element":"a"},{"text":". This guar- ","element":"span"},{"text":"antees that these inverse weights are strictly greater than zero. ","element":"span"},{"text":"A similar approach is employed by ","element":"span"},{"href":"#id-110","referenceIndex":38,"text":"Shi et al. ","element":"a"},{"href":"#id-110","referenceIndex":38,"text":"(2020b) ","element":"a"},{"text":"for testing the overall qualitative treatment effects in singlestage decision making. In addition, one could allow ","element":"span"},{"style":{"height":14},"width":22,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-7.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"to depend on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". The resulting confidence interval would be valid as long as ","element":"span"},{"style":{"height":21.74},"width":296.14,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-8.png","element":"img","alt":" δ ≫ (NT)−1/6","inline":true},{"text":"(see e.g., Theorem 3.1 of ","element":"span"},{"href":"#id-110","referenceIndex":38,"text":"Shi ","element":"a"},{"href":"#id-110","referenceIndex":38,"text":"et al., ","element":"a"},{"href":"#id-110","referenceIndex":38,"text":"2020b)","element":"a"},{"text":". However, a potential limitation is that it would yield a conservative confidence interval when the truncation is active, as discussed in ","element":"span"},{"href":"#id-109","referenceIndex":22,"text":"Luedtke and van der Laan ","element":"a"},{"href":"#id-109","referenceIndex":22,"text":"(2017)","element":"a"},{"text":".","element":"span"}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"B.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Double fitted ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontWeight":"bold"},"text":"-iteration","element":"span"}],[{"text":"In this section, we introduce our algorithm for computing the estimated optimal policy in our numerical studies. The proposed algorithm is based on FQI that recursively updates the estimated optimal Q-function by some supervised learning method (see Example 2 in Section ","element":"span"},{"href":"#id-111","text":"3.2.3)","element":"a"},{"text":". In FQI, at each iteration, a maximization over estimated Q-function is used as an estimate of the maximum of the true Q-function. This can lead to a significant positive bias ","element":"span"},{"href":"#id-75","referenceIndex":42,"text":"(Sutton and Barto, ","element":"a"},{"href":"#id-75","referenceIndex":42,"text":"2018)","element":"a"},{"text":". ","element":"span"},{"href":"#id-112","referenceIndex":11,"text":"Hasselt ","element":"a"},{"href":"#id-112","referenceIndex":11,"text":"(2010) ","element":"a"},{"text":"proposed a double Q-learning method to reduce the maximization bias. Here, we apply similar ideas to FQI to compute the estimated optimal policy. We use a pseudocode to summarize our algorithm below.","element":"span"}],[{"text":"In Algorithm ","element":"span"},{"href":"#id-113","text":"1, ","element":"a"},{"text":"we can apply any non-parametric models ","element":"span"},{"style":{"height":19.6},"width":136.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-9.png","element":"img","alt":" Q(·, ;θ","inline":true},{"text":") indexed by ","element":"span"},{"style":{"height":14},"width":212.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-10.png","element":"img","alt":" θ to model","inline":true,"padRight":true},{"text":"the optimal Q-function. In our implementation, we set ","element":"span"},{"style":{"height":19.6},"width":136.82,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/40-11.png","element":"img","alt":" Q(·, ·; ·","inline":true},{"text":") to be a linear combination of tensor product B-spline basis functions.","element":"span"}],[{"id":"id-113","style":{"width":"100%"},"width":1844,"height":559,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-0.png","element":"img"}]]},{"heading":"C Additional technical details","paragraphs":[[{"id":"id-43","style":{"fontWeight":"bold"},"text":"C.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional details regarding Condition A3","element":"span"}],[{"text":"When ","element":"span"},{"style":{"height":13.2},"width":145.59,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-1.png","element":"img","alt":" ν0 = µ","inline":true},{"text":", the density function of ","element":"span"},{"style":{"height":18.87},"width":277.14,"height":47.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-2.png","element":"img","alt":" X0,1 equals µ","inline":true,"padRight":true},{"text":"as well. By Jensen’s inequality, we","element":"span"}],[{"text":"have for any ","element":"span"},{"style":{"height":16.54},"width":278.12,"height":41.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-3.png","element":"img","alt":" v ∈ RmL that","inline":true}],[{"style":{"width":"80%"},"width":1479,"height":611,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-4.png","element":"img"}],[{"text":"is positive semidefinite. It follows that","element":"span"}],[{"style":{"width":"74%"},"width":1367,"height":306,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-5.png","element":"img"}],[{"text":"When ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is a deterministic policy, ","element":"span"},{"style":{"height":21.72},"width":878.47,"height":54.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-7.png","element":"img","alt":"�a∈A ξ(x, a)ξ⊤(x, a)b(a|x)−γ2Uπ(x)U ⊤π (x","inline":true},{"text":") is a block","element":"span"}],[{"text":"diagonal matrix. To show A3(i) holds, it suffices to show","element":"span"}],[{"style":{"width":"66%"},"width":1229,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/41-8.png","element":"img"}],[{"text":"Suppose ","element":"span"},{"style":{"height":14},"width":185.48,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-0.png","element":"img","alt":" b is the ϵ","inline":true},{"text":"-greedy policy with respect to ","element":"span"},{"style":{"height":20.54},"width":840.32,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-1.png","element":"img","alt":" π, i.e, b(a|x) = ϵm−1 + (1 − ϵ)π(a|x), for","inline":true}],[{"text":"any ","element":"span"},{"style":{"height":20.14},"width":1014.17,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-2.png","element":"img","alt":" a ∈ {1, . . . , m} and ϵ satisfies ϵ ≤ 1 − γ2, we have","inline":true}],[{"style":{"width":"98%"},"width":1822,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-3.png","element":"img"}],[{"text":"Suppose A2 holds. It suffices to require","element":"span"}],[{"id":"id-114","style":{"width":"69%"},"width":1287,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-4.png","element":"img"}],[{"text":"The condition in ","element":"span"},{"href":"#id-114","text":"(C.26) ","element":"a"},{"text":"is automatically satisfied (see, e.g., ","element":"span"},{"href":"#id-115","referenceIndex":4,"text":"Burman and Chen, ","element":"a"},{"href":"#id-115","referenceIndex":4,"text":"1989; ","element":"a"},{"href":"#id-42","referenceIndex":5,"text":"Chen ","element":"a"},{"href":"#id-42","referenceIndex":5,"text":"and Christensen, ","element":"a"},{"href":"#id-42","referenceIndex":5,"text":"2015)","element":"a"},{"text":".","element":"span"}],[{"id":"id-81","style":{"fontWeight":"bold"},"text":"C.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional details on the variance comparison","element":"span"}],[{"text":"We consider a randomized study where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") is a constant function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". In addition, we assume the target policy is nondynamic, i.e., ","element":"span"},{"style":{"height":19.6},"width":911.37,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-5.png","element":"img","alt":" π(a∗|x) = 1 for some 1 ≤ a∗ ≤ m and any x.","inline":true,"padRight":true},{"text":"We impose the following conditions.","element":"span"}],[{"text":"(C1) The process ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-6.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"is stationary.","element":"span"}],[{"text":"(C2) The temporal difference error ","element":"span"},{"style":{"height":14.07},"width":60.26,"height":35.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-7.png","element":"img","alt":" ε0,t","inline":true,"padRight":true},{"text":"is independent of (","element":"span"},{"style":{"height":19.67},"width":207.3,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-8.png","element":"img","alt":"X0,t, A0,t).","inline":true}],[{"style":{"width":"73%"},"width":1363,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-9.png","element":"img"}],[{"text":"We make some remarks. First, Condition (C1) is imposed to simplify the presentation. The same results hold as long as ","element":"span"},{"style":{"height":19.67},"width":139.25,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-10.png","element":"img","alt":" {X0,t}t","inline":true,"padRight":true},{"text":"will converge to its stationary distribution. Second, the variances of our estimator and DRL are very difficult to analyse in general. Conditions (C2)-(C3) are imposed to simplify the calculation. Even when these conditions are violated, we expect the variance of the proposed estimator will be smaller in general, as reflected in our numerical study.","element":"span"}],[{"id":"id-117","style":{"fontWeight":"bold"},"text":"Theorem 5 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume (C1)-(C3) hold. Then the asymptotic variance of the DRL estimator is at least ","element":"span"},{"style":{"height":20.61},"width":614.07,"height":51.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-11.png","element":"img","alt":" h−1(ΦL){1 + Var(w(π, X0,t))}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"times larger than the proposed estimator where ","element":"span"},{"style":{"height":22.05},"width":1431.6,"height":55.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-12.png","element":"img","alt":"h(ΦL) = �x ΦL(x)G(dx){EΦL(X0,t)Φ⊤L(X0,t)}−1 �x ΦL(x)G(dx) and w","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the marginalized ","element":"span"},{"style":{"fontStyle":"italic"},"text":"density ratio ","element":"span"},{"href":"#id-16","referenceIndex":17,"style":{"fontStyle":"italic"},"text":"(Kallus and Uehara, ","element":"a"},{"href":"#id-16","referenceIndex":17,"style":{"fontStyle":"italic"},"text":"2019)","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We next investigate the value of the factor ","element":"span"},{"style":{"height":19.6},"width":348.2,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-13.png","element":"img","alt":" h(ΦL). We set G","inline":true,"padRight":true},{"text":"and the stationary distribution of ","element":"span"},{"style":{"height":18.47},"width":229.97,"height":46.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/42-14.png","element":"img","alt":" X0,t, i.e., µ","inline":true,"padRight":true},{"text":"to a uniform distribution on [0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]. We consider a polynomial basis","element":"span"}],[{"id":"id-116","style":{"width":"54%"},"width":996,"height":671,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-0.png","element":"img"}],[{"text":"Figure 6: ","element":"figcaption","subtype":"caption"},{"style":{"height":19.67},"width":128.32,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-1.png","element":"img","alt":" h(Φ0,L","inline":true},{"text":") with different choices of Φ","element":"figcaption","subtype":"caption"},{"style":{"height":11.2},"width":64.37,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-2.png","element":"img","alt":"0,L.","inline":true}],[{"text":"function and a B-spline basis function. Figure ","element":"span"},{"href":"#id-116","text":"6 ","element":"a"},{"text":"depicts the value of this factor with different choices of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":". We also tried several other combinations of ","element":"span"},{"style":{"height":18},"width":170.78,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-3.png","element":"img","alt":" G and µ","inline":true},{"text":", and find this factor is in general smaller than or very close to 1. Since Var(","element":"span"},{"style":{"height":19.67},"width":268.35,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-4.png","element":"img","alt":"w(π, X0,t)) >","inline":true,"padRight":true},{"text":"0, Theorem ","element":"span"},{"href":"#id-117","text":"5 ","element":"a"},{"text":"implies that the proposed estimator achieves smaller variance.","element":"span"}],[{"text":"We next sketch a few lines to prove Theorem ","element":"span"},{"href":"#id-117","text":"5. ","element":"a"},{"text":"Based on Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"the asymptotic","element":"span"}],[{"text":"variance of our estimator is given by","element":"span"}],[{"style":{"width":"87%"},"width":1619,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.6},"width":39,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-6.png","element":"img","alt":" Ω","inline":true,"padRight":true},{"text":"is defined in Step 2 of the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1. ","element":"a"},{"text":"Under (C1) and (C2), we have ","element":"span"},{"style":{"height":22.47},"width":794.05,"height":56.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-7.png","element":"img","alt":"Ω = Eξ0,tξ⊤0,tε20,t = σ2∗Eξ0,tξ⊤0,t where σ2∗ ","inline":true,"padRight":true},{"text":"is the variance of ","element":"span"},{"style":{"height":14.07},"width":75.48,"height":35.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-8.png","element":"img","alt":" ε0,t.","inline":true}],[{"text":"Using similar arguments in the proof of Theorem 16 of ","element":"span"},{"href":"#id-16","referenceIndex":17,"text":"Kallus and Uehara ","element":"a"},{"href":"#id-16","referenceIndex":17,"text":"(2019)","element":"a"},{"text":", we","element":"span"}],[{"text":"can show that the asymptotic variance of the DRL estimator equals","element":"span"}],[{"style":{"width":"42%"},"width":776,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-9.png","element":"img"}],[{"text":"Under the given conditions, the above variance is equal to ","element":"span"},{"style":{"height":20.54},"width":541.26,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-10.png","element":"img","alt":" {nT(1 − γ)2pa∗}−1σ2∗{1 +","inline":true}],[{"text":"Var(","element":"span"},{"style":{"height":19.67},"width":716.68,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-11.png","element":"img","alt":"ω(X0,0))} where pa∗ = Pr(A0,t = a∗","inline":true},{"text":"). Consequently, it suffices to show","element":"span"}],[{"id":"id-118","style":{"width":"94%"},"width":1748,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/43-12.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-0.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is a nondynamic policy,","element":"span"},{"style":{"height":22.85},"width":327.85,"height":57.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-1.png","element":"img","alt":"�x∈X Uπ(x)G(dx","inline":true},{"text":") is a sparse vector that takes the following ","element":"span"},{"text":"form:","element":"span"}],[{"style":{"width":"53%"},"width":990,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-2.png","element":"img"}],[{"text":"By the definition of ","element":"span"},{"style":{"height":18.87},"width":230.75,"height":47.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-3.png","element":"img","alt":" Σπ and ξ0,t","inline":true},{"text":", the left-hand-side of ","element":"span"},{"href":"#id-118","text":"(C.28) ","element":"a"},{"text":"is equal to","element":"span"}],[{"style":{"width":"99%"},"width":1843,"height":776,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-4.png","element":"img"}],[{"text":"under (C3). Consequently, we have","element":"span"}],[{"style":{"width":"81%"},"width":1503,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-5.png","element":"img"}],[{"text":"As such, the left-hand-side of ","element":"span"},{"href":"#id-118","text":"(C.28) ","element":"a"},{"text":"is upper bounded by","element":"span"}],[{"style":{"width":"89%"},"width":1642,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-6.png","element":"img"}],[{"text":"(1 ","element":"span"},{"style":{"height":21.5},"width":136.26,"height":53.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-7.png","element":"img","alt":" − γ)−2","inline":true}],[{"style":{"width":"89%"},"width":1648,"height":143,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-8.png","element":"img"}],[{"text":"This completes the proof.","element":"span"}],[{"id":"id-69","style":{"fontWeight":"bold"},"text":"C.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional details on on-policy evaluation","element":"span"}],[{"text":"In this section, we show our proposed CI in Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"achieves nominal converge. ","element":"span"},{"text":"To simplify the analysis, we focus on the setting where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"is finite, ","element":"span"},{"style":{"height":19.6},"width":507.75,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-9.png","element":"img","alt":" T(1) = · · · = T(K) = T","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.6},"width":716.55,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-10.png","element":"img","alt":" L(1) = · · · = L(K) = L. When K","inline":true,"padRight":true},{"text":"diverges, the sequences ","element":"span"},{"style":{"height":19.67},"width":521.88,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/44-11.png","element":"img","alt":" {T(k)}k≥1 and {L(k)}k≥1","inline":true,"padRight":true},{"text":"shall be properly chosen to reduce the bias of the value estimates. We leave this for future research.","element":"span"}],[{"text":"Similar to Appendix ","element":"span"},{"href":"#id-119","text":"A.2, ","element":"a"},{"text":"we assume the estimated policy ","element":"span"},{"style":{"height":13.27},"width":104.89,"height":33.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-0.png","element":"img","alt":" �πI ∈","inline":true,"padRight":true},{"text":"Π with probability 1, for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":". In on-policy settings, the behavior policy ","element":"span"},{"style":{"height":16.47},"width":39.91,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-1.png","element":"img","alt":" bπ","inline":true,"padRight":true},{"text":"is a function of the estimated policy ","element":"span"},{"style":{"height":11.2},"width":80.42,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-2.png","element":"img","alt":" π ∈","inline":true,"padRight":true},{"text":"Π. For instance, when an ","element":"span"},{"style":{"height":8.8},"width":19,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-3.png","element":"img","alt":" ϵ","inline":true},{"text":"-greedy policy is used to determine the behavior policy, then we have ","element":"span"},{"style":{"height":19.6},"width":602.11,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-4.png","element":"img","alt":" bπ = (1 − ϵ)π + ϵπ∗ where π∗ ","inline":true,"padRight":true},{"text":"denotes a uniform random policy. Let ","element":"span"},{"style":{"height":19.2},"width":360.56,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-5.png","element":"img","alt":"B = {bπ : π ∈ Π}.","inline":true}],[{"text":"For any behavior policy ","element":"span"},{"style":{"height":14.4},"width":110.38,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-6.png","element":"img","alt":" b ∈ B","inline":true},{"text":", consider a Markov chain ","element":"span"},{"style":{"height":19.67},"width":363.7,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-7.png","element":"img","alt":" {(X0,t,b, A0,t,b)}t≥0","inline":true,"padRight":true},{"text":"generated by this behavior policy. Let ","element":"span"},{"style":{"height":18.47},"width":90.15,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-8.png","element":"img","alt":" Y0,t,b","inline":true,"padRight":true},{"text":"be the realization of the immediate reward at time ","element":"span"},{"style":{"height":17.6},"width":176.26,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-9.png","element":"img","alt":" t. Let µb","inline":true,"padRight":true},{"text":"denote the limiting distribution of the Markov chain ","element":"span"},{"style":{"height":23.14},"width":754.01,"height":57.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-10.png","element":"img","alt":" {X0,t,b}t≥0, and Pt,bX (·|x) be its t-step","inline":true}],[{"text":"transition kernel. For any ","element":"span"},{"style":{"height":19.67},"width":667.92,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-11.png","element":"img","alt":" x ∈ X, a ∈ A, define ωπ,b(x, a) as","inline":true}],[{"style":{"width":"98%"},"width":1816,"height":173,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-12.png","element":"img"}],[{"text":"We introduce the following conditions.","element":"span"}],[{"text":"(A2’.) Assume ","element":"span"},{"style":{"height":17.2},"width":170.72,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-13.png","element":"img","alt":" ν0 and q","inline":true,"padRight":true},{"text":"are uniformly bounded away from 0 and ","element":"span"},{"style":{"height":8.8},"width":48,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-14.png","element":"img","alt":" ∞","inline":true,"padRight":true},{"text":"on their supports.","element":"span"}],[{"text":"(A3’.) Assume (i) and (ii) hold if ","element":"span"},{"style":{"height":13.6},"width":156.34,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-15.png","element":"img","alt":" T → ∞","inline":true,"padRight":true},{"text":"and (iii) holds if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded.","element":"span"}],[{"text":"(i) inf","element":"span"},{"style":{"height":22.85},"width":1717.28,"height":57.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-16.png","element":"img","alt":"π∈Π,b∈B λmin[�x∈X�a∈A{ξ(x, a)ξ⊤(x, a) − γ2uπ(x, a)u⊤π (x, a)}b(a|x)µb(x)dx] ≥ ¯c for","inline":true,"padRight":true},{"text":"some constant ¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0.","element":"span"}],[{"text":"(ii) There exists some function ","element":"span"},{"style":{"height":19.6},"width":215.64,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-17.png","element":"img","alt":" M(·) on X","inline":true,"padRight":true},{"text":"and some constant ","element":"span"},{"style":{"height":14.4},"width":74.44,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-18.png","element":"img","alt":" ρ <","inline":true,"padRight":true},{"text":"1 such that","element":"span"}],[{"style":{"width":"34%"},"width":627,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-19.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"53%"},"width":977,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-20.png","element":"img"}],[{"text":"(iii) There exists some constant ¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"c > ","element":"span"},{"text":"0 such that","element":"span"}],[{"style":{"width":"3%"},"width":65,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-21.png","element":"img"}],[{"text":"inf","element":"span"},{"style":{"fontStyle":"italic"},"text":"π","element":"span"},{"style":{"height":28.84},"width":224.18,"height":72.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-22.png","element":"img","alt":"∈Π,b∈B λmin[","inline":true}],[{"text":"(A4’) For any 1 ","element":"span"},{"style":{"height":22.24},"width":1113.5,"height":55.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-23.png","element":"img","alt":" ≤ k ≤ K, we have E|V (�π¯Ik; G)−V (π∗; G)| = O(|¯Ik|−b0","inline":true},{"text":"), for some ","element":"span"},{"style":{"height":19.2},"width":171.65,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-24.png","element":"img","alt":" b0 > 1/2","inline":true,"padRight":true},{"text":"such that (","element":"span"},{"style":{"height":23.18},"width":661.91,"height":57.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/45-25.png","element":"img","alt":"nT)b0−1/2 ≫ ∥�x ΦL(x)G(dx)∥−12 ","inline":true,"padRight":true},{"text":", where the big-","element":"span"},{"style":{"fontStyle":"italic"},"text":"O ","element":"span"},{"text":"term is uniform in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":".","element":"span"}],[{"id":"id-120","style":{"fontWeight":"bold"},"text":"Theorem 6 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume A1, A2’-A4’ hold. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"height":22.2},"width":776.8,"height":55.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-0.png","element":"img","alt":" L = o{√nT/ log(nT)} and L2p/d ≫","inline":true}],[{"style":{"height":22.96},"width":568.74,"height":57.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-1.png","element":"img","alt":"nT{1+∥�x ΦL(x)G(dx)∥−22 }","inline":true},{"style":{"fontStyle":"italic"},"text":". Assume there exists some constant ","element":"span"},{"style":{"height":19.67},"width":565.2,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-2.png","element":"img","alt":" c0 ≥ 1 such that ωπ,b(x, a) ≥","inline":true},{"style":{"height":21.59},"width":1114.78,"height":53.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-3.png","element":"img","alt":"c−10 for any x, a, π, b and Pr(max0≤t≤T−1 |Y0,t| ≤ c0) = 1","inline":true},{"style":{"fontStyle":"italic"},"text":". Then as either ","element":"span"},{"style":{"height":16.8},"width":389.68,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-4.png","element":"img","alt":" n → ∞ or T → ∞,","inline":true}],[{"style":{"width":"56%"},"width":1041,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-5.png","element":"img"}],[{"text":"Proof of Theorem ","element":"span"},{"href":"#id-120","text":"6 ","element":"a"},{"text":"is omitted for brevity.","element":"span"}]]},{"heading":"D Additional numerical results","paragraphs":[[{"id":"id-95","style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Data-adaptive selection of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"}],[{"text":"We apply the proposed method detailed in Section ","element":"span"},{"href":"#id-121","text":"7.2 ","element":"a"},{"text":"to Scenario (B) where the treatment assignment mechanism depends on the observed state, to investigate the finite sample performance of the resulting CI. Specifically, we apply the random forest algorithm to learn the conditional mean function ","element":"span"},{"style":{"height":13.2},"width":28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-6.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"and the behavior policy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":". We assume Σ(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, x","element":"span"},{"text":") is a","element":"span"}],[{"text":"constant function of (","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, x","element":"span"},{"text":") and estimated it by","element":"span"}],[{"style":{"width":"64%"},"width":1188,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-7.png","element":"img"}],[{"text":"We use the tensor product B-spline basis for Φ","element":"span"},{"style":{"height":8.8},"width":23,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-8.png","element":"img","alt":"L","inline":true},{"text":", as in Section ","element":"span"},{"text":"5. ","element":"span"},{"text":"Note that the state is a twodimensional vector, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is selected among the set ","element":"span"},{"style":{"height":20.14},"width":406.84,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/46-9.png","element":"img","alt":" {42, 52, 62, 72, 82, 92}","inline":true},{"text":". Specifically, we choose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"such that the resulting CI is the shortest among all CIs whose coverage probabilities are above 93%. If no such CI exists, we select the CI with the highest coverage probability.","element":"span"}],[{"text":"We report the ECP and AL of the resulting CI in the left and middle panels of Figure ","element":"span"},{"href":"#id-122","text":"7. ","element":"a"},{"text":"It can been seen that ECP is close to the nominal level in all cases and AL decays as either ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"increases. In the right panel of Figure ","element":"span"},{"href":"#id-122","text":"7, ","element":"a"},{"text":"we report the number of basis functions that is being selected most by the proposed method as a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"(denote by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, T","element":"span"},{"text":")). It is clear from Figure ","element":"span"},{"href":"#id-122","text":"7 ","element":"a"},{"text":"that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, T","element":"span"},{"text":") increases with the total number of observations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nT","element":"span"},{"text":". This is consistent with the following intuition: as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nT ","element":"span"},{"text":"increases, more basis functions are needed to reduce the approximation error and guarantee the nominal coverage of the resulting CI.","element":"span"}],[{"style":{"width":"91%"},"width":1683,"height":493,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-0.png","element":"img"}],[{"id":"id-122","text":"Figure 7: Empirical coverage probabilities (ECP) and average lengths (AL) of CIs con- ","element":"figcaption","subtype":"caption"},{"text":"structed by the proposed method detailed in Section ","element":"figcaption","subtype":"caption"},{"href":"#id-121","text":"7.2 ","element":"a","subtype":"caption"},{"text":"as well as the number of basis functions being selected most, with different combinations of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Sensitivity analysis","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Sensitivity test for ","element":"span"},{"style":{"height":13.2},"width":24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-1.png","element":"img","alt":" η","inline":true}],[{"text":"In this section, we conduct the sensitivity test for the parameter ","element":"span"},{"style":{"height":13.2},"width":24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-2.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"in the number of basis ","element":"span"},{"style":{"height":19.6},"width":262.94,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-3.png","element":"img","alt":"L = ⌊(nT)η⌋","inline":true},{"text":". We consider the simulation of the off-policy evaluation with a fixed target policy in Section 5.1. For scenario (A), (B) and (C), we set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 100, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 100 and the different ","element":"span"},{"style":{"height":13.2},"width":24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-4.png","element":"img","alt":" η","inline":true},{"text":"’s are chosen from (0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"25","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"30","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"35","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"40","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"7","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"45). The result of the ECPs are plotted in Figure ","element":"span"},{"href":"#id-123","text":"8 ","element":"a"},{"text":"where all the ECPs are close to the nominal coverage rate 0.95. It shows that the results of the coverage are not sensitive to the different choices of ","element":"span"},{"style":{"height":13.2},"width":37.82,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-5.png","element":"img","alt":" η.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"D.2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Sensitivity test for ","element":"span"},{"style":{"height":13.2},"width":26,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-6.png","element":"img","alt":" γ","inline":true}],[{"text":"In Figure ","element":"span"},{"href":"#id-124","text":"9, ","element":"a"},{"text":"we report the ECP and AL of the proposed CI and the MSE of our value estimate under Scenario B where the target policy is fixed, with ","element":"span"},{"style":{"height":18},"width":289.21,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-7.png","element":"img","alt":" γ = 0.3 and 0.","inline":true},{"text":"7. In Figure ","element":"span"},{"href":"#id-125","text":"10, ","element":"a"},{"text":"we report the ECP and AL of the proposed CI and the MSE of our value estimate under Scenario B where the target policy is an estimated optimal policy, with ","element":"span"},{"style":{"height":18},"width":240.14,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-8.png","element":"img","alt":" γ = 0.3 and","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"7. It can be seen that findings are very similar to those with ","element":"span"},{"style":{"height":17.2},"width":162.73,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-9.png","element":"img","alt":" γ = 0.5.","inline":true}],[{"text":"In Tables ","element":"span"},{"href":"#id-126","text":"4 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-127","text":"5, ","element":"a"},{"text":"we report the ECP, AL and MSE of the proposed method and DRL under Scenario (C) where the target policy is fixed, with ","element":"span"},{"style":{"height":18},"width":300.12,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-10.png","element":"img","alt":" γ = 0.3 and 0.","inline":true},{"text":"7. It can be seen that the proposed CI achieves nominal coverage in all cases. When ","element":"span"},{"style":{"height":17.2},"width":134,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-11.png","element":"img","alt":" γ = 0.","inline":true},{"text":"3, ECP of the DRL method is well below the nominal level in all cases. When ","element":"span"},{"style":{"height":17.2},"width":126.29,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/47-12.png","element":"img","alt":" γ = 0.","inline":true},{"text":"7, the AL and MSE of the proposed CI are much smaller than those based on DRL.","element":"span"}],[{"id":"id-123","style":{"width":"72%"},"width":1333,"height":874,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/48-0.png","element":"img"}],[{"id":"id-124","text":"Figure 9: Empirical coverage probabilities and average lengths of CIs constructed by the ","element":"figcaption","subtype":"caption"},{"text":"proposed method as well as the mean squared errors of the value estimates, under Scenario (B), with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". The target policy is fixed. ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":453.89,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/48-1.png","element":"img","alt":" γ = 0.3 and 0.7, from","inline":true,"padRight":true},{"text":"top to bottom panels.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"75%"},"width":1397,"height":841,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/48-2.png","element":"img"}],[{"id":"id-96","style":{"fontWeight":"bold"},"text":"D.2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Sensitivity test for the ordering of trajectories","element":"span"}],[{"text":"We focus on Scenario (B), detailed in Section ","element":"span"},{"href":"#id-70","text":"5.1, ","element":"a"},{"text":"to examine the sensitivity of the proposed CI to the ordering of trajectories. Specifically, we first randomly permute all trajectories","element":"span"}],[{"text":"Figure 10: Empirical coverage probabilities and average lengths of CIs constructed by the proposed method as well as the mean squared errors of the value estimates, under Scenario (B), with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". The target policy is an estimated optimal policy. ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":292.77,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/49-0.png","element":"img","alt":"γ = 0.3 and 0.","inline":true},{"id":"id-125","text":"7, from top to bottom panels.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"76%"},"width":1402,"height":841,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/49-1.png","element":"img"}],[{"id":"id-126","text":"Table 4: Empirical coverage probabilities (ECP) and average lengths (AL) of CIs con- ","element":"figcaption","subtype":"caption"},{"text":"structed by the proposed method and DRL as well as the mean-squared errors (MSE) of the corresponding value estimators under Scenario (C), with ","element":"figcaption","subtype":"caption"},{"style":{"height":17.2},"width":162.72,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/49-2.png","element":"img","alt":" γ = 0.3.","inline":true}],[{"style":{"width":"64%"},"width":1197,"height":296,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/49-3.png","element":"img"}],[{"text":"with some fixed random seed. We next apply our SAVE procedure to construct the CI. We use three random seeds to generate different random permutations and depict the corresponding results in Figure ","element":"span"},{"href":"#id-128","text":"11. ","element":"a"},{"text":"It can be seen that our method is not overly sensitive to the ordering of trajectories.","element":"span"}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"D.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional settings","element":"span"}],[{"text":"In this section, we conduct additional simulation studies to investigate the finite sample performance of the proposed method under settings where the reference distribution ","element":"span"},{"text":"G ","element":"span"},{"text":"is a Dirac delta function. Specifically, we consider the settings in Scenario (A) and set ","element":"span"},{"text":"G ","element":"span"},{"text":"to","element":"span"}],[{"id":"id-127","text":"Table 5: Empirical coverage probabilities (ECP) and average lengths (AL) of CIs con- ","element":"figcaption","subtype":"caption"},{"text":"structed by the proposed method and DRL as well as the mean-squared errors (MSE) of the corresponding value estimators under Scenario (C), with ","element":"figcaption","subtype":"caption"},{"style":{"height":17.6},"width":162.72,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/50-0.png","element":"img","alt":" γ = 0.7.","inline":true}],[{"style":{"width":"64%"},"width":1197,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/50-1.png","element":"img"}],[{"id":"id-128","text":"Figure 11: Empirical coverage probabilities and average lengths of CIs constructed by the ","element":"figcaption","subtype":"caption"},{"text":"proposed method under Scenarios (B), with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". From top plots to bottom plots, we use three different random seeds to generate the random permutation applied to all trajectories.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"73%"},"width":1346,"height":1233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/50-2.png","element":"img"}],[{"text":"(0","element":"span"},{"style":{"height":19.6},"width":538.64,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/50-3.png","element":"img","alt":".5, 0.5)⊤ and (−0.5, −0.5)⊤","inline":true},{"text":". It can be seen from Figures ","element":"span"},{"href":"#id-129","text":"12 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-130","text":"13 ","element":"a"},{"text":"that our CIs achieve nominal coverage and their lengths decrease as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"nT ","element":"span"},{"text":"increases.","element":"span"}],[{"id":"id-129","text":"Figure 12: Empirical coverage probabilities and average lengths of CIs constructed by the ","element":"figcaption","subtype":"caption"},{"text":"proposed method under Scenarios (A), with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". The target policy is fixed. The reference distribution ","element":"figcaption","subtype":"caption"},{"text":"G ","element":"figcaption","subtype":"caption"},{"text":"corresponds to (0","element":"figcaption","subtype":"caption"},{"style":{"height":19.6},"width":682.3,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-0.png","element":"img","alt":".5, 0.5)⊤ and (−0.5, −0.5)⊤, from","inline":true,"padRight":true},{"text":"top plots to bottom plots.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"53%"},"width":984,"height":894,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-1.png","element":"img"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"D.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional real data results","element":"span"}],[{"text":"We use our real data example to discuss the issue of over-fitting in this section. Specifically, we apply the proposed method in Section ","element":"span"},{"href":"#id-23","text":"3.2.2 ","element":"a"},{"text":"to evaluate the optimal value ","element":"span"},{"style":{"height":20.21},"width":249.55,"height":50.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-2.png","element":"img","alt":" V (πopt; Xi,0)","inline":true,"padRight":true},{"text":"starting from the initial state variable ","element":"span"},{"style":{"height":18.87},"width":422.04,"height":47.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-3.png","element":"img","alt":" Xi,0, for i = 1, 2, · · · ,","inline":true,"padRight":true},{"text":"6. When the initial starting time is 8:00 am in Day 1, CIs for Patient 5 and Patient 6 are [","element":"span"},{"style":{"height":19.2},"width":718,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-4.png","element":"img","alt":"−6.288, 3.287] and [−6.313, 10.644],","inline":true,"padRight":true},{"text":"respectively. Both upper bounds are positive. However, according to our definition, the immediate reward is nonpositive. As such, the value and Q-function shall be nonpositive as well. This reflects one of the drawback of the proposed method. The resulting Q-estimator might suffer from over-fitting, leading to an unbounded outcome.","element":"span"}],[{"text":"Specifically, it is due to that the matrix ","element":"span"},{"style":{"height":16.07},"width":58.86,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-5.png","element":"img","alt":"�Σπ","inline":true,"padRight":true},{"text":"is close to singular. Note that the regression","element":"span"}],[{"text":"coefficients ","element":"span"},{"style":{"height":17.2},"width":51.55,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-6.png","element":"img","alt":"�βπ","inline":true,"padRight":true},{"text":"are computed by solving the linear equation","element":"span"}],[{"style":{"width":"50%"},"width":940,"height":198,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-7.png","element":"img"}],[{"text":"In our data example, the number of basis function equals 12. As such, ","element":"span"},{"style":{"height":16.07},"width":58.86,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/51-8.png","element":"img","alt":"�Σπ","inline":true,"padRight":true},{"text":"is a 12 by 12","element":"span"}],[{"text":"Figure 13: Empirical coverage probabilities and average lengths of CIs constructed by the proposed method under Scenarios (A), with different choices of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T","element":"figcaption","subtype":"caption"},{"text":". The target policy is an estimated optimal policy. The reference distribution ","element":"figcaption","subtype":"caption"},{"text":"G ","element":"figcaption","subtype":"caption"},{"text":"corresponds to (0","element":"figcaption","subtype":"caption"},{"style":{"height":19.6},"width":253.29,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-0.png","element":"img","alt":".5, 0.5)⊤ and","inline":true,"padRight":true},{"text":"(","element":"figcaption","subtype":"caption"},{"style":{"height":19.6},"width":249.23,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-1.png","element":"img","alt":"−0.5, −0.5)⊤","inline":true},{"id":"id-130","text":", from top plots to bottom plots.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"53%"},"width":984,"height":893,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-2.png","element":"img"}],[{"text":"matrix. When it is close to singular, the resulting Q-estimator might be unbounded.","element":"span"}],[{"text":"To avoid offer-fitting, we note that in theory, ","element":"span"},{"style":{"height":16.07},"width":58.85,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-3.png","element":"img","alt":"�Σπ","inline":true,"padRight":true},{"text":"is a positive definite matrix under","element":"span"}],[{"text":"Condition (A3)(i). This motivates us to compute ","element":"span"},{"style":{"height":17.2},"width":51.55,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-4.png","element":"img","alt":"�βπ","inline":true,"padRight":true},{"text":"by solving","element":"span"}],[{"style":{"width":"40%"},"width":742,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"denotes the identity matrix. As long as ","element":"span"},{"style":{"height":20.54},"width":528.17,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-6.png","element":"img","alt":" λ satisfies λ = O(N −1T −1","inline":true},{"text":"), the proposed CI remains valid. In our real data example, we set ","element":"span"},{"style":{"height":16.14},"width":275.84,"height":40.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-7.png","element":"img","alt":" λ = 1 × 10−9","inline":true},{"text":". The resulting CIs for Patient 5 and Patient 6 are [","element":"span"},{"style":{"height":19.2},"width":676.51,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-8.png","element":"img","alt":"−8.034, −5.667] and [−9.866, −7.","inline":true},{"text":"104]. Both upper bounds are strictly negative.","element":"span"}]]},{"heading":"E Technical proofs","paragraphs":[[{"text":"For any two positive sequences ","element":"span"},{"style":{"height":19.67},"width":763.16,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-9.png","element":"img","alt":" {at}t≥1 and {bt}t≥1, we write at ⪯ bt","inline":true,"padRight":true},{"text":"if there exists some constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0 such that ","element":"span"},{"style":{"height":17.6},"width":375.09,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-10.png","element":"img","alt":" at ≤ Cbt for any t","inline":true},{"text":". The notation ","element":"span"},{"style":{"height":19.6},"width":599.02,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-11.png","element":"img","alt":" at ⪯ 1 means at = O(1). We","inline":true,"padRight":true},{"text":"will use ","element":"span"},{"style":{"height":19.64},"width":145.72,"height":49.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/52-12.png","element":"img","alt":" C, ¯C >","inline":true,"padRight":true},{"text":"0 to denote some universal constants whose values are allowed to change from place to place. Let ","element":"span"},{"style":{"height":19.6},"width":124.88,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-0.png","element":"img","alt":" qX(·|x","inline":true},{"text":") denote the density function of ","element":"span"},{"style":{"height":20.94},"width":525.12,"height":52.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-1.png","element":"img","alt":" PX(·|x). Define SmL−1 as","inline":true,"padRight":true},{"text":"the unit sphere ","element":"span"},{"style":{"height":20.54},"width":449.22,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-2.png","element":"img","alt":" {v ∈ RmL : ∥v∥2 = 1}","inline":true},{"text":". When splines are used to estimate the Q-function, we assume the internal knots are equally spaced.","element":"span"}],[{"text":"The rest of the section is organized as follows. We first present the proof sketches for Theorems ","element":"span"},{"href":"#id-31","text":"1-","element":"a"},{"href":"#id-28","text":"4. ","element":"a"},{"text":"We next present the detailed technical proofs.","element":"span"}],[{"id":"id-46","style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A sketch for the proof of Theorem ","element":"span"},{"href":"#id-31","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"text":"We provide an outline for the proof in this section. The detailed proof can be found in Section ","element":"span"},{"href":"#id-47","text":"E.5 ","element":"a"},{"text":"of the supplementary article. We break the proof into three steps. In the first","element":"span"}],[{"text":"step, we show the estimator ","element":"span"},{"style":{"height":17.2},"width":227.14,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-3.png","element":"img","alt":"�βπ satisfies","inline":true}],[{"id":"id-131","style":{"width":"92%"},"width":1708,"height":199,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.07},"width":223.52,"height":40.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-5.png","element":"img","alt":" Σπ = E�Σπ","inline":true},{"text":". The proof of ","element":"span"},{"href":"#id-131","text":"(E.29) ","element":"a"},{"text":"relies on some random matrix inequalities established in Lemma ","element":"span"},{"href":"#id-25","text":"3 ","element":"a"},{"text":"of the supplementary article.","element":"span"}],[{"text":"In the second step, we show the linear representation in ","element":"span"},{"href":"#id-48","text":"(3.12) ","element":"a"},{"text":"holds. The proof of ","element":"span"},{"href":"#id-48","text":"(3.12) ","element":"a"},{"text":"relies on the convergence rate of ","element":"span"},{"style":{"height":17.2},"width":32,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-6.png","element":"img","alt":"�β","inline":true,"padRight":true},{"text":"established in the first step and some additional random matrix nequalities in Lemma ","element":"span"},{"href":"#id-132","text":"4.","element":"a"}],[{"text":"In the last step, we show the leading term on the RHS of ","element":"span"},{"href":"#id-48","text":"(3.12) ","element":"a"},{"text":"is asymptotically normal, based on the martingale central limit theorem. The completes the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1.","element":"a"}],[{"id":"id-52","style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A sketch for the proof of Theorem ","element":"span"},{"href":"#id-57","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"text":"Similar to ","element":"span"},{"href":"#id-50","text":"(3.15)","element":"a"},{"text":", for each 1 ","element":"span"},{"style":{"height":17.2},"width":373.56,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-7.png","element":"img","alt":" ≤ k ≤ K, we have","inline":true}],[{"id":"id-51","style":{"width":"87%"},"width":1613,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":24.71},"width":79.12,"height":61.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-9.png","element":"img","alt":" R(1)k","inline":true,"padRight":true},{"text":"denotes the remainder term and","element":"span"}],[{"style":{"width":"77%"},"width":1426,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/53-10.png","element":"img"}],[{"text":"Since ","element":"span"},{"href":"#id-56","text":"(3.16) ","element":"a"},{"text":"is satisfied, we have ","element":"span"},{"style":{"height":20.6},"width":1155.16,"height":51.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-0.png","element":"img","alt":" E{εi,t(�π¯Ik)|Ok} = 0. Conditional on the data in Ok, �π¯Ik","inline":true,"padRight":true},{"text":"is a deterministic rule. The RHS of ","element":"span"},{"href":"#id-51","text":"(E.30) ","element":"a"},{"text":"is thus equivalent to a mean-zero martingale.","element":"span"}],[{"text":"When (A5) is satisfied, we have","element":"span"}],[{"id":"id-133","style":{"width":"85%"},"width":1583,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-1.png","element":"img"}],[{"text":"for some remainder term ","element":"span"},{"style":{"height":24.7},"width":593.51,"height":61.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-2.png","element":"img","alt":" R(2)k . Suppose R(1)k and R(2)k","inline":true,"padRight":true},{"text":"satisfy certain convergence rates.","element":"span"}],[{"text":"Combining ","element":"span"},{"href":"#id-133","text":"(E.31) ","element":"a"},{"text":"together with ","element":"span"},{"href":"#id-51","text":"(E.30) ","element":"a"},{"text":"yields,","element":"span"}],[{"id":"id-53","style":{"width":"97%"},"width":1795,"height":294,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-3.png","element":"img"}],[{"text":"due to our use of the inverse weighting trick. Theorem ","element":"span"},{"href":"#id-57","text":"2 ","element":"a"},{"text":"thus follows from the martingale central limit theorem.","element":"span"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"E.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A sketch for the proofs of Theorems ","element":"span"},{"href":"#id-27","style":{"fontWeight":"bold"},"text":"3 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"and ","element":"span"},{"href":"#id-28","style":{"fontWeight":"bold"},"text":"4","element":"a"}],[{"text":"Proofs of Theorems ","element":"span"},{"href":"#id-27","text":"3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"4 ","element":"a"},{"text":"are divided into two steps. In the first step, we decompose the value difference ","element":"span"},{"style":{"height":20.14},"width":434.95,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-4.png","element":"img","alt":" V (πopt; G) − V (�πI; G","inline":true},{"text":") into the sum of an infinite series and provide upper bounds for all the terms in the series. In the second step, we use the margin-type condition A5 to further characterize these upper bounds. We only present the first step in this section.","element":"span"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . ","element":"span"},{"text":", define a time-dependent policy ","element":"span"},{"style":{"height":24.56},"width":69.98,"height":61.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-5.png","element":"img","alt":" �π(j)I","inline":true,"padRight":true},{"text":"that executes ","element":"span"},{"style":{"height":11.27},"width":49.56,"height":28.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-6.png","element":"img","alt":" �πI","inline":true,"padRight":true},{"text":"at the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"time points and then follows ","element":"span"},{"style":{"height":15.34},"width":73.72,"height":38.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-7.png","element":"img","alt":" πopt","inline":true},{"text":". By definition, we have ","element":"span"},{"style":{"height":24.56},"width":682.49,"height":61.41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-8.png","element":"img","alt":" πopt = �π(0)I and �πI = �π(∞)I . Notice","inline":true}],[{"text":"that","element":"span"}],[{"id":"id-134","style":{"width":"90%"},"width":1671,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-9.png","element":"img"}],[{"text":"Moreover, for any ","element":"span"},{"style":{"height":16.8},"width":122.25,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-10.png","element":"img","alt":" j ≥ 1,","inline":true}],[{"style":{"height":43.31},"width":619.17,"height":108.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-11.png","element":"img","alt":"V (�π(j)I ; G) − V (�π(j+1)I ; G) =�x","inline":true}],[{"style":{"width":"68%"},"width":1267,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/54-12.png","element":"img"}],[{"style":{"width":"99%"},"width":1841,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-0.png","element":"img"}],[{"text":"policy ","element":"span"},{"style":{"height":11.27},"width":49.56,"height":28.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-1.png","element":"img","alt":" �πI","inline":true,"padRight":true},{"text":"at the first ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"time points, we have","element":"span"}],[{"style":{"width":"59%"},"width":1104,"height":254,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-2.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"style":{"height":45.7},"width":756.42,"height":114.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-3.png","element":"img","alt":"V (�π(j)I ; G) − V (�π(j+1)I ; G) = γj�x,x′∈X","inline":true}],[{"text":"By A1, we have sup","element":"span"},{"style":{"height":21.32},"width":362.54,"height":53.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-4.png","element":"img","alt":"x,x′,a q(x′|x, a) ≤ c","inline":true},{"text":". Under the Markov assumption,","element":"span"}],[{"style":{"width":"99%"},"width":1838,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-5.png","element":"img"}],[{"text":"Therefore, we obtain","element":"span"}],[{"id":"id-135","style":{"width":"91%"},"width":1690,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-6.png","element":"img"}],[{"text":"By A1, the reward function ","element":"span"},{"style":{"height":19.6},"width":87.87,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-7.png","element":"img","alt":" r(·, ·","inline":true},{"text":") is uniformly bounded. This further implies that ","element":"span"},{"style":{"height":20.14},"width":168.48,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-8.png","element":"img","alt":" Qopt(·, ·)","inline":true,"padRight":true},{"text":"is uniformly bounded. Therefore, ","element":"span"},{"style":{"height":27.36},"width":929.92,"height":68.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-9.png","element":"img","alt":"�j≥t V (�π(j)I ; G) − V (�π(j+1)I ; G) → 0 as t → ∞","inline":true},{"text":". It follows","element":"span"}],[{"text":"from ","element":"span"},{"href":"#id-134","text":"(E.33) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-135","text":"(E.34) ","element":"a"},{"text":"that","element":"span"}],[{"id":"id-138","style":{"width":"81%"},"width":1503,"height":429,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-10.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":14.4},"width":27,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-11.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"denote the Lebesgue measure on ","element":"span"},{"style":{"height":15.74},"width":52.54,"height":39.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/55-12.png","element":"img","alt":" Rd","inline":true},{"text":". In Sections ","element":"span"},{"href":"#id-136","text":"E.11 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-137","text":"E.12, ","element":"a"},{"text":"we use A5 to further bound ","element":"span"},{"href":"#id-138","text":"(E.35) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-138","text":"(E.36)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-36","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"text":"Since ","element":"span"},{"text":"X ","element":"span"},{"text":"is compact, Condition (A1) implies that sup","element":"span"},{"style":{"height":21.32},"width":409.02,"height":53.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-0.png","element":"img","alt":"x∈X,a∈A |r(x, a)| ≤ R","inline":true,"padRight":true},{"text":"for some 0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< R <","element":"span"}],[{"text":"+","element":"span"},{"style":{"height":8.8},"width":48,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-1.png","element":"img","alt":"∞","inline":true},{"text":". Under CMIA, we have","element":"span"}],[{"style":{"width":"98%"},"width":1820,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-2.png","element":"img"}],[{"text":"As a result, we obtain","element":"span"}],[{"id":"id-139","style":{"width":"65%"},"width":1208,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-3.png","element":"img"}],[{"text":"By the Bellman equation, we obtain","element":"span"}],[{"style":{"width":"68%"},"width":1255,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-4.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":19.6},"width":206.35,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-5.png","element":"img","alt":" r(·, a) is p","inline":true},{"text":"-smooth for any ","element":"span"},{"style":{"height":15.2},"width":122.05,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-6.png","element":"img","alt":" a ∈ A","inline":true},{"text":", it suffices to show","element":"span"}],[{"style":{"width":"55%"},"width":1029,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-7.png","element":"img"}],[{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-smooth for any ","element":"span"},{"style":{"height":15.2},"width":122.05,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-8.png","element":"img","alt":" a ∈ A","inline":true,"padRight":true},{"text":"and any policy ","element":"span"},{"style":{"height":8.8},"width":41.28,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-9.png","element":"img","alt":" π.","inline":true}],[{"text":"For any function ","element":"span"},{"style":{"height":19.6},"width":58.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-10.png","element":"img","alt":" h(·","inline":true},{"text":") defined on ","element":"span"},{"style":{"height":19.67},"width":244.9,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-11.png","element":"img","alt":" X, let ∂jh(x","inline":true},{"text":") denote the partial derivative ","element":"span"},{"style":{"height":19.67},"width":225.17,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-12.png","element":"img","alt":" ∂h(x)/∂xj.","inline":true,"padRight":true},{"text":"Without loss of generality, suppose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p > ","element":"span"},{"text":"1 such that ","element":"span"},{"style":{"height":19.67},"width":207.61,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-13.png","element":"img","alt":" ∂jp(x′|x, a","inline":true},{"text":") exists for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". In the","element":"span"}],[{"text":"following, we show ","element":"span"},{"style":{"height":19.67},"width":216.24,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-14.png","element":"img","alt":" ∂jT(π; x, a","inline":true},{"text":") exists for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". Let","element":"span"}],[{"style":{"width":"46%"},"width":861,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-15.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":14.4},"width":115.98,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-16.png","element":"img","alt":" δ ∈ R","inline":true},{"text":", consider the limit","element":"span"}],[{"style":{"width":"90%"},"width":1660,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-17.png","element":"img"}],[{"text":"as ","element":"span"},{"style":{"height":16.8},"width":144.46,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-18.png","element":"img","alt":" j → ∞","inline":true},{"text":". By the mean value theorem, we have","element":"span"}],[{"style":{"width":"93%"},"width":1728,"height":275,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/56-19.png","element":"img"}],[{"text":"where 0 ","element":"span"},{"style":{"height":17.2},"width":707.37,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-0.png","element":"img","alt":" ≤ θx ≤ 1 for all x. When 1 < p ≤","inline":true,"padRight":true},{"text":"2, we have ","element":"span"},{"style":{"height":19.2},"width":713.78,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-1.png","element":"img","alt":" ⌊p⌋ = 1. It follows from Condition","inline":true}],[{"text":"A1 that","element":"span"}],[{"style":{"width":"46%"},"width":866,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-2.png","element":"img"}],[{"text":"When ","element":"span"},{"style":{"height":19.67},"width":430.54,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-3.png","element":"img","alt":" p > 2, |∂j∂jq(x′|x, a)|","inline":true,"padRight":true},{"text":"exists and is bounded by ","element":"span"},{"style":{"height":17.6},"width":414.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-4.png","element":"img","alt":" c for any x′, x and a","inline":true},{"text":". It follows from","element":"span"}],[{"text":"the mean value theorem that","element":"span"}],[{"style":{"width":"43%"},"width":801,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-5.png","element":"img"}],[{"text":"In either case, we have that","element":"span"}],[{"style":{"width":"62%"},"width":1145,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-6.png","element":"img"}],[{"text":"By ","element":"span"},{"href":"#id-139","text":"(E.37) ","element":"a"},{"text":"and that ","element":"span"},{"text":"X ","element":"span"},{"text":"is compact, we obtain Re(","element":"span"},{"style":{"height":19.6},"width":363.42,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-7.png","element":"img","alt":"j, δ) → 0 as δ →","inline":true,"padRight":true},{"text":"0. This implies that","element":"span"}],[{"style":{"width":"69%"},"width":1287,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-8.png","element":"img"}],[{"text":"In addition, it follows from A1 and ","element":"span"},{"href":"#id-139","text":"(E.37) ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"44%"},"width":819,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.6},"width":58.52,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-10.png","element":"img","alt":" λ(·","inline":true},{"text":") denotes the Lebesgue measure. Using the same arguments, we can show for any ","element":"span"},{"style":{"height":19.6},"width":507.89,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-11.png","element":"img","alt":"d-tuple α = (α1, . . . , αd)⊤ ","inline":true,"padRight":true},{"text":"of nonnegative integers that satisfies ","element":"span"},{"style":{"height":19.2},"width":239.6,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-12.png","element":"img","alt":" ∥α∥1 ≤ ⌊p⌋,","inline":true}],[{"id":"id-140","style":{"width":"82%"},"width":1523,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-13.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"73%"},"width":1358,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-14.png","element":"img"}],[{"text":"Moreover, by A1, ","element":"span"},{"href":"#id-139","text":"(E.37) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-140","text":"(E.38)","element":"a"},{"text":", we have for any ","element":"span"},{"style":{"height":19.2},"width":639.05,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-15.png","element":"img","alt":" d-tuple α with ∥α∥1 = ⌊p⌋ that","inline":true}],[{"style":{"width":"97%"},"width":1799,"height":187,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-16.png","element":"img"}],[{"text":"This together with ","element":"span"},{"href":"#id-140","text":"(E.38) ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":20.54},"width":1041.8,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/57-17.png","element":"img","alt":" T(π; ·, a) ∈ Λ(p, Rcλ(X)(1 − γ)−1) for any π and a.","inline":true,"padRight":true},{"text":"The proof is thus completed.","element":"span"}],[{"id":"id-47","style":{"fontWeight":"bold"},"text":"E.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-31","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"text":"We introduce the following lemmas before proving Theorem ","element":"span"},{"href":"#id-31","text":"1. ","element":"a"},{"text":"In the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-141","text":"2-","element":"a"},{"href":"#id-142","text":"5, ","element":"a"},{"text":"we will omit the subscript ","element":"span"},{"style":{"height":19.6},"width":896.16,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-0.png","element":"img","alt":" π in Uπ(·), uπ, Σπ, �Σπ, �βπ, β∗π, ωπ, etc, for","inline":true,"padRight":true},{"text":"brevity.","element":"span"}],[{"id":"id-141","style":{"fontWeight":"bold"},"text":"Lemma 2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists some constant ","element":"span"},{"style":{"height":16.4},"width":330.75,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-1.png","element":"img","alt":" c∗ ≥ 1 such that","inline":true}],[{"id":"id-156","style":{"width":"92%"},"width":1703,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-2.png","element":"img"}],[{"style":{"height":22.31},"width":613.17,"height":55.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-3.png","element":"img","alt":"and supx∈X ∥ΦL(x)∥2 ≤ c∗√L.","inline":true}],[{"id":"id-25","style":{"fontWeight":"bold"},"text":"Lemma 3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose the conditions in Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We have as either ","element":"span"},{"style":{"height":10.4},"width":245.32,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-4.png","element":"img","alt":" n → ∞ or","inline":true},{"style":{"height":21.81},"width":1842.86,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-5.png","element":"img","alt":"T → ∞ that ∥Σ−1∥2 ≤ 3¯c−1, ∥Σ∥2 = O(1), ∥�Σ − Σ∥2 = Op{L1/2(nT)−1/2 log(nT)},","inline":true},{"style":{"height":21.81},"width":1411.29,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-6.png","element":"img","alt":"∥�Σ−1 − Σ−1∥2 = Op{L1/2(nT)−1/2 log(nT)} and ∥�Σ−1∥ ≤ 6¯c−1 wpa1.","inline":true}],[{"id":"id-132","style":{"fontWeight":"bold"},"text":"Lemma 4 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose the conditions in Theorem ","element":"span"},{"href":"#id-31","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"hold. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We have as either ","element":"span"},{"style":{"height":10.4},"width":245.32,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-7.png","element":"img","alt":" n → ∞ or","inline":true}],[{"style":{"width":"99%"},"width":1838,"height":265,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-8.png","element":"img"}],[{"id":"id-142","style":{"height":23.18},"width":1118.63,"height":57.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-9.png","element":"img","alt":"Step 1. Since L2p/d ≫ nT{1 + ∥�x ΦL(x)G(dx)∥−22 }","inline":true},{"text":", it follows from Lemma ","element":"span"},{"href":"#id-142","text":"5 ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":23.18},"width":759.85,"height":57.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-10.png","element":"img","alt":"L2p/d ≫ nT{1 + ∥�x U(x)G(dx)∥−22 }","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-36","text":"1, ","element":"a"},{"text":"there exist a set of vector ","element":"span"},{"style":{"height":19.2},"width":198.87,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-11.png","element":"img","alt":" {β∗a} that","inline":true}],[{"text":"satisfy (see Section 2.2 of ","element":"span"},{"href":"#id-39","referenceIndex":13,"text":"Huang, ","element":"a"},{"href":"#id-39","referenceIndex":13,"text":"1998, ","element":"a"},{"text":"for details)","element":"span"}],[{"id":"id-143","style":{"width":"74%"},"width":1371,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-12.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"style":{"height":20.94},"width":774.69,"height":52.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-13.png","element":"img","alt":" C > 0. Let β∗ = (β∗T1 , . . . , β∗Tm )⊤, and","inline":true}],[{"style":{"width":"98%"},"width":1819,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-14.png","element":"img"}],[{"text":"The condition Pr(max","element":"span"},{"style":{"height":19.67},"width":1063.66,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-15.png","element":"img","alt":"0≤t≤T−1 |Yi,t| ≤ c0) = 1 implies that |Yi,t| ≤ c0, ∀i, t","inline":true},{"text":", almost surely. By Lemma ","element":"span"},{"href":"#id-36","text":"1 ","element":"a"},{"text":"and the definition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-smooth functions, we obtain that ","element":"span"},{"style":{"height":19.6},"width":389.73,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-16.png","element":"img","alt":" |Q(π; x, a)| ≤ c′ for","inline":true}],[{"text":"any ","element":"span"},{"style":{"height":12.4},"width":121.79,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-17.png","element":"img","alt":" π, x, a","inline":true},{"text":". It follows that","element":"span"}],[{"id":"id-144","style":{"width":"76%"},"width":1415,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/58-18.png","element":"img"}],[{"text":"almost surely. In addition, it follows from ","element":"span"},{"href":"#id-143","text":"(E.41) ","element":"a"},{"text":"that","element":"span"}],[{"id":"id-145","style":{"width":"87%"},"width":1606,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-0.png","element":"img"}],[{"text":"By definition, we have","element":"span"}],[{"style":{"width":"87%"},"width":1621,"height":771,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-1.png","element":"img"}],[{"text":"In the following, we show ","element":"span"},{"style":{"height":21.81},"width":935.47,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-2.png","element":"img","alt":" ζ2 = Op{L(nT)−1 log(nT)} and ζ3 = Op(L−p/d","inline":true},{"text":") as either ","element":"span"},{"style":{"height":13.6},"width":163.16,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-3.png","element":"img","alt":" n → ∞,","inline":true}],[{"text":"or ","element":"span"},{"style":{"height":13.6},"width":169.16,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-4.png","element":"img","alt":" T → ∞.","inline":true}],[{"style":{"height":19.67},"width":630.54,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-5.png","element":"img","alt":"Error bound for ∥ζ2∥2: Let Fi,t","inline":true,"padRight":true},{"text":"denote the sub-dataset ","element":"span"},{"style":{"height":19.67},"width":717.16,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/59-6.png","element":"img","alt":" {Xi,t, Ai,t} ∪ {(Yi,j, Ai,j, Xi,j)}1≤j ","element":"span"},{"text":"0 and hence","element":"span"}],[{"id":"id-149","style":{"width":"74%"},"width":1368,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/61-8.png","element":"img"}],[{"text":"by ","element":"span"},{"href":"#id-148","text":"(E.49)","element":"a"},{"text":". Combining ","element":"span"},{"href":"#id-149","text":"(E.50) ","element":"a"},{"text":"together with ","element":"span"},{"href":"#id-150","text":"(E.48) ","element":"a"},{"text":"yields that","element":"span"}],[{"style":{"width":"84%"},"width":1550,"height":294,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/61-9.png","element":"img"}],[{"style":{"width":"99%"},"width":1837,"height":213,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-0.png","element":"img"}],[{"text":"This completes the second step of the proof.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Step 3: ","element":"span"},{"text":"In the following, we show","element":"span"},{"style":{"height":26.28},"width":1042.81,"height":65.71,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-1.png","element":"img","alt":"√nTσ−1(π; G){�x U(x)G(dx)}⊤ζ1 d→ N(0, 1). For","inline":true,"padRight":true},{"text":"any integer 1 ","element":"span"},{"style":{"height":19.6},"width":544.78,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-2.png","element":"img","alt":" ≤ g ≤ nT, let i(g) and t(g","inline":true},{"text":") be the quotient and the remainder of ","element":"span"},{"style":{"height":17.2},"width":199.22,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-3.png","element":"img","alt":" g + T − 1","inline":true}],[{"text":"divided by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"that satisfy","element":"span"}],[{"style":{"width":"51%"},"width":951,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-4.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":21.81},"width":380.78,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-5.png","element":"img","alt":" F(0) = {X1,0, A1,0}","inline":true},{"text":". Then we iteratively define ","element":"span"},{"style":{"height":21.81},"width":264,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-6.png","element":"img","alt":" {F(g)}1≤g≤nT","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"73%"},"width":1351,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-7.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":22.72},"width":660.32,"height":56.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-8.png","element":"img","alt":" ξ(g) = ξi(g),t(g) and ε(g) = εi(g),t(g)","inline":true},{"text":". It follows that","element":"span"}],[{"id":"id-151","style":{"width":"85%"},"width":1581,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-9.png","element":"img"}],[{"text":"By MA, CMIA and the Bellman equation in ","element":"span"},{"href":"#id-38","text":"(3.7)","element":"a"},{"text":", we obtain that","element":"span"}],[{"style":{"width":"51%"},"width":947,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-10.png","element":"img"}],[{"text":"Hence, the RHS of ","element":"span"},{"href":"#id-151","text":"(E.52) ","element":"a"},{"text":"forms a martingale with respect to the filtration ","element":"span"},{"style":{"height":21.81},"width":272.82,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-11.png","element":"img","alt":" {σ(F(g))}g≥0,","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":21.74},"width":129.1,"height":54.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-12.png","element":"img","alt":" σ(F(g)","inline":true},{"text":") stands for the ","element":"span"},{"style":{"height":8.8},"width":27,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-13.png","element":"img","alt":" σ","inline":true},{"text":"-algebra generated by ","element":"span"},{"style":{"height":17.34},"width":82.56,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-14.png","element":"img","alt":" F(g)","inline":true},{"text":". To show the asymptotic normality, we use a martingale central limit theorem for triangular arrays (Corollary 2.8 of ","element":"span"},{"href":"#id-152","referenceIndex":25,"text":"McLeish, ","element":"a"},{"href":"#id-152","referenceIndex":25,"text":"1974)","element":"a"},{"text":". This requires to verify the following two conditions:","element":"span"}],[{"style":{"width":"99%"},"width":1839,"height":709,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/62-15.png","element":"img"}],[{"text":"where the first inequality follows from Cauchy-Schwarz inequality, the second inequality is due to ","element":"span"},{"href":"#id-144","text":"(E.42)","element":"a"},{"text":", the third inequality is due to Lemma ","element":"span"},{"href":"#id-141","text":"2 ","element":"a"},{"text":"and the fact that ","element":"span"},{"style":{"height":21.34},"width":200.32,"height":53.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-0.png","element":"img","alt":" ∥ξ(g)∥2 ≤","inline":true,"padRight":true},{"text":"sup","element":"span"},{"style":{"height":19.6},"width":215.72,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-1.png","element":"img","alt":"x ∥ΦL(x)∥2","inline":true},{"text":", and the last inequality follows from ","element":"span"},{"href":"#id-148","text":"(E.49)","element":"a"},{"text":". Since ","element":"span"},{"style":{"height":22.19},"width":490.47,"height":55.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-2.png","element":"img","alt":" L ≪√nT/ log(nT), (a)","inline":true}],[{"style":{"width":"98%"},"width":1816,"height":579,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-3.png","element":"img"}],[{"text":"In view of ","element":"span"},{"href":"#id-148","text":"(E.49)","element":"a"},{"text":", it suffices to show","element":"span"}],[{"id":"id-153","style":{"width":"74%"},"width":1364,"height":198,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-4.png","element":"img"}],[{"text":"This can be proven using similar arguments in bounding ","element":"span"},{"style":{"height":19.2},"width":198.28,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-5.png","element":"img","alt":" ∥�Σ − Σ∥2","inline":true,"padRight":true},{"text":"in the proof of Lemma","element":"span"}],[{"style":{"width":"74%"},"width":1377,"height":213,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-6.png","element":"img"}],[{"text":"To complete the proof, it remains to show ","element":"span"},{"style":{"height":22.42},"width":388.21,"height":56.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-7.png","element":"img","alt":" �σ(π; G)/σ(π; G) p→","inline":true,"padRight":true},{"text":"1. Using similar arguments in verifying (b), it suffices to show ","element":"span"},{"style":{"height":20.61},"width":789.21,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-8.png","element":"img","alt":" ∥�Σ−1 �Ω(�Σ⊤)−1 − Σ−1Ω(Σ⊤)−1∥2 = op","inline":true},{"text":"(1). By ","element":"span"},{"href":"#id-144","text":"(E.42)","element":"a"}],[{"text":"and Lemma ","element":"span"},{"href":"#id-132","text":"4, ","element":"a"},{"text":"we have","element":"span"}],[{"style":{"width":"58%"},"width":1072,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-9.png","element":"img"}],[{"text":"and hence ","element":"span"},{"style":{"height":19.2},"width":204.59,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-10.png","element":"img","alt":" ∥Ω∥2 = O","inline":true},{"text":"(1). This together with Lemma ","element":"span"},{"href":"#id-25","text":"3 ","element":"a"},{"text":"and the condition ","element":"span"},{"style":{"height":22.19},"width":397.96,"height":55.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-11.png","element":"img","alt":" L ≪√nT/ log(nT)","inline":true}],[{"text":"yields that","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"∥","element":"span"},{"style":{"height":21.5},"width":1838.62,"height":53.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-12.png","element":"img","alt":"�Σ−1Ω(�Σ⊤)−1 − Σ−1Ω(Σ⊤)−1∥2 ≤ ∥�Σ−1 − Σ∥2∥Ω∥2∥(�Σ⊤)−1∥2 + ∥Σ−1∥2∥Ω∥2∥�Σ−1 − Σ∥2","inline":true}],[{"style":{"width":"40%"},"width":753,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-13.png","element":"img"}],[{"text":"Thus, it remains to show ","element":"span"},{"style":{"height":20.61},"width":1311.58,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-14.png","element":"img","alt":" ∥�Σ−1 �Ω(�Σ⊤)−1 − �Σ−1Ω(�Σ⊤)−1∥2 = op(1), or ∥�Ω − Ω∥2 = op(1),","inline":true,"padRight":true},{"text":"by Lemma ","element":"span"},{"href":"#id-25","text":"3. ","element":"a"},{"text":"In view of ","element":"span"},{"href":"#id-153","text":"(E.53)","element":"a"},{"text":", it suffices to show ","element":"span"},{"style":{"height":26.08},"width":805.22,"height":65.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/63-15.png","element":"img","alt":" ∥(nT)−1 �nTg=1(ε(g))2ξ(g)(ξ(g))⊤ − �Ω∥2 =","inline":true}],[{"style":{"height":14.07},"width":40.51,"height":35.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-0.png","element":"img","alt":"op","inline":true},{"text":"(1), or equivalently,","element":"span"}],[{"style":{"width":"63%"},"width":1162,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-1.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"99%"},"width":1840,"height":215,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-2.png","element":"img"}],[{"text":"show max","element":"span"},{"style":{"height":21.81},"width":578.59,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-3.png","element":"img","alt":"1≤g≤nT |(ε(g))2 − (�ε(g))2| = op","inline":true},{"text":"(1). Suppose we have shown that max","element":"span"},{"style":{"height":21.81},"width":270.62,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-4.png","element":"img","alt":"1≤g≤nT |ε(g) −","inline":true},{"style":{"height":21.81},"width":185.92,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-5.png","element":"img","alt":"�ε(g)| = op","inline":true},{"text":"(1). By ","element":"span"},{"href":"#id-144","text":"(E.42)","element":"a"},{"text":", ","element":"span"},{"style":{"height":16.94},"width":65.35,"height":42.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-6.png","element":"img","alt":" ε(g)","inline":true},{"text":"s are uniformly bounded with probability 1 and thus we have max","element":"span"},{"style":{"height":21.81},"width":477.87,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-7.png","element":"img","alt":"1≤g≤nT |ε(g) + �ε(g)| = Op","inline":true},{"text":"(1). It follows that","element":"span"}],[{"style":{"width":"77%"},"width":1432,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-8.png","element":"img"}],[{"text":"Therefore, it remains to show max","element":"span"},{"style":{"height":21.81},"width":465.37,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-9.png","element":"img","alt":"1≤g≤nT |ε(g) − �ε(g)| = op","inline":true},{"text":"(1), or equivalently,","element":"span"}],[{"id":"id-154","style":{"width":"85%"},"width":1581,"height":308,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-10.png","element":"img"}],[{"text":"The LHS of ","element":"span"},{"href":"#id-154","text":"(E.54) ","element":"a"},{"text":"is upper bound by","element":"span"}],[{"style":{"width":"41%"},"width":759,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-11.png","element":"img"}],[{"text":"By ","element":"span"},{"href":"#id-143","text":"(E.41)","element":"a"},{"text":", ","element":"span"},{"href":"#id-155","text":"(E.46) ","element":"a"},{"text":"and Lemma 2, we have","element":"span"}],[{"style":{"height":19.6},"width":429.53,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-12.png","element":"img","alt":"|Φ⊤L(x)�βa − Φ⊤L(x)β∗a|","inline":true},{"style":{"height":21.1},"width":327.18,"height":52.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-13.png","element":"img","alt":"≤ CL−p/d + sup","inline":true}],[{"style":{"width":"72%"},"width":1336,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-14.png","element":"img"}],[{"text":"Under the given conditions, we have ","element":"span"},{"style":{"height":22.2},"width":763.98,"height":55.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-15.png","element":"img","alt":" Lp/d ≫√nT and L ≪√nT/ log(nT","inline":true},{"text":"). This implies","element":"span"}],[{"style":{"height":21.81},"width":984.5,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-16.png","element":"img","alt":"Op(L1/2−p/d) = op(1), and Op(Ln−1/2T −1/2) = op","inline":true},{"text":"(1). Therefore, we have","element":"span"}],[{"style":{"width":"41%"},"width":774,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/64-17.png","element":"img"}],[{"text":"The proof is hence completed.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-141","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"text":"For B-spline basis, the assertion in ","element":"span"},{"href":"#id-156","text":"(E.40) ","element":"a"},{"text":"follows from the arguments used in the proof of Theorem 3.3 of ","element":"span"},{"href":"#id-115","referenceIndex":4,"text":"Burman and Chen ","element":"a"},{"href":"#id-115","referenceIndex":4,"text":"(1989)","element":"a"},{"text":". For wavelet basis, the assertion in ","element":"span"},{"href":"#id-156","text":"(E.40) ","element":"a"},{"text":"follows from the arguments used in the proof of Theorem 5.1 of ","element":"span"},{"href":"#id-42","referenceIndex":5,"text":"Chen and Christensen ","element":"a"},{"href":"#id-42","referenceIndex":5,"text":"(2015)","element":"a"},{"text":".","element":"span"}],[{"text":"For either B-spline or wavelet sieve and any ","element":"span"},{"style":{"height":16.8},"width":307.58,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-0.png","element":"img","alt":" L ≥ 1, x ∈ X","inline":true},{"text":", the number of nonzero elements in the vector Φ","element":"span"},{"style":{"height":19.6},"width":70.24,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-1.png","element":"img","alt":"L(x","inline":true},{"text":") is bounded by some constant. Moreover, each of the basis function is uniformly bounded by ","element":"span"},{"style":{"height":22.2},"width":127.14,"height":55.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-2.png","element":"img","alt":" O(√L","inline":true},{"text":"). This proves that the second assertion.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-25","style":{"fontWeight":"bold"},"text":"3","element":"a"}],[{"text":"We consider two scenarios: (i) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"grows to infinity; (ii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded. The proof is divided into four parts. In the first part, we show ","element":"span"},{"style":{"height":20.54},"width":1001.96,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-3.png","element":"img","alt":" a⊤Σa ≥ ¯c∥a∥22/2 for any a ∈ RmL and ∥Σ−1∥2 ≤","inline":true,"padRight":true},{"text":"2","element":"span"},{"style":{"height":19.2},"width":47.38,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-4.png","element":"img","alt":"/¯c","inline":true},{"text":", as either ","element":"span"},{"style":{"height":16.8},"width":404.06,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-5.png","element":"img","alt":" n → ∞, or T → ∞","inline":true},{"text":". In the second part, we bound ","element":"span"},{"style":{"height":19.2},"width":202.74,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-6.png","element":"img","alt":" ∥�Σ − Σ∥2","inline":true},{"text":". In the third part, we bound ","element":"span"},{"style":{"height":20.14},"width":291.53,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-7.png","element":"img","alt":" ∥�Σ−1 − Σ−1∥2","inline":true},{"text":". Finally, we show ","element":"span"},{"style":{"height":19.6},"width":278.55,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-8.png","element":"img","alt":" ∥Σ∥2 = O(1).","inline":true}],[{"style":{"width":"97%"},"width":1798,"height":508,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-9.png","element":"img"}],[{"text":"Therefore,","element":"span"}],[{"style":{"width":"73%"},"width":1351,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-10.png","element":"img"}],[{"text":"Under A3(i), we obtain ","element":"span"},{"style":{"height":20.94},"width":769.96,"height":52.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-11.png","element":"img","alt":" η2(a) − γ2η1(a) ≥ ¯c∥a∥22 for a ∈ RmL","inline":true},{"text":". It follows that ","element":"span"},{"style":{"height":23.2},"width":234.37,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-12.png","element":"img","alt":" γ�η1(a) ≤","inline":true}],[{"id":"id-157","style":{"width":"99%"},"width":1834,"height":242,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-13.png","element":"img"}],[{"text":"We now show","element":"span"}],[{"id":"id-158","style":{"width":"68%"},"width":1268,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/65-14.png","element":"img"}],[{"text":"Otherwise, there exists some ","element":"span"},{"style":{"height":20.54},"width":1213.5,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-0.png","element":"img","alt":" a0 ∈ RmL such that ∥Σa0∥2 < 2−1¯c∥a0∥2. By Cauchy-","inline":true,"padRight":true},{"text":"Schwarz inequality, we obtain ","element":"span"},{"style":{"height":20.14},"width":746.99,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-1.png","element":"img","alt":" a⊤0 Σa0 ≤ ∥a0∥2∥Σa0∥2 < 2−1¯c∥a0∥22","inline":true},{"text":". However, this violates ","element":"span"},{"text":"the assertion in ","element":"span"},{"href":"#id-157","text":"(E.55)","element":"a"},{"text":". ","element":"span"},{"href":"#id-158","text":"(E.56) ","element":"a"},{"text":"is thus proven.","element":"span"}],[{"text":"According to the singular value decomposition, we have ","element":"span"},{"style":{"height":18.33},"width":261.86,"height":45.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-2.png","element":"img","alt":" Σ = V ⊤1 ΛV2 ","inline":true,"padRight":true},{"text":"for some orthog- ","element":"span"},{"text":"onal matrices ","element":"span"},{"style":{"height":16.8},"width":130.57,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-3.png","element":"img","alt":" V1, V2","inline":true,"padRight":true},{"text":"and some diagonal matrix ","element":"span"},{"style":{"height":13.6},"width":38,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-4.png","element":"img","alt":" Λ","inline":true},{"text":". By orthogonality, we obtain ","element":"span"},{"style":{"height":19.2},"width":187.78,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-5.png","element":"img","alt":" ∥Σa∥2 =","inline":true},{"style":{"height":19.2},"width":606.27,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-6.png","element":"img","alt":"∥ΛV2a∥2 and ∥a∥2 = ∥V2a∥2","inline":true},{"text":". In view of ","element":"span"},{"href":"#id-158","text":"(E.56)","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":20.54},"width":632.11,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-7.png","element":"img","alt":" ∥Λa∥2 ≥ 2−1¯c∥a∥2, ∀a ∈ RmL.","inline":true,"padRight":true},{"text":"This implies that the absolute value of each diagonal element in ","element":"span"},{"style":{"height":13.6},"width":38,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-8.png","element":"img","alt":" Λ","inline":true,"padRight":true},{"text":"is at least ¯","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":". Thus, we obtain ","element":"span"},{"style":{"height":20.14},"width":855.46,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-9.png","element":"img","alt":" ∥Λ−1∥2 ≤ 2¯c−1 and hence ∥Σ−1∥2 ≤ 2¯c−1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Part 2: ","element":"span"},{"text":"We first consider Scenario (ii). Define the random matrix","element":"span"}],[{"style":{"width":"57%"},"width":1069,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-10.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-141","text":"2, ","element":"a"},{"text":"we have max","element":"span"},{"style":{"height":22.26},"width":1244.65,"height":55.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-11.png","element":"img","alt":"1≤i≤n,0≤t≤T−1 ∥ξ(Xi,t, Ai,t)∥2 ≤ supx ∥ΦL(x)∥2 ≤ c∗√L and","inline":true}],[{"text":"max","element":"span"},{"style":{"height":22.26},"width":1018.18,"height":55.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-12.png","element":"img","alt":"1≤i≤n,1≤t≤T ∥U(Xi,t+1)∥2 ≤ supx ∥ΦL(x)∥2 ≤ c∗√L","inline":true},{"text":". It follows that","element":"span"}],[{"id":"id-159","style":{"width":"85%"},"width":1579,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-13.png","element":"img"}],[{"text":"Let","element":"span"}],[{"style":{"width":"93%"},"width":1726,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-14.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":19.34},"width":366.94,"height":48.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-15.png","element":"img","alt":" v ∈ RmL, we have","inline":true}],[{"style":{"width":"92%"},"width":1705,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-16.png","element":"img"}],[{"text":"Moreover, using similar arguments in proving ","element":"span"},{"href":"#id-159","text":"(E.57)","element":"a"},{"text":", we can show","element":"span"}],[{"style":{"width":"78%"},"width":1454,"height":327,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/66-17.png","element":"img"}],[{"text":"By Cauchy-Schwarz inequality, we obtain","element":"span"}],[{"style":{"width":"61%"},"width":1142,"height":301,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-0.png","element":"img"}],[{"text":"Similarly, we can show","element":"span"}],[{"style":{"width":"73%"},"width":1357,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-1.png","element":"img"}],[{"text":"and hence","element":"span"}],[{"id":"id-161","style":{"width":"81%"},"width":1494,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-2.png","element":"img"}],[{"text":"Consider ","element":"span"},{"style":{"height":19.67},"width":695.8,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-3.png","element":"img","alt":" λmax{Eξ(X0,0, A0,0)ξ(X0,0, A0,0)⊤}","inline":true,"padRight":true},{"text":"first. Notice that ","element":"span"},{"style":{"height":19.67},"width":543.2,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-4.png","element":"img","alt":" Eξ(X0,0, A0,0)ξ(X0,0, A0,0)⊤","inline":true,"padRight":true},{"text":"is a block diagonal matrix. For any ","element":"span"},{"style":{"height":20.94},"width":663.25,"height":52.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-5.png","element":"img","alt":" v ∈ RmL, let v = (a⊤1 , . . . , a⊤m)⊤","inline":true,"padRight":true},{"text":"where all the sub-","element":"span"}],[{"text":"vectors ","element":"span"},{"style":{"height":14.07},"width":39.58,"height":35.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-6.png","element":"img","alt":" aj","inline":true},{"text":"s have the same length. With some calculations, we have","element":"span"}],[{"style":{"width":"84%"},"width":1556,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-7.png","element":"img"}],[{"text":"By Condition A2 and Lemma ","element":"span"},{"href":"#id-141","text":"2, ","element":"a"},{"text":"we obtain","element":"span"}],[{"style":{"width":"86%"},"width":1589,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-8.png","element":"img"}],[{"text":"This yields","element":"span"}],[{"id":"id-160","style":{"width":"73%"},"width":1353,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-9.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > ","element":"span"},{"text":"0, the marginal density function of ","element":"span"},{"style":{"height":18.47},"width":77.2,"height":46.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-10.png","element":"img","alt":" X0,t","inline":true,"padRight":true},{"text":"is given by","element":"span"}],[{"style":{"width":"95%"},"width":1756,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-11.png","element":"img"}],[{"text":"Thus, we have ","element":"span"},{"style":{"height":21.32},"width":1011.86,"height":53.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-12.png","element":"img","alt":" µt(x) ≤ supx′,x′′ qX(x′|x′′) for any t ≥ 1 and x ∈ X","inline":true},{"text":". Under Condition A1, we can show the density function ","element":"span"},{"style":{"height":19.6},"width":153.81,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-13.png","element":"img","alt":" qX(x|x′","inline":true},{"text":") is uniformly bounded for any ","element":"span"},{"style":{"height":14},"width":179.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/67-14.png","element":"img","alt":" x and x′","inline":true},{"text":". It follows that ","element":"span"},{"style":{"height":19.6},"width":87.61,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-0.png","element":"img","alt":" µt(x","inline":true},{"text":") is uniformly bounded for any ","element":"span"},{"style":{"height":16.4},"width":242.37,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-1.png","element":"img","alt":" t ≥ 1 and x","inline":true},{"text":". Using similar arguments in proving","element":"span"}],[{"href":"#id-160","text":"(E.59)","element":"a"},{"text":", we can show","element":"span"}],[{"style":{"width":"81%"},"width":1499,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-2.png","element":"img"}],[{"text":"This together with ","element":"span"},{"href":"#id-161","text":"(E.58) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-160","text":"(E.59) ","element":"a"},{"text":"yields","element":"span"}],[{"id":"id-163","style":{"width":"57%"},"width":1068,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-3.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0. Combining this together with ","element":"span"},{"href":"#id-159","text":"(E.57)","element":"a"},{"text":", an application of the matrix","element":"span"}],[{"text":"concentration inequality (see Theorem 1.6 in ","element":"span"},{"href":"#id-162","referenceIndex":46,"text":"Tropp, ","element":"a"},{"href":"#id-162","referenceIndex":46,"text":"2012) ","element":"a"},{"text":"yields that","element":"span"}],[{"style":{"width":"86%"},"width":1587,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-4.png","element":"img"}],[{"text":"Set ","element":"span"},{"style":{"height":19.6},"width":579.64,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-5.png","element":"img","alt":" τ = 3√CnL log n. Since T","inline":true,"padRight":true},{"text":"is bounded, under the given conditions, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"will grow to","element":"span"}],[{"text":"infinity. For sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", we have 8","element":"span"},{"style":{"height":20.54},"width":539.08,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-6.png","element":"img","alt":"L(c∗)2τ/3 ≪ τ 2 and hence","inline":true}],[{"style":{"width":"55%"},"width":1031,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-7.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":15.2},"width":279.36,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-8.png","element":"img","alt":" L ≪ n and T","inline":true,"padRight":true},{"text":"is bounded, we obtain 2","element":"span"},{"style":{"height":20.54},"width":383.2,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-9.png","element":"img","alt":"mL/n4 ≪ 1/(n2T 2","inline":true},{"text":"). Thus, we can show that","element":"span"}],[{"text":"the following event occurs with probability at least 1 ","element":"span"},{"style":{"height":20.54},"width":286.78,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-10.png","element":"img","alt":" − O(n−2T −2),","inline":true}],[{"id":"id-165","style":{"width":"83%"},"width":1535,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-11.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded.","element":"span"}],[{"text":"Now let’s consider Scenario (i). Let","element":"span"}],[{"style":{"width":"61%"},"width":1127,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-12.png","element":"img"}],[{"text":"We aim to apply the matrix concentration inequality to the sum of independent random","element":"span"}],[{"text":"matrix (regardless of whether ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"is bounded or not),","element":"span"}],[{"style":{"width":"29%"},"width":544,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-13.png","element":"img"}],[{"text":"We begin by providing an upper error bound for max","element":"span"},{"style":{"height":23.68},"width":714.06,"height":59.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-14.png","element":"img","alt":"1≤i≤n ∥T −1 �T−1t=0 (Ri,t − Σ)∥2. Let","inline":true},{"style":{"height":19.67},"width":1183.86,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-15.png","element":"img","alt":"Ft−1 = {(X0,j, A0,j)}0≤j≤t, for all t ≥ 0, and σ(Ft) be the σ","inline":true},{"text":"-algebra generated by ","element":"span"},{"style":{"height":16.47},"width":210.86,"height":41.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-16.png","element":"img","alt":" Ft. Define","inline":true}],[{"style":{"width":"56%"},"width":1035,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/68-17.png","element":"img"}],[{"style":{"width":"99%"},"width":1841,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-0.png","element":"img"}],[{"text":"filtration ","element":"span"},{"style":{"height":19.6},"width":342.1,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-1.png","element":"img","alt":" {σ(Ft) : t ≥ −1}","inline":true},{"text":". Similar to ","element":"span"},{"href":"#id-159","text":"(E.57) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-163","text":"(E.62)","element":"a"},{"text":", we can show","element":"span"}],[{"style":{"width":"57%"},"width":1060,"height":410,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-2.png","element":"img"}],[{"text":"By the matrix martingale concentration inequality (Corollary 1.3, ","element":"span"},{"href":"#id-164","referenceIndex":45,"text":"Tropp, ","element":"a"},{"href":"#id-164","referenceIndex":45,"text":"2011)","element":"a"},{"text":", we obtain the following occurs with probability at least 1 ","element":"span"},{"style":{"height":20.54},"width":286.78,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-3.png","element":"img","alt":" − O(n−3T −2),","inline":true}],[{"id":"id-166","style":{"width":"72%"},"width":1330,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-4.png","element":"img"}],[{"text":"Define","element":"span"}],[{"style":{"width":"99%"},"width":1837,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-5.png","element":"img"}],[{"text":"Using similar arguments in proving ","element":"span"},{"href":"#id-165","text":"(E.63)","element":"a"},{"text":", we can show that","element":"span"}],[{"style":{"width":"77%"},"width":1426,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-6.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0, where the big-","element":"span"},{"style":{"fontStyle":"italic"},"text":"O ","element":"span"},{"text":"term is independent of ","element":"span"},{"style":{"height":21.53},"width":422.96,"height":53.82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-7.png","element":"img","alt":" {X∗0,t}t≥0. Thus, we","inline":true}],[{"text":"obtain","element":"span"}],[{"style":{"width":"65%"},"width":1212,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-8.png","element":"img"}],[{"text":"This together with ","element":"span"},{"href":"#id-166","text":"(E.64) ","element":"a"},{"text":"implies that the following event occurs with probability at least 1 ","element":"span"},{"style":{"height":20.54},"width":286.78,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-9.png","element":"img","alt":" − O(n−3T −2),","inline":true}],[{"id":"id-171","style":{"width":"99%"},"width":1839,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-10.png","element":"img"}],[{"text":"define the ","element":"span"},{"style":{"height":17.2},"width":28,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-11.png","element":"img","alt":" β","inline":true},{"text":"-mixing coefficient of the stationary Markov chain ","element":"span"},{"style":{"height":19.67},"width":242.8,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-12.png","element":"img","alt":" {X0,t}t≥0 as","inline":true}],[{"style":{"width":"65%"},"width":1212,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/69-13.png","element":"img"}],[{"text":"Under the geometric ergodicity assumption in A3(ii) and , it follows from Lemma 1 of ","element":"span"},{"href":"#id-167","referenceIndex":26,"text":"Meitz ","element":"a"},{"href":"#id-167","referenceIndex":26,"text":"and Saikkonen ","element":"a"},{"href":"#id-167","referenceIndex":26,"text":"(2019) ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":19.67},"width":182.83,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-0.png","element":"img","alt":" {X0,t}t≥0","inline":true,"padRight":true},{"text":"is exponentially ","element":"span"},{"style":{"height":17.2},"width":28,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-1.png","element":"img","alt":" β","inline":true},{"text":"-mixing. That is, ","element":"span"},{"style":{"height":20.14},"width":334.65,"height":50.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-2.png","element":"img","alt":" β(t) = O(ρt) for","inline":true}],[{"text":"some ","element":"span"},{"style":{"height":18},"width":372.84,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-3.png","element":"img","alt":" ρ < 1 and any t ≥","inline":true,"padRight":true},{"text":"0. Using similar arguments in proving ","element":"span"},{"href":"#id-159","text":"(E.57)","element":"a"},{"text":", we can show","element":"span"}],[{"id":"id-168","style":{"width":"69%"},"width":1272,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-4.png","element":"img"}],[{"text":"Moreover, for any 0 ","element":"span"},{"style":{"height":19.74},"width":821.43,"height":49.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-5.png","element":"img","alt":" ≤ t1 ≤ t2 ≤ T − 1 and any v1, v2 ∈ RmL","inline":true},{"text":", we have by Cauchy-Schwarz","element":"span"}],[{"text":"inequality that","element":"span"}],[{"style":{"width":"94%"},"width":1735,"height":192,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-6.png","element":"img"}],[{"text":"Using similar arguments in proving ","element":"span"},{"href":"#id-163","text":"(E.62)","element":"a"},{"text":", we can show","element":"span"}],[{"style":{"width":"48%"},"width":894,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-7.png","element":"img"}],[{"text":"This implies","element":"span"}],[{"style":{"width":"62%"},"width":1158,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-8.png","element":"img"}],[{"text":"and hence,","element":"span"}],[{"style":{"width":"55%"},"width":1026,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-9.png","element":"img"}],[{"text":"or equivalently,","element":"span"}],[{"style":{"width":"49%"},"width":908,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-10.png","element":"img"}],[{"text":"Similarly, we can show","element":"span"}],[{"id":"id-173","style":{"width":"76%"},"width":1413,"height":79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-11.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":18.47},"width":228,"height":46.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-12.png","element":"img","alt":" Σt = ER0,t","inline":true},{"text":". Notice that ","element":"span"},{"style":{"height":23.68},"width":368.03,"height":59.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-13.png","element":"img","alt":" Σ = T −1 �T−1t=0 Σt","inline":true,"padRight":true},{"text":"and we have ","element":"span"},{"style":{"height":23.68},"width":635.59,"height":59.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-14.png","element":"img","alt":" �T−1t=0 (R0,t−Σ) = �T−1t=0 (R0,t−","inline":true},{"style":{"height":16.07},"width":50.86,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-15.png","element":"img","alt":"Σt","inline":true},{"text":"). Similar to Theorem 4.2 of ","element":"span"},{"href":"#id-42","referenceIndex":5,"text":"Chen and Christensen ","element":"a"},{"href":"#id-42","referenceIndex":5,"text":"(2015)","element":"a"},{"text":", we can show there exist some","element":"span"}],[{"text":"constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0 such that for any ","element":"span"},{"style":{"height":15.2},"width":75.96,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-16.png","element":"img","alt":" τ ≥","inline":true,"padRight":true},{"text":"0 and integer 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< q < T","element":"span"},{"text":",","element":"span"}],[{"id":"id-169","style":{"width":"91%"},"width":1693,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/70-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20.54},"width":1705.53,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-0.png","element":"img","alt":" Ir = {q⌊(T + 1)/q⌋, q⌊(T + 1)/q⌋ + 1, · · · , T − 1}. Suppose τ ≥ 5qL(c∗)2. Notice","inline":true}],[{"text":"that ","element":"span"},{"style":{"height":19.2},"width":156.59,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-1.png","element":"img","alt":" |Ir| ≤ q","inline":true},{"text":". It follows from ","element":"span"},{"href":"#id-168","text":"(E.67) ","element":"a"},{"text":"that","element":"span"}],[{"id":"id-170","style":{"width":"70%"},"width":1303,"height":152,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-2.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":19.6},"width":858.88,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-3.png","element":"img","alt":" β(q) = O(ρq), set q = −3 log(nT)/ log ρ","inline":true},{"text":", we obtain ","element":"span"},{"style":{"height":20.54},"width":593.28,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-4.png","element":"img","alt":" Tβ(q)/q = O(n−3T −2). Set","inline":true}],[{"style":{"height":23.2},"width":970.26,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-5.png","element":"img","alt":"τ = max{4�CTqL log(Tn), 11qL(c∗)2 log(nT)}","inline":true},{"text":", we obtain that","element":"span"}],[{"style":{"width":"83%"},"width":1532,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-6.png","element":"img"}],[{"text":"as either ","element":"span"},{"style":{"height":13.6},"width":375.97,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-7.png","element":"img","alt":" n → ∞ or T → ∞","inline":true},{"text":". It follows from ","element":"span"},{"href":"#id-169","text":"(E.69)","element":"a"},{"text":", ","element":"span"},{"href":"#id-170","text":"(E.70) ","element":"a"},{"text":"and the condition ","element":"span"},{"style":{"height":15.2},"width":267.73,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-8.png","element":"img","alt":" L ≪ nT that","inline":true}],[{"style":{"width":"76%"},"width":1418,"height":240,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-9.png","element":"img"}],[{"text":"Combining this together with ","element":"span"},{"href":"#id-171","text":"(E.66) ","element":"a"},{"text":"yields that the following event occurs with probability at least 1 ","element":"span"},{"style":{"height":20.54},"width":286.78,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-10.png","element":"img","alt":" − O(n−3T −2),","inline":true}],[{"id":"id-178","style":{"width":"80%"},"width":1491,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-11.png","element":"img"}],[{"text":"By Bonferroni’s inequality, we obtain with probability at least 1 ","element":"span"},{"style":{"height":20.54},"width":375.04,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-12.png","element":"img","alt":" − O(n−2T −2) that","inline":true}],[{"id":"id-172","style":{"width":"86%"},"width":1597,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-13.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"text":"¯","element":"span"},{"style":{"height":17.6},"width":667.06,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-14.png","element":"img","alt":"C > 0. For i = 0, 1, . . . , n, let Ai","inline":true,"padRight":true},{"text":"denote the event","element":"span"}],[{"style":{"width":"71%"},"width":1314,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-15.png","element":"img"}],[{"text":"It follows from ","element":"span"},{"href":"#id-172","text":"(E.72) ","element":"a"},{"text":"that the following event occurs with probability at least 1","element":"span"},{"style":{"height":20.54},"width":276.16,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-16.png","element":"img","alt":"−O(n−2T −2),","inline":true}],[{"style":{"width":"99%"},"width":1838,"height":442,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/71-17.png","element":"img"}],[{"text":"Note that","element":"span"}],[{"id":"id-176","style":{"width":"79%"},"width":1464,"height":862,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-0.png","element":"img"}],[{"text":"By ","element":"span"},{"href":"#id-173","text":"(E.68)","element":"a"},{"text":", we obtain that","element":"span"}],[{"id":"id-177","style":{"width":"65%"},"width":1203,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-1.png","element":"img"}],[{"text":"For any 0 ","element":"span"},{"style":{"height":16.47},"width":661.21,"height":41.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-2.png","element":"img","alt":" ≤ t1 < t2 ≤ T − 1 with t2 − t1 ≥","inline":true,"padRight":true},{"text":"4, it follows from MA that (","element":"span"},{"style":{"height":19.67},"width":387.58,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-3.png","element":"img","alt":"X0,t2, A0,t2, X0,t2+1)","inline":true}],[{"text":"is independent of (","element":"span"},{"style":{"height":19.67},"width":663.05,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-4.png","element":"img","alt":"X0,t1, A0,t1, X0,t1+1) given X0,t2−1","inline":true},{"text":". Thus, we have","element":"span"}],[{"style":{"width":"80%"},"width":1478,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-5.png","element":"img"}],[{"text":"Similarly, conditional on ","element":"span"},{"style":{"height":19.67},"width":821.62,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-6.png","element":"img","alt":" X0,t1+2, (X0,t1, A0,t1, X0,t1+1) and X0,t2−1","inline":true,"padRight":true},{"text":"are independent. It fol-","element":"span"}],[{"text":"lows that","element":"span"}],[{"style":{"width":"64%"},"width":1188,"height":149,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-7.png","element":"img"}],[{"text":"and hence","element":"span"}],[{"id":"id-174","style":{"width":"99%"},"width":1842,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-8.png","element":"img"}],[{"text":"Define ","element":"span"},{"style":{"height":21.01},"width":1688.66,"height":52.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/72-9.png","element":"img","alt":" Θ1,t1(x) = E(R0,t1|X0,t1+2 = x) and Θ2,t2(x) = E(R0,t2|X0,t2−1 = x). Let EX0,t","inline":true}],[{"text":"denote the conditional expectation given ","element":"span"},{"style":{"height":18.47},"width":77.2,"height":46.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-0.png","element":"img","alt":" X0,t","inline":true},{"text":". With some calculations, we can show","element":"span"}],[{"style":{"width":"98%"},"width":1807,"height":365,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-1.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"72%"},"width":1335,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-2.png","element":"img"}],[{"style":{"height":19.67},"width":310.16,"height":49.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-3.png","element":"img","alt":"Θ2,t2(x) = E","inline":true}],[{"style":{"width":"79%"},"width":1462,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-4.png","element":"img"}],[{"text":"It follows from ","element":"span"},{"href":"#id-174","text":"(E.76) ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"84%"},"width":1558,"height":317,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.67},"width":459.17,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-6.png","element":"img","alt":" Θl,t,j,·(x) and Θl,t,·,j(x","inline":true},{"text":") denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th row and ","element":"span"},{"style":{"height":19.67},"width":382.76,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-7.png","element":"img","alt":" j-column of Θl,t(x","inline":true},{"text":"), respectively.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":19.67},"width":301.92,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-8.png","element":"img","alt":" ξj(·, ·) be the j","inline":true},{"text":"-th element of ","element":"span"},{"style":{"height":19.6},"width":91.22,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-9.png","element":"img","alt":" ξ(·, ·","inline":true},{"text":"). By Lemma ","element":"span"},{"href":"#id-141","text":"2 ","element":"a"},{"text":"and the definitions of ","element":"span"},{"style":{"height":17.6},"width":189.14,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-10.png","element":"img","alt":" ξ and U,","inline":true}],[{"style":{"width":"71%"},"width":1319,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-11.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"style":{"width":"75%"},"width":1396,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-12.png","element":"img"}],[{"text":"Similarly to Lemma ","element":"span"},{"href":"#id-141","text":"2, ","element":"a"},{"text":"we can show max","element":"span"},{"style":{"height":25.58},"width":651.94,"height":63.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-13.png","element":"img","alt":"j∈{1,...,L}�y∈X |ϕL,j(y)|dy ⪯ L−1/2","inline":true},{"text":", and hence","element":"span"}],[{"id":"id-175","style":{"width":"73%"},"width":1358,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/73-14.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"style":{"width":"99%"},"width":1830,"height":427,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-0.png","element":"img"}],[{"text":"Similar to ","element":"span"},{"href":"#id-175","text":"(E.77)","element":"a"},{"text":", we can show max","element":"span"},{"style":{"height":22.41},"width":881.07,"height":56.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-1.png","element":"img","alt":"t maxj∈{1,...,mL} supv∈SmL−1,x∈X |Θ2,j,·(x)v| ⪯","inline":true,"padRight":true},{"text":"1. It follows","element":"span"}],[{"text":"from the definition of ","element":"span"},{"style":{"height":17.2},"width":28,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-2.png","element":"img","alt":" β","inline":true},{"text":"-mixing coefficients and the geometric ergodicity that","element":"span"}],[{"style":{"width":"83%"},"width":1531,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-3.png","element":"img"}],[{"text":"where the above bound is uniform for any pair (","element":"span"},{"style":{"height":15.6},"width":90.72,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-4.png","element":"img","alt":"t1, t2","inline":true},{"text":") that satisfies 0 ","element":"span"},{"style":{"height":16.07},"width":324.17,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-5.png","element":"img","alt":" ≤ t1 ≤ t2 − 4.","inline":true}],[{"style":{"width":"93%"},"width":1732,"height":541,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-6.png","element":"img"}],[{"text":"and hence ","element":"span"},{"style":{"height":17.6},"width":171.64,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-7.png","element":"img","alt":" η3 ⪯ LT","inline":true},{"text":". This together with ","element":"span"},{"href":"#id-176","text":"(E.74) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-177","text":"(E.75) ","element":"a"},{"text":"yields that","element":"span"}],[{"id":"id-179","style":{"width":"84%"},"width":1551,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-8.png","element":"img"}],[{"text":"By Cauchy-Schwarz inequality, we obtain","element":"span"}],[{"style":{"width":"82%"},"width":1522,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-9.png","element":"img"}],[{"text":"Combining this with ","element":"span"},{"href":"#id-172","text":"(E.72)","element":"a"},{"text":", an application of the matrix Bernstein inequality (Theorem","element":"span"}],[{"style":{"width":"93%"},"width":1728,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/74-10.png","element":"img"}],[{"text":"under the assumption that ","element":"span"},{"style":{"height":21.84},"width":434.97,"height":54.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-0.png","element":"img","alt":" L = o{nT/ log2(nT)}","inline":true},{"text":". This together with ","element":"span"},{"href":"#id-172","text":"(E.72) ","element":"a"},{"text":"yields that ","element":"span"},{"id":"id-180","style":{"height":58.71},"width":95.57,"height":146.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-1.png","element":"img","alt":"�� n�","inline":true}],[{"text":"By Cauchy-Schwarz inequality, we have for any ","element":"span"},{"style":{"height":20.54},"width":858.97,"height":51.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-2.png","element":"img","alt":" v1, v2 ∈ RmL with ∥v1∥2 = ∥v2∥2 = 1 that","inline":true}],[{"style":{"width":"84%"},"width":1553,"height":363,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-3.png","element":"img"}],[{"text":"by ","element":"span"},{"href":"#id-178","text":"(E.71)","element":"a"},{"text":", ","element":"span"},{"href":"#id-179","text":"(E.78) ","element":"a"},{"text":"and the condition that ","element":"span"},{"style":{"height":21.84},"width":371.56,"height":54.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-4.png","element":"img","alt":" L ≪ Tn/ log2(Tn","inline":true},{"text":"). This together with ","element":"span"},{"href":"#id-180","text":"(E.79)","element":"a"}],[{"style":{"width":"99%"},"width":1843,"height":657,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-5.png","element":"img"}],[{"text":"event occurs with probability approaching 1,","element":"span"}],[{"style":{"width":"35%"},"width":647,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-6.png","element":"img"}],[{"text":"Using similar arguments in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Part 1","element":"span"},{"text":", this implies ","element":"span"},{"style":{"height":13.2},"width":39,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-7.png","element":"img","alt":"�Σ","inline":true,"padRight":true},{"text":"is invertible and satisfies ","element":"span"},{"style":{"height":20.14},"width":316.48,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-8.png","element":"img","alt":" ∥�Σ−1∥2 ≤ 3¯c−1,","inline":true}],[{"text":"with probability tending to 1. Therefore","element":"span"}],[{"style":{"width":"91%"},"width":1692,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-9.png","element":"img"}],[{"text":"with probability tending to 1. Since ","element":"span"},{"style":{"height":21.81},"width":818.59,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-10.png","element":"img","alt":" ∥�Σ − Σ∥2 = Op{L1/2(nT)−1/2 log(nT)}","inline":true},{"text":", we obtain ","element":"span"},{"style":{"height":21.81},"width":888.08,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-11.png","element":"img","alt":"∥�Σ−1 − Σ−1∥2 = Op{L1/2(nT)−1/2 log(nT)}","inline":true},{"text":". The proof is hence completed.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Part 4: ","element":"span"},{"text":"It suffices to show","element":"span"}],[{"style":{"width":"43%"},"width":798,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/75-12.png","element":"img"}],[{"text":"With some calculations, we have","element":"span"}],[{"style":{"width":"84%"},"width":1553,"height":246,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-0.png","element":"img"}],[{"text":"by Cauchy-Schwarz inequality, where","element":"span"}],[{"style":{"width":"52%"},"width":972,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-1.png","element":"img"}],[{"text":"By definition, we have","element":"span"}],[{"style":{"width":"60%"},"width":1108,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-2.png","element":"img"}],[{"text":"Since the matrix ","element":"span"},{"style":{"height":20.78},"width":683.94,"height":51.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-3.png","element":"img","alt":"�a∈A ξ(X0,t, a)ξ⊤(X0,t, a)b(a|X0,t","inline":true},{"text":") is block diagonal with the main-diagonal ","element":"span"},{"text":"blocks ","element":"span"},{"style":{"height":19.67},"width":732.5,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-4.png","element":"img","alt":" {ΦL(X0,t)ΦL(X0,t)⊤b(j|X0,t)}j=1,...,m","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-141","text":"2 ","element":"a"},{"text":"and Condition A2, we can show","element":"span"}],[{"style":{"height":24.07},"width":591.98,"height":60.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-5.png","element":"img","alt":"η(1)4 ⪯ 1. As for η(2)4 , we have","inline":true}],[{"style":{"width":"87%"},"width":1616,"height":337,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-6.png","element":"img"}],[{"text":"where the first inequality follows from Jensen’s inequality. By Lemma ","element":"span"},{"href":"#id-141","text":"2, ","element":"a"},{"text":"we can similarly show that ","element":"span"},{"style":{"height":24.07},"width":120.38,"height":60.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-7.png","element":"img","alt":" η(2)4 ⪯","inline":true,"padRight":true},{"text":"1. Thus, we obtain ","element":"span"},{"style":{"height":19.2},"width":260.9,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-8.png","element":"img","alt":" ∥Σ∥2 = η4 ⪯","inline":true,"padRight":true},{"text":"1. The proof is hence completed.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.8 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-132","style":{"fontWeight":"bold"},"text":"4","element":"a"}],[{"text":"The proof of Lemma ","element":"span"},{"href":"#id-132","text":"4 ","element":"a"},{"text":"is very similar to that of Lemma ","element":"span"},{"href":"#id-25","text":"3. ","element":"a"},{"text":"By Condition A3(i), we obtain","element":"span"}],[{"style":{"width":"86%"},"width":1588,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/76-9.png","element":"img"}],[{"text":"as ","element":"span"},{"style":{"height":13.6},"width":159.91,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-0.png","element":"img","alt":" T → ∞","inline":true},{"text":". Using similar arguments as in the first part of the proof of Lemma ","element":"span"},{"href":"#id-25","text":"3, ","element":"a"},{"text":"we can","element":"span"}],[{"text":"show that","element":"span"}],[{"id":"id-181","style":{"width":"67%"},"width":1240,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-1.png","element":"img"}],[{"text":"as either ","element":"span"},{"style":{"height":16.8},"width":417.24,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-2.png","element":"img","alt":" n → ∞, or T → ∞","inline":true},{"text":". Using similar arguments in the third part of the proof of","element":"span"}],[{"id":"id-182","style":{"width":"99%"},"width":1838,"height":218,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-3.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":22.2},"width":470.64,"height":55.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-4.png","element":"img","alt":" L = o{√nT/ log(nT)}","inline":true},{"text":", it follows from ","element":"span"},{"href":"#id-181","text":"(E.80) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-182","text":"(E.81) ","element":"a"},{"text":"that the following event","element":"span"}],[{"text":"occurs with probability tending to 1,","element":"span"}],[{"style":{"width":"60%"},"width":1108,"height":309,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-5.png","element":"img"}],[{"text":"It remains to show ","element":"span"},{"style":{"height":25.23},"width":922.81,"height":63.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-6.png","element":"img","alt":" λmax{(nT)−1 �ni=1�T−1t=0 ξi,tξ⊤i,t} = Op(1) and","inline":true}],[{"id":"id-183","style":{"width":"69%"},"width":1275,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-7.png","element":"img"}],[{"text":"Suppose ","element":"span"},{"href":"#id-183","text":"(E.82) ","element":"a"},{"text":"holds. By ","element":"span"},{"href":"#id-182","text":"(E.81) ","element":"a"},{"text":"and the condition that ","element":"span"},{"style":{"height":22.2},"width":657.92,"height":55.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-8.png","element":"img","alt":" L = o{√nT/ log(nT)}, we have","inline":true},{"style":{"height":25.23},"width":783.8,"height":63.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-9.png","element":"img","alt":"λmax{(nT)−1 �ni=1�T−1t=0 ξi,tξ⊤i,t} = Op","inline":true},{"text":"(1). Thus, it suffices to show ","element":"span"},{"href":"#id-183","text":"(E.82)","element":"a"},{"text":". This can be ","element":"span"},{"text":"proven using similar arguments in Part 2 of the proof of Lemma ","element":"span"},{"href":"#id-25","text":"3. ","element":"a"},{"text":"We omit the details for brevity. The proof is hence completed.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.9 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-142","style":{"fontWeight":"bold"},"text":"5","element":"a"}],[{"style":{"width":"96%"},"width":1772,"height":518,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/77-10.png","element":"img"}],[{"text":"The proof is hence completed.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.10 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-57","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"text":"Without loss of generality, we assume ","element":"span"},{"style":{"height":19.2},"width":1020.57,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-0.png","element":"img","alt":" n = Knnmin and T = KTTmin such that |Ik| =","inline":true},{"style":{"height":17.6},"width":380.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-1.png","element":"img","alt":"nminTmin for any k","inline":true},{"text":". Under the given conditions, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"is bounded. Similar to Lemma ","element":"span"},{"href":"#id-25","text":"3, ","element":"a"},{"text":"we","element":"span"}],[{"text":"can show under A4* that","element":"span"}],[{"style":{"width":"65%"},"width":1203,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-2.png","element":"img"}],[{"text":"We next bound the difference between ","element":"span"},{"style":{"height":23.38},"width":429.24,"height":58.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-3.png","element":"img","alt":" Σ�π¯Ik−1 and �ΣIk,�π¯Ik−1","inline":true},{"text":". Consider the scenario where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded first. Since ","element":"span"},{"style":{"height":16.07},"width":182.88,"height":40.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-4.png","element":"img","alt":" Tmin = T","inline":true},{"text":", the data are divided according to the trajectories they belong to. Thus, for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , K","element":"span"},{"text":", variables ","element":"span"},{"style":{"height":20.58},"width":674.97,"height":51.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-5.png","element":"img","alt":" {(Xi,t, Ai,t, Yi,t, Xi,t+1)}(i,t)∈Ik are","inline":true}],[{"text":"independent of ","element":"span"},{"style":{"height":21.55},"width":731.67,"height":53.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-6.png","element":"img","alt":" {(Xi,t, Ai,t, Yi,t, Xi,t+1)}(i,t)∈¯Ik−1. Let","inline":true}],[{"style":{"width":"59%"},"width":1088,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-7.png","element":"img"}],[{"text":"Using similar arguments in Part 2 of the proof of Lemma ","element":"span"},{"href":"#id-25","text":"3, ","element":"a"},{"text":"we can show","element":"span"}],[{"style":{"width":"79%"},"width":1474,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-8.png","element":"img"}],[{"text":"with probability at least 1 ","element":"span"},{"style":{"height":20.54},"width":513.99,"height":51.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-9.png","element":"img","alt":" − O(n−2T −2) = 1 − o(1).","inline":true}],[{"text":"Now let us consider the scenario where ","element":"span"},{"style":{"height":19.6},"width":957.08,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-10.png","element":"img","alt":" T → ∞. For k = 1, . . . , K, define (i0(k), t0(k))","inline":true}],[{"text":"to be the tuple in ","element":"span"},{"style":{"height":19.6},"width":626.9,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-11.png","element":"img","alt":" Ik such that i ≥ i0(k), t ≥ t0(k","inline":true},{"text":") for any (","element":"span"},{"style":{"height":19.6},"width":174.55,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-12.png","element":"img","alt":"i, t) ∈ Ik","inline":true},{"text":". Then, we have","element":"span"}],[{"style":{"width":"69%"},"width":1284,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-13.png","element":"img"}],[{"text":"Consider any ","element":"span"},{"style":{"height":20.58},"width":1220.8,"height":51.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-14.png","element":"img","alt":" k ∈ {2, . . . , K} with t0(k) = 0, {(Xi,t, Ai,t, Yi,t, Xi,t+1)}(i,t)∈Ik","inline":true,"padRight":true},{"text":"are independent of ","element":"span"},{"style":{"height":21.55},"width":626.2,"height":53.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-15.png","element":"img","alt":" {(Xi,t, Ai,t, Yi,t, Xi,t+1)}(i,t)∈¯Ik−1","inline":true},{"text":". Using similar arguments in Part 2 of the proof of Lemma","element":"span"}],[{"href":"#id-25","text":"3, ","element":"a"},{"text":"we can show wpa1 that,","element":"span"}],[{"id":"id-186","style":{"width":"78%"},"width":1447,"height":155,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/78-16.png","element":"img"}],[{"style":{"width":"0%"},"width":17,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-0.png","element":"img"}],[{"text":"Consider ","element":"span"},{"style":{"height":19.6},"width":574.55,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-1.png","element":"img","alt":" k ∈ {2, . . . , K} with t0(k) >","inline":true,"padRight":true},{"text":"0. We decompose ","element":"span"},{"style":{"height":22.98},"width":239.59,"height":57.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-2.png","element":"img","alt":"�ΣIk,�π¯Ik−1 as","inline":true}],[{"style":{"width":"59%"},"width":1088,"height":563,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-3.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"style":{"width":"99%"},"width":1838,"height":449,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-4.png","element":"img"}],[{"text":"dent. Using the matrix concentration inequality, we can show wpa1 that","element":"span"}],[{"id":"id-184","style":{"width":"99%"},"width":1840,"height":413,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-5.png","element":"img"}],[{"text":"are conditionally independent. Moreover, for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"such that (","element":"span"},{"style":{"height":19.6},"width":266.55,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-6.png","element":"img","alt":"i, t0(k)) ∈ Ik","inline":true},{"text":", the density function of ","element":"span"},{"style":{"height":19.38},"width":176.8,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-7.png","element":"img","alt":" Xi,t0(k)+1","inline":true,"padRight":true},{"text":"conditional on ","element":"span"},{"style":{"height":21.55},"width":636.81,"height":53.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-8.png","element":"img","alt":" {(Xj,t, Aj,t, Yj,t, Xj,t+1)}(j,t)∈¯Ik−1","inline":true,"padRight":true},{"text":"is uniformly bounded under A3. Using similar arguments in bounding ","element":"span"},{"style":{"height":19.2},"width":193.06,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-9.png","element":"img","alt":" ∥�Σ−Σ∥2","inline":true,"padRight":true},{"text":"in Part 2 of the proof of Lemma","element":"span"}],[{"href":"#id-25","text":"3, ","element":"a"},{"text":"we can show wpa1 that","element":"span"}],[{"id":"id-185","style":{"width":"78%"},"width":1447,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-10.png","element":"img"}],[{"text":"Combining ","element":"span"},{"href":"#id-184","text":"(E.86) ","element":"a"},{"text":"with ","element":"span"},{"href":"#id-185","text":"(E.87) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-186","text":"(E.85)","element":"a"},{"text":", we obtain wpa1 that","element":"span"}],[{"style":{"width":"78%"},"width":1447,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/79-11.png","element":"img"}],[{"text":"To summarize, we have shown that","element":"span"}],[{"style":{"width":"89%"},"width":1657,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-0.png","element":"img"}],[{"text":"wpa1, regardless of whether ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is bounded or not. Under the given conditions, we have ","element":"span"},{"style":{"height":23.2},"width":454.16,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-1.png","element":"img","alt":"�L/(nT) log(nT) = o","inline":true},{"text":"(1). Using similar arguments in the proof of Lemma ","element":"span"},{"href":"#id-25","text":"3, ","element":"a"},{"text":"we can show","element":"span"}],[{"text":"wpa1 that","element":"span"}],[{"id":"id-190","style":{"width":"93%"},"width":1719,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-2.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"height":15.55},"width":215.36,"height":38.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-3.png","element":"img","alt":" �π = �π¯I(K)","inline":true},{"text":". By Lemma ","element":"span"},{"href":"#id-36","text":"1, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":21.55},"width":744.48,"height":53.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-4.png","element":"img","alt":" Q(�π¯I(k); ·, a) ∈ Λ(p, c′) for any k ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , K","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". Using similar arguments in the proof of Theorem 12.8 of ","element":"span"},{"href":"#id-187","referenceIndex":35,"text":"Schumaker ","element":"a"},{"href":"#id-187","referenceIndex":35,"text":"(1981) ","element":"a"},{"text":"and proof of Proposition 5 of ","element":"span"},{"href":"#id-188","referenceIndex":27,"text":"Meyer ","element":"a"},{"href":"#id-188","referenceIndex":27,"text":"(1992)","element":"a"},{"text":", there exist some vectors ","element":"span"},{"style":{"height":25.69},"width":377.98,"height":64.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-5.png","element":"img","alt":" {β∗�π¯I(k),a}a∈A,1≤k≤K","inline":true}],[{"text":"that satisfy","element":"span"}],[{"id":"id-189","style":{"width":"81%"},"width":1509,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-6.png","element":"img"}],[{"text":"for some constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C > ","element":"span"},{"text":"0. Similar to ","element":"span"},{"href":"#id-145","text":"(E.43)","element":"a"},{"text":", we have by ","element":"span"},{"href":"#id-189","text":"(E.90) ","element":"a"},{"text":"that","element":"span"}],[{"id":"id-191","style":{"width":"73%"},"width":1353,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-7.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"75%"},"width":1399,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-8.png","element":"img"}],[{"text":"Similar to the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"we have by ","element":"span"},{"href":"#id-190","text":"(E.89) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-191","text":"(E.91) ","element":"a"},{"text":"that","element":"span"}],[{"style":{"width":"93%"},"width":1715,"height":243,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-9.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"77%"},"width":1437,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-10.png","element":"img"}],[{"text":"for any (","element":"span"},{"style":{"height":19.6},"width":190.04,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/80-11.png","element":"img","alt":"i, t) ∈ Ik.","inline":true}],[{"text":"To prove the asymptotic normality of ","element":"span"},{"style":{"height":23.2},"width":943.1,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/81-0.png","element":"img","alt":"�nT(K − 1)/K�σ−1(G){�V (G) − V (�π; G)}, it","inline":true}],[{"id":"id-193","style":{"width":"99%"},"width":1838,"height":657,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/81-1.png","element":"img"}],[{"text":"Notice that ","element":"span"},{"style":{"height":14.49},"width":161.49,"height":36.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/81-2.png","element":"img","alt":" �π = �π¯IK","inline":true},{"text":". Under A5, we have","element":"span"}],[{"style":{"width":"72%"},"width":1335,"height":469,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/81-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) denotes some positive constant. Since ","element":"span"},{"style":{"height":25.8},"width":787.63,"height":64.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/81-4.png","element":"img","alt":"�K−1k=1 k−b0 ≤ 1 +� K1 x−b0dx ⪯ K1−b0,","inline":true}],[{"text":"we obtain that","element":"span"}],[{"style":{"width":"91%"},"width":1688,"height":931,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/81-5.png","element":"img"}],[{"id":"id-194","style":{"width":"99%"},"width":1838,"height":226,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-0.png","element":"img"}],[{"text":"Using similar arguments in the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"we can show","element":"span"}],[{"style":{"width":"82%"},"width":1524,"height":387,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-1.png","element":"img"}],[{"text":"we obtain","element":"span"}],[{"id":"id-192","style":{"width":"71%"},"width":1327,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-2.png","element":"img"}],[{"text":"Similar to ","element":"span"},{"href":"#id-149","text":"(E.50)","element":"a"},{"text":", we can show there exists some constant ","element":"span"},{"style":{"height":12.87},"width":91.09,"height":32.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-3.png","element":"img","alt":" c6 >","inline":true,"padRight":true},{"text":"0, such that the following","element":"span"}],[{"text":"occurs wpa1,","element":"span"}],[{"style":{"width":"64%"},"width":1186,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-4.png","element":"img"}],[{"text":"This together with ","element":"span"},{"href":"#id-192","text":"(E.95) ","element":"a"},{"text":"yields that","element":"span"}],[{"style":{"width":"80%"},"width":1492,"height":410,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-5.png","element":"img"}],[{"text":"Combining this together with ","element":"span"},{"href":"#id-193","text":"(E.93)","element":"a"},{"text":", we obtain that","element":"span"}],[{"style":{"width":"96%"},"width":1781,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-6.png","element":"img"}],[{"text":"To prove ","element":"span"},{"href":"#id-193","text":"(E.92)","element":"a"},{"text":", it suffices to show","element":"span"}],[{"id":"id-195","style":{"width":"79%"},"width":1467,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/82-7.png","element":"img"}],[{"style":{"width":"86%"},"width":1600,"height":535,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-0.png","element":"img"}],[{"text":"In the following, we show ","element":"span"},{"style":{"height":24.84},"width":213.53,"height":62.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-1.png","element":"img","alt":" η7 d→ N(0,","inline":true,"padRight":true},{"text":"1). Based on ","element":"span"},{"href":"#id-194","text":"(E.94)","element":"a"},{"text":", one can show ","element":"span"},{"style":{"height":24.04},"width":103.32,"height":60.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-2.png","element":"img","alt":" η8 P→","inline":true,"padRight":true},{"text":"0. Assertion ","element":"span"},{"href":"#id-195","text":"(E.96) ","element":"a"},{"text":"thus follows from Slutsky’s theorem.","element":"span"}],[{"text":"Notice that ","element":"span"},{"style":{"height":18},"width":184.62,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-3.png","element":"img","alt":" η7 equals","inline":true}],[{"style":{"width":"71%"},"width":1318,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-4.png","element":"img"}],[{"text":"For any 1 ","element":"span"},{"style":{"height":17.2},"width":200.35,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-5.png","element":"img","alt":" ≤ g ≤ nT","inline":true},{"text":", there exists some integer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":") that satisfies ","element":"span"},{"style":{"height":19.6},"width":514.46,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-6.png","element":"img","alt":" {k(g) − 1}nminTmin + 1 ≤","inline":true}],[{"style":{"height":19.6},"width":705.71,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-7.png","element":"img","alt":"g ≤ k(g)nminTmin. Let t(g) and i(g","inline":true},{"text":") be the integers that satisfy","element":"span"}],[{"style":{"width":"76%"},"width":1403,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-8.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":21.81},"width":380.78,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-9.png","element":"img","alt":" F(0) = {X1,0, A1,0}","inline":true},{"text":". Then we iteratively define ","element":"span"},{"style":{"height":21.81},"width":264,"height":54.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-10.png","element":"img","alt":" {F(g)}1≤g≤nT","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"style":{"width":"94%"},"width":1749,"height":335,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-11.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":22.72},"width":660.32,"height":56.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-12.png","element":"img","alt":" ξ(g) = ξi(g),t(g) and ε(g) = εi(g),t(g)","inline":true},{"text":". We rewrite ","element":"span"},{"style":{"height":13.2},"width":100.07,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-13.png","element":"img","alt":" η7 as","inline":true}],[{"style":{"width":"78%"},"width":1444,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-14.png","element":"img"}],[{"text":"One can show that ","element":"span"},{"style":{"height":13.2},"width":40.1,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-15.png","element":"img","alt":" η7","inline":true,"padRight":true},{"text":"forms a mean-zero martingale with respect to the filtration ","element":"span"},{"style":{"height":22.72},"width":347.68,"height":56.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-16.png","element":"img","alt":" {σ(F(g))}g≥nT/K.","inline":true}],[{"text":"Using similar arguments in the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"we can show that","element":"span"}],[{"id":"id-196","style":{"width":"87%"},"width":1604,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/83-17.png","element":"img"}],[{"style":{"width":"91%"},"width":1690,"height":736,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-0.png","element":"img"}],[{"text":"Similar to the proof of Theorem ","element":"span"},{"href":"#id-31","text":"1, ","element":"a"},{"text":"we can show max","element":"span"},{"style":{"height":26.64},"width":676.34,"height":66.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-1.png","element":"img","alt":"k∈{2,...,K} ∥�Ω∗Ik,�π¯Ik−1 − Ω�π¯Ik−1∥2 =","inline":true},{"style":{"height":14.07},"width":40.51,"height":35.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-2.png","element":"img","alt":"op","inline":true},{"text":"(1). Similar to ","element":"span"},{"href":"#id-148","text":"(E.49)","element":"a"},{"text":", we can show there exists some constant ","element":"span"},{"style":{"height":12.87},"width":89.36,"height":32.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-3.png","element":"img","alt":" c7 >","inline":true,"padRight":true},{"text":"0 such that","element":"span"}],[{"id":"id-197","style":{"width":"99%"},"width":1838,"height":430,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-4.png","element":"img"}],[{"text":"Using a martingale central limit theorem for triangular arrays (Corollary 2.8 of ","element":"span"},{"href":"#id-152","referenceIndex":25,"text":"McLeish, ","element":"a"},{"href":"#id-152","referenceIndex":25,"text":"1974)","element":"a"},{"text":", we have by ","element":"span"},{"href":"#id-196","text":"(E.97) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-197","text":"(E.99) ","element":"a"},{"text":"that ","element":"span"},{"style":{"height":24.84},"width":213.53,"height":62.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-5.png","element":"img","alt":" η7 d→ N(0,","inline":true,"padRight":true},{"text":"1). The proof is hence completed.","element":"span"}],[{"id":"id-136","style":{"fontWeight":"bold"},"text":"E.11 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-27","style":{"fontWeight":"bold"},"text":"3","element":"a"}],[{"text":"Based on the discussions in Section ","element":"span"},{"href":"#id-111","text":"3.2.3, ","element":"a"},{"text":"it suffices to show","element":"span"}],[{"style":{"width":"69%"},"width":1278,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-6.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-198","style":{"width":"84%"},"width":1561,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-7.png","element":"img"}],[{"text":"We only prove ","element":"span"},{"href":"#id-198","text":"(E.100) ","element":"a"},{"text":"for brevity. ","element":"span"},{"text":"Under the given conditions, we have Pr(","element":"span"},{"style":{"height":19.6},"width":136.05,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-8.png","element":"img","alt":"A0) ≥","inline":true}],[{"text":"1 ","element":"span"},{"style":{"height":19.6},"width":373.77,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-9.png","element":"img","alt":" − O(|I|−κ), where","inline":true}],[{"style":{"width":"55%"},"width":1022,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/84-10.png","element":"img"}],[{"text":"Notice that","element":"span"}],[{"id":"id-199","style":{"width":"77%"},"width":1421,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-0.png","element":"img"}],[{"text":"Under A1, ","element":"span"},{"style":{"height":18.54},"width":82.52,"height":46.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-1.png","element":"img","alt":" Qopt ","inline":true,"padRight":true},{"text":"is uniformly bounded. Therefore, the first term on the RHS of ","element":"span"},{"href":"#id-199","text":"(E.101) ","element":"a"},{"text":"is upper bounded by ","element":"span"},{"style":{"height":19.6},"width":675.21,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-2.png","element":"img","alt":" O{Pr(Ac0)} = O(|I|−κ). Since κ","inline":true,"padRight":true},{"text":"can be chosen arbitrarily large, it","element":"span"}],[{"text":"suffices to show","element":"span"}],[{"id":"id-202","style":{"width":"88%"},"width":1623,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-3.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":16.8},"width":308.92,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-4.png","element":"img","alt":" x ∈ X, suppose","inline":true}],[{"id":"id-200","style":{"width":"84%"},"width":1556,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-5.png","element":"img"}],[{"text":"Under the event defined in ","element":"span"},{"style":{"height":17.6},"width":247.2,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-6.png","element":"img","alt":" A0, we have","inline":true}],[{"style":{"width":"53%"},"width":991,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-7.png","element":"img"}],[{"text":"and hence","element":"span"}],[{"style":{"width":"46%"},"width":856,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-8.png","element":"img"}],[{"text":"Thus, we have","element":"span"}],[{"style":{"width":"43%"},"width":799,"height":105,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-9.png","element":"img"}],[{"text":"when ","element":"span"},{"href":"#id-200","text":"(E.103) ","element":"a"},{"text":"holds. Let ","element":"span"},{"style":{"height":16.07},"width":51.54,"height":40.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-10.png","element":"img","alt":" X∗","inline":true,"padRight":true},{"text":"denote the set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"that satisfies ","element":"span"},{"href":"#id-200","text":"(E.103)","element":"a"},{"text":". It follows that","element":"span"}],[{"id":"id-201","style":{"width":"82%"},"width":1518,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/85-11.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":19.6},"width":532.34,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-0.png","element":"img","alt":" �aI(x) = sarg maxa �QI(a, x","inline":true},{"text":"). Similarly, we can show the event ","element":"span"},{"style":{"height":20.26},"width":601.4,"height":50.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-1.png","element":"img","alt":" �aI(x) /∈ arg maxa∈A Qopt(x, a)","inline":true,"padRight":true},{"text":"occurs only when","element":"span"}],[{"id":"id-203","style":{"width":"75%"},"width":1396,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-2.png","element":"img"}],[{"text":"Since max","element":"span"},{"style":{"height":20.52},"width":1327.08,"height":51.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-3.png","element":"img","alt":"a Qopt(x, a) − Qopt(x,�aI(x)) = �a Qopt(x, a){πopt(a|x) − �πI(a|x)}","inline":true},{"text":", we obtain","element":"span"}],[{"style":{"width":"73%"},"width":1363,"height":253,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-4.png","element":"img"}],[{"text":"where the first equality follows from A5. ","element":"span"},{"text":"Combining this together with ","element":"span"},{"href":"#id-201","text":"(E.104) ","element":"a"},{"text":"yields ","element":"span"},{"href":"#id-202","text":"(E.102)","element":"a"},{"text":". The proof is hence completed.","element":"span"}],[{"id":"id-137","style":{"fontWeight":"bold"},"text":"E.12 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-28","style":{"fontWeight":"bold"},"text":"4","element":"a"}],[{"text":"For a given ","element":"span"},{"style":{"height":20.14},"width":1104.28,"height":50.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-5.png","element":"img","alt":" ε > 0, let A∗ = {maxa Qopt(x, a) − Qopt(x,�aI(x)) ≤ ε}","inline":true},{"text":". Notice that","element":"span"}],[{"id":"id-204","style":{"width":"79%"},"width":1464,"height":412,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-6.png","element":"img"}],[{"text":"Using similar arguments in the proof of Theorem ","element":"span"},{"href":"#id-27","text":"3, ","element":"a"},{"text":"we can show","element":"span"}],[{"id":"id-205","style":{"width":"92%"},"width":1712,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-7.png","element":"img"}],[{"text":"Moreover, similar to ","element":"span"},{"href":"#id-203","text":"(E.105)","element":"a"},{"text":", we can show the event ","element":"span"},{"style":{"height":20.26},"width":624.54,"height":50.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-8.png","element":"img","alt":" �aI(x) /∈ arg maxa∈A Qopt(x, a)","inline":true}],[{"text":"occurs only when","element":"span"}],[{"style":{"width":"68%"},"width":1271,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/86-9.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"style":{"width":"86%"},"width":1600,"height":486,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/87-0.png","element":"img"}],[{"text":"Combining this together with ","element":"span"},{"href":"#id-204","text":"(E.106) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-205","text":"(E.107) ","element":"a"},{"text":"yields that","element":"span"}],[{"style":{"width":"76%"},"width":1416,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/87-1.png","element":"img"}],[{"text":"The proof is hence completed by setting ","element":"span"},{"style":{"height":21.34},"width":339.39,"height":53.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2001.04515/images/87-2.png","element":"img","alt":" ε = |I|−2b∗/(2+α).","inline":true}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]