b

DiscoverSearch
About
My stuff
Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health
2019·arXiv
Abstract
Abstract

Due to the recent advancements in wearables and sensing technology, health scientists are increasingly developing mobile health (mHealth) interventions. In mHealth interventions, mobile devices are used to deliver treatment to individuals as they go about their daily lives. These treatments are generally designed to impact a near time, proximal outcome such as stress or physical activity. The mHealth intervention policies, often called just-in-time adaptive interventions, are decision rules that map a individual’s current state (e.g., individual’s past behaviors as well as current observations of time, location, social activity, stress and urges to smoke) to a particular treatment at each of many time points. The vast majority of current mHealth interventions deploy expert-derived policies. In this paper, we provide an approach for conducting inference about the performance of one or more such policies using historical data collected under a possibly different policy. Our measure of performance is the average of proximal outcomes over a long time period should the particular mHealth policy be followed. We provide an estimator as well as confidence intervals. This work is motivated by HeartSteps, an mHealth physical activity intervention.

Keywords: sequential decision making, policy evaluation, markov decision process, reinforcement learning

Due to the recent advancement in mobile device and sensing technology, health scientists are more and more interested in developing mobile health (mHealth) interventions. In mHealth, mobile devices (e.g., wearables and smartphones) are used to deliver interventions to individuals as they go about their daily lives. In general, there are two types of mHealth treatments. Most are pull treatments that reside on the individual’s mobile device and allow the individual to access treatment content as needed. This work focuses on the second type, the “push” treatment, typically in the form of a notification or a text message that appears on a mobile device. There is a wide variety of possible treatment messages (e.g., behavioral, cognitive, and motivational message and reminders). These treatments are generally intended to impact a near time, proximal outcome, such as stress or behaviors such as physical activity over some subsequent minutes/hours. The mHealth intervention policies, often called just-in-time adaptive interventions in the mHealth literature (Nahum-Shani et al. 2018), are decision rules that map the individuals current state (e.g., past behaviors as well as current observations of location, social activity, stress and urges to smoke) to a particular treatment at each of many time points. Many mHealth interventions are designed for long-term use in chronic disease management (Lee et al. 2018). The vast majority of current mHealth interventions deploy expert-derived policies with limited use of data evidence (for an example see Kizakevich et al. (2014)), however the long-term efficacy of these policies on the health behavior is not well understood. An important first step toward developing data-based, effective mHealth interventions is to properly measure the long-term performance of these policies. In this work, we provide an approach for conducting inference about the optimality of one or more mHealth policies of interest. Our optimality criterion is the long-term average of the proximal outcomes should a particular mHealth policy be followed. We develop a flexible method to estimate the performance of an mHealth policy using a historical dataset in which the treatments are decided by a possibly different policy.

This work is motivated by HeartSteps (Klasnja et al. 2015), an mHealth physical activity intervention. To design this intervention, we are conducting a series of studies. The first, already completed, study was for 42 days. The last study will be for one year. Here we focus on the intervention component involving activity suggestions. These suggestions may be delivered at each of the five individual-specified times per day. While in the first study there were 42×5 = 210 time points per individual, in the year-long study there will be about2,000 time points per individual. The proximal outcome is the step count in the 30 minutes following each of the five times per day. Our goal is to use the data collected from the first 42-day study to predict and estimate the long-term average of proximal outcomes for a variety of policies that could be used to decide whether or not to send the activity suggestion at each time point in the year-long study. The 42-day study was a Micro-Randomized Trial (MRT) (Klasnja et al. 2015, Liao et al. 2016). In an MRT, a known stochastic policy, also called a behavior policy, is used to decide when and which type of treatment to provide at each time point. A partial list of MRTs in the field or completed can be found at the website1. Froman experimental point of view, the stochastic behavior policy is used to conduct sequential randomizations within each individual. Here the adjective, “stochastic”, means that at each time point each individual is randomized between the possible treatments. In this work we focus on settings in which the randomization probabilities are known functions of the individual’s past data; this is the case with MRTs by design.

The rest of the article is organized as follows. Section 2 provides a review of Markov Decision Processes. In Section 3, we review related work. Section 4 develops an estimator for the long-term average proximal outcome; then Section 5 provides the asymptotic distribution of this estimator. As the estimation requires tuning parameters, in Section 6 we provide a procedure to select the tuning parameters. Simulations are used to assess the coverage probability of the proposed confidence intervals in various settings. A case study using data from the 42-day MRT of HeartSteps is presented in Section 7. We end with a discussion of future work in Section 8.

The data for each individual is of the form

image

where t indexes time points,  St ∈ Sis the individual’s state and  At ∈ Ais the treatment (usually called the action) assigned at time, t. The action space, A, is discrete and finite. In mHealth, the state,  St, contains the time-varying information (e.g., current location) as well as summaries of historical data up to and including time, t (e.g., summaries of previous physical activity). The actions are different types of treatments that are delivered to the individual via a smartdevice; these treatments can be reminders, motivational messages, messages prompting self-reflection and so on. For simplicity, we assume that the duration over which data is collected, T, is non-random and same for all individuals. The proximal outcome (also called the reward), denoted by  Rt+1 ∈ R, is assumed to be a known function of (St, At, St+1). In mHealth, the reward is often chosen to measure the near-term impact of the current action (e.g., the number of steps in a pre-specified time window after each time point). In this work, we focus on the case of continuous rewards (see Section 8 for a discussion about other types of rewards). In HeartSteps, the binary action is whether an activity suggestion is delivered and the reward is the 30-min step count following each decision time.

We assume that the distribution of the states satisfies the Markovian property, that is, for  t ≥ 1, St+1 ⊥ {S1, A1, . . . , St−1, At−1} | {St, At}. Furthermore, we assume that the conditional distribution (also called the transition kernel) of  St+1 | {St, At}is time-homogeneous. Denote the transition kernel by P; thus given a measurable set, B, in the state space, S, P(B | s, a) = Pr(St+1 ∈ B | St = s, At = a). Note that P does not depend on t due to the above time-homogeneity assumption. Denote by  p(s′ | s, a) the transition density with respect to some reference measure on S (e.g., counting measure when S is discrete). Let r(s, a) denote the conditional expectation of the reward given state and action, i.e., r(s, a) = E(Rt+1 | St = s, At = a). The tuple, (S, A, P), is called a Markov Decision Process (MDP) (Howard 1960, Puterman 1994, Sutton & Barto 2018). In mHealth, non-stationarity in P is likely to occur if there are unobserved aspects of the current state (e.g., individual’s engagement and/or burden). Therefore, practically, it is critical to strive to collect sufficient information (via self-report or wearable sensors) to represent individual’s state.

Note that the MDP does not specify the distribution of the actions. And indeed the distribution of the actions may not satisfy the Markovian property. In an MRT, the actions, {At}Tt=1, are randomized with probabilities that can depend on the entire history prior to time point  t, Ht = {S1, A1, . . . , St}. Denote the distribution of  At | Ht by πbt(· | Ht). We callπb = {πb1, . . . , πbT}, a stochastic behavior policy. Throughout we assume that  πb is known(as is the case in an MRT) and that the probabilities are strictly positive, i.e.,  πbt(a | Ht) ≥pmin > 0 for all a ∈ A, Ht and t ≤ T.

Suppose that a pre-specified time-invariant, Markovian policy,  π, is being considered for use in future. Our goal is to conduct inference for the resulting average of the rewards over a large number of time points. In mHealth, the policy might be an expert-constructed policy. Considering a long time period makes most sense for individuals who are struggling with chronic problems or disorders for which, at this time, there is no general cure. Many health-behavior problems fall into this area including obesity, hypertension, adherence to medications for AIDs, mental illness and addictions. Let  π(a | s) be the probability of choosing the action, a, at the state, s. Given a dataset that consists of n independent, identically distributed (i.i.d.) observations of D, we aim to estimate the average reward of the policy,

defined as

image

where the expectation,  Eπ, is taken over the trajectory  {S1, A1, S2, . . . , St∗, At∗, St∗+1} inwhich the actions are selected according to the policy,  π, that is, the likelihood in the expectation is given by  1{S1=s}�t∗t=1 π(At | St)p(St+1 | St, At). The policy,  π, induces a Markov chain on the state with the transition kernel,  P π(· | s) = �a∈A π(a | s)P(· | s, a).

Suppose for now the state space, S, is finite. It is known that the limit in (1) exists, i.e., ηπ(s) = limt∗→∞ Eπ( 1t∗�t∗t=1 Rt+1 | S1 = s) (Theorem A.6 on p. 595 in Puterman (1994)). Furthermore, when the induced Markov chain,  P π, is irreducible, the average reward is

independent of initial state and is given by

image

where  dπ(·) is the stationary distribution. The existence of stationary distribution is guaranteed by the irreducibility assumption on  P π (Puterman (1994), p. 592). The above results can be extended to general state spaces (e.g.,  S ⊂ Rd) under more involved conditions on the transition kernel,  P π, analogous to the finite state case (see, for example, Hern´andez-Lerma & Lasserre (1999), chap. 7). Practically the irreducibility assumption implies that time-invariant information cannot be included in the state. Motivated by mHealth applications, in Supplement A we present a generalization that allows the average reward to depend on time-invariant variables. In the case of mHealth, time-invariant variables might be gender, baseline severity, genetics and so on.

We propose to conduct inference about the long-term performance of each policy,  π, viaits average reward,  ηπ. This is because the average reward,  ηπ, is an asymptotic surrogate

of the average of finite rewards over a long period of time. In fact, it can be shown that

image

where the leading constant depends on the mixing time of  P π (see Theorem 7.5.10 in Hern´andez-Lerma & Lasserre (1999)). In the case of HeartSteps, the goal is to use the data from the 42-day MRT study to estimate the average reward,  ηπ, for a variety of policies  π.The average reward,  ηπ, provides a proxy for the average of the 30-min step counts when the policy,  π, is used to determine whether to send the activity suggestions over a long time period (e.g., a year: 5  ×365 time points).

Note that the data, D, on each individual includes observations over T time points and the actions are selected according to a behavior policy. However, as will be seen, the above assumptions including the Markovian and irreducibility assumptions will allow us to estimate the average reward over a long time period and under a different policy.

Lastly as mentioned above the focus here is to conduct inference for the long run average of equally weighted rewards under the target policy using data collected under a possibly different policy. An alternate, and more common, inference target is based on an expected discounted sum of rewards,  Eπ(�∞t=1 γt−1Rt+1 | S1 = s), with the discount rate,  γ ∈ [0, 1).When the discount rate,  γ, is small (e.g.,  γ = 0.5), the discounted sum of rewards focuses only on finitely many near-term rewards. Note that even with a large discount rate of  γ = 0.99,the reward at t = 100 has a weight of 0.37 and the reward t = 200 has a weight of 0.13. Recall our motivating mHealth intervention is being designed to optimize the overall physical activity over one year. From a scientific point of view, the rewards in the distant future are as important as the near-term ones, especially when considering the effect of habituation and burden. With this in mind, we opt for the long-term average reward, which can be viewed as a proxy for the (undiscounted) average of rewards over a long period of time. In fact, the conditional expectation of the sum of discounted rewards is related to the average reward; as  γ →1, the above conditional expectation of the sum of the discounted rewards normalized by the constant, 1/(1 − γ), converges to the average reward,  ηπ(Mahadevan 1996). In the online setting many researchers focus on a discounted sum of rewards. This is because the Bellman operator, for the expected discounted sum of rewards, is a contraction (Sutton & Barto 2018); the contraction provides greater computational stability and simpler convergence arguments. However, as we shall see below, consideration of the average reward is not problematic in the batch (i.e., off-line) setting.

The evaluation of a given target policy using data collected from a different policy (i.e., the behavior policy) is called off-policy evaluation. This has been widely studied in both the statistical and reinforcement learning (RL) literature. Many authors have evaluated and contrasted policies in terms of the expected sum of rewards over a finite number of time points (Murphy et al. 2001, Chakraborty & Moodie 2013, Jiang & Li 2015). However, because these methods often use products of weights with probabilities from the behavior policy in the denominator, the extension to problems with a large number of time points often suffers from a large variance (Thomas & Brunskill 2016, Jiang & Li 2015).

The most common off-policy evaluation methods for infinite-horizon problems (i.e., a large number of time points) focus on a discounted sum of rewards and are thus based in some way on the value function (in the discounted reward setting  Eπ(�∞t=1 γt−1Rt+1 | S1 = s)considered as a function of s is the value function). Farahmand et al. (2016) proposed a regularized version of Least Square Temporal Difference (Bradtke & Barto 1996) and statistical properties were studied. They used a non-parametric model to estimate the value function and derived the convergence rate when training data consists of i.i.d. transition samples in the form of state, action, reward and next state. From a technical point of view, our estimation method is similar to Farahmand et al. (2016), albeit focused on the average reward; most importantly our method relaxes the assumption that Bellman operator can be modeled correctly for each candidate relative value function and only assumes the data consists of i.i.d. samples of trajectories. Luckett et al. (2020) also focused on the discounted reward setting. They evaluated policy,  π, based on an average of  Eπ(�∞t=1 γt−1Rt+1 | S1 = s) withrespect to a pre-selected reference distribution of the state. While the reference distribution can be naturally chosen as the distribution of the initial state (Farajtabar et al. 2018, Liu et al. 2018, Luckett et al. 2020, Thomas & Brunskill 2016), choosing a “right” discount rate,  γ, can be non-trivial, at least in mHealth. They assumed a parametric model for the value function and developed a regularized estimating equation. In computer science literature, there also exists many off-policy evaluation methods for the discounted reward setting. We refer the interested reader to the recent works by Farajtabar et al. (2018) and Kallus & Uehara (2019) and references therein.

Closest to the setting of this work is the recent work by Murphy et al. (2016) and Liu et al. (2018). Murphy et al. (2016) considered the average reward setting. They assumed a linear model for the value function and constructed the estimating equations to estimate the average reward. However the linearity assumption of the value function is unlikely to hold in practice and difficult to validate (e.g., the value function involves the infinite sum of the rewards). Our method allows the use of a non-parametric model for the value function to increase robustness. Liu et al. (2018) also considered the average reward and proposed an estimator for the average reward based on estimating the ratio of the stationary distribution under the target policy divided by the stationary distribution under the behavior policy. However they did not provide confidence intervals or other inferential methods besides an estimator for the average reward. In addition, they restricted the behavior policy to be Markovian and time-stationary. In mHealth, the behavior policy can be determined by an algorithm based on the accruing data and thus violates this assumption (Liao et al. 2018, Dempsey et al. 2020).

We assume that the dataset,  Dn, consists of n trajectories:

image

Each trajectory,  Di, is an i.i.d. copy of D described in Section 2. Recall that  {At}Tt=1, theactions in D, are selected by the behavior policy,  πb. In the following, the expectation, E, without the subscript is with respect to the distribution of the trajectory, D, under the behavior policy.

Below we introduce the estimator for  ηπ. We follow the so-called “model-free” approach (i.e., does not require modeling the transition kernel, P) to estimate the average reward. Our estimator is based on the Bellman equation, also known as the Poisson equation (Puterman 1994); as will be discussed below this equation characterizes the average reward.

First consider the setting where the state space, S, is finite and the induced Markov chain,  P π, is irreducible. Recall that in this setting the average reward,  ηπ, is a constant

given in (2). Define the relative value function by

image

this limit is well-defined (Puterman (1994), p. 338). If the induced Markov chain is aperiodic, then the relative value function (3) can be expressed as  Qπ(s, a) = Eπ{�∞t=1(Rt+1−ηπ) | S1 =s, A1 = a}.The relative value function,  Qπ(s, a), measures the difference between the expected cumulative rewards under policy  πand the average reward when the initial state is s and the first action is a. It is easy to verify from the definition that (ηπ, Qπ) is a solution of the Bellman equation:

image

Furthermore, when the induced Markov chain is irreducible, the Bellman equation (4) uniquely identifies the average reward,  ηπ, and identifies the relative value function,  Qπ,up to a constant (see Puterman (1994), p. 343 for details). That is, the set of the solutions of the Bellman equation (4) is given by  {(ηπ, Q) : Q = Qπ +c1, c ∈ R, 1(s, a) = 1}. These results can be generalized to general state spaces (see chap. 7 in Hern´andez-Lerma & Lasserre (1999)). The key assumption for the method proposed here is as follows.

Assumption 1. The average reward of the target policy,  π, is independent of state and satisfies (2). (ηπ, Qπ)is the unique solution of the Bellman equation (4) up to a constant for  Qπ. The stationary distribution of the induced transition kernel,  P π, exists.

As the focus of this work is to estimate the average reward, it will be sufficient to estimate a specific version of  Qπ. Define the shifted relative value function by ˜Qπ(s, a) = Qπ(s, a) −Qπ(s∗, a∗) for a specific state-action pair, (s∗, a∗). Obviously ˜Qπ(s∗, a∗) = 0 and ˜Qπ(s1, a1)−˜Qπ(s2, a2) = Qπ(s1, a1) − Qπ(s2, a2), that is, the difference in the relative value remains the same. By restricting the relative value function to satisfy  Q(s∗, a∗) = 0, the solution ofBellman equation (4) is unique and given by (ηπ, ˜Qπ).

In the following, we assume that ˜Qπ ∈ Q, where Qdenotes a vector space of functions on the state-action space  S × A such that Q(s∗, a∗) = 0 for all Q ∈ Q. The Bellman operator,

Tπ, with respect to the target policy  πis given by

image

Note that the above conditional expectation does not depend on the behavior policy due to the conditioning on current state and action. The Bellman error at (s, a) with respect to (η, Q) and πis defined as  Tπ(s, a; Q) − η − Q(s, a). From the Bellman equation, this error is zero for all (s, a) when η = ηπ and Q = ˜Qπ.

Note that the Bellman operator (5) involves the (unknown) transition kernel, P. Suppose for now that P is known and thus the Bellman operator is known. Since the Bellman error is zero at  η = ηπ and Q = ˜Qπ, a natural way to estimate (ηπ, ˜Qπ) is to minimize the empirical

squared Bellman error, i.e.,

image

where  Pnf(D) = (1/n) �ni=1 f(Di) is the empirical mean over the training data,  Dn, fora function of the trajectory, f(D). Obviously, this is not a feasible estimator as we don’t know the transition kernel and thus  Tπ(St, At; Q) is unknown. A natural idea is to replace the Bellman operator by its sample counterpart, i.e., replace  Tπ(St, At; Q) by Rt+1 +�a′ π(a′ | St+1)Q(St+1, a′) in the objective function of (6). Unlike the regression problem in which the dependent variable is fully observed, the dependent variable here is  Rt+1 +�a′ π(a′ | St+1)Q(St+1, a′), which involves the unknown relative value function, Q. As a re- sult, this natural plug-in estimator is biased (see Antos et al. (2008) for a similar discussion in the discounted reward setting).

The above argument motivates a coupled estimator in which we use the estimated Bellman error to form an objective function. In particular, for each (η, Q), we replace the Bellman error,  Tπ(St, At; Q) − η − Q(St, At), in (6) by an estimate of the “projection” of the Bellman error into a second function class, G:

image

Throughout we assume the solution of the above optimization exists and is in G and we call g∗π(·, ·; η, Q) a projection for simplicity. Recall that members of  Q satisfy Q(s∗, a∗) = 0. Asimilar constraint needs not be placed on the members of G. It is worth noting that we do not require the assumption that the Bellman error is modeled correctly by G, that is, Tπ(·, ·; Q)−η −Q(·, ·) may not be in G. A natural choice of  G is R⊕Q = {c+Q : c ∈ R, Q ∈Q}, however this is not mandatory. Farahmand et al. (2016) assumed that the Bellman error (in discounted setting) is in fact in Q in order to develop a non-parametric estimator for the value function. As we will see, the assumption that the Bellman error belongs to G is in fact not necessary and can be relaxed (our proof will use the weaker assumption 4 in Section 5). The key reason why the projected Bellman error (7) allows us to identify (ηπ, ˜Qπ) is because g∗π(·, ·; ηπ, ˜Qπ) = 0 (see also (iii)in Assumption 4 ).

We now formally introduce the estimator for (ηπ, ˜Qπ). This estimator is designed to minimize the projected Bellman error (7). Specifically, the estimator, (ˆηπn, ˆQπn), of (ηπ, ˜Qπ),is found by solving a coupled (or nested) optimization problem:

image

where for each (η, Q), ˆgn,π(·, ·; η, Q) is an estimator for the projection of the Bellman error

given by

image

where  J1 : Q → R+ and J2 : G → R+ are two regularizers and  λn and µnare tuning parameters.

We can see that for every (η, Q), ˆgn,π(·, ·; η, Q) is a penalized estimator for the projected Bellman error  g∗π(·, ·; η, Q) in (7). On the other hand, the objective function in (8) is a plug-in version of the objective function in (6) where we replace the Bellman error by ˆgn,π(·, ·; η, Q).Compared to the classic empirical risk minimization, (ˆηπn, ˆQπn) solves a nested optimization problem in the sense that the objective function (8) depends on ˆgn,π(·, ·; η, Q) which itself is the solution of another, lower-level optimization (9).

The penalty term,  λnJ21(Q), is used to balance between the model fitting (i.e., the squared estimated Bellman error) and the complexity of the relative value function measured by J1(Q). Similarly, µnJ22(g) is used to control the overfitting in estimating the projected Bellman error when the function class, G, is complex. In the case where the function space is k-th order Sobolev space, the regularizer is typically defined by the k-th order derivative to capture the smoothness of function. In the case where the function space is Reproducing Kernel Hilbert Space (RKHS), the regularizer is the endowed norm. In Supplement D, we provide a closed-form solution of the estimator when both Q and G are RKHSs.

So far we have focused on evaluating a single target policy. In practice, one might want to compare the target policy to some reference policy or contrast multiple target policies of interest. Suppose we are interested in K different target policies,  {πj}Kj=1. The above procedure (8) can be applied to estimate  {ηπj}Kj=1. In the next section, we will provide the result of the joint asymptotic distribution of�ˆηπjn�Kj=1 (see Corollary 1). This can be used, for example, to construct the confidence interval of the difference of the average rewards between two policies.

In this section, we first derive the global rate of convergence for (ˆηπn, ˆQπn) in (8) and derive the asymptotic distribution of ˆηπn for a single policy. We then extend the results to the case of multiple policies. For any state-action function,  f(s, a, s′), and distribution,  ν, on S × A,denote the  L2(ν) norm by ∥f∥2ν =�f 2(s, a)dν(s, a). If the norm does not have a subscript, then the expectation is with respect to the average state-action distribution in the trajectory, D, that is, ∥f∥2 = E�(1/T) �Tt=1 f 2(St, At)�.

We first state two standard assumptions used in the non-parametric regression literature (Gy¨orfi et al. 2006). Recall that the shifted relative value function is defined as ˜Qπ =

image

Assumption 2. The reward is uniformly bounded:  |Rt+1| ≤ Rmax < ∞ for all t ≥ 1. Theshifted relative value function is bounded:  | ˜Qπ(s, a)| ≤ Qmax for all s ∈ S and a ∈ A.

Assumption 3. The function class, Q, satisfies (i)  Q(s∗, a∗) = 0 and ∥Q∥∞ ≤ Qmax for allQ ∈ Q and (ii) ˜Qπ ∈ Q.

The assumption of a bounded reward is mainly to simplify the proof and can be relaxed to the sub-Gaussian case, that is, the error  Rt+1−r(St, At) is sub-Gaussian for all  t ≤ T. Theboundedness assumption on the shifted relative value function can be ensured by assuming certain smoothness assumptions on the transition distribution (Ortner & Ryabko 2012) or assuming geometric convergence to the stationary distribution (Hern´andez-Lerma & Lasserre 1999). The boundedness assumption, (i), for members of the function class, Q, is used to simplify the proof; a truncation argument can be used to avoid this assumption.

Recall that  g∗π(·, ·; η, Q) is a projected Bellman error in (7) into a function class, G. We make the following assumptions about G.

image

Given  Rmax and Qmax, Gmaxcan be chosen as 2(Rmax+Qmax). Similar to Q, the boundedness assumption of G is used to simply the proof and can be relaxed. The value of  κ measureshow well the function class, G, approximates the Bellman error for all (η, Q) in which η ∈ Rand  Q ∈ Q. The condition of a strictly positive  κensures the estimator (8) based on minimizing the projected Bellman error onto the space, G, is able to identify the true values, (ηπ, ˜Qπ). This is similar to the eigenvalue condition (Assumption 5) in Luckett et al. (2020), but they are essentially using the same function class for Q and G. Recall that, unlike in Farahmand et al. (2016), here we do not assume  Tπ(·, ·; Q) − η − Q(·, ·) ∈ Gfor every (η, Q).If this were the case, then we would have  g∗π(·, ·; η, Q) = Tπ(·, ·; Q) − η − Q(·, ·) and thusκ = 1.

Below we make assumptions on the complexity of the function classes, Q and G. These assumptions are satisfied for common function classes, for example RKHS and Sobolev spaces (Van de Geer 2000, Zhao et al. 2016, Steinwart & Christmann 2008, Gy¨orfi et al. 2006). We denote by  N (ǫ, F, ∥ · ∥) the ǫ-covering number of a set of functions, F, with respect to the norm,  ∥ · ∥.

Assumption 5. (i) The regularization functional  J1 and J2are pseudo norms and induced by the inner products  J1(·, ·) and J2(·, ·), respectively. There exist constants  C1, C2 such that

image

The upper bound on  J2(g∗π(·, ·; η, Q)) in (i)is realistic when the transition kernel is sufficiently smooth (see Farahmand et al. (2016) for an example of MDP satisfying this condition). We use a common  α ∈ (0,1) for both Q and G in (ii) to simply the proof.

Now we are ready to state the theorem about the convergence rate for (ˆηπn, ˆQπn) in termsof the Bellman error.

Theorem 1 (Global Convergence Rate). Let (ˆηπn, ˆQπn)be the estimator defined in (8). Sup- pose Assumptions 1-5 hold and the tuning parameters, (λn, µn), satisfy τ −1n− 11+α ≤ µn ≤ τλn

image

In Lemma ?? in Supplement B, we show that up to a constant,  |ˆηπn −ηπ| ≲ ∥Tπ(·, ·; ˆQπn)−ˆηπn− ˆQπn(·, ·)∥2and thus ˆηπn is a consistent estimator for  ηπ when λn = oP(1). When the tuning parameters are chosen such that  λn ≍ µn and λn ≍ n−1/(1+α), the Bellman error at (ˆηπn, ˆQπn)has the optimal rate of convergence, i.e.,  ∥Tπ(·, ·; ˆηπn, ˆQπn) − ˆηπn − ˆQπn(·, ·)∥2 = OP(n−1/(1+α)).The proof of Theorem 1 is provided in Supplement B.

In the following, we provide the asymptotic distribution of the estimated average reward. This requires additional notation as follows. Define  dπ(s, a) = π(a | s)dπ(s); dπ is the density of the stationary distribution of the state-action under the target policy,  π. For each t ≥ 1,denote by  dt(s, a) the density of the state-action pair in the trajectory, D, under the behavior policy. Let ¯dT(s, a) be the average density over T decision times. Motivated by the least favorable direction in partial linear regression problems (Van de Geer 2000, Zhao et al. 2016),

we define the direction function,  eπ(s, a), by

image

The direction,  eπ, is used to control the bias (ˆηπn − ηπ) caused by the penalization on the non-parametric component (i.e., relative value function) in the estimator (8). This is akin to partially linear regression problem,  Y = f(Z) + X⊤β + ǫ, in which the analog of  eπ(s, a) isthe residual  x − E(X | Z = z) (see Donald et al. (1994), Van de Geer (2000) for the analysis in the regression problem). In our setting, the direction,  eπ(s, a), satisfies the following

orthogonality: for any state-action function, Q,

image

The numerator in (10) is a ratio between the stationary distribution of state-action pair under

target policy,  π, and the average distribution of state-action pair in the trajectory, D, under the behavior policy. The denominator is the expectation of the ratio under the stationary distribution. As a result of the denominator, we have�eπ(s, a)dπ(s, a)dsda = 1.

Next define  qπ(s, a) = limt∗→∞ 1t∗�t∗t=1 Eπ��tk=1 {1 − eπ(Sk, Ak)} | S1 = s, A1 = a�. Noteqπ has a similar structure to that of the relative value function (3) in which the “reward” at time,  t, is {1 − eπ(St, At)}and the “average reward” is zero (i.e.,�{1−eπ(s, a)}dπ(s, a)dsda =0). Similar to the relative value function (3),  qπ(·, ·) satisfies a Bellman-like equation:

image

We make the following smoothness assumption about  eπ and qπ, akin to the assumptions used in partially linear regression literature (Van de Geer 2000, Zhao et al. 2016).

image

Recall that in Assumption 3 we restrict  Q(s∗, a∗) = 0 for all Q ∈ Q. Thus we consider the shifted function ˜qπ in the assumption above. The analog of ˜qπ ∈ Qin partially linear regression problem,  Y = f(Z) + X⊤β + ǫ, is the standard assumption that  E[X|Z = ·] ∈ F,where F is the function class to model the nonparametric component, f(z) (Donald et al. 1994, Van de Geer 2000). The condition, ˜qπ ∈ Q, will be used to prove the √n rate ofconvergence and asymptotic normality of ˆηπn. On the other hand, unlike in the regression setting, we assume that the direction function,  eπ, is sufficiently smooth (i.e.,  eπ ∈ G).This assumption will be used to show that the bias of the coupled estimator, ˆηπn, decreases sufficiently fast to zero.

The last assumption is a contraction-type property. This assumption will be used to control the variance of a remainder term caused by the estimation of  Qπ.

Assumption 7. Let (Pπf)(·, ·) = Eπ {f(St+1, At+1) | St = ·, At = ·}be the function of the conditional expectation and  µπ(f) =�f(s, a)dπ(s, a)dsdabe the expectation under stationary distribution induced by  πfor a state-action function, f. There exist constants,  C4 > 0 and

image

The parameter,  β, in Assumption 7 is akin to the discount factor,  γ, in the discounted reward setting. Intuitively, this is related to the “mixing rate” of the Markov chain induced by the target policy  π. A similar assumption was imposed in Van Roy (1998) (Assumption 7.2 on p. 99). Now we are ready to present our main result, the asymptotic normality of the estimated average reward, ˆηπn.

Theorem 2 (Asymptotic Distribution). Suppose the conditions in Theorem 1 hold. In

addition, suppose Assumption 6 and 7 hold and  λn = ann−1/2 with an → 0. The estimator, ˆηπn, in (9) is  √n-consistent and asymptotically normal: √n(ˆηπn − ηπ) ⇒ N(0, σ2), where

image

σ2 = Var

image

From Theorem 2, the variance in estimating the average reward parameter,  ηπ, dependson the length of trajectory and the ratio between the stationary distribution of the state-action pair induced by the target policy (i.e.,  dπ) and the average state-action distribution in the training data (i.e., ¯dT). To gain intuition of how these impact the asymptotic variance of ˆηπn, consider a simplified setting where the conditional variance of  Rt+1 + �a′ Qπ(St+1, a′) −ηπ − Qπ(St, At) given (St, At) is a constant, denoted by  σ20.It can be shown that the asymptotic variance becomes  σ2 = σ20T (1 + ∥(dπ/ ¯dT) − 1∥2). Thus the smaller  ∥(dπ/ ¯dT) − 1∥2(i.e., the ratio,  dπ/ ¯dT, close to one), the smaller the asymptotic variance of the estimated average reward. Although here we focus only on the asymptotic properties of ˆηπn for largen (recall n is the number of i.i.d. trajectories), one can see that increasing length of the trajectory, T, reduces the asymptotic variance.

Now we present the result for evaluating a class of policies, Π = {π1, . . . , πK}. Denoteby ˆηπjn the estimated average reward of the policy,  πj, using (8).

Corollary 1 (Multiple Policies). Suppose the conditions in Theorem 1 and 2 hold for each

image

tation on both sides of (10), the ratio can be written in terms of  eπ: dπ(s, a)/ ¯dT(s, a) =

eπ(s, a)/E{(1/T) �Tt=1 eπ(St, At)}.It is enough to construct an estimator for  eπ, which wedenote by ˆeπn, and then estimate the ratio by ˆeπn(s, a)/Pn{(1/T) �Tt=1 ˆeπn(St, At)}. Moti-vated by the orthogonality (11) and the expression (12), we construct the estimator for eπ(·, ·) by ˆeπn(·, ·) = ˜gn,π(·, ·; ˆqπn), where ˆqπn(·, ·) = argminq∈Q Pn{(1/T) �Tt=1 ˜g2n,π(St, At; q)} +˜λnJ21(q) and ˜gn,π(·, ·; q) = argming∈G Pn[(1/T) �Tt=1{1−q(St, At)+�a′ π(a′ | St+1)q(St+1, a′)−g(St, At)}2] + ˜µnJ22(g) for each q ∈ Q. Here (˜λn, ˜µn) are some tuning parameters. Follow- ing a similar argument as in the proof of Theorem 1, ˜qπn can be shown to be a consistent estimator for ˜qπ. Under the assumption that  eπ ∈ G, eπ can be consistently estimated by ˆeπn(·, ·) = ˜gn,π(·, ·; ˆqπn) based on (12). See Supplement C for additional details about the es- timator ˆeπn. In Supplement D, we provide a closed-form solution for the estimator for the asymptotic variance when Q and G are RKHSs.

In this section, we conduct a simulation study to evaluate the performance of the proposed method. The generative model is given as follows. We follow the state generative model in Luckett et al. (2020). Specifically, the state,  St = (St,1, St,2), is a two-dimensional vector and the action,  At ∈ {0, 1}, is binary. Given the current state,  St, and action,  At, the next state, St+1 = (St+1,1, St+1,2), is generated by  St+1,1 = (3/4)(2At −1)St,1 + (1/4)St,1St,2 + N(0, 0.52)and  St+1,2 = (3/4)(1 − 2At)St,2 + (1/4)St,1St,2 + N(0, 0.52). Note that receiving a treatment (At = 1) increases the value of St,1while decreases  St,2. The reward is generated by  Rt+1 =St+1,1 + (1/2)St+1,2 + (1/4)(2At −1). For each trajectory in the training data, the state variables are generated as independent standard normal random variables and the behavior policy is to choose  At = 1 with a fixed probability 0.5. We evaluate and compare two naturalpolicies: the “always treat” policy,  π1(a | s) = 1, and “no treatment” policy, π2(a | s) = 0.

image

image

error,  R + �

image

pendent variables are the current state and action). Finally, we choose (λ, µ) that minimizes the squared estimated Bellman error over the validation set, i.e., �(S,A)∈Dval ˆf 2(S, A; λ, µ).The final estimator for  ηπ is then calculated with the optimal tuning parameters using the entire dataset. In the simulation, we use (1/2) of the trajectories for the training set and (1/2) for the validation set and we use Gaussian Process regression to estimate the Bellman error in the validation step.

We consider different scenarios of the number of the trajectories,  n ∈ {25, 40}, and thelength of each trajectory,  T ∈ {25, 50, 75}. In each scenario, we generate 500 simulated dataset and for each dataset we construct the 95% confidence intervals of  ηπ1, ηπ2 and ηπ1 −ηπ2. The coverage probability of each confidence interval is calculated over 500 repetitions. The simulation result is reported in Table 1. When the number of trajectories is small (i.e., n = 25), the simulated coverage probability is slightly smaller than the claimed value, 0.95, especially when the length of the trajectory, T, is small. It can be seen that the coverage probability slightly improves when T increases. When n = 40, the coverage probability becomes closer to 0.95 as desired. Overall, the simulation result demonstrates the validity of the inference and the selection procedure for the tuning parameters. It suggests that it is necessary to perform a small-sample correction when both n and T are small. This is left for future work.

Table 1: Coverage probability of the 95% confidence interval and MAD (mean absolute deviation) over 500 repetitions. Case 1: policy evaluation of  π1. Case 2: policy evaluation of  π2. Case 3: policy comparison between  π1 and π2.

image

We apply the method to the data collected in the first study in HeartSteps (Klasnja et al. 2015, Liao et al. 2016, Klasnja et al. 2019). Below we refer to this study by HS1 for simplicity. HS1 was a 42-day MRT with 44 healthy sedentary adults. We focus on the activity suggestion intervention component. There were five individual-specified times in a day which were roughly separated by 2.5 hours and corresponded to the individuals morning commute, mid-day, mid-afternoon, evening commute, and post-dinner times. At each decision time, an activity suggestion was sent with a fixed probability 0.6 only if the participants were considered to be available for treatment. For example, the participants were considered unavailable when they were currently physically active (e.g., walking or running) or driving a vehicle. The activity suggestions were intended to motivate near-time walking. Each participant wore a Jawbone wrist tracker and the minute-level step count data was recorded.

We construct the state based on the participant’s step count data (e.g., the 30-min

step count prior to the decision time and the total step count from yesterday), location, temperature and number of the notifications received over the last seven days. We also include in the state the time slot index in the day (1 to 5) and the indicator measuring how the step count varies at the current time slot over the last seven days. The reward is formed by the log transformation of the total step count collected in 30-min window after the decision time. The log transformation is performed as the step count data is positively skewed (Klasnja et al. 2019). The step count data might be missing because the Jawbone tracker recorded data only when there were steps occurred. We use the same imputation procedure as in Klasnja et al. (2019). The state related to the step count are constructed based on the imputed step counts. The variables in the state are chosen to be predictive of the reward. In particular, each variable is selected, at the significance level of 0.05, based on a marginal Generalized Estimating Equation (GEE) analysis. In the analysis, we exclude seven participants’ data as in the primary analysis in Klasnja et al. (2019) (three due to technical issues and four due to early dropout). In addition, from the 37 participants’ data we exclude the decision times when participants were traveling abroad or experiencing technical issues or when the reward (i.e., post 30-min step count) is considered as missing (see Klasnja et al. (2019) for details).

We consider three target policies. The first policy,  πnothing, is “do nothing”. The second policy,  πalways, is the “always treat” policy. Recall that in HeartSteps the activity suggestion can be sent only when the participant is available. So here the “always treat” policy refers to always send the suggestion whenever the participant is available. The third policy,  πlocation, isbased on the location. Specifically, we consider the policy that sends the activity suggestion when the participant is at either home or work location and available. This policy is of interest because people at home or work are in a more structured environment and thus might be able to better respond to an activity suggestion as compared with at other locations. In HS1, about 44% of the available decision times were at times that the participants were at their home or work location. Thus the policy,  πlocation, is different from the “always treat” policy,  πalways.

In the implementation, we use the RKHS with the radial basis function kernel to form the function classes, Q and G. The tuning parameters are selected based on the procedure described in Section 6. The estimated average reward of the location-based policy,  πlocation,is 3.155 with the 95% confidence interval , [2.893, 3.417], which is slightly better than the “do nothing” policy. Specifically, the estimated average reward of  πnothing is 2.962 and the 95% confidence interval of the difference,  ηπlocation − ηπnothing, is [−0.016, 0.402]. Translating back to the raw step count as in Klasnja et al. (2019), the location-based policy is able to increase the average 30-min step count roughly by 22% (i.e., exp(3.16 − 2.96) − 1 = 1.22),corresponding to 55 steps (the mean post-decision time step count is 248 across all decision times in the dataset). However if we compare the “always treat” policy (ˆηπalways = 3.127,95% confidence interval is [2.840, 3.413]) with the location-based policy,  πlocation, we see no indication that providing treatment only at home or work is better than always providing treatment (the 95% confidence interval of  ηπlocation − ηπalways is [−0.161, 0.217]). Recall that the sample size for this study is n = 37 thus this non-significant finding may be due to the small sample.

In this work we developed a flexible method to conduct inference about the the long-term average outcomes for given target policies using data collected from a possibly different behavior policy. We believe that this is an important first step towards developing data-based just-in-time adaptive interventions. Below we discuss some directions for future research.

In many MRT studies, a natural choice of the proximal outcome to assess the effectiveness of the intervention is binary. For example, in the Substance Abuse Research Assistance study (Rabbi et al. 2018), the proximal outcome was whether the individual completed a daily survey. An interesting open question is how to extend the method to the binary reward setting, which would require carefully choosing the model to represent the relative value function and/or the loss functions used in estimating the Bellman error and solving the Bellman equation.

Non-stationarity occurs mainly because of the unobserved aspects of the current state (e.g., the engagement and/or burden) in many mHealth applications. It will be interesting to generalize the average reward framework to incorporate the non-stationarity detected in the observed trajectory. Alternatively, one can consider evaluating the treatment policy in the indefinite horizon setting where there is an absorbing state (akin to the individual disengaging from the mobile app) and thus we aim to conduct inference about the expected total rewards until the absorbing state is reached.

We focused on evaluating and contrasting multiple pre-specified treatment policies. An important next step is to extend the method to learn the optimal policy that would lead to the largest long-term average reward and to develop the inferential methods to assess the usefulness of certain variables in the policy.

This work was supported by National Institute on Alcohol Abuse and Alcoholism (NIAAA) of the National Institutes of Health under award number R01AA23187, National Institute on Drug Abuse (NIDA) of the National Institutes of Health under award numbers P50DA039838 and R01DA039901, National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under award number U54EB020404, National Cancer Institute (NCI) of the National Institutes of Health under award number U01CA229437, and National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health under award number R01HL125440. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Antos, A., Szepesv´ari, C. & Munos, R. (2008), ‘Learning near-optimal policies with bellman- residual minimization based fitted policy iteration and a single sample path’, Machine Learning 71(1), 89–129.

Bradtke, S. J. & Barto, A. G. (1996), ‘Linear least-squares algorithms for temporal difference learning’, Machine learning 22(1-3), 33–57.

Chakraborty, B. & Moodie, E. (2013), Statistical methods for dynamic treatment regimes, Springer.

Dempsey, W., Liao, P., Kumar, S., Murphy, S. A. et al. (2020), ‘The stratified micro- randomized trial design: sample size considerations for testing nested causal effects of time-varying treatments’, Annals of Applied Statistics 14(2), 661–684.

Donald, S. G., Newey, W. K. et al. (1994), ‘Series estimation of semilinear models’, Journal of Multivariate Analysis 50(1), 30–40.

Farahmand, A.-m., Ghavamzadeh, M., Szepesv´ari, C. & Mannor, S. (2016), ‘Regularized policy iteration with nonparametric function spaces’, The Journal of Machine Learning Research 17(1), 4809–4874.

Farahmand, A.-m. & Szepesv´ari, C. (2011), ‘Model selection in reinforcement learning’, Machine learning 85(3), 299–332.

Farajtabar, M., Chow, Y. & Ghavamzadeh, M. (2018), More robust doubly robust off-policy evaluation, in ‘International Conference on Machine Learning’, pp. 1447–1456.

Gy¨orfi, L., Kohler, M., Krzyzak, A. & Walk, H. (2006), A distribution-free theory of non-parametric regression, Springer Science & Business Media.

Hern´andez-Lerma, O. & Lasserre, J. B. (1999), Further topics on discrete-time Markov control processes, Vol. 42, Springer.

Howard, R. A. (1960), ‘Dynamic programming and markov processes.’.

Jiang, N. & Li, L. (2015), ‘Doubly robust off-policy value evaluation for reinforcement learn- ing’, arXiv preprint arXiv:1511.03722 .

Kallus, N. & Uehara, M. (2019), Intrinsically efficient, stable, and bounded off-policy evalu- ation for reinforcement learning, in ‘Advances in Neural Information Processing Systems’, pp. 3320–3329.

Kizakevich, P. N., Eckhoff, R., Weger, S., Weeks, A., Brown, J., Bryant, S., Bakalov, V., Zhang, Y., Lyden, J. & Spira, J. (2014), ‘A personal health information toolkit for health intervention research’, Stud Health Technol Inform 199, 35–39.

Klasnja, P., Hekler, E., Shiffman, S., Boruvka, A., Almirall, D., Tewari, A. & Murphy, S. (2015), ‘Micro-randomized trials: An experimental design for developing just-in-time adaptive interventions.’, Health Psychology 34(S), 1220.

Klasnja, P., Smith, S., Seewald, N. J., Lee, A., Hall, K., Luers, B., Hekler, E. B. & Murphy, S. A. (2019), ‘Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps’, Annals of Behavioral Medicine 53(6), 573– 582.

Lee, J.-A., Choi, M., Lee, S. A. & Jiang, N. (2018), ‘Effective behavioral intervention strate- gies using mobile health applications for chronic disease management: a systematic review’, BMC medical informatics and decision making 18(1), 12.

Liao, P., Dempsey, W., Sarker, H., Hossain, S. M., al’Absi, M., Klasnja, P. & Murphy, S. (2018), ‘Just-in-time but not too much: Determining treatment timing in mobile health’, Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2(4), 179.

Liao, P., Klasjna, P., Tewari, A. & Murphy, S. (2016), ‘Micro-randomized trials in mhealth’, Statistics in Medicine 35(12), 1944–71.

Liu, Q., Li, L., Tang, Z. & Zhou, D. (2018), Breaking the curse of horizon: Infinite-horizon off-policy estimation, in ‘Advances in Neural Information Processing Systems’, pp. 5356– 5366.

Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E. & Kosorok, M. R. (2020), ‘Estimating dynamic treatment regimes in mobile health using v-learning’, Journal of the American Statistical Association 115(530), 692–706.

Mahadevan, S. (1996), ‘Average reward reinforcement learning: Foundations, algorithms, and empirical results’, Machine learning 22(1-3), 159–195.

Murphy, S. A., Deng, Y., Laber, E. B., Maei, H. R., Sutton, R. S. & Witkiewitz, K. (2016), ‘A batch, off-policy, actor-critic algorithm for optimizing the average reward’, arXiv preprint arXiv:1607.05047 .

Murphy, S. A., van der Laan, M. J., Robins, J. M. & Group, C. P. P. R. (2001), ‘Marginal mean models for dynamic regimes’, Journal of the American Statistical Association 96(456), 1410–1423.

Nahum-Shani, I., Smith, S. N., Spring, B. J., Collins, L. M., Witkiewitz, K., Tewari, A. & Murphy, S. A. (2018), ‘Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support’, Annals of Behavioral Medicine 52(6), 446–462.

Ortner, R. & Ryabko, D. (2012), Online regret bounds for undiscounted continuous reinforce- ment learning, in ‘Advances in Neural Information Processing Systems’, pp. 1763–1771.

Puterman, M. L. (1994), ‘Markov decision processes: Discrete stochastic dynamic program- ming’.

Rabbi, M., Kotov, M. P., Cunningham, R., Bonar, E. E., Nahum-Shani, I., Klasnja, P., Walton, M. & Murphy, S. (2018), ‘Toward increasing engagement in substance use data collection: development of the substance abuse research assistant app and protocol for a microrandomized trial using adolescents and emerging adults’, JMIR research protocols 7(7), e166.

Steinwart, I. & Christmann, A. (2008), Support vector machines, Springer Science & Business Media.

Sutton, R. S. & Barto, A. G. (2018), Reinforcement learning: An introduction, MIT press.

Thomas, P. & Brunskill, E. (2016), Data-efficient off-policy policy evaluation for reinforce- ment learning, in ‘International Conference on Machine Learning’, pp. 2139–2148.

Van de Geer, S. (2000), Empirical Processes in M-estimation, Vol. 6, Cambridge university press.

Van Roy, B. (1998), Learning and value function approximation in complex decision pro- cesses, PhD thesis, Massachusetts Institute of Technology.

Zhao, T., Cheng, G. & Liu, H. (2016), ‘A partially linear framework for massive heteroge- neous data’, Annals of statistics 44(4), 1400.


Designed for Accessibility and to further Open Science