b

DiscoverSearch
About
My stuff
Learning with Abandonment
2018·arXiv
Abstract
Abstract

Consider a platform that wants to learn a personalized policy for each user, but the platform faces the risk of a user abandoning the platform if she is dissatisfied with the actions of the platform. For example, a platform is interested in personalizing the number of newsletters it sends, but faces the risk that the user unsubscribes forever. We propose a general thresholded learning model for scenarios like this, and discuss the structure of optimal policies. We describe salient features of optimal personalization algorithms and how feedback the platform receives impacts the results. Furthermore, we investigate how the platform can efficiently learn the heterogeneity across users by interacting with a population and provide performance guarantees.

Machine learning algorithms are increasingly intermediating interactions between platforms and their users. As a result, users’ interaction with the algorithms will impact optimal learning strategies; we investigate this consequence in our work. In the setting we consider, a platform wants to personalize service to each user. The distinctive feature in this work is that the platform faces the risk of a user abandoning the platform if she is dissatisfied with the actions of the platform. Algorithms designed by the platform thus need to be careful to avoid losing users.

There are many examples of such settings. In the near future, smart energy meters will be able to throttle consumers’ energy consumption to increase effi-ciency of the power grid during peak demand, e.g., by raising or lowering the level of air conditioning. This can lead to cost savings for both utility companies and consumers. However, if the utility company is too aggressive in its throttling of energy, a user might abandon the program. Due to heterogeneity in housing, appliances and preferences of customers, it is important that utility companies learn personalized strategies for each consumer.

Content creators (e.g., news sites, blogs, etc.) face a similar problem with e-mail dissemination. There is value in sending more e-mails, but each e-mail also risks the recipient unsubscribing, taking away any opportunity of the creator to interact with the user in the future. Yet another example is that of mobile app notifications. These can be used to improve user engagement and experience. However if the platform sends too many notifications, an upset user might turn off notifications from the application.

In all of the above scenarios, we face a decision problem where “more is better;” however, there is a threshold beyond which the user abandons and no further rewards are gained. This work focuses on developing insight into the structure of optimal learning strategies in such settings. We are particularly interested in understanding when such strategies take on a “simple” structure, as we elaborate below.

In Section 2, we introduce a benchmark model of learning with abandonment. In the initial model we consider, a platform interacts with a single user over time. The user has a  threshold θdrawn from a distribution F, and at each time t = 0, 1, 2, . . . the platform chooses an action  xt. If  xtever exceeds  θ, the user abandons; otherwise, the user stays, and the platform earns some reward dependent on  xt.

We first consider the case where the distribution F and the reward function are known known (say, from prior estimation), and the challenge is finding an optimal strategy for a given new user. We consider the problem of maximizing expected discounted reward. Intuitively, we might expect that the optimal policy is increasing and depends on the discount factor: in particular, we might try to serve the user at increasing levels of  xtas long as we see they did not abandon. Surprisingly, our main result shows this is not the case: that in fact, the static policy of maximizing one-step reward is optimal for this problem. Essentially, because the user abandons if the threshold is ever crossed, there is no value to trying to actively learn the threshold.

In Section 3, we consider how to adapt our results when F and/or the reward function are unknown. In this case, the platform can learn over multiple user arrivals. We relate the problem to one of learning an unknown demand curve, and suggest an approach to efficiently learning the threshold distribution F and the reward function.

Finally in Section 4, we consider a more general model with “soft” abandonment: after a negative experience, users may not abandon entirely, but continue with the platform with some probability. We characterize the structure of an optimal policy to maximize expected discounted reward on a per-user basis; in particular, we find that the policy adaptively experiments until it has sufficient confidence, and then commits to a static action. We empirically investigate the structure of the optimal policy as well.

Related work The abandonment setting is quite unique, and we are aware of only one other work that addresses the same setting. Independently from this work, Lu et al. [2017] model the abandonment problem using only two actions; the safe action and the risky action. This naturally leads to rather different results. There are some similarities with the mechanism design literature, though there the focus is on strategic behavior by agents [Rothschild, 1974, Myerson, 1981, Farias and Van Roy, 2010, Pavan et al., 2014, Lobel and Paes Leme, 2017]. As in this work, the revenue management literature considers agents with heuristic behaviour, but the main focus is on dealing with a finite inventory [Gallego and Van Ryzin, 1994].

It may seem that our problem is closely related to many problems in reinforcement learning (RL) [Sutton and Barto, 1998] due to the dynamic structure of our problem. However, there are important differences. Our focus is on personalization; viewed through the RL lens, this corresponds to having only a single episode to learn, which is independent of other episodes (users). On the other hand, in RL the focus is on learning an optimal policy using multiple episodes where information carries over between episodes. These differences present novel challenges in the abandonment setting, and necessitate use of the structure present in this setting.

Also related is work on safe reinforcement learning, where catastrophic states need to be avoided [Moldovan and Abbeel, 2012, Berkenkamp et al., 2017]. In such a setting, the learner usually has access to additional information, for example a safe region is given. Finally, we note that in our work, unlike in safe RL, avoiding abandonment is not a hard constraint.

In this section, we formalize the problem of finding a personalized policy for a single user without further feedback.

2.1 Formal setup and notation

We consider a setting where heterogeneous users interact with a platform at discrete time steps indexed by t, and focus on the problem of finding a personalized policy for a single user. The user is characterized by sequence of hidden thresholds  {θt}∞t=0jointly drawn from a known distribution that models the heterogeneity across users. At every time t, the platform selects an action xt ∈ X ⊂ R+from a given closed set X. Based on the chosen action  xt, the platform obtains the random reward  Rt(xt) ≥0. The expected reward of action x is given by r(x) =  E(Rt(x)) < ∞, which we assume to be stationary and known to the platform.1While not required for our results, we expect r to be increasing. When the action exceeds the threshold at time t, the process stops. More formally, let T be the stopping time that denotes the first time the  xtexceeds the threshold  θt:

image

The goal is to find a sequence of actions  {xt}∞t=0that maximizes:

image

where  γ ∈(0, 1) denotes the discount factor. We note that this expectation is well defined even if  T = ∞, since  γ <1. We focus here on the discounted expected reward criterion. An alternative approach is to consider maximizing average reward on a finite horizon; considering this problem remains an interesting direction for future work.

2.2 Optimal policies

Without imposing further restrictions on the structure of the stochastic threshold process, the solution is intractable. Thus, we first consider two extreme cases: (1) the threshold is sampled at the start and then remains fixed across time; and (2) the thresholds are independent across time. Thereafter, we look at the robustness of the results when we deviate from these extreme scenarios.

Fixed threshold We first consider a case where the threshold is sampled at the beginning of the horizon, but then remains fixed. In other words, for all t, θt = θ ∼ F. Intuitively, we might expect that the platform might try to gradually learn this threshold, by starting with  xtlow and increasing it as long as the user does not abandon. In fact, we find something quite different: our main result is that the optimal policy is a constant policy.

Proposition 1. Suppose the function and the function  x → r(x)(1  − F(x)) has a unique optimum  x∗ ∈ X. Then, the optimal policy is  xt = x∗for all t.

All proofs can be found in the supplemental material.

We sketch an argument why there exists a constant policy that is optimal. Consider a policy that is increasing and suppose it is optimal.2Then there exists a time t such that  xt = y < xt+1 = z. Compare these two actions with the policy that would use action z at both time periods. First suppose  θ < y; then the user abandons under either alternative and so the outcome is identical. Now consider  θ ≥ y; then by the optimality of the first policy, given knowledge that  θ ≥ y, it is optimal to play z. But that means the constant policy is at least as good as the optimal policy.

In the appendix, we provide another proof of the result using value iteration. This proof also characterizes the optimal policy and optimal value exactly (as in the proposition). Remarkably, the optimal policy is independent of the discount factor  γ.

Independent thresholds For completeness, we also note here the other extreme case: suppose the thresholds  θtare drawn independently from the same distribution F at each t. Then since there is no correlation between time steps, it follows immediately that the optimal policy is a constant policy, with a simple form.

Proposition 2. Then the optimal policy under the independent threshold assumption is  xt = x∗for all t if

image

is the unique optimum.

Robustness So far, we have considered two extreme threshold models and have shown that constant policies, albeit different ones, are optimal. In this section we look at the robustness of those results by understanding what happens when we interpolate between the two sides by considering an additive noise threshold model. Here, the threshold at time t consists of a fixed element and independent noise:  θt = θ + εt, where  θ ∼ Fis drawn once, and the noise terms are drawn independently. In general, the optimal policy in this model is increasing and intractable because the posterior over  θnow depends on all previous actions. However, there exists constant policies that are close to optimal in case the noise terms are either small or large, reflecting our preceding results in the extreme cases.

First consider the case where the noise terms are small. In particular, suppose the error distribution has an arbitrary distribution over a small interval [−y, y].

Proposition 3. Suppose  εt ∈ [−y, y] and the reward function r is L-Lipschitz. Then there exists a constant policy with value  Vcsuch that

image

where  V ∗is the value of the optimal policy for the noise model, and  x∗is the optimal constant policy for the noiseless case.

This result follows from comparing the most beneficial and detrimental scenarios;  εt = yand  εt = −yfor all t, respectively, and nothing that in both cases the optimal policies are constant policies, because thresholds are simply shifted. We can then show that the optimal policy for the worst scenario achieves the gap above compared to the optimal policy in the best case. The details can be found in the appendix.

Similarly, when the noise level is sufficiently large with respect to the threshold distribution F there also exists a constant policy that is close to optimal. The intuition behind this is as follows. First, if the noise level is large, the platform receives only little information at each step, and thus cannot efficiently update the posterior on  θ. Furthermore, the high variance in the thresholds also reduces the expected lifetime of any policy. Combined, these two factors make learning ineffective.

We formalize this by comparing a constant policy to an oracle policy that knows  θbut not the noise terms  εt. Let G be the CDF of the noise distribution εtwith ¯G denoting its complement: ¯G(y) = 1  − G(y). Then we note that for a given threshold  θ, the probability of survival is ¯G(x − θ), and thus the expected value for the constant policy  xt = xfor all t is

image

Define the optimal constant policy given knowledge of the fixed part of the threshold,  θby  x(θ):

image

We note that v is non-decreasing in  θ′. We assume that v is  Lv-Lipschitz:

image

for all  θand s. Note that noise distributions G that have high variance lead to a smaller Lipschitz constant.

To state our result in this case, we define an  η-cover, which is a simple notion of the spread of a distribution.

Definition 1. An interval (l, u) provides an  ηcover for distribution F if  F(u)−F(l) > η.

In other words, with probability as least 1  − η, a random variable drawn from distribution F lies in the interval (l, u).

Proposition 4. Assume r is bounded, and X is a continuous and connected space. Suppose v defined above is  Lv-Lipschitz, and there exists an  η-cover for threshold distribution  Fθwith width  w = u − l. Then the constant policy xt= l+u2with expected value  Vθsatisfies

image

The shape of v, and in particular its Lipschitz constant  Lvdepend on the threshold distribution F and reward function r. As the noise distribution G “widens”,  Lvdecreases. As a result, the bound above is most relevant when the variance of G is substantial relative to spread of F.

To summarize, our results show that in the extreme cases where the thresholds are drawn independently, or drawn once, there exists a constant policy that is optimal. Further, the class of constant policies is robust when the joint distribution over the thresholds is close to either of these scenarios.

Thus far, we have assumed that the heterogeneity across the population and the mean reward function are known to the platform, and we have focused on personalization for a single user. It is natural to ask what the platform should do when it lacks such knowledge, and in this section we show how the platform can learn an optimal policy efficiently across the population. We study this problem within the context of the fixed threshold model described above, as it naturally lends itself to development of algorithms that learn about populationlevel heterogeneity. In particular, we give theoretical performance guarantees on a UCB type [Auer et al., 2002] algorithm, and show that a variant based on MOSS [Audibert and Bubeck, 2009] performs better in practice. We also empirically show that an explore-exploit strategy performs well.

Learning setting We focus our attention on the fixed threshold model, and consider a setting where n users arrive sequentially, each with a fixed threshold θu (u= 1, . . . , n) drawn from unknown distribution F with support on [0, 1]. To emphasize the role of learning from users over time, we consider a stylized setting where the platform interacts with one user at a time, deciding on all the actions and observing the outcomes for this user, before the next user arrives. Inspired by our preceding analysis, we consider a proposed algorithm that uses a constant policy for each user. Furthermore, we assume that the rewards Rt(x) are bounded between 0 and 1, but otherwise drawn from an arbitrary distribution that depends on x.

Regret with respect to oracle We measure the performance of learning algorithms against the oracle that has full knowledge about the threshold distribution F and the reward function r, but no access to realizations of random variables. As discussed in Section 2, the optimal policy for the oracle is thus to play constant policy  x∗= maxx∈[0,1] r(x)(1  − F(x)). We define regret as

image

which we note is normalized on a per-user basis with respect to the discount factor  γ.

3.1 UCB strategy

We propose a UCB algorithm [Auer et al., 2002] on a suitably discretized space, and prove an upper bound on its regret in terms of the number of users. This approach is based on earlier work by [Kleinberg and Leighton, 2003][Section 3] for learning demand curves. Before presenting the details, we introduce the UCB algorithm for the standard multi-armed bandit problem.

In the standard setting, there are K arms, each with its own mean  µi. At each time t, UCB(α) selects the arm with largest index  Bi,t

image

where  ni(t) is the number of pulls of arm i at time t. We assume  Bi,t = ∞if ni(t) = 0. The following lemma bounds the regret of the UCB index policy.

Lemma 5 (Theorem 2.1 [Bubeck et al., 2012]). Suppose rewards for each arm i are independent across multiple pulls,  σ-sub-Gaussian and have mean  µi. Define i= maxj µj − µi. Then, UCB(α) attains regret bound

image

Kleinberg and Leighton [2003] adapt the above result to the problem of demand curve learning. We follow their approach: Discretize the action space and then use the standard UCB approach to find an approximately optimal action. For each user, the algorithm selects a constant action  xuand either receives reward  Ru= 0 if  xu > θuor  Ru= �∞t=0 γtRt(xu).

We need to impose the following assumptions:  θ ∈[0, 1], 0  ≤ R(x) ≤ Mfor some M > 0, and the function f(x) = r(x)D(x) = r(x)(1  − F(x)) is strongly convex and thus has a unique maximum at  x∗.

Assumption 1 (Lemma 3.11 in Leighton and Kleinberg). There exists con-

image

for all  x ∈[0, 1].

Using these assumptions, we can prove the main learning result.

Theorem 6. Suppose that f satisfies the concavity condition above. Then UCB(α) on the discretized space with  K = O�(n/log  n)1/4�arms satisfies

image

for all  α >2.

The proof consists of two parts, first we use Lemma 1 to bound the difference between the best action and the best arm in the discretized action space. Then we use Theorem 5 to show that the learning strategy has small regret compared to the best arm. Combined, these prove the result.

It is important to note that the algorithm requires prior knowledge of the number of users, n. In practice it is reasonable to assume that a platform is able to estimate this accurately, but otherwise the well-known doubling trick can be employed at a slight cost.

3.2 Lower bound

We now briefly discuss lower bounds on learning algorithms. If we restrict ourselves to algorithms that play a constant policy for each user, the lower bound in Kleinberg and Leighton [2003] applies immediately.

Proposition 7 (Theorem 3.9 in Kleinberg and Leighton [2003]). Any learning algorithm A that plays a constant policy for each user, has regret at least

image

for some threshold distribution.

Thus, the discretized UCB strategy is near-optimal in the class of constant policies.

However, algorithms with dynamic policies for users can obtain more information on the user’s threshold and therefore more easily estimate the empirical distribution function. Whether the  O(√n) lower bound carries over to dynamic policies is an open problem.

3.3 Simulations

In this section, we empirically compare the performance of the discretized UCB against other policies. For our simulations, we also include the MOSS algorithm [Audibert and Bubeck, 2009], and an explore-exploit strategy.

MOSS Audibert and Bubeck [2009] give a upper confidence bound algorithm that has a tighter regret bound in the standard multi-armed bandit problem. The MOSS algorithm is an index policy where the index for arm i is given by

image

While the policy is quite similar to the UCB algorithm, it does not suffer from an extra √log n term in the regret bound. However, we cannot adapt the bound to the abandonment setting, due to worse dependence on the number of arms. In practice, we expect this algorithm to perform better than the UCB algorithm, as it is a superior multi-armed bandit algorithm.

Explore-exploit strategy Next, we consider an explore-exploit strategy that first estimates an empirical distribution function, and then uses that to optimize a constant policy. For this algorithm, we assume that for zero reward, the learner can observe  θufor a particular user, which mimics a strategy where the learner increases its action by  εat each time period to learn the threshold  θuof a particular user with arbitrary precision. Because it directly estimates the empirical distribution function and does not require discretization, it is better able to capture the structure of our model.

The explore-exploit strategy consists of two stages.

First, obtain m samples of  θuto find an empirical estimate of F, which we denote by ˆFm

For the remaining users, play constant policy  xu= arg max r(x)(1  −ˆFm(x))

Note that compared to the previous algorithm, we assume this learner has access to the reward function, and only the threshold distribution F is unknown. If the signal-to-noise ratio in the stochastic rewards is large, this is not unrealistic: the platform, while exploring, is able to observe a large number of rewards and should therefore be able to estimate the reward function reasonably well.

Setup For simplicity, our simulations focus on a stylized setting; we observed similar results under different scenarios.3We assume that the rewards are deterministic and follow the identity function r(x) = x, and the threshold distribution (unknown to the learning algorithm) is uniform on [0, 1]. For each algorithm, we run 50 repetitions for n = 2000 time steps, and plot all cumulative regret paths. For the discretized policies, we set  K ≈ 2.5� nlog n�1/4= 12. The explore-exploit strategy first observes 20+2√n= 110 samples to estimate F, before committing to a fixed strategy.

Results The cumulative regret paths are shown in Figure 1. We observe that MOSS, while having higher variance, indeed performs better than the standard UCB algorithm, despite the lack of a theoretical bound.

However, the explore-exploit strategy obtains the lowest regret. First, since it is aware of the reward function, it has less uncertainty. More importantly, the algorithm leverages the structure of the problem because it does not discretize the action space and then treat actions independently. Finally, we note that when rewards are stochastic, the UCB and MOSS are even worse compared to explore-exploit, as they have to estimate the mean reward function, while the explore-exploit strategy assumes it is given.

In this section, we consider a “softer” version of abandonment, where the platform receives some feedback before the user abandons. As example, consider optimizing the number of push notifications. When a user receives a notification, she may decide to open the app, or decide to turn off notifications. However, her most likely action is to ignore the notification. The platform can interpret this as a signal of dissatisfaction, and work to improve the policy.

In this section, we augment our model to capture such effects. While the solution to this updated model is intractable, we discuss interesting structure

image

Figure 1: Cumulative regret plots for r(x) = x and F = U[0, 1].

that the optimal policy exhibits: partial learning, and the aggressiveness of the optimal policy.

Feedback model To incorporate user feedback, we expand the model as follows. Suppose that whenever the current action  xtexceeds the threshold (i.e., xt > θt), then with probability p we receive no reward but the user remains, and with probability 1  − pthe user abandons. Further, we assume that the platform at time t both observes the reward  R(xt), if rewarded, and an indicator  Zt = Ixt>θt. This is equivalent to assuming that a user has geometrically distributed patience; the number of times she allows the platform to cross her threshold.

As before the goal is to maximize expected discounted reward. Note that because the platform does not receive a reward when the threshold is crossed, the problem is nontrivial even when p = 1. We restrict our attention to the single threshold model, where  θis drawn once and then fixed for all time periods.

Figure 2 shows the numerically computed optimal policy when the threshold distribution is uniform on [0, 1], the reward function is r(x) = x, the probability of abandonment p = 0.5 and  γ= 0.9. Depending on whether or not a feedback signal is received, the optimal policy follows the green or the red line as we step through time from left to right.

We note that one can think of the optimal policy as a form of bisection, though it does not explore the entire domain of F. In particular it is conservative regarding users with large  θ. For example, consider a user with threshold 0.9. While the policy is initially increasing and thus partially personalizes to her threshold,  xtdoes not converge to 0.9, and in fact never comes close. We call this partial learning; in the next section, we demonstrate that this is a key feature of the optimal policy in general.

Partial learning Partial learning refers to the fact that the optimal policy does not fully reduce its uncertainty (the posterior) on  θ. Initially, the policy learns about the threshold using a bisection-type search. However, at some point (dependent on the user’s threshold), further learning is too risky and the

image

Figure 2: Visualization of optimal policy when discount factor  γ= 0.9 in the p = 0.5 model. Follow the tree from left to right, where if  Zt= 0 (reward obtained) the next action follows from following the green line, and if  Zt= 1, the optimal action is given by the point following the red line if the user has not abandoned.

optimal policy switches to a constant policy. We note that this happens even when there is no risk of abandonment at all (p = 1), because at some point even the risk of losing a reward is not offset by potential gains in getting a more accurate posterior on  θ. Partial learning occurs under some regularity conditions on the threshold distribution that ensures the posterior does not collapse, and is Lipschitz as defined in the following paragraph.

Write  F ulfor the posterior distribution over  θgiven lower bound l and upper bound u based on previous actions

image

We say the that the posterior distribution is non-degenerate if the following condition holds:

Definition 2 (Non-degenerate posterior distribution ). For all  λ >0, there exists a  νsuch that for all l, u where  u − l < ν, F ul(ε) < 1 − λεfor 0  < ε < ν.

Thus, for sufficiently small intervals, the conditional probability decreases rapidly as we move away from the lower bound of the interval. Suppose F is such that the posterior is non-degenerate and is Lipschitz in the following sense.

Assumption 2 (Lipschitz continuity of conditional distribution). There exists an  L′ >0 such that for all intervals [l, u] and all 0  < y < u − l, we have

image

We can use this assumption to show that the value function corresponding to the dynamic program that models the feedback model is Lipschitz.

Lemma 8 (Lipschitz continuity of value function). Consider a bounded action space X. If p is Lipschitz with Lipschitz constant  Lp, and the reward function r is bounded by B, there exists constant  LVsuch that for all l < u

image

Using these assumptions, we can then prove that the optimal policy exhibits partial learning, as stated in the following proposition.

Proposition 9. Suppose r is increasing,  Lr-Lipschitz, non-zero on the interior of X and bounded by B. Furthermore, assume p is non-degenerate and Lipschitz as defined above. For all  u ∈ Int(X) there exists an  ε(u) >0 such that for all l where  u − l < ε(u), the optimal action in state (l, u) is l, that is

image

Furthermore,  ε(u) is non-decreasing in u.

We prove this result by analyzing the value function of the corresponding dynamic program. The result shows that at some point, the potential gains from a better posterior for the threshold are not worth the risk of abandonment. This is especially true when  θis quite likely under the posterior. If, to the contrary, we belief the threshold is small, there is little to lose in experimentation. Note however that the result also holds for p = 1, where there are only signals and no abandonment. In this case the risk of a signal (and no reward for the current timestep), outweights (all) possible future gains. Naturally, if the probability of override is small (i.e. p is small), the condition on  λalso weakens, leading to larger intervals of constant policies.

Aggressive and conservative policies Another salient feature of the structure of optimal policies in the feedback model is the aggressiveness of the policy. In particular, we say a policy is aggressive if the first action  x0is larger than the optimal constant policy  x∗in the absence of feedback (corresponding to p = 0), and conservative if it is smaller. As noted before, when there is no feedback, there is no benefit to adapting to user thresholds. However, there is value in personalization when users give feedback.

Empirically, we find that when there is low risk of abandonment, i.e.,  p ≈1, then the optimal policy is aggressive. In this case, the optimal policy can aggressively target high-value users because other users are unlikely to abandon immediately. Thus the policy can personalize to high-value users in later periods.

However, when the risk of abandonment is large (p ≈0) and the discount factor is sufficiently close to one, the optimal policy is more conservative than the optimal constant policy when p = 0. In this case, the high risk of abandonment

image

Figure 3: The relation between the override probability p and the (approximate) optimal initial action  x0when the discount factor  γ= 0.9. The artifacts in the plot are due to the discretization error from numerical computations.

forces the policy to be careful: over a longer horizon the algorithm can extract value even from a low value user, but it has to be careful not to lose her in the first few periods. This long term value of a user with low threshold makes up for the loss in immediate reward gained from aggressively targeting users with a high threshold. Figure 3 illustrates this effect. Here, we use deterministic rewards r(x) = x and the threshold distribution is uniform F = U[0, 1], but a similar effect is observed for other distributions and reward functions as well.

When machine learning algorithms are deployed in settings where they interact with people, it is important to understand how user behavior affects these algorithm. In this work, we propose a novel model for personalization that takes into account the risk that a dissatisfied user abandons the platform.

This leads to some unexpected results. We show that constant policies are optimal under fixed threshold and independent threshold models. We have shown that under small perturbations of these models, constant policies are “robust” (i.e., perform well in the perturbed model), though in general finding an optimal policy becomes intractable.

In a setting where a platform faces many users, but does not know the reward function nor population distribution over threshold, under suitable assumptions we have shown that UCB-type algorithms perform well, both theoretically by providing regret bounds and running simulations. We also consider an explore-exploit strategy that is more efficient in practice, but it requires knowledge of

the reward function.

Feedback from users leads to more sophisticated optimal learning strategies that exhibit partial learning; the optimal learning algorithm personalizes to a certain degree to each user. Also, we have found that the optimal policy is more conservative when the probability of abandonment is high, and aggressive when that probability is low.

5.1 Further directions

There are several interesting directions of further research that are outside the scope of this work.

Abandonment models First, more sophisticated behaviour on user abandonment should be considered. This could take many forms, such as a total patience budget that gets depleted as the threshold is crossed. Another model is that of a user playing a learning strategy herself, comparing this platform to one or multiple outside options. In this scenario, the user and platform are simultaneously learning about each other.

User information Second, we have not considered additional user information in terms of covariates. In the notification example, user activity seems like an important signal of her preferences. Models that are able to incorporate such information and are able to infer the parameters from data are beyond the scope of this work but an important direction of further research.

Empirical analysis This work focuses on theoretical understanding of the abandonment model, and thus ignores important aspects of a real world system. We believe there is a lot of potential to gain additional insight from an empirical perspective using real-world systems with abandonment risk.

The authors would like to thank Andrzej Skrzypacz, Emma Brunskill, Ben van Roy, Andreas Krause, and Carlos Riquelme for their suggestions and feedback. This work was supported by the Stanford TomKat Center, and by the National Science Foundation under Grant No. CNS-1544548. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Jean-Yves Audibert and S´ebastien Bubeck. Minimax Policies for Adversarial and Stochastic Bandits. In COLT, pages 217–226, 2009.

Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine learning, 47(2):235–256, 2002.

Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe Model-based Reinforcement Learning with Stability Guarantees. In NIPS, 2017.

S´ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems.  Foundations and Trends R⃝ inMachine Learning, 5(1):1–122, 2012.

Vivek F Farias and Benjamin Van Roy. Dynamic Pricing with a Prior on Market Response. Operations Research, 58(1):16–29, 2010.

Guillermo Gallego and Garrett Van Ryzin. Optimal Dynamic Pricing of Inven- tories with Stochastic Demand over Finite Horizons. Management science, 40 (8):999–1020, 1994.

Robert D. Kleinberg and Frank Thomson Leighton. The Value of Knowing a Demand Curve: Bounds on Regret for Online Posted-Price Auctions. In FOCS, 2003.

Ilan Lobel and Renato Paes Leme. Dynamic Mechanism Design under Positive Commitment. 2017.

Jiaqi Lu, Yash Kanoria, and Ilan Lobel. Dynamic Decision Making under Cus- tomer Abandonment Risk. MSOM, 2017.

Teodor Mihai Moldovan and Pieter Abbeel. Safe Exploration in Markov Decision Processes. CoRR, abs/1205.4810, 2012.

Roger B Myerson. Optimal Auction Design. Mathematics of operations research, 6(1):58–73, 1981.

Alessandro Pavan, Ilya Segal, and Juuso Toikka. Dynamic Mechanism Design: A Myersonian Approach. Econometrica, 82(2):601–653, 2014.

Michael Rothschild. A Two-Armed Bandit Theory of Market Pricing. Journal of Economic Theory, 9(2):185–202, 1974.

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction, volume 1. MIT press Cambridge, 1998.

A.1 Threshold models

Proof of Proposition 1. The proof follows from defining an appropriate dynamic program and solving it using value iteration. We will denote the state by x, denoting the best lower bound on c. In practice, if the process survives up to time t (T > t) the state is x = maxs≤t xs. Furthermore, it is convenient to use the survival function S(x) = 1  − F(x).

It is easy to see that the optimal policy is non-decreasing, so we can restrict our focus to non-decreasing policies.

The Bellman equation for the value function at state x is given by

image

For convenience we define the following transformation J(x) = S(x)V (x) and note that we can equivalently use J to find the optimal policy. We now explicitly compute the limit of value iteration to find J(x). Start with  J0(x) = 0 for all x and note that the iteration takes the form

image

We prove the following two properties by induction for all k > 0:

1.  Jk(x) =  p(x∗) �k−1i=0 γifor all  x ≤ x∗.

2.  Jk(x) < Jk(x∗) for all  x > x∗.

image

The result follows from taking the limit as  k → ∞and noting that for any state x ≤ x∗, it is optimal to jump to state  x∗(and stay there). We also immediately see that the value of the optimal policy thus is  p(x)/γ, as required.

Proof of Proposition 2. It is immediate that the optimal policy must be constant; if the process survives  xt = x, then at time t + 1 we face the same problem as at time t. So whatever action is optimal at time t, is also optimal at time t + 1. Let V (x) denote the value of playing  xt = xfor all t. Then the following relation holds

image

which leads to

image

The result now follows immediately.

A.2 Robustness

Proof of Proposition 3. First we consider the constant policy  xt = x∗ −yfor all t in the noiseless case. We note that

image

where  V (x∗) is the value of the optimal constant policy for the noise-free model.

Now let us consider the best possible noise model, then  εt = yfor all t. But this is equivalent to the noise-free model with the threshold shifted by y. Hence, we know that a constant policy is optimal. We can bound the value of this model by

image

Hence, this implies that the constant policy  xt = x∗ − yis at most 2yL1−γworse than the optimal policy for the most optimistic noise model.

Proof of Proposition 4. Let ¯θbe the midpoint of the  ηcover, c = l+u2. Now we bound the expected value of an oracle policy, i.e. a policy that knows the true threshold  θ∗as follows

image

which completes the proof.

A.3 Learning

Proof of Proposition 6. Due to the discretization, the proof consists of two parts. First, we show that the policy that plays the best arm  i∗suffers small regret with respect to the optimal policy. Then we use the UCB regret bound to show that the learning strategy has low regret with respect to the playing arm  i∗. Thus we can decompose regret into

image

where the first term corresponds to the discretization error and the second from the learning policy. Due to the time horizon and discounting, we write

Let  x∗be the optimal strategy, i.e. it maxizimes r(x)D(x). Then the discretization error from playing  i∗/K, by Assumption 1 is

image

Thus, the error due to the discretization is small.

Now let us bound the UCB regret with respect to action  i∗/K. As Kleinberg and Leighton (2003) note, the assumption that the pulls of different arms are independent is not used in the proof. Thus we can apply Lemma 5. First, we show that the arms are sub-Gaussian. Since the rewards are bounded by 1 and independent across time, straightforward calculation shows that

image

Then using the law of total variance, conditioning on the event  x < θu, the variance of the total obtained reward for user  u, Ru, can be bounded by

image

Thus we find that the reward for users is sub-Gaussian with parameter  σ= M 22. Recall the UCB regret bound

image

We now focus on the �Ki=1:∆i>0 1∆iterm. Let ∆(1) ≤ ∆(2) ≤ . . . ≤ ∆(K−1)denote the ordered gaps with respect to the optimal arm. Note that for  j ≥2, we know ∆(j) > c1( j2K)2due to Assumption 1. However, for the smallest gap, we only know 0  ≤ ∆(1) ≤ c2K2, depending how close  i∗/Kis to  x∗. We thus obtain

image

Thus regret is bounded by

image

However, the regret from due to playing the second best action is trivially bounded by  n∆(1). Thus, we can bound the worst case when ∆(1)= 4�log n/n. This leads to a bound of

image

since there are K = (n/ log  n)1/4arms, we get

image

Combining this with the bound on regretDcompletes the proof.

A.4 Feedback

Proof of Lemma 8. The Bellman equation of the dynamic program for the feedback model can be written as:

image

where l and u are the lower bounds and upper bounds on c based on the history.

Note that V is finite and therefore value iteration converges pointwise to V . We use induction on the value iterates to find the Lipschitz constant for V . Let  V0, V1, . . .indicate the value iterates. Since  V0(l, u) = 0 for all states (l, u), the Lipschitz constant for  V0, denoted by  L0= 0. We further claim that Ln+1 = Lp B1−γ+  βγLn. Suppose this is true for n = 1, . . . , i −1, then for n = i + 1 we consider state (l + ε, u) and write  x∗for the optimal action in that state, and  y∗=  x∗ − l. Then

image

Also,  V (l, x∗) ≤ V (l, u). Then we find

image

Using the Lipschitz continuity of p we can bound

image

Then note that

image

and for the final two terms we note

image

where we use the inductive assumption. Because l, u and  εare arbitrary, we see that

image

which implies V is Lipschitz.

Proof of Proposition 9. First we note that by Lemma 8, V is Lipschitz, and we write  Lvfor its Lipschitz constant. Fix u, and consider a state (u − ν, u) for some  ν >0. For notational convenience, for action x we write  y = x − (u − ν) for the difference from the lower bound. We also use the shorthand  l = u − νand p(y) = p(y | l, u). We can upperbound the value function by

image

where we write  L = Lr + γLvand use the non-degeneracy of p. The derivative for the above expression with respect to y is

image

Since r(l) > 0 for all  l ∈ Int X, for  νsufficiently small this derivative is negative for all  y ≥0. To complete the proof, we need this upperbound to be tight at y = 0, which follows immediately

image

Since r is increasing, it follows immediately that  ε(u) is non-decreasing in u.


Designed for Accessibility and to further Open Science