Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy

2019·arXiv

Abstract

Acknowledgments

The research reported here was supported by NSF CAREER award, ONR Young Investigator award and Microsoft faculty fellowship.

Acerbi, C., and Tasche, D. 2002. On the coherence of expected shortfall. Journal of Banking & Finance 26(7):1487– 1503.

Artzner, P.; Delbaen, F.; Eber, J.-M.; and Heath, D. 1999. Coherent measures of risk. Mathematical finance 9(3):203– 228.

Bastani, M. 2014. Model-free intelligent diabetes management using machine learning.

Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 1471–1479.

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 449–458. JMLR. org.

Berkenkamp, F.; Turchetta, M.; Schoellig, A.; and Krause, A. 2017. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, 908–918.

Brafman, R. I., and Tennenholtz, M. 2002. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3(Oct):213–231.

Castro, P. S.; Moitra, S.; Gelada, C.; Kumar, S.; and Bellemare, M. G. 2018. Dopamine: A Research Framework for Deep Reinforcement Learning.

Chow, Y., and Ghavamzadeh, M. 2014. Algorithms for cvar optimization in mdps. In Advances in neural information processing systems, 3509–3517.

Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, 1522–1530.

Clarke, W., and Kovatchev, B. 2009. Statistical tools to analyze continuous glucose monitor data. Diabetes technology & therapeutics 11(S1):S–45.

Dabney, W.; Ostrovski, G.; Silver, D.; and Munos, R. 2018a. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923.

Dabney, W.; Rowland, M.; Bellemare, M. G.; and Munos, R. 2018b. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence.

Dann, C.; Li, L.; Wei, W.; and Brunskill, E. 2018. Policy cer-tificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056.

Delage, E., and Mannor, S. 2010. Percentile optimization for markov decision processes with parameter uncertainty. Operations research 58(1):203–213.

Dilokthanakul, N., and Shanahan, M. 2018. Deep reinforcement learning with risk-seeking exploration. In International Conference on Simulation of Adaptive Behavior, 201–211. Springer.

Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Ernst, D.; Stan, G.-B.; Goncalves, J.; and Wehenkel, L. 2006. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, 667–672. IEEE. Geramifard, A.; Dann, C.; Klein, R. H.; Dabney, W.; and How, J. P. 2015. Rlpy: a value-function-based reinforcement learning framework for education and research. Journal of Machine Learning Research 16(46):1573–1578.

Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1311–1320. JMLR. org.

Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2018. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence.

Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal regret bounds for reinforcement learning.

Koller, T.; Berkenkamp, F.; Turchetta, M.; and Krause, A. 2018. Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), 6059–6066. IEEE.

Man, C. D.; Micheletto, F.; Lv, D.; Breton, M.; Kovatchev, B.; and Cobelli, C. 2014. The uva/padova type 1 diabetes simulator: new features. Journal of diabetes science and technology 8(1):26–34.

Mavrin, B.; Yao, H.; Kong, L.; Wu, K.; and Yu, Y. 2019. Distributional reinforcement learning for efficient exploration.

In International Conference on Machine Learning, 4424– 4434.

Moerland, T. M.; Broekens, J.; and Jonker, C. M. 2018. The potential of the return distribution for exploration in rl. arXiv preprint arXiv:1806.04242.

Ostrovski, G.; Bellemare, M. G.; van den Oord, A.; and Munos, R. 2017. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2721–2730. JMLR. org.

Rockafellar, R. T.; Uryasev, S.; et al. 2000. Optimization of conditional value-at-risk. Journal of risk 2:21–42.

Rowland, M.; Bellemare, M. G.; Dabney, W.; Munos, R.; and Teh, Y. W. 2018. An analysis of categorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163.

Sato, M.; Kimura, H.; and Kobayashi, S. 2001. Td algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence 16(3):353–362.

Shapiro, A.; Dentcheva, D.; and Ruszczy´nski, A. 2009. Lectures on stochastic programming: modeling and theory. SIAM.

Strehl, A. L., and Littman, M. L. 2008. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74(8):1309–1331.

Tamar, A., and Mannor, S. 2013. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697.

Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2015. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, 1468– 1476.

Tamar, A.; Glassner, Y.; and Mannor, S. 2015. Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

Thomas, P., and Learned-Miller, E. 2019. Concentration inequalities for conditional value at risk. In International Conference on Machine Learning, 6225–6233.

Weissman, T.; Ordentlich, E.; Seroussi, G.; Verdu, S.; and Weinberger, M. J. 2003. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep. Xie, J. 2018. A type-1 diabetes simulator, simglucose v0.2.1.

Zhu, S., and Fukushima, M. 2009. Worst-case conditional value-at-risk with application to robust portfolio management. Operations research 57(5):1155–1168.

Experimental Details

Machine Replacement Machine replacement environment consist of n states (we use n = 25 in the experiment), where action replace transitions to the terminal state with cost at state t, where and . Action don’t replace has cost and transitions to state t + 1. In our experiment we used . However, for the last state n action don’t replace has cost , where we used , and transitions to the terminal state. For C51 algorithm we use , learning rate 0.01 and 51 atoms.

Tuning: We use -greedy with schedule (start, end, n) = (0.9, 0.1, 5000) that starts with start and decays linearly to end in n time steps, staying constant afterwards. This schedule achieved the best performance in our experiments when compared to other linear schedules {(0.9, 0.3, 5000), (0.9, 0.1, 10000), (0.9, 0.1, 15000), (0.9, 0.05, 5000)}, and exponential decays with schedule in the form of ((0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where . We have also tried our algorithm with optimism values of c = [0.25, 0.5, 1, 2].

For the actor critic method we use the CVaR limit as -10, radial basis function as kernel and other set of hyper-parameters are as described in the appendix of (Chow and Ghavamzadeh 2014).

Additional Experiments: Additional to the risk level , we observe the same gain in the performance for other risk levels. As shown in figure 7, optimism based exploration shows a significant gain over greedy exploration for risk levels and .

Type 1 Diabetes Simulator An open source implementation of type 1 diabetes simulator (Xie 2018) simulates 30 different virtual patients, 10 child, 10 adolescent and 10 adult. For our experiments in this paper we have used adult#003, adult#004 and adult#005. Additionally we have used "Dexcom" sensor for CGM (to measure blood glucose level) and "Insulet" as a choice of insulin pump. All simulations are 10 hours for each patient and after 10 hour, patient resets to the initial state. Each step of simulation is 3 minutes.

State space is a continous vector of size 2 (glucose level, meal size) where glucose level is the amount of glucose measured by "Dexcom" sensor and meal size is the amount of Carbohydrate in each meal.

Action space is defined as (bolus, basal=0) where amount of bolus injection discretized by 6 bins between 30 (max bolus, a property of the "Insulet" insulin pump) and 0 (no injection). Additionally we inject two source of stochasticity to the taken action, assume action at time t is the agent’s decision, then we take the action at time where:

Where is drawn from the power law distribution and c = 5. Note that this means delay the action at most 5 step where the probability of taking the action at time t is higher than time following the power law. Since each step of simulation is 3 minutes, patient might take the insulin up to 15 minutes after the prescribed time by the agent.

Reward structure is defined similar to the prior work (Bastani 2014) as following:

Where which is the estimate of bg (blood glucose) in mmol/L rather than mg/dL. Additionally if the amount of glucose is less than 39 mg/dL agent incurs a penalty of .

0 2000 4000 6000 8000 10000 12000 Episodes

0 2000 4000 6000 8000 10000 12000 Episodes

Figure 7: Machine Replacement with different risk levels. Left: risk level , Right: risk level

We generated a meal plan scenario for all the patients that is meal of size 60, 20, 60, 20 CHO with the schedule 1 hour, 3 hours, 5 hours and 7 hours after starting the simulation. Notice that this will make the simulation horizon 200 steps and 5 actionable steps (initial state, and after each meal).

Categorical Distributional RL: The C51 model consist of 2 hidden layers each of size 32 and ReLU activation function, followed by of |A| each with 51 neurons followed by a softmax activation, for representing the distribution of each action.

We used Adam optimizer with learning rate and . We set , 51 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.

Density Model: For log likelihood density model we used realNVP (Dinh, Sohl-Dickstein, and Bengio 2016) with 3 layers each of size 64. The input of the model is a concatenated vector of (s, a). We used same hyper parameters for optimizer as in C51 model. We have used constant for computing the pseudo-count.

Tuning: We have tuned our method and -greedy on patient adult#001 and used the same parameters for the other patients. We tried 5 different linear schedule of -greedy, {(0.9, 0.1, 2), (0.9, 0.05, 4), (0.9, 0.05, 6), (0.9, 0.3, 4), (0.9, 0.3, 4), (0.9, 0.05, 10)} where first element is the initial , second element is the final and the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for -greedy in the form of ((0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where . The first of the exponential decay set preformed the best. We have also tested our algorithm with constant optimism values of [0.2, 0.4, 0.5, 0.8, 1, 2, 5] where we picked the best value 0.5.

HIV Simulator The environment is an implementation of the physical model described in (Ernst et al. 2006). The state space is of dimension 6 with and action space is of size 4, indicating the efficacy of being on the treatment. described in (Ernst et al. 2006) takes values as and . We have also added the stochasticity to the action by a random gaussian noise. So the efficacy of a drug is computed as .

The reward structure is defined similar to the prior work (Ernst et al. 2006). And we simulate for 1000 time steps, where agent can take action in 50 steps (each 20 simulation step) and actions remains constant in each interval. While trianing we normalize the reward by dividing them by 1e6.

Categorical Distributional RL: The C51 model consist of 4 hidden layers each of size 128 and ReLU activation function, followed by of |A| each with 151 neurons followed by a softmax activation, for representing the distribution of each action.

We used Adam optimizer with learning rate decay schedule from to in half a number of episodes, , and . We set , 151 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.

Implicit Quantile Network: IQN model consists of 4 hidden layers with size 128 and ReLU activation. Then an embedding of size 64 computed by ReLU. Then we take the element wise multiplication of the embedding and the output of 4 hidden layers, followed by a fully connected layer with size 128 and ReLU activation, and a softmax layer. We used 8 samples for N and and 32 quantiles.

Density Model: For log likelihood density model we used realNVP (Dinh, Sohl-Dickstein, and Bengio 2016) with 3 layers each of size 64. The input of the model is a concatenated vector of (s, a). We used same hyper parameters for optimizer as in C51 model. We have used constant for computing the pseudo-count.

Tuning: We have tuned our method, -greedy and IQN. For -greedy we tried 5 different linear schedule of -greedy, {(0.9, 0.05, 10), (0.9, 0.05, 8), (0.9, 0.05, 5), (0.9, 0.05, 4), (0.9, 0.05, 2)} where first element is the initial , second element is the final and the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for -greedy and IQN in the form of ((1.0, 0.9, 10), (1.0, 0.9, 100), (1.0, 0.9, 500), (1.0, 0.99, 10), (1, 0.99, 100)} where . The first of the linear decay set preformed the best. We have also tested our algorithm with constant optimism values of (0.2, 0.4, 0.5, 0.8, 1, 2, 5) where we picked the best value 0.8.

Theoretical Analysis

Proposition 3 (Restatement of Proposition 1). For any c, the operator is a non-expansion in the Cram´er distance . This implies that optimistic distributional Bellman backups and the projected version are -contractions in and iterates of these operators converge in to a unique fixed-point.

Proof. Consider , any state and action with CDFs and and consider the application of

the optimism operator :

Generally, for any we have

and applying this case-by-case bound to the quantity in the integral above, we get

By taking the square root on both sides as well as a max over states and actions, we get that is a non-expansion in . The rest of the statement follows from the fact that is a -contraction and a non-expansion (Rowland et al. 2018) and the Banach fixed-point theorem.

Theorem 4 (Restatement of Theorem 2). Let the shift parameter in the optimistic operator be sufficiently large which is . Then with probability at least , the iterates converges for any risk level and initial to an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least ,

Proof. By Lemma 8 and Lemma 3 by (Bellemare, Dabney, and Munos 2017), we know that converges to a unique fixed-point , independent of the initial . Hence, without loss of generality, we can choose .

We proceed by first showing how our result will follow under a particular definition of c, and then show what that definition is. Assume that we have obtained a value for c that satisfies the assumption of Lemma 5 (see other parts of this appendix), and let and . Then Lemma 5 implies that if for all (s, a), then also for all (s, a). Thus, for all (s, a). Finally, we can use Lemma 8 to obtain the desired result of our proof statement, for all (s, a).

Going back, we use concentration inequalities to determine the value of c that ensures the required condition in Lemma 5 (expressed in Eq. (19)). The DKW-inequality which give us that for any (s, a) with probability at least

Further, the inequality by (Weissman et al. 2003) gives that

Combining both with a union bound over all state-action pairs, we get that it is sufficient to choose c = to ensure that allowing us to apply Lemma 5.

However, we can improve this result by removing the polynomial dependency of c on the number of states |S| as follows. Consider a fixed (s, a) and denote where we set . Our goal is to derive

a concentration bound on that is tighter than the bound derived from

. Note that v is not a random quantity and hence is a normalized sum of independent random variables for any x. To deal with the continuous variable x which prevents us from applying a union bound over x directly, we use a covering argument. Let be arbitrary and consider the discretization set

Define as the discretization of v at the discretization points in . This construction ensures that the discretization error is uniformly bounded by 1/K, that is, holds for all and . Hence, we can bound for all

where in 1 , we applied Hoeffding’s inequality to the first term in combination with a union bound over as can only take values in . The second term was bounded with H¨older’s inequality and in 2 the concentration inequality from Eq. (12) was used. Combining this bound with Eq. (11) by applying a union bound over all states and actions, we get that picking

is sufficient to apply Lemma 5. Since is non-decreasing, the size of the discretization set is at most and by picking , we see that is sufficient.

Lemma 5. Let be such that for all (s, a). Assume finitely many states and actions . Let and be the empirical reward distributions and transition probabilities. Assume further that

where is the shift in the optimism operator . Then for all (s, a). Note that it is sufficient to pick to ensure the Assumption in Eq. (19).

Proof. We start with some basic identities which follow directly from the definition of CDFs

where the integrals are Lebesgue-Stieltjes integrals and understood to be taken over . We omit the limits in the following to unclutter notation. These identities allow us to derive expressions for and :

Here we exchanged the finite sum with the integral by linearity of integrals. Using these identities, we will show that for all x. Consider any fixed x and first the case where the max in is attained by 0. In this case because CDFs take values in [0, 1]. For the second case, we combine (26) and (29) to write

In the following, we consider each of the terms (30)–(32) separately. Let us start with (30) and bound

where the identity follows from the basic identities in Eq. (20) and the inequality from the assumption that for all (s, a). Hence, the term in Eq. (30) is always non-positive. Moving on to the term in Eq. (31) which we bound with similar tools as

This yields that Eq. (31) is bounded by this as well

Finally, the last term from Eq. (32) can be bounded as follows

which is non-positive as long as which completes the proof.

Technical lemmas Lemma 6. Let F and G be the CDFs of two non-negative random variables and let be a maximizing value of in the definition of and respectively. Then:

Proof. Assume w.l.o.g. that . Denote by any maximizing value of in the definition of . By Lemma 4.2 and Equation (4.9) in (Acerbi and Tasche 2002), a possible value of is . Then we can write the differences in CVaR as

Using Lemma 7 in Equation (48) gives

We can in full analogy upper-bound and arrive at the desired statement.

Lemma 7. Let F be a CDF of a bounded non-negative random variable and be arbitrary. Then . Hence, one can write the conditional value at risk of a variable for any CDF F with F(0) = 0 as

Proof. We rewrite as follows

1=νF(ν) − EF

2=νF(ν) −� ∞

where 1 follows from which holds for any and 2 uses Tonelli’s theorem to exchange the two integrals. Plugging this identity into

and taking the sup over gives the desired result.

Lemma 8. Let G and F be CDFs of non-negative random variables so that . Then for any , we have .

Proof. Consider now the following difference

By Lemma 7, we have that

Let denote a value of that achieves the supremum in (which exists by Lemma 4.2 and Equation (4.9) in (Acerbi and Tasche 2002)). Then

Lemma 9. Consider a sequence of CDFs on with that converges in distance to as . Then as .

Proof. Consider the sequence of Wasserstein distance between the CDFs:

Where the second line follows by H¨older’s inequality. The right hand side goes to 0 as , which implies convergence in Wasserstein distance (p=1). Finally, using Lemma 6, convergence of CVaR follows.

Designed for Accessibility and to further Open Science