Identifying Distinct, Effective Treatments for Acute Hypotension with SODA-RL: Safely Optimized Diverse Accurate Reinforcement Learning

2020·Arxiv

Abstract

Abstract

Hypotension in critical care settings is a life-threatening emergency that must be recognized and treated early. While fluid bolus therapy and vasopressors are common treatments, it is often unclear which interventions to give, in what amounts, and for how long. Observational data in the form of electronic health records can provide a source for helping inform these choices from past events, but often it is not possible to identify a single best strategy from observational data alone. In such situations, we argue it is important to expose the collection of plausible options to a provider. To this end, we develop SODA-RL: Safely Optimized, Diverse, and Accurate Reinforcement Learning, to identify distinct treatment options that are supported in the data. We demonstrate SODA-RL on a cohort of 10,142 ICU stays where hypotension presented. Our learned policies perform comparably to the observed physician behaviors, while providing different, plausible alternatives for treatment decisions.

Introduction

Patients in the intensive care unit (ICU) are among the sickest in the hospital, and require many different types of interventions to control and respond to their unstable physiological conditions. For instance, antibiotics are given to control infections [1], and anticoagulants are given to dialysis patients to prevent thrombosis [2]. Patients with the highest acuity may be given more aggressive and invasive interventions such as mechanical ventilation [3] as well.

In this paper, we focus on decisions to give fluid bolus therapy [4] and vasopressors [5] when treating hypotension and shock. Hypotension is associated with overall higher morbidity and mortality in across several populations, including populations with sepsis [6] and populations in the emergency department [7]. However, despite the importance of addressing this problem, decision making for hypotension management is not standardized, and treating these patients effectively is challenging. Although it has been studied extensively [8], the choice of bolus size and timing, as well as which vasopressor to use and in what dosing regimen is not well understood.

Reinforcement learning (RL), a branch of machine learning focused on learning how to make a sequence of decisions toward some desired outcome [9], has the potential to help us use past data to assist with these decisions. Recent applications of RL to healthcare include managing sepsis [10], schizophrenia [11], mechanical ventilation [12], and heparin dosing [13]. However, as noted in [14], quantifying the quality of a proposed treatment policy is challenging. Observational data create hard limitations on the kinds of policies that can be credibly evaluated: one cannot evaluate policies that recommend treatments that were never or rarely performed, and even when the recommended treatments have support in the observed data, the value of different choices may be impossible to statistically differentiate.

Thus, instead of attempting to identify a single optimal treatment policy from observational data—which is often impossible—in this work, we focus on identifying a collection of distinct, plausible policies. Having such a collection of options can provide insights into multiple versions of treatments that may be of similar efficacy, and it also provides a step toward providing personalized recommendations by creating a space of reasonable treatment options. One way to think about this approach is to note that the variation that we see in clinician actions is likely to be safe—patients are typically treated conservatively to avoid iatrogenic harm. Amid this variation, our goal is to identify a collection of treatment policies that are both distinct—that is, different from each other, so as to provide choices of options— but also likely—that is, are not too far from current practices. To this end, in this work we develop SODA-RL: Safely Optimized, Diverse, and Accurate Reinforcement Learning, as a technique to identify a collection of plausible high-efficacy policies. By drawing potential treatment policies from the variation in current practice—that is, actions currently taken by clinicians—we ensure that our options are likely to be safe, or at least as safe as current practice.

Our results on a cohort of hypotensive ICU patients demonstrate that all three components of SODA-RL (Safety, Diversity, and Quality/Accuracy) are necessary. The distinct policies learned by SODA-RL achieve roughly the same estimated value as the observed clinician policy, and our qualitative results suggest that the different policies do indeed pick up on real underlying options for treatments.

Background

We will model the problem of hypotension management as a Markov Decision Process (MDP), a standard formalism in reinforcement learning [9]. An MDP is defined by a state space S that describes the current setting of the environment (e.g. clinical variables describing a patient’s current physiological state), and an action space A of possible actions that can be taken (e.g. treatments to administer such as IV fluids or vasopressors). The Markov in MDP refers to the assumption of Markovianity in the state transition distribiution. That is, we assume that at time t, the next state is determined solely from the current state and action , i.e. , where and refer to the complete history of previous states and actions. To complete the specification of the MDP, we define a discount factor that balances the value of current vs. future rewards, along with a reward function r(s, a) that assess how good the actions being taken are. For instance, the reward function might take positive values for physiologically stable states that lead to improved patient outcomes, and take negative values for states leading to physiological instability and decompensation.

We refer to a decision making strategy as a policy, and let indicate the probability that action a is taken when in state s. In this work we focus on stochastic policies, although it is also possible to learn deterministic policies where the same action is always taken from a given state. A trajectory is a sequence of states, actions and rewards received in an interaction with the environment: ). We define the value of a policy as its expected sum of future discounted rewards:

where denotes the distribution over trajectories generated by following the policy and transitioning between states according to the distribution .

An optimal policy is one that achieves the highest possible value (eq. 1). The field of reinforcement learning (RL) provides a suite of tools for learning an optimal policy via interactions with the environment. That is, we typically do not have direct access to the transition distribution and must instead learn by trying actions and seeing their results (e.g. giving a treatment to a patient and observing the outcome). However, such experimentation is obviously both unethical and impractical in clinical domains, as unsafe actions may be recommended. The subfield of batch RL attempts to learn policies based on previously collected trajectories (e.g. from information in the electronic health record describing the clinical states and treatments given to patients).

A key question in batch RL is off-policy evaluation, that is, how to estimate the value of a proposed policy given only a collection of trajectories collected according to some other (potentially suboptimal) policy . One class of methods for accomplishing this relies on importance sampling, a general technique for estimating properties of a distribution of interest (e.g. the distribution of rewards if we follow our policy ), given only samples generated from a different distribution (e.g. the distribution of rewards if we follow the clinician behavior policy, ). In this work, we will use a state-of-the-art estimator, the Consistent Weighted Per-Decision Importance Sampling (CWPDIS, [15]), to estimate the value of the policies we learn using a retrospective set of clinician behavior trajectories :

The quality of the estimate in eq. 2 will depend on how many trajectories are retained by the reweighting by , known as the effective sample size (ESS) [16]. Informally, the ESS gauges how many samples from the true distribution of interest would provide an estimator with similar quality. Even though the number of trajectories may be large, high variance in the distribution of the importance weights may cause the resulting estimate to be very unreliable, and only provide non-negligible weight on a few trajectories. For instance, if N = 1000 but the ESS is only 5, then our estimate using 1000 trajectories from to estimate the value of our policy will perform about as well as using only 5 trajectories actually collected according to .

We focus on the ESS of the CWPDIS estimator at time T, the end of the trajectory:

If all the importance weights are equal, it is easy to see that the ESS is simply N. In this work, we use the ESS as an indicator of the reliability of the estimate of a proposed policy’s value. If the ESS is low, then even if the value estimate is high, the proposed policy is not trustworthy and may actually not be high-quality, because that high value estimate was effectively measured from only a few trajectories.

Related Work

The batch or off-policy RL literature generally focuses on safe and efficient learning using off-policy evaluation techniques [14, 17, 18]. In this work, the notion of safety we use is employed by the assumption that clinicians generally perform well and very rarely make unsafe actions. This is somewhat distinct from other concepts of safety in the area of safe RL, such as those comparing the bounds on different off-policy evaluation metrics (e.g. [19]). Moreover, there has been limited exploration of learning collections of distinct agents within the off-policy RL community.

Within RL more broadly, most prior work involves notions of diversity that are not aligned with the kind of efficient exploration-amongst-safe-options setting we are interested in. [20, 21] use notions of diversity that don’t directly compare action probabilities, but rather compare features such as neural network parameter differences or the entropy in a single agent’s action probabilities. More related is [22], who learn a policy over options and can train multiple options (in an off-policy manner) using a rollout from a single option. Although the distinct options can give rise to agents with distinct behaviors, there is no explicit diversity component in the objective, and it is unclear how to summarize the kinds of distinct trajectories that are possible and what combination of options leads to the most interesting policies.

Our motivation for seeking a collection of distinct policies in the reinforcement learning setting is aligned most closely with the end goal of [23]: presenting a broad set of representative solutions as a tool for hypothesis generation and to discover specific directions of interest for further inquiry. Their primary application focus is on malware detection, and they first learn a set of good policies followed by a post-hoc clustering step to identify diverse candidates, whereas we learn diverse policies via a joint optimization. [24] learn collections of distinct policies using a divergence metric between distribution of trajectories induced by policies. However, their work focuses largely on on-policy settings where a simulator of the environment is available and collection of policies is learned sequentially rather than jointly. In our case, we jointly optimize to find a collection of distinct, plausible alternatives from a collection of alreadycollected observational data, which can inform clinicians of multiple hypotheses for treatment strategies.

Finally, there exist several papers using data to inform decisions in the ICU. [10] and [25] also use RL to learn fluid and vasopressor treatment strategies, but specifically in septic patients, and their focus is on optimality and not safety and diversity. [8] focuses on predicting response to fluid bolus therapy, as the treatment does not always work. There are also many papers that attempt to predict onset of various kinds of interventions (e.g. [26]) and onset of hypotension events (e.g. [27, 28]). All of these works try to identify one policy, rather than providing reasonable alternatives.

Cohort and Data Processing

We draw our trajectories from the publicly-available MIMIC-III database [29]. The full database contains static and dynamic information for nearly 60,000 patients treated in the critical care units of Beth-Israel Deaconess Medical Center in Boston between 2001-2012. We use version 1.4 of MIMIC-III, released in September 2016.

From the database, we considered adults (at least age 18), with MetaVision data (only patients for whom we could reliably and easily extract both start and end times for interventions). We then removed patients with very short ICU stays of less than 12 hours. For all other ICU stays, we only consider the first 72 hours within the ICU admission, as patients who are in the ICU for extended periods of time often receive different care than the initial treatments in the crucial first few days after admission. We required at least three distinct measurements of mean arterial pressure (MAP) below 65mmHg, indicating probable hypotension, and used only the first ICU admission if a single patient had multiple admissions. This filtering process resulted in 10, 142 ICU stays. We split the dataset into N = 7, 000

Table 1: Baseline characteristics of the total cohort of N = 10, 142 ICU stays we use in this work.

ICU stays (of which we use 1, 000 as a validation set for hyperparameter selection and 6, 000 for training), and the remaining N = 3, 142 as a held-out test set for final evaluation. See Table 1 for baseline characteristics and demographics of the selected cohort.

In addition to these 7 baseline variables, we also include features derived from 10 different vital signs (e.g. heart rate, MAP) and 20 laboratory measurements (e.g. lactate, creatinine). Vitals are typically recorded about once an hour from (continuous) bedside monitors, while labs are typically only measured a few times a day from blood samples drawn from patients. We also include indicator variables that assess whether or not a variable was recently measured, as the action of decided to measure certain variables may itself be very informative [30].

Lastly, we extracted information on the interventions of interest: fluid bolus therapy and vasopressor administrations. We combine different types of fluids and blood products together when forming our fluid action variable (we only include common NaCl 0.9% solution, lactated ringers, packed red blood cells, fresh frozen plasma, and platelets). We include five different types of vasopressors for the vasopressor action: dopamine, epinephrine, norepinephrine, vasopressin, and phenylephrine. We map these five drugs into a common dosage amount based off norepinephrine equivalents, following the preprocessing in [10], where the infusion rates are in mcg/kg, normalized by body weight.

To apply RL to a problem, we must formalize the state and action spaces, as well as defining a reward and a time-scale. We now describe each of these pieces below.

State Space, Time Discretization, and Imputation We discretize time into hourly windows, and derive an 89-dimensional state vector, consisting of the baseline variables in Table1 and values of the physiological and indicator variables as shown in detail in Table3 in the appendix. We impute any unobserved variable with the population median. Once a variable is observed in a given hospital admission, we then use the last observed measurement until a new value is measured. If more than one value is measured in a given hour window we take the most recent value, except for the three blood pressure variables, where we use the minimum value, as clinicians typically treat patients based on their most recent worst blood pressure value.

Action Space We discretize the two types of interventions, fluid boluses and vasopressors, into 4 and 5 different discrete doses, so that in total there are 20 unique actions (see Figures 3,4,5,6 in the appendix for details). To compute the dose of a vasopressor, we aggregate the total amount of vasopressors given in each hour window, normalized by weight. For fluids, we only include fluids boluses of at least 200mL administered in an hour or less.

Reward We use the common target of a mean arterial blood pressure (MAP) of 65mmHg. We consider MAP values above 65mmHg as acceptable (reward 1), and decrease the reward using a piecewise linear function, with inflection points at 60mmHg, and 55mmHg, down to a minimum of 28mmHg (the lowest observed MAP in our data, which we assign a reward of 0). Sufficient urine outputs are allowed to ignore the penalty for moderately low MAP values of 55mmHg or higher, as clinically the slightly lower MAP is less concerning if their fluids are well balanced. See Figure 7 in the appendix for a visual depiction of the chosen reward function. We leave a more thorough investigation of potential reward functions to future work. However, it is important to note that when we present SODA-RL in the next section, rewards are not included in the optimization, so the algorithm will be agnostic to choice of reward and this will only affect the post-hoc value estimates.

Methods

When treating hypotension, there may legitimately exist different treatment strategies that are equally effective for a particular patient (e.g. one that focuses on vasopressor use and one that focuses on fluid use). There may also exist treatment strategies whose quality cannot be distinguished from the observational data.

Below, we introduce an algorithm, SODA-RL: Safely Optimized, Diverse, and Accurate Reinforcement Learning, for learning a collection of distinct, reliably high-quality policies from a batch of data. Doing so requires three parts. First, we want to make sure that any policy () that we recommend never takes potentially dangerous actions i.e. safe. Second, we want the policy to be high-performing. Finally, we want the collection of policies (to be distinct (that is, not repeating the same recommendations). The following objective function incorporates all of these desiderata:

where is a loss function that measures discrepancy between our collection of policies and the behavior policy, is a loss function (with associated regularization strength ) related to diversity within the collection of policies. Note that before SODA-RL can be run, we first need to estimate the clinician behavior policy, . Following [31], we do this using a k nearest neighbors approach to count the proportion of each action observed in the 100 nearest states. To quantify distance between states, we use a manually constructed distance function that weights each of the 89 state variables differently depending on their relative importance to this clinical application.

Safety: safeThe goal of the safety constraint is to ensure that a policy does not take a dangerous action. For our purpose, we define dangerous as unknown or rarely performed: assuming that the clinicians are choosing amongst reasonable decisions most of the time, there likely exists good reason for treatments that are not chosen. And even if not, there is no way to tell, given the current data, the potential consequences of a never-tried treatment.

The safe operator safe, uses an indicator function (1) to only allow state-action pairs where the behavior action probability is greater than some threshold . For a given state, if multiple actions are allowed but some are not, the action probabilities are normalized over only the allowable actions.

Distinct (), Likely () Collections: The safety operation simply ensures that we do not take actions that are completely non-evaluable. However, it does not ensure that the policies will be of high quality. One option is to directly optimize policies with respect to the CWPDIS estimator in equation 2. However, [32] note that gradient-based optimization of importance sampling estimates is difficult with complex policies and long rollouts, and we experienced difficulty attempting to optimize this directly.

Thus, we will instead follow a different strategy: our goal will be to identify a collection of likely, distinct strategies. This objective is based on the intuition that the current clinician behaviors are generally reasonable. Our goal is to essentially disentangle the distinct treatment strategies that clinicians are currently using in practice and then each one can be evaluated and filtered using a value estimate from equation 2.

We shall measure how likely a proposed policy is given current clinician behavior at a particular state as the difference where l is some loss function. We will consider the average difference over all policies in the collection ) and over all states in the batch (B) as the overall similarity (or quality) of the collection of proposed policies and clinician behavior:

Of course, the optimal solution to equation 6 is to make all policies in the collection identical to the clinician policy. To separate out the strategies that clinicians may be using, we add a diversity term, weighted by hyperparameter , that will encourage us to discover a distinct collection of policies. We define the diversity between two policies as an average of the symmetric KL between their action probabilities, over all observed states in the batch B:

For a collection of policies , we define the diversity measure as the average of the pairwise diversity measure for pairs that are distinct:

Together, equations 6 and 8 represent the tension between finding policies that are likely—have high support in the observed data—and yet distinct. Identifying this collection, we provide a space of potential policies that may be useful in any situation, and the opportunity for clinicians to optimize over the range of action they are already performing.

Experimental Setup

In this section, we provide details for the setup of our experiments on the particular task of hypotension management in the ICU. We try out two different variants for the loss function l defined in equation 6. The first is the standard cross-entropy (CE) loss function, that will encourage a policy’s action probabilities at each state in the batch to be close to the action that was actually taken. The second is the symmetric KL distance (symKL; also used for the diversity term), where here the distance is between the action probabilities for the behavior policy and the policy to be learned.

In practice, we try a range of values (1,0.4,0.1,0.01,0.001), and try several values for in equation 5 (.01,.03,.05; corresponding to only considering actions seen in at least 1, 3, and 5 of the 100 nearest neighbors of a given state, respectively). To actually learn a policy that maps states to action probabilities, we use a simple three-layer feedforward neural network (multilayer perceptron), with 128 units per layer. Thus, the parameters to learn are three sets of weight matrices and bias vectors for each policy . In our experiments we jointly learn 4 policies at once. We train our methods using the Adam optimizer with a learning rate of 0.001 and a batch size of 100 trajectories at a time, and use a modest multiplier on an regularization term on all policy parameters.

Evaluation Metrics While our optimization metric aimed to identify distinct, likely treatment policies from the data, our original objective was to identify distinct, effective policies that can serve as options for clinicians. We evaluate the effectiveness of a policy via the CWPDIS estimator in equation 2, with . We also provide the effective sample size of a policy using equation 3. Together, these metrics provide an estimate of the effectiveness of a policy; CWPDIS value is an estimate of the policy’s value, while the ESS is a measure of confidence in that estimate. We also present the CE and symKL loss functions that are optimized in the quality term, as additional metrics to measure how likely a given policy is with respect to the behavior policy and behavior actions taken. We measure the distinctness of the collection using the average symmetric KL between each pair of policies, i.e. equation 8. Lastly, to measure safety, we count the number of times a policy places a non-negligible probability (i.e. above 0.01) on an action disallowed by the safety term in equation 5.

Baselines We consider ablations of our approach to determine which aspects are most important to identifying a collection of effective policies. In particular, we explore variants where we turn off various combinations of the diversity and quality terms and safety constraint. We ran experiments with all three (the full method) using both the CE and symKL losses to measure quality, and also ran versions with: only a diversity term with safety constraint, and no quality term; a diversity and quality term but no safety constraint; a quality term and safety constraint, but no diversity term; and diversity term and quality terms alone, with no safety constraints.

Results

Table 2 presents our quantitative results. As a means of constraining our results to only include policies where we can reliably estimate their value, we prune out learned policies that have an individual ESS of less than 50 on the test set of

Table 2: Quantitative results (means and standard deviations) for each collection of learned policies. For comparison, note that there were 3, 142 trajectories in the test set, so this is the highest achievable ESS. Furthermore, the empirical average of the returns in the test set was 37.90, so this is a reasonable estimate of the value of the behavior policy. We only show results for agents who learned a policy that had an ESS of at least 50. We show results for .

N = 3142 trajectories, regardless of their value estimate. In general, most policies that we learn have value estimates that are quite close to the average returns on the test set of 37.90, which is an unbiased and reliable estimate of the value of the clinician behavior policy.

A major takeaway is that without the safety constraint, the optimization is very likely to end up learning a policy with an unacceptably low ESS. However, even if the ESS is reasonable, there will be a large number of transitions where the agent is recommending unknown, never before seen actions for patients similar to the current state. Without the diversity but with the safety constraint, it is possible to achieve better CE and SymKL loss values that push you closer to the behavior, but at the cost of very low to no diversity. Without a quality term of some sort, the combination of diversity and safety learns a very diverse set of policies that still has good value and ESS, but is substantially further away from the behavior. It often confidently recommends actions that were unlikely, but still possible, under the behavior. Lastly, using only a quality term also typically fails to learn a policy with a reasonable ESS. In contrast, the full method SODA-RL using all three terms is a tradeoff in the middle, still learning a fairly diverse set of policies, but sticking much closer to the behavior.

Lastly, we present qualitative results from the policies presented in the second row of Table 2, i.e. high diversity (), a safety constraint of , and the symKL loss in the quality term. Figure 1 illustrates the local diversity learned by this collection of 3 policies, at a particular state. The blue bars in the figure show the estimated behavior policy action probabilities, while orange, green, and red show the SODA-RL probabilities. Agent 1 (correctly) places high confidence in the low-vasopressor, no-fluid action (v1,f0), while agent 2 places high confidence on the mediumvasopressor, no-fluid action (v2,f0) and agent 3 assigns moderate probability to several other actions.

Figure 2 presents a more global picture of the type of diversity that the policies learn. Agent 1 primarily places high probability on high doses of vasopressors with fluids, low doses of vasopressor with no fluids, and medium doses of fluids with no vasopressors. Agent 2 mostly focuses on lower doses of vasopressors, regardless of fluid amount. Lastly, agent 3 largely recommends various amounts of fluids across a range of low to moderate vasopressor doses.

For additional qualitative results similar to these two, see Figures 8-18 in the appendix. Figures 8-14 illustrate additional states with high local diversity at that state among agents, and Figures 15-18 show the distribution of action probabilities across subsets of states where different types of actions were taken and where patients were in states with high physiological instability (i.e. low MAP and high lactate).

Discussion

In this paper we introduced SODA-RL, a reinforcement learning approach for identifying a collection of effective treatment policies from observational data. When applied to the task of hypotension management in the ICU, we found that it is crucial that all three components in Equation 4 are utilized so that the learned policies are diverse, safe,

Figure 1: One representative example of a particular state (with variable values presented on the bottom) where the three retained policies learned by SODA-RL exhibit high diversity. For this example, there were 14 actions that the policies were able to exploit, from the safety constraint. Agents 1 and 2 place high probability on a small number of actions, while agent 3 spreads out amongst several reasonable alternatives.

and not that far from current clinical practice. Additionally, our qualitative results on a learned collection of policies suggests that they are each picking up on diverse sets of practices in the treatment of hypotension.

However, one of the major assumptions that we make is that the current set of features that comprise our definition of state are actually sufficient for a clinician to act on (i.e., that our defined state actually satisfies the Markov assumption). This is likely an unrealistic assumption, but future work could explore other ways of learning state-statistics, and our methods can be seamlessly combined with any state representation.

Another interesting line of future work would be to explore how and why different types of vasopressors are given, especially settings where more than one are given (e.g. vasopressin, which is often combined with another drug like norepinephrine). Finally, blood pressure targets themselves are an area of active research [33]. We focused on achieving certain targets in our rewards as that ensures that the actions were closely linked to the outcomes. More general forms of patient outcomes—e.g. mortality—may be more interesting, but have their own challenges, as these outcomes depend on many factors outside of how a patient’s hypotension is managed.

Overall, we believe SODA-RL represents an important and under-explored direction in reinforcement learning for healthcare: it is often statistically impossible to identify optimal treatment strategies from observational data. However it is possible to identify a collection of plausible alternatives, drawn from current practice variation. This collection can provide a starting point for clinical experts to perform a targeted review—starting with chart review, perhaps ending in a trial about different treatment options; once vetted, it could be used to help patients and providers think about options in the context of the patient’s specific presentation and the provider’s experience and expertise. Our proposed SODARL algorithm ensures that those alternatives are distinct and have sufficient support in the data, enabling what we believe will be a more practical and impactful way for clinicians to draw treatment policy insights from observational sources.

Figure 2: Overall action probabilities. Each column corresponds to one of the 20 actions in our action space. The top row shows the physician behavior probabilities aggregated across all patients in the test set. The bottom three rows show probabilities from the three different agents, from the same run of the algorithm presented in Table 2.

Acknowledgements

FDV and JF acknowledge support from NSF Project 1750358. MAM and FDV acknowledge support from AFOSR FA 9550-17-1-0155. JF additionally acknowledges Oracle Labs, a Harvard CRCS fellowship, and a Harvard Embedded EthiCS fellowship.

References

1. Emad H Ibrahim, Glenda Sherman, Suzanne Ward, Victoria J Fraser, and Marin H Kollef. The influence of inadequate antimicrobial treatment of bloodstream infections on patient outcomes in the icu setting. Chest, 118(1):146–155, 2000.

2. AN Berbece and RMA Richardson. Sustained low-efficiency dialysis in the icu: cost, anticoagulation, and solute removal. Kidney international, 70(5):963–968, 2006.

3. Andr´es Esteban, Antonio Anzueto, Fernando Frutos, Inmaculada Al´ıa, Laurent Brochard, Thomas E Stewart, Salvador Benito, Scott K Ep- stein, Carlos Apeztegu´ıa, Peter Nightingale, et al. Characteristics and outcomes in adult patients receiving mechanical ventilation: a 28-day international study. Jama, 287(3):345–355, 2002.

4. Neil J Glassford, Glenn M Eastwood, and Rinaldo Bellomo. Physiological changes after fluid bolus therapy in sepsis: a systematic review of contemporary data. Critical care, 18(6):696, 2014.

5. Christof Havel, Jasmin Arrich, Heidrun Losert, Gunnar Gamper, Marcus M¨ullner, and Harald Herkner. Vasopressors for hypotensive shock. Cochrane Database of Systematic Reviews, (5), 2011.

6. Kamal Maheshwari, Brian H Nathanson, Sibyl H Munson, Victor Khangulov, Mitali Stevens, Hussain Badani, Ashish K Khanna, and Daniel I Sessler. The relationship between icu hypotension and in-hospital mortality and morbidity in septic patients. Intensive care medicine, 44(6):857–867, 2018.

7. Alan E Jones, Vasilios Yiannibas, Charles Johnson, and Jeffrey A Kline. Emergency department hypotension predicts sudden unexpected in-hospital mortality: a prospective cohort study. Chest, 130(4):941–946, 2006.

8. Uma M Girkar, Ryo Uchimido, Li-wei H Lehman, Peter Szolovits, Leo Celi, and Wei-Hung Weng. Predicting blood pressure response to fluid bolus therapy using attention-based neural networks for clinical interpretability. arXiv preprint arXiv:1812.00699, 2018.

9. Richard S Sutton. Introduction to reinforcement learning, volume 2. 1998.

10. Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.

11. Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau, and Susan A Murphy. Informing sequential clinical decision- making through reinforcement learning: an empirical study. Machine learning, 84(1-2):109–136, 2011.

12. Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300, 2017.

13. Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981. IEEE, 2016.

14. Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18, 2019.

15. Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.

16. Jun S Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and computing, 6(2):113–119, 1996.

17. R´emi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.

18. Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

19. Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388, 2015.

20. Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.

21. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.

22. Matthew Smith, Herke Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In International Conference on Machine Learning, pages 4710–4719, 2018.

23. Shirin Sohrabi, Anton V Riabov, Octavian Udrea, and Oktie Hassanzadeh. Finding diverse high-quality plans for hypothesis generation. In ECAI, pages 1581–1582, 2016.

24. Muhammad Masood and Finale Doshi-Velez. Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In IJCAI, 2019.

25. Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. arXiv preprint arXiv:1705.08422, 2017.

26. Marzyeh Ghassemi, Mike Wu, Michael C Hughes, Peter Szolovits, and Finale Doshi-Velez. Predicting intervention onset in the icu with switching state space models. AMIA Summits on Translational Science Proceedings, 2017:82, 2017.

27. Feras Hatib, Zhongping Jian, Sai Buddi, Christine Lee, Jos Settels, Karen Sibert, Joseph Rinehart, and Maxime Cannesson. Machine-learning algorithm to predict hypotension based on high-fidelity arterial pressure waveform analysis. Anesthesiology: The Journal of the American Society of Anesthesiologists, 129(4):663–674, 2018.

28. Shameek Ghosh, Mengling Feng, Hung Nguyen, and Jinyan Li. Risk prediction for acute hypotensive patients by using gap constrained sequential contrast patterns. In AMIA Annual Symposium Proceedings, volume 2014, page 1748. American Medical Informatics Association, 2014.

29. Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.

30. Denis Agniel, Isaac S Kohane, and Griffin M Weber. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. bmj, 361:k1479, 2018.

31. Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, and Emma Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.

32. Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.

33. Pierre Asfar, Ferhat Meziani, Jean-Franc¸ois Hamel, Fabien Grelon, Bruno Megarbane, Nadia Anguel, Jean-Paul Mira, Pierre-Franc¸ois Dequin, Soizic Gergaud, Nicolas Weiss, et al. High versus low blood-pressure target in patients with septic shock. New England Journal of Medicine, 370(17):1583–1593, 2014.

Appendix

Our state space contains 89 clinical and demographic features, which we now briefly describe. There are 7 baseline variables (demographics and other characteristics available on ICU admission) in Table 1. We also include in the state formulation a continuous variable denoting how far into the first 72 hours of ICU stay a current time point is. The remaining 81 clinical variables are in Table 3, which summarizes the measured value of time series variables as well as the indicator variables. Lastly, there are indicator variables for the most recent type of treatment administered, and for the total amount of each treatment administered thus far and in the last 8 hours.

Table 3: Summary statistics for clinical variables included in our state formulation. Continuous-valued time series are summarized by their mean, and 25/50/75% quantiles, among measured values (i.e. excluding imputed values). Indicators are summarized by the percentage of hours in the full dataset where they were active.

Figure 3: Histogram showing the raw observed values of fluid boluses administered in the dataset. We define a fluid bolus to be IV fluids administered within an hour, and a volume of at least 200mL. From this plot, we decided to discretize fluids to the following 4 discrete ranges (in mL within an hour): {[0, 200), [200, 500), [500, 1000), [1000, 2000)}.

We now present histograms showing the distribution of actual values of treatments given, to show how we eventually discretized them to achieve our final action space of 20 possible actions.

Figure 4: Histogram showing the raw observed amounts of vasopressor administered within each discrete hour in the dataset. From this plot, together with the next plot that zooms in on values less than 40, we decided to discretize vasopressors to the following 5 discrete ranges: {0, (0, 5), [5, 15), [15, 40), [40, 150)}.

Figure 5: Histogram showing the lower range of the raw observed amounts of vasopressor administered within each discrete hour in the dataset. This is just a zoomed in version of Figure 4.

Figure 6: Histogram of overall counts of the 20 different actions in the entire dataset. Actions 0,1,2,3 are no vasopressor and none/low/medium/high fluid bolus; 4,5,6,7 are low dose of vasopressor along with no/low/medium/high fluid bolus, etc.

Figure 7: Reward function used in our analysis. The reward depends primarily on MAP, with inflection points at 55, 60, and 65. Values 65 and above are considered optimal. Note that the main objective function for SODA-RL does not depend on the specific reward function chosen, so this will only affect our final estimate of the overall value of the learned policies.

Figure 8: An example state where there were 13 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

We now show additional results figures exploring different specific states observed in the test set, where the three retained policies learned by SODA-RL exhibit high degrees of diversity.

Figure 9: An example state where there were 13 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

Figure 10: An example state where there were 4 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

Figure 11: An example state where there were 4 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

Figure 12: An example state where there were 4 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

Figure 13: An example state where there were 7 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

Figure 14: An example state where there were 8 different actions allowed by the safety term in SODA-RL, and where the resulting policies exhibit high diversity.

Figure 15: Action probabilities for states where fluids are subsequently administered. Each column corresponds to one of the 20 actions in our action space. The top row shows the physician behavior probabilities aggregated across all patients in the test set. The bottom three rows show probabilities from the three different agents, from the same run of the algorithm presented in the results table 2.

Finally, we show additional histograms of action probabilities for the 3 learned policies, along with the behavior policy, for several different subsets of states. We show how the behavior and learned policies focus on different actions in states where a fluid action was subsequently taken (Figure 15), states where a vasopressor action was subsequently taken (Figure 16), and states where the patient is in a stage of especially high acuity, as measured by elevated lactate (Figure 17) and severely low MAP (Figure 18).

Figure 16: Action probabilities for states where a vasopressor is subsequently administered. Each column corresponds to one of the 20 actions in our action space. The top row shows the physician behavior probabilities aggregated across all patients in the test set. The bottom three rows show probabilities from the three different agents, from the same run of the algorithm presented in the results table 2.

Figure 17: Action probabilities for states with an elevated lactate of greater than 2mmol/L. Each column corresponds to one of the 20 actions in our action space. The top row shows the physician behavior probabilities aggregated across all patients in the test set. The bottom three rows show probabilities from the three different agents, from the same run of the algorithm presented in the results table 2.

Figure 18: Action probabilities for states with a low MAP of less than 55mmHg. Each column corresponds to one of the 20 actions in our action space. The top row shows the physician behavior probabilities aggregated across all patients in the test set. The bottom three rows show probabilities from the three different agents, from the same run of the algorithm presented in the results table 2.