A Hidden Absorbing Semi-Markov Model for Informatively Censored Temporal Data: Learning and Inference

2016·Arxiv

Abstract

Abstract

Modeling continuous-time physiological processes that manifest a patient’s evolving clinical states is a key step in approaching many problems in healthcare. In this paper, we develop the Hidden Absorbing Semi-Markov Model (HASMM): a versatile probabilistic model that is capable of capturing the modern electronic health record (EHR) data. Unlike existing models, the HASMM accommodates irregularly sampled, temporally correlated, and informatively censored physiological data, and can describe non-stationary clinical state transitions. Learning the HASMM parameters from the EHR data is achieved via a novel forward-filtering backward-sampling Monte-Carlo EM algorithm that exploits the knowledge of the end-point clinical outcomes (informative censoring) in the EHR data, and implements the E-step by sequentially sampling the patients’ clinical states in the reverse-time direction while conditioning on the future states. Real-time inferences are drawn via a forward-filtering algorithm that operates on a virtually constructed discrete-time embedded Markov chain that mirrors the patient’s continuous-time state trajectory. We demonstrate the prognostic utility of the HASMM in a critical care prognosis setting using a real-world dataset for patients admitted to the Ronald Reagan UCLA Medical Center. In particular, we show that using HASMMs, a patient’s clinical deterioration can be predicted 8-9 hours prior to intensive care unit admission, with a 22% AUC gain compared to the Rothman index, which is the state-of-the-art critical care risk scoring technology.

1. Introduction

Modeling the clinical conditions of a patient using evidential physiological data is a ubiquitous problem that arises in many healthcare settings, including disease progression modeling (Schulam and Saria (2015); Mould (2012); Wang et al. (2014); Jackson et al. (2003); Sweeting et al. (2010); Liu et al. (2015)) and critical care prognosis (Moreno et al. (2005); Matos et al. (2006); Yoon et al. (2016); Hoiles and van der Schaar (2016); Alaa et al. (2016)). Accurate physiological modeling in these settings confers an instrumental value that mani-

Figure 1: An episode of the diastolic blood pressure measurements (as recorded in the EHR) for a patient hospitalized in a regular ward for 50 days and then admitted to the ICU after the ward staff realized she is clinically deteriorating. Measurements are censored in accordance with the ICU admission time.

0 50 100 150 100

Figure 2: An episode of the systolic blood pressure measurements for a patient hospitalized in a regular ward for 6 days and then discharged home by the ward staff. Measurements are missing in a 24-hour period during the patient’s stay in the ward.

fests in the ability to provide early diagnosis, individualized treatments and timely interventions (e.g. early warning systems in critical care hospital wards (Yoon et al. (2016)), early diagnosis for Scleroderma patients (Varga et al. (2012); Alaa and van der Schaar (2016)), early detection of a progressing breast cancer (Bartkova et al. (2005)), etc). Physiological modeling also confers an epistemic value that manifests in the knowledge extracted from data about the progression and severity phases of a disease (Stelfox et al. (2012))), or the short-term dynamics of the physiological behavior of critically ill patients (Li-wei et al. (2013)).

The recent availability of data in the electronic health records (EHR)creates a promising horizon for establishing rich and complex physiological models (Gunter and Terry (2005)). Modern EHRs comprise episodic data records for individual (anonymized) patients; every patient’s episode is a temporal sequence of clinical findings (e.g. visual field index for Glau-

coma patients (Liu et al. (2015)), CD4 cell counts for HIV-infected patients (Guihenneuc-Jouyaux et al. (2000)), etc), lab test results (e.g. white cell blood count for post-operative patients under immunosuppressive drugs (Cholette et al. (2012)), etc), or vital signs (e.g. blood pressure and saturation (Yoon et al. (2016))). The time span of these episodes may be as short as few days in short-term hospitalization episodes (e.g. patients with solid tumors, hematological malignancies or neutropenia who are hospitalized in regular wards before or after a surgery (Kause et al. (2004); Hogan et al. (2012); Kirkland et al. (2013))), or as long as few years in longitudinal episodes (e.g. chronic obstructive pulmonary disease may evolve from a mild Stage I to a very severe Stage IV over a time span of 10 years (Pedersen et al. (2011); Wang et al. (2014))). In this paper, we develop a versatile physiological model that fits a wide spectrum of healthcare settings, providing means for data-driven clinical prognosis. In the next Subsection, we expose our modeling rationale and list the modeling challenges posed by the structure of modern EHR data. We conclude this Section by summarizing our contributions in Subsection 1.2.

1.1 Modeling Rationale and Challenges

1.1.1 Rationale

Previous physiological models have branched into two different modeling paths with respect to the way a patient’s clinical states are defined. One strand of literature adopts fully observable models; these models assume that clinical states are quantifiable via observable clinical markers or disease severity measures (e.g. PFVC in Scleroderma (Schulam and Saria (2015)), GFR in kidney disease (Eddy and Neilson (2006)), etc). Another strand of literature adopts latent variable models, which assume that clinical states are latent and manifest only through proximal, noisy physiological measurements. Table 1 lists some notable previous works that fall under each modeling category.

Table 1: Modeling methodologies in previous works.

Our modeling choice is to go with a latent variable model. The rationale behind this choice is explicated as follows.

• In a wide range of problems, a concrete clinical marker that can be directly used as a surrogate for the patient’s true clinical condition is available. This is especially true in critical care settings where no solid definition or measure of a “clinical state” exists (Li-wei et al. (2013)). Previous works that adopted a clinical risk score as a surrogate for the clinical state in critical care settings have found that other physiological features, when augmented with the clinical risk score, still hold a significant predictive power with respect to end-point clinical outcome (Ghassemi et al. (2015)). This implies that a clinical risk score or a severity of illness measure (such as APACHE II, SAPS and SOFA (Knaus et al. (1991); Subbe et al. (2001))) is not a sufficient measure of a patient’s true clinical condition, and hence cannot be reliably modeled as an observable clinical state.

• The same line of argument extends to disease progression models: (Jackson et al. (2003)) has shown that significant modeling gain can be attained by treating clinical markers and diagnostic assessments as noisy, potentially erroneous manifest variables for the patient’s true clinical state rather than defining a clinical state in terms of those markers.

• For various chronic disease, such as HIV, Scleroderma, and kidney disease, progression stages are well defined in terms of observable clinical markers (CD4 cell count, PFVC and GFR). However, a latent variable model can help validate and assess the current domain knowledge-based clinical practice guidelines by learning alternative, data-driven guidelines. Other diseases, such as COPD, have their progression stages manifesting only through symptoms (e.g. chronic bronchitis, emphysema and chronic airway obstruction (Wang et al. (2014))), which may or may not accurately reflect the disease’s true state, and hence a latent variable model is necessary.

• Conclusive clinical markers that reveal a patient’s true state may be available only occasionally in a patient’s longitudinal episodes. For instance, in a breast cancer progression setting, most of the data points associated with a patient’s longitudinal episode would be imaging test results (e.g. BI-RADS scores of a mammogram or an MRI (Gail and Mai (2010); Taghipour et al. (2013))), which are noisy markers for the existence of a tumor, whereas a conclusive biopsy result that truly reveals whether the patient is in a preclinical or clinical breast cancer state may not be available because the patient did not undergo a biopsy test. A latent variable model better suits such settings.

• A fully observable model does not provide diagnostic utility since it assumes that an already observable clinical marker provides an immediate, domain-knowledge-based diagnosis for the patient. Contrarily, a latent variable model leaves room for diagnoses to be learned from the evidential data by learning the association between physiological evidence and clinical states, which may help inform and improve clinical practice.

1.1.2 Challenges

Hidden Markov Models (HMMs) and their variants have been widely deployed as temporal latent variable models for dynamical systems (Smyth (1994); Zhang et al. (2001); Giampieri et al. (2005); Genon-Catalot et al. (2000); Ghahramani and Jordan (1997)). Such models have achieved considerable success in various applications, such as topic modeling (Gruber et al. (2007)), speaker diarization (Fox et al. (2011b)), and speech recognition (Rabiner (1989)). However, the nature of the clinical setting, together with the format of the modern EHR data pose the following set of serious challenges that confound classical HMM models:

(A) Non-stationarity: Recently developed disease progression models, such those in (Wang et al. (2014)) and (Liu et al. (2015)), use conventional stationary Markov chain models. In particular, they assume that state transition probabilities are independent of time. However, this assumption is seriously at odds with even casual observational studies which show that the probability of transiting from the current state to another state depends on the time spent in the current state (Lagakos et al. (1978); Huzurbazar (2004); Gillaizeau et al. (2015)). This effect, which violates the memorylessness assumptions adopted by continuous-time Markovian models, was verified in patients who underwent renal transplantation (Foucher et al. (2007, 2008)), patients who are HIV infected (Joly and Commenges (1999); Dessie (2014); Foucher et al. (2005)), and patients with chronic obstructive pulmonary disease (Bakal et al. (2014); Wang et al. (2014)).

(B) Irregularly spaced observations: The times at which the clinical findings of a patient (vital signs or lab tests) are observed is controlled either by clinicians (in the case of hospitalized inpatients), or by the patient’s visit times (in the case of a chronic disease follow up). The time interval between every two measurements may vary from one patient to another, and may also vary for the same patient within her episode. This is reflected in the structure of the episodes in the EHR records, as shown in Figure 1 and 2. Figure 1 depicts an actual diastolic blood pressure episode for a patient hospitalized in a regular ward for 1200 hours (50 days). The patient’s stay in the ward was concluded with an admission to the ICU after the ward staff realized she was clinically deteriorating. As we can see, the blood pressure measurements in the first 20 hours were initially taken with a rate of 1 sample per hour, and then later the rate changed to 1 sample every 5 hours. Thus, a direct application of a regular, discrete-time HMM (e.g. the models in (Murphy (2002); Fox et al. (2011b,a); Rabiner (1989); Yu (2010); Matos et al. (2006); Guihenneuc-Jouyaux et al. (2000))) will not suffice for jointly describing the latent states and observations, and hence ensuring accurate inferences.

(C) Discrete observations of a continuous-time phenomena: A patient’s physiological signals and latent states evolve in continuous time; however, the observed physiological measurements are gathered at discrete time steps. The intervals between observed measurements can vary quite significantly; as we can see in Figure 2, the systolic blood pressure for a patient who stayed in a ward for 140 hours exhibits an entire day without measurements This means that the patient may encounter multiple hidden state transitions without any associated observed data. These effects make learning and inference problems more complicated since the inference algorithms need to consider potential unobserved trajectories of state evolution between every two timestamps. This challenge has been recently addressed in (Nodelman et al. (2012); Wang et al. (2014); Liu et al. (2015)), but only on the basis of memoryless Markov chain models for the hidden states, for which tractable inferences that rely on the solutions to Chapman-Kolmogorov equations can be executed. However, incorporating non-stationarity in state transitions (i.e. addressing challenge (1) in this list) would make the problem of reasoning about a continuous-time process through discrete observations much more complicated.

(D) Lack of supervision: The episodes in the EHR may be labeled with the aid of domain knowledge (e.g. the stages and symptoms of some chronic diseases, such as chronic kidney disease (Eddy and Neilson (2006)), are known to clinicians and may be provided in the EHR). However, in many cases, including the case of (post or pre-operative) critical care, we do not have access to any labels for the patients’ states. Hence, unsupervised learning approaches need to be used for learning model parameters from EHR episodes. While unsupervised learning of discrete-time HMMs has been extensively studied and is well understood (e.g. the Baum-Welch EM algorithm is predominant in such settings (Zhang et al. (2001); Yu (2010); Rabiner (1989))), the problem of unsupervised learning of continuous-time models for which both the patient’s states and state transition times are hidden is far less understood, and indeed far more complicated.

(E) Censored observations: Episodes in the EHR are usually terminated by an informative intervention or event, such as death, ICU admission, discharge, etc. This is known as informative censoring (Scharfstein and Robins (2002); Huang and Wolfe (2002); Link (1989)). Unlike classical HMM settings where training sets comprise fixed length, or arbitrarily-censored, HMM sequence instances, a typical EHR dataset would comprise a set of episodes with different durations, and the duration of each episodes is itself informative of the entire state evolution trajectory. Learning in such settings requires novel algorithms that can efficiently compute the likelihood of observing a set of episodes conditioned on their durations and terminating states.

1.2 Summary of Contributions

In order to address the challenges above, we develop a new model –which we call the Hidden Absorbing Semi-Markov Model (HASMM)– as a versatile generative model for a patient’s (physiological) episode as recorded in the EHR. The HASMM captures non-stationary transitions for a patient’s clinical state via a continuous-time semi-Markov model with explicitly specified state sojourn time distributions. Informative censoring is captured via absorbing states that designate clinical endpoint outcomes (e.g. cardiac arrest, mortality, recovery, etc); entering an absorbing state of an HASMM stimulates censoring events (e.g. clinical deterioration leads to an ICU admission which terminates the physiological observations for a monitored patient in a ward, etc). Observable variables are modeled via a multi-task Gaussian process (Bonilla et al. (2007)), for which the observation times (i.e. follow up visits, vital sign gathering, lab tests, etc) are modeled as a point process. Using multi-task Gaussian process with state-dependent hyper-parameters, an HASMM accounts for both correlations among different physiological variables, in addition to the temporal correlations among the observation variables that are generated by the same hidden state during its sojourn period. In that sense, an HASMM is a segment model (Ostendorf et al. (1996)) and also a state-switching model (Fox et al. (2011a))).

To allow for real-time inference of a patient’s state, we develop a forward-filtering HASMM inference algorithm that can estimate a patient’s latent state using her history of irregularly sampled physiological measurements. The inference algorithm operates by constructing a virtual, discrete-time embedded Markov chain that fully describes the patient’s state transitions at observation times. The embedded Markov chain is constructed in an offline stage by solving a system of Volterra integral equations of the second kind using the successive approximation method; the solution to this system of equations, which parallels the Chapman-Kolmogorov equations in ordinary Markov chains, describe the HASMM’s semi-Markovian state transitions as observed at arbitrarily selected discrete timestamps.

Offline learning of the HASMM model parameters from patients’ episodes in an EHR is a daunting task. Since the HASMM is a continuous-time model, we cannot directly use the classical Baum-Welch EM algorithms for learning its parameters (Rabiner (1989)). Moreover, the semi-Markovianity of an HASMM yields an intractable integral in the E-step of the Expectation-Maximization (EM) formulation. Since the HASMM’s state transitions are not captured by the conventional continuous-time Markov chain transition rate matrices, we cannot make use of the Expm and Unif methods that were used in (Hobolth and Jensen (2011)), and more recently in (Liu et al. (2015)) for evaluating the integrals involved in the E-step of learning continuous-time HMMs. To address this challenge, we develop a novel forward-filtering backward-sampling Monte Carlo EM (FFBS-MCEM) algorithm that approximates the integral involved in the E-step by efficiently sampling the latent clinical trajectories conditioned on observations in the EHR by exploiting the informative censoring of the patients’ episodes. The FFBS-MCEM algorithm samples the latent clinical states of every (offline) patient episode in the EHR as follows: it starts from the known clinical endpoints, and sequentially samples the patient’s states by traversing in the reverse-time direction while conditioning on the future states, and then uses the sampled state trajectories to evaluate a Monte Carlo approximation for the E-step.

The rest of the paper is organized as follows. In Section 2, we present the HASMM model. The HASMM inference algorithm is developed in Section 3, and the learning algorithm is developed in Section 4. In Section 5, we demonstrate the utility of the HASMM in the problem of critical care prognosis using a real-world dataset for patients admitted to Ronald Reagan UCLA Medical Center. Conclusions are drawn in Section 6.

2. The Hidden Absorbing Semi-Markov Model (HASMM)

In this section, we introduce the basic abstract structure of the continuous-time HASMM (Subsection 2.1), and then we propose the distributional specifications for the model’s variables (Subsection 2.2).

2.1 Abstract Model

We start by describing the HASMM’s hidden state evolution process, and then we describe the structure of its observable variables.

2.1.1 Hidden States

We consider a filtered probability space (Ω), over which a continuous-time stochastic process X(t) is defined on . The process X(t) corresponds to a temporal trajectory of the patient’s hidden clinical states, which take on values from a finite statespace X = {1, 2, . . ., N}. Because the process X(t) takes on only finitely many values, it can be decomposed in the form

where (is a c`adl`ag path (i.e. right-continuous with left limits), and the interval [) is the time interval accommodating the hidden state, which takes on a value Every path (on the stochastic basis (ΩMarkov path (Janssen and De Dominicis (1984); Durrett (2010)), where the sojourn time of state n, which we denote as , is drawn from a state-specific distribution being a state-specific duration parameter associated with state . Unlike ordinary time-homogeneous semi-Markov transitions, in which the transition probabilities among states are assumed to be constant conditioned on there being a transition from the current state (Gillaizeau et al. (2015); Murphy (2002); Johnson and Willsky (2013); Yu (2010); Dewar et al. (2012); Gu´edon (2007)), our model accounts for duration-dependent semi-Markov transitions. In other words, the transition probability from one state to another depends on the time elapsed in the current state, i.e.

where is well defined, and

Now consider the bivariate (renewal) process (, which comprises the sequence of states and sojourn times. The semi-Markovian nature of X(t) implies that

(satisfies the following condition on its transition probabilities

where ) is the cumulative distribution function of state i’s sojourn time, and ¯is the probability mass function that reflects the probability that a patient’s next state being j given that she was at state i and her sojourn time in i is less than (or equal to) s. Based on (3), we define the semi-Markov transition kernel as a matrix-valued function that describes the dynamics of X(t) in continuous time, with entries that are given by

Since the observable episode for a patient can begin in an arbitrary clinical state (because we only observe the physiological measurements starting from the time when the patients are hospitalized or start taking clinical tests), then it follows that the initial state random. The initial state distribution is given by

The hidden states reflect different levels of clinical risk or severity (e.g. progression stage indexes of a chronic disease or phases of clinical deterioration (Sweeting et al. (2010);Chen and Zhou (2011))). In that sense, state 1 is regarded as the “least risky state”, and state N is regarded as the “most risky state”. We define and interpret states 1 and N as follows:

• State 1 is denoted as the safe state, and represents the state at which the patient is at minimum (or no) risk (e.g. clinically stable post-operative patient, etc).

• State N is denoted as the catastrophic state, and represents the state at which the patient is at severe risk or encounters an adverse event (e.g. a very severe stage of a chronic disease (Bakal et al. (2014)), a cardiac or respiratory arrest (Subbe et al. (2001)), mortality (Knaus et al. (1991)), etc).

We assume that whenever the system enters either state 1 or state N, it remains there forever. Therefore, we model states {1, N} as absorbing states, whereas we model the remaining states in X \ {1, N} as transient states that represent intermediate levels of risk. Following the assumptions in (Murphy (2002); Johnson and Willsky (2013)), we eliminate the self-transitions for all transient states by setting X \ {1, N}, whereas we restrict the transitions from states 1 and N to self-transitions only, i.e. . Figure 3 depict the Markov chain for the sequence

Figure 3: The Markov chain model for a 5-state HASMM.

We define as the event that the path (is absorbed in the safe state 1, i.e. as the event that (is absorbed in the catastrophic state is an absorbing semi-Markov chain, we know that and since the events mutually exclusive, it follows that ). The quantity ) describes a patient’s prior risk of ending in the catastrophic state, whereas ) describes the patient’s posterior risk of ending in the catastrophic state having observed its evolution history up to time -stopping time representing the absorption time of the path (in either state 1 or state N, i.e.

Finally, we define K as the (random) number of state realizations in the sequence up to the stopping time , which has to be concluded by either state 1 or N, e.g. when X = 4, the sequences {1}, {4}, {2, 3, 2, 3, 4}, and {3, 2, 1} are valid, random-length realizations of , and each represents a certain state evolution trajectory for the patient.

2.1.2 Observations and Censoring

The path (is unobservable; what is observable is a corresponding process (on (Ω), the values of which are drawn from an observation-space Y, and whose distributional properties are dependent on the latent states’ path (observable process (can be put in the form

where (is a c`adl`ag path, comprising a sequence of function-valued variables Even though the path (is accessible, only a sequence of irregularly spaced samples of it is observed over time, and is denoted by is the set of observed measurements, and M is the total number of such measurements. We say that the process is censored if episodes in an EHR are censored: observations stop at some point of time due to a release from care, an ICU admission, mortality, etc.

The sampling times in T represent the times at which a patient with a chronic disease took clinical tests (i.e. time intervals in T spans years), or the times at which clinicians have gathered vital signs for a monitored critically ill patient in a hospital ward (i.e. time intervals in T span days or hours). We assume that the sampling times in T are drawn from a which is defined on (Ωand with being the Dirac measure. The point process Φ() is parametrized by an intensity parameter , but is assumed to be independent of the latent states pathfine as the set of samples that are gathered during the intervalcould possibly be empty (), some states can have no corresponding observations (i.e. an inpatient may exhibit a transition to a deteriorating state during the night, even though her blood pressure were not measured during the night. Recall the illustration in Figure 2).

The paths are assumed to be conditionally independent given the hidden sequence of states , and hence we have that

The observed samples generated under every state and sampled at the times in are drawn from Y according to a distribution emission parameter that controls the distributional properties of the observations generated under state j.

The number of observation samples is finite: the observed sequence is censored at some point of time, which we call the censoring time , after which no more observation samples are available. Censoring reflects an external intervention/event that terminated the observation sequence, i.e. death, intensive care unit (ICU) admission, etc. Censoring is informative (Scharfstein and Robins (2002);Huang and Wolfe (2002);Link (1989)), because the censoring time is correlated with the absorption time strictly precedes (in an almost sure sense). That is, -stopping time that is given by , i.e. once the patient enters state 1 or state N, the observations stop after the patient’s sojourn time in that state (i.e. observations stop after a time from the entrance in the absorbing state). Therefore, the duration distributions ) of states 1 and N are used to determine the censoring times conditioned on the chain being absorbed at time

Every sample from the HASMM is an episode comprising a random-length sequence of hidden states , and a random-length sequence of observations with the associated observation times. We only observe ; the path of latent states X(t), the number of realized states K, the association between observations and states (i.e. the sets ) are all unobserved, which makes the inference problem very challenging, but captures the realistic EHR data format and the associated inferential hurdles. In the next subsection, we specify the model’s generative process and present an algorithm to generate episodic samples from an HASMM.

2.2 Model Specification and Generative Process

As have been discussed in Subsection 2.1, the hidden and observables variables of an HASMM can be listed as follows:

• Hidden variables: The hidden states sequence and the states’ sojourn times (or equivalently, the transition times

• Observable variables: The observed episode and the associated sam- pling times

The HASMM model parameters that generate both the hidden and observable variables are encompassed in the parameter set Γ, i.e.

Γ =

Since the point process Φ() does not reveal any information about the latent states, and hence plays no role in inference, we will drop it from the parameter set Γ in the rest of the paper. In the following, we specify the distributional properties for both the hidden and observable variables.

2.2.1 Distributional specifications for the hidden variables

We model the state sojourn time of each state via a Gamma distribution. The selection of a Gamma distribution ensures that the generative process encompasses ordinary continuous-time Markov models for the path (, since the exponential distributionis a special case of the Gamma distribution (Durrett (2010)). Thus, if the underlying physiology of the patient is naturally characterized by memoryless state transitions, this will be automatically learned from the data via the parameters of the Gamma distribution. The sojourn time distribution for state i is given by

Figure 4: Exemplary transition functions (for a 4-state HASMM.

Figure 5: Depiction for the correlation structure of the observable variables for an underlying state sequence

where 0 are the shape and rate parameters of the Gamma distribution respectively.

Now we specify the structure of the transition kernel call from (4) that the each element in the transition kernel matrix can be written as ). Having specified the distribution ) as a Gamma distribution, it remains to specify the function ) in order to construct the elements of Q(s). The transition functions (are given by multinomial logistic functions as follows

where . The parameters (determine the baseline values for the transi- tion probability mass from state (0), whereas the parameters the dependence of the transition probability mass on the sojourn timewe have that i.e. the transition probability out of state i remains constant irrespective of the sojourn time in that state. In the limit when s

Figure 6: A basic graphical model for the HASMM.

goes to infinity, the parameter dominates the functional form in (6). Figure 4 depicts exemplary transition functions (for a 4-state HASMM.

2.2.2 Distributional specifications for the observable variables

As explained in Subsection 2.1, the observable process Y (t) can be decomposed as Y (t) = , where the paths (are conditionally independent given the state sequence . Since observations are drawn from Y (t) at arbitrarily, and irregularly spaced time instances T , we have to model the distributional properties of Y (t) in continuous time. We model every path ) defined over [) as a segment drawn from a multi-task Gaussian Process (GP), with a hyper-parameter set Θthat depends on the corresponding latent state (Rasmussen (2006); Bonilla et al. (2007)). The input to the multi-task GP is the time variable and the output is the set of physiological variables at a certain point of time. The GP associated with every state parametrized by a constant mean function covariance ker-

nel , and a “free-form” covariance matrix Σbetween the different physiological measurements (Bonilla et al. (2007)). Thus, for a Q-dimensional physiological stream )), the observations for state i are generated as follows

where . The GP hyper-parameters associated with state i are given by Θ

We note that the HASMM model is a segment model (Ostendorf et al. (1996); Murphy (2002); Yu (2010); Gu´edon (2007)), i.e. observation samples that are defined within the sojourn time of the same state are correlated, but observation samples in different states are independent. The model can also be viewed as a state-switching model, but for which

0 20 40 60 80 100 120 140 160 20

Figure 7: An episode generated by = 5. The realized hidden state sequence (upper) is {2, 3, 5}, and is absorbed in state 5. The physiological stream (2-dimensional and stream ) is sampled more intensely than

the transition dynamics do not need to be linear as in (Georgatzis et al. (2016); Fox et al. (2011a)), but rather depend on the covariance kernel ). Figure 5 depicts the correlation structure of the observable variables in terms of the covariance matrix of a discrete version of Y (t) generated under a specific hidden state sequence. We can see that conditioned on the hidden state sequence, the covariance matrix is a block diagonal matrix, where the sizes of the blocks are random and are determined by the hidden states’ sojourn times.

The sampling times in T are generated by the point process Φ(), which for the sake of completeness of the model description, we specify as a Poisson process with an intensity parameter . Note though that since we assume the sampling times are uninformative of the latent states path X(t), the distributional specification of Φ() is ancillary to the inference and learning algorithms developed in Sections 3 and 4.

2.2.3 Sampling episodes from an HASMM

We conclude this Section by presenting an algorithm for sampling episodes from an HASMM with a hyper-parameter set Γ. Algorithm 1 (samples a patient’s episodes by first sampling an initial state from X, and then sequentially samples sojourn times s from the Gamma distribution, and new states using the semi-Makrov kernel Q(s), until an absorbing state is drawn. Figure 7 depicts an episode sampled via Algorithm 1.

3. Inference in Hidden Absorbing Semi-Markov Models

In this Section, we develop an online algorithm that carries out diagnostic and prognostic inferences for a monitored patient’s episode in real-time. Given an ongoing realization of an episode (before the censoring time

HASMM model parameter Γ that has generated this realization (i.e. is sampled via the algorithm ), we aim at carrying out the following inference tasks:

• Diagnosis: Infer the patient’s current clinical state, i.e. compute

• Prognostic Risk Scoring: Compute the patient’s risk of absorption in the catastrophic state, i.e.

In the rest of this Section, we drop the conditioning on Γ for notational brevity. The first inference task corresponds to disease severity estimation for patients with chronic disease, or clinical acuity assessment for critical care patients. The second task corresponds to risk scoring for future adverse events for patients who have been monitored for some period of time, i.e. the risk of developing a future preclinical or clinical breast cancer state (Gail and Mai (2010)), the risk of clinical deterioration for post-operative patients in wards (Rothman et al. (2013)), the risk of mortality for ICU patients (Knaus et al. (1985)), etc.

3.1 Challenges facing the HASMM Inference Tasks

The inference tasks discussed in the previous Subsection are confronted with 3 main challenges –listed hereunder– that hinder the direct deployment of classical forward-backward message-passing routines.

Figure 8: An exemplary HASMM episode with 6 hidden state realizations and 9 observed samples.

1. In addition to the clinical states being unobserved, the transition times among the states, , are also unobserved (i.e. we do not know the time at which the patient’s state changed). Thus, unlike the discrete-time models in (Murphy (2002); Johnson and Willsky (2013); Yu (2010); Dewar et al. (2012); Gu´edon (2007)), in which we know that the underlying states switch sequentially in a (known) one-to-one correspondence with the observations, in an HASMM the association between states and observations is unknown. Figure 8 depicts an exemplary HASMM episode with 6 realized states and 9 observations samples; in this realization, the association between the observations is hidden. The importance of reasoning about the hidden transition times is magnified by the duration-dependence of the transition probabilities that govern the sequence

2. Since observations are made at random and arbitrary time instances, some transitions may not be associated with any evidential data. That is, as it is the case for state Figure 8, there is no guarantee that for every state , an observation is drawn during its occupancy, i.e. [). In a practical setting, the inference algorithm should be able to reason about the state trajectories even in silence periods that come with no observations (recall the example in Figure 2 where observations of a critical care patient’s systolic blood pressure stop for an entire day). Hence, one cannot directly discretize the time variable and use the discrete-time HMM inference algorithms (e.g. the algorithms in (Rabiner (1989))) since in that case we would exhibit time steps that come with no associated observations, and with potential state transitions.

3. The HASMM model assumes that observations that belong to the same state are correlated (e.g. in Figure 8, each of the subset of observations are not drawn independently conditioned on the latent state since they are sampled from a GP), thus we cannot use the variable-duration and explicit-duration HSMM inference algorithms in (Murphy (2002); Johnson and Willsky (2013); Yu (2010); Gu´edon (2007)), as those assume that all observations are conditionally independent given the latent states. Our model is closer to a segment-HSMM model (Yu (2010); Gu´edon (2007)), but with irregularly spaced observations and an underlying duration-dependent state evolution process, which requires a different construction of the forward messages.

In the following Subsection, we develop a forward filtering algorithm that deal with episodes generated from an HASMM and address the above challenges.

3.2 The HASMM Forward Filtering Algorithm

Given a realization of an episode , the posterior probability of the patient’s current clinical state ) is given by

The above application of Bayes’ rule implies that, given the observation times T , computing the joint probability density ) suffices for computing the posterior probability of the patient’s clinical states. As it is the case for the conventional HMM setting, we denote these joint probabilities as the forward messages

Since the HASMM is a segment model, the conventional notion of the forward messages ) does not suffice for constructing the forward filtering algorithm since we need to account for the latent correlation structures between the (conditionally-dependent) observations (Murphy (2002)). To that end, we define ) as the forward message for the state at the observation time (i.e. as follows

for some That is, the forward message ply the joint probability that the current state is j, that the associated observations are (and that the current state has lasted for the last w measurements. For notational brevity, denote the event) can be written as

which can be decomposed using the conditional independence properties of the states, observable variables and sojourn times as follows

d

where we have dropped the conditioning on T for notational brevity. The first term, )), is the probability density of the observable vari- ables in conditioned on the hidden state being and that the time instances reside in the sojourn time of . The second term, the probability that the hidden state sequence transits to state j after a period of , given that its sojourn time in state is at least , and at most . The third term is the probability that the sojourn time in state between , whereas the fourth term, ), is the (forward message with a lag of . Thus, we can write the forward message with a lag w as follows

As we can see in (10), one can express ) using a recursive formula that makes use of the older forward messages ) = 0, which allows for an effi-cient dynamic programming algorithm to infer the patient’s clinical state in real-time; this is important in critical care settings where prompt risk assessments are crucial for timely clinical intervention.

The construction of the forward messages in (10) parallels the structure of forward message-passing in segment-HSMM (See Section 1.2 in (Murphy (2002)) and Section 4.2.2 in (Yu (2010))), but with the following differences. In (10), the time interval between every two observation samples is irregular, which reflects in the correlation between the observations in (depends on the covariance kernel of the GP, and the probability of the current latent state’s sojourn time being encompassing the most recent w samples, i.e. ()). However, the most challenging ingredient of the forward message is the interval transition probability is because unlike the discrete-time HSMM models in (Murphy (2002); Yu (2010)), which exhibit transitions only at discrete time steps that are always accompanied with evidential observations, i.e. no hidden transitions can occur between observation samples, and the transitions among hidden states are duration-independent, in an HASMM, transitions can occur at arbitrary time instances, multiple transitions can occur between two observation samples, and transitions are duration-dependent.

In order to evaluate the term )), we construct a virtual (discrete-time) trivariate , the transition probabilities of which are equal to the interval transition probabilities. In the recent work in (Liu et al. (2015)), a similar embedded Markov chain analysis was conducted for a CTHMM, but for which the underlying state evolution process was assumed to be a duration-independent ordinary Markov chain for which the expressions for the interval transition probabilities are readily available by virtue of the exponential distributions of the memoryless state sojourn times.

Recall from Subsection 2.1.1 that the semi-Markov kernel of the hidden state sequence is defined as i.e. the probability that the sequence transits from state i to state j given that the sojourn time in i is less than or equal to . Theorem 1 establishes the methodology for computing the interval transition probabilities )) using the parameters of an HASMM. In Theorem 1, we define ) as a matrix-valued function the entries of which are given by

be the solution to the fol- lowing integral equation

(11) for the three independent variables (. Then, the interval transition probability is given by

Theorem 1 follows from a first-step analysis that is akin to the derivation of the conventional Chapman-Kolmogorov equations in ordinary Markov chains (Kulkarni (1996)). The integral equation in (11) is a (matrix-valued) non-homogeneous Volterra integral equation of the second kind (Polyanin and Manzhirov (2008)).

It can be easily demonstrated that a closed-form solution that hinges on conventional kernel methods cannot be obtained. Hence, we resort to a numerical method in order to solve (11) for . Before presenting the numerical method, we reformulate (11) as follows

where is an element-wise convolution operator. (12) follows from (11) by the fact that the integral in (11) is a convolution integral; (12) can be expressed as follows

where the (functional) operator ) is given by

where F is the Fourier transform operator, and the transforms in (14) are all taken with respect to

(1967)) as follows. We initialize the function ) with the truncated semi-Markov kernel), and then iteratively apply the operator B(.) to obtain a new value for ) until convergence. That is, the successive approximation procedure goes as follows

The following Theorem establishes the validity of the procedure in (15) as a solver for (13). Before presenting the statement of Theorem 2, we define the function space P as follows

where is the Kronecker delta function.

Theorem 2 (Convergence of successive approximations) The functional has a unique fixed-point and the successive approximation procedure in (15) always converges to the fixed point, i.e. , starting from any initial value

It is important to note that we do not need to solve for ) during real-time inference. Instead, we create a look-up table comprising a discretized version of [˜, and then we query this table when performing real-time inference for monitored patients. Hence, efficient and fast inferences can be provided for critical care patients for whom prompt diagnoses are necessary for the efficacy of clinical interventions. Algorithm 2 shows a pseudocode for constructing a look-up table of interval transition probabilities, ), which takes as an input the parameter set Γ and a precision level (to control the termination of the successive approximation iterations), and outputs the interval transitions look-up table ). In Algorithm 2, FFT and IFFT refer to the fast Fourier transform operation and its inverse, respectively, and “diff(.)” refers to a numerical differentiation operation.

Now that we have constructed the algorithm TransitionLookUp to compute the interval transition probabilities in the look-up table ), we can implement a forward-filtering inference algorithm using dynamic programming (by virtue of the recursive formula in (10)). In particular, the posterior probability of the patient’s current clinical state in terms of the forward messages can be written as

Algorithm 3, ForwardFilter, implements real-time inference of a patient’s clinical state given a sequence of measurements . In Algorithm 3, we invoke TransitionLookUp initially to construct the look-up table of transition probabilities, but in practice, the look-up table can be constructed in an offline stage once the HASMM parameter set Γ is known. The number of computations can be reduced by limiting the lags w for every forward message ) to the samples in T that reside in a period derived from the Gamma distribution of the sojourn time (e.g. can be selected such that ) = 90%). The complexity of ForwardFilter is similar to the conventional forward algorithms in (Rabiner (1989)), i.e.

3.3 Prognostic risk scoring using an HASMM

Diagnostic inference, e.g. estimating the patient’s current state after a screening test, can be conducted by a direct application of the forward filtering algorithm presented in the previous Subsection. In this Subsection, we now explain how prognostic inferences can also be conducted using the algorithms presented in the previous Subsection. Prognostic risk scoring plays an important role in designing screening guidelines (Gail and Mai (2010)), acute care interventions (Knaus et al. (1985)) and surgical decisions (Foucher et al. (2007)).

A risk score is a measure for the patient’s risk of encountering an adverse event (abstracted as state N in our model) at any future time step starting from time That is, the patient’s risk score at time can be formulated as

which can be computed using the outputs of TransitionLookUp and ForwardFilter as follows

Therefore, the procedures TransitionLookUp and ForwardFilter suffice for executing both the diagnostic and prognostic inference tasks.

4. Learning Hidden Absorbing Semi-Markov Models

In Section 3.1, we developed an inference algorithm that can handle diagnostic and prognostic tasks for patients in real-time assuming that the true HASMM parameter set Γ is known. In practice, the parameter set Γ is not known, and has to be learned from an offline EHR dataset D that comprises D episodes for previously hospitalized or monitored patients, i.e.

where are the observable variables and their respective sampling times for the is the episode’s censoring time, and is a label for the realized absorbing state.

We note that unlike the conventional HMM learning setting (Rabiner (1989); Zhang et al. (2001); Nodelman et al. (2012)), the episodes are not of equal-length as the observations for every episode stop at a random, but informative, censoring time. Thus, the patient’s state trajectory does not manifest only in the observable time series, i.e. also in the episode’s censoring variables . In this Section, we develop an efficient algorithm, which we call the forward-filtering backward-sampling Monte Carlo EM (FFBSMCEM) algorithm, that computes the Maximum Likelihood (ML) estimate of Γ given an informatively censored dataset is the likelihood of the dataset D given the parameter set Γ. We start by presenting the learning setup in Section 4.1, and then we present the FFBS-MCEM algorithm Section 4.3.

4.1 The Learning Setup

We focus on the challenging scenario when no domain knowledge or diagnostic assessments for the patients’ latent states are provided in the dataset (with the exception of the absorbing state which is declared by the variable ), i.e. the learning setup is an unsupervised one. For such a scenario, the main challenge in constructing the ML estimator Γresides in the hiddenness of the patients’ state trajectories in the training dataset D; the dataset D contains only the sequence of observable variables, their respective observation times, the episodes censoring time and the state in which the trajectory was absorbed. If the patients’ latent state trajectories (were observed in D, the ML estimation problem ΓΓ) would have been straightforward; the hiddenness of (entails the need for marginalizing over the space of all possible latent trajectories conditioned on the observed variables, which is a hard task even for conventional continuous-time HMM models (Liu et al. (2015); Nodelman et al. (2012); Leiva-Murillo et al. (2011); Metzner et al. (2007)).

In order to construct the ML estimator for Γ, we start by writing the complete likelihood, i.e. the likelihood of an HASMM with a parameter set Γ to generate both the hidden states trajectory and the observable variables episode in the dataset D as follows

where is the number of states that realized in episode d from t = 0 until absorption. The factorization in (19) follows from the conditional independence properties of the HASMM variables (see Figure 6). Since we cannot observe the latent states trajectory , the ML estimator deals with the expected likelihood Λ(evaluated by marginalizing the complete likelihood over the latent states trajectories, i.e.

where we have assumed that the episodes in D are independent and identically distributed. The integral in (20) can be further decomposed as follows

Λ(

4.2 Challenges Facing the HASMM Learning Task

The problem of learning the HASMM parameters by maximizing the likelihood function in (21) is obstructed by various obstacles that hinder the deployment of off-the-shelf learning algorithms; we list these challenges hereunder.

1. Finding the ML estimate Γby direct maximization of Λ(D | Γ) is not viable due to the intractability of the integral in (21), i.e. Λ(D | Γ) has no analytic maximizer. The difficulty of evaluating the expected likelihood Λ(D | Γ) follows from the need to average the complete likelihood over a complicated posterior density function for the latent state trajectory.

2. Direct adoption of the conventional Baum-Welch implementation of the EM algorithm as a solution to the intractable problem of maximizing the expected likelihood in (21) –as has been applied in HMMs (Rabiner (1989)), HSMMs (Murphy (2002)), EDHMMs and VDHMMs (Yu (2010)– is not possible for the HASMM setting. This is due to the intractability of the integral involved in the E-step; a problem that is also faced by other continuous-time models (Liu et al. (2015); Nodelman et al. (2012)). However, these models assumed Markovian state trajectories, in which case the implementation of the E-step boils down to computing the expected state durations and transition counts as sufficient statistics for estimating the latent trajectories(e.g. see Equations (12) and (13) in (Liu et al. (2015))). This simplification, which follows from the plausible properties of the Markov chain’s transition rate matrix, does not materialize for semi-Markovian transitions. Further complications are introduced by the duration-dependence of the state-transitions and the segmental nature of the observables.

3. Learning an informatively censored dataset would naturally benefit from the information conveyed in the censoring variables . However, the availability of censor- ing information leads to more complicated posterior density expressions for the latent