Modeling the clinical conditions of a patient using evidential physiological data is a ubiquitous problem that arises in many healthcare settings, including disease progression modeling (Schulam and Saria (2015); Mould (2012); Wang et al. (2014); Jackson et al. (2003); Sweeting et al. (2010); Liu et al. (2015)) and critical care prognosis (Moreno et al. (2005); Matos et al. (2006); Yoon et al. (2016); Hoiles and van der Schaar (2016); Alaa et al. (2016)). Accurate physiological modeling in these settings confers an instrumental value that mani-
Figure 1: An episode of the diastolic blood pressure measurements (as recorded in the EHR) for a patient hospitalized in a regular ward for 50 days and then admitted to the ICU after the ward staff realized she is clinically deteriorating. Measurements are censored in accordance with the ICU admission time.
0 50 100 150 100
Figure 2: An episode of the systolic blood pressure measurements for a patient hospitalized in a regular ward for 6 days and then discharged home by the ward staff. Measurements are missing in a 24-hour period during the patient’s stay in the ward.
fests in the ability to provide early diagnosis, individualized treatments and timely interventions (e.g. early warning systems in critical care hospital wards (Yoon et al. (2016)), early diagnosis for Scleroderma patients (Varga et al. (2012); Alaa and van der Schaar (2016)), early detection of a progressing breast cancer (Bartkova et al. (2005)), etc). Physiological modeling also confers an epistemic value that manifests in the knowledge extracted from data about the progression and severity phases of a disease (Stelfox et al. (2012))), or the short-term dynamics of the physiological behavior of critically ill patients (Li-wei et al. (2013)).
The recent availability of data in the electronic health records (EHR)creates a promising horizon for establishing rich and complex physiological models (Gunter and Terry (2005)). Modern EHRs comprise episodic data records for individual (anonymized) patients; every patient’s episode is a temporal sequence of clinical findings (e.g. visual field index for Glau-
coma patients (Liu et al. (2015)), CD4 cell counts for HIV-infected patients (Guihenneuc-Jouyaux et al. (2000)), etc), lab test results (e.g. white cell blood count for post-operative patients under immunosuppressive drugs (Cholette et al. (2012)), etc), or vital signs (e.g. blood pressure and saturation (Yoon et al. (2016))). The time span of these episodes may be as short as few days in short-term hospitalization episodes (e.g. patients with solid tumors, hematological malignancies or neutropenia who are hospitalized in regular wards before or after a surgery (Kause et al. (2004); Hogan et al. (2012); Kirkland et al. (2013))), or as long as few years in longitudinal episodes (e.g. chronic obstructive pulmonary disease may evolve from a mild Stage I to a very severe Stage IV over a time span of 10 years (Pedersen et al. (2011); Wang et al. (2014))). In this paper, we develop a versatile physiological model that fits a wide spectrum of healthcare settings, providing means for data-driven clinical prognosis. In the next Subsection, we expose our modeling rationale and list the modeling challenges posed by the structure of modern EHR data. We conclude this Section by summarizing our contributions in Subsection 1.2.
1.1 Modeling Rationale and Challenges
1.1.1 Rationale
Previous physiological models have branched into two different modeling paths with respect to the way a patient’s clinical states are defined. One strand of literature adopts fully observable models; these models assume that clinical states are quantifiable via observable clinical markers or disease severity measures (e.g. PFVC in Scleroderma (Schulam and Saria (2015)), GFR in kidney disease (Eddy and Neilson (2006)), etc). Another strand of literature adopts latent variable models, which assume that clinical states are latent and manifest only through proximal, noisy physiological measurements. Table 1 lists some notable previous works that fall under each modeling category.
Table 1: Modeling methodologies in previous works.
Our modeling choice is to go with a latent variable model. The rationale behind this choice is explicated as follows.
• In a wide range of problems, a concrete clinical marker that can be directly used as a surrogate for the patient’s true clinical condition is available. This is especially true in critical care settings where no solid definition or measure of a “clinical state” exists (Li-wei et al. (2013)). Previous works that adopted a clinical risk score as a surrogate for the clinical state in critical care settings have found that other physiological features, when augmented with the clinical risk score, still hold a significant predictive power with respect to end-point clinical outcome (Ghassemi et al. (2015)). This implies that a clinical risk score or a severity of illness measure (such as APACHE II, SAPS and SOFA (Knaus et al. (1991); Subbe et al. (2001))) is not a sufficient measure of a patient’s true clinical condition, and hence cannot be reliably modeled as an observable clinical state.
• The same line of argument extends to disease progression models: (Jackson et al. (2003)) has shown that significant modeling gain can be attained by treating clinical markers and diagnostic assessments as noisy, potentially erroneous manifest variables for the patient’s true clinical state rather than defining a clinical state in terms of those markers.
• For various chronic disease, such as HIV, Scleroderma, and kidney disease, progression stages are well defined in terms of observable clinical markers (CD4 cell count, PFVC and GFR). However, a latent variable model can help validate and assess the current domain knowledge-based clinical practice guidelines by learning alternative, data-driven guidelines. Other diseases, such as COPD, have their progression stages manifesting only through symptoms (e.g. chronic bronchitis, emphysema and chronic airway obstruction (Wang et al. (2014))), which may or may not accurately reflect the disease’s true state, and hence a latent variable model is necessary.
• Conclusive clinical markers that reveal a patient’s true state may be available only occasionally in a patient’s longitudinal episodes. For instance, in a breast cancer progression setting, most of the data points associated with a patient’s longitudinal episode would be imaging test results (e.g. BI-RADS scores of a mammogram or an MRI (Gail and Mai (2010); Taghipour et al. (2013))), which are noisy markers for the existence of a tumor, whereas a conclusive biopsy result that truly reveals whether the patient is in a preclinical or clinical breast cancer state may not be available because the patient did not undergo a biopsy test. A latent variable model better suits such settings.
• A fully observable model does not provide diagnostic utility since it assumes that an already observable clinical marker provides an immediate, domain-knowledge-based diagnosis for the patient. Contrarily, a latent variable model leaves room for diagnoses to be learned from the evidential data by learning the association between physiological evidence and clinical states, which may help inform and improve clinical practice.
1.1.2 Challenges
Hidden Markov Models (HMMs) and their variants have been widely deployed as temporal latent variable models for dynamical systems (Smyth (1994); Zhang et al. (2001); Giampieri et al. (2005); Genon-Catalot et al. (2000); Ghahramani and Jordan (1997)). Such models have achieved considerable success in various applications, such as topic modeling (Gruber et al. (2007)), speaker diarization (Fox et al. (2011b)), and speech recognition (Rabiner (1989)). However, the nature of the clinical setting, together with the format of the modern EHR data pose the following set of serious challenges that confound classical HMM models:
(A) Non-stationarity: Recently developed disease progression models, such those in (Wang et al. (2014)) and (Liu et al. (2015)), use conventional stationary Markov chain models. In particular, they assume that state transition probabilities are independent of time. However, this assumption is seriously at odds with even casual observational studies which show that the probability of transiting from the current state to another state depends on the time spent in the current state (Lagakos et al. (1978); Huzurbazar (2004); Gillaizeau et al. (2015)). This effect, which violates the memorylessness assumptions adopted by continuous-time Markovian models, was verified in patients who underwent renal transplantation (Foucher et al. (2007, 2008)), patients who are HIV infected (Joly and Commenges (1999); Dessie (2014); Foucher et al. (2005)), and patients with chronic obstructive pulmonary disease (Bakal et al. (2014); Wang et al. (2014)).
(B) Irregularly spaced observations: The times at which the clinical findings of a patient (vital signs or lab tests) are observed is controlled either by clinicians (in the case of hospitalized inpatients), or by the patient’s visit times (in the case of a chronic disease follow up). The time interval between every two measurements may vary from one patient to another, and may also vary for the same patient within her episode. This is reflected in the structure of the episodes in the EHR records, as shown in Figure 1 and 2. Figure 1 depicts an actual diastolic blood pressure episode for a patient hospitalized in a regular ward for 1200 hours (50 days). The patient’s stay in the ward was concluded with an admission to the ICU after the ward staff realized she was clinically deteriorating. As we can see, the blood pressure measurements in the first 20 hours were initially taken with a rate of 1 sample per hour, and then later the rate changed to 1 sample every 5 hours
. Thus, a direct application of a regular, discrete-time HMM (e.g. the models in (Murphy (2002); Fox et al. (2011b,a); Rabiner (1989); Yu (2010); Matos et al. (2006); Guihenneuc-Jouyaux et al. (2000))) will not suffice for jointly describing the latent states and observations, and hence ensuring accurate inferences.
(C) Discrete observations of a continuous-time phenomena: A patient’s physiological signals and latent states evolve in continuous time; however, the observed physiological measurements are gathered at discrete time steps. The intervals between observed measurements can vary quite significantly; as we can see in Figure 2, the systolic blood pressure for a patient who stayed in a ward for 140 hours exhibits an entire day without measurements This means that the patient may encounter multiple hidden state transitions without any associated observed data. These effects make learning and inference problems more complicated since the inference algorithms need to consider potential unobserved trajectories of state evolution between every two timestamps. This challenge has been recently addressed in (Nodelman et al. (2012); Wang et al. (2014); Liu et al. (2015)), but only on the basis of memoryless Markov chain models for the hidden states, for which tractable inferences that rely on the solutions to Chapman-Kolmogorov equations can be executed. However, incorporating non-stationarity in state transitions (i.e. addressing challenge (1) in this list) would make the problem of reasoning about a continuous-time process through discrete observations much more complicated.
(D) Lack of supervision: The episodes in the EHR may be labeled with the aid of domain knowledge (e.g. the stages and symptoms of some chronic diseases, such as chronic kidney disease (Eddy and Neilson (2006)), are known to clinicians and may be provided in the EHR). However, in many cases, including the case of (post or pre-operative) critical care, we do not have access to any labels for the patients’ states. Hence, unsupervised learning approaches need to be used for learning model parameters from EHR episodes. While unsupervised learning of discrete-time HMMs has been extensively studied and is well understood (e.g. the Baum-Welch EM algorithm is predominant in such settings (Zhang et al. (2001); Yu (2010); Rabiner (1989))), the problem of unsupervised learning of continuous-time models for which both the patient’s states and state transition times are hidden is far less understood, and indeed far more complicated.
(E) Censored observations: Episodes in the EHR are usually terminated by an informative intervention or event, such as death, ICU admission, discharge, etc. This is known as informative censoring (Scharfstein and Robins (2002); Huang and Wolfe (2002); Link (1989)). Unlike classical HMM settings where training sets comprise fixed length, or arbitrarily-censored, HMM sequence instances, a typical EHR dataset would comprise a set of episodes with different durations, and the duration of each episodes is itself informative of the entire state evolution trajectory. Learning in such settings requires novel algorithms that can efficiently compute the likelihood of observing a set of episodes conditioned on their durations and terminating states.
1.2 Summary of Contributions
In order to address the challenges above, we develop a new model –which we call the Hidden Absorbing Semi-Markov Model (HASMM)– as a versatile generative model for a patient’s (physiological) episode as recorded in the EHR. The HASMM captures non-stationary transitions for a patient’s clinical state via a continuous-time semi-Markov model with explicitly specified state sojourn time distributions. Informative censoring is captured via absorbing states that designate clinical endpoint outcomes (e.g. cardiac arrest, mortality, recovery, etc); entering an absorbing state of an HASMM stimulates censoring events (e.g. clinical deterioration leads to an ICU admission which terminates the physiological observations for a monitored patient in a ward, etc). Observable variables are modeled via a multi-task Gaussian process (Bonilla et al. (2007)), for which the observation times (i.e. follow up visits, vital sign gathering, lab tests, etc) are modeled as a point process. Using multi-task Gaussian process with state-dependent hyper-parameters, an HASMM accounts for both correlations among different physiological variables, in addition to the temporal correlations among the observation variables that are generated by the same hidden state during its sojourn period. In that sense, an HASMM is a segment model (Ostendorf et al. (1996)) and also a state-switching model (Fox et al. (2011a))).
To allow for real-time inference of a patient’s state, we develop a forward-filtering HASMM inference algorithm that can estimate a patient’s latent state using her history of irregularly sampled physiological measurements. The inference algorithm operates by constructing a virtual, discrete-time embedded Markov chain that fully describes the patient’s state transitions at observation times. The embedded Markov chain is constructed in an offline stage by solving a system of Volterra integral equations of the second kind using the successive approximation method; the solution to this system of equations, which parallels the Chapman-Kolmogorov equations in ordinary Markov chains, describe the HASMM’s semi-Markovian state transitions as observed at arbitrarily selected discrete timestamps.
Offline learning of the HASMM model parameters from patients’ episodes in an EHR is a daunting task. Since the HASMM is a continuous-time model, we cannot directly use the classical Baum-Welch EM algorithms for learning its parameters (Rabiner (1989)). Moreover, the semi-Markovianity of an HASMM yields an intractable integral in the E-step of the Expectation-Maximization (EM) formulation. Since the HASMM’s state transitions are not captured by the conventional continuous-time Markov chain transition rate matrices, we cannot make use of the Expm and Unif methods that were used in (Hobolth and Jensen (2011)), and more recently in (Liu et al. (2015)) for evaluating the integrals involved in the E-step of learning continuous-time HMMs. To address this challenge, we develop a novel forward-filtering backward-sampling Monte Carlo EM (FFBS-MCEM) algorithm that approximates the integral involved in the E-step by efficiently sampling the latent clinical trajectories conditioned on observations in the EHR by exploiting the informative censoring of the patients’ episodes. The FFBS-MCEM algorithm samples the latent clinical states of every (offline) patient episode in the EHR as follows: it starts from the known clinical endpoints, and sequentially samples the patient’s states by traversing in the reverse-time direction while conditioning on the future states, and then uses the sampled state trajectories to evaluate a Monte Carlo approximation for the E-step.
The rest of the paper is organized as follows. In Section 2, we present the HASMM model. The HASMM inference algorithm is developed in Section 3, and the learning algorithm is developed in Section 4. In Section 5, we demonstrate the utility of the HASMM in the problem of critical care prognosis using a real-world dataset for patients admitted to Ronald Reagan UCLA Medical Center. Conclusions are drawn in Section 6.
In this section, we introduce the basic abstract structure of the continuous-time HASMM (Subsection 2.1), and then we propose the distributional specifications for the model’s variables (Subsection 2.2).
2.1 Abstract Model
We start by describing the HASMM’s hidden state evolution process, and then we describe the structure of its observable variables.
2.1.1 Hidden States
We consider a filtered probability space (Ω), over which a continuous-time stochastic process X(t) is defined on
. The process X(t) corresponds to a temporal trajectory of the patient’s hidden clinical states, which take on values from a finite statespace X = {1, 2, . . ., N}. Because the process X(t) takes on only finitely many values, it can be decomposed in the form
where (is a c`adl`ag path (i.e. right-continuous with left limits), and the interval [
) is the time interval accommodating the
hidden state, which takes on a value
Every path (
on the stochastic basis (Ω
Markov path (Janssen and De Dominicis (1984); Durrett (2010)), where the sojourn time of state n, which we denote as
, is drawn from a state-specific distribution
being a state-specific duration parameter associated with state
. Unlike ordinary time-homogeneous semi-Markov transitions, in which the transition probabilities among states are assumed to be constant conditioned on there being a transition from the current state (Gillaizeau et al. (2015); Murphy (2002); Johnson and Willsky (2013); Yu (2010); Dewar et al. (2012); Gu´edon (2007)), our model accounts for duration-dependent semi-Markov transitions. In other words, the transition probability from one state to another depends on the time elapsed in the current state, i.e.
where is well defined, and
Now consider the bivariate (renewal) process (, which comprises the sequence of states and sojourn times. The semi-Markovian nature of X(t) implies that
(satisfies the following condition on its transition probabilities
where ) is the cumulative distribution function of state i’s sojourn time, and ¯
is the probability mass function that reflects the probability that a patient’s next state being j given that she was at state i and her sojourn time in i is less than (or equal to) s. Based on (3), we define the semi-Markov transition kernel as a matrix-valued function
that describes the dynamics of X(t) in continuous time, with entries
that are given by
Since the observable episode for a patient can begin in an arbitrary clinical state (because we only observe the physiological measurements starting from the time when the patients are hospitalized or start taking clinical tests), then it follows that the initial state random
. The initial state distribution is given by
The hidden states reflect different levels of clinical risk or severity (e.g. progression stage indexes of a chronic disease or phases of clinical deterioration (Sweeting et al. (2010);Chen and Zhou (2011))). In that sense, state 1 is regarded as the “least risky state”, and state N is regarded as the “most risky state”. We define and interpret states 1 and N as follows:
• State 1 is denoted as the safe state, and represents the state at which the patient is at minimum (or no) risk (e.g. clinically stable post-operative patient, etc).
• State N is denoted as the catastrophic state, and represents the state at which the patient is at severe risk or encounters an adverse event (e.g. a very severe stage of a chronic disease (Bakal et al. (2014)), a cardiac or respiratory arrest (Subbe et al. (2001)), mortality (Knaus et al. (1991)), etc).
We assume that whenever the system enters either state 1 or state N, it remains there forever. Therefore, we model states {1, N} as absorbing states, whereas we model the remaining states in X \ {1, N} as transient states that represent intermediate levels of risk. Following the assumptions in (Murphy (2002); Johnson and Willsky (2013)), we eliminate the self-transitions for all transient states by setting
X \ {1, N}, whereas we restrict the transitions from states 1 and N to self-transitions only, i.e.
. Figure 3 depict the Markov chain for the sequence
Figure 3: The Markov chain model for a 5-state HASMM.
We define as the event that the path (
is absorbed in the safe state 1, i.e.
as the event that (
is absorbed in the catastrophic state
is an absorbing semi-Markov chain
, we know that
and since the events
mutually exclusive, it follows that
). The quantity
) describes a patient’s prior risk of ending in the catastrophic state, whereas
) describes the patient’s posterior risk of ending in the catastrophic state having observed its evolution history up to time
-stopping time representing the absorption time of the path (
in either state 1 or state N, i.e.
Finally, we define K as the (random) number of state realizations in the sequence up to the stopping time
, which has to be concluded by either state 1 or N, e.g. when X = 4, the sequences {1}, {4}, {2, 3, 2, 3, 4}, and {3, 2, 1} are valid, random-length realizations of
, and each represents a certain state evolution trajectory for the patient.
2.1.2 Observations and Censoring
The path (is unobservable; what is observable is a corresponding process (
on (Ω
), the values of which are drawn from an observation-space Y, and whose distributional properties are dependent on the latent states’ path (
observable process (
can be put in the form
where (is a c`adl`ag path, comprising a sequence of function-valued variables
Even though the path (
is accessible, only a sequence of irregularly spaced samples of it is observed over time, and is denoted by
is the set of observed measurements, and M is the total number of such measurements. We say that the process is censored if
episodes in an EHR are censored: observations stop at some point of time due to a release from care, an ICU admission, mortality, etc.
The sampling times in T represent the times at which a patient with a chronic disease took clinical tests (i.e. time intervals in T spans years), or the times at which clinicians have gathered vital signs for a monitored critically ill patient in a hospital ward (i.e. time intervals in T span days or hours). We assume that the sampling times in T are drawn from a which is defined on (Ω
and with
being the Dirac measure. The point process Φ(
) is parametrized by an intensity parameter
, but is assumed to be independent of the latent states path
fine
as the set of
samples that are gathered during the interval
could possibly be empty (
), some states can have no corresponding observations (i.e. an inpatient may exhibit a transition to a deteriorating state during the night, even though her blood pressure were not measured during the night. Recall the illustration in Figure 2).
The paths are assumed to be conditionally independent given the hidden sequence of states
, and hence we have that
The observed samples generated under every state and sampled at the times in
are drawn from Y according to a distribution
emission parameter that controls the distributional properties of the observations generated under state j.
The number of observation samples is finite: the observed sequence is censored at some point of time, which we call the censoring time , after which no more observation samples are available. Censoring reflects an external intervention/event that terminated the observation sequence, i.e. death, intensive care unit (ICU) admission, etc. Censoring is informative (Scharfstein and Robins (2002);Huang and Wolfe (2002);Link (1989)), because the censoring time is correlated with the absorption time
strictly precedes
(in an almost sure sense). That is,
-stopping time that is given by
, i.e. once the patient enters state 1 or state N, the observations stop after the patient’s sojourn time in that state (i.e. observations stop after a time
from the entrance in the absorbing state). Therefore, the duration distributions
) of states 1 and N are used to determine the censoring times conditioned on the chain
being absorbed at time
Every sample from the HASMM is an episode comprising a random-length sequence of hidden states , and a random-length sequence of observations
with the associated observation times. We only observe
; the path of latent states X(t), the number of realized states K, the association between observations and states (i.e. the sets
) are all unobserved, which makes the inference problem very challenging, but captures the realistic EHR data format and the associated inferential hurdles. In the next subsection, we specify the model’s generative process and present an algorithm to generate episodic samples from an HASMM.
2.2 Model Specification and Generative Process
As have been discussed in Subsection 2.1, the hidden and observables variables of an HASMM can be listed as follows:
• Hidden variables: The hidden states sequence and the states’ sojourn times
(or equivalently, the transition times
• Observable variables: The observed episode and the associated sam- pling times
The HASMM model parameters that generate both the hidden and observable variables are encompassed in the parameter set Γ, i.e.
Γ =
Since the point process Φ() does not reveal any information about the latent states, and hence plays no role in inference, we will drop it from the parameter set Γ in the rest of the paper. In the following, we specify the distributional properties for both the hidden and observable variables.
2.2.1 Distributional specifications for the hidden variables
We model the state sojourn time of each state via a Gamma distribution. The selection of a Gamma distribution ensures that the generative process encompasses ordinary continuous-time Markov models for the path (
, since the exponential distribution
is a special case of the Gamma distribution (Durrett (2010)). Thus, if the underlying physiology of the patient is naturally characterized by memoryless state transitions, this will be automatically learned from the data via the parameters of the Gamma distribution. The sojourn time distribution for state i is given by
Figure 4: Exemplary transition functions (for a 4-state HASMM.
Figure 5: Depiction for the correlation structure of the observable variables for an underlying state sequence
where 0 are the shape and rate parameters of the Gamma distribution respectively.
Now we specify the structure of the transition kernel call from (4) that the each element in the transition kernel matrix can be written as
). Having specified the distribution
) as a Gamma distribution, it remains to specify the function
) in order to construct the elements of Q(s). The transition functions (
are given by multinomial logistic functions as follows
where . The parameters (
determine the baseline values for the transi- tion probability mass from state
(0), whereas the parameters
the dependence of the transition probability mass on the sojourn time
we have that
i.e. the transition probability out of state i remains constant irrespective of the sojourn time in that state. In the limit when s
Figure 6: A basic graphical model for the HASMM.
goes to infinity, the parameter dominates the functional form in (6). Figure 4 depicts exemplary transition functions (
for a 4-state HASMM.
2.2.2 Distributional specifications for the observable variables
As explained in Subsection 2.1, the observable process Y (t) can be decomposed as Y (t) = , where the paths (
are conditionally independent given the state sequence
. Since observations are drawn from Y (t) at arbitrarily, and irregularly spaced time instances T , we have to model the distributional properties of Y (t) in continuous time. We model every path
) defined over [
) as a segment drawn from a multi-task Gaussian Process (GP), with a hyper-parameter set Θ
that depends on the corresponding latent state
(Rasmussen (2006); Bonilla et al. (2007)). The input to the multi-task GP is the time variable and the output is the set of physiological variables at a certain point of time. The GP associated with every state
parametrized by a constant mean function
covariance ker-
nel , and a “free-form” covariance matrix Σ
between the different physiological measurements (Bonilla et al. (2007)). Thus, for a Q-dimensional physiological stream
)), the observations for state i are generated as follows
where . The GP hyper-parameters associated with state i are given by Θ
We note that the HASMM model is a segment model (Ostendorf et al. (1996); Murphy (2002); Yu (2010); Gu´edon (2007)), i.e. observation samples that are defined within the sojourn time of the same state are correlated, but observation samples in different states are independent. The model can also be viewed as a state-switching model, but for which
0 20 40 60 80 100 120 140 160 20
Figure 7: An episode generated by = 5. The realized hidden state sequence (upper) is {2, 3, 5}, and is absorbed in state 5. The physiological stream (
2-dimensional and stream
) is sampled more intensely than
the transition dynamics do not need to be linear as in (Georgatzis et al. (2016); Fox et al. (2011a)), but rather depend on the covariance kernel ). Figure 5 depicts the correlation structure of the observable variables in terms of the covariance matrix of a discrete version of Y (t) generated under a specific hidden state sequence. We can see that conditioned on the hidden state sequence, the covariance matrix is a block diagonal matrix, where the sizes of the blocks are random and are determined by the hidden states’ sojourn times.
The sampling times in T are generated by the point process Φ(), which for the sake of completeness of the model description, we specify as a Poisson process with an intensity parameter
. Note though that since we assume the sampling times are uninformative of the latent states path X(t), the distributional specification of Φ(
) is ancillary to the inference and learning algorithms developed in Sections 3 and 4.
2.2.3 Sampling episodes from an HASMM
We conclude this Section by presenting an algorithm for sampling episodes from an HASMM with a hyper-parameter set Γ. Algorithm 1 (samples a patient’s episodes by first sampling an initial state from X, and then sequentially samples sojourn times s from the Gamma distribution, and new states using the semi-Makrov kernel Q(s), until an absorbing state is drawn. Figure 7 depicts an episode sampled via Algorithm 1.
In this Section, we develop an online algorithm that carries out diagnostic and prognostic inferences for a monitored patient’s episode in real-time. Given an ongoing realization of an episode (before the censoring time
HASMM model parameter Γ that has generated this realization (i.e. is sampled via the algorithm
), we aim at carrying out the following inference tasks:
• Diagnosis: Infer the patient’s current clinical state, i.e. compute
• Prognostic Risk Scoring: Compute the patient’s risk of absorption in the catastrophic state, i.e.
In the rest of this Section, we drop the conditioning on Γ for notational brevity. The first inference task corresponds to disease severity estimation for patients with chronic disease, or clinical acuity assessment for critical care patients. The second task corresponds to risk scoring for future adverse events for patients who have been monitored for some period of time, i.e. the risk of developing a future preclinical or clinical breast cancer state (Gail and Mai (2010)), the risk of clinical deterioration for post-operative patients in wards (Rothman et al. (2013)), the risk of mortality for ICU patients (Knaus et al. (1985)), etc.
3.1 Challenges facing the HASMM Inference Tasks
The inference tasks discussed in the previous Subsection are confronted with 3 main challenges –listed hereunder– that hinder the direct deployment of classical forward-backward message-passing routines.
Figure 8: An exemplary HASMM episode with 6 hidden state realizations and 9 observed samples.
1. In addition to the clinical states being unobserved, the transition times among the states,
, are also unobserved (i.e. we do not know the time at which the patient’s state changed). Thus, unlike the discrete-time models in (Murphy (2002); Johnson and Willsky (2013); Yu (2010); Dewar et al. (2012); Gu´edon (2007)), in which we know that the underlying states switch sequentially in a (known) one-to-one correspondence with the observations, in an HASMM the association between states and observations is unknown. Figure 8 depicts an exemplary HASMM episode with 6 realized states and 9 observations samples; in this realization, the association between the observations
is hidden. The importance of reasoning about the hidden transition times is magnified by the duration-dependence of the transition probabilities that govern the sequence
2. Since observations are made at random and arbitrary time instances, some transitions may not be associated with any evidential data. That is, as it is the case for state Figure 8, there is no guarantee that for every state
, an observation is drawn during its occupancy, i.e. [
). In a practical setting, the inference algorithm should be able to reason about the state trajectories even in silence periods that come with no observations (recall the example in Figure 2 where observations of a critical care patient’s systolic blood pressure stop for an entire day). Hence, one cannot directly discretize the time variable and use the discrete-time HMM inference algorithms (e.g. the algorithms in (Rabiner (1989))) since in that case we would exhibit time steps that come with no associated observations, and with potential state transitions.
3. The HASMM model assumes that observations that belong to the same state are correlated (e.g. in Figure 8, each of the subset of observations are not drawn independently conditioned on the latent state since they are sampled from a GP), thus we cannot use the variable-duration and explicit-duration HSMM inference algorithms in (Murphy (2002); Johnson and Willsky (2013); Yu (2010); Gu´edon (2007)), as those assume that all observations are conditionally independent given the latent states. Our model is closer to a segment-HSMM model (Yu (2010); Gu´edon (2007)), but with irregularly spaced observations and an underlying duration-dependent state evolution process, which requires a different construction of the forward messages.
In the following Subsection, we develop a forward filtering algorithm that deal with episodes generated from an HASMM and address the above challenges.
3.2 The HASMM Forward Filtering Algorithm
Given a realization of an episode , the posterior probability of the patient’s current clinical state
) is given by
The above application of Bayes’ rule implies that, given the observation times T , computing the joint probability density ) suffices for computing the posterior probability of the patient’s clinical states. As it is the case for the conventional HMM setting, we denote these joint probabilities as the forward messages
Since the HASMM is a segment model, the conventional notion of the forward messages ) does not suffice for constructing the forward filtering algorithm since we need to account for the latent correlation structures between the (conditionally-dependent) observations (Murphy (2002)). To that end, we define
) as the forward message for the
state at the
observation time (i.e.
as follows
for some That is, the forward message
ply the joint probability that the current state is j, that the associated observations are (
and that the current state has lasted for the last w measurements. For notational brevity, denote the event
) can be written as
which can be decomposed using the conditional independence properties of the states, observable variables and sojourn times as follows
d
where we have dropped the conditioning on T for notational brevity. The first term, )), is the probability density of the observable vari- ables in
conditioned on the hidden state being
and that the time instances
reside in the sojourn time of
. The second term,
the probability that the hidden state sequence transits to state j after a period of
, given that its sojourn time in state
is at least
, and at most
. The third term is the probability that the sojourn time in state
between
, whereas the fourth term,
), is the (
forward message with a lag of
. Thus, we can write the
forward message with a lag w as follows
As we can see in (10), one can express ) using a recursive formula that makes use of the older forward messages
) = 0, which allows for an effi-cient dynamic programming algorithm to infer the patient’s clinical state in real-time; this is important in critical care settings where prompt risk assessments are crucial for timely clinical intervention.
The construction of the forward messages in (10) parallels the structure of forward message-passing in segment-HSMM (See Section 1.2 in (Murphy (2002)) and Section 4.2.2 in (Yu (2010))), but with the following differences. In (10), the time interval between every two observation samples is irregular, which reflects in the correlation between the observations in (depends on the covariance kernel of the GP, and the probability of the current latent state’s sojourn time being encompassing the most recent w samples, i.e. (
)). However, the most challenging ingredient of the forward message is the interval transition probability
is because unlike the discrete-time HSMM models in (Murphy (2002); Yu (2010)), which exhibit transitions only at discrete time steps that are always accompanied with evidential observations, i.e. no hidden transitions can occur between observation samples, and the transitions among hidden states are duration-independent, in an HASMM, transitions can occur at arbitrary time instances, multiple transitions can occur between two observation samples, and transitions are duration-dependent.
In order to evaluate the term )), we construct a virtual (discrete-time) trivariate
, the transition probabilities of which are equal to the interval transition probabilities. In the recent work in (Liu et al. (2015)), a similar embedded Markov chain analysis was conducted for a CTHMM, but for which the underlying state evolution process was assumed to be a duration-independent ordinary Markov chain for which the expressions for the interval transition probabilities are readily available by virtue of the exponential distributions of the memoryless state sojourn times.
Recall from Subsection 2.1.1 that the semi-Markov kernel of the hidden state sequence is defined as
i.e. the probability that the sequence transits from state i to state j given that the sojourn time in i is less than or equal to
. Theorem 1 establishes the methodology for computing the interval transition probabilities
)) using the parameters of an HASMM. In Theorem 1, we define
) as a matrix-valued function
the entries of which are given by
be the solution to the fol- lowing integral equation
(11) for the three independent variables (. Then, the interval transition probability
is given by
Theorem 1 follows from a first-step analysis that is akin to the derivation of the conventional Chapman-Kolmogorov equations in ordinary Markov chains (Kulkarni (1996)). The integral equation in (11) is a (matrix-valued) non-homogeneous Volterra integral equation of the second kind (Polyanin and Manzhirov (2008)).
It can be easily demonstrated that a closed-form solution that hinges on conventional kernel methods cannot be obtained. Hence, we resort to a numerical method in order to solve (11) for . Before presenting the numerical method, we reformulate (11) as follows
where is an element-wise convolution operator. (12) follows from (11) by the fact that the integral in (11) is a convolution integral; (12) can be expressed as follows
where the (functional) operator ) is given by
where F is the Fourier transform operator, and the transforms in (14) are all taken with respect to
(1967)) as follows. We initialize the function ) with the truncated semi-Markov kernel
), and then iteratively apply the operator B(.) to obtain a new value for
) until convergence. That is, the successive approximation procedure goes as follows
The following Theorem establishes the validity of the procedure in (15) as a solver for (13). Before presenting the statement of Theorem 2, we define the function space P as follows
where is the Kronecker delta function.
Theorem 2 (Convergence of successive approximations) The functional has a unique fixed-point
and the successive approximation procedure in (15) always converges to the fixed point, i.e.
, starting from any initial value
It is important to note that we do not need to solve for ) during real-time inference. Instead, we create a look-up table comprising a discretized version of
[˜
, and then we query this table when performing real-time inference for monitored patients. Hence, efficient and fast inferences can be provided for critical care patients for whom prompt diagnoses are necessary for the efficacy of clinical interventions. Algorithm 2 shows a pseudocode for constructing a look-up table of interval transition probabilities,
), which takes as an input the parameter set Γ and a precision level
(to control the termination of the successive approximation iterations), and outputs the interval transitions look-up table
). In Algorithm 2, FFT and IFFT refer to the fast Fourier transform operation and its inverse, respectively, and “diff(.)” refers to a numerical differentiation operation.
Now that we have constructed the algorithm TransitionLookUp to compute the interval transition probabilities in the look-up table ), we can implement a forward-filtering inference algorithm using dynamic programming (by virtue of the recursive formula in (10)). In particular, the posterior probability of the patient’s current clinical state in terms of the forward messages can be written as
Algorithm 3, ForwardFilter, implements real-time inference of a patient’s clinical state given a sequence of measurements . In Algorithm 3, we invoke TransitionLookUp initially to construct the look-up table of transition probabilities, but in practice, the look-up table can be constructed in an offline stage once the HASMM parameter set Γ is known. The number of computations can be reduced by limiting the lags w for every forward message
) to the samples in T that reside in a period
derived from the Gamma distribution of the sojourn time (e.g.
can be selected such that
) = 90%). The complexity of ForwardFilter is similar to the conventional forward algorithms in (Rabiner (1989)), i.e.
3.3 Prognostic risk scoring using an HASMM
Diagnostic inference, e.g. estimating the patient’s current state after a screening test, can be conducted by a direct application of the forward filtering algorithm presented in the previous Subsection. In this Subsection, we now explain how prognostic inferences can also be conducted using the algorithms presented in the previous Subsection. Prognostic risk scoring plays an important role in designing screening guidelines (Gail and Mai (2010)), acute care interventions (Knaus et al. (1985)) and surgical decisions (Foucher et al. (2007)).
A risk score is a measure for the patient’s risk of encountering an adverse event (abstracted as state N in our model) at any future time step starting from time That is, the patient’s risk score at time
can be formulated as
which can be computed using the outputs of TransitionLookUp and ForwardFilter as follows
Therefore, the procedures TransitionLookUp and ForwardFilter suffice for executing both the diagnostic and prognostic inference tasks.
In Section 3.1, we developed an inference algorithm that can handle diagnostic and prognostic tasks for patients in real-time assuming that the true HASMM parameter set Γ is known. In practice, the parameter set Γ is not known, and has to be learned from an offline EHR dataset D that comprises D episodes for previously hospitalized or monitored patients, i.e.
where are the observable variables and their respective sampling times for the
is the episode’s censoring time, and
is a label for the realized absorbing state.
We note that unlike the conventional HMM learning setting (Rabiner (1989); Zhang et al. (2001); Nodelman et al. (2012)), the episodes are not of equal-length as the observations for every episode stop at a random, but informative, censoring time. Thus, the patient’s state trajectory does not manifest only in the observable time series, i.e. also in the episode’s censoring variables
. In this Section, we develop an efficient algorithm, which we call the forward-filtering backward-sampling Monte Carlo EM (FFBSMCEM) algorithm, that computes the Maximum Likelihood (ML) estimate of Γ given an informatively censored dataset
is the likelihood of the dataset D given the parameter set Γ. We start by presenting the learning setup in Section 4.1, and then we present the FFBS-MCEM algorithm Section 4.3.
4.1 The Learning Setup
We focus on the challenging scenario when no domain knowledge or diagnostic assessments for the patients’ latent states are provided in the dataset (with the exception of the absorbing state which is declared by the variable
), i.e. the learning setup is an unsupervised one. For such a scenario, the main challenge in constructing the ML estimator Γ
resides in the hiddenness of the patients’ state trajectories in the training dataset D; the dataset D contains only the sequence of observable variables, their respective observation times, the episodes censoring time and the state in which the trajectory was absorbed. If the patients’ latent state trajectories (
were observed in D, the ML estimation problem Γ
Γ) would have been straightforward; the hiddenness of (
entails the need for marginalizing over the space of all possible latent trajectories conditioned on the observed variables, which is a hard task even for conventional continuous-time HMM models (Liu et al. (2015); Nodelman et al. (2012); Leiva-Murillo et al. (2011); Metzner et al. (2007)).
In order to construct the ML estimator for Γ, we start by writing the complete likelihood, i.e. the likelihood of an HASMM with a parameter set Γ to generate both the hidden states trajectory and the observable variables
episode in the dataset D as follows
where is the number of states that realized in episode d from t = 0 until absorption. The factorization in (19) follows from the conditional independence properties of the HASMM variables (see Figure 6). Since we cannot observe the latent states trajectory
, the ML estimator deals with the expected likelihood Λ(
evaluated by marginalizing the complete likelihood over the latent states trajectories, i.e.
where we have assumed that the episodes in D are independent and identically distributed. The integral in (20) can be further decomposed as follows
Λ(
4.2 Challenges Facing the HASMM Learning Task
The problem of learning the HASMM parameters by maximizing the likelihood function in (21) is obstructed by various obstacles that hinder the deployment of off-the-shelf learning algorithms; we list these challenges hereunder.
1. Finding the ML estimate Γby direct maximization of Λ(D | Γ) is not viable due to the intractability of the integral in (21), i.e. Λ(D | Γ) has no analytic maximizer. The difficulty of evaluating the expected likelihood Λ(D | Γ) follows from the need to average the complete likelihood over a complicated posterior density function for the latent state trajectory.
2. Direct adoption of the conventional Baum-Welch implementation of the EM algorithm as a solution to the intractable problem of maximizing the expected likelihood in (21) –as has been applied in HMMs (Rabiner (1989)), HSMMs (Murphy (2002)), EDHMMs and VDHMMs (Yu (2010)– is not possible for the HASMM setting. This is due to the intractability of the integral involved in the E-step; a problem that is also faced by other continuous-time models (Liu et al. (2015); Nodelman et al. (2012)). However, these models assumed Markovian state trajectories, in which case the implementation of the E-step boils down to computing the expected state durations and transition counts as sufficient statistics for estimating the latent trajectories(e.g. see Equations (12) and (13) in (Liu et al. (2015))). This simplification, which follows from the plausible properties of the Markov chain’s transition rate matrix, does not materialize for semi-Markovian transitions. Further complications are introduced by the duration-dependence of the state-transitions and the segmental nature of the observables.
3. Learning an informatively censored dataset would naturally benefit from the information conveyed in the censoring variables . However, the availability of censor- ing information leads to more complicated posterior density expressions for the latent
state trajectories, which complicates the job of any analytic, variational or Monte Carlo based inference method one would use to infer the latent state trajectories
In the following Section, we present a learning algorithm that addresses the above challenges, and provides insights into general settings in which informatively censored time series data are to be dealt with.
4.3 The Forward-filtering Backward-sampling Monte Carlo EM Algorithm
4.3.1 Expectation-Maximization
As in the case of classical discrete and continuous-time HMMs, we address the first challenge stated in Section 4.2 by using the EM algorithm (Liu et al. (2015); Nodelman et al. (2012)). The iterative EM algorithm starts with an initial guess ˆΓfor the parameter set, and maximizes a proxy for the log-likelihood in the
iteration as follows:
•
•
The E-step computes the ), which entails evaluating the following integral
where is, the proximal expected log-likelihood
) is computed by marginalizing the likelihood of the observed samples of the
over all potential latent paths (
that are censored at time
and absorbed in state
. Figure 9 depicts a set of observables (
) for one episode, and a potential latent path
that could have generated such observables. Computing
) requires av- eraging over the posterior density of the latent paths conditional on an observable episode.
4.3.2 “The Only Good Monte Carlo is a Dead Monte Carlo”
Since computing ) does not admit a closed-form solution, as mentioned earlier in the second challenge stated in Section 4.2, we resort to a Monte Carlo approach for approximating the integral involved in the E-step (Caffo et al. (2005)). That is, in the
of the EM algorithm, we draw G random trajectories (
for every episode d, and use those trajectories to construct a Monte Carlo approximation ˆ
Figure 9: An episode that comprised 8 observable samples, censored at time , and absorbed in state N (catastrophic state). The dashed state trajectory is a trajectory that could have generated the observables with a positive probability. Computing the proximal log-likelihood requires averaging over infinitely many paths that could have generated the observables with a positive probability.
proximal log-likelihood function ). Sample trajectories are drawn from the posterior density of the latent states’ trajectory conditional on the the observable variables (including the censoring information). That is to say, the
sample trajectory
is drawn as follows
for . Hence, the proximal log-likelihood
) can be approximated via a Monte Carlo estimate ˆ
) as follows
It follows from the Glivenko-Cantelli Theorem (Durrett (2010)) that
and hence the Monte Carlo implementation of the E-step becomes more accurate as the sample size G increases. Sampling trajectories from the posterior distribution specified in (22) in order to obtain a Monte Carlo estimate for ) is not a straight forward task; the sampler needs to jointly sample the states and their sojourn times taking into account the duration-dependent transitions among states, and that the number of variables sampled (number of states)
in each trajectory is itself random.
Since there is no straightforward method that can generate samples for the random state trajectory from the joint posterior density in (22), the normative solution for such a problem is to resort to a Markov Chain Monte Carlo (MCMC) method such as Metropolis-Hastings or Gibbs sampling (Carter and Kohn (1994)). Since the number of state and sojourn time variables,
, is itself random, one can even resort to a reversible jump MCMC method (Green and Hastie (2009)) in order to generate the samples for
At this point of our analysis, we invoke the classical aphorism with which we titled this Subsection:
(Trotter and Tukey (1956)). By this quote, Trotter meant to advocate the view that sophisticated Monte Carlo methods should be avoided whenever possible; whenever an integral is analytically tractable, or whenever some analytic insights can be exploited to built simpler samplers, doing so should be preferred to an expensive Monte Carlo method. MCMCs are indeed expensive: they mix very slowly and they generate correlated samples. Adopting an MCMC to generate random state trajectories in every iteration of the EM algorithm and for every episode in D is beyond affordable. Fortunately, in the rest of this Section we show that an efficient sampler that generates independent samples of
and for which the run-time is geometrically distributed can be constructed by capitalizing on the censoring information and utilizing some insights from the literature on sequential Monte Carlo smoothing (Godsill et al. (2012)).
4.3.3 The Forward-filtering Backward-sampling Recipe
The availability of the censoring information (censoring time and absorbing state
) for every episode d in D, together with the inherent non-linearity of the semi-Markovian transition dynamics encourage the development of a forward-filtering backward-sampling (FFBS) Monte Carlo algorithm
that goes in the reverse-time direction of every episode by starting from the censoring instance, and sequentially sampling the latent states conditioned on the (sampled) future trajectory (Godsill et al. (2012)). That is, unlike the generative process (described by the routine GenerateHASMM(Γ)) which uses the knowledge of Γ to generate sample trajectories by drawing an initial state and then sequentially going forward in time and sampling future states until absorption, the inferential process naturally goes the other way around: it exploits informative censoring by starting from the (known) final absorbing state (and censoring time), and sequentially samples a trajectory by traversing backwards in time and conditioning on the future.
We start constructing our forward-filter backward-sampler by first formulating the posterior density of the latent trajectory (from which we sample the G trajectories in the
iteration of the FFBS-MCEM algorithm as shown in (22)) as follows
where the conditioning on the (guess of the parameter set, ˆΓ
, is suppressed for notational convenience. The formulation in (24) decomposes the posterior density of the latent trajectory
into factors in which the likelihood of every state n is conditioned on the future trajectory starting from n (i.e. the states
up to the absorbing states, together with their corresponding sojourn times). The posterior density in (24) can
be further decomposed as follows
which, using the conditional independence properties of the HASMM (see Figure 6), can be simplified as follows
From (26), we can see that for the last state in every episode d, i.e. state , we already know that
, and hence the randomness is only in the last state’s sojourn time
Contrarily, for the first state, we know that conditioned on the sojourn times of the “future states” (
), the sojourn time of state
is equal to
almost surely, and hence the randomness is only in the initial state realization
. Generally, (26) says that a sufficient statistic for the
state and sojourn time is the future trajectory (starting from state n + 1) summarized by: the next state, i.e.
, the observable variables up to state n, and the time elapsed in the episode up to state n, i.e. the duration of state n cannot exceed the difference between the censoring time
and the sojourn time of the future trajectory that stems from state n + 1. This is captured by the last factor in (26), which explicitly specifies the likelihood of a joint realization for a state and its sojourn time conditioned on the future trajectory. Using Bayes’ rule, we can further represent the last factor in (26) in terms of familiar quantities that are directly derived from the HASMM model parameters as follows
Thus, a sampler for the latent states trajectories can be constructed using the forward messages, the HASMM’s transition functions (, and the sojourn time distributions.
A compact representation for the factors in (27) is given by
Given the representations above, we can write the last factor in (26) in the iteration of the EM algorithm as follows
From the factor decomposition in (27), we can see that informative censoring allows us to construct a sampler for the latent state trajectories that operates sequentially in the reverse time direction by sampling from the posterior probability of every state n given the future trajectory of states that starts from state n + 1. From (28), we note that the posterior density of the latent states conditioned on the future trajectory, from which sequential sampling is conducted, can be explicitly decomposed in terms of the HASMM parameters. A complete recipe for the forward-filtering backward-sampling procedure for sampling trajectories from the posterior density ) using the decomposition in (27) and the posterior density in (28) is provided as follows:
For every episode d in D, compute the forward messages for all time instances
using the current estimate for the parameter set ˆΓ
, i.e. invoke the routine
1. Set a dummy = 1 and set
2. Sample a Bernoulli random variable Bernoulli(
3. If
1, sample a bivariate random variable (
) as follows
where and
= 1, then sample
5. If = 0, then increment the placeholder index
and go to step 2 and repeat the consequent steps.
6. If = 1, then set
and terminate the sampling process for episode d. Set the sampled trajectory by swapping the bivariate sequence (
follows: (
The forward-filtering backward-sampling procedure constitutes of a forward pass in which we compute the forward messages for all the data points in D using the dynamic programming algorithms presented in Section 3, and a backward pass in which these forward
Algorithm 6 A sampler for latent state trajectories
messages are used to sample latent state trajectories. The backward sampling procedure for every episode goes as follows. We start from the censoring time at which we know what state has actually materialized, i.e. the absorbing state. Since we do not know the number of states in the state trajectory, we initialize a placeholder index = 1 as an index for the absorbing state, and increment it whenever a new state is sampled. We start the sampling procedure as follows. Given the the censoring variables and the observable time series, we sample the sojourn time of the last state (the absorbing state): this is sampled from a truncated sojourn time distribution, with a truncation threshold at
, and a point mass at
with an assigned measure that is equal to the posterior probability of the absorbing state being the initial state as depicted in Figure 10. This is implemented by first sampling a Bernoulli random variable
with a success probability equal to the posterior probability of the absorbing state being the initial state, and then sampling the truncated sojourn time if
= 0 using the simple rejection sample executed by the routine TRSampler which is provided in Algorithm 4. Having sampled the last state’s sojourn time, we sample the penultimate state and its sojourn time jointly using the routine BARSampler (Algorithm 5) as depicted in Figure 11. The routine BARSampler uses a sampling algorithm, that we
call the bivariate adaptive rejection sampler, which jointly samples the current state and its sojourn time given the next state as follows. First, a state is sampled from a Multinomial distribution with probability masses equal to the forward messages. Next, given the sampled state, we sample a sojourn time from the truncated sojourn time distribution. Finally, given the sampled state and the sampled sojourn time, we sample a dummy state from a Multinomial whose masses are equal to the transition functions, and we accept the sample only if the sampled dummy state is equal to the next state. It can be easily proven that BARSampler generates samples that are equal in distribution to the true state trajectory.
The backward-sampling procedure operates sequentially by invoking the BARSampler to generate new state and sojourn times samples conditional on the previously sampled (future) states. The process terminates whenever a state is sampled as an “initial state”. The routine BackwardSampling (Algorithm 6) implements the overall backward-sampling procedure for every episode in D.
Note that, unlike the slowly mixing MCMC methods, the backward-sampling algorithm can generate the latent state trajectory in an efficient manner, i.e. the run-time of the backward-sampling algorithm is stochastically dominated by a geometrically-distributed random variable with a success probability that, other than in a pathological HASMM parameter settings, would not be close to zero. Moreover, since BackwardSampling generates independent samples, no wasteful burn-in sampling iterations are involved in the FFBS-
Figure 10: Depiction of the backward sampling pass for the last state of an episode d.
Figure 11: Depiction of the backward sampling pass for the penultimate state after having sampled the last state as depicted in the Figure above.
MCEM operation. We provide a pseudocode for the overall operation of the FFBS-MCEM algorithm in Algorithm 7. We omit the standard EM operations, such as the implementation of the M-step, for the sake of brevity.
In Algorithm 7, we avoid the need for running the routine BackwardSampling in every iteration of the EM algorithm by re-using the sampled trajectories based on the initial parameter guess ˆΓthrough the usage of importance weights in the E-step. That is, in the
iteration of the EM algorithm, we implement the E-step as follows (Booth and Hobert (1999))
This implementation for the E-step offers a tremendous advantage in the computational cost of FFBS-MCEM. By using importance weights, we need to compute the forward messages and sample the latent state trajectories only once, and then reuse the sampled trajectories in all the subsequent EM iterations. In the following Theorem, we prove that FFBS-MCEM is as accurate as an EM algorithm with an exact implementation for the E-step when the number of Monte Carlo samples G grows asymptotically large.
Theorem 3 (Convergence properties of FFBS-MCEM) The sequence of parameter set estimates computed by the FFBS-MCEM algorithm converges in probability, i.e. ˆΓ
. Furthermore, if Γ
is a local maximizer of Λ(
, then there exists a neighborhood of Γ
such that for any initial guess ˆΓ
in that neighborhood and for any
we have that
In the next Section, we highlight the merits of our model and the associated algorithms through experiments conducted on a real-world dataset of informatively censored clinical time series data.
We investigate the utility of the HASMM in the setting of ICU prognostication; we use the HASMM as a model for the physiology of critically ill patients in regular hospital wards who are monitored for various vital signs and lab tests. Through the HASMM, we construct a risk score (based on the analysis in Section 3.3) that assesses the risk of clinical deterioration for the monitored patients, which allows for timely ICU admission whenever clinical decompensation is detected. Risk scoring in hospital wards and ICU admission management is a pressing problem with a huge social and clinical impact: qualitative medical studies have suggested that up to 50% of cardiac arrests on general wards could be prevented by earlier transfer to the ICU (Hershey and Fisher (1982)). Since over 200,000 in-hospital cardiac arrests occur in the U.S. each year with a mortality rate of 75% (Merchant et al. (2011)), improved patient monitoring and vigilant care in wards enabled by the HASMM would translate to a large number of lives saved yearly.
5.1 Data
5.1.1 The Patients’ Cohort
Experiments were conducted on a heterogeneous cohort of 6,094 episodes for patients who were hospitalized in Ronald Reagan UCLA medical center during the period between March 3, 2013 to March 29
, 2016. The patients’ population is heterogeneous: we considered admissions to all the floors and units in the medical center, those include the acute care pediatrics unit, cardiac observation unit, cardiothoracic unit, hematology and stem cell transplant unit and the liver transplant service. Patients admitted to those floors (or wards) are post-operative or pre-operative critically ill patients who are vulnerable to adverse clinical outcomes that may require an impending ICU transfer. The cohort comprised patients
0 1000 2000 3000 4000 5000 6000 0
Figure 12: Visualization for the episodes’ censoring information.
with a wide variety of ICD-9 codes and medical conditions, including leukemia, hypertension, septicemia, sepsis, abdomen and pelvis, pneumonia, and renal failure. Table 2 shows the distribution of the most common ICD-9 codes in the patient cohort together with the corresponding medical conditions. The notable heterogeneity of the cohort suggests that the results presented in this Section are generalizable to different cohorts extracted from different hospitals. Every patient in the cohort is associated with a set of 21 (temporal) physiological streams comprising a set of vital signs and lab tests that are listed in Table 2. The physiological measurements are gathered over time during the patient’s stay in the ward, and they manifest -in a subtle fashion- the patient’s clinical state. The physiological measurements are collected over irregularly spaced time intervals (usually ranging from 1 to 4 hours); for each physiological time series, we have access to the times at which each value was gathered.
In all the experiments hereafter, we split the patient cohort into a training set and a testing set. The training set comprises 4,939 patients admitted to the medical center in the period between March 3, 2013 to November 1
, 2015; the testing set comprises 1,155 patients admitted in the period between November 1
, 2015 to March 29
This split of the data allows us to assess the performance under the realistic scenario when a certain algorithm learns from the data available up to a certain date, and then is used to assess the risk for patients admitted in future dates. In Table 2, we show statistics for the patients’ baseline static features (e.g. gender, age, etc) in both the training and testing sets; as we can see, the characteristics of the patients admitted in the period (March 2013 - November 2015) has not significantly changed from those admitted in the period (November 2015 -March 2016). We have verified this fact using a two-sample t-test through which we compared the expected values of the baseline co-variates in both the training and testing sets. This means that the hospital’s management policy with respect to the patients’ acceptance and triaging has not significantly changed across the two time periods, and hence whatever is learned from the training data can be sensibly applied to the testing data.
5.1.2 Informative Censoring
All the patient episodes in the cohort were informatively censored. That is, for every patient in the cohort, we know the following information:
• the length of stay of each patient in the ward is recorded in the dataset, and hence we have access to the HASMM’s censoring time variable
average hospitalization time (or censoring time) in the cohort is 157 hours and 34 minutes (6.5 days). The patient episodes’ censoring times ranged from 4 hours to 2672 hours.
• The absorbing clinical state (l): with the help of experts from the division of pulmonary and critical care medicine at Ronald Reagan UCLA medical center, we set the value of the variable l (absorbing state) for every patient’s episode based on the clinicians’ interventions as reported in the dataset. That is, as advised by our medical collaborators, we assigned the label l = 1 to every patient who was admitted to the ICU and underwent an intervention in the ICU (e.g. ventilator, drug, etc), or was reported to exhibit a cardiac or respiratory arrest (before or after the ICU transfer). According to the medical experts, those patients have experienced “clinical deterioration” as their absorbing state, and would have benefited from an earlier admission to the ICU. We have excluded all patients who underwent a preplanned ICU admission from the dataset since those patients did not actually experience clinical deterioration, but were transferred routinely to the ICU after a surgery. We assigned the label l = 0 to all patients who were discharged home after the clinician’s in charge realized they were clinically stable. Since the readmission rate at the UCLA medical center is quite low, our medical collaborators believe that the labels l = 1 and l = 0 represent an accurate representation for the patients’ true absorbing clinical states upon censoring.
Patient episodes with the absorbing state l = 0 had an average censoring time of 155 hours on average, whereas those with l = 1 had an average censoring time of 204 hours. The percentage of episodes with an absorbing state l = 1 was 4.98% 0.64% in the training period (March 2013 - November 2015), and was 5.19% 1.44% in the testing period (November 2015 - March 2016). A two-sample t-test reveals that the censoring information (distributions of ) has not significantly changed from the training to testing periods, which suggests that the HASMM learned from the training data can be sensibly applied to the testing data. Figure 12 visualizes the informative censoring information over the time period between March 2013 and March 2016. Every patient episode, starting at a certain admission date, is represented by its censoring time (hospitalization time); light colored episodes are ones that were absorbed in the clinical stability state (l = 0), whereas dark colored ones were absorbed in the clinical deterioration state (l = 1).
5.2 Baseline Algorithms
We compare the HASMM, as a model for the patients’ episodes based on which an early warning system is constructed, with other baseline early warning methods. The comparisons involve both state-of-the-art clinical risk scores that are currently used in various healthcare facilities around the world, in addition to benchmark machine learning algorithms. The details of the baselines are provided in the following subsections.
5.2.1 State-of-the-art Clinical Risk Scores
We have conducted comparisons with the most prominent clinical risk scores currently deployed in major healthcare facilities. We list the clinical risk scores involved in our comparisons as follows.
(i) Modified Early Warning System (MEWS): a risk scoring scheme used currently by many healthcare facilities and rapid response teams to quickly assess the severity of illness of a hospitalized patient (Morgan et al. (1997)). The score ranges from 0 to 3 and is based on the following cardinal vital signs: systolic blood pressure, respiratory rate, , temperature, and heart rate.
(ii) Sequential Organ Failure Assessment (SOFA): a risk score (ranging from 1 to 4) that is used to determine the extent of a hospitalized patient’s respiratory, cardiovascular, hepatic, coagulation, renal and neurological organ function in the ICU (Vincent et al. (1996)).
(iii) Acute Physiology and Chronic Health Evaluation (APACHE II): a risk scoring system (an integer score from 0 to 71) for predicting mortality of patients in the ICU (Knaus et al. (1991)). The score is based on 12 physiological measurements, including creatinine, white blood cell count, and glasgow coma scale.
(iv) Rothman Index: a regression-based data-driven risk score that utilizes physiological data to predict mortality, 30-days readmission, and ICU admissions for patients in regular wards (Rothman et al. (2013)). The Rothman index is the state-of-the-art risk score for regular ward patients and is currently used in more than 70 hospitals in the US, including the Houston Methodist hospital in Texas and the Yale-New Haven hospital in Connecticut (Landro (2015)). At the time of conducting these experiments, the Rothman index was also deployed in the Ronald Reagan UCLA medical center.
We implemented the MEWS, SOFA, APACHE II and Rothman scores according to the specifications in (Vincent et al. (1996); Knaus et al. (1991); Rothman et al. (2013)). Note that while the SOFA and APACHE II scores are usually deployed for patients in the ICU, both scores have been recently shown to provide a prognostic utility for predicting clinical deterioration for patients in regular wards (Yu et al. (2014)), and hence we consider both scores in our comparisons.
5.2.2 Machine Learning Algorithms
In order to demonstrate the modeling gain of HASMMs, we make comparisons with other competing machine learning algorithms that adopt different modeling approaches for the clinical time series data. The details of these competing models are provided in the following.
We consider the following set of discriminative predictors that directly predict clinical deterioration without explicitly modeling the clinical time series data:
In order to ensure that the censoring information is utilized by all the discriminative predictors, we train every predictor by constructing a training dataset that comprises the physiological data gathered within a temporal window before the censoring event (ICU admission or patient discharge), and using the censoring information (i.e. the variable l) as the labels. The size of this window is a hyper-parameter that is tuned separately for every predictor. For the testing data, the predictors are applied sequentially to a sliding window of every patient’s episode, and the predictor’s output is considered as the patient’s real-time risk score. We used the built-in MATLAB functions for training the logistic regression, LASSO and random forest predictors.
Although RNNs are not clinically interpretable, they have been frequently applied to the problem of clinical time series prediction, and the recent work in (Che et al. (2016)) have considered RNNs to predict mortality in the ICU using the MIMIC dataset (Saeed et al. (2002)). We have trained an RNN with 5 hidden layers, and 10 neurons with each layer, using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, where gradients are computed using the Backpropagation Through Time algorithm (Werbos (1990)). All the training time series were temporally aligned via the endpoint censoring information, and training was accomplished via 1000 iterations of the BFGS algorithm. A top layer with a squashing sigmoid function was used to map the RNN hidden states to a risk score between 0 and 1 at each point in time.
We used the correlated feature selection algorithm to select the physiological stream for every predictor (Yu and Liu (2003)). To ensure a fair comparison, we did not include the static co-variates in any predictor, including the HASMM, since they are not used by the clinical risk scores.
In addition to the discriminative models, we also considered the following probabilistic models for the patients’ episodes:
We used the Baum-Welch algorithm for learning the HMM (Murphy et al. (2001)); the informative censoring information was incorporated by including two absorbing states for clinical stability (l = 0) and deterioration (l = 1), and informing the forward-backward algorithm with the labeled states at the end of every episode. We tried many initializations for the HMM parameters and picked the initialization that led to the maximum likelihood for the training dataset. The complete data log likelihood after 100 EM iterations was -1.25. In real-time, a patient’s risk score at every point of time is computed by first applying forward filtering to obtain the posterior probability of the patient’s states, and then averaging over the distribution of the absorbing states. Using the Bayesian Information Criterion, we selected an HMM model with 4 latent states.
For the multi-task Gaussian process, we used the free-form parametrization in (Bonilla et al. (2007)), and used the gradient method to learn the parameters of two Gaussian process models: one for patients with l = 0, and one for patients with l = 1. The risk score for a patient’s risk score is computed as the test statistic of a sequential hypothesis test that is based on the two learned Gaussian process models.
5.3 Results
5.3.1 Performance metrics
In order to assess the performance of every algorithm, we compute each algorithm’s risk score R(t) at every point of time in every patient’s episode. We only use the patient episodes in the testing set for performance evaluation. The risk score that is based on an HASMM is evaluated as discussed in Section 3.3. We emulate the ICU admission decisions by setting a threshold on the risk score R(t) above which a patient is identified as “clinically deteriorating”. The accuracy of such decisions are assessed via the following performance metrics: true positive rate (TPR), positive predictive value (PPV) and timeliness. These performance metrics are formally defined as follows:
TPR = # patients with l = 1 and R(t) exceeding threshold for some t < Tc # patients with l = 1 ,
PPV = # patients with l = 1 and R(t) exceeding threshold for some t < Tc # patients with R(t) exceeding threshold for some t < Tc ,
and
Timeliness = E [Time at which R(t) exceeds threshold ) exceeds threshold, l = 1] .
The three performance metrics described above evaluate the different risk scoring algorithms in terms of their detection power, false alarm rate, and timeliness in detecting clinical deterioration. We sweep the threshold value of every risk scoring algorithm and report the AUC of the TPR vs. PPV ROC curve. All results reported hereafter are statistically significant (p-value < 0.001).
5.3.2 Learning the HASMM
We applied the FFBS-MCEM algorithm to the training episodes in order to estimate the parameter set Γ. Based on the Bayesian information criterion, we have selected a model with 4 clinical states, i.e. X = {1, 2, 3, 4}. State 1 is the clinical stability state for which a patient can be discharged, whereas state 4 is the clinical deterioration state at which an ICU admission is necessary. We ran 100 MCEM iterations and used ˆΓas the estimate for Γ.
We discretized the time domain into steps of 1 hour while computing the elements of the look-up table holding the values of the tensor . With a granular 1-hour discretization of the time horizon, the Gaussian covariance matrix was found to be ill-conditioned for many patient episodes. To ensure the numerical stability of the computations involving the Gaussian process likelihood functions, we used the Moore-Penrose pseudo-inverse for the covariance matrix instead of direct matrix inversion. The function TransitionLookUp was invoked once before running the MCEM iterations, and its run time was 2 minutes and 15 seconds on a dual-core 3 GHz machine. The function ForwardFilter was invoked 150,852 times (all data points in all patients’ episodes in the testing set), and its overall run time was 3 hours and 50 minutes (on a dual-core 3 GHz machine). The run time for every risk score update for a single patient is less than 1 second, which implies that the algorithm can efficiently prompt quick risk assessments if implemented on a machine with a reasonable computational power.
From the learned HASMM, we were able to extract the following “medical concept” out of the training data. The patients’ clinical state space X = {1, 2, 3, 4} comprises the following 4 states:
As implied by the model, states 1 and 4 are absorbing states: once the patient is believed to be in state 1, the clinicians should release her from care, whereas exhibiting clinical state 4 should be treated with an admission to the ICU. States 2 and 3 are critical states that require the patient to stay under vigilant care in the ward. The two states are different ways to manifest “criticality”. We characterize the properties of the four clinical states in the rest of this subsection.
Figures 13-16 depict the different characteristics of the four clinical states. In Figure 13, we plot a bipartite correlation graph that shows the correlations among the relevant physiological streams in the different clinical states. These graphs were constructed by computing the using the entries of the multi-task Gaussian process covariance matrix
. An edge is connected between every two features for whom the Pearson correlation coefficient exceeds 0.1, i.e.
see, different physiological variables become less or more correlated in the different clinical states. For instances, only the clinically stable patients experience significant correlations
between their urea Nitrogen and the diastolic blood pressure; the Pearson coefficient between those variables becomes insignificant in the other states. Clinicians can use this piece of information, extracted solely from the data, to construct simple tests for clinical stability by computing the correlations between blood pressure and urea Nitrogen for a hospitalized patient before deciding to discharge her. Generally speaking, we observe that the critical, transient states display more correlations among the physiological streams than the clinical stability and deterioration states. In particular, the type-2 critical state has most of the physiological streams being strongly correlated. We speculate that the reason behind these strong correlations is that some kinds of interventions (e.g. drugs, mechanical pumps, ventilators, etc) applied to hospitalized patients affect all the physiological streams simultaneously; and hence we believe that type-1 and type-2 critical state patients are hospitalized patients with and without clinical interventions. We will examine this claim when we retrieve information about interventions and the time they were applied from the Ronald Reagan medical center; such information was not available at the time of conducting these experiments.
Figure 14 shows the sojourn time distributions for the four states. Recall that the “sojourn time” of an absorbing state (state 1 or 4) is defined as the time between entering the state and the censoring time; such a time interval corresponds to the clinicians’ policy with respect to patient discharge and ICU admission. That is, the sojourn time of an absorbing state is not a natural physiological quantity, but it rather reflects the speed with which patients are released from care or receive leveraged level of care. As we can see, the sojourn time distributions significantly deviate from an exponential distribution of an ordinary, memoryless Markov model, which supports our assumption of semi-Markovianity.
Figure 15 displays the covariance function ) for the 4 clinical states; the state-specific covariance function quantifies the physiological streams’ temporal correlations in a particular clinical conditions. Knowing such correlation patterns are useful for deciding the frequency with which nurses and clinicians should collect physiological measurements over time for different patients in different clinical conditions (Alaa and van der Schaar (2016)). We observed that, as one would expect, the temporal correlations increase when the patient becomes more stable; the temporal correlation is greatest in state 1 and smallest in state 4. This means that one would expect deteriorating patients to experience more physiological fluctuations over time. We also note that physiological stream for which the constant mean function differed significant among the clinical state was the urea Nitrogen. The level of urea Nitrogen increases significantly when the patient is in a more risky state; the average blood urea nitrogen is 11.7 milligrams per deciliter (mg/dL) in state 1, 23.8 mg/dL in state 2, 41.1 mg/dL in state 3 and 64.9 mg/dL in state 4.
Figure 16 depicts the transition functions out of the transient states 2 and 3 as a function of the sojourn time in those states. We note that the transition probabilities are almost a constant function of sojourn time for patients in state 2 (
0), whereas the duration-dependence is more significant (
0); as the sojourn time in state 3 increases, the transition probabilities become more biased towards state 1. This reinforces our hypothesis that state 3 corresponds to patients for whom interventions were applied. That is, as time passes for a patient in state 3 after receiving an intervention, her chances for recovery (transiting to state 1) increases.
0 5 10 15 20 25 0
0 2 4 6 8 10 12 14 16 18 20 22 0
Figure 17: Depiction for the episode of a clinically deteriorating patient.
score over time by focusing on an episode of a particular patient who was hospitalized for 1 day and then admitted to the ICU. As shown in Figure 17 (top), the inference algorithm computes the forward messages whenever new physiological measurements become available. Using the forward messages, the algorithm can display the maximum a posteriori (MAP) state estimates to the clinicians over time. As we can see in Figure 17 (middle), the patient under consideration was in clinical state 2 (type-1 critical state) at the time of admission to the ward. After 6 hours, the patient switched to state 3 (type-2 critical state), probably due to a clinical intervention. After around 9 hours, the patient switched back to the type-1 critical state for a brief 2-hour period, before switching to the type-2 critical state (probably due to a second intervention). Our algorithm was able to detect
Table 3: Performance comparisons for various algorithms.
clinical deterioration (state 4) conclusively (through both the MAP state estimate and the risk score) more than 6 hours before the clinicians actually sent the patient to the ICU. Had the clinicians used the algorithm for monitoring that patient, they would have been able to send the patient to the ICU 6 hours early, allowing for a potentially much more efficient therapeutic intervention in intensive care. In Figure 17 (bottom), we plot the patient’s physiological stream and tag the different time intervals with the corresponding clinical state estimates. The clinicians can rely on these clinically interpretable tags to describe the patient’s states at each point of time rather than using a high-dimensional, and potentially inexpressive set of physiological measurements.
5.3.3 Performance comparisons
In order to handle class imbalance, we focused on the TPR vs. PPV performance rather than the TPR vs. FPR analysis commonly adopted in the medical literature, which overlooks the class imbalance problemWe report the AUC values for all the algorithms under consideration in Table 3. As we can see, all the machine learning algorithms signifi-cantly outperform the state-of-the-art clinical risk scores (Rothman, MEWS, APACHE and SOFA). The reason behind the significant performance gain of the HASMM as compared to the clinical risk scores is that it incorporates the patients’ history when updating the forward messages (as shown in Figure 17), and reasons about the future trajectory when computing the risk score (as discussed in Section 3.3). Clinical risk scores are instantaneous in that they map the current physiological measurements to a risk score without considering the previously measured physiological variables, and hence they are vulnerable to high false alarm rates (low PPV). Moreover, the clinical risk scores do not reason about the future trajectory given the current physiological measurements, and hence they display a sluggish risk signal that fail to quickly cope with subtle clinical deterioration.
We also note that HASMMs outperforms conventional HMMs; this is a consequence of incorporating temporal correlations and semi-Markovian state transitions, which more accurately describe the patient’s physiology. This manifests in the sojourn time distributions in Figure 14, which largely deviate from the exponential distribution adopted by an HMM, and also manifests in the temporal correlation patterns in Figure 15, which largely deviate from the Dirac-delta function. Moreover, not only that the HASMM outperforms discriminative classifiers that operate on a sliding window of the clinical time series, but unlike these classifiers, it provides a clinically interpretable model as well (see Figure 17). Such an interpretable model cannot be provided by discriminative approaches such as logistic regression or RNNs. In addition to the accuracy gains demonstrated in Table 3, we also note that ICU alarms issued by the HASMM precedes actual ICU admission decisions issued by the clinicians with 8-9 hours on average for a TPR of 50% and PPV of 35%.
We developed a versatile model, which we call the Hidden Absorbing Semi-Markov Model (HASMM), for clinical time series data which accurately represents physiological data in modern EHRs. The HASMM can deal with irregularly sampled, temporally correlated, and informatively censored physiological data with non-stationary clinical state transitions. We also proposed an efficient Monte Carlo EM learning algorithms that is based on particle filtering, and developed an inference algorithm that can effectively carry out real-time inferences. We have shown, using a real-world dataset for patients admitted to the Ronald Reagan UCLA Medical Center, that HASMMs provide a significant gain in critical care prognosis when utilized for constructing an early warning and risk scoring system.
We would like to thank Dr. Scott Hu (Division of Pulmonary and Critical Care Medicine, Department of Medicine, David Geffen School of Medicine, UCLA) for providing us with the clinical data and the appropriate medical background and insights used in Section 5. We also thank Mr. Jinsung Yoon for his valuable help with the simulations in Section 5. This research was funded by grants from the Office of Naval Research (ONR) and NSF ECCS 1462245.
We start by rewriting (11) as follows:
Starting with the left hand side, we can use a first-step analysis to write every term ˜as follows
) is the time elapsed in state
is the sojourn time of state i. The integral equation in (30) can be written in a matrix form as in the right hand side of (29), and hence the Theorem follows.
Recall that the operation
can be written as
Thus, for every 1, there exists
a contraction mapping. Therefore, the operation
unique fixed point that can be reached via
successive approximations.
The Theorem can be proven using the proof of Theorem 5 in (Neath et al. (2013)).
Ahmed M. Alaa and Mihaela van der Schaar. Balancing suspense and surprise: Timely deci- sion making with endogenous information acquisition. In Advances in Neural Information Processing Systems, pages 2910–2918, 2016.
Ahmed M Alaa, Jinsung Yoon, Scott Hu, and Mihaela van der Schaar. Personalized risk scoring for critical care prognosis using mixtures of gaussian processes. arXiv preprint arXiv:1610.08853, 2016.
Jeffrey A Bakal, Finlay A McAlister, Wei Liu, and Justin A Ezekowitz. Heart failure re-admission: measuring the ever shortening gap between repeat heart failure hospitalizations. PloS one, 9(9):e106494, 2014.
Jirina Bartkova, Zuzana Hoˇrejˇs´ı, Karen Koed, Alwin Kr¨amer, Frederic Tort, Karsten Zieger, Per Guldberg, Maxwell Sehested, Jahn M Nesland, Claudia Lukas, et al. Dna damage response as a candidate anti-cancer barrier in early human tumorigenesis. Nature, 434 (7035):864–870, 2005.
Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian process prediction. In Advances in neural information processing systems, pages 153–160, 2007.
James G Booth and James P Hobert. Maximizing generalized linear mixed model likelihoods with an automated monte carlo em algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1):265–285, 1999.
Brian S Caffo, Wolfgang Jank, and Galin L Jones. Ascent-based monte carlo expectation– maximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):235–251, 2005.
Chris K Carter and Robert Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.
Dustin Charles, Meghan Gabriel, and JaWanna Henry. Electronic capabilities for patient engagement among us non-federal acute care hospitals: 2012-2014. The Office of the National Coordinator for Health Information Technology, 2015.
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Re- current neural networks for multivariate time series with missing values. arXiv preprint arXiv:1606.01865, 2016.
Baojiang Chen and Xiao-Hua Zhou. Non-homogeneous markov process models with infor- mative observations with an application to alzheimer’s disease. Biometrical Journal, 53 (3):444–463, 2011.
Jill M Cholette, Kelly F Henrichs, George M Alfieris, Karen S Powers, Richard Phipps, Sherry L Spinelli, Michael Swartz, Francisco Gensini, L Eugene Daugherty, Emily Nazarian, et al. Washing red blood cells and platelets transfused in cardiac surgery reduces post-operative inflammation and number of transfusions: Results of a prospective, randomized, controlled clinical trial. Pediatric critical care medicine: a journal of the Society of Critical Care Medicine and the World Federation of Pediatric Intensive and Critical Care Societies, 13(3), 2012.
Zelalem Getahun Dessie. Multi-state models of hiv/aids by homogeneous semi-markov process. American Journal of Biostatistics, 4(2):21, 2014.
Michael Dewar, Chris Wiggins, and Frank Wood. Inference in hidden markov models with explicit state duration distributions. IEEE Signal Processing Letters, 19(4):235–238, 2012.
Rick Durrett. Probability: theory and examples. Cambridge university press, 2010.
Allison A Eddy and Eric G Neilson. Chronic kidney disease progression. Journal of the American Society of Nephrology, 17(11):2964–2966, 2006.
Yohann Foucher, Eve Mathieu, Philippe Saint-Pierre, J Durand, and J Daures. A semi- markov model based on generalized weibull distribution with an illustration for hiv disease. Biometrical journal, 47(6):825, 2005.
Yohann Foucher, Magali Giral, Jean-Paul Soulillou, and Jean-Pierre Daures. A semi-markov model for multistate and interval-censored data with multiple terminal events. application in renal transplantation. Statistics in medicine, 26(30):5381–5393, 2007.
Yohann Foucher, M Giral, JP Soulillou, and JP Daures. A flexible semi-markov model for interval-censored data and goodness-of-fit testing. Statistical methods in medical research, 2008.
Emily Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. Bayesian nonparametric inference of switching dynamic linear models. IEEE Transactions on Signal Processing, 59(4):1569–1585, 2011a.
Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. A sticky hdphmm with application to speaker diarization. The Annals of Applied Statistics, pages 1020–1056, 2011b.
Mitchell H Gail and Phuong L Mai. Comparing breast cancer risk assessment models. Journal of the National Cancer Institute, 102(10):665–668, 2010.
Valentine Genon-Catalot, Thierry Jeantheau, Catherine Lar´edo, et al. Stochastic volatility models as hidden markov models and statistical applications. Bernoulli, 6(6):1051–1079, 2000.
Konstantinos Georgatzis, Christopher KI Williams, and Christopher Hawthorne. Input- output non-linear dynamical systems applied to physiological condition monitoring. Journal of Machine Learning Research, 2016.
Zoubin Ghahramani and Michael I Jordan. Factorial hidden markov models. Machine learning, 29(2-3):245–273, 1997.
Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, and Mengling Feng. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icu with sparse, heterogeneous clinical data. In Proceedings of the... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 2015, page 446. NIH Public Access, 2015.
Giacomo Giampieri, Mark Davis, and Martin Crowder. Analysis of default data using hidden markov models. Quantitative Finance, 5(1):27–34, 2005.
Florence Gillaizeau, Etienne Dantan, Magali Giral, and Yohann Foucher. A multistate additive relative survival semi-markov model. Statistical methods in medical research, page 0962280215586456, 2015.
Simon J Godsill, Arnaud Doucet, and Mike West. Monte carlo smoothing for nonlinear time series. Journal of the american statistical association, 2012.
Peter J Green and David I Hastie. Reversible jump mcmc. Genetics, 155(3):1391–1403, 2009.
Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. Hidden topic markov models. In AISTATS, volume 7, pages 163–170, 2007.
Yann Gu´edon. Exploring the state sequence space for hidden markov and semi-markov chains. Computational Statistics & Data Analysis, 51(5):2379–2409, 2007.
Chantal Guihenneuc-Jouyaux, Sylvia Richardson, and Ira M Longini. Modeling markers of disease progression by a hidden markov process: application to characterizing cd4 cell decline. Biometrics, 56(3):733–741, 2000.
Tracy D Gunter and Nicolas P Terry. The emergence of national electronic health record architectures in the united states and australia: models, costs, and questions. Journal of medical Internet research, 7(1):e3, 2005.
Alan G Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of Applied Probability, pages 493–503, 1974.
CharlesO Hershey and Linda Fisher. Why outcome of cardiopulmonary resuscitation in general wards is poor. The Lancet, 319(8262):31–34, 1982.
Asger Hobolth and Jens Ledet Jensen. Summary statistics for endpoint-conditioned continuous-time markov chains. Journal of Applied Probability, pages 911–924, 2011.
Helen Hogan, Frances Healey, Graham Neale, Richard Thomson, Charles Vincent, and Nick Black. Preventable deaths due to problems in care in english acute hospitals: a retrospective case record review study. BMJ quality & safety, pages bmjqs–2012, 2012.
William Hoiles and Mihaela van der Schaar. A non-parametric learning method for confi- dently estimating patient’s clinical state and dynamics. In Advances in Neural Information Processing Systems, pages 2020–2028, 2016.
Xuelin Huang and Robert A Wolfe. A frailty model for informative censoring. Biometrics, 58(3):510–520, 2002.
Aparna V Huzurbazar. Multistate models, flowgraph models, and semi-markov processes. 2004.
Christopher H Jackson, Linda D Sharples, Simon G Thompson, Stephen W Duffy, and Elisabeth Couto. Multistate markov models for disease progression with classification error. Journal of the Royal Statistical Society: Series D (The Statistician), 52(2):193– 209, 2003.
Jacques Janssen and R De Dominicis. Finite non-homogeneous semi-markov processes: Theoretical and computational aspects. Insurance: Mathematics and Economics, 3(3): 157–165, 1984.
Matthew J Johnson and Alan S Willsky. Bayesian nonparametric hidden semi-markov models. Journal of Machine Learning Research, 14(Feb):673–701, 2013.
Pierre Joly and Daniel Commenges. A penalized likelihood approach for a progressive three- state model with censored and truncated data: Application to aids. Biometrics, 55(3): 887–890, 1999.
Juliane Kause, Gary Smith, David Prytherch, Michael Parr, Arthas Flabouris, Ken Hillman, et al. A comparison of antecedents to cardiac arrests, deaths and emergency intensive care admissions in australia and new zealand, and the united kingdomthe academia study. Resuscitation, 62(3):275–282, 2004.
Lisa L Kirkland, Michael Malinchoc, Megan OByrne, Joanne T Benson, Deanne T Kashi- wagi, M Caroline Burton, Prathibha Varkey, and Timothy I Morgenthaler. A clinical deterioration prediction tool for internal medicine patients. American Journal of Medical Quality, 28(2):135–142, 2013.
William A Knaus, Elizabeth A Draper, Douglas P Wagner, and Jack E Zimmerman. Apache ii: a severity of disease classification system. Critical care medicine, 13(10):818–829, 1985.
William A Knaus, Douglas P Wagner, Elizabeth A Draper, Jack E Zimmerman, Marilyn Bergner, Paulo G Bastos, Carl A Sirio, Donald J Murphy, Ted Lotring, and Anne Damiano. The apache iii prognostic system. risk prediction of hospital mortality for critically ill hospitalized adults. Chest Journal, 100(6):1619–1636, 1991.
Vidyadhar G Kulkarni. Modeling and analysis of stochastic systems. CRC Press, 1996.
Stephan W Lagakos, Charles J Sommer, and Marvin Zelen. Semi-markov models for par- tially censored data. Biometrika, 65(2):311–317, 1978.
David Lando. On cox processes and credit risky securities. Review of Derivatives research, 2(2-3):99–120, 1998.
Laura Landro. Hospitals find new ways to monitor patients 24/7. The Wall Street Journal, 2015.
Jose Leiva-Murillo, AA Rodrguez, and E Baca-Garca. Visualization and prediction of disease interactions with continuous-time hidden markov models. In NIPS 2011 Workshop on Personalized Medicine, 2011.
H Lehman Li-wei, Shamim Nemati, Ryan P Adams, George Moody, Atul Malhotra, and Roger G Mark. Tracking progression of patient state of health in critical care using inferred shared dynamics in physiological time series. In 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 7072–7075. IEEE, 2013.
William A Link. A model for informative censoring. Journal of the American Statistical Association, 84(407):749–752, 1989.
Yu-Ying Liu, Shuang Li, Fuxin Li, Le Song, and James M Rehg. Efficient learning of continuous-time hidden markov models for disease progression. In Advances in neural information processing systems, pages 3600–3608, 2015.
Sergio Matos, Surinder S Birring, Ian D Pavord, and H Evans. Detection of cough signals in continuous audio recordings using hidden markov models. IEEE Transactions on Biomedical Engineering, 53(6):1078–1083, 2006.
Raina M Merchant, Lin Yang, Lance B Becker, Robert A Berg, Vinay Nadkarni, Graham Nichol, Brendan G Carr, Nandita Mitra, Steven M Bradley, Benjamin S Abella, et al. Incidence of treated cardiac arrest in hospitalized patients in the united states. Critical care medicine, 39(11):2401, 2011.
Philipp Metzner, Illia Horenko, and Christof Sch¨utte. Generator estimation of markov jump processes based on incomplete observations nonequidistant in time. Physical Review E, 76(6):066702, 2007.
Rui P Moreno, Philipp GH Metnitz, Eduardo Almeida, Barbara Jordan, Peter Bauer, Ri- cardo Abizanda Campos, Gaetano Iapichino, David Edbrooke, Maurizia Capuzzo, JeanRoger Le Gall, et al. Saps 3from evaluation of the patient to evaluation of the intensive care unit. part 2: Development of a prognostic model for hospital mortality at icu admission. Intensive care medicine, 31(10):1345–1355, 2005.
RJM Morgan, F Williams, and MM Wright. An early warning scoring system for detecting developing critical illness. Clin Intensive Care, 8(2):100, 1997.
DR Mould. Models for disease progression: new approaches and uses. Clinical Pharmacology & Therapeutics, 92(1):125–131, 2012.
Kevin Murphy et al. The bayes net toolbox for matlab. Computing science and statistics, 33(2):1024–1034, 2001.
Kevin P Murphy. Hidden semi-markov models (hsmms). unpublished notes, 2, 2002.
Ronald C Neath et al. On convergence properties of the monte carlo em algorithm. In Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton, pages 43–62. Institute of Mathematical Statistics, 2013.
Uri Nodelman, Christian R Shelton, and Daphne Koller. Expectation maximization and complex duration distributions for continuous time bayesian networks. arXiv preprint arXiv:1207.1402, 2012.
Zdzis�law Opial. Weak convergence of the sequence of successive approximations for nonex- pansive mappings. Bulletin of the American Mathematical Society, 73(4):591–597, 1967.
Mari Ostendorf, Vassilios V Digalakis, and Owen A Kimball. From hmm’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on speech and audio processing, 4(5):360–378, 1996.
Soren Erik Pedersen, Suzanne S Hurd, Robert F Lemanske, Allan Becker, Heather J Zar, Peter D Sly, Manuel Soto-Quiroz, Gary Wong, and Eric D Bateman. Global strategy for the diagnosis and management of asthma in children 5 years and younger. Pediatric pulmonology, 46(1):1–17, 2011.
Andrei D Polyanin and Alexander V Manzhirov. Handbook of integral equations. CRC press, 2008.
Zhen Qin and Christian R Shelton. Auxiliary gibbs sampling for inference in piecewise- constant conditional intensity models. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015.
Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.
Michael J Rothman, Steven I Rothman, and Joseph Beals. Development and validation of a continuous measure of patient condition using the electronic medical record. Journal of biomedical informatics, 46(5):837–848, 2013.
Mohammed Saeed, Christine Lieu, Greg Raber, and Roger G Mark. Mimic ii: a massive temporal icu patient database to support research in intelligent patient monitoring. In Computers in Cardiology, 2002, pages 641–644. IEEE, 2002.
Daniel O Scharfstein and James M Robins. Estimation of the failure time distribution in the presence of informative censoring. Biometrika, 89(3):617–634, 2002.
Peter Schulam and Suchi Saria. A framework for individualizing predictions of disease trajectories by exploiting multi-resolution structure. In Advances in Neural Information Processing Systems, pages 748–756, 2015.
Padhraic Smyth. Hidden markov models for fault detection in dynamic systems. Pattern recognition, 27(1):149–164, 1994.
Henry T Stelfox, Brenda R Hemmelgarn, Sean M Bagshaw, Song Gao, Christopher J Doig, Cheri Nijssen-Jordan, and Braden Manns. Intensive care unit bed availability and outcomes for hospitalized patients with sudden clinical deterioration. Archives of internal medicine, 172(6):467–474, 2012.
CP Subbe, M Kruger, P Rutherford, and L Gemmel. Validation of a modified early warning score in medical admissions. Qjm, 94(10):521–526, 2001.
MJ Sweeting, VT Farewell, and D De Angelis. Multi-state markov models for disease progression in the presence of informative examination times: An application to hepatitis c. Statistics in medicine, 29(11):1161–1174, 2010.
S Taghipour, D Banjevic, AB Miller, N Montgomery, AKS Jardine, and BJ Harvey. Pa- rameter estimates for invasive breast cancer progression in the canadian national breast screening study. British journal of cancer, 108(3):542–548, 2013.
Hale F Trotter and John W Tukey. Conditional monte carlo for normal samples. In Symposium on Monte Carlo Methods, pages 64–79. Wiley, 1956.
John Varga, Christopher P Denton, and Fredrick M Wigley. Scleroderma: From pathogenesis to comprehensive management. Springer Science & Business Media, 2012.
J-L Vincent, Rui Moreno, Jukka Takala, Sheila Willatts, Arnaldo De Mendon¸ca, Hajo Bruining, CK Reinhart, PeterM Suter, and LG Thijs. The sofa (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive care medicine, 22(7):707–710, 1996.
Xiang Wang, David Sontag, and Fei Wang. Unsupervised learning of disease progression models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 85–94. ACM, 2014.
Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
J Yoon, A Alaa, S Hu, and M van der Schaar. Forecasticu: A prognostic decision support system for timely prediction of intensive care unit admission. pages 1680–1689, 2016.
Lei Yu and Huan Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In ICML, volume 3, pages 856–863, 2003.
Shun Yu, Sharon Leung, Moonseong Heo, Graciela J Soto, Ronak T Shah, Sampath Gunda, and Michelle Ng Gong. Comparison of risk prediction scoring systems for ward patients: a retrospective nested case-control study. Critical Care, 18(3):1, 2014.
Shun-Zheng Yu. Hidden semi-markov models. Artificial Intelligence, 174(2):215–243, 2010.
Yongyue Zhang, Michael Brady, and Stephen Smith. Segmentation of brain mr images through a hidden markov random field model and the expectation-maximization algorithm. IEEE transactions on medical imaging, 20(1):45–57, 2001.