b

DiscoverSearch
About
My stuff
Event Outlier Detection in Continuous Time
2019·arXiv
Abstract
Abstract

Continuous-time event sequences represent discrete events occurring in continuous time. Such sequences arise frequently in real-life. Usually we expect the sequences to follow some regular pattern over time. However, sometimes these patterns may be interrupted by unexpected absence or occurrences of events. Identification of these unexpected cases can be very important as they may point to abnormal situations that need human attention. In this work, we study and develop methods for detecting outliers in continuous-time event sequences, including unexpected absence and unexpected occurrences of events. Since the patterns that event sequences tend to follow may change in different contexts, we develop outlier detection methods based on point processes that can take context information into account. Our methods are based on Bayesian decision theory and hypothesis testing with theoretical guarantees. To test the performance of the methods, we conduct experiments on both synthetic data and real-world clinical data and show the effectiveness of the proposed methods.

Continuous-time event sequences are defined by occurrences of various types of events in time. Event sequences may represent many real-world processes and observations including, e.g., arrival of packets to servers in network systems, or administration of drugs to patients. Continuoustime event sequences are typically modeled as point processes (Daley & Vere-Jones, 2003). While many point-process models have been developed in the literature in recent years, most works build and test models on event prediction tasks. The focus of this work is on outlier detection in event sequences that aims to identify unusual occurrences

1Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA 2Borealis AI, Vancouver, BC, Canada. Correspondence to: Siqi Liu <siqiliu@cs.pitt.edu>, Milos Hauskrecht <milos@pitt.edu>.

Proceedings of the  38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

or absence of events in event sequences in real-time. Outlier detection is the basis of many critical real-world applications, such as fraud detection (Fawcett & Provost, 1997), network intrusion surveillance (Garcia-Teodoro et al., 2009), disease outbreak detection (Wong et al., 2003), and medical error detection (Hauskrecht et al., 2013; 2016).

Two types of outliers may arise in continuous-time event sequences. First, given the history and the recent absence of the events, the event may be overdue. We refer to these as omission outliers. Second, given the history, the event that has just arrived is unexpected: it either arrived too early or was not expected to occur at all given the context from the history. We refer to these as commission outliers. Figure 1 shows some examples of these two types of outliers.

image

Figure 1. Examples of omission (top) and commission (bottom) outliers. Each stem and rectangle / circle represent one event on the horizontal time line. Different colors and shapes represent different types of events. Red dashed boxes mark the outliers. Top: A blue-circle event is expected but has not been observed. Bottom: The second to last blue-circle event happened too early. The last blue-circle event was not expected given the occurrence pattern.

Both types of outliers are often related to problems of practical importance. Take for example, a person suffering from a disease and taking specific medications on a regular schedule to treat the disease. Given the history and the current time, we may infer that the person has not taken the medication yet, and the medication is overdue (omission). The detection of the overdue medication can be then used to generate a reminder alert. For commission, consider a patient who takes a medication too early compared to the normal schedule. The detection of this event or its prevention is extremely important and may prevent adverse situations like high concentration of the drug and its possible toxic effects. Similar situations may happen when one receives a medication that is unrelated to his/her condition (i.e., context). The occurrence of this event may indicate a medical error, and again its timely detection is extremely important.

Despite the importance of these types of outlier detection problems, to the best of our knowledge, neither of them have been studied in the literature. The key contributions of this paper are the formulation of these outlier detection problems and the development of their algorithmic solutions. Specifically, we develop semi-supervised outlier detection methods (Chandola et al., 2009), where data generated by the normal point process are available, based on Bayesian decision theory and hypothesis testing that can leverage a variety of point-process models. We show the effectiveness of the methods on both synthetic and real-word clinical event sequences. Since event occurrences often depend on other related events, we adapt an existing model, the continuous-time LSTM (Mei & Eisner, 2017), to account for these events and their history as context. However, we stress that the proposed outlier detection methods can be combined with any type of point-process model.

Point Processes Point processes are probabilistic models for discrete points in continuous domains (Daley & Vere-Jones, 2003). They have been widely used to model continuous-time event sequences. For a temporal point process, the conditional intensity function (CIF) characterizes the process. Some researchers have proposed to use Gaussian processes to model the CIF (Adams et al., 2009; Rao & Teh, 2011; Lloyd et al., 2015; 2016; Ding et al., 2018; Liu & Hauskrecht, 2019), while the others have developed flexible models based on Hawkes processes (Hawkes, 1971; Zhou et al., 2013; Wang et al., 2016; Xu et al., 2016; Lee et al., 2016; Apostolopoulou et al., 2019), where the CIF depends on the history explicitly. Recently, neural-network-based point-process models (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017; Li et al., 2018; Jia & Benson, 2020) have shown promising results in different settings. Despite the large amount of works on point processes for event sequences, none of them have studied the problem of defining and detecting outliers in continuous time.

Point Processes for Noisy Data Recently, researchers have started developing methods to deal with noisy data, such as incomplete or missing data (Xu et al., 2017; Shel- ton et al., 2018; Mei et al., 2019) and desynchronized data (Trouleau et al., 2019). They assume the data has been corrupted by some source (e.g., censoring or noise), and the goal is to recover the original data or to learn a model nonetheless. Although the omission outliers we consider is related to missing data, our goal is to detect these outliers as accurately as possible, i.e., to distinguish them from normal data, instead of data recovery. Moreover, our method is assumed to be executed in an online manner, so we must decide whether there is an outlier based only on the history. We do not have access to the data in the future as in the missing data setting. Finally, detecting commission outliers is not related to any of these works.

Outlier Detection In general, outlier (or anomaly) detection (Chandola et al., 2009; Aggarwal, 2013) aims to identify data instances that are unusual when compared to other instances in data. When outlier detection is applied on a subset of dimensions given the rest as context, it is referred to as contextual (or conditional) outlier detection (Hauskrecht et al., 2007; Song et al., 2007; Chandola et al., 2009). Our goal in this work is to detect unusual absence and occurrences of events in event sequences that depend on other types of events defining the context. Unlike outlier detection in traditional time series (Gupta et al., 2013), where time is treated as merely the index of the observed values, for continuous-time event sequences, the time of each event is essentially a random variable defined by the generative process, and the goal is to detect unusual behaviors in time given the context.

Semi-Supervised Detection Semi-supervised outlier (or anomaly) detection is a term coined in (Chandola et al., 2009). It means that normal data (without any outliers or other corruptions) are available for training the model to detect outliers on test data later. It is a widely-adopted approach in both outlier detection (Fiore et al., 2013; Akcay et al., 2019) and other related tasks, such as intrusion detection (Xu & Shelton, 2010) and missing data imputation (Mei et al., 2019), although in some cases, the assumption is not clearly stated. Our work also adopts this approach, but we note that event outlier detection in continuous time has not been studied in any of the previous works.

3.1. Problem Formulation

First, we formally define the problem of contextual outlier detection in continuous-time event sequences. An event sequence can be formulated as  Sx = {(ti : ti ∈ T }Nxi=1, i.e.,a sequence of timestamps  tiof the events, where  tiis the time of the i-th event,  Nxis the total number of the events in the sequence, and  T ⊆ Ris the domain of time. We call  Sx thetarget sequence and the events the target events, because they are the targets in which we aim to detect outliers. Meanwhile, we may observe contextual information along with  Sx. We assume the contextual information can be either represented as or converted to discrete events. We denote these events as  SC = {(ti, ui) : ti ∈ T , ui ∈ C}NCi=1,where  tiand  uiare the time and type (mark) of the i-th event,  NCis the total number of the events, and  C ⊆ Zis the finite set of distinct marks for different types of events. We call  SC thecontext sequence and the events the context events.

We stress that  Sxand  SCshare the same time domain T and, therefore, we can combine them into a single sequence SM = {(ti, ui) : (ti, ui) ∈ SC or ti ∈ Sx, ui = x}, where a new type  x /∈ Cis assigned to all the events in  Sx. For detecting outliers, we only rely on information in the past from both  Sxand  SC. We denote the combined history of Sxand  SCup to time t as  HMt = {(ti, ui) : (ti, ui) ∈SM, ti < t}. Meanwhile,  Hxt = {ti : ti ∈ Sx, ti < t}is the history of the target sequence  Sxonly without any contextual information. Now, we are ready to define two types of outlier detection problems we wish to solve.

Commission Outlier Detection Given an observed target event at time  tnand the combined history  HMtn up to time tn,the goal is to assign a label  yc(tn) ∈ {0, 1} to tnindicating whether it is a commission outlier. Notice that  yc(t) is onlydefined if t is the time of a target event. In this work, instead of hard labels, we calculate a commission outlier score sc(tn) for tnto indicate how likely it is to be a commission outlier.

Omission Outlier Detection Define a blank interval  B ⊆T as an interval in which there is no event of type x (target event). Given a blank interval  B = (tb, te)and the combined history  HMtbup to time  tb, the goal is to assign a label  yo(B) ∈ {0, 1}to B indicating whether there are any omission outliers in B. Notice that  yo(B)is only de-fined when B is an interval with no events from  Sx. In thiswork, instead of hard labels, we calculate an omission outlier score  so(B) for Bto indicate how likely it is to contain any omission outliers.

3.2. Probabilistic Models

We develop algorithms for detecting outliers in continuous-time event sequences based on probabilistic models, specifi-cally (temporal) point processes. Point processes are probabilistic models for discrete points in continuous domains. For an event sequence, the points are the events, and the domain is the time T . In this case, the models are also called temporal point processes. A temporal point process can be defined as a counting process  N(·) on T , where N(τ) is thenumber of points in the interval  τ ⊆ T .

For a temporal point process, the conditional intensity function (CIF),  λ(t), characterizes the probability of observing an event in an infinitesimal time interval  [t, t + dt) ⊆ T

given the history up to time t, i.e.  λ(t)dt = p(N([t, t +dt)) = 1|Ht).For our problem, we only model the target events using a point process, while the history  Ht = HMtcontains both the target events and the context events. Because we always condition on  Ht, we omit Htin notations for the rest of the paper.

For a sequence of target events  Sx = {ti : ti ∈ T }Nxi=1 gen-erated from the point process with CIF  λ(t), the probability density is

image

In this work, we focus on semi-supervised outlier detection (Chandola et al., 2009), meaning that we can obtain data generated from the normal point process (without outliers), from which we may train a model for detecting outliers on unseen data later. It has been widely adopted in previous works (see Section 2).

Assumption 3.1. Data generated by the normal point process are available.

We note that by definition, the majority of the data are normal rather than outliers, so obtaining normal data is usually much easier than finding and labeling outliers. Meanwhile, robust learning of the model in presence of outliers is an important but orthogonal direction for future work, since we can combine robust learning with our proposed outlier detection methods to eliminate Assumption 3.1.

In general, the type of model to use should be able to represent the dependencies between the target events and the context events. For a point-process model, it means that the CIF, instead of only depending on the target events, λ(t) = f(Hxt ), should also depend on the context events, λ(t) = f(HMt ), where  f(·)denotes the mapping repre- sented by the model.

In this work, we use a model adapted from the continuous-time LSTM (Mei & Eisner, 2017) due to its good performance in previous studies (Mei & Eisner, 2017; Jia & Ben- son, 2020). See Appendix A for details of the model. However, we stress that if there is a better model, we can simply plug it in to get better performance.

3.3. Detecting Commission Outliers

Without knowing the true process that generates commission outliers, we make the following assumption when developing our method and later generalize it by relaxing the assumption of constant rate.

Assumption 3.2 (Independent and Constant Commission). Commission outliers are generated by a process with a constant CIF independently from the normal point process.

Suppose we are given a target event  tnand the history up to time  tn. Define a random variable  Zn, such that  Zn = 1if tnis a commission outlier, and  Zn = 0otherwise. We are interested in calculating  p(Zn = 1|tn).

By Assumption 3.2 (Independent), the generative processes for the normal points and outliers can be viewed together as a marked point process. For each event  tn, there is a hidden mark  Znindicating whether it is an outlier. The overall CIF is  λg(t) = λ1(t) + λ0(t), where  λ0(t)is the CIF for the normal point process, and  λ1(t)for the outlier process. Suppose we are at time  tc. The density of having the next event at  tnmarginally and jointly with mark  Zn = 1are respectively

image

Then we can derive the conditional distribution of the hidden mark

image

Therefore, the Bayes decision rule is

image

where I [x] = 1 if x is true, and 0 otherwise. However, this rule cannot be directly applied, because  λ1(tn)is unknown. From Assumption 3.2 (Constant),  λ1is a constant, so the decision rule becomes  Z∗n = I [λ0(tn) < θc], where θc is athreshold. This justifies ranking by

image

across all  n = 1, . . . , Nx, so we use  −λ0(tn)as the commission outlier score: the higher the score, the more likely tnis to be a commission outlier.

Nonconstant Rate We generalize the idea by relaxing the assumption of  λ1(t)being a constant. The key is to treat λ1(tn)as a random variable and then marginalize it out as

image

Then we can prove that  p(Zn = 1|tn)is still an increasing function of  sc(tn), which justifies ranking by  sc(see Appendix B for details).

3.4. Detecting Omission Outliers

Similar to detecting commission outliers, we make the following assumption for omission outliers.

Assumption 3.3 (Independent and Constant Omission). Omission outliers are generated by independently removing the normal points with a constant probability.

To derive the method, we first define some notations. Let λ0(t)denote the CIF of the normal point process,  p1the probability of removing each normal point, and B a blank interval without any target points which we wish to decide whether contains omission outliers. For any interval  τ ⊆ T ,let  N(τ)be the number of points observed, and  N0(τ)be the number of points generated by the normal point process with CIF  λ0(t), so  N(·)is the result of combining  N0(·)with random removal, and we can observe  N(·)but not N0(·). Furthermore, we define an auxiliary random variable KBthat counts the number of points removed in B.

For a blank interval B, we observe  N(B) = 0, but KB = kcan take different values k = 0, 1, . . .. The joint probability for each k is

image

where  Fk(B)denotes the probability that k points are generated by the normal point process  N0(·) in B for k = 0, 1, . . ..It depends on  λ0(t). Then we can calculate the posterior probability of  KB = 0

image

Define a random variable  ZBto indicate whether there are any omission outliers in the blank interval  B. ZB = 0is equivalent to  KB = 0; ZB = 1is equivalent to  KB > 0.

image

Then the Bayes decision rule is

image

Without further assumptions,  p(KB = 0|N(B) = 0)(Eq. 9) cannot be evaluated in closed form, but we can get a lower bound

image

image

Then the posterior probability of B containing any omission outliers can also be bounded

image

Therefore, we propose to use

image

as the omission outlier score. When we rank the blank intervals by  so(B), we essentially rank them by an upper bound of  p(ZB = 1|N(B) = 0).

Poisson Process There is a notable special case where we can get a closed-form  p(KB = 0|N(B) = 0). When the normal point process  N0(·)is an inhomogeneous Poisson process, we have

image

for k = 0, 1, . . .. The posterior becomes

image

Therefore, the posterior probability of B containing any omission outliers is

image

This justifies scoring the interval  B by so(B) =�B λ0(s)ds,because if we rank the intervals by their scores, the result will be the same as ranking by their posterior probabilities of containing omission outliers,  p(ZB = 1|N(B) = 0).

Nonconstant Rate Similar to commission outliers, we can generalize the idea by relaxing the assumption that  p1is a constant and treating it as a random variable, in which case  p(ZB = 1|N(B) = 0)is still an increasing function of  so(B)(see Appendix C for details).

Inter-Event Time Without making any assumptions on the normal point process, we can alternatively justify so(B) =�B λ0(s)dswith hypothesis testing, if B is an inter-event time interval, i.e., a time interval between consecutive events  tn−1 and tn. Let Tnbe the random variable for the inter-event time corresponding to B. The null and alternative hypotheses are

image

Assuming the null hypothesis is true, i.e., B is generated by the normal point process with CIF  λ0(t), the probability that the inter-event time is at least as long as |B| is

image

which is the p-value. A lower p-value means that the observation is more extreme, given that the null hypothesis is true, which means it is more likely to contain omission outliers. This justifies scoring by  so(B) = �B λ0(s)ds, where a higher score means that B is more likely to contain omission outliers.

3.5. Bounds on FDR and FPR

In this section, we prove some bounds on the performance of the proposed outlier scoring methods. We recall the definitions of false discovery rate (FDR) and false positive rate (FPR). Let y denote the true label (1=outlier, 0=normal) of an object (a target event or a blank interval) and  ˆy denotethe predicted label. Then FDR and FPR are defined as

image

Given the above definitions, we can prove the following theorems (see Appendix D for proofs).

Theorem 3.1. If we use the commission outlier score sc(tn) = −λ0(tn), where  tnis the time of a target event, with a threshold  θc ≤ 0, such that the decision rule is ˆyc(tn) = I [sc(tn) > θc], and let  λ1denote the CIF of the independent process generating commission outliers, then the FDR is bounded above by −θcλ1−θc .

Theorem 3.2. If we use the omission outlier score so(B) =�B λ0(s)dsfor an inter-event time interval B, with a threshold  θo ≥ 0, such that the decision rule is ˆyo(B) = I [so(B) > θo], then the FPR is bounded above by  exp (−θo).

Theorem 3.3. If we use the omission outlier score  so(B) =�B λ0(s)dsfor a blank interval B, with a threshold  θo ≥ 0,such that the decision rule is  ˆyo(B) = I [so(B) > θo], andassume that the normal point process is an inhomogeneous Poisson process and the probability of omission is  p1, thenthe FDR is bounded above by  exp (−p1θo).

To test the proposed methods, we perform experiments on both synthetic and real-world event sequences. We compare the following methods in the experiments. RND: A baseline that generates outlier scores by sampling from a uniform distribution. LEN: A baseline that detects outliers based on the empirical distribution of the inter-event time lengths (see Section 4.1). PPOD (Point-Process Outlier Detection): Our method based on a point-process model but only using the history of the target events as the context. CPPOD (Contextual PPOD): Our method based on a point-process model using the history of both the target events and the context events as the context.

A model adapted from the continuous-time LSTM (Mei & Eisner, 2017) is used for PPOD and CPPOD (see Appendix A). We choose the number of hidden units in the model from {64, 128, 256, 512, 1024} by maximizing the likelihood on the internal validation set that consists of 20 percent of the training set. We stress that for training and validation we do not use any labeled outlier data. Our implementation of all the methods and the experiments is publicly available.1

4.1. Baseline Method

We briefly describe the baseline method LEN. For training, the lengths of all the inter-event time intervals of the target events,  L = {li : li = ti+1 − ti, (ti, x), (ti+1, x) ∈ Sx} arecollected. Then, an empirical distribution of the inter-event time can be formulated as ˆF(l) = 1|L|�|L|i=1 I [li ≤ l]. Here,for simplicity, we describe the method as if we only had one sequence in the training data, but it is easy to see how it works for multiple sequences, which is the case in our experiments.

For testing, LEN outputs a commission outlier score for a target event at time  tn ∈ Sx as

image

where  tn − tn−1is the inter-event time between the current and previous target events. Intuitively, if the inter-event time is too small ( ˆF(·)is small) or too big (1 − ˆF(·)is small), it is likely that  tnhas occurred at an abnormal time (too early or too late) and therefore is a commission outlier. The negation makes sure that a higher score indicates that it is more likely to be an outlier. For a blank interval B, LEN outputs an omission outlier score as the length of B,

image

Intuitively, the longer the blank interval, the more likely it is to contain omission outliers.

4.2. Experiments on Synthetic Event Sequences

We generate synthetic event sequences using two different types of point processes. One is the inhomogeneous Poisson process. The other is the Gamma process. For each type of processes, there is a set of parameters that determine the distribution of the points. We allow the parameters to vary according to a context state x, which can take two distinct values  x ∈ {0, 1}.

For the inhomogeneous Poisson process, the CIF is a piecewise constant function with the value  λ = f(x), where x is the context state. In the experiments, we set f(0) = 0.1 and f(1) = 1. For the Gamma process, the inter-event time follows a Gamma distribution  Gam (ax, bx) (axshape,  bxrate), where x is the context state. In the experiments, we set  (a0, b0) = (10, 10) and (a1, b1) = (100, 10).

The changes of the context state x are driven by a continuous-time Markov chain (Nodelman et al., 2002) with a transition matrix  Q =�−0.05 0.050.05 −0.05�such that

image

where dt is infinitesimal time. Each change of the state generates a context event.

For each point process type, we simulate 40 sequences. All sequences are simulated in the same time span T = [0, 1000]. We use 50 percent of the sequences for training and the others for testing.

Outlier Simulation Commission outliers are simulated by adding points generated from a separate point process. Omission outliers are simulated by random removals of existing points. There is a parameter  α(t)that defines the rate of outliers relative to normal points and can change over time (see Appendix E for details). To evaluate the performance of the proposed methods under different conditions, we conduct three types of simulations.

Constant rate. In this case, the rate of omission and commission is a constant,  α(t) = α0, relative to the normal points, where  α0 = 0.1. We also changed  α0, but it did not affect the relative performance of each method in almost all cases (see Appendix I).

Periodic rate (denoted as [sin]). In this case, the rate of omission and commission is a periodic function defined by  α(t) = α0(1 + sin(2πt/p))/2relative to the normal points, where  α0controls the overall scale, t is the time, and p is the period. We set  α0 = 0.2 andp = 100.

Piecewise-constant rate (denoted as [pc]). In this case, we randomly generate a piecewise-constant function  g(t) : T → [0, 1]with a step size s = 10 for each sequence. The value of the function at each step is randomly generated from a uniform distribution. Then the rate of omission and commission is defined as

α(t) = α0g(t)relative to the normal points, where α0 = 0.2controls the overall scale.

We note that the first type satisfies the constant rate assumptions we made when developing the methods, but the second and third types do not. By using different simulations, we try to verify the robustness of our methods in different cases when the constant rate assumptions may not be satisfied.

Outlier Detection We apply the methods to detect outliers in an online manner, i.e., a commission (omission) outlier score is generated whenever a new target event is observed. For omission outliers, we also try to detect outliers at additional checkpoints within long blank intervals. See Appendix F for details of the approach and Appendix G for the outlier ratios of the datasets.

Results Table 1 shows the results in Area Under the Receiver Operating Characteristic curve (AUROC). CPPOD achieves the best performance for both commission and omission outliers, showing the effectiveness of our outlier scoring methods. PPOD being worse than CPPOD shows the importance of the context events in these cases. LEN performs much better than RND but worse than our proposed methods, which shows (1) it is an effective method for detecting outliers; (2) although intuitive, it is less effective than our proposed methods. Meanwhile, it does not have rigorous theoretical justifications.

As we can see, changing the outlier simulation mechanism does not affect the relative performance of each method. Even when the rates are not constants, i.e.,  α = [sin]or [pc], CPPOD and PPOD are still performing similarly as when  αis a constant and much better than the baselines. We also checked the performance of our outlier scoring methods combined with the ground-truth model (see Appendix I), and the performance is similar to CPPOD.

Empirical Verification of the Bounds on FDR and FPR To empirically verify the bounds on FDR and FPR presented in Section 3.5, we randomly repeat the experiments using our outlier scoring methods with the ground-truth model (replacing the learned LSTM model with the true model in CPPOD) on the synthetic data 10 times with different test data. Each time we calculate the FDR and FPR for different thresholds on the scores. For verifying FPR, we only test the inter-event time intervals for omission outliers. Their means and standard deviations over all repetitions are shown with the theoretical bounds in Figure 2 for Gamma process. For FPR, the bounds overlap the empirical rates. See Appendix H for more details and the results for Poisson process.

image

Figure 2. FDR (commission outlier) and FPR (omission outlier) on synthetic data (Gamma process).

4.3. Experiments on Real-World Clinical Data

In this part, we use real-world clinical data extracted from the MIMIC III dataset (Johnson et al., 2016). The dataset consists of de-identified electronic health records of ICU patients. We use four types of events corresponding to commonly used medications and lab tests as our targets and form four datasets by collecting the target events and corresponding context events. The target events and their context events are listed in Table 2. The medical category (medication, lab, or vital sign) of each type of events is in brackets following the type. For example, Potassium Chloride is a type of medication, and Potassium (Blood) is a type of lab test. The latter is used as the context for the former, as the administration of the medication can be triggered by observing an abnormally low value in the lab test.

We record in the data every type of events in the table. However, for Potassium (Blood) and Total Calcium (Blood), we further split the context events into three subtypes depending on whether the value in the lab test is low, normal, or high. For INR(PT) (previous state), we create context events of two subtypes based on whether the value of the previous event is normal or abnormal (with a 1-second delay). For Arterial Blood Pressure systolic (ABPs) and Non-invasive Blood Pressure systolic (NBPs), we split the events into two subtypes depending on whether the value is normal or low. These subtypes help to define better the contexts influencing the target events, since depending on their values, the target events can be more/less likely to occur.

All target and context events for one patient admission form one event sequence. We randomly select 2000 sequences for the first three datasets and 500 sequences for the last one, as each sequence in the last one contains much more events than the first three. For each dataset, we use 50 percent of the sequences for training and the others for testing.

Simulation Evaluation We generate and detect commission and omission outliers on top of the existing data using the same procedures for synthetic data (Section 4.2) except that we set  p = 24 × 7 and s = 12. This allows us to obtain

Table 1. AUROC on synthetic data. Dataset: name abbreviation (C=commission, O=omission) [α].

image

Table 2. Names of target and context events from MIMIC. INR=international normalized ratio; PT=prothrombin time.

image

ground-truth labels for analyses. Table 3 shows the AUROC results. The results have more variations across different datasets in this case, which can be seen by examining the performance of LEN. Omission outliers appear to be more challenging than commission outliers except for INR(PT). CPPOD and PPOD outperform RND and LEN on all the datasets for both types of outliers.

In all cases, CPPOD is the best or very close to it. In the latter cases, the best is always PPOD, and the differences are very small. These are the cases where the additional context events are not as influential as the history of the target events themselves for the occurrences of the target events, so PPOD is as good as but simpler than CPPOD. However, for Potassium Chloride and Calcium Gluconate, we can see a clear advantage of CPPOD over PPOD by using additional context events.

Manual Evaluation of Clinical Relevance Finally, we apply CPPOD to the original data (without any simulated outliers). For each dataset, we select three commission outliers and three omission outliers with the highest outlier scores. We manually examine the electronic health record

image

of the patient for each outlier to determine whether the event commissions and omissions identified by CPPOD are clinically relevant and assign the outliers to three classes: N (outlier is not relevant), Y (outlier is relevant), and P (outlier is probably relevant given the available data but cannot be determined due to insufficient data recorded in MIMIC).

Specifically, we retrieve medical notes and related data for each case across time. We examine the notes and data to find evidence to support the decisions made by the physicians (giving a medication / lab test or not) that are detected as outliers. If there is evidence supporting the decision (e.g., given the condition of the patient, the medication should be given, and it was given), then the outlier is not clinically relevant (N). If the evidence is against the decision, it is clinically relevant (Y). If we do not have enough data, but the data we have do not support the decision, it is probably relevant (P).

Table 4 shows the results of the manual evaluation. Overall, CPPOD is performing well, as many detected outliers are clinically relevant. For both false outliers in Cal (C), there was an apparent change in the schedule of the medication (previous pattern being interrupted), so for the model, they do appear to be outliers. For Pot (C), one patient had rare conditions, for which we do not have the corresponding context events. The other had a drop in the value of the lab test (context), but it did not reach the abnormally low level, so the model does not have the exact context information that supported the human decision of giving the medication.

We note that for this evaluation, we do not have any guarantees on any of the assumptions we made when developing the methods in theory. Nonetheless, CPPOD is still effective. Moreover, as we have discovered, the original data may already contain outliers, so even in the previous evaluation, the models are unlikely to be trained on clean data, but Table 3 still shows the benefits of PPOD and CPPOD. This demonstrates that the assumptions do not have to be rigorously satisfied in practice for the methods to work reasonably well, although we may lose the theoretical guarantees on the performance, which depend on the assumptions.

Table 3. AUROC on MIMIC data. Dataset: name abbreviation (C=commission, O=omission) [α].

image

Table 4. Manual evaluation on MIMIC data.

image

In this work, we have studied two new outlier detection problems: detection of commission and omission outliers in continuous-time event sequences. We proposed outlier scoring methods based on Bayesian decision theory and hypothesis testing with theoretical guarantees. The proposed methods depend on a point-process model built from the data. In this work, we experimented with a model considering context, adapted from the continuous-time LSTM. We conducted experiments on both synthetic and real-world event sequences. The results show the flexibility of the adapted model and, more importantly, the effectiveness of the proposed outlier scoring methods.

As verified by the experiments, the assumptions we made when developing the methods in theory do not have to be sat-isfied for them to work reasonably well in practice, although when they are satisfied, we get the theoretical justifications and guarantees on the performance. For future work, there are two interesting directions. One is to develop algorithms to learn the models robustly in presence of outliers in the

image

training data. The other is to allow the model to adapt in a non-stationary environment. Both are orthogonal to this work, as they can be combined with the outlier scoring methods developed in this work to improve the performance by providing a better model.

This work was supported by NIH grant R01-GM088224. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Siqi Liu also acknowledges the support by CS50 Merit Pre-doctoral Fellowship from the Department of Computer Science, University of Pittsburgh. Finally, the authors would like to thank all the anonymous reviewers who provided helpful comments on the earlier versions of the paper.

Adams, R. P., Murray, I., and MacKay, D. J. Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 9–16. ACM, 2009.

Aggarwal, C. C. Outlier Analysis. Springer New York, 2013.

Akcay, S., Atapour-Abarghouei, A., and Breckon, T. P. GANomaly: Semi-supervised anomaly detection via ad-

versarial training. In Jawahar, C. V., Li, H., Mori, G., and Schindler, K. (eds.), Computer Vision – ACCV 2018, Lecture Notes in Computer Science, pp. 622–637, Cham, 2019. Springer International Publishing.

Apostolopoulou, I., Linderman, S., Miller, K., and Dubrawski, A. Mutually regressive point processes. In Advances in Neural Information Processing Systems, pp. 5116–5127, 2019.

Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec- tion: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009.

Daley, D. J. and Vere-Jones, D. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. Springer, New York, 2003.

Ding, H., Khan, M., Sato, I., and Sugiyama, M. Bayesian nonparametric Poisson-process allocation for time-sequence modeling. In International Conference on Artificial Intelligence and Statistics, pp. 1108–1116, 2018.

Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez- Rodriguez, M., and Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555–1564. ACM, 2016.

Fawcett, T. and Provost, F. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291–316, 1997.

Fiore, U., Palmieri, F., Castiglione, A., and De Santis, A. Network anomaly detection with the restricted Boltzmann machine. Neurocomputing, 122:13–23, December 2013.

Garcia-Teodoro, P., Diaz-Verdejo, J., Maci´a-Fern´andez, G., and V´azquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security, 28(1):18–28, 2009.

Gupta, M., Gao, J., Aggarwal, C. C., and Han, J. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering, 26(9):2250–2267, 2013.

Hauskrecht, M., Valko, M., Kveton, B., Visweswaram, S., and Cooper, G. Evidence-based anomaly detection. In Annual American Medical Informatics Association Symposium, pp. 319–324, November 2007.

Hauskrecht, M., Batal, I., Valko, M., Visweswaran, S., Cooper, G. F., and Clermont, G. Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics, 46(1):47–55, February 2013.

Hauskrecht, M., Batal, I., Hong, C., Nguyen, Q., Cooper, G. F., Visweswaran, S., and Clermont, G. Outlier-based detection of unusual patient-management actions: an ICU study. Journal of Biomedical Informatics, 64:211–221, 2016.

Hawkes, A. G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.

Jia, J. and Benson, A. R. Neural jump stochastic differential equations. arXiv:1905.10403 [cs, stat], January 2020.

Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.- w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific Data, 3: 160035, May 2016.

Lee, Y., Lim, K. W., and Ong, C. S. Hawkes processes with stochastic excitations. In International Conference on Machine Learning, pp. 79–88, 2016.

Li, S., Xiao, S., Zhu, S., Du, N., Xie, Y., and Song, L. Learn- ing temporal point processes via reinforcement learning. In Advances in Neural Information Processing Systems, pp. 10781–10791, 2018.

Liu, S. and Hauskrecht, M. Nonparametric regressive point processes based on conditional Gaussian processes. In Advances in Neural Information Processing Systems, pp. 1062–1072, 2019.

Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. Vari- ational inference for Gaussian process modulated Poisson processes. In International Conference on Machine Learning, pp. 1814–1822, 2015.

Lloyd, C., Gunter, T., Osborne, M., Roberts, S., and Nick- son, T. Latent point process allocation. In Artificial Intelligence and Statistics, pp. 389–397, May 2016.

Mei, H. and Eisner, J. M. The neural Hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, pp. 6757–6767, 2017.

Mei, H., Qin, G., and Eisner, J. Imputing missing events in continuous-time event streams. In International Conference on Machine Learning, pp. 4475–4485. PMLR, May 2019.

Nodelman, U., Shelton, C. R., and Koller, D. Continuous time Bayesian networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI’02, pp. 378–387, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.

Rao, V. and Teh, Y. W. Gaussian process modulated renewal processes. In Advances in Neural Information Processing Systems, pp. 2474–2482, 2011.

Shelton, C. R., Qin, Z., and Shetty, C. Hawkes process inference with missing data. In Thirty-Second AAAI Conference on Artificial Intelligence, April 2018.

Song, X., Wu, M., Jermaine, C., and Ranka, S. Conditional anomaly detection. Knowledge and Data Engineering, IEEE Transactions on, 19(5):631–645, 2007.

Trouleau, W., Etesami, J., Grossglauser, M., Kiyavash, N., and Thiran, P. Learning Hawkes processes under synchronization noise. In International Conference on Machine Learning, pp. 6325–6334, May 2019.

Wang, Y., Xie, B., Du, N., and Song, L. Isotonic Hawkes processes. In International Conference on Machine Learning, pp. 2226–2234, 2016.

Wong, W. K., Moore, A., Cooper, G., and Wagner, M. Bayesian network anomaly pattern detection for disease outbreaks. In International Conference on Machine Learning, pp. 808–815, August 2003.

Xiao, S., Farajtabar, M., Ye, X., Yan, J., Song, L., and Zha, H. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems, pp. 3247–3257, 2017.

Xu, H., Farajtabar, M., and Zha, H. Learning Granger causal- ity for Hawkes processes. In International Conference on Machine Learning, pp. 1717–1726, 2016.

Xu, H., Luo, D., and Zha, H. Learning Hawkes processes from short doubly-censored event sequences. In International Conference on Machine Learning, pp. 3831–3840, July 2017.

Xu, J. and Shelton, C. R. Intrusion detection using con- tinuous time Bayesian networks. Journal of Artificial Intelligence Research, 39:745–774, December 2010.

Zhou, K., Zha, H., and Song, L. Learning triggering kernels for multi-dimensional Hawkes processes. In International Conference on Machine Learning, pp. 1301–1309, 2013.

The input to the continuous-time LSTM consists of the marked events in the combined sequence,  (ti, ui) ∈ SM. That is, we not only use the target events but also the context events as input, although we only model the CIF of the target events,  λ(t). The output consists of the hidden states  h(ti)corresponding to the input. It is a nonlinear mapping from the content in the memory cell  c(ti)of the LSTM at time ti. As in a traditional LSTM, each continuous-time LSTM unit also has an input gate i, an output gate o, and a forget gate f. The relations between the memory cells, the hidden states, the input, and these gates are summarized as follows.

Let  uibe a vector representation of the mark  ui, which is a learnable embedding. For  t ∈ (ti−1, ti], c(t)is a continuous function changing over time from  ci to ¯ci, and for ci and ¯cithere are separate input gates and forget gates:

image

ci+1 = fi+1 ⊙ c(ti) + ii+1 ⊙ zi+1 (A.6)¯ci+1 = ¯fi+1 ⊙ ¯ci + ¯ii+1 ⊙ zi+1 (A.7)

image

where [a; b] denotes the concatenation of the vectors a and  b, ⊙is the elementwise product,  σ(·)is the logistic function, and g(x, s) = s log(1 + exp (x/s)) is the scaled softplus function with parameter s. All the W , U and d with/without different subscripts and bars are learnable parameters of the continuous-time LSTM.

Finally, to convert the output of the continuous-time LSTM to the CIF of the target events,  λ(t), we have  λ(t) =g(wTλ h(t), s)where  wλand s are learnable parameters. The model is learned by maximizing the likelihood (Eq. 1) for all sequences in the training data. Monte-Carlo integration is used to evaluate�λ(s)ds.

We generalize the method we have developed in Section 3.3 to cases when the rate of commission is not a constant. Treating  λ1(tn)as a random variable, based on what we

have already developed, we have

image

To avoid cluttering, we omit  tn in λ0, λ1, and sc from now.By marginalizing out  λ1, we get

image

where we defined

image

We assume

image

which is easy to satisfy, since it is sufficient that either λ1 ≥ ϵfor some  ϵ > 0or the distribution of  λ1is one of the common distributions including any finite discrete distribution, Gamma distribution, etc.

It is not hard to see that f is an increasing function of  sc forany given  λ1, as sc ∈ (−∞, 0), λ1 ∈ (0, ∞), and

image

Therefore, we can show that

image

using the Dominated Convergence Theorem. This implies that  p(Zn = 1|tn)is an increasing function of  sc.

Similar to commission outliers, we generalize the method we have developed in Section 3.4 to cases when the probability of omission is not a constant and the normal point process is an inhomogeneous Poisson process. Treating p1as a random variable, based on what we have already developed, we have

image

To avoid cluttering, we omit  B in sofrom now. By marginalizing out  p1, we get

image

where we defined

image

Apparently g is an decreasing function of  scfor any given

image

Therefore, we can show that

image

using the Dominated Convergence Theorem. This implies that  p(ZB = 1|N(B) = 0)is an increasing function of  so.

D.1. Theorem 3.1

Proof. From Eq. 4 and implicitly conditioned on the event tnand the history

image

Given that  ˆyc(tn) = 1, i.e., −λ0(tn) > θc, we get

image

D.2. Theorem 3.2

Proof. Let  Tnbe the random variable for the inter-event time corresponding to the observed inter-event interval B, assuming it is generated from the normal point process. From Eq. 18

image

The last equality is because  p(Tn > |B|) = 1 − p(Tn ≤|B|), and  p(Tn ≤ |B|)is the cumulative distribution function of  Tn, implying it follows a uniform distribution.

D.3. Theorem 3.3

Proof. From Eq. 17 and implicitly conditioned on N(B) = 0 and the history

image

Given that  ˆyo(B) = 1, i.e.,�B λ0(s)ds > θo, we get

image

To define outliers, we simulate commission and omission outliers on top of the existing data. In this way, we can obtain ground-truth labels for testing.

To define commission outliers, we simulate a new sequence of target events independently from the existing data, and then merge the new events with the existing events. We use an (inhomogeneous) Poisson process with an intensity λc(t)to generate the outliers.  λccontrols the rate of such outliers. In the experiments, for each dataset, we set  λc(t) =α(t)ˆλtest, where α(t)is either a constant or a function over time depending on the settings, and ˆλtestis the empirical rate of the target events calculated from the original test data.

To define omission outliers, we randomly remove target events in the original sequences according to independent Bernoulli trials. That is, each event is removed with probability  p1and kept with probability  1 − p1. We always keep the event if it marks the start time of the sequence. In the experiments, we set  p1 = α(t), where, similar to commission, α(t)is either a constant or a function over time depending on the settings.

We detect the presence of commission and omission outliers differently. To test for commission outliers, each method outputs an outlier score at the time of each target event. That is, whenever there is a new target event, we ask the question: is this event a commission outlier or not?

Testing for omission outliers is trickier, because we need to decide the checkpoints more carefully, i.e., when to ask for outlier scores. The simplest thing to do is to only check at the target event times. That is, whenever there is a new target event, we ask the question: is there any omission outlier since the previous target event till now?

However, this may become unsatisfactory in real-world applications, because there could be cases when the target events just stop occurring for a long period of time or even forever (potentially due to malfunctions of the underlying system). These are interesting and important cases we are supposed to detect, but the above testing method will not work. Therefore, we use a combined approach. We still have a checkpoint at each target event time, but on top of that, we also randomly generate checkpoints in long blank intervals.

Specifically, we have a parameter  w set to 2/ˆλtrain, whereˆλtrainis the empirical rate of the target events estimated from the training data for each dataset, so within w, on average, we should see two events normally. Then, whenever the blank interval from the previous checkpoint till now is longer than w, we generate a new checkpoint within the interval by uniform sampling, and set the previous checkpoint to the generated checkpoint. We keep generating checkpoints until we reach the next target event or the end of the sequence.

The outlier ratio, i.e., the number of outliers divided by the total number of test points, for each dataset is summarized in Table G.1.

We show the results of empirically verifying the bounds proved in Section 3.5, continuing the results in Section 4.2. We use GT (Ground Truth): our outlier scoring methods combined with the ground-truth point-process model, which is only available on synthetic data. Figure H.1 shows the empirical FDR (commission outlier), FDR (omission outlier), and FPR (omission outlier) with means and standard deviations on data simulated from inhomogeneous Poisson

Table G.1. Outlier ratios of the datasets. Dataset: name abbreviation (C=commission, O=omission) [α].

image

processes along with the theoretical bounds. As we can see, the empirical FDRs have high variance when the threshold is high, because there are smaller number of samples above a higher threshold. Nonetheless, the empirical FDRs conform with the theoretical bounds, and so does the empirical FPR.

Using Ground-Truth Model We also compared with GT (Ground Truth): our outlier scoring methods combined with the ground-truth point-process model (only available on synthetic data). Figure I.1 and I.2 show the receiver operating characteristic (ROC) curves of the outlier detection methods on the synthetic data generated from inhomogeneous Poisson processes and Gamma processes with  α = 0.1. Wenote that the curves of GT and CPPOD are almost identical. The fact that CPPOD almost has the same performance as GT is an evidence that the model based on the continuous-time LSTM is flexible enough to represent these context-dependent point processes.

Varying Outlier Rate We also experimented with changing  α0 = 0.05for the constant rate to see its effect. Table I.1

image

Figure H.1. From left to right: FDR (commission outlier), FDR (omission outlier), and FPR (omission outlier) on synthetic data (Poisson process).

image

Figure I.1. ROC curves on synthetic data (Poisson process). Left: commission. Right: omission.

and Table I.2 show the full AUROC results for synthetic and MIMIC data respectively. As we can see, the relative performance for each method does not change in almost all cases.

image

Table I.1. AUROC on synthetic data. Dataset: name abbreviation (C=commission, O=omission) [α].

image

image


Designed for Accessibility and to further Open Science