Continuous-time event sequences are defined by occurrences of various types of events in time. Event sequences may represent many real-world processes and observations including, e.g., arrival of packets to servers in network systems, or administration of drugs to patients. Continuoustime event sequences are typically modeled as point processes (Daley & Vere-Jones, 2003). While many point-process models have been developed in the literature in recent years, most works build and test models on event prediction tasks. The focus of this work is on outlier detection in event sequences that aims to identify unusual occurrences
1Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA 2Borealis AI, Vancouver, BC, Canada. Correspondence to: Siqi Liu <siqiliu@cs.pitt.edu>, Milos Hauskrecht <milos@pitt.edu>.
Proceedings of the International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
or absence of events in event sequences in real-time. Outlier detection is the basis of many critical real-world applications, such as fraud detection (Fawcett & Provost, 1997), network intrusion surveillance (Garcia-Teodoro et al., 2009), disease outbreak detection (Wong et al., 2003), and medical error detection (Hauskrecht et al., 2013; 2016).
Two types of outliers may arise in continuous-time event sequences. First, given the history and the recent absence of the events, the event may be overdue. We refer to these as omission outliers. Second, given the history, the event that has just arrived is unexpected: it either arrived too early or was not expected to occur at all given the context from the history. We refer to these as commission outliers. Figure 1 shows some examples of these two types of outliers.
Figure 1. Examples of omission (top) and commission (bottom) outliers. Each stem and rectangle / circle represent one event on the horizontal time line. Different colors and shapes represent different types of events. Red dashed boxes mark the outliers. Top: A blue-circle event is expected but has not been observed. Bottom: The second to last blue-circle event happened too early. The last blue-circle event was not expected given the occurrence pattern.
Both types of outliers are often related to problems of practical importance. Take for example, a person suffering from a disease and taking specific medications on a regular schedule to treat the disease. Given the history and the current time, we may infer that the person has not taken the medication yet, and the medication is overdue (omission). The detection of the overdue medication can be then used to generate a reminder alert. For commission, consider a patient who takes a medication too early compared to the normal schedule. The detection of this event or its prevention is extremely important and may prevent adverse situations like high concentration of the drug and its possible toxic effects. Similar situations may happen when one receives a medication that is unrelated to his/her condition (i.e., context). The occurrence of this event may indicate a medical error, and again its timely detection is extremely important.
Despite the importance of these types of outlier detection problems, to the best of our knowledge, neither of them have been studied in the literature. The key contributions of this paper are the formulation of these outlier detection problems and the development of their algorithmic solutions. Specifically, we develop semi-supervised outlier detection methods (Chandola et al., 2009), where data generated by the normal point process are available, based on Bayesian decision theory and hypothesis testing that can leverage a variety of point-process models. We show the effectiveness of the methods on both synthetic and real-word clinical event sequences. Since event occurrences often depend on other related events, we adapt an existing model, the continuous-time LSTM (Mei & Eisner, 2017), to account for these events and their history as context. However, we stress that the proposed outlier detection methods can be combined with any type of point-process model.
Point Processes Point processes are probabilistic models for discrete points in continuous domains (Daley & Vere-Jones, 2003). They have been widely used to model continuous-time event sequences. For a temporal point process, the conditional intensity function (CIF) characterizes the process. Some researchers have proposed to use Gaussian processes to model the CIF (Adams et al., 2009; Rao & Teh, 2011; Lloyd et al., 2015; 2016; Ding et al., 2018; Liu & Hauskrecht, 2019), while the others have developed flexible models based on Hawkes processes (Hawkes, 1971; Zhou et al., 2013; Wang et al., 2016; Xu et al., 2016; Lee et al., 2016; Apostolopoulou et al., 2019), where the CIF depends on the history explicitly. Recently, neural-network-based point-process models (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017; Li et al., 2018; Jia & Benson, 2020) have shown promising results in different settings. Despite the large amount of works on point processes for event sequences, none of them have studied the problem of defining and detecting outliers in continuous time.
Point Processes for Noisy Data Recently, researchers have started developing methods to deal with noisy data, such as incomplete or missing data (Xu et al., 2017; Shel- ton et al., 2018; Mei et al., 2019) and desynchronized data (Trouleau et al., 2019). They assume the data has been corrupted by some source (e.g., censoring or noise), and the goal is to recover the original data or to learn a model nonetheless. Although the omission outliers we consider is related to missing data, our goal is to detect these outliers as accurately as possible, i.e., to distinguish them from normal data, instead of data recovery. Moreover, our method is assumed to be executed in an online manner, so we must decide whether there is an outlier based only on the history. We do not have access to the data in the future as in the missing data setting. Finally, detecting commission outliers is not related to any of these works.
Outlier Detection In general, outlier (or anomaly) detection (Chandola et al., 2009; Aggarwal, 2013) aims to identify data instances that are unusual when compared to other instances in data. When outlier detection is applied on a subset of dimensions given the rest as context, it is referred to as contextual (or conditional) outlier detection (Hauskrecht et al., 2007; Song et al., 2007; Chandola et al., 2009). Our goal in this work is to detect unusual absence and occurrences of events in event sequences that depend on other types of events defining the context. Unlike outlier detection in traditional time series (Gupta et al., 2013), where time is treated as merely the index of the observed values, for continuous-time event sequences, the time of each event is essentially a random variable defined by the generative process, and the goal is to detect unusual behaviors in time given the context.
Semi-Supervised Detection Semi-supervised outlier (or anomaly) detection is a term coined in (Chandola et al., 2009). It means that normal data (without any outliers or other corruptions) are available for training the model to detect outliers on test data later. It is a widely-adopted approach in both outlier detection (Fiore et al., 2013; Akcay et al., 2019) and other related tasks, such as intrusion detection (Xu & Shelton, 2010) and missing data imputation (Mei et al., 2019), although in some cases, the assumption is not clearly stated. Our work also adopts this approach, but we note that event outlier detection in continuous time has not been studied in any of the previous works.
3.1. Problem Formulation
First, we formally define the problem of contextual outlier detection in continuous-time event sequences. An event sequence can be formulated as a sequence of timestamps
of the events, where
is the time of the i-th event,
is the total number of the events in the sequence, and
is the domain of time. We call
target sequence and the events the target events, because they are the targets in which we aim to detect outliers. Meanwhile, we may observe contextual information along with
. We assume the contextual information can be either represented as or converted to discrete events. We denote these events as
where
and
are the time and type (mark) of the i-th event,
is the total number of the events, and
is the finite set of distinct marks for different types of events. We call
context sequence and the events the context events.
We stress that and
share the same time domain T and, therefore, we can combine them into a single sequence
, where a new type
is assigned to all the events in
. For detecting outliers, we only rely on information in the past from both
and
. We denote the combined history of
and
up to time t as
. Meanwhile,
is the history of the target sequence
only without any contextual information. Now, we are ready to define two types of outlier detection problems we wish to solve.
Commission Outlier Detection Given an observed target event at time and the combined history
the goal is to assign a label
indicating whether it is a commission outlier. Notice that
defined if t is the time of a target event. In this work, instead of hard labels, we calculate a commission outlier score
to indicate how likely it is to be a commission outlier.
Omission Outlier Detection Define a blank interval T as an interval in which there is no event of type x (target event). Given a blank interval
and the combined history
up to time
, the goal is to assign a label
to B indicating whether there are any omission outliers in B. Notice that
is only de-fined when B is an interval with no events from
work, instead of hard labels, we calculate an omission outlier score
to indicate how likely it is to contain any omission outliers.
3.2. Probabilistic Models
We develop algorithms for detecting outliers in continuous-time event sequences based on probabilistic models, specifi-cally (temporal) point processes. Point processes are probabilistic models for discrete points in continuous domains. For an event sequence, the points are the events, and the domain is the time T . In this case, the models are also called temporal point processes. A temporal point process can be defined as a counting process number of points in the interval
For a temporal point process, the conditional intensity function (CIF), , characterizes the probability of observing an event in an infinitesimal time interval
given the history up to time t, i.e. For our problem, we only model the target events using a point process, while the history
contains both the target events and the context events. Because we always condition on
in notations for the rest of the paper.
For a sequence of target events erated from the point process with CIF
, the probability density is
In this work, we focus on semi-supervised outlier detection (Chandola et al., 2009), meaning that we can obtain data generated from the normal point process (without outliers), from which we may train a model for detecting outliers on unseen data later. It has been widely adopted in previous works (see Section 2).
Assumption 3.1. Data generated by the normal point process are available.
We note that by definition, the majority of the data are normal rather than outliers, so obtaining normal data is usually much easier than finding and labeling outliers. Meanwhile, robust learning of the model in presence of outliers is an important but orthogonal direction for future work, since we can combine robust learning with our proposed outlier detection methods to eliminate Assumption 3.1.
In general, the type of model to use should be able to represent the dependencies between the target events and the context events. For a point-process model, it means that the CIF, instead of only depending on the target events, , should also depend on the context events,
, where
denotes the mapping repre- sented by the model.
In this work, we use a model adapted from the continuous-time LSTM (Mei & Eisner, 2017) due to its good performance in previous studies (Mei & Eisner, 2017; Jia & Ben- son, 2020). See Appendix A for details of the model. However, we stress that if there is a better model, we can simply plug it in to get better performance.
3.3. Detecting Commission Outliers
Without knowing the true process that generates commission outliers, we make the following assumption when developing our method and later generalize it by relaxing the assumption of constant rate.
Assumption 3.2 (Independent and Constant Commission). Commission outliers are generated by a process with a constant CIF independently from the normal point process.
Suppose we are given a target event and the history up to time
. Define a random variable
, such that
if
is a commission outlier, and
otherwise. We are interested in calculating
By Assumption 3.2 (Independent), the generative processes for the normal points and outliers can be viewed together as a marked point process. For each event , there is a hidden mark
indicating whether it is an outlier. The overall CIF is
, where
is the CIF for the normal point process, and
for the outlier process. Suppose we are at time
. The density of having the next event at
marginally and jointly with mark
are respectively
Then we can derive the conditional distribution of the hidden mark
Therefore, the Bayes decision rule is
where I [x] = 1 if x is true, and 0 otherwise. However, this rule cannot be directly applied, because is unknown. From Assumption 3.2 (Constant),
is a constant, so the decision rule becomes
threshold. This justifies ranking by
across all , so we use
as the commission outlier score: the higher the score, the more likely
is to be a commission outlier.
Nonconstant Rate We generalize the idea by relaxing the assumption of being a constant. The key is to treat
as a random variable and then marginalize it out as
Then we can prove that is still an increasing function of
, which justifies ranking by
(see Appendix B for details).
3.4. Detecting Omission Outliers
Similar to detecting commission outliers, we make the following assumption for omission outliers.
Assumption 3.3 (Independent and Constant Omission). Omission outliers are generated by independently removing the normal points with a constant probability.
To derive the method, we first define some notations. Let denote the CIF of the normal point process,
the probability of removing each normal point, and B a blank interval without any target points which we wish to decide whether contains omission outliers. For any interval
let
be the number of points observed, and
be the number of points generated by the normal point process with CIF
, so
is the result of combining
with random removal, and we can observe
but not
. Furthermore, we define an auxiliary random variable
that counts the number of points removed in B.
For a blank interval B, we observe can take different values k = 0, 1, . . .. The joint probability for each k is
where denotes the probability that k points are generated by the normal point process
It depends on
. Then we can calculate the posterior probability of
Define a random variable to indicate whether there are any omission outliers in the blank interval
is equivalent to
is equivalent to
Then the Bayes decision rule is
Without further assumptions, (Eq. 9) cannot be evaluated in closed form, but we can get a lower bound
Then the posterior probability of B containing any omission outliers can also be bounded
Therefore, we propose to use
as the omission outlier score. When we rank the blank intervals by , we essentially rank them by an upper bound of
Poisson Process There is a notable special case where we can get a closed-form . When the normal point process
is an inhomogeneous Poisson process, we have
for k = 0, 1, . . .. The posterior becomes
Therefore, the posterior probability of B containing any omission outliers is
This justifies scoring the interval because if we rank the intervals by their scores, the result will be the same as ranking by their posterior probabilities of containing omission outliers,
Nonconstant Rate Similar to commission outliers, we can generalize the idea by relaxing the assumption that is a constant and treating it as a random variable, in which case
is still an increasing function of
(see Appendix C for details).
Inter-Event Time Without making any assumptions on the normal point process, we can alternatively justify with hypothesis testing, if B is an inter-event time interval, i.e., a time interval between consecutive events
be the random variable for the inter-event time corresponding to B. The null and alternative hypotheses are
Assuming the null hypothesis is true, i.e., B is generated by the normal point process with CIF , the probability that the inter-event time is at least as long as |B| is
which is the p-value. A lower p-value means that the observation is more extreme, given that the null hypothesis is true, which means it is more likely to contain omission outliers. This justifies scoring by , where a higher score means that B is more likely to contain omission outliers.
3.5. Bounds on FDR and FPR
In this section, we prove some bounds on the performance of the proposed outlier scoring methods. We recall the definitions of false discovery rate (FDR) and false positive rate (FPR). Let y denote the true label (1=outlier, 0=normal) of an object (a target event or a blank interval) and the predicted label. Then FDR and FPR are defined as
Given the above definitions, we can prove the following theorems (see Appendix D for proofs).
Theorem 3.1. If we use the commission outlier score , where
is the time of a target event, with a threshold
, such that the decision rule is
, and let
denote the CIF of the independent process generating commission outliers, then the FDR is bounded above by
Theorem 3.2. If we use the omission outlier score for an inter-event time interval B, with a threshold
, such that the decision rule is
, then the FPR is bounded above by
Theorem 3.3. If we use the omission outlier score for a blank interval B, with a threshold
such that the decision rule is
assume that the normal point process is an inhomogeneous Poisson process and the probability of omission is
the FDR is bounded above by
To test the proposed methods, we perform experiments on both synthetic and real-world event sequences. We compare the following methods in the experiments. RND: A baseline that generates outlier scores by sampling from a uniform distribution. LEN: A baseline that detects outliers based on the empirical distribution of the inter-event time lengths (see Section 4.1). PPOD (Point-Process Outlier Detection): Our method based on a point-process model but only using the history of the target events as the context. CPPOD (Contextual PPOD): Our method based on a point-process model using the history of both the target events and the context events as the context.
A model adapted from the continuous-time LSTM (Mei & Eisner, 2017) is used for PPOD and CPPOD (see Appendix A). We choose the number of hidden units in the model from {64, 128, 256, 512, 1024} by maximizing the likelihood on the internal validation set that consists of 20 percent of the training set. We stress that for training and validation we do not use any labeled outlier data. Our implementation of all the methods and the experiments is publicly available.1
4.1. Baseline Method
We briefly describe the baseline method LEN. For training, the lengths of all the inter-event time intervals of the target events, collected. Then, an empirical distribution of the inter-event time can be formulated as
for simplicity, we describe the method as if we only had one sequence in the training data, but it is easy to see how it works for multiple sequences, which is the case in our experiments.
For testing, LEN outputs a commission outlier score for a target event at time
where is the inter-event time between the current and previous target events. Intuitively, if the inter-event time is too small (
is small) or too big (
is small), it is likely that
has occurred at an abnormal time (too early or too late) and therefore is a commission outlier. The negation makes sure that a higher score indicates that it is more likely to be an outlier. For a blank interval B, LEN outputs an omission outlier score as the length of B,
Intuitively, the longer the blank interval, the more likely it is to contain omission outliers.
4.2. Experiments on Synthetic Event Sequences
We generate synthetic event sequences using two different types of point processes. One is the inhomogeneous Poisson process. The other is the Gamma process. For each type of processes, there is a set of parameters that determine the distribution of the points. We allow the parameters to vary according to a context state x, which can take two distinct values
For the inhomogeneous Poisson process, the CIF is a piecewise constant function with the value , where x is the context state. In the experiments, we set f(0) = 0.1 and f(1) = 1. For the Gamma process, the inter-event time follows a Gamma distribution
shape,
rate), where x is the context state. In the experiments, we set
The changes of the context state x are driven by a continuous-time Markov chain (Nodelman et al., 2002) with a transition matrix
where dt is infinitesimal time. Each change of the state generates a context event.
For each point process type, we simulate 40 sequences. All sequences are simulated in the same time span T = [0, 1000]. We use 50 percent of the sequences for training and the others for testing.
Outlier Simulation Commission outliers are simulated by adding points generated from a separate point process. Omission outliers are simulated by random removals of existing points. There is a parameter that defines the rate of outliers relative to normal points and can change over time (see Appendix E for details). To evaluate the performance of the proposed methods under different conditions, we conduct three types of simulations.
• Constant rate. In this case, the rate of omission and commission is a constant, , relative to the normal points, where
. We also changed
, but it did not affect the relative performance of each method in almost all cases (see Appendix I).
• Periodic rate (denoted as [sin]). In this case, the rate of omission and commission is a periodic function defined by relative to the normal points, where
controls the overall scale, t is the time, and p is the period. We set
p = 100.
• Piecewise-constant rate (denoted as [pc]). In this case, we randomly generate a piecewise-constant function with a step size s = 10 for each sequence. The value of the function at each step is randomly generated from a uniform distribution. Then the rate of omission and commission is defined as
relative to the normal points, where
controls the overall scale.
We note that the first type satisfies the constant rate assumptions we made when developing the methods, but the second and third types do not. By using different simulations, we try to verify the robustness of our methods in different cases when the constant rate assumptions may not be satisfied.
Outlier Detection We apply the methods to detect outliers in an online manner, i.e., a commission (omission) outlier score is generated whenever a new target event is observed. For omission outliers, we also try to detect outliers at additional checkpoints within long blank intervals. See Appendix F for details of the approach and Appendix G for the outlier ratios of the datasets.
Results Table 1 shows the results in Area Under the Receiver Operating Characteristic curve (AUROC). CPPOD achieves the best performance for both commission and omission outliers, showing the effectiveness of our outlier scoring methods. PPOD being worse than CPPOD shows the importance of the context events in these cases. LEN performs much better than RND but worse than our proposed methods, which shows (1) it is an effective method for detecting outliers; (2) although intuitive, it is less effective than our proposed methods. Meanwhile, it does not have rigorous theoretical justifications.
As we can see, changing the outlier simulation mechanism does not affect the relative performance of each method. Even when the rates are not constants, i.e., or [pc], CPPOD and PPOD are still performing similarly as when
is a constant and much better than the baselines. We also checked the performance of our outlier scoring methods combined with the ground-truth model (see Appendix I), and the performance is similar to CPPOD.
Empirical Verification of the Bounds on FDR and FPR To empirically verify the bounds on FDR and FPR presented in Section 3.5, we randomly repeat the experiments using our outlier scoring methods with the ground-truth model (replacing the learned LSTM model with the true model in CPPOD) on the synthetic data 10 times with different test data. Each time we calculate the FDR and FPR for different thresholds on the scores. For verifying FPR, we only test the inter-event time intervals for omission outliers. Their means and standard deviations over all repetitions are shown with the theoretical bounds in Figure 2 for Gamma process. For FPR, the bounds overlap the empirical rates. See Appendix H for more details and the results for Poisson process.
Figure 2. FDR (commission outlier) and FPR (omission outlier) on synthetic data (Gamma process).
4.3. Experiments on Real-World Clinical Data
In this part, we use real-world clinical data extracted from the MIMIC III dataset (Johnson et al., 2016). The dataset consists of de-identified electronic health records of ICU patients. We use four types of events corresponding to commonly used medications and lab tests as our targets and form four datasets by collecting the target events and corresponding context events. The target events and their context events are listed in Table 2. The medical category (medication, lab, or vital sign) of each type of events is in brackets following the type. For example, Potassium Chloride is a type of medication, and Potassium (Blood) is a type of lab test. The latter is used as the context for the former, as the administration of the medication can be triggered by observing an abnormally low value in the lab test.
We record in the data every type of events in the table. However, for Potassium (Blood) and Total Calcium (Blood), we further split the context events into three subtypes depending on whether the value in the lab test is low, normal, or high. For INR(PT) (previous state), we create context events of two subtypes based on whether the value of the previous event is normal or abnormal (with a 1-second delay). For Arterial Blood Pressure systolic (ABPs) and Non-invasive Blood Pressure systolic (NBPs), we split the events into two subtypes depending on whether the value is normal or low. These subtypes help to define better the contexts influencing the target events, since depending on their values, the target events can be more/less likely to occur.
All target and context events for one patient admission form one event sequence. We randomly select 2000 sequences for the first three datasets and 500 sequences for the last one, as each sequence in the last one contains much more events than the first three. For each dataset, we use 50 percent of the sequences for training and the others for testing.
Simulation Evaluation We generate and detect commission and omission outliers on top of the existing data using the same procedures for synthetic data (Section 4.2) except that we set . This allows us to obtain
Table 1. AUROC on synthetic data. Dataset: name abbreviation (C=commission, O=omission) [
Table 2. Names of target and context events from MIMIC. INR=international normalized ratio; PT=prothrombin time.
ground-truth labels for analyses. Table 3 shows the AUROC results. The results have more variations across different datasets in this case, which can be seen by examining the performance of LEN. Omission outliers appear to be more challenging than commission outliers except for INR(PT). CPPOD and PPOD outperform RND and LEN on all the datasets for both types of outliers.
In all cases, CPPOD is the best or very close to it. In the latter cases, the best is always PPOD, and the differences are very small. These are the cases where the additional context events are not as influential as the history of the target events themselves for the occurrences of the target events, so PPOD is as good as but simpler than CPPOD. However, for Potassium Chloride and Calcium Gluconate, we can see a clear advantage of CPPOD over PPOD by using additional context events.
Manual Evaluation of Clinical Relevance Finally, we apply CPPOD to the original data (without any simulated outliers). For each dataset, we select three commission outliers and three omission outliers with the highest outlier scores. We manually examine the electronic health record
of the patient for each outlier to determine whether the event commissions and omissions identified by CPPOD are clinically relevant and assign the outliers to three classes: N (outlier is not relevant), Y (outlier is relevant), and P (outlier is probably relevant given the available data but cannot be determined due to insufficient data recorded in MIMIC).
Specifically, we retrieve medical notes and related data for each case across time. We examine the notes and data to find evidence to support the decisions made by the physicians (giving a medication / lab test or not) that are detected as outliers. If there is evidence supporting the decision (e.g., given the condition of the patient, the medication should be given, and it was given), then the outlier is not clinically relevant (N). If the evidence is against the decision, it is clinically relevant (Y). If we do not have enough data, but the data we have do not support the decision, it is probably relevant (P).
Table 4 shows the results of the manual evaluation. Overall, CPPOD is performing well, as many detected outliers are clinically relevant. For both false outliers in Cal (C), there was an apparent change in the schedule of the medication (previous pattern being interrupted), so for the model, they do appear to be outliers. For Pot (C), one patient had rare conditions, for which we do not have the corresponding context events. The other had a drop in the value of the lab test (context), but it did not reach the abnormally low level, so the model does not have the exact context information that supported the human decision of giving the medication.
We note that for this evaluation, we do not have any guarantees on any of the assumptions we made when developing the methods in theory. Nonetheless, CPPOD is still effective. Moreover, as we have discovered, the original data may already contain outliers, so even in the previous evaluation, the models are unlikely to be trained on clean data, but Table 3 still shows the benefits of PPOD and CPPOD. This demonstrates that the assumptions do not have to be rigorously satisfied in practice for the methods to work reasonably well, although we may lose the theoretical guarantees on the performance, which depend on the assumptions.
Table 3. AUROC on MIMIC data. Dataset: name abbreviation (C=commission, O=omission) [
Table 4. Manual evaluation on MIMIC data.
In this work, we have studied two new outlier detection problems: detection of commission and omission outliers in continuous-time event sequences. We proposed outlier scoring methods based on Bayesian decision theory and hypothesis testing with theoretical guarantees. The proposed methods depend on a point-process model built from the data. In this work, we experimented with a model considering context, adapted from the continuous-time LSTM. We conducted experiments on both synthetic and real-world event sequences. The results show the flexibility of the adapted model and, more importantly, the effectiveness of the proposed outlier scoring methods.
As verified by the experiments, the assumptions we made when developing the methods in theory do not have to be sat-isfied for them to work reasonably well in practice, although when they are satisfied, we get the theoretical justifications and guarantees on the performance. For future work, there are two interesting directions. One is to develop algorithms to learn the models robustly in presence of outliers in the
training data. The other is to allow the model to adapt in a non-stationary environment. Both are orthogonal to this work, as they can be combined with the outlier scoring methods developed in this work to improve the performance by providing a better model.
This work was supported by NIH grant R01-GM088224. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Siqi Liu also acknowledges the support by CS50 Merit Pre-doctoral Fellowship from the Department of Computer Science, University of Pittsburgh. Finally, the authors would like to thank all the anonymous reviewers who provided helpful comments on the earlier versions of the paper.
Adams, R. P., Murray, I., and MacKay, D. J. Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 9–16. ACM, 2009.
Aggarwal, C. C. Outlier Analysis. Springer New York, 2013.
Akcay, S., Atapour-Abarghouei, A., and Breckon, T. P. GANomaly: Semi-supervised anomaly detection via ad-
versarial training. In Jawahar, C. V., Li, H., Mori, G., and Schindler, K. (eds.), Computer Vision – ACCV 2018, Lecture Notes in Computer Science, pp. 622–637, Cham, 2019. Springer International Publishing.
Apostolopoulou, I., Linderman, S., Miller, K., and Dubrawski, A. Mutually regressive point processes. In Advances in Neural Information Processing Systems, pp. 5116–5127, 2019.
Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec- tion: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009.
Daley, D. J. and Vere-Jones, D. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. Springer, New York, 2003.
Ding, H., Khan, M., Sato, I., and Sugiyama, M. Bayesian nonparametric Poisson-process allocation for time-sequence modeling. In International Conference on Artificial Intelligence and Statistics, pp. 1108–1116, 2018.
Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez- Rodriguez, M., and Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555–1564. ACM, 2016.
Fawcett, T. and Provost, F. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3):291–316, 1997.
Fiore, U., Palmieri, F., Castiglione, A., and De Santis, A. Network anomaly detection with the restricted Boltzmann machine. Neurocomputing, 122:13–23, December 2013.
Garcia-Teodoro, P., Diaz-Verdejo, J., Maci´a-Fern´andez, G., and V´azquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security, 28(1):18–28, 2009.
Gupta, M., Gao, J., Aggarwal, C. C., and Han, J. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering, 26(9):2250–2267, 2013.
Hauskrecht, M., Valko, M., Kveton, B., Visweswaram, S., and Cooper, G. Evidence-based anomaly detection. In Annual American Medical Informatics Association Symposium, pp. 319–324, November 2007.
Hauskrecht, M., Batal, I., Valko, M., Visweswaran, S., Cooper, G. F., and Clermont, G. Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics, 46(1):47–55, February 2013.
Hauskrecht, M., Batal, I., Hong, C., Nguyen, Q., Cooper, G. F., Visweswaran, S., and Clermont, G. Outlier-based detection of unusual patient-management actions: an ICU study. Journal of Biomedical Informatics, 64:211–221, 2016.
Hawkes, A. G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
Jia, J. and Benson, A. R. Neural jump stochastic differential equations. arXiv:1905.10403 [cs, stat], January 2020.
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.- w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific Data, 3: 160035, May 2016.
Lee, Y., Lim, K. W., and Ong, C. S. Hawkes processes with stochastic excitations. In International Conference on Machine Learning, pp. 79–88, 2016.
Li, S., Xiao, S., Zhu, S., Du, N., Xie, Y., and Song, L. Learn- ing temporal point processes via reinforcement learning. In Advances in Neural Information Processing Systems, pp. 10781–10791, 2018.
Liu, S. and Hauskrecht, M. Nonparametric regressive point processes based on conditional Gaussian processes. In Advances in Neural Information Processing Systems, pp. 1062–1072, 2019.
Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. Vari- ational inference for Gaussian process modulated Poisson processes. In International Conference on Machine Learning, pp. 1814–1822, 2015.
Lloyd, C., Gunter, T., Osborne, M., Roberts, S., and Nick- son, T. Latent point process allocation. In Artificial Intelligence and Statistics, pp. 389–397, May 2016.
Mei, H. and Eisner, J. M. The neural Hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, pp. 6757–6767, 2017.
Mei, H., Qin, G., and Eisner, J. Imputing missing events in continuous-time event streams. In International Conference on Machine Learning, pp. 4475–4485. PMLR, May 2019.
Nodelman, U., Shelton, C. R., and Koller, D. Continuous time Bayesian networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI’02, pp. 378–387, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
Rao, V. and Teh, Y. W. Gaussian process modulated renewal processes. In Advances in Neural Information Processing Systems, pp. 2474–2482, 2011.
Shelton, C. R., Qin, Z., and Shetty, C. Hawkes process inference with missing data. In Thirty-Second AAAI Conference on Artificial Intelligence, April 2018.
Song, X., Wu, M., Jermaine, C., and Ranka, S. Conditional anomaly detection. Knowledge and Data Engineering, IEEE Transactions on, 19(5):631–645, 2007.
Trouleau, W., Etesami, J., Grossglauser, M., Kiyavash, N., and Thiran, P. Learning Hawkes processes under synchronization noise. In International Conference on Machine Learning, pp. 6325–6334, May 2019.
Wang, Y., Xie, B., Du, N., and Song, L. Isotonic Hawkes processes. In International Conference on Machine Learning, pp. 2226–2234, 2016.
Wong, W. K., Moore, A., Cooper, G., and Wagner, M. Bayesian network anomaly pattern detection for disease outbreaks. In International Conference on Machine Learning, pp. 808–815, August 2003.
Xiao, S., Farajtabar, M., Ye, X., Yan, J., Song, L., and Zha, H. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems, pp. 3247–3257, 2017.
Xu, H., Farajtabar, M., and Zha, H. Learning Granger causal- ity for Hawkes processes. In International Conference on Machine Learning, pp. 1717–1726, 2016.
Xu, H., Luo, D., and Zha, H. Learning Hawkes processes from short doubly-censored event sequences. In International Conference on Machine Learning, pp. 3831–3840, July 2017.
Xu, J. and Shelton, C. R. Intrusion detection using con- tinuous time Bayesian networks. Journal of Artificial Intelligence Research, 39:745–774, December 2010.
Zhou, K., Zha, H., and Song, L. Learning triggering kernels for multi-dimensional Hawkes processes. In International Conference on Machine Learning, pp. 1301–1309, 2013.
The input to the continuous-time LSTM consists of the marked events in the combined sequence, . That is, we not only use the target events but also the context events as input, although we only model the CIF of the target events,
. The output consists of the hidden states
corresponding to the input. It is a nonlinear mapping from the content in the memory cell
of the LSTM at time
. As in a traditional LSTM, each continuous-time LSTM unit also has an input gate i, an output gate o, and a forget gate f. The relations between the memory cells, the hidden states, the input, and these gates are summarized as follows.
Let be a vector representation of the mark
, which is a learnable embedding. For
is a continuous function changing over time from
there are separate input gates and forget gates:
¯
where [a; b] denotes the concatenation of the vectors a and is the elementwise product,
is the logistic function, and g(x, s) = s log(1 + exp (x/s)) is the scaled softplus function with parameter s. All the W , U and d with/without different subscripts and bars are learnable parameters of the continuous-time LSTM.
Finally, to convert the output of the continuous-time LSTM to the CIF of the target events, , we have
where
and s are learnable parameters. The model is learned by maximizing the likelihood (Eq. 1) for all sequences in the training data. Monte-Carlo integration is used to evaluate
We generalize the method we have developed in Section 3.3 to cases when the rate of commission is not a constant. Treating as a random variable, based on what we
have already developed, we have
To avoid cluttering, we omit By marginalizing out
where we defined
We assume
which is easy to satisfy, since it is sufficient that either for some
or the distribution of
is one of the common distributions including any finite discrete distribution, Gamma distribution, etc.
It is not hard to see that f is an increasing function of any given
Therefore, we can show that
using the Dominated Convergence Theorem. This implies that is an increasing function of
Similar to commission outliers, we generalize the method we have developed in Section 3.4 to cases when the probability of omission is not a constant and the normal point process is an inhomogeneous Poisson process. Treating as a random variable, based on what we have already developed, we have
To avoid cluttering, we omit from now. By marginalizing out
where we defined
Apparently g is an decreasing function of for any given
Therefore, we can show that
using the Dominated Convergence Theorem. This implies that is an increasing function of
D.1. Theorem 3.1
Proof. From Eq. 4 and implicitly conditioned on the event and the history
Given that
D.2. Theorem 3.2
Proof. Let be the random variable for the inter-event time corresponding to the observed inter-event interval B, assuming it is generated from the normal point process. From Eq. 18
The last equality is because |B|), and
is the cumulative distribution function of
, implying it follows a uniform distribution.
D.3. Theorem 3.3
Proof. From Eq. 17 and implicitly conditioned on N(B) = 0 and the history
Given that
To define outliers, we simulate commission and omission outliers on top of the existing data. In this way, we can obtain ground-truth labels for testing.
To define commission outliers, we simulate a new sequence of target events independently from the existing data, and then merge the new events with the existing events. We use an (inhomogeneous) Poisson process with an intensity to generate the outliers.
controls the rate of such outliers. In the experiments, for each dataset, we set
is either a constant or a function over time depending on the settings, and
is the empirical rate of the target events calculated from the original test data.
To define omission outliers, we randomly remove target events in the original sequences according to independent Bernoulli trials. That is, each event is removed with probability and kept with probability
. We always keep the event if it marks the start time of the sequence. In the experiments, we set
, where, similar to commission,
is either a constant or a function over time depending on the settings.
We detect the presence of commission and omission outliers differently. To test for commission outliers, each method outputs an outlier score at the time of each target event. That is, whenever there is a new target event, we ask the question: is this event a commission outlier or not?
Testing for omission outliers is trickier, because we need to decide the checkpoints more carefully, i.e., when to ask for outlier scores. The simplest thing to do is to only check at the target event times. That is, whenever there is a new target event, we ask the question: is there any omission outlier since the previous target event till now?
However, this may become unsatisfactory in real-world applications, because there could be cases when the target events just stop occurring for a long period of time or even forever (potentially due to malfunctions of the underlying system). These are interesting and important cases we are supposed to detect, but the above testing method will not work. Therefore, we use a combined approach. We still have a checkpoint at each target event time, but on top of that, we also randomly generate checkpoints in long blank intervals.
Specifically, we have a parameter is the empirical rate of the target events estimated from the training data for each dataset, so within w, on average, we should see two events normally. Then, whenever the blank interval from the previous checkpoint till now is longer than w, we generate a new checkpoint within the interval by uniform sampling, and set the previous checkpoint to the generated checkpoint. We keep generating checkpoints until we reach the next target event or the end of the sequence.
The outlier ratio, i.e., the number of outliers divided by the total number of test points, for each dataset is summarized in Table G.1.
We show the results of empirically verifying the bounds proved in Section 3.5, continuing the results in Section 4.2. We use GT (Ground Truth): our outlier scoring methods combined with the ground-truth point-process model, which is only available on synthetic data. Figure H.1 shows the empirical FDR (commission outlier), FDR (omission outlier), and FPR (omission outlier) with means and standard deviations on data simulated from inhomogeneous Poisson
Table G.1. Outlier ratios of the datasets. Dataset: name abbreviation (C=commission, O=omission) [
processes along with the theoretical bounds. As we can see, the empirical FDRs have high variance when the threshold is high, because there are smaller number of samples above a higher threshold. Nonetheless, the empirical FDRs conform with the theoretical bounds, and so does the empirical FPR.
Using Ground-Truth Model We also compared with GT (Ground Truth): our outlier scoring methods combined with the ground-truth point-process model (only available on synthetic data). Figure I.1 and I.2 show the receiver operating characteristic (ROC) curves of the outlier detection methods on the synthetic data generated from inhomogeneous Poisson processes and Gamma processes with note that the curves of GT and CPPOD are almost identical. The fact that CPPOD almost has the same performance as GT is an evidence that the model based on the continuous-time LSTM is flexible enough to represent these context-dependent point processes.
Varying Outlier Rate We also experimented with changing for the constant rate to see its effect. Table I.1
Figure H.1. From left to right: FDR (commission outlier), FDR (omission outlier), and FPR (omission outlier) on synthetic data (Poisson process).
Figure I.1. ROC curves on synthetic data (Poisson process). Left: commission. Right: omission.
and Table I.2 show the full AUROC results for synthetic and MIMIC data respectively. As we can see, the relative performance for each method does not change in almost all cases.
Table I.1. AUROC on synthetic data. Dataset: name abbreviation (C=commission, O=omission) [