The online testing problem is concerned with the investigation of a possibly infinite stream of null hypotheses in an ongoing manner based on sequentially collected data
At each time point, the investigator must make a real-time decision after
arrives, without knowing future data
The control of multiplicity in sequential testing typically involves imposing serial constraints on error rates over time, which requires that, for example, the family wise error rate (FWER) or false discovery rate (FDR; Benjamini and Hochberg, 1995) must fall below a pre–specified level
decision points.
The online testing problem may arise from a range of applications. For example, the quality preserving database (QPD) framework (Aharoni et al., 2010) has been widely employed by many research teams from diverse backgrounds. Some notable databases include Stanford’s HIVdb that serves the community of anti-HIV treatment groups, WTCCC’s large-scale database that is distributed to assist various whole-genome association studies, and the National Health Institute (NIH) influenza virus resource (IVR) that has been intensively queried by numerous researchers for designing new vaccines and treatments. The proper and efficient management of these large databases calls for new analytical tools for handling thousands of hypothesis tests with real–time decisions made in a sequential fashion. For instance, the NIH IVR has been used to investigate thousands of biomedical hypotheses and, per the record in PubMed, has lead to more than 1,000 scientific publications as of January 2020. It has become increasingly important to develop a powerful and effective monitoring system to control the false positive findings over time. Another important application scenario, which is frequently encountered in finance, social media and mobile computing, is the real–time detection of anomalies based on high–frequency and large–scale time series data. For example, large travel service providers closely monitor the number of changes or cancellation requests of existing itineraries. An abnormal spike usually signifies an unexpected event. It is important for the company to detect such events early and make necessary adjustments. The development of online detection system plays a key role for providing novel and timely marketing insights and avoiding adverse financial losses.
Large-scale testing under the online setup poses several new issues that are not present in conventional “offline” setup. First, a real–time decision must be made before the next data point arrives. This makes conventional step–wise testing methods no longer applicable. For instance, the well–known Holm’s procedure (Holm, 1979) for FWER control and Benjamini– Hochberg’s procedure for FDR control both involve first ordering all observed p-values and then choosing a threshold along the ranking. However, the ranking step becomes impossible due to the absence of future data. Second, in contrast with conventional FWER and FDR criteria that only require an overall assessment of the multiplicity in simultaneous testing, the online methods must proceed with more stringent error constraints that are imposed sequentially at every decision point. This not only leads to decreased power in detecting signals but also calls for more carefully designed online testing rules. Third, the data stream often encodes useful local structures, including signal magnitudes, sparsity levels and grouping patterns, that may vary over time. It is crucial to develop flexible and adaptive online rules to exploit the underlying domain knowledge and informative structures. Fourth, the online decision-making process, which proceeds sequentially without the knowledge of future, may come to a halt when the total error budget, or alpha–wealth, is exhausted. As a result, the investigator may miss all potential discoveries in the future. This concern must be carefully addressed because in many applications the hypothesis tests are conducted in an ongoing manner with unpredictable patterns – even the total number of hypotheses to be investigated can be unknown. Finally, how to wisely allocate and invest the alpha–wealth to ensure the validity in error control while maintaining high statistical power of online testing rules in the long run has remained as a key issue that requires much research.
The online FDR control problem has received much recent attention and great progresses have been made. The alpha-investing (AI) idea (Foster and Stine, 2008) and its various generalizations (Aharoni and Rosset, 2014; Ramdas et al., 2017; Javanmard et al., 2018) have served as the basic framework and proved to be effective. Carefully designed AI rules are capable of handling an infinite stream of hypotheses and incorporating informative domain knowledge into the dynamic decision-making process. Beginning with a pre–specified alpha– wealth, the key idea in AI algorithms is that each rejection gains extra alpha–wealth, which may be subsequently used to make more discoveries at later time points. The generalized AI (GAI) algorithms (Aharoni and Rosset, 2014; Robertson and Wason, 2018; Lynch et al., 2017) are developed for a wider class of pay-out functions, enabling the construction of new online rules with increased power. The GAI++ framework (Ramdas et al., 2017) improves the power of GAI methods uniformly and is capable of dealing with more general settings. The new class of weighted GAI++ methods are flexibly designed to allow “indecisions” and are capable of integrating prior domain knowledge. To alleviate the “piggybacking” and “alpha–death” issues of AI rules, Ramdas et al. (2017) discussed the concept of decaying memory FDR. To effectively incorporate structural information into online inference, the SAFFRON procedure (Ramdas et al., 2018) derived a sequence of thresholds that are adaptive to estimated sparsity levels and showed that the power can be much improved.
This article develops a new class of structure–adaptive sequential testing (SAST) rules for online FDR control with several new features. First, in contrast with existing AI and GAI rules whose building blocks are p-values, the class of SAST rules are built upon the conditional local false discovery rate (Clfdr), which optimally adapts to important local structures in the data stream. Second, the sequential rejection rule based on Clfdr leads to a novel alpha– investing framework that is fundamentally different from that in Foster and Stine (2008). The new framework precisely characterizes the tradeoffs between different actions in online decision making, which provides key insights for designing more powerful online FDR rules. The new AI framework also reveals that SAST automatically avoids the “alpha–death” issue in the sense that its operation always reserves budget to reject new hypotheses, and can proceed in an ongoing manner to any time point in the future. Finally, by adaptively learning from past experiences and dynamically allocating the alphawealth, SAST can effectively avoid the “piggybacking” issue and improve its performance as more data are acquired. Our theoretical and numerical results demonstrate that SAST is effective for online FDR control, and achieves substantial power gain over existing methods in many settings.
The article is organized as follows. Section 2 first introduces the model and problem formulation, and then develops the oracle SAST procedure for online FDR control by assuming that model parameters are known. Section 3 discusses computational algorithms, proposes the data-driven SAST rule and establishes its theoretical properties. Simulation is conducted in Section 4 to investigate the finite sample performance of SAST and compare it with existing methods. SAST is illustrated in Section 5 through applications for identifying differentially expressed genes and detecting anomalies in time series data. The proofs are provided in the online supplementary material.
We first describe the model and problem formulation in Section 2.1, then discuss three key elements in the proposed SAST rule in turn: a new test statistic to capture the structural information in the data stream (Sections 2.2 and 2.3); a new alpha–investing framework to characterize the gains and losses in sequential decision making (Section 2.4); and a new adaptive learning algorithm to optimize the alpha–wealth allocation (Sections 2.5).
2.1 Model and Problem Formulation
Denote T a continuous temporal domain and a time point. Let
be a discrete, ordered and evenly spaced index set for time labels
. Suppose we are interested in testing a sequence of null hypotheses
based on data stream
). To describe the true states of nature, define Bernoulli variables
is true/false. Let
denote the local sparsity levels that may vary over time. The observations can be described using a hierarchical model:
where are the null and non-null distributions, respectively. Denote
corresponding density functions. We assume that
is known and identical for all
contrast,
can vary smoothly in
Remark 1. The inhomogeneity assumption reflects that signals may either vary in strengths or arrive at different rates over time. This structural information can be highly informative. The smoothness assumption makes it possible for pooling information from the observations in the neighborhood of t. We do not impose further assumptions on , both of which will be estimated non-parametrically.
Let ) be the collection of summary statistics (e.g. p–values or z–values) up to time t. Consider a class of online decision rules
where
) represents a real-time decision in the sense that
only depends on information available at time
= 1 indicating that
is rejected and
= 0 otherwise. Denote
the collection of decisions up to t. The online FDR problem is
concerned with the performance of a stream of real–time decisions. For decisions up to t, let
where the superscript “t” denotes that the FDR is evaluated at a specific time point. The goal is to construct a real–time decision rule that controls the FDR
level
. To compare the power of different testing rules, define the average power
(AP) and missed discovery rate (MDR) as
To simplify the discussion, throughout this section we assume that the distributional information such as the non-null proportion and density function
in Model 2.1 are known. Section 3 considers the case where model parameters are unknown and discusses in detail related estimation and implementation issues.
2.2 The oracle rule for simultaneous testing
The goal of this section is to justify the fundamental role of Clfdr as the building block of the proposed online FDR rule.
The online decision-making process is complicated due to the serial constraints on FDR and absence of future data. To focus on the essential issue, we first consider an ideal setup where a hypothetical oracle observes all data in a local neighborhood at once and makes a batch of simultaneous decisions. Let d denote the size of a neighborhood. Consider the collection of hypotheses in a neighborhood prior to Denote the neighborhood
and the simultaneous decisions
is allowed to depend on the entire
Unlike (2.2), we only require that the FDR is controlled for the d simultaneous decisions:
where the superscript “s” indicates a simultaneous–type FDR concept.
The simultaneous testing of multiple hypotheses can be conceptualized as a two-stage inferential process: firstly ranking all hypotheses according to a significance index and secondly choosing a cutoff along the ordered sequence. This process can be described by a thresholding
rule of the form
where ) is an indicator function, Λ
is the significance index of
is the cutoff of Λ
. For example, the BH procedure uses the p-value as the significance index to order the hypotheses, and implements a step-up algorithm to determine a data-driven cutoff c.
However, the p-value is inefficient for online FDR analysis as it fails to capture the important structural information in the data stream. We propose to use the conditional local false discovery rate (Clfdr) as the significance index to order the hypotheses:
Denote Clfdrthe ordered Clfdr values in
corresponding hypotheses. To determine the cutoff for simultaneous testing, we apply a step-
wise algorithm
Then the threshold is and we reject
. The Clfdr rule (2.6) may be viewed as an oracle rule that sees all data in a local neighborhood at once and then makes simultaneous decisions. In Appendix C, we establish the optimality property of the Clfdr rule for simultaneous testing under the “offline” setup. An infinite data stream can be approximately by sequential data points arrived in batches. Intuitively, the Clfdr statistic provides a good building block for developing new online sequential testing rules as it is optimal for simultaneous inference in each batch of data points.
Remark 2. In the “offline” setup for simultaneous testing with a covariate sequence, which includes the Clfdr rule (2.6) as a special case, Cai et al. (2019) develops asymptotic optimality theory. We can similarly show that (2.6) is asymptotically optimal in the sense that it achieves the benchmark of a hypothetical oracle. However, the optimality issue in the online setup, which depends on many other factors such as the optimal allocation of alpha–wealth and prediction of future patterns over time, is still an open issue and requires much research.
2.3 Adapting to local structures by Clfdr: an illustration
The incorporation of structural information and domain knowledge promises to improve the power of existing FDR procedures (Genovese et al., 2006; Cai and Sun, 2009; Hu et al., 2010; Lei and Fithian, 2018; Cai et al., 2019). For example, the works by Hu et al. (2010), Li and Barber (2019) and Xia et al. (2020) showed that the weighted p-values can be constructed to capture the varying sparsity levels of ordered or grouped hypotheses. In contrast with the p-value, the Clfdr takes into account important structural information such as makes Clfdr an ideal building block for multiple testing with inhomogeneous data streams. We present an example to illustrate the advantage of the Clfdr rule.
Consider the following situation where the data stream obeys a ran-
dom mixture model with varying sparsity levels:
Model (2.7) is a special case of Model (2.1): the null and alternative densities are fixed and the dynamic part is fully captured by the varying proportion . The key idea of Clfdr and weighted p-value (in the form of
is the weight for
) is to upweight the hypotheses in a local neighborhood where signals appear more frequently (e.g. in clusters).
To compare the effectiveness of different weighting methods, we simulate a data stream for testing m = 5000 hypotheses. The top row in Figure 1 sets 5 in blocks [1001 : 1150], [2001 : 2150], [3001 : 3100] and [4001 : 4150], and
01 elsewhere. We vary
4. The bottom row sets
from 0.2 to 0.9 in the above blocks. The block structure is highly informative and can be exploited by Clfdr and weighted p-values to improve the power. We apply the following methods at FDR level
05 by assuming that the model parameters in (2.7) are known: BH (Benjamini and Hochberg, 1995), the structure–adaptive BH algorithm (SABHA; Li and Barber, 2019) using weighted p-values with
the GAP method (Xia et al., 2020) using weighted p-values with
Clfdr rule (2.6). We can see that all methods control the FDR at the nominal level. In terms of the power, BH can be improved by SABHA and GAP, both of which are dominated by the Clfdr rule. Clfdr captures the varying structure in the data stream more effectively: in addition to varied
, it also adapts to
, leading to further power improvement.
2.4 A new alpha–investing framework
Existing FDR methods such as the BH and Clfdr procedures are simultaneous inference procedures that involve first ordering the significance indices (p-value or Clfdr) of all hypotheses and then applying a step-wise algorithm to the ordered sequence to determine the threshold. However, the ranking and thresholding strategy cannot be applied to the online setting where the investigator must make real–time decisions without seeing future observations. This section discusses how to avoid the overflow of the FDR at any given time t and how to efficiently
Figure 1: Structure–adaptiveness: Clfdr vs weighted p-values.
allocate the alpha–wealth to increase the power.
We start with a novel interpretation of the alpha–investing idea by recasting the Clfdr algorithm (2.6) as a varying–capacity knapsack process. Denote collection of rejected hypotheses at time t. The decision process (2.6) can be conceptualized as a sequence of comparisons of two quantities: the nominal FDR level
and the average of
the rejected Clfdr values. Specifically, (2.6) motivates us to consider the constraint
where Ave(A) denotes the average of the elements in set A. The simultaneous testing setup is only concerned with one constraint at the last time point when all data have been observed. By contrast, the online setup poses a series of constraints, e.g. (2.8) must be fulfilled for every t to avoid the overflow of FDR
We view (2.8) as a dynamic decision process resembling a knapsack problem, where
only be rejected when the following constraint is satisfied:
where (of the knapsack) at time t with the default choice
capacity may either expand or shrink over time, depending on the sequential decisions along the data stream. This dynamic process can be described as follows. The initial capacity is
= 0. Starting from t = 1, we reject
if (2.9) is fulfilled. If
with Clfdr
is rejected, then the capacity
increases by
(gain); hence we earn bonus room. By contrast, if
with Clfdr
is rejected, then
decreases by
The decision process (2.9) provides a new alpha–investing framework that precisely characterizes the gains and losses in sequential testing. In contrast with the alpha–investing framework in Foster and Stine (2008), which views each rejection as a gain of extra alpha–wealth, the new characterization (2.9) reveals that not all rejections are created equal: rejections with small Clfdr will lead to increased alpha–wealth whereas rejections with large Clfdr will lead to decreased alpha–wealth. This view provides key insights for designing more powerful online FDR rules. Moreover, the new AI framework reveals that utilizing Clfdr rules can automatically avoid the “alpha–death” issue. Specifically, the process (2.9) can always reject new hypotheses with Clfdr regardless of the current budget, and can proceed in an ongoing manner to any time point in the future.
2.5 Oracle–assisted adaptive learning and the SAST algorithm
To efficiently allocate the alpha–wealth, we need to further refine the online algorithm (2.9) to avoid making imprudent rejections that can potentially eat up all the budget. The specific issue is referred to as “piggybacking” (Ramdas et al., 2017), which, in a vivid way, describes the phenomenon that a string of bad decisions were made due to previously acquired budget.
To see the necessity of taking careful actions, suppose that we have accumulated some bonus room over time before observing a very large Clfdrsatisfying (2.9). Although rejecting
is an action that obeys the FDR constraint, the action can be unwise since it is possible that we can invest the extra “cost”, Clfdr
, to make more discoveries at later time points.
A practical strategy is to incorporate a “barrier” and modify (2.9) as
The barrier can effectively prevent “piggybacking” by filtering out large Clfdrsaving budget for future.
The choice of depends on the pattern of future hypotheses. However, all online methods must proceed without seeing the future. To resolve the issue, consider the oracle Clfdr rule (2.6) that sees all data in a local neighborhood at once. If we assume that the hypothesis stream is “locally stable” in its patterns, then
may be informed by the oracle rule (2.6) simultaneously conducted on a local neighborhood
. The rationale is to use recent past data to get some ideas about the patterns of hypotheses to arrive in the near future. Concretely, we first order
according to their Clfdr values, then run the “offline” algorithm (2.6) to set the barrier
. The online algorithm, by acting as if it sees the future, can effectively filter out large Clfdr values and hence avoid inefficient investments. The operation of algorithm (2.6) also implies that the barrier
may be either raised or lowered according to the varied
in the dynamic model, which is desirable in practice for dealing with inhomogeneous data streams. In Section 4.3, we illustrate that the incorporating of the barrier can greatly reduce the MDR (2.3).
Finally, we present the proposed structure–adaptive sequential testing (SAST) rule (oracle version with known parameters) in Algorithm 1. The SAST algorithm essentially utilizes the sequential constraints (2.10) with barriers set by the offline algorithm (2.6).
We can see that Algorithm 1 runs two parallel procedures: an online procedure for making real–time decisions and an “offline” procedure for determining the barrier. Thus the information of every data point has been used twice: first is used for real–time decision–making at time
is stored as past data so that we can “learn from experiences” via the offline oracle. The following theorem shows that Algorithm 1 is valid for online FDR control.
Theorem 1. Consider the online FDR procedure is determined by
Algorithm 1. Denote . Assume that the Clfdr values are known. Then we have FDR
We first develop estimation methodologies and computational algorithms to implement the SAST rule in Section 3.1, then establish the theoretical properties of the data-driven procedure in Section 3.2.
3.1 Data-driven procedure and computational algorithms
We assume that the null distribution of is known, which is a standard practice in the literature
. The key quantities remained to be estimated are
). In our motivating applications such as queries of QPDs and anomaly detection in high–frequency time series, the databases or servers have already collected large amounts of data at the beginning of the online FDR analysis. Let
denote the available data and suppose we start online testing at t = 1 with a data stream
The conditional density can be estimated using standard (one–sided) bivariate kernel
methods (Silverman, 1986):
where is the length of the moving window that includes a pre-specified number of observations, K(t) is a kernel function,
are the bandwidths, with
Remark 3. In analysis of large-scale high-frequency time series data such as the NYC taxi
data (Section 5), we can pre-specify d, say, to be 1000 to speed up the computation. This virtually has no impact on the estimator ˆ(compared to using all previous data). Otherwise we can always set d = t. Note that our estimator has followed the standard practice in density estimation, which does not include
when estimating
Next we propose a weighted screening approach to estimate the unknown proportion . The key idea is to use a kernel, which weights observations by their distance to t, to pool information from nearby time points. Let
be the bandwidth
a kernel function satisfying
. Consider a screening procedure
is a pre-specified threshold. We propose the following estimator based on Cai et al. (2019):
Now we provide some intuitions of the estimator (3.2). First, at time (0). We can view
) as the “total” number of observations at time t. Suppose we are interested in counting how many null p-values are greater than
among the
“observations” at t. The empirical count is given by
), whereas the expected count is given by
Equation (3.2) can be derived by first setting equal the expected and empirical counts and then solving for
. In Section
which always underestimates and guarantees (conservative) FDR control (Propositions 1).
Remark 4. There is a bias-variance tradeoff in the choice of for the proposed estimator ˆ
We shall see that when increases, the “purity” of the screening subset
) increases, which decreases the approximation bias of
(desirable). At the same time, when
increases, the sample size for estimating
will decrease, thereby increasing the variance of the estimator ˆ
(undesirable). The common choice of
is 0.5. In Section 4.1, we discuss a data–driven algorithm that chooses
adaptively.
Combining (3.1) and (3.2), we propose to estimate the Clfdr as
Our proposed data-driven rule implements Algorithm 1 by substituting in place of Clfdr
. The data-driven algorithm is summarized in Algorithm 2.
3.2 Theoretical properties of data–driven SAST
This section aims to show that the data–driven SAST procedure is asymptotically valid for online FDR control. Our theoretical analysis is divided into three steps. The first step (Propo-
sition 1) shows that a hypothetical rule, which substitutes
in place of Clfdrin Algorithm 1, is conservative for online FDR control.
, then we have
Hence the hypothetical rule using (3.5) is valid (and conservative) for online FDR control.
The second step (Proposition 2) shows that is a consistent estimator of Clfdr
We prove the result by appealing to the infill–asymptotics framework (Stein, 2012), which converts the set of time points
on a growing domain to a set of points that lie on a fixed-domain regular grid:
. The discussions in Stein (2012) indicate that the in-fill model is equivalent to the growing domain model under mild conditions: When
, the asymptotic arguments, which respectively correspond to letting the grid become denser and denser in the fixed interval (0, 1] and letting the domain
to grow to infinity, can be essentially established in the same manner. We state the fixed domain theory as it naturally connects to the familiar density estimation theory, where the notations and regularity conditions are standard and easy to understand. The growing domain version of the theory is briefly discussed in Appendix B.3.
We can similarly define the bivariate density estimator and the following conditional proportion estimator:
The two estimators (3.2) and (3.6) are essentially identical (with rescaled bandwidths).
We state the following regularity conditions. Condition (A1) requires that ) is smooth in t. Conditions (A2) to (A4) are standard in density estimation theory; see, for example, (Wand and Jones, 1994).
such that if
f
Proposition 2. Suppose (A1)–(A4) hold, then
In the third step of our theoretical analysis (Theorem 2), we establish the asymptotic validity of the data-driven SAST procedure for online FDR control.
Theorem 2. Assume the conditions in Proposition 2 hold. Then for any given time t, the data-driven SAST rule (Algorithm 2) controls the FDRasymptotically.
3.3 Theory for data streams with fixed distributions
SAST learns from past decisions and improves its performance over time through the assistance from an offline oracle. The barrier would become more informative as more tests are conducted. Specifically, the initial barrier is set to be
= 1, which is very conservative. In the special case when the mixture model has fixed
over time, we can show that the barrier
would converge to
is the optimal threshold of the “offline” oracle procedure in Section 2.2. Hence, provided that the capacity allows, the operation of (2.10) implies that SAST behaves like an oracle that sees all data points (including future ones). Our numerical results show that the FDR levels of SAST are conservative at the beginning but the FDR becomes closer to
as we sequentially update the barrier with information from more time points.
Theorem 3. Assume conditions from Theorem 2 holds. Then under the model with and
, the data-driven barrier ˆ
is the optimal threshold of the oracle FDR procedure for simultaneous testing defined in Section 2.2.
In this section, we first provide some details in implementation. Simulation studies are conducted in Section 4.2 to compare the oracle and data-driven SAST procedures with other existing online FDR rules. Section 4.3 presents an example to illustrate the merit of including a barrier in online sequential testing.
4.1 Implementation Details
In our simulation, the conditional density function ˆ) is estimated using R function density, where the bandwidths
are chosen based on Silverman (1986). A key step in the SAST algorithm is to estimate ˆ
. We propose to choose a data-driven
by running BH at
5. Roughly speaking, in the subset ˜
of the cases come from the null (e.g. the expected proportion of false positives made by BH). It is anticipated that in the remaining set
is used to construct our estimator, majority of the cases should come from the null. This data-driven scheme ensures a small bias in approximation, while maintaining a larger sample size compared to the standard choice of
4.2 Comparisons of online FDRs and MDRs
We compare the proposed SAST procedure with its competitors for online FDR control. The following methods are included in the comparison:
• SAST with known (SAST.OR, Algorithm 1)
• SAST with estimated model parameters (SAST.DD, Algorithm 2)
• LOND: the method proposed by Javanmard, A. and Montanari, A. (2016).
• LORD++: the GAI++ rule proposed by Ramdas et al. (2017).
For the general simulation setup, we choose m = 5000 and the pre-specified FDR level
05. The data are simulated from the following model:
For the data–driven method, we need an initial burn–in period. In simulation we generate 500 data points prior to t = 1 to form an initial density estimate. The varying density and proportion estimates are updated every 200 time points. The following simulation settings are considered:
2. from 2 to 4.2 with step size 0.2.
3. linearly from 0 to 0.5. Vary
from 2 to 4.2 with step size 0.5.
4. ranges between 0 to 0.5, vary
from 2 to 4.2
We apply different methods at 05. The empirical FDR and MDR levels are evaluated using the average of the false discovery proportions and missed discovery proportions from 1000 replications. To investigate the performance of different methods in the online setting, we display the empirical FDR
levels at various time points, where the intermediate evaluation points ranges from 1500 to 5000 with step size 500. The results for block and constant patterns are summarized in Figure 2, and the results for the linear and sine patterns are summarized in Figure 3.
The following observations can be made from the simulation results.
Figure 2: Simulation results for Settings 1 and 2: signal proportions are varied in a block fashion and kept constant respectively. Various signal strengths are investigated as well. Our data-driven and oracle procedures provide significantly more power while controlling FDR under the nominal level in comparison with others.
Figure 3: Simulation results for Settings 3 and 4: signal proportions are varied in linear and sine patterns, respectively. Our data-driven and oracle procedures provide significantly more power while controlling FDR under the nominal level in comparison with others.
4.3 Effects of the barrier
This section presents a toy example to illustrate that the barrier, which aims to prevent the “piggybacking” issue (Ramdas et al., 2017), can greatly reduce the MDR by allocating existing alpha–wealth in a more cost–effective way. Consider the previous block structured setting (Setting 1 in Section 4.2). Figure 4 shows the FDR and MDR comparisons for the following methods at FDR level 05: (i) oracle SAST rule (OR); (ii) oracle SAST rule with no barrier (OR nob); (iii) data-driven SAST rule with estimated parameters (DD, Section 3); (iv) data-driven SAST rule with no barrier (DD nob).
We can see from the comparison that although the FDR levels between the two oracle methods are roughly the same, the MDR levels are greatly reduced by incorporating the barrier (hence the alpha–wealth is invested more efficiently). The same patterns can be observed for the two data-driven procedures.
Figure 4: The incorporation of the barrier greatly reduces the MDR levels.
Online FDR rules are useful for a wide range of scenarios. We discuss two applications, respectively for anomaly detection in large–scale time series data and genotype discovery under the QPD framework.
5.1 Time series anomaly detection
The NYC taxi dataset can be downloaded from the Numenta Anomaly Benchmark (NAB) repository (Ahmad et al., 2017), which contains useful tools and datasets for evaluating algorithms for anomaly detection in streaming, real–time applications. The dataset records the counts of NYC taxi passengers every 30 minutes from July 1, 2014 to January 31, 2015, during which period five known anomalies had occurred (the NYC marathon, Thanksgiving, Christmas, New Years day and a snow storm). In Figure 5, we plot the time series, with the known anomalous intervals displayed in red rectangles.
We formulate the anomaly detection problem as an online sequential multiple testing problem. The basic setup can be described as follows. The null hypothesis corresponds to no anomaly at time t. We claim that an anomaly occurs at
is rejected. A rejection within the red intervals is considered to be a true discovery.
The application of online FDR rules requires summarizing the stream of counts data as a sequence of p-values or CLfdr statistics. However, directly calculating the p-values based
Figure 5: NYC Taxi passenger count time series from July 1st 2014 to Jan 31st 2015. Blue lines are Loess smoothed time series indicating the overall trend change.
on this dataset would be problematic as the data demonstrate strong trend and seasonality patterns. We first use the R package stlplus to carry out an STL decomposition (Seasonal Trend decomposition using Loess smoothing; Cleveland et al., 1990) to remove the seasonal and trend components. The residuals, displayed in the top 3 rows of Figure 6, are standardized and modeled using a two-component mixture (2.1). However, as can be seen from the histogram at the bottom of Figure 6, the null distribution is approximately normal but deviates from a standard normal. Following the method in Jin and Cai (2007), we estimate the empirical null distribution as N(0.028, 0.618). We apply the BH (pretending all observations are seen at once), LOND, LORD++ and SAST.DD at FDR level 0.0001. For the SAST.DD method, the neighborhood size d and initial burn-in period are both chosen to be 500. In calculating the Clfdr, ) is taken as the density of the estimated empirical null ˆ
. Moreover, the p-values are obtained by the formula
-scores are computed based on the residuals. Figure 7 summarizes the anomaly points detected by different methods.
We can see that for the several anomaly time periods labeled, SAST can detect more points than other methods. Table 1 summarizes the total number of rejections within the labeled time windows. It may appear counter-intuitive that SAST, being an online procedure, rejects
Table 1: Number of discoveries made by various online and offline FDR procedures for the NYC taxi dataset, nominal FDR level at 0.0001.
more null hypotheses than the offline BH procedure. The reason is that the anomalies tend to appear in clusters. This structural information is captured by the Clfdr statistic, which forms the building block of SAST and leads to improved power in detecting structured signals (Section 2.3).
5.2 IMPC dataset Genotype Discovery
In this section, we demonstrate the SAST procedure on a real dataset from the International Mouse Phenotyping Consortium (IMPC). This dataset, which has been analyzed in Karp et al. (2017), involves a large study to functionally annotate every protein coding gene by exploring the impact of gene knockouts. This dataset and resulting family of hypotheses are constantly growing as new results come in. Karp et al. (2017) tested both the roles of genotype and sex as modifiers of genotype effects, resulting in two sets of p-values: one set for testing genotype effects, and the other for sexual dimorphism. This dataset has been widely used for comparing online FDR algorithms. Currently it is available as one of the application datasets in the R-package OnlineFDR that implements methods such as LORD, LOND and LORD++. In order to implement our proposed SAST procedure, we need the original z-scores instead of p-values. However, the directions of effects cannot be determined based on p-value alone. Hence, we transform the p-values into z-scores by introducing a Bernoulli random variable to ensure asymptotic symmetry around 0:
Table 2 summarizes the total number of discoveries made by each method. We can see that SAST makes more discoveries than other alpha–investing methods. Similar to the analysis
Figure 6: Top three rows: Time series of remainder component from STL decomposition with the known anomaly regions marked in red rectangles. Bottom row: Histogram of the remainder term from STL decomposition, the red curve indicates the estimated empirical null distribution N(0.028, 0.618).
Figure 7: Anomaly points detected by various algorithms, our data-driven SAST procedure detects the most anomaly points within the labeled window marked by red rectangles. Nominal significance level chosen as 0.0001.
in Section 5, SAST rejects more hypotheses than the offline BH procedure. One possible explanation is that Clfdr is more powerful than p-values since it captures useful structural information in the data stream.
Table 2: Number of discoveries made by various online and offline FDR procedures for the IMPC dataset on Genotypes, nominal FDR level at 0.05.
Aharoni, E., Neuvirth, H., and Rosset, S. (2010). The quality preserving database: A com- putational framework for encouraging collaboration, enhancing power and controlling false discovery. IEEE/ACM transactions on computational biology and bioinformatics, 8(5):1431– 1437.
Aharoni, E. and Rosset, S. (2014). Generalized -investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771–794.
Ahmad, S., Lavin, A., Purdy, S., and Agha, Z. (2017). Unsupervised real-time anomaly detection for streaming data. Neurocomputing, 262:134–147.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B, 57:289–300.
Cai, T. T. and Sun, W. (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc., 104:1467–1481.
Cai, T. T., Sun, W., and Wang, W. (2019). CARS: Covariate assisted ranking and screening for large-scale two-sample inference (with discussion). J. Roy. Statist. Soc. B, 81(2):187–234.
Cleveland, R. B., Cleveland, W. S., McRae, J. E., and Terpenning, I. (1990). Stl: a seasonal- trend decomposition. Journal of official statistics, 6(1):3–73.
Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association, 99(465):96–104.
Foster, D. P. and Stine, R. A. (2008). -investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429–444.
Genovese, C. R., Roeder, K., and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika, 93(3):509–524.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65–70.
Hu, J. X., Zhao, H., and Zhou, H. H. (2010). False discovery rate control with groups. Journal of the American Statistical Association, 105(491):1215–1227.
Javanmard, A., Montanari, A., et al. (2018). Online rules for control of false discovery rate and false discovery exceedance. The Annals of statistics, 46(2):526–554.
Jin, J. and Cai, T. T. (2007). Estimating the null and the proportional of nonnull effects in large-scale multiple comparisons. J. Amer. Statist. Assoc., 102:495–506.
Karp, N. A., Mason, J., Beaudet, A. L., Benjamini, Y., Bower, L., Braun, R. E., Brown, S. D., Chesler, E. J., Dickinson, M. E., Flenniken, A. M., et al. (2017). Prevalence of sexual dimorphism in mammalian phenotypic traits. Nature communications, 8:15475.
Lei, L. and Fithian, W. (2018). Adapt: an interactive procedure for multiple testing with side information. J. Roy. Statist. Soc. B, 80(4):649–679.
Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini– hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):45–74.
Lynch, G., Guo, W., Sarkar, S. K., Finner, H., et al. (2017). The control of the false discovery rate in fixed sequence multiple testing. Electronic Journal of Statistics, 11(2):4649–4673.
Ramdas, A., Yang, F., Wainwright, M. J., and Jordan, M. I. (2017). Online control of the false discovery rate with decaying memory. In Advances In Neural Information Processing Systems, pages 5650–5659.
Ramdas, A., Zrnic, T., Wainwright, M., and Jordan, M. (2018). Saffron: an adaptive algo- rithm for online control of the false discovery rate. In International Conference on Machine Learning, pages 4286–4294.
Robertson, D. S. and Wason, J. (2018). Online control of the false discovery rate in biomedical research. arXiv preprint arXiv:1809.07292.
Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. CRC press.
Stein, M. L. (2012). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media.
Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc., 102:901–912.
Wand, M. P. and Jones, M. C. (1994). Kernel smoothing. Chapman and Hall/CRC.
Xia, Y., Cai, T. T., and Sun, W. (2020+). Gap: A general framework for information pooling in two-sample sparse inference. J. Am. Statist. Assoc., to appear.
This supplement contains the proofs of main theorems (Section A), other theoretical results (Section B), and optimality theory on simultaneous testing (Section C).
A.1 Proof of Theorem 1
Note that the Clfdr is defined as Clfdr). Then by the definition of FDR and double expectation theorem, we have:
By construction of the decision rule, (for all realization of XXX. It follows that FDR
A.2 Proof of Theorem 2
We need the following lemma:
is bounded for all
The proof of lemma 1 is elementary thus omitted. By definition of our algorithm, if
Note that Clfdris a random variable from random mixture model (2.1) with a non-vanishing
proportion of nonzero signals, we have
for every would imply
infinitely many times. By Proposition 2,
0. It follows that
. By Proposition 2
Finally, the operation of Algorithm 2 implies that
FDR(
A.3 Proof of theorem 3
Note when both are fixed over time, the Clfdr statistic reduces to Lfdr
The optimal threshold in the offline simultaneous testing setup would be independent of time t and the chosen neighborhood. The oracle offline rule coincides with the oracle procedure described in Section 3.2 of Sun and Cai (2007).
We now introduce some notations:
• ˆ
•
•
• is the “ideal” threshold.
Note that ˆis discrete. To facilitate the theoretical analysis, we define, for
where ˆ). It is easy to verify that ˆ
is continuous and monotone. Hence its inverse ˆ
is well defined, continuous and monotone.
) by the WLLN, so that we only need to establish that ˆ
). We need to following lemma:
Proof of Lemma 2. Using the definitions of ˆ, we can show that
Let us refer to the three sums on the right hand as I, II, and III respectively. By step 2 in
the proof of Theorem 2, I = o(1). Then let 0, and consider that
The first term on the right hand is vanishingly small as is a continuous random variable. The second term converges to 0 by Proposition 2. Noting that 0
we conclude II = o(1). In a similar fashion, we can show that III = o(1), thus proving the lemma.
It follows that
By Proposition 2, 0, applying Chebyshev’s inequality, we obtain
establishing (i).
is continuous, for any
0, we can find
0 such that
. It follows that
Proposition 2 and the WLLN imply that ˆ) = 0, then,
Hence, we have
completing the proof of (ii).
B.1 Proof of Proposition 1
Let ) is the p-value of x. Then
Hence By definition of Clfdr
Clfdr
Let R be the index set of hypotheses rejected by . The FDR of
The last inequality is due to the definition of which guarantees that
B.2 Proof of Proposition 2
Under the in-fill model, we write
We first state 3 lemmas that will be proved in turn.
Lemma 3. Under the assumption of Proposition 2,
Lemma 4. Under the assumptions of Proposition 2,
be estimates such that
0,
By Lemma 3 and Lemma 4, together with the fact that is known, it follows from Lemma 5 that
Since convergence in second order mean implies convergence
in probability, we have
B.3 Growing domain version of Proposition 2
In the growing domain framework, Proposition 2 takes the following form:
Proposition 3. Suppose:
T
The proof follows the same line as the proof of proposition 2, thus omitted.
B.4 Proof of Lemma 3
We first compute ). Note that
Taylor expansion, we have
It follows that
Let
To see why the last expression goes to 0, note that for any by Assumption (A1), we can
take such that for all
Note that 0, we conclude that
since
Thus
It follows from the boundedness of
Next we compute
Some additional calculations give
Therefore, by assumption (A3) and (A4),
Since (B.2) and (B.3) together imply that
B.5 Proof of lemma 4
Define ˆ
We first rewrite the term
use the definition of ˆ
To show the lemma, it is sufficient to show
To see why (B.6) implies (B.4), note that (B.6) implies
Next note that
By (A4), we havefor some constant
By Chebyshev’s inequality,
Combining (B.7) , (B.5), (A1) and (A2),
Therefore (B.4) follows.
We now show (B.6). Let
Use the normal tail bound,
Define be the density function for
. Note that
Hence (B.6) is proved. The lemma follows.
B.6 Proof of lemma 5
Note that ) is continuous and positive on the real line, then there exists
such that
we claim that are bounded below by a positive number for large t except for an event that has a low probability. Similar arguments can be applied to the upper bound of ˆ
as well as the cases for
. Therefore, we conclude that
are all bounded in the interval [
except for an event
that has probability tends to 0. Hence 0
Next note that
we conclude that
According to the assumptions, we further have that for a given 0, there exists
such that we can find
), and at the same time
Consequently, we have
, and the desired result follows.
The optimality of the Clfdr rule in simultaneous testing is summarized in the following proposition. The idea in the proof essentially follows that in Cai et al. (2019). We provide it here for completeness.
Proposition 4. Consider a class of decision rules simultaneous testing of hypotheses
in the neighborhood of
the marginal FDR of
, then the oracle threshold
exists and is unique. Define the oracle rule
is optimal for simultaneous testing in the sense that
Proof. The proof has two parts. In (a), we establish two properties of the testing rule that thresholds the Clfdr at an arbitrary . We show that it produces mFDR
and that its mFDR is monotonic in t. In (b) we show that when the threshold is
, the testing rule,
, exactly attains the mFDR level and is optimal amongst all valid testing procedures controls mFDR at level
Part(a). For the testing rule . We first show
that . Since Clfdr
where notation E is the expected value taken over (), notation
is the expectation taken over the distribution of (
is the expectation taken over
, holding (X)
fixed. We use (C.8) in the definition of
The equality above implies that . To see this, consider that all potentially non–zero terms arise when Clfdr
, and when this is the case, either (i)
Clfdr
, or (iii) Clfdr
. Notice (i) produces zero or positive terms on the LHS of (C.9), (ii) produces zero or negative terms, and (iii) produces negative terms. If
then only (iii) is possible, which contradicts the RHS. Thus, the testing rule is valid.
Next, we show that ) is nondecreasing in
. That is, letting
then
. We argue by contradiction. Suppose that
. First, it cannot be that
) = 0 for all i, because that implies
(both equal 0). Next,
since
and rewrite (Clfdr
If
It follows that
To see this, consider the expectation of the sum over m tests for the three RHS terms of (C.10), which we reference as (i), (ii), and (iii) respectively. First, (i) is zero because of (C.9). Then for each Clfdr, either (ii) is positive because
, or (iii) is positive because
α
However, (C.9) establishes that = 0, leading to a contradiction. Hence,
Part(b). The oracle threshold is defined as . First, let ¯
(1), which represents the largest mFDR level that the oracle testing procedure can be. By part (a),
) is non–decreasing. Via the squeeze theorem, for all
implies that
Next, consider the power of compared to that of an arbitrary decision rule
) such that
. Using the previous result
from part(a), it follows that
Take the difference of the two expressions to obtain
Next apply a transformation . Note that because
(1
) is monotonically increasing. Then order is preserved: if Clfdr
then
) and likewise for Clfdr
. This means we can rewrite
). It will be useful to note that, from part (a), we have
1, which implies that
Then,
. For both cases,
Summing over all m terms and taking the expectation yields (C.12).
Combine (C.11) and (C.12) to obtain
Finally, since 0, it follows that
0. After distributing the (
) term and separating the expectations for the sums of the two decision rules, we apply the definition of
to conclude that
ETP(