PAPRIKA: Private Online False Discovery Rate Control

2020·Arxiv

Abstract

Abstract

In hypothesis testing, a false discovery occurs when a hypothesis is incorrectly rejected due to noise in the sample. When adaptively testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. Thus the problem of False Discovery Rate (FDR) control is to find a procedure for testing multiple hypotheses that accounts for this effect in determining the set of hypotheses to reject. The goal is to minimize the number (or fraction) of false discoveries, while maintaining a high true positive rate (i.e., correct discoveries).

In this work, we study False Discovery Rate (FDR) control in multiple hypothesis testing under the constraint of differential privacy for the sample. Unlike previous work in this direction, we focus on the online setting, meaning that a decision about each hypothesis must be made immediately after the test is performed, rather than waiting for the output of all tests as in the offline setting. We provide new private algorithms based on state-of-the-art results in non-private online FDR control. Our algorithms have strong provable guarantees for privacy and statistical performance as measured by FDR and power. We also provide experimental results to demonstrate the efficacy of our algorithms in a variety of data environments.

1 Introduction

In the modern era of big data, data analyses play an important role in decision-making in healthcare, information technology, and government agencies. The growing availability of large-scale datasets and ease of data analysis, while beneficial to society, has created a severe crisis of reproducibility in science. In 2011, Bayer HealthCare reviewed 67 in-house projects and found that they could replicate fewer than 25 percent, and found that over two-thirds of the projects had major inconsistencies [ENAoSM]. One major reason is that random noise in the data can often be mistaken for interesting signals, which does not lead to valid and reproducible results. This problem is particularly relevant when testing multiple hypotheses, when there is an increased chance of false discoveries based on noise in the data. For example, an analyst may conduct 250 hypothesis tests and find that 11 are significant at the 5% level. This may be exciting to the researcher who publishes a paper based on these findings, but elementary statistics suggests that (in expectation) 12.5 of those tests should be significant at that level purely by chance, even if the null hypotheses were all true. To avoid such problems, statisticians have developed tools for controlling overall error rates when performing multiple hypothesis tests.

In hypothesis testing, the null hypothesis of no interesting scientific discovery (e.g., a drug has no effect), is tested against the alternative hypothesis of a particular scientific theory being true (e.g., a drug has a particular effect). The significance of each test is measured by a p-value, which is the probability of the observed data occurring under the null hypothesis, and a hypothesis is rejected if the corresponding p-value is below some (fixed) significance level. Each rejection is called a discovery, and a rejected hypothesis is a false discovery if the null hypothesis is actually true. When testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. The problem of false discovery rate (FDR) control is to find a procedure for testing multiple hypotheses that takes in the p-values of each test, and outputs a set of hypotheses to reject. The goal is to minimize the number of false discoveries, while maintaining high true positive rate (i.e., true discoveries).

In many applications, the dataset may contain sensitive personal information, and the analysis must be conducted in a privacy-preserving way. For example, in genome-wide association studies (GWAS), a large number of single-nucleotide polymorphisms (SNPs) are tested for an association with a disease simultaneously or adaptively. Prior work has shown that the statistical analysis of these datasets can lead to privacy concerns, and it is possible to identify an individual’s genotype when only minor allele frequencies are revealed [HSRThe field of differential privacy [DMNS06] offers data analysis tools that provide powerful worst-case privacy guarantees, and has become a de facto gold standard in private data analysis. Informally, an algorithm that is -differentially private ensures that any particular output of the algorithm is at most more likely when a single data point is changed. This parameterization allows for a smooth tradeoff between accurate analysis and privacy to the individuals who have contributed data. In the past decade, researchers have developed a wide variety of differentially private algorithms for many statistical tasks; these tools have been implemented in practice at major organizations including Google [EPK14], Apple [Dif17], Microsoft [DKY17], and the U.S. Census Bureau [DLS

Related Work. The only prior work on differentially private FDR control [DSZ18] considers the classic offline multiple testing problem, where an analyst has all the hypotheses and corresponding p-values upfront. Their private method repeatedly applies ReportNoisyMin [DR14] to the celebrated Benjamini-Hochberg (BH) procedure [BH95] in offline multiple testing to privately pre-screen the p-values, and then applies the BH procedure again to select the significant p-values. The (non-private) BH procedure first sorts all p-values, and then sequentially compares them to an increasing threshold, where all p-values below their (ranked and sequential) threshold are rejected. The ReportNoisyMin mechanism privatizes this procedure by repeatedly (and privately) finding the hypothesis with the lowest p-value.

Although the work of [DSZ18] showed that it was possible to integrate differential privacy with FDR control in multiple hypothesis testing, the assumption of having all hypotheses and p-values upfront is not reasonable in many practical settings. For example, a hospital may conduct multi-phase clinical trials where more patients join over time, or a marketing company may perform A/B testings sequentially. In this work, we focus on the more practical online hypothesis testing problem, where a stream of hypotheses arrive sequentially, and decisions to reject hypotheses must be made based on current and previous results before the next hypothesis arrives. This sequence of the hypotheses could be independent or adaptively chosen. Due to the fundamental difference between the offline and online FDR procedures, the method of [DSZ18] based on ReportNoisyMin cannot be applied to the online setting. Instead, we use SparseVector, described in Section ??, as a starting point. Discussion of non-private online multiple hypothesis testing appears in Section 2.2.

Our Results. We develop a differentially private online FDR control procedure for multiple hypothesis testing, which takes a stream of p-values and a target FDR level and privacy parameter , and outputs discoveries that can control the FDR at a certain level at any time point. Such a procedure provides unconditional differential privacy guarantees (to ensure that privacy will be protected even in the worst case) and satisfy the theoretical guarantees dictated by the FDR control problem.

Our algorithm, Private Alpha-investing P-value Rejecting Iterative sparse veKtor Algorithm (PAPRIKA, Algorithm 3), is presented in Section 3. Its privacy and accuracy guarantees are stated in Theorem 4 and 5, respectively. While the full proofs appear in the appendix, we describe the main ideas behind the algorithms and proofs in the surrounding prose. In Section 4, we provide a thorough empirical investigation

of PAPRIKA, with additional empirical results in Appendix A.

2 Preliminaries

2.1 Background on Differential Privacy

Differential Privacy bounds the maximal amount that one data entry can change the output of the computation. Databases belong to the space and contain n entries–one for each individual–where each entry belongs to data universe D. We say that if they differ in at most one data entry.

Definition 1 (Differential Privacy [DMNS06]). An algorithm is -differentially private if for every pair of neighboring databases , and for every subset of possible outputs , , we say that -differentially private.

The additive sensitivity of a real-valued query is denoted , and is defined to be the maximum change in the function’s value that can be caused by changing a single entry. That is,

If f is a vector-valued query, the expression above can be modified with the appropriate norm in place of the absolute value. Differential privacy guarantees are often achieved by adding Laplace noise at various places in the computation, where the noise scales with . A Laplace random variable with parameter b is denoted Lap(b), and has probability density function,

We may sometimes abuse notation and also use Lap(b) to denote the realization of a random variable with this distribution.

The SparseVector algorithm, first introduced by [DNPR10] and refined to its current form by [DR14], privately reports the outcomes of a potentially very large number of computations, provided that only a few are “significant.” It takes in a stream of queries, and releases a bit vector indicating whether or not each noisy query answer is above the fixed noisy threshold. We use this algorithm as a framework for our online private false discovery rate control algorithm as new hypotheses arrive online, and we only care about those “significant” hypotheses when the p-value is below a certain threshold. We note that the standard presentation below checks for queries with values above a threshold, but by simply changing signs this framework can be used to check for values below a threshold, as we will do with the p-values.

Theorem 1 ([DNPR10])-differentially private.

Theorem 2 ([DNPR10]). For any sequence of with sensitivity outputs with probability at least a stream of such that as

Unlike the conventional use of additive sensitivity, [DSZ18] defined the notion of multiplicative sensitivity specifically for p-values. It is motivated by the observation that, although the additive sensitivity of a p-value may be large, the relative change of the p-value on two neighboring datasets is stable unless the p-value is very small. Using this alternative sensitivity notion means that preserving privacy for these p-values only requires a small amount of noise.

Definition 2 (Multiplicative Sensitivity [DSZ18]). A p-value function p is said to be -multiplicative sensitive if for all neighboring databases , either both

Specifically, when is sufficiently small, then we can treat the logarithm of the p-values as having additive sensitivity , and we only need to add noise that scales with , which may be much smaller than the noise required under the standard additive sensitivity notion.

2.2 Background on Online False Discovery Rate Control

In the online false discovery rate (FDR) control problem, a data analyst receives a stream of hypotheses on the database D, or equivalently, a stream of p-values . The analyst must pick a threshold at each time t to reject the hypothesis when ; this threshold can depend on previous hypotheses and discoveries, and rejection must be decided before the next hypothesis arrives.

The error metric is the false discovery rate, formally defined as:

where is the (unknown to the analyst) set of hypotheses where the null hypothesis is true, and R is the set of rejected hypotheses. We will also write these terms as a function of time t to indicate their values after the first t hypotheses: FDR. The goal of FDR control is to guarantee that for any time t, the FDR up to time t is less than a pre-determined quantity

Such a problem was first investigated by [FS08], who proposed a framework known as online alpha-investing that models the hypothesis testing problem as an investment problem. The analyst is endowed with an initial budget, can test hypotheses at a unit cost, and receives an additional reward for each discovery. The alpha-investing procedure ensures that the analysts always maintains an -fraction of their wealth, and can therefore continue testing future hypotheses indefinitely. Unfortunately, this approach only controls a

slightly relaxed version of FDR, known as mFDR, which is given by mFDR. This approach was later extended to a class of generalized alpha-investing (GAI) rules [AR14]. One subclass of GAI rules, the Level based On Recent Discovery (LORD), was shown to have consistently good performance in practice [JM15, JM18]. The SAFFRON procedure, proposed by [RZWJ18], further improves the LORD procedures by adaptively estimating the proportion of true nulls. The SAFFRON procedure is the current state-of-the-art in online FDR control for multiple hypothesis testing. To understand the main differences between the SAFFRON and the LORD procedures, we first introduce an oracle estimate of the FDP as FDP

number of false discoveries, so FDPoverestimates the FDP. The oracle estimator FDPcannot be calculated since is unknown. LORD’s naive estimator is a natural overestimate of FDPThe SAFFRON’s threshold sequence is based on a novel estimate of FDP as

where is a sequence of user-chosen parameters in the interval (0, 1), which can be a constant or a deterministic function of the information up to time . This is a much better estimator than LORD’s naive estimator . The SAFFRON estimator is a fairly tight estimate of FDP, since intuitively has unit expectation under null hypotheses and is stochastically smaller than uniform under non-null hypotheses.

The SAFFRON algorithm is given formally in Algorithm 2. SAFFRON starts off with an error budget , which will be allocated to different tests over time. It never loses wealth when testing candidate p-values with , and it earns back wealth of on every rejection except for the first. By construction, the SAFFRON algorithm controls to be less than at any time t. The function for defining the sequence can be any coordinatewise non-decreasing function. For example, can be a deterministic sequence of constants, or , as in the case of alpha-investing. These values serve as a weak overestimate of . The algorithm first checks if a p-value is below so, adds it to the candidate set of hypotheses that may be rejected. It then computes the threshold based on current wealth, current size of the candidate set, and the number of rejections so far, and decides to reject the hypothesis if . It also takes in a non-increasing sequence of decay factors which sum to one. These decay factors serve to depreciate past wealth and ensure that the sum of the wealth budget is always below the desired level

The SAFFRON algorithm requires that the input sequence of p-values are not too correlated under the null hypothesis. This condition is formalized through a filtration on the sequence of candidacy and rejection decisions. Intuitively, this means that the sequence of hypotheses cannot be too adaptively chosen, otherwise the p-values may become overly correlated and violate this condition. Denote by the indicator for rejection, and let be the indicator for candidacy. Define the filtration formed by the sequences of -fields , and let , where is an arbitrary function of the first indicators for rejections and candidacy. We say that the null p-values are conditionally super-uniformly distributed with respect to the filtration F if:

We note that independent p-values is a special case of the conditional super-uniformity condition of (2). When p-values are independent, they satisfy the following condition:

SAFFRON provides the following accuracy guarantees, where the first two conditions apply if p-values are conditionally super-uniformly distributed, and the last two conditions apply if the p-values are additionally independent under the null.

Theorem 3 ([RZWJ18]). If the null p-values are conditionally super-uniformly distributed, then we have: (a)

(d) The condition implies that

3 Private online false discovery rate control

In this section, we provide our algorithm for private online false discovery rate control, PAPRIKA, given formally in Algorithm 3. It is a differentially private version of SAFFRON, where we use SparseVector to ensure privacy of our rejection set. However, the combination of these tools is far from immediate, and several algorithmic innovations are required, including: dynamic thresholds in SparseVector to match the alpha-investing rule of SAFFRON, adding noise that scales with the multiplicative sensitivity of p-values to reduce the noise required for privacy, shifting the SparseVector threshold to accommodate FDR as a novel accuracy metric, and the candidacy indicator step which cannot be done privately and requires new analysis for both privacy and accuracy. Complete proofs of our privacy and accuracy results appear in the appendix; we elaborate here on the algorithmic details and why these modifications are needed to ensure privacy and FDR control.

The SAFFRON algorithm decides to reject hypothesis t if the corresponding is less than the rejection threshold ; that is, if . We instantiate the SparseVector framework in this setting, where plays the role of the query answer , and plays the role of the threshold. Note that SparseVector uses a single fixed threshold for all queries, while our algorithm PAPRIKA allows for a dynamic threshold that depends on the previous output. Our privacy analysis of the algorithm accounts for this change and shows that dynamic thresholds do not affect the privacy guarantees of SparseVector. However, the algorithm would not be private if the dynamic thresholds also depend on the data. Note that SAFFRON never loses wealth when testing candidate p-values with , and the threshold on the data since it is based on current wealth. We remove such dependence in PAPRIKA by losing wealth at every step regardless of whether we test a candidate p-values, similar to LORD. This will result in stricter FDR control (and potentially weaker power) because our wealth decays faster.

Similar to prior work on private offline FDR control [DSZ18], we use multiplicative sensitivity as described in Definition 2, as p-values may have high sensitivity and require unacceptably large noise to be added to preserve privacy. We assume that our input stream of each has multiplicative sensitivity . As long as is small enough (i.e., less than the rejection threshold), we can treat the logarithm of the p-values as the queries with additive sensitivity . Because of this change, we must make rejection decisions based on the logarithm of the p-values, so our reject condition is for Laplace noise terms drawn from the appropriate distributions.

The accuracy guarantees of SparseVector ensure that if a value is reported to be below threshold, then with high probability it will not be more than above the threshold. However, to ensure that our algorithm satisfies the desired bound , we require that reports of “below threshold” truly do correspond to p-values that are below the desired threshold . To accommodate this, we shift our rejection threshold down by a parameter A. A is chosen such that the algorithm satisfies -differential privacy, but the choice can be seen as inspired by the -accuracy term of SparseVector as given in Theorem 2. Therefore our final reject condition is . This ensures that “below threshold” reports are below with high probability. Empirically, we see that the bound of A in Theorem 4 may be overly conservative and lead to no hypotheses being rejected, so we allow an additional scaling parameter s that will scale the magnitude of shift by a factor of s. The conservative bounds of Theorem 4 correspond to s = 4, but in many scenarios a smaller value of s = 1 or 2 will lead to better performance while still satisfying the privacy guarantee. Further guidance choosing this shift parameter is given in Section 4.3.

Even with these modifications, a naive combination of SparseVector and SAFFRON would still not satisfy differential privacy. This is due to the candidacy indicator step of the algorithm. In the SAFFRON algorithm, a pre-processing candidacy step occurs before any rejection decisions. This step checks whether each p-value is smaller than a loose upper bound on the eventual reject threshold . The algorithm chooses using an -investing rule that depends on the number of candidate hypotheses seen so far, and ensures that , so only hypotheses in this candidate set can be rejected. These values are used to control , which serves as a conservative overestimate of FDP(t). (For a discussion of how to choose , see Lemma 1 or our experimental results in Section 4. Reasonable choices would be small constant such as 0.2.)

Without adding noise to the candidacy condition, there may be neighboring databases with p-values for some hypothesis such that , and hence the hypothesis would have positive probability of being rejected under the first database and zero probability of rejection under the neighbor. This would violate the -differential privacy guarantee intended under SparseVector. If we were to privatize the condition for candidacy using, for example, a parallel instantiation of SparseVector, then we would have to reuse the same realizations of the noise when computing the rejection threshold to still control FDP, but this would no longer be private.

Since we cannot add noise to the candidacy condition, we weaken it in Then if a hypothesis has different candidacy results under neighboring databases and the multiplicative sensitivity is small, then the hypothesis is still extremely unlikely to be rejected even under the database for which it was candidate. To see this, consider a pair of neighboring databases that induce p-values where . Due to the multiplicative sensitivity constraint, we know that . Plugging this into the rejection condition , we see that we would need the difference of the noise terms to satisfy , which by analysis of the Laplace distribution, will happen with exponentially small probability in n when Our PAPRIKA algorithm is thus -differentially private, and we account for this failure probability in our (exponentially small) parameter, as stated in Theorem 4.

Our PAPRIKA algorithm allows analysts to specify a maximum number of hypotheses tested k and rejections c. We require a bound on the maximum number of hypotheses tested because the accuracy guarantees of SparseVector only allows exponentially (in the size of the database) many queries to be answered accurately. Once the total number of rejections reaches c, the algorithm will fail to reject all future hypotheses. We do not halt the algorithm as in SparseVector and therefore, PAPRIKA does not have a stopping criterion, and we can safely talk about the FDR control at any fixed time, just like SAFFRON.

Our algorithm also controls at each time t,

equivalent to by scaling down by a factor of 2. By analyzing and bounding this expression, we achieve FDR bounds for our PAPRIKA algorithm, as stated in Theorem 5.

Theorem 4. For any stream of p-values -differentially private.

As a starting point, our privacy comes from SparseVector, but as discussed above, many crucial modifications are required. To briefly summarize the key considerations, we must handle different thresholds at different times, multiplicative rather than additive sensitivity, a modified notion of the candidate set, and introducing a small delta parameter to account for the new candidate set definition and the shift. The proof of Theorem 4 appears in Appendix B.

Next we describe the theoretical guarantees of FDR control for our private algorithm PAPRIKA which is an analog of Theorem 3. We modify the notation of the conditional super-uniformity assumption of SAFFRON to incorporate the added Laplace noise. The conditions are otherwise identical. (See (2) in Appendix ?? for comparison.) We note that independent p-values is a special case of conditional super-uniformity, but this requirement more generally allows for a broader class of dependencies among p-values. Let be the rejection decisions, and let be the indicators for candidacy. We let , where is an arbitrary function of the first indicators for rejections and candidacy. Define the filtration formed by the sequences of -values are conditionally super-uniformly distributed with respect to the filtration if when the null hypothesis is true, then We emphasize that this condition is only needed for FDR control, and that our privacy guarantee of Theorem 4 holds for arbitrary streams of p-values, even those which do not satisfy conditional super-uniformity.

Our FDR control guarantees for PAPRIKA mirror those of SAFFRON (Theorem 3). The first two statements apply if p-values are conditionally super-uniform, and the last two statements apply if the p-values are additionally independent under the null. The proof of Theorem 5 appears in Appendix C.

Theorem 5. If the null p-values are conditionally super-uniformly distributed, then we have: (a)

(d) The condition implies that

Relative to the non-private guarantees of Theorem 3, the FDR bounds provided by PAPRIKA are weaker by an additive of . In most differential privacy applications, is typically required to be cryptographically small (i.e., at most negligible in the size of the database) [DR14], so this additional term should have a minuscule effect on the FDR.We note that plays a role in the analysis of Theorem 5, although it does not appear in FDR bounds. See (22) in the appendix, where the term with dependence on can be upper bounded by

The following lemma is a key tool in the proof of Theorem 5. Though it is qualitatively similar to Lemma 2 in [RZWJ18], it is crucially modified to show an analogous statement holds under the addition of Laplace noise. Its proof appears in Appendix D.

There are no known theoretical bounds on the statistical power of SAFFRON even in the non-private setting. Instead, we validate power empirically through the experimental results in Section 4.

4 Experiments

We experimentally compare the FDR and the statistical power of variations of the PAPRIKA and SAFFRON procedures, under different sequences of . Following the convention of [RZWJ18], we define PAPRIKA-Alpha-Investing, or PAPRIKA AI, to be the instantiation of Algorithm 3 with the sequence the rejection threshold matches the -investing rule, and we use PAPRIKA to denote Algorithm 3 instantiated with a sequence of constant of , which in our experiments is . We use in We generally observe that, even under moderately stringent privacy restrictions, PAPRIKA and its AI variant perform comparably to the non-private alternatives, with PAPRIKA AI typically outperforming PAPRIKA. This suggests that even though setting as a fixed constant may be easier for implementation, parameter optimization can lead to meaningful performance improvements. We chose the sequence to be a constant 1/k up to time k. Note that the sequence can be decreasing such as of the form in [RZWJ18], which controls the wealth to be more concentrated around small values of j. See [RZWJ18] for more discussion on the choice of . In our experiments, we set the target FDR level , and thus our privacy parameter is set to be bounded by . The maximum number of rejections c = 40. All the results are averaged over 100 runs. We investigate two settings: in Section 4.1, the observations come Bernoulli distributions, and in Section 4.2, the observations are generated from truncated exponential distributions. In Section 4.3, we discuss our choice of the shift parameter A and give guidance on how to choose this parameter in practice. Code for PAPRIKA and our experiments is available at https://github.com/wanrongz/PAPRIKA.

4.1 Testing with Bernoulli Observations

We assume that the database D contains n individuals with k independent features. The ith feature is associated with n i.i.d. Bernoulli variables , each of which takes the value 1 with probability , and takes the value 0 otherwise. Let be the sum of the ith features. A p-value for testing null hypothesis is given by

[DSZ18] showed that -multiplicatively sensitive for

and c is any small positive constant.

for varying values of , which represents the expected fraction of non-null hypotheses. We consider relatively small values of as most practical applications of FDR control (such as GWAS studies) will have only a small fraction of true “discoveries” in the data.

In the following experiments, we sequentially test size of the database D, and k = 800 as the number of features as well as the number of hypotheses. Our experiments are run under several different shifts A, but due to space constraints, we only report results in the main body with (i.e., when s = 1), which still satisfies our privacy guarantee. Further discussion on the choice of A and additional results under other shift parameters s are deferred to Appendix 4.3. The results are summarized in Figure 1, which plots the FDR and statistical power against the expected fraction of non-nulls, . In Figure 1(a) and (b), we compare our algorithms with privacy parameter to the non-private baseline methods of LORD [JM15, JM18], Alpha-investing [AR14], and SAFFRON and SAFFRON AI from [RZWJ18]. In Figure 1(c,d) and (e,f), we compare the performance of PAPRIKA AI and PAPRIKA, respectively, with varying privacy parameters . We also list these values in Table 1 in Appendix A.

Figure 1: FDR and statistical power versus fraction of non-null hypotheses (with ), and non-private algorithms when the database consists of Bernoulli observations.

As expected, the performance of PAPRIKA generally diminishes as decreases. A notable exception is that FDR also decreases in Figure 1(c). This phenomenon is because we set , resulting in a smaller candidacy set and leading to insufficient rejections. Surprisingly, PAPRIKA AI also yields a lower FDR than many of the non-private algorithms (Figure 1(a)), since it tends to make fewer rejections. We also see that PAPRIKA AI performs dramatically better than PAPRIKA, suggesting that the choice of should be preferred to constant to ensure good performance in practice.

As PAPRIKA is the first algorithm for private online FDR control, there is no private baseline for comparison. In Appendix A, we show that naïve Laplace privatization plus SAFFRON is ineffective.

4.2 Testing with Truncated Exponential Observations

We again assume that the database D contains n individuals with k independent features. The ith feature is associated with n i.i.d. truncated exponential distributed variables , each of which is sampled according to density

for positive parameters be the realized sum of the ith features, and let denote the random variable of the sum of the n truncated exponential distributed variables in the ith entry. A p-value for testing the null hypothesis against the alternative hypothesis is given by,

[DSZ18] showed that -multiplicatively sensitive for

and c is any small positive constant. In the following experiments, we generate our database using the exponential distribution model truncated at as follows:

where we vary the parameter , corresponding to the expected fraction of non-nulls.

We sequentially test versus for i = 1, . . . , k. We use n = 1000 as the size of the database D, and k = 800 as the number of features as well as the number of hypotheses. While there is no closed form to compute the p-values, the sum of n = 1000 i.i.d. samples is approximately normally distributed by the Central Limit Theorem. The expectation and the variance of

respectively. Therefore, is approximately distributed as , and we compute the p-values accordingly. We run the experiments with shift (shift magnitude s = 1). The results are shown in Figure 2, which plots the FDR and statistical power against the expected fraction of non-nulls,

Figure 2: FDR and statistical power versus fraction of non-nulls ), and non-private algorithms when the database consists of truncated exponential observations.

As in the case with binomial data, we see that the performance of PAPRIKA generally diminishes as decreases, and that PAPRIKA AI outperforms PAPRIKA, again reinforcing the need for tuning the parameters based on the alpha-investing rule. All methods perform well in this setting, and the FDR of PAPRIKA AI is visually indistinguishable from 0 at all levels of tested. Numerical values are listed

in Table 2 in Appendix A for ease of comparison.

We provide a further illustration of our experiments on truncated exponentials in Figure 3. In particular, we plot the rejection threshold and wealth versus the hypothesis index. Each “jump” of the wealth corresponds to a rejection. We observe that the rejections of our private algorithms are consistent with the rejections of the non-private algorithms, another perspective which empirically confirms their accuracy.

One hypothesis for the good performance observed in Figure 2 is that the signal between the null and alternative hypotheses as parameterized by is very strong, meaning the algorithms can easily discriminate between the true null and true non-null hypotheses based on the observed p-values. To measure this, we also varied the value of in the alternative hypotheses. These results are shown in Figure 4, which plots FDR and power of PAPRIKA and PAPRIKA AI with when the alternative hypotheses have parameter . As expected, the performance gets better as we increase the signal, and we observe that when the signal is too weak (), performance begins to decline.

For baseline of comparison, we include results for LapSAFFRON with , which is a naïve privatization of SAFFRON based on the Laplace Mechanism. For this baseline mechanism, LapSAFFRON first computes the p-values of each hypothesis, applies the Laplace Mechanism [DMNS06] to the p-values, and then uses these noisy p-values as input to SAFFRON. Overall privacy of the mechanism comes from advanced composition across multiple calls to the Laplace Mechanism, and post-processing guarantees of differential privacy, where the SAFFRON algorithm is post-processing on the privatized p-values. We see that this baseline mechanism performs extremely poorly relative to PAPRIKA and PAPRIKA AI, motivating the need for our better algorithm design.

Figure 3: Wealth and rejection threshold versus hypothesis index with privacy parameter when the database consists of truncated exponential observations. PAPRIKA AI and

Figure 4: FDR and statistical power versus expected fraction of non-null hypotheses various choices of signal for alternative hypothesis parameters. The privacy parameter is , and the database consists of truncated exponential observations. The first row shows performance of , and the second row shows performance of

4.3 Choice of shift A

We now discuss how to choose the shift parameter A. Theorem 4 gives a theoretical lower bound for A in terms of the privacy parameter , but this bound may be overly conservative. Since the shift A is closely related to the performance of FDR and statistical power, we wish to pick a value of A that yields good performance in practice. In Theorem 5, we show that FDR(t) is less than our desired bound plus the privacy parameter , which naturally requires that the privacy loss parameter be small. For a more detailed explanation, we bound Inequality (22) in the proof of Theorem 5 using Inequality (14) from the proof of Theorem 4, and therefore, the empirical is naturally tied to the empirical FDR. As long as we can guarantee the empirical FDR to be bounded by the target FDR level, our privacy loss is bounded by the nominal

We use the Bernoulli example in Section 4.1 to investigate the performance under different choices of the shift A with privacy parameter . The results are summarized in Figure 5, which plots the FDR and power versus the expected fraction of non-nulls when we vary the shift size with s = 0.5, 1, 1.5, 2.

Larger shifts (corresponding to larger values of s) will lower the rejection threshold, which causes fewer hypotheses to be rejected. This improves FDR of the algorithm, but harms Power, as the threshold may be too low to reject true nulls. Figure 5 shows that the shift size parameter s should be chosen by the analyst to balance the tradeoff between FDR and Power, as demanded by the application.

Figure 5: FDR and statistical power versus expected fraction of non-null hypotheses various choices of shift magnitude s. The privacy parameter is , and the database consists of Bernoulli observations. The first row shows performance of and the second row shows performance of

References

[AR14] Ehud Aharoni and Saharon Rosset. Generalized -investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771–794, 2014.

[BH95] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.

[Dif17] Differential Privacy Team, Apple. Learning with privacy at scale. https: //machinelearning.apple.com/docs/learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf, December 2017.

[DKY17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, NIPS ’17, pages 3571–3580. Curran Associates, Inc., 2017.

[DLSAref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin Machanavajjhala, Simson L. Garfinkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M.

Abowd. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientific Advisory Committee.

[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC ’06, pages 265–284, 2006.

[DNPR10] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC ’10, pages 715–724, 2010.

[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.

[DSZ18] Cynthia Dwork, Weijie J Su, and Li Zhang. Differentially private false discovery rate control. arXiv preprint arXiv:1807.04209, 2018.

[ENAoSMMedicine Engineering, Engineering National Academies of Sciences, Medicine, et al. Reproducibility and replicability in science. 2019.

[EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer and Communications Security, CCS ’14, pages 1054–1067, New York, NY, USA, 2014. ACM.

[FS08] Dean P Foster and Robert A Stine. -investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429–444, 2008.

[HSRNils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, and David W Craig. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS genetics, 4(8):e1000167, 2008.

[JM15] Adel Javanmard and Andrea Montanari. On online control of false discovery rate. arXiv preprint arXiv:1502.06197, 2015.

[JM18] Adel Javanmard and Andrea Montanari. Online rules for control of false discovery rate and false discovery exceedance. The Annals of Statistics, 46(2):526–554, 2018.

[RZWJ18] Aaditya Ramdas, Tijana Zrnic, Martin Wainwright, and Michael Jordan. SAFFRON: an adaptive algorithm for online control of the false discovery rate. arXiv preprint arXiv:1802.09098, 2018.

[TR19] Jinjin Tian and Aaditya Ramdas. ADDIS: An adaptive discarding algorithm for online FDR control with conservative nulls. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, pages 9383–9391. Curran Associates, Inc., 2019.

A Additional Tables

Tables 1 and 2 report the numerical values for our experiments on Bernoulli and truncated exponential data, respectively. This information is also presented visually in Figures 1 and 2.

Table 1: Numerical values of FDR and power for Bernoulli observations experiments. LapSAFFRON corresponds to running SAFFRON on the naïve Laplace privatization of the p-values.

Table 2: Numerical values of FDR and power for truncated exponential observations experiments. LapSAF- FRON corresponds to running SAFFRON on the naïve Laplace privatization of the p-values.

B Proof of Theorem 4

Before proving Theorem 4, we will state and prove the following lemma, which will be useful in the proofs of Theorem 4 and Theorem 5.

Lemma 2. If and C > 0 is a constant, we have

Proof.

Theorem 4. For any stream of p-values -differentially private.

Proof. Fix any two neighboring databases D and . Let R denote the random variable representing the output of denote the random variable representing the output of ). Let k denote the total number of hypotheses. When and for all . When , privacy follows from the privacy of SparseVector with dynamic thresholds. Since the threshold at each time t only depends on the threshold at time and and private rejection , by post-processing, the threshold is private. Then by post-processing and the privacy of SparseVector , the rejection R(t) is also private. We give the formal probability argument as follows. For any neighboring and any sequence of hypotheses, we first consider the output up to the first rejection, which is AboveThresh . Consider any output . Let , with and

where is a fixed sequence of thresholds determined by the r. We have

Equation (3) is from change of integration variable . Inequality (4) is because and . Inequality (5) is because

When we restart AboveThresh after the first rejection, the inital threshold is the post-processing of the previous ouputs, which is also private. Then by simple composition, the overall privacy loss is For other cases, the worst case is that for all . In this setting, we have

To satisfy -differential privacy, we need to bound the probability of outputting r for database D. We first consider r = {0, 0 . . . , 0}. We wish to bound and . The latter is trivial since , which is greater than 1. It remains to satisfy

where Inequality (7) is because the worst case happens when is below the candidacy threshold , Equation (8) applies Lemma 2, and Inequality (9) follows from the facts that for all t and that the third term in (8) is positive. Setting (9) to be larger than

Next, we consider all other possible outputs r. Define the set S := {r | there exists a We wish to bound . The latter is trivial since . It remains to bound

where Inequality (11) is because the worst case occurs when applies Lemma 2, and Inequality (13) follows from the facts that and that the second term in (12) is negative. Setting (13) to be less than

Combining Equations (14) and (10), we have the condition that

which is how the shift term A is set in PAPRIKA.

C Proof of Theorem 5

Theorem 5. If the null p-values are conditionally super-uniformly distributed, then we have: (a)

(d) The condition implies that

Proof. For any time t > 0, before the total number of rejections reaches c we bound the number of false rejections as follows:

where Inequality (15) follows from the rejection rule before the total number of rejections reaches c, and the number of false rejections is always 0 afterwards. Inequality (16) follows from the conditional super-uniformity property. We bound each term in (16) separately. Using the law of iterated expectations by conditioning on , we can bound the first term of (16) as follows:

where Equation (17) applies the conditional super-uniformity. Since

Next, we bound the second term in (16) as follows:

Combining this inequality with (17), we bound mFDR as

If the null p-values are independent of each other and the non-nullls, and is a coordinate-wise non-decreasing function of the vector , then we have

where Inequality (18) applies the law of iterated expectations by conditioning on and Lemma 1. Inequal-

ity (19) follows by a case analysis: if reduces to . On the other hand, if , then , allowing us to upper bound the expectation by the probability of this event.

Combining (21) and (22), we reach the conclusion that FDR

D Proof of Lemma 1

Proof. The proof is similar to the proof of Lemma 2 in [RZWJ18] with the addition of i.i.d. Laplace noise.

In a high level, we hallucinate a vector of p-values that are same as the original vector of p-values, except for the t-th index. This allows us to apply the conditional uniformity property, since now is independent of the hallucinated rejections. We then connect the original rejections and the hallucinated rejections by the monotonicity of the rejections.

We perform our analysis using a hallucinated process: let be a copy of that is identical everywhere except for the t-th p-value which is set to be 1. That is,

Also let the hallucinated Laplace noises be an identical copy of , and let be an identical copy of . The t-th value of can be arbitrary since we have ensure the event , so it will fail to become a candidate and the values of will not be relevant. We denote and as the candidates and rejections made using

because , so both will fail to become candidates, and hence we have and the following equation holds:

We note that when , the above equation still holds since both sides will be zero. Since is independent of

where Inequality (23) is obtained by taking the expectation only with respect to by invoking the conditional super-uniformity property and independence of and , and Inequality (24) follows from the facts that and that the function h is non-decreasing.

For the second inequality in the lemma statement, we hallucinate a vector of p-values that equals everywhere except for the t-th p-value which is set to be 0. That is,

Also let the hallucinated Laplace noises be an identical copy of be an identical copy of . We denote and as the candidates and rejections made using and . By construction, we have and for all i < t. On the event that , since and we inject the same Laplace noise, we have and , and hence also . Then the following equation holds:

We note that when , the above equation still holds since both sides will be zero. Since are independent of , we can take conditional expectations to obtain

where Inequality (25) follows by taking expectation only with respect to by invoking the conditional uniformity property and the fact that the support of p-values is [0, 1], and Inequality (26) follows from the facts that and that the function h is non-decreasing.