In the modern era of big data, data analyses play an important role in decision-making in healthcare, information technology, and government agencies. The growing availability of large-scale datasets and ease of data analysis, while beneficial to society, has created a severe crisis of reproducibility in science. In 2011, Bayer HealthCare reviewed 67 in-house projects and found that they could replicate fewer than 25 percent, and found that over two-thirds of the projects had major inconsistencies [ENAoSM]. One major reason is that random noise in the data can often be mistaken for interesting signals, which does not lead to valid and reproducible results. This problem is particularly relevant when testing multiple hypotheses, when there is an increased chance of false discoveries based on noise in the data. For example, an analyst may conduct 250 hypothesis tests and find that 11 are significant at the 5% level. This may be exciting to the researcher who publishes a paper based on these findings, but elementary statistics suggests that (in expectation) 12.5 of those tests should be significant at that level purely by chance, even if the null hypotheses were all true. To avoid such problems, statisticians have developed tools for controlling overall error rates when performing multiple hypothesis tests.
In hypothesis testing, the null hypothesis of no interesting scientific discovery (e.g., a drug has no effect), is tested against the alternative hypothesis of a particular scientific theory being true (e.g., a drug has a particular effect). The significance of each test is measured by a p-value, which is the probability of the observed data occurring under the null hypothesis, and a hypothesis is rejected if the corresponding p-value is below some (fixed) significance level. Each rejection is called a discovery, and a rejected hypothesis is a false discovery if the null hypothesis is actually true. When testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. The problem of false discovery rate (FDR) control is to find a procedure for testing multiple hypotheses that takes in the p-values of each test, and outputs a set of hypotheses to reject. The goal is to minimize the number of false discoveries, while maintaining high true positive rate (i.e., true discoveries).
In many applications, the dataset may contain sensitive personal information, and the analysis must be conducted in a privacy-preserving way. For example, in genome-wide association studies (GWAS), a large number of single-nucleotide polymorphisms (SNPs) are tested for an association with a disease simultaneously or adaptively. Prior work has shown that the statistical analysis of these datasets can lead to privacy concerns, and it is possible to identify an individual’s genotype when only minor allele frequencies are revealed [HSRThe field of differential privacy [DMNS06] offers data analysis tools that provide powerful worst-case privacy guarantees, and has become a de facto gold standard in private data analysis. Informally, an algorithm that is
-differentially private ensures that any particular output of the algorithm is at most
more likely when a single data point is changed. This parameterization allows for a smooth tradeoff between accurate analysis and privacy to the individuals who have contributed data. In the past decade, researchers have developed a wide variety of differentially private algorithms for many statistical tasks; these tools have been implemented in practice at major organizations including Google [EPK14], Apple [Dif17], Microsoft [DKY17], and the U.S. Census Bureau [DLS
Related Work. The only prior work on differentially private FDR control [DSZ18] considers the classic offline multiple testing problem, where an analyst has all the hypotheses and corresponding p-values upfront. Their private method repeatedly applies ReportNoisyMin [DR14] to the celebrated Benjamini-Hochberg (BH) procedure [BH95] in offline multiple testing to privately pre-screen the p-values, and then applies the BH procedure again to select the significant p-values. The (non-private) BH procedure first sorts all p-values, and then sequentially compares them to an increasing threshold, where all p-values below their (ranked and sequential) threshold are rejected. The ReportNoisyMin mechanism privatizes this procedure by repeatedly (and privately) finding the hypothesis with the lowest p-value.
Although the work of [DSZ18] showed that it was possible to integrate differential privacy with FDR control in multiple hypothesis testing, the assumption of having all hypotheses and p-values upfront is not reasonable in many practical settings. For example, a hospital may conduct multi-phase clinical trials where more patients join over time, or a marketing company may perform A/B testings sequentially. In this work, we focus on the more practical online hypothesis testing problem, where a stream of hypotheses arrive sequentially, and decisions to reject hypotheses must be made based on current and previous results before the next hypothesis arrives. This sequence of the hypotheses could be independent or adaptively chosen. Due to the fundamental difference between the offline and online FDR procedures, the method of [DSZ18] based on ReportNoisyMin cannot be applied to the online setting. Instead, we use SparseVector, described in Section ??, as a starting point. Discussion of non-private online multiple hypothesis testing appears in Section 2.2.
Our Results. We develop a differentially private online FDR control procedure for multiple hypothesis testing, which takes a stream of p-values and a target FDR level and privacy parameter , and outputs discoveries that can control the FDR at a certain level at any time point. Such a procedure provides unconditional differential privacy guarantees (to ensure that privacy will be protected even in the worst case) and satisfy the theoretical guarantees dictated by the FDR control problem.
Our algorithm, Private Alpha-investing P-value Rejecting Iterative sparse veKtor Algorithm (PAPRIKA, Algorithm 3), is presented in Section 3. Its privacy and accuracy guarantees are stated in Theorem 4 and 5, respectively. While the full proofs appear in the appendix, we describe the main ideas behind the algorithms and proofs in the surrounding prose. In Section 4, we provide a thorough empirical investigation
of PAPRIKA, with additional empirical results in Appendix A.
2.1 Background on Differential Privacy
Differential Privacy bounds the maximal amount that one data entry can change the output of the computation. Databases belong to the space and contain n entries–one for each individual–where each entry belongs to data universe D. We say that
if they differ in at most one data entry.
Definition 1 (Differential Privacy [DMNS06]). An algorithm is
-differentially private if for every pair of neighboring databases
, and for every subset of possible outputs
,
, we say that
-differentially private.
The additive sensitivity of a real-valued query is denoted
, and is defined to be the maximum change in the function’s value that can be caused by changing a single entry. That is,
If f is a vector-valued query, the expression above can be modified with the appropriate norm in place of the absolute value. Differential privacy guarantees are often achieved by adding Laplace noise at various places in the computation, where the noise scales with . A Laplace random variable with parameter b is denoted Lap(b), and has probability density function,
We may sometimes abuse notation and also use Lap(b) to denote the realization of a random variable with this distribution.
The SparseVector algorithm, first introduced by [DNPR10] and refined to its current form by [DR14], privately reports the outcomes of a potentially very large number of computations, provided that only a few are “significant.” It takes in a stream of queries, and releases a bit vector indicating whether or not each noisy query answer is above the fixed noisy threshold. We use this algorithm as a framework for our online private false discovery rate control algorithm as new hypotheses arrive online, and we only care about those “significant” hypotheses when the p-value is below a certain threshold. We note that the standard presentation below checks for queries with values above a threshold, but by simply changing signs this framework can be used to check for values below a threshold, as we will do with the p-values.
Theorem 1 ([DNPR10])-differentially private.
Theorem 2 ([DNPR10]). For any sequence of with sensitivity
outputs with probability at least
a stream of
such that
as
Unlike the conventional use of additive sensitivity, [DSZ18] defined the notion of multiplicative sensitivity specifically for p-values. It is motivated by the observation that, although the additive sensitivity of a p-value may be large, the relative change of the p-value on two neighboring datasets is stable unless the p-value is very small. Using this alternative sensitivity notion means that preserving privacy for these p-values only requires a small amount of noise.
Definition 2 (Multiplicative Sensitivity [DSZ18]). A p-value function p is said to be -multiplicative sensitive if for all neighboring databases
, either both
Specifically, when is sufficiently small, then we can treat the logarithm of the p-values as having additive sensitivity
, and we only need to add noise that scales with
, which may be much smaller than the noise required under the standard additive sensitivity notion.
2.2 Background on Online False Discovery Rate Control
In the online false discovery rate (FDR) control problem, a data analyst receives a stream of hypotheses on the database D, or equivalently, a stream of p-values . The analyst must pick a threshold
at each time t to reject the hypothesis when
; this threshold can depend on previous hypotheses and discoveries, and rejection must be decided before the next hypothesis arrives.
The error metric is the false discovery rate, formally defined as:
where is the (unknown to the analyst) set of hypotheses where the null hypothesis is true, and R is the set of rejected hypotheses. We will also write these terms as a function of time t to indicate their values after the first t hypotheses: FDR
. The goal of FDR control is to guarantee that for any time t, the FDR up to time t is less than a pre-determined quantity
Such a problem was first investigated by [FS08], who proposed a framework known as online alpha-investing that models the hypothesis testing problem as an investment problem. The analyst is endowed with an initial budget, can test hypotheses at a unit cost, and receives an additional reward for each discovery. The alpha-investing procedure ensures that the analysts always maintains an -fraction of their wealth, and can therefore continue testing future hypotheses indefinitely. Unfortunately, this approach only controls a
slightly relaxed version of FDR, known as mFDR, which is given by mFDR. This approach was later extended to a class of generalized alpha-investing (GAI) rules [AR14]. One subclass of GAI rules, the Level based On Recent Discovery (LORD), was shown to have consistently good performance in practice [JM15, JM18]. The SAFFRON procedure, proposed by [RZWJ18], further improves the LORD procedures by adaptively estimating the proportion of true nulls. The SAFFRON procedure is the current state-of-the-art in online FDR control for multiple hypothesis testing. To understand the main differences between the SAFFRON and the LORD procedures, we first introduce an oracle estimate of the FDP as FDP
number of false discoveries, so FDPoverestimates the FDP. The oracle estimator FDP
cannot be calculated since
is unknown. LORD’s naive estimator
is a natural overestimate of FDP
The SAFFRON’s threshold sequence is based on a novel estimate of FDP as
where is a sequence of user-chosen parameters in the interval (0, 1), which can be a constant or a deterministic function of the information up to time
. This is a much better estimator than LORD’s naive estimator
. The SAFFRON estimator is a fairly tight estimate of FDP
, since intuitively
has unit expectation under null hypotheses and is stochastically smaller than uniform under non-null hypotheses.
The SAFFRON algorithm is given formally in Algorithm 2. SAFFRON starts off with an error budget , which will be allocated to different tests over time. It never loses wealth when testing candidate p-values with
, and it earns back wealth of
on every rejection except for the first. By construction, the SAFFRON algorithm controls
to be less than
at any time t. The function
for defining the sequence
can be any coordinatewise non-decreasing function. For example,
can be a deterministic sequence of constants, or
, as in the case of alpha-investing. These
values serve as a weak overestimate of
. The algorithm first checks if a p-value is below
so, adds it to the candidate set of hypotheses that may be rejected. It then computes the
threshold based on current wealth, current size of the candidate set, and the number of rejections so far, and decides to reject the hypothesis if
. It also takes in a non-increasing sequence of decay factors
which sum to one. These decay factors serve to depreciate past wealth and ensure that the sum of the wealth budget is always below the desired level
The SAFFRON algorithm requires that the input sequence of p-values are not too correlated under the null hypothesis. This condition is formalized through a filtration on the sequence of candidacy and rejection decisions. Intuitively, this means that the sequence of hypotheses cannot be too adaptively chosen, otherwise the p-values may become overly correlated and violate this condition. Denote by the indicator for rejection, and let
be the indicator for candidacy. Define the filtration formed by the sequences of
-fields
, and let
, where
is an arbitrary function of the first
indicators for rejections and candidacy. We say that the null p-values are conditionally super-uniformly distributed with respect to the filtration F if:
We note that independent p-values is a special case of the conditional super-uniformity condition of (2). When p-values are independent, they satisfy the following condition:
SAFFRON provides the following accuracy guarantees, where the first two conditions apply if p-values are conditionally super-uniformly distributed, and the last two conditions apply if the p-values are additionally independent under the null.
Theorem 3 ([RZWJ18]). If the null p-values are conditionally super-uniformly distributed, then we have: (a)
(d) The condition implies that
In this section, we provide our algorithm for private online false discovery rate control, PAPRIKA, given formally in Algorithm 3. It is a differentially private version of SAFFRON, where we use SparseVector to ensure privacy of our rejection set. However, the combination of these tools is far from immediate, and several algorithmic innovations are required, including: dynamic thresholds in SparseVector to match the alpha-investing rule of SAFFRON, adding noise that scales with the multiplicative sensitivity of p-values to reduce the noise required for privacy, shifting the SparseVector threshold to accommodate FDR as a novel accuracy metric, and the candidacy indicator step which cannot be done privately and requires new analysis for both privacy and accuracy. Complete proofs of our privacy and accuracy results appear in the appendix; we elaborate here on the algorithmic details and why these modifications are needed to ensure privacy and FDR control.
The SAFFRON algorithm decides to reject hypothesis t if the corresponding is less than the rejection threshold
; that is, if
. We instantiate the SparseVector framework in this setting, where
plays the role of the
query answer
, and
plays the role of the threshold. Note that SparseVector uses a single fixed threshold for all queries, while our algorithm PAPRIKA allows for a dynamic threshold that depends on the previous output. Our privacy analysis of the algorithm accounts for this change and shows that dynamic thresholds do not affect the privacy guarantees of SparseVector. However, the algorithm would not be private if the dynamic thresholds also depend on the data. Note that SAFFRON never loses wealth when testing candidate p-values with
, and the threshold
on the data since it is based on current wealth. We remove such dependence in PAPRIKA by losing wealth at every step regardless of whether we test a candidate p-values, similar to LORD. This will result in stricter FDR control (and potentially weaker power) because our wealth decays faster.
Similar to prior work on private offline FDR control [DSZ18], we use multiplicative sensitivity as described in Definition 2, as p-values may have high sensitivity and require unacceptably large noise to be added to preserve privacy. We assume that our input stream of each has multiplicative sensitivity
. As long as
is small enough (i.e., less than the rejection threshold), we can treat the logarithm of the p-values as the queries with additive sensitivity
. Because of this change, we must make rejection decisions based on the logarithm of the p-values, so our reject condition is
for Laplace noise terms
drawn from the appropriate distributions.
The accuracy guarantees of SparseVector ensure that if a value is reported to be below threshold, then with high probability it will not be more than above the threshold. However, to ensure that our algorithm satisfies the desired bound
, we require that reports of “below threshold” truly do correspond to p-values that are below the desired threshold
. To accommodate this, we shift our rejection threshold
down by a parameter A. A is chosen such that the algorithm satisfies
-differential privacy, but the choice can be seen as inspired by the
-accuracy term of SparseVector as given in Theorem 2. Therefore our final reject condition is
. This ensures that “below threshold” reports are below
with high probability. Empirically, we see that the bound of A in Theorem 4 may be overly conservative and lead to no hypotheses being rejected, so we allow an additional scaling parameter s that will scale the magnitude of shift by a factor of s. The conservative bounds of Theorem 4 correspond to s = 4, but in many scenarios a smaller value of s = 1 or 2 will lead to better performance while still satisfying the privacy guarantee. Further guidance choosing this shift parameter is given in Section 4.3.
Even with these modifications, a naive combination of SparseVector and SAFFRON would still not satisfy differential privacy. This is due to the candidacy indicator step of the algorithm. In the SAFFRON algorithm, a pre-processing candidacy step occurs before any rejection decisions. This step checks whether each p-value is smaller than a loose upper bound
on the eventual reject threshold
. The algorithm chooses
using an
-investing rule that depends on the number of candidate hypotheses seen so far, and ensures that
, so only hypotheses in this candidate set can be rejected. These
values are used to control
, which serves as a conservative overestimate of FDP(t). (For a discussion of how to choose
, see Lemma 1 or our experimental results in Section 4. Reasonable choices would be
small constant such as 0.2.)
Without adding noise to the candidacy condition, there may be neighboring databases with p-values for some hypothesis such that
, and hence the hypothesis would have positive probability of being rejected under the first database and zero probability of rejection under the neighbor. This would violate the
-differential privacy guarantee intended under SparseVector. If we were to privatize the condition for candidacy using, for example, a parallel instantiation of SparseVector, then we would have to reuse the same realizations of the noise when computing the rejection threshold
to still control FDP, but this would no longer be private.
Since we cannot add noise to the candidacy condition, we weaken it in Then if a hypothesis has different candidacy results under neighboring databases and the multiplicative sensitivity
is small, then the hypothesis is still extremely unlikely to be rejected even under the database for which it was candidate. To see this, consider a pair of neighboring databases that induce p-values where
. Due to the multiplicative sensitivity constraint, we know that
. Plugging this into the rejection condition
, we see that we would need the difference of the noise terms to satisfy
, which by analysis of the Laplace distribution, will happen with exponentially small probability in n when
Our PAPRIKA algorithm is thus
-differentially private, and we account for this failure probability in our (exponentially small)
parameter, as stated in Theorem 4.
Our PAPRIKA algorithm allows analysts to specify a maximum number of hypotheses tested k and rejections c. We require a bound on the maximum number of hypotheses tested because the accuracy guarantees of SparseVector only allows exponentially (in the size of the database) many queries to be answered accurately. Once the total number of rejections reaches c, the algorithm will fail to reject all future hypotheses. We do not halt the algorithm as in SparseVector and therefore, PAPRIKA does not have a stopping criterion, and we can safely talk about the FDR control at any fixed time, just like SAFFRON.
Our algorithm also controls at each time t,
equivalent to by scaling down
by a factor of 2. By analyzing and bounding this expression, we achieve FDR bounds for our PAPRIKA algorithm, as stated in Theorem 5.
Theorem 4. For any stream of p-values -differentially private.
As a starting point, our privacy comes from SparseVector, but as discussed above, many crucial modifications are required. To briefly summarize the key considerations, we must handle different thresholds at different times, multiplicative rather than additive sensitivity, a modified notion of the candidate set, and introducing a small delta parameter to account for the new candidate set definition and the shift. The proof of Theorem 4 appears in Appendix B.
Next we describe the theoretical guarantees of FDR control for our private algorithm PAPRIKA which is an analog of Theorem 3. We modify the notation of the conditional super-uniformity assumption of SAFFRON to incorporate the added Laplace noise. The conditions are otherwise identical. (See (2) in Appendix ?? for comparison.) We note that independent p-values is a special case of conditional super-uniformity, but this requirement more generally allows for a broader class of dependencies among p-values. Let be the rejection decisions, and let
be the indicators for candidacy. We let
, where
is an arbitrary function of the first
indicators for rejections and candidacy. Define the filtration formed by the sequences of
-values are conditionally super-uniformly distributed with respect to the filtration
if when the null hypothesis
is true, then
We emphasize that this condition is only needed for FDR control, and that our privacy guarantee of Theorem 4 holds for arbitrary streams of p-values, even those which do not satisfy conditional super-uniformity.
Our FDR control guarantees for PAPRIKA mirror those of SAFFRON (Theorem 3). The first two statements apply if p-values are conditionally super-uniform, and the last two statements apply if the p-values are additionally independent under the null. The proof of Theorem 5 appears in Appendix C.
Theorem 5. If the null p-values are conditionally super-uniformly distributed, then we have: (a)
(d) The condition implies that
Relative to the non-private guarantees of Theorem 3, the FDR bounds provided by PAPRIKA are weaker by an additive of . In most differential privacy applications,
is typically required to be cryptographically small (i.e., at most negligible in the size of the database) [DR14], so this additional term should have a minuscule effect on the FDR.
We note that
plays a role in the analysis of Theorem 5, although it does not appear in FDR bounds. See (22) in the appendix, where the term with dependence on
can be upper bounded by
The following lemma is a key tool in the proof of Theorem 5. Though it is qualitatively similar to Lemma 2 in [RZWJ18], it is crucially modified to show an analogous statement holds under the addition of Laplace noise. Its proof appears in Appendix D.
There are no known theoretical bounds on the statistical power of SAFFRON even in the non-private setting. Instead, we validate power empirically through the experimental results in Section 4.
We experimentally compare the FDR and the statistical power of variations of the PAPRIKA and SAFFRON procedures, under different sequences of . Following the convention of [RZWJ18], we define PAPRIKA-Alpha-Investing, or PAPRIKA AI, to be the instantiation of Algorithm 3 with the sequence
the rejection threshold matches the
-investing rule, and we use PAPRIKA to denote Algorithm 3 instantiated with a sequence of constant of
, which in our experiments is
. We use
in
We generally observe that, even under moderately stringent privacy restrictions, PAPRIKA and its AI variant perform comparably to the non-private alternatives, with PAPRIKA AI typically outperforming PAPRIKA. This suggests that even though setting
as a fixed constant may be easier for implementation, parameter optimization can lead to meaningful performance improvements. We chose the sequence
to be a constant 1/k up to time k. Note that the sequence can be decreasing such as of the form
in [RZWJ18], which controls the wealth to be more concentrated around small values of j. See [RZWJ18] for more discussion on the choice of
. In our experiments, we set the target FDR level
, and thus our privacy parameter
is set to be bounded by
. The maximum number of rejections c = 40. All the results are averaged over 100 runs. We investigate two settings: in Section 4.1, the observations come Bernoulli distributions, and in Section 4.2, the observations are generated from truncated exponential distributions. In Section 4.3, we discuss our choice of the shift parameter A and give guidance on how to choose this parameter in practice. Code for PAPRIKA and our experiments is available at https://github.com/wanrongz/PAPRIKA.
4.1 Testing with Bernoulli Observations
We assume that the database D contains n individuals with k independent features. The ith feature is associated with n i.i.d. Bernoulli variables , each of which takes the value 1 with probability
, and takes the value 0 otherwise. Let
be the sum of the ith features. A p-value for testing null hypothesis
is given by
[DSZ18] showed that -multiplicatively sensitive for
and c is any small positive constant.
for varying values of , which represents the expected fraction of non-null hypotheses. We consider relatively small values of
as most practical applications of FDR control (such as GWAS studies) will have only a small fraction of true “discoveries” in the data.
In the following experiments, we sequentially test size of the database D, and k = 800 as the number of features as well as the number of hypotheses. Our experiments are run under several different shifts A, but due to space constraints, we only report results in the main body with
(i.e., when s = 1), which still satisfies our privacy guarantee. Further discussion on the choice of A and additional results under other shift parameters s are deferred to Appendix 4.3. The results are summarized in Figure 1, which plots the FDR and statistical power against the expected fraction of non-nulls,
. In Figure 1(a) and (b), we compare our algorithms with privacy parameter
to the non-private baseline methods of LORD [JM15, JM18], Alpha-investing [AR14], and SAFFRON and SAFFRON AI from [RZWJ18]. In Figure 1(c,d) and (e,f), we compare the performance of PAPRIKA AI and PAPRIKA, respectively, with varying privacy parameters
. We also list these values in Table 1 in Appendix A.
Figure 1: FDR and statistical power versus fraction of non-null hypotheses (with
), and non-private algorithms when the database consists of Bernoulli observations.
As expected, the performance of PAPRIKA generally diminishes as decreases. A notable exception is that FDR also decreases in Figure 1(c). This phenomenon is because we set
, resulting in a smaller candidacy set and leading to insufficient rejections. Surprisingly, PAPRIKA AI also yields a lower FDR than many of the non-private algorithms (Figure 1(a)), since it tends to make fewer rejections. We also see that PAPRIKA AI performs dramatically better than PAPRIKA, suggesting that the choice of
should be preferred to constant
to ensure good performance in practice.
As PAPRIKA is the first algorithm for private online FDR control, there is no private baseline for comparison. In Appendix A, we show that naïve Laplace privatization plus SAFFRON is ineffective.
4.2 Testing with Truncated Exponential Observations
We again assume that the database D contains n individuals with k independent features. The ith feature is associated with n i.i.d. truncated exponential distributed variables , each of which is sampled according to density
for positive parameters be the realized sum of the ith features, and let
denote the random variable of the sum of the n truncated exponential distributed variables in the ith entry. A p-value for testing the null hypothesis
against the alternative hypothesis
is given by,
[DSZ18] showed that -multiplicatively sensitive for
and c is any small positive constant. In the following experiments, we generate our database using the exponential distribution model truncated at as follows:
where we vary the parameter , corresponding to the expected fraction of non-nulls.
We sequentially test versus
for i = 1, . . . , k. We use n = 1000 as the size of the database D, and k = 800 as the number of features as well as the number of hypotheses. While there is no closed form to compute the p-values, the sum of n = 1000 i.i.d. samples is approximately normally distributed by the Central Limit Theorem. The expectation and the variance of
respectively. Therefore, is approximately distributed as
, and we compute the p-values accordingly. We run the experiments with shift
(shift magnitude s = 1). The results are shown in Figure 2, which plots the FDR and statistical power against the expected fraction of non-nulls,
Figure 2: FDR and statistical power versus fraction of non-nulls ), and non-private algorithms when the database consists of truncated exponential observations.
As in the case with binomial data, we see that the performance of PAPRIKA generally diminishes as decreases, and that PAPRIKA AI outperforms PAPRIKA, again reinforcing the need for tuning the parameters
based on the alpha-investing rule. All methods perform well in this setting, and the FDR of PAPRIKA AI is visually indistinguishable from 0 at all levels of
tested. Numerical values are listed
in Table 2 in Appendix A for ease of comparison.
We provide a further illustration of our experiments on truncated exponentials in Figure 3. In particular, we plot the rejection threshold and wealth versus the hypothesis index. Each “jump” of the wealth corresponds to a rejection. We observe that the rejections of our private algorithms are consistent with the rejections of the non-private algorithms, another perspective which empirically confirms their accuracy.
One hypothesis for the good performance observed in Figure 2 is that the signal between the null and alternative hypotheses as parameterized by is very strong, meaning the algorithms can easily discriminate between the true null and true non-null hypotheses based on the observed p-values. To measure this, we also varied the value of
in the alternative hypotheses. These results are shown in Figure 4, which plots FDR and power of PAPRIKA and PAPRIKA AI with when the alternative hypotheses have parameter
. As expected, the performance gets better as we increase the signal, and we observe that when the signal is too weak (
), performance begins to decline.
For baseline of comparison, we include results for LapSAFFRON with , which is a naïve privatization of SAFFRON based on the Laplace Mechanism. For this baseline mechanism, LapSAFFRON first computes the p-values of each hypothesis, applies the Laplace Mechanism [DMNS06] to the p-values, and then uses these noisy p-values as input to SAFFRON. Overall privacy of the mechanism comes from advanced composition across multiple calls to the Laplace Mechanism, and post-processing guarantees of differential privacy, where the SAFFRON algorithm is post-processing on the privatized p-values. We see that this baseline mechanism performs extremely poorly relative to PAPRIKA and PAPRIKA AI, motivating the need for our better algorithm design.
Figure 3: Wealth and rejection threshold versus hypothesis index with privacy parameter
when the database consists of truncated exponential observations. PAPRIKA AI and
Figure 4: FDR and statistical power versus expected fraction of non-null hypotheses various choices of signal
for alternative hypothesis parameters. The privacy parameter is
, and the database consists of truncated exponential observations. The first row shows performance of
, and the second row shows performance of
4.3 Choice of shift A
We now discuss how to choose the shift parameter A. Theorem 4 gives a theoretical lower bound for A in terms of the privacy parameter , but this bound may be overly conservative. Since the shift A is closely related to the performance of FDR and statistical power, we wish to pick a value of A that yields good performance in practice. In Theorem 5, we show that FDR(t) is less than our desired bound
plus the privacy parameter
, which naturally requires that the privacy loss parameter
be small. For a more detailed explanation, we bound Inequality (22) in the proof of Theorem 5 using Inequality (14) from the proof of Theorem 4, and therefore, the empirical
is naturally tied to the empirical FDR. As long as we can guarantee the empirical FDR to be bounded by the target FDR level, our privacy loss is bounded by the nominal
We use the Bernoulli example in Section 4.1 to investigate the performance under different choices of the shift A with privacy parameter . The results are summarized in Figure 5, which plots the FDR and power versus the expected fraction of non-nulls when we vary the shift size with s = 0.5, 1, 1.5, 2.
Larger shifts (corresponding to larger values of s) will lower the rejection threshold, which causes fewer hypotheses to be rejected. This improves FDR of the algorithm, but harms Power, as the threshold may be too low to reject true nulls. Figure 5 shows that the shift size parameter s should be chosen by the analyst to balance the tradeoff between FDR and Power, as demanded by the application.
Figure 5: FDR and statistical power versus expected fraction of non-null hypotheses various choices of shift magnitude s. The privacy parameter is
, and the database consists of Bernoulli observations. The first row shows performance of
and the second row shows performance of
[AR14] Ehud Aharoni and Saharon Rosset. Generalized -investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771–794, 2014.
[BH95] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
[Dif17] Differential Privacy Team, Apple. Learning with privacy at scale. https: //machinelearning.apple.com/docs/learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf, December 2017.
[DKY17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, NIPS ’17, pages 3571–3580. Curran Associates, Inc., 2017.
[DLSAref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin Machanavajjhala, Simson L. Garfinkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M.
Abowd. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientific Advisory Committee.
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC ’06, pages 265–284, 2006.
[DNPR10] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC ’10, pages 715–724, 2010.
[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
[DSZ18] Cynthia Dwork, Weijie J Su, and Li Zhang. Differentially private false discovery rate control. arXiv preprint arXiv:1807.04209, 2018.
[ENAoSMMedicine Engineering, Engineering National Academies of Sciences, Medicine, et al. Reproducibility and replicability in science. 2019.
[EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer and Communications Security, CCS ’14, pages 1054–1067, New York, NY, USA, 2014. ACM.
[FS08] Dean P Foster and Robert A Stine. -investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429–444, 2008.
[HSRNils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, and David W Craig. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS genetics, 4(8):e1000167, 2008.
[JM15] Adel Javanmard and Andrea Montanari. On online control of false discovery rate. arXiv preprint arXiv:1502.06197, 2015.
[JM18] Adel Javanmard and Andrea Montanari. Online rules for control of false discovery rate and false discovery exceedance. The Annals of Statistics, 46(2):526–554, 2018.
[RZWJ18] Aaditya Ramdas, Tijana Zrnic, Martin Wainwright, and Michael Jordan. SAFFRON: an adaptive algorithm for online control of the false discovery rate. arXiv preprint arXiv:1802.09098, 2018.
[TR19] Jinjin Tian and Aaditya Ramdas. ADDIS: An adaptive discarding algorithm for online FDR control with conservative nulls. In Advances in Neural Information Processing Systems 32, NeurIPS ’19, pages 9383–9391. Curran Associates, Inc., 2019.
Tables 1 and 2 report the numerical values for our experiments on Bernoulli and truncated exponential data, respectively. This information is also presented visually in Figures 1 and 2.
Table 1: Numerical values of FDR and power for Bernoulli observations experiments. LapSAFFRON corresponds to running SAFFRON on the naïve Laplace privatization of the p-values.
Table 2: Numerical values of FDR and power for truncated exponential observations experiments. LapSAF- FRON corresponds to running SAFFRON on the naïve Laplace privatization of the p-values.
Before proving Theorem 4, we will state and prove the following lemma, which will be useful in the proofs of Theorem 4 and Theorem 5.
Lemma 2. If and C > 0 is a constant, we have
Proof.
Theorem 4. For any stream of p-values -differentially private.
Proof. Fix any two neighboring databases D and . Let R denote the random variable representing the output of
denote the random variable representing the output of
). Let k denote the total number of hypotheses. When
and
for all
. When
, privacy follows from the privacy of SparseVector with dynamic thresholds. Since the threshold at each time t only depends on the threshold at time
and and private rejection
, by post-processing, the threshold
is private. Then by post-processing and the privacy of SparseVector , the rejection R(t) is also private. We give the formal probability argument as follows. For any neighboring
and any sequence of hypotheses, we first consider the output up to the first rejection, which is AboveThresh . Consider any output
. Let
, with
and
where is a fixed sequence of thresholds determined by the r. We have
Equation (3) is from change of integration variable . Inequality (4) is because
and
. Inequality (5) is because
When we restart AboveThresh after the first rejection, the inital threshold is the post-processing of the previous ouputs, which is also private. Then by simple composition, the overall privacy loss is For other cases, the worst case is that for all
. In this setting, we have
To satisfy -differential privacy, we need to bound the probability of outputting r for database D. We first consider r = {0, 0 . . . , 0}. We wish to bound
and
. The latter is trivial since
, which is greater than 1. It remains to satisfy
where Inequality (7) is because the worst case happens when is
below the candidacy threshold
, Equation (8) applies Lemma 2, and Inequality (9) follows from the facts that
for all t and that the third term in (8) is positive. Setting (9) to be larger than
Next, we consider all other possible outputs r. Define the set S := {r | there exists a We wish to bound
. The latter is trivial since
. It remains to bound
where Inequality (11) is because the worst case occurs when applies Lemma 2, and Inequality (13) follows from the facts that
and that the second term in (12) is negative. Setting (13) to be less than
Combining Equations (14) and (10), we have the condition that
which is how the shift term A is set in PAPRIKA.
Theorem 5. If the null p-values are conditionally super-uniformly distributed, then we have: (a)
(d) The condition implies that
Proof. For any time t > 0, before the total number of rejections reaches c we bound the number of false rejections as follows:
where Inequality (15) follows from the rejection rule before the total number of rejections reaches c, and the number of false rejections is always 0 afterwards. Inequality (16) follows from the conditional super-uniformity property. We bound each term in (16) separately. Using the law of iterated expectations by conditioning on , we can bound the first term of (16) as follows:
where Equation (17) applies the conditional super-uniformity. Since
Next, we bound the second term in (16) as follows:
Combining this inequality with (17), we bound mFDR as
If the null p-values are independent of each other and the non-nullls, and is a coordinate-wise non-decreasing function of the vector
, then we have
where Inequality (18) applies the law of iterated expectations by conditioning on and Lemma 1. Inequal-
ity (19) follows by a case analysis: if reduces to
. On the other hand, if
, then
, allowing us to upper bound the expectation by the probability of this event.
Combining (21) and (22), we reach the conclusion that FDR
Proof. The proof is similar to the proof of Lemma 2 in [RZWJ18] with the addition of i.i.d. Laplace noise.
In a high level, we hallucinate a vector of p-values that are same as the original vector of p-values, except for the t-th index. This allows us to apply the conditional uniformity property, since now is independent of the hallucinated rejections. We then connect the original rejections and the hallucinated rejections by the monotonicity of the rejections.
We perform our analysis using a hallucinated process: let be a copy of
that is identical everywhere except for the t-th p-value which is set to be 1. That is,
Also let the hallucinated Laplace noises be an identical copy of
, and let
be an identical copy of
. The t-th value of
can be arbitrary since we have ensure the event
, so it will fail to become a candidate and the values of
will not be relevant. We denote
and
as the candidates and rejections made using
because , so both will fail to become candidates, and hence we have
and the following equation holds:
We note that when , the above equation still holds since both sides will be zero. Since
is independent of
where Inequality (23) is obtained by taking the expectation only with respect to by invoking the conditional super-uniformity property and independence of
and
, and Inequality (24) follows from the facts that
and that the function h is non-decreasing.
For the second inequality in the lemma statement, we hallucinate a vector of p-values that equals
everywhere except for the t-th p-value which is set to be 0. That is,
Also let the hallucinated Laplace noises be an identical copy of
be an identical copy of
. We denote
and
as the candidates and rejections made using
and
. By construction, we have
and
for all i < t. On the event that
, since
and we inject the same Laplace noise, we have
and
, and hence also
. Then the following equation holds:
We note that when , the above equation still holds since both sides will be zero. Since
are independent of
, we can take conditional expectations to obtain
where Inequality (25) follows by taking expectation only with respect to by invoking the conditional uniformity property and the fact that the support of p-values is [0, 1], and Inequality (26) follows from the facts that
and that the function h is non-decreasing.