b

DiscoverSearch
About
My stuff
Cross-conformal e-prediction
2020·arXiv
Abstract
Abstract

This note discusses a simple modification of cross-conformal prediction inspired by recent work on e-values. The precursor of conformal prediction developed in the 1990s by Gammerman, Vapnik, and Vovk was also based on e-values and is called conformal e-prediction in this note. Replacing e-values by p-values led to conformal prediction, which has important advantages over conformal e-prediction without obvious disadvantages. The situation with cross-conformal prediction is, however, different: whereas for cross-conformal prediction validity is only an empirical fact (and can be broken with excessive randomization), this note draws the reader’s attention to the obvious fact that cross-conformal e-prediction enjoys a guaranteed property of validity.

The version of this paper at http://alrw.net (Working Paper 26) is updated most often.

Conformal prediction is based on the notion of a p-value. At this time p-values are widely discussed (see, e.g., [1]), and several alternatives to p-values have been proposed. Perhaps the most popular alternative is Bayes factors, which, when stripped of their Bayesian context, are referred to as e-values in, e.g., [18, 10, 6]. In fact, e-values were used (under the name of i-values) when discussing precursors of conformal prediction in the 1990s. One early description is [4], and [2] is a recent rediscovery. In this note, it will be convenient to distinguish between conformal e-prediction (using e-values) and conformal p-prediction (standard conformal prediction using p-values).

Conformal e-prediction was superseded by conformal (p-)prediction mainly for two reasons:

1. Conformal predictions can be packaged as prediction sets [16, Sect. 2.2], in which case the property of validity of conformal predictors is very easy to state: we just say that the probability of error is at most  ǫfor a pre- specified significance level  ǫ[16, Proposition 2.3].

2. In the on-line mode of prediction, smoothed conformal predictors make errors independently, and so small probabilities of errors manifest themselves, with high probability, as a low frequency of errors [16, Corollary 2.5].

These are important advantages of conformal p-prediction, which do not appear to be counterbalanced by any clear disadvantages.

Whereas the notion of a conformal e-predictor does not appear particularly useful, using e-values in place of p-values in cross-conformal prediction [15] has a clear advantage. Cross-conformal predictors are not provably valid [15, Appendix], and this sometimes even shows in experimental results [9]. The limits of violations of validity are given by R¨uschendorf’s result (see, e.g., [17]): when merging p-values coming from different folds by taking arithmetic mean (this is essentially what cross-conformal predictors do), the resulting arithmetic mean has to be multiplied by 2 in order to guarantee validity. In the recent method of jackknife+, introduced in [3] and closely related to cross-conformal prediction, there is a similar factor of 2 [3, Theorem 1], which cannot be removed in general [3, Theorem 2].

The situation with e-values is different: the arithmetic mean of e-values is always an e-value. This is an obvious fact, but it is shown in [18] that arithmetic mean is the only useful merging rule. Therefore, the version of cross-conformal prediction based on e-values, which we call cross-conformal e-prediction in this note, is always valid.

Suppose we are given a training set  z1, . . . , znconsisting of labelled objects zi = (xi, yi) and our goal is to predict the label of a new object x. In this note we consider predictors of the following type: for each potential label y for x we would like to have a number  f(z1, . . . , zn, x, y) reflecting the plausibility of y being the true label of x. An example is conformal transducers [16, Sect. 2.7], which, in the terminology of this note, may be called conformal p-predictors. The output

image

of a conformal p-predictor is the full conformal prediction for the label of x; e.g., it determines the prediction set at each significance level. We will sometimes write  f(z1, . . . , zn, z), where z := (x, y), instead of  f(z1, . . . , zn, x, y).

We will use the notation X for the object space and Y for the label space (both assumed non-empty). These are measurable spaces from which the objects and labels, respectively, are drawn. Full observations z = (x, y) are drawn from Z := X × Y. For any non-empty set  X, X+will be the set  ∪∞n=1Xnof all non-empty finite sequences of elements of X.

A nonconformity e-measure is a function  A : Z+ →[0, ∞)+that maps any finite sequence (z1, . . . , zm), m ∈ {1, 2, . . .}, to a finite sequence (α1, . . . , αm) of

the same length consisting of nonnegative numbers with average at most 1,

image

that satisfies the following property of equivariance: for any  m ∈ {2, 3, . . .}, any permutation  πof {1, . . . , m}, any (z1, . . . , zm) ∈ Zm, and any (α1, . . . , αm) ∈[0, ∞)m,

image

The conformal e-predictor f corresponding to such A is defined by

image

so that  f : Z+ →[0, ∞). A conformal e-predictor is a function that can be obtained from a nonconformity e-measure in this way.

When given a training set  z1, . . . , znand a test object x, the full prediction for x according to a conformal e-predictor f is the family of e-values

image

The full prediction can be summarized as, e.g., the point prediction

image

(assuming the min is attained at a single label), the e-confidence

image

and the  e-credibility f(z1, . . . , zn, x, ˆy). We can make a confident point prediction when the e-confidence is large while the e-credibility is not.

A  nonnegative nonconformity measure A : Z+ →[0, ∞)+is defined as a nonconformity e-measure except that the condition (1) is omitted. Given a nonnegative nonconformity measure A, we can always define the corresponding nonconformity e-measure  A′by normalizing A:

image

where (α1, . . . , αm) := A(z1, . . . , zm) (setting 0/0 to, e.g., 0). We will say that the corresponding conformal e-predictor is based on A.

The conformal e-predictor proposed in [5] for binary classification problems is

image

where SV is the set of indices of support vectors:  i ∈SV if and only if  zi, i = 1, . . . , n + 1, is a support vector for the SVM constructed from  z1, . . . , zn+1as training set. It is based on the indicator function of being a support vector. When given a training set  z1, . . . , znand a new object x, this conformal e-predictor goes through all potential labels y for x and for each constructs an SVM and outputs  f(z1, . . . , zn, x, y). It makes it computationally inefficient.

The following obvious proposition asserts the validity of conformal e-predictors.

image

Let us fix a measurable space Σ (a summary space). A Σ-valued split nonconformity measure is a measurable function  A : Z+ →Σ. Intuitively,  A(z1, . . . , zm, z) encodes how well z conforms to  z1, . . . , zm. A normalizing transformation N : Σ+ →[0, ∞)+is an equivariant measurable function that maps every non-empty finite sequence (σ1, . . . , σm) of elements of Σ to a finite sequence (α1, . . . , αm) of the same length of nonnegative numbers whose average is at most 1 (i.e., satisfying (1)).

To apply split conformal e-prediction to a training set  z1, . . . , zn, we split it into two parts, the  training set proper z1, . . . , zn−cand the calibration set zn−c+1, . . . , zn. For a new object x and a potential label y for it, we set

image

where  αyis defined using the following steps:

image

For many choices of A and N, the split conformal e-predictor (3) will be computationally efficient; this is the case when:

1. Processing the training set proper only once, we can find an easily computable rule transforming z into  A(z1, . . . , zn−c, z).

2. The normalizing transformation N is easily computable.

An example of an easily computable normalizing transformation is

image

(cf. (2)), where the summary space is supposed to be Σ := [0, ∞). Proposition 1, our statement of validity, continues to hold for split conformal e-predictors.

A Σ-valued split nonconformity measure A is a Σ-valued cross-nonconformity measure if  A(z1, . . . , zm, z) does not depend on the order of its first m arguments. Given such an A and a normalizing transformation N, the corresponding cross-conformal e-predictor (CCEP) is defined as follows. The training sequence  z1, . . . , znis randomly split into K non-empty multisets (folds) zSk, k = 1, . . . , K, of equal (or as equal as possible) sizes  |Sk|, where  K ∈ {2, 3, . . .}is a parameter of the algorithm, (S1, . . . , SK) is a partition of the index set {1, . . ., n}, and  zSkconsists of all  zi, i ∈ Sk. For each  k ∈ {1, . . . , K}and each potential label  y ∈ Yof the new object x, find the output  αkof the split conformal e-predictor on the new object x and its postulated label y with  zS−kas training set proper and  zSkas calibration set, where  S−k := ∪j̸=kSj = {1, . . ., n}\Skis the complement to the fold  Sk. The corresponding CCEP is defined by

image

(A slight modification, still provably valid, of this definition is where the arithmetic mean is replaced by the weighted mean with the weights proportional to the sizes  |Sk|of the folds.)

Proposition 1 still holds for cross-conformal e-predictors; this trivially follows from the arithmetic mean of e-values being an e-value.

Remark 2. To compare the outputs of cross-conformal p-predictors (CCPP) and CCEP, we can use the rough transformation discussed in [18]: a p-value of p roughly corresponds to an e-value of 1/p. Under this transformation, the arithmetic average of e-values corresponds to the harmonic average of p-values, and the harmonic average is always less than or equal to the arithmetic average [7, Theorem 16]. This suggests that CCEP produce better results than CCPP do. In the opposite direction, the arithmetic average of p-values corresponds to the harmonic average of e-values, which again suggests that CCEP produce better results than CCPP do.

Advantage 1 of conformal p-prediction given on p. 1 is that, when discussing validity, we can talk about probabilities instead of p-values. It mostly disappears when we move to conformal e-prediction: validity has to be defined in terms of e-values (however, it has been argued [11] that e-values are more intuitive than p-values). In this section we discuss advantage 2, which requires the online mode of prediction. We will see that it still holds, albeit in a weakened form.

The notion of validity asserted in Proposition 1 (applied to CCEP) is stated in the “space domain”: the output of the CCEP on the true label is an e-value, i.e., its mean value at a fixed time over the probability space does not exceed 1. Now we will complement the validity in the space domain by the validity in the time domain assuming that the CCEP is bounded.

In the online prediction protocol, we observe an object  x1, apply the CCEP to compute the values  f(x1, y) for all possible labels  y ∈ Y, observe the true label  y1, observe another object  x2, apply the CCEP to compute the values f(x1, y1, x2, y) for all possible labels  y ∈ Y, observe the true label  y2, etc.

Remark 3. Intuitively, using CCEP in the online mode defeats the purpose of cross-conformal prediction: once we process  x1, y1, . . . , xn, yn, we would like to apply the rules that we have found (see item 1 on p. 4) to a large number of new objects rather than just one. But more realistic settings (e.g., along the lines of [16, Sect. 3.3]) are also more complicated.

Let the e-values for the true labels be  en := f(x1, y1, . . . , xn, yn). The following proposition asserts the asymptotic online validity of the e-values  en.

Proposition 4. Suppose the observations (xn, yn), n = 1, 2, . . ., are IID. Then, in the online prediction protocol,

image

Proposition 4 shows that the long-term time average of the e-values for the true labels is bounded above by 1 almost surely. In this sense they are time-wise e-values.

Proof of Proposition 4. We follow the proof [14, Lemma 14] (given as Lemma 3.15 in the first edition of [16]) and use the terminology of [13, Chap. 7].

Let  Fnbe the  σ-algebra generated by the multiset  �z1, . . . , zn−1�and the labelled objects  zn, zn+1, . . .. For each time horizon  N ∈ {2, 3, . . .}, the stochastic sequence (en − 1, Fn), n = N, . . . ,1, is a bounded supermartingale difference. By Hoeffding’s inequality (see, e.g., [16, Sect. A.6.3]), for any  N ∈ {2, 3, . . .},

image

where C is an upper bound on the conformal e-predictor. By the Borel–Cantelli lemma [12, Sect. 2.10] (part (a)), the internal inequality in (5) holds only for finitely many N for a fixed  ǫ >0. This implies (4).

The equation (5) in the proof of Proposition 4 can also be interpreted directly, and its advantage is that it is non-asymptotic. It implies that, for any time horizon  N ≥2,

image

which is another expression for the  enbeing time-wise e-values: their average is approximately bounded above by 1 with high probability (for small  ǫand large N). Even if we do not assume that the conformal e-predictor is bounded, we can still claim, e.g., that

image

The last inequality says that  en ∧ N 1/3are approximate time-wise e-values (assuming that  ǫis small and  N ≫ ǫ−6), and so  enare approximate time-wise e-values if we regard an e-value of  N 1/3as large enough for the difference between  N 1/3and larger e-values to be considered unimportant (such as an e-value of 100 on Jeffreys’s scale [8, Appendix B]).

Many thanks to Aaditya Ramdas for pointing out serious deficiencies in presentation in an earlier version. This research has been partially supported by Astra Zeneca, Stena Line, and Mitie.

[1] Special issue on p-values. American Statistician, 73(Supplement 1), 2019.

[2] Alexander A. Balinsky and Alexander D. Balinsky. Enhancing conformal prediction using e-test statistics. Technical Report arXiv:2403.19082 [cs.LG], arXiv.org e-Print archive, March 2024. To appear in COPA 2024.

[3] Rina Foygel Barber, Emmanuel J. Cand`es, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. Annals of Statistics, 49:486–507, 2021.

[4] Alex Gammerman, Vladimir Vapnik, and Vladimir Vovk. Transduction in pattern recognition, 1997. Manuscript submitted to the Fifteenth International Joint Conference on Artificial Intelligence in January 1997. Extended version published as [5].

[5] Alex Gammerman, Vladimir Vovk, and Vladimir Vapnik. Learning by transduction. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 148–155, San Francisco, CA, 1998. Morgan Kaufmann.

[6] Peter Gr¨unwald, Rianne de Heide, and Wouter M. Koolen. Safe testing. Technical Report arXiv:1906.07801 [math.ST], arXiv.org e-Print archive, March 2023. Journal version is to appear in the Journal of the Royal Statistical Society B (with discussion).

[7] G. H. Hardy, John E. Littlewood, and George P´olya. Inequalities. Cambridge University Press, Cambridge, second edition, 1952.

[8] Harold Jeffreys. Theory of Probability. Oxford University Press, Oxford, third edition, 1961.

[9] Henrik Linusson, Ulf Norinder, Henrik Bostr¨om, Ulf Johansson, and Tuve L¨ofstr¨om. On the calibration of aggregated conformal predictors. Proceedings of Machine Learning Research, 60:154–173, 2017.

[10] Aaditya Ramdas, Peter Gr¨unwald, Vladimir Vovk, and Glenn Shafer. Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38:576–601, 2023.

[11] Glenn Shafer. The language of betting as a strategy for statistical and scientific communication (with discussion). Journal of the Royal Statistical Society A, 184:407–478, 2021.

[12] Albert N. Shiryaev. Probability-1. Springer, New York, third edition, 2016.

[13] Albert N. Shiryaev. Probability-2. Springer, New York, third edition, 2019.

[14] Vladimir Vovk. A universal well-calibrated algorithm for on-line classifica-tion. Journal of Machine Learning Research, 5:575–604, 2004.

[15] Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74:9–28, 2015.

[16] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, Cham, second edition, 2022.

[17] Vladimir Vovk and Ruodu Wang. Combining p-values via averaging. Biometrika, 107:791–808, 2020.

[18] Vladimir Vovk and Ruodu Wang. E-values: Calibration, combination, and applications. Annals of Statistics, 49:1736–1754, 2021.


Designed for Accessibility and to further Open Science