GAP Safe screening rules for sparse multi-task and multi-class models

2015·Arxiv

Abstract

Abstract

High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be safe. In this paper we derive new safe rules for generalized linear models regularized with The rules are based on duality gap computations and spherical safe regions whose diameters converge to zero. This allows to discard safely more variables, in particular for low regularization parameters. The GAP Safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent for multi-task Lasso, binary and multinomial logistic regression, demonstrating significant speed ups on all tested datasets with respect to previous safe rules.

1 Introduction

The computational burden of solving high dimensional regularized regression problem has lead to a vast literature in the last couple of decades to accelerate the algorithmic solvers. With the increasing popularity of -type regularization ranging from the Lasso [18] or group-Lasso [24] to regularized logistic regression and multi-task learning, many algorithmic methods have emerged to solve the associated optimization problems. Although for the simple regularized least square a specific algorithm (e.g., the LARS [8]) can be considered, for more general formulations, penalties, and possibly larger dimension, coordinate descent has proved to be a surprisingly efficient strategy [12].

Our main objective in this work is to propose a technique that can speed-up any solver for such learning problems, and that is particularly well suited for coordinate descent method, thanks to active set strategies.

The safe rules introduced by [9] for generalized regularized problems, is a set of rules that allows to eliminate features whose associated coefficients are proved to be zero at the optimum. Relaxing the safe rule, one can obtain some more speed-up at the price of possible mistakes. Such heuristic strategies, called strong rules [19] reduce the computational cost using an active set strategy, but require difficult post-precessing to check for features possibly wrongly discarded. Another road to speed-up screening method has been the introduction of sequential safe rules [21, 23, 22]. The idea is to improve the screening thanks to the computations done for a previous regularization parameter. This scenario is particularly relevant in machine learning, where one computes solutions over a grid of regularization parameters, so as to select the best one (e.g., to perform cross-validation). Nevertheless, such strategies suffer from the same problem as strong rules, since relevant features can be wrongly disregarded: sequential rules usually rely on theoretical quantities that are not known by the solver, but only approximated. Especially, for such rules to work one needs the exact dual optimal solution from the previous regularization parameter.

Recently, the introduction of safe dynamic rules [6, 5] has opened a promising venue by letting the screening to be done not only at the beginning of the algorithm, but all along the iterations. Following a method introduced for the Lasso [11], we generalize this dynamical safe rule, called GAP Safe rules (because it relies on duality gap computation) to a large class of learning problems with the following benefits:

We introduce our general GAP Safe framework in Section 2. We then specialize it to important machine learning use cases in Section 3. In Section 4 we apply our GAP Safe rules to a multi-task Lasso problem, relevant for brain imaging with magnetoencephalography data, as well as to multinomial logistic regression regularized with norm for joint feature selection.

2 GAP Safe rules

2.1 Model and notations

We denote by rds the set t1, . . . , du for any integer the transpose of a matrix Q. Our observation matrix is represents the number of samples, and q the number of tasks or classes. The design matrix explanatory variables (or features) column-wise, and n observations row-wise. The standard norm is written the unit ball is denoted by (or simply B) and we write center c and radius r. For a matrix B , we denote by the Frobenius norm, and by the associated inner product.

We consider the general optimization problem of minimizing a separable function with a group-Lasso regularization. The parameter to recover is a matrix B , and for any B, while for any -th column. We would like to find

where is a convex function with 1-Lipschitz gradient. So convex with Lipschitz gradient. The function Ω : promoting a few lines of B to be non-zero at a time. The parameter is a non-negative constant controlling the trade-off between data fitting and regularization.

Some elements of convex analysis used in the following are introduced here. For a convex function the Fenchel-Legendre transform, is the function . The sub-differential of a function f at a point x is denoted by Bfpxq. The dual norm of Ω is the norm and reads Ω

Remark 1. For the ease of reading, all groups are weighted with equal strength, but extension of our results to non-equal weights as proposed in the original group-Lasso [24] paper would be straightforward.

2.2 Basic properties

First we recall the associated Fermat’s condition and a dual formulation of the optimization problem:

(see [3, Proposition 26.1] for a more general result) For any convex function

Theorem 2 ([9]). A dual formulation of (1) is given by

where ∆. The primal and dual solutions are linked by

Furthermore, Fermat’s condition reads:

Remark 2. Contrarily to the primal, the dual problem has a unique solution under our assumption on Indeed, the dual function is strongly concave, hence strictly concave.

let us introduce primal/dual link can be written

2.3 Critical parameter: λmax

For large enough the solution of the primal problem is simply 0. Thanks to the Fermat’s rule (2), 0 is optimal if and only if . Thanks to the property of the dual norm Ω, this is equivalent to Ωis the dual norm of Ω. Since , 0 is a primal solution of and only if

This development shows that for , Problem (1) is trivial. So from now on, we will only focus on the case where

2.4 Screening rules description

Safe screening rules rely on a simple consequence of the Fermat’s condition:

Stated in such a way, this relation is useless because is unknown (unless ). However, it is often possible to construct a set , containing it. Then, note that

The so called safe screening rules consist in removing the variable j from the problem whenever the previous test is satisfied, since is then guaranteed to be zero. This property leads to considerable speed-up in practice especially with active sets strategies, see for instance [11] for the Lasso case. A natural goal is to find safe regions as narrow as possible: smaller safe regions can only increase the number of screened out variables. However, complex regions could lead to a computational burden limiting the benefit of screening. Hence, we focus on constructing R satisfying the trade-off:

2.5 Spheres as safe regions

Various shapes have been considered in practice for the set R such as balls (referred to as spheres) [9], domes [11] or more refined sets (see [23] for a survey). Here we consider the so-called “sphere regions” choosing a ball as a safe region. One can easily obtain a control on maxby extending the computation of the support function of a ball [11, Eq. (9)] to the matrix case: max

2.6 GAP Safe rule description

In this section we derive a GAP Safe screening rule extending the one introduced in [11]. For this, we rely on the strong convexity of the dual objective function and on weak duality.

Finding a radius: Remember that is differentiable with a 1-Lipschitz gradient. As a consequence, -strongly convex [14, Theorem 4.2.2, p. 83] and so -strongly concave:

By weak duality and we deduce the following theorem:

Theorem 3.

Provided one knows a dual feasible point Θ , it is possible to construct a safe sphere with radius ˆcentered on Θ. We now only need to build a (relevant) dual point to center such a ball. Results from Section 2.3, ensure that , but it leads to a static rule, a introduced in [9]. We need a dynamic center to improve the screening as the solver proceeds.

Finding a center: Remember that . Now assume that one has a converging algorithm for the primal problem, . Hence, a natural choice for creating a dual feasible point Θis to choose it proportional to , for instance by setting:

A refined method consists in solving the one dimensional problem: arg maxLasso and group-Lasso case [5, 6, 11] such a step is simply a projection on the intersection of a line and the (polytope) dual set and can be computed efficiently. However for logistic regression the computation is more involved, so we have opted for the simpler solution in Equation (10). This still provides converging safe rules (see Proposition 1).

Dynamic GAP Safe rule summarized

We can now state our dynamical GAP Safe rule at the k-th step of an iterative solver:

Dynamic safe screening rules are more efficient than existing methods in practice because they can increase the ability of screening as the algorithm proceeds. Since one has sharper and sharper dual regions available along the iterations, support identification is improved. Provided one relies on a primal converging algorithm, one can show that the dual sequence we propose is converging too.

The convergence of the primal is unaltered by our GAP Safe rule: screening out unnecessary coefficients of Bcan only decrease its distance with its original limits. Moreover, a practical consequence is that one can observe surprising situations where lowering the tolerance of the solver can reduce the computation time. This can happen for sequential setups.

be the current estimate of defined in Eq. (10) be the current estimate of

Note that if the primal sequence is converging to the optimal, our dual sequence is also converging. But we know that the radius of our safe sphere is . By strong duality, this radius converges to 0, hence we have certified that our GAP Safe regions sequence is a converging safe rules (in the sense introduced in [11, Definition 1]).

Remark 4. The active set obtained by our GAP Safe rule (i.e., the indexes of non screened-out variables) converges to the equicorrelation set [20] , allowing us to early identify relevant features (see Proposition 2 in the supplementary material for more details).

3 Special cases of interest

We now specialize our results to relevant supervised learning problems, see also Table 1.

3.1 Lasso

In the Lasso case 1, the parameter is a vector: B meaning that

3.2 ℓ1{ℓ2 multi-task regression

In the multi-task Lasso, which is a special case of group-Lasso, we assume that the observation is . In signal processing, this model is also referred to as Multiple Measurement Vector (MMV) problem. It allows to jointly select the same features for multiple regression tasks [1, 2].

Remark 5. Our framework could encompass easily the case of non-overlapping groups with various size and weights presented in [6]. Since our aim is mostly for multi-task and multinomial applications, we have rather presented a matrix formulation.

3.3 ℓ1 regularized logistic regression

Here, we consider the formulation given in [7, Chapter 3] for the two classes logistic regression. In such a context, one observes for each i P rns a class label . This information can be recast as and it is then customary to minimize (1) where

with B and the penalty is simply the Ω. Let us introduce Nh, the (binary) negative entropy function defined by

Then, one can easily check that

3.4 ℓ1{ℓ2 multinomial logistic regression

We adapt the formulation given in [7, Chapter 3] for the multinomial regression. In such a context, one observes for each i P rns a class label . This information can be recast into a matrix filled by 0’s and 1’s: . In the same spirit as the multi-task Lasso, a matrix B formed by q vectors encoding the hyperplanes for the linear classification. The multinomial regularized regression reads:

with to recover the formulation as in (1). Let us introduce NH, the negative entropy function defined by (still with the convention 0 log

Again, one can easily check that

Remark 6. For multinomial logistic regression, implicitly encodes the additional constraint Θ dimensional simplex, see (14). As 0 and belong to this set, any convex combination of them, such as Θdefined in (10), satisfies this additional constraint.

Remark 7. The intercept has been neglected in our models for simplicity. Our GAP Safe framework can also handle such a feature at the cost of more technical details (by adapting the results from [15] for instance). However, in practice, the intercept can be handled in the present formulation by adding a constant column to the design matrix X. The intercept is then regularized. However, if the constant is set high enough, regularization is small and experiments show that it has little to no impact for high-dimensional problems. This is the strategy used by the Liblinear package [10].

Table 1: Useful ingredients for computing GAP Safe rules. We have used lower case to indicate when the parameters are vectorial (1). The function RowNorm consists in normalizing a (non-negative) matrix row-wise, such that each row sums to one.

4 Experiments

In this section we present results obtained with the GAP Safe rule. Results are on high dimensional data, both dense and sparse. Implementation have been done in Python and Cython for low critical parts. They are based on the multi-task Lasso implementation of Scikit-Learn [17] and coordinate descent logistic regression solver in the Lightning software [4]. In all experiments, the coordinate descent algorithm used follows the pseudo code from [11] with a screening step every 10 iterations.

Note that we have not performed comparison with the sequential screening rule commonly acknowledge as the state-of-the-art “safe” screening rule (such as th EDDP+ [21]), since we can show that this kind of rule is not safe. Indeed, the stopping criterion is based on dual gap accuracy, and comparisons would be unfair since such methods sometimes do not converge to the prescribed accuracy. This is backed-up by a counter example given in the supplementary material. Nevertheless, modifications of such rules, inspired by our GAP Safe rules, can make them safe. However the obtained sequential rules are still outperformed by our dynamic strategies (see Figure 2 for an illustration).

4.1 ℓ1{ℓ2 multi-task regression

To demonstrate the benefit of the GAP Safe screening rule for a multi-task Lasso problem we used neuroimaging data. Electroencephalography (EEG) and magnetoencephalography (MEG) are brain imaging modalities that allow to identify active brain regions. The problem to solve is a multi-task regression problem with squared loss where every task corresponds to a time instant. Using a multi-task Lasso one can constrain the recovered sources to be identical during a short time interval [13]. This corresponds to a temporal stationary assumption. In this experiment we used a joint MEG/EEG data with 301 MEG and 59 EEG sensors leading to 360. The number of possible sources is 494 and the number of time instants 20. With a 1 kHz sampling rate it is equivalent to say that the sources stay the same for 20 ms.

Results are presented in Figure 1. The GAP Safe rule is compared with the dynamic safe rule from [6]. The experimental setup consists in estimating the solutions of the multi-task Lasso problem for 100 values of on a logarithmic grid from . For the experiments on the left a fixed number of iterations from 2 to 2is allowed for each . The fraction of active variables is reported. Figure 1 illustrates that the GAP Safe rule screens out much more variables than the compared method, as well as the converging nature of our safe regions. Indeed, the more iterations performed the more the rule allows to screen variables. On

Figure 1: Experiments on MEG/EEG brain imaging dataset (dense data with 20). On the left: fraction of active variables as a function of and the number of iterations K. The GAP Safe strategy has a much longer range of with (red) small active sets. On the right: Computation time to reach convergence using different screening strategies.

the right, computation time confirms the effective speed-up. Our rule significantly improves the computation time for all duality gap tolerance from 10, especially when accurate estimates are required, e.g., for feature selection.

4.2 ℓ1 binary logistic regression

Results on the Leukemia dataset are reported in Figure 2. We compare the dynamic strategy of GAP Safe to a sequential and non dynamic rule such as Slores [22]. We do not compare to the actual Slores rule as it requires the previous dual optimal solution, which is not available. Slores is indeed not a safe method (see Section B in the supplementary materials). Nevertheless one can observe that dynamic strategies outperform pure sequential one, see Section C in the supplementary material).

Figure 2: regularized binary logistic regression on the Leukemia dataset (n = 72 ; m = 7,129 ; q = 1). Simple sequential and full dynamic screening GAP Safe rules are compared. On the left: fraction of the variables that are active. Each line corresponds to a fixed number of iterations for which the algorithm is run. On the right: computation times needed to solve the logistic regression path to desired accuracy with 100 values of

4.3 ℓ1{ℓ2 multinomial logistic regression

We also applied GAP Safe to an multinomial logistic regression problem on a sparse dataset. Data are bag of words features extracted from the News20 dataset (TF-IDF removing English stop words and words occurring only once or more than 95% of the time). One can observe on Figure 3 the dynamic screening and its benefit as more iterations are performed. GAP Safe leads to a significant speedup: to get a duality gap smaller than 10on the 100 values of , we needed 1,353 s without screening and only 485 s when GAP Safe was activated.

Figure 3: Fraction of the variables that are active for regularized multinomial logistic regression on 3 classes of the News20 dataset (sparse data with n = 2,757 ; m = 13,010 ; q = 3). Computation was run on the best 10% of the features using feature selection [16]. Each line corresponds to a fixed number of iterations for which the algorithm is run.

5 Conclusion

This contribution detailed new safe rules for accelerating algorithms solving generalized linear models regularized with norms. The rules proposed are safe, easy to implement, dynamic and converging, allowing to discard significantly more variables than alternative safe rules. The positive impact in terms of computation time was observed on all tested datasets and demonstrated here on a high dimensional regression task using brain imaging data as well as binary and multiclass classification problems on dense and sparse data. Extensions to other generalized linear model, e.g., Poisson regression, are expected to reach the same conclusion. Future work could investigate optimal screening frequency, determining when the screening has correctly detected the support.

Acknowledgment

We acknowledge the support from Chair Machine Learning for Big Data at T´el´ecom ParisTech and from the Orange/T´el´ecom ParisTech think tank phi-TAB. This work benefited from the support of the ”FMJH Program Gaspard Monge in optimization and operation research”, and from the support to this program from EDF.

References

[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, pages 41–48, 2006.

[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.

[3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York, 2011.

[4] M. Blondel, K. Seki, and K. Uehara. Block coordinate descent algorithms for large-scale sparse multiclass classification. Machine Learning, 93(1):31–52, 2013.

[5] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. A dynamic screening principle for the lasso. In EUSIPCO, 2014.

[6] A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. Dynamic Screening: Accelerating First-Order Algorithms for the Lasso and Group-Lasso. IEEE Trans. Signal Process., 63(19):20, 2015.

[7] P. B¨uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.

[8] B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2):407– 499, 2004. With discussion, and a rejoinder by the authors.

[9] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. J. Pacific Optim., 8(4):667–698, 2012.

[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874, 2008.

[11] O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In ICML, pages 333–342, 2015.

[12] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

[13] A. Gramfort, M. Kowalski, and M. H¨am¨al¨ainen. Mixed-norm estimates for the M/EEG inverse problem using accelerated gradient methods. Phys. Med. Biol., 57(7):1937–1961, 2012.

[14] J.-B. Hiriart-Urruty and C. Lemar´echal. Convex analysis and minimization algorithms. II, volume 306. Springer-Verlag, Berlin, 1993.

[15] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. J. Mach. Learn. Res., 8(8):1519–1555, 2007.

[16] C. D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825–2830, 2011.

[18] R. Tibshirani. Regression shrinkage and selection via the lasso. JRSSB, 58(1):267–288, 1996.

[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. JRSSB, 74(2):245–266, 2012.

[20] R. J. Tibshirani. The lasso problem and uniqueness. Electron. J. Stat., 7:1456–1490, 2013.

[21] J. Wang, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. arXiv preprint arXiv:1211.3966, 2012.

[22] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A safe screening rule for sparse logistic regression. In NIPS, pages 1053–1061, 2014.

[23] Z. J. Xiang, Y. Wang, and P. J. Ramadge. Screening tests for lasso problems. arXiv preprint arXiv:1405.4897, 2014.

[24] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. JRSSB, 68(1):49–67, 2006.

Supplementary Material A Proofs

A.1 Proof of variable identification

Proposition 2. There exists such that for all is screened out by the GAP Safe rule if and only if

Proof. For simplicity we use the notation for the safe region at step k. Define max0 such that is converging to and lim0, there exists . Hence, for any . Using the triangle inequality, one gets

provided that . Hence, for all implies that j is screened out by the GAP Safe rule thanks to the last inequality. For the reverse inclusion take by construction of our GAP Safe screening rule 1u. This means that the variable j can not be eliminated by our safe rule, and we have shown that in the limit we have exactly identified the equicorrelation set.

A.2 Proof that the GAP Safe rule is converging (Proposition 1)

The second term converges to zero whenever Bsince G is continuous (it is -Lipschitz). For the first term, note that Ω(thanks to the primal/dual link, and that is dual feasible). Then, as G is a Lipschitz function and all norms are equivalent in a finite dimension space, the right hand side converges to zero in the previous inequality, and the results stated follows.

In the second case Θand the proof proceeds as inthe first case.

B EDPP is not safe

In the two last sections, we present a study on the EDDP method [21], a screening rule that relies on the dual optimal point obtained for the previous in the path. Note that the same conclusion would hold true for generalization of the sequential approach given in [22], as well as for any other screening rule that needs exact dual solution at one step. To simplify the reading we use the vectorial (with no capital letters) notation used earlier. In the remainder we consider and a non-increasing sequence of parameters . In practice, we choose the common grid [7][2.12.1]): Wang et al. [21] proposed a sequential screening rule based on properties of the projection onto a convex set. Their rule is based on the exact knowledge of the true optimal solution for the previous parameter. Such a rule can be used to compute ˆis known. However for only known approximately and the rules introduced in [21] are not safe anymore: some active groups may be wrongly disregarded if one does not use the exact value of ˆ

We first first recall the property they proved. Then, we give a counter-example that shows that the rule is indeed not safe. In Section C, we propose to modify their rule in order to make it safe in all cases.

Recall that in this case 1, the parameters are vectors: B

Proposition 3 ([21, Theorem 19]). Assume that , then the dual optimal solution of the groupLasso with parameter

where

and

Note that the rule proposed by [21] (as pointed out in [6]) relies on the exact knowledge of a dual optimal solution for a previously solved Lasso problem. This is impossible to obtain in practice and even if it is possible to find accurate solutions, the search for high accuracy may hinder the benefits of the screening when it was not actually needed. Using inaccurate solutions may lead to discarding variables that should have been active and so the screened optimization algorithm will not converge to a solution of the original problem.

We illustrate this issue on Figure 4. Knowing an approximation to the optimal primal point, returned by the optimization algorithm at the previous regularization parameter , we need to choose an approximation to the optimal dual point to run EDPP.

• If we choose to approximate the dual optimal point by (blue curve with diamonds), then the result is catastrophic. Indeed, at 0 is a valid -solution for screening rule tries to perform a division by 0 when computing

• If we choose to approximate the dual optimal point by , we have a better behavior (purple curve with triangles) but we may still have an algorithm which does not converge to an -solution. Here, for the 13Lasso problem a variable is erroneously removed and the problem can only be solved to accuracy 003162. This may look like a small issue but when the stopping criterion is based on the duality gap, this causes the algorithm to continue until the maximum number of iterations is reached.

Figure 4: EDPP is not safe. We run GAP SAFE and two interpretations of EDPP (described in the main text) to solve the Lasso path on the dataset defined by X and y above with target accuracy 10Lasso problem, we plot the final duality gap returned by the optimization solver.

C Making EDDP screening rule safe

C.1 The simpler screening rule

In the present paper, we give computable guarantees on the distance between the current dual feasible point and the solution of the problem. We show here how we can combine our result with Wang et al. ’s in order to make their screening rule work even with approximate solutions to the previous Lasso problem.

For simplicity, we first consider the initial version of Wang et al. ’s sphere test:

proved in [21, Theorem 7]. As we do not know ˆ, we cannot readily use this ball. However, we can modify it to make it a safe screening rules as follows:

Proposition 4. Assume that is a dual feasible point and is a radius satisfying ˆ

where

and for any

Proof. Start first by noting that (17) implies

Let us denote

then ˆ. We now need to upper bound H. A simple choice is to take defined in Eq. (19) The motivation for such a choice is because it is optimal when 0. This provides the following bound on H:

Hence, after some simplifications:

Remark 8. In the case that 1 then with the definition of and the CauchySchwartz inequality one has that 1 . This means that the multiplicative ratio in front of . In [11, Proposition 3], the bound obtained would only lead to the smaller ratio:

C.2 The complete screening rule (EDDP+)

Let us now consider the EDDP+ screening rule [21] relying on the property (15): ˆ

. Using the same technique as for Proposition 4, we can strengthen our previous proposition with the following result.

Proposition 5. Assume that is a dual feasible point and is a radius satisfying ˆ

Proof. As before, we do not know exactly ˆbut we know that denoting

with

We continue as

Now, reminding that is a 1-Lipschitz function,

where the second inequality comes from the triangle inequality and the Cauchy-Schwartz Inequality, and the

third is obtained by factorizing the difference of squares. Plugging this in the former, we get:

One could check that there exists satisfying ˆ

designed for accessibility and to further open science