Safe Screening With Variational Inequalities and Its Application to LASSO

2013·arXiv

Abstract

1 Introduction

Sparse learning [2, 12] is an effective technique for analyzing high dimensional data. It has been applied successfully in various areas, such as machine learning, signal processing, image processing, medical imaging, and so on. In general, the -regularized sparse learning can be formulated as:

where contains the model coefficients, lossis a loss function defined on the design matrix and the response , and is a positive regularization parameter that balances the tradeoff between the loss function and the regularization. Let denote the i-th sample that corresponds to the transpose of the i-th row of X, and let denote the j-th feature that corresponds to the j-th column of X. We use lossin Lasso [12] and lossin sparse logistic regression [6].Since the optimal is usually unknown in practical applications, we need to solve formulation (1) corresponding to a series of regularization parameter , obtain the solutions , and then select the solution that is optimal in terms of a pre-specified criterion, e.g., Schwarz Bayesian information criterion [11] and cross-validation. The well-known LARS approach [3] can be modified to obtain the full piecewise linear Lasso solution path. Other approaches such as interior point [6], coordinate descent [4] and accelerated gradient descent [8] usually solve formulation (1) corresponding to a series of pre-defined parameters.

The solutions are sparse in that many of their coefficients are zero. Taking advantage of the nature of sparsity, the screening techniques have been proposed for accelerating the computation. Specifically, given a solution at the regularization parameter , if we can identify the features that are guaranteed to have zero coefficients in at the regularization parameter , then the cost for computing can be saved by excluding those inactive features. There are two categories of screening techniques: 1) the safe screening techniques [5, 14, 10, 15] with which our obtained solution is exactly the same as the one obtained by directly solving (1), and 2) the heuristic rule such as the strong rules [13] which can eliminate more features but might mistakenly discard active features.

In this paper, we propose an approach called “Sasvi” (Safe screening with variational inequalities) and take Lasso as an example in the analysis. Sasvi makes use of the variational inequality which provides the sufficient and necessary optimality condition for the dual problem. Several existing approaches such as SAFE [5] and DPP [14] can be casted as relaxed versions of the proposed Sasvi, thus Sasvi provides a stronger screening rule. The monotone properties of Sasvi for Lasso are studied based on which a sure removal regularization parameter can be identified for each feature. Empirical results on both synthetic and real data sets demonstrate the effectiveness of the proposed Sasvi for Lasso screening. Extension of the proposed Sasvi to the generalized sparse linear models such as logistic regression is briefly discussed. Notations Throughout this paper, scalars are denoted by italic letters, and vectors by bold face letters. Let denote the norm, the Euclidean norm, and the infinity norm, respectively. Let denote the inner product between x and y.

2 The Proposed Sasvi

Our proposed approach builds upon an analysis on the following simple problem:

Figure 1: The work flow of the proposed Sasvi. The purpose is to discard the features that can be safely eliminated in computing with the information obtained at .

The dual problem usually can provide a good insight about the problem to be solved. Let denote the dual variable of Eq. (1). In light of Eq. (2), we can show that , the j-th component of the optimal solution to Eq. (1), optimizes

where denotes the j-th feature and denotes the optimal dual variable of Eq. (1). From the results to Eq. (2), we need to ensure that Eq. (3) does not equal to , and we have

where denotes the value of above which the solution to Eq. (1) is zero. Let and be the optimal primal variables corresponding to and , respectively. Let and be the optimal dual variables corresponding to and , respectively. Figure 1 illustrates the work flow of the proposed Sasvi. We firstly derive the dual problem of Eq. (1). Suppose that we have obtained the primal and dual solutions and for a given regularization parameter , and we are interested in solving Eq. (1) with by using Eq. (4) to screen the features to save computational cost. However, the difficulty lies in that, we do not have the dual optimal . To deal with this, we construct a feasible set for , estimate an upper-bound of , and safely remove if this upper-bound is smaller than 1.

The construction of a tight feasible set for is key to the success of the screening technique. If the constructed feasible set is too loose, the estimated upper-bound of is over 1, and thus only a few features can be discarded. In this paper, we propose to construct the feasible set by using the variational inequalities that provide the sufficient and necessary optimality conditions for the dual problems with and . Then, we estimate the upper-bound of in the constructed feasbile set, and discard the j-th feature if the upper-bound is smaller than 1. For discussion convenience, we focus on Lasso in this paper, but the underlying methodology can be extended to the general problem in Eq. (1). Next, we elaborate the three building blocks that are illustrated in the bottom row of Figure 1.

2.1 The Dual Problem of Lasso

We follow the discussion in Section 6 of [9] in deriving the dual problem of Lasso as follows:

A dual variable is introduced in the first equality, and the equivalence can be ver-ified by setting the derivative with regard to to zero, which leads to the following relationship between the optimal primal variable () and the optimal dual variable ():

In obtaining the last equality of Eq. (6), we make use of the results to Eq. (2). The dual problem of Eq. (1) can be formulated as:

For Lasso, the in Eq. (5) can be analytically computed as . In applying Sasvi, we might start with , since the primal and dual optimals can be computed analytically as: and .

2.2 Feasible Set Construction

Given and , we aim at estimating the upper-bound of without the actual computation of . To this end, we construct a feasible set for , and then estimate the upper-bound in the constructed feasible set. To construct the feasible set, we make use of the variational inequality that provides the sufficient and necessary condition of a constrained convex optimization problem.

Lemma 1 [8] For the constrained convex optimization problem:

with G being convex and closed and being convex and differentiable, is an optimal solution of Eq. (9) if and only if

Eq. (10) is the so-called variation inequality for the problem in Eq. (9). Applying Lemma 1 to the Lasso dual problem in Eq. (8), we can represent the optimality conditions for and using the following two variational inequalities:

With Eq. (13) and Eq. (14), we can construct the following feasible set for as:

For an illustration of the feasible set, please refer to Figure 2. Generally speaking, the closer is to , the tighter the feasible set for is. In fact, when approaches to concentrates to a singleton set that only contains . Note that one may use additional ’s in Eq. (12) for improving the estimation of the feasible set of . Next, we discuss how to make use of the feasible set defined in Eq. (15) for estimating an upper-bound for .

2.3 Upper-bound Estimation

Next, we show how to solve Eq. (16). For discussion convenience, we introduce the following three variables:

where a denotes the prediction based on scaled by , and b is the summation of a and the change of the inputs to the dual problem in Eq. (8) from to .

Figure 2 illustrates a and b by lines EB and EC, respectively. For the triangle EBC, the following theorem shows that the angle between a and b is acute.

Theorem 1 Let , and . We have

and if and only if . In addition, if , then .

The proof of Theorem 1 is given in Supplement A. With the notations in Eq. (17), Eq. (16) can be rewritten as

The objective function of Eq. (19) can be represented by half of the following form:

which indicates that Eq. (19) can be computed by maximizing and over the feasible set in the same equation. Maximizing and can be computed by minimizing and , which can be solved by the following minimization problem:

We assume that x is a non-zero vector. Let

which are the orthogonal projections of , and y onto the null space of a, respectively. Our next theorem says that Eq. (20) admits a closed form solution.

Theorem 2 Let and . Eq. (20) equals to , if , and otherwise.

The proof of Theorem 2 is given in Supplement B. With Theorem 2, we can obtain the upper-bound of in the following theorem.

Figure 2: Illustration of the feasible set used in Sasvi and Theorem 3. The points in the figure are explained as follows. E: , B: , C: , D:. The left hand side of the dash line represents the half space , and the ball centered at D with radius ED represents . For Theorem 3, EX, EX, EXand EXdenote in two subcases: 1) the angle between EB and EX(EX) is larger than the angle between EB and EC, and 2) the angle between EB and EX(EX) is smaller than the angle between EB and EC. R(R) is the maximizer to Eq. (16) with EX(EX) denoting . With EX(EX) denoting , the maximizer to Eq. (16) is on the intersection between the dashed line and the ball centered at D with radius ED.

Theorem 3 Let , and . Denote

The proof of Theorem 3 is given in Supplement C. An illustration of Theorem 3 for different cases can be found in Figure 2. It follows from Eq. (4) that, if and , then the j-th feature can be safely eliminated for the computation of . We provide the following analysis to the established upper-bound. Firstly, we have

which attributes to the fact that . Secondly, in the extreme case that is orthogonal to the scaled prediction which is nonzero, Theorem 3 leads to and . Thus, the j-th feature can be safely removed for any positive that is smaller than so long as 1. Thirdly, in the case that has low correlation with the prediction , Theorem 3 indicates that the j-th feature is very likely to be safely removed for a wide range of if . The monotone properties of the upper-bound established in Theorem 3 is given Section 4.

3 Comparison with Existing Approaches

Our proposed Sasvi differs from the existing screening techniques [5, 13, 14, 15] in the construction of the feasible set for .

3.1 Comparison with the Strong Rule

The strong rule [13] works on and makes use of the assumption

from which we can obtain an estimated upper-bound for as:

A comparison between Eq. (31) and the upper-bound established in Theorem 3 shows that, 1) both are dependent on , the inner product between the j-th feature and the dual variable obtained at , but note that , 2) in comparison with the data independent term used in the strong rule, Sasvi utilizes a data dependent term as shown in Eqs. (26)-(29). We note that, 1) when a feature has low correlation with the prediction , the upper-bound for estimated by Sasvi might be lower than the one by the strong rule 2, and 2) as pointed out in [13], Eq. (30) might not always hold, and the same applies to Eq. (31).

Next, we compare Sasvi with the SAFE approach [5] and the DPP approach [14], and the differences in terms of the feasible sets are shown in Figure 3.

3.2 Comparison with the SAFE approach

Denote . The SAFE approach makes use of the so-called “dual” scaling, and compute the upper-bound of the for as

Note that, compared to the SAFE paper, the dual variable has been scaled in the formulation in Eq. (32), but this scaling does not influence of the following result for the SAFE approach. Denote as the optimal solution. Solving Eq. (32), we have when . The SAFE approach computes the upper-bound for as follows:

Figure 3: Comparison of Sasvi with existing safe screening approaches. The points in the figure are as follows. A: , B: , C: , D: the middle point of C and E, E: , F: , and G: . The feasible set for used by the proposed Sasvi approach is the intersection between the ball centered at D with radius being half EC and the closed half space passing through E and containing the constraint of the dual of Lasso. The feasible set for used by the SAFE [5] approach is the ball centered at C with radius being the smallest distance from C to the points in the line segment EG. The feasible set for used by the DPP [14] approach is the ball centered at E with radius BC.

Next, we show that the feasible set for used in Eq. (33) can be derived from the variational inequality in Eq. (12) followed by relaxations.

which is the feasible set used in Eq. (33). Note that, the ball defined by Eq. (37) has higher volume than the one defined by Eq. (34) due to the relaxation used in Eq. (36), and it can be shown that the ball defined by Eq. (34) lies within the ball defined by Eq. (37).

3.3 Comparison with the DPP approach

where Eq. (39) is a result of adding Eq. (13) and Eq. (14). Therefore, although the authors in [14] motivates the DPP approach from the viewpoint of Euclidean projection, the DPP approach can indeed be treated as generating the feasible set for using the variational inequality in Eq. (11) and Eq. (12) followed by relaxation in Eq. (40). Note that, the ball specified by Eq. (38) has higher volume than the one specified by Eq. (39) due to the relaxation used in Eq. (40), and it can be shown that the ball defined by Eq. (39) lies within the ball defined by Eq. (38).

4 Feature Sure Removal Parameter

In this subsection, we study the monotone properties of the upper-bound established in Theorem 3 with regard to the regularization parameter . With such study, we can identify the feature sure removal parameter—the smallest value of above which a feature is guaranteed to have zero coefficient and thus can be safely removed.

Without loss of generality, we assume and the results can be easily extended to the case . In addition, we assume that if then . This is a valid assumption for real data.

Let , and . We introduce the following two auxiliary functions:

We show in Supplement D that is strictly increasing with regard to in and is strictly decreasing with regard to in . Such monotone properties, which are illustrated geometrically in the first plot of Figure 4, guarantee that

and have unique roots with regard to when some conditions are satisfied.

Our main results are summarized in the following theorem:

Theorem 4 Let and . Let . Assume that if then .

Define as follows: If , then let ; otherwise, let be the unique value in that satisfies .

Define as follows: If a = 0 or if and , then let ; otherwise, let be the unique value in that satisfies .

1. is monotonically decreasing with regard to in .

2. If , then is monotonically decreasing with regard to in .

3. If , then is monotonically decreasing with regard to in and , but monotonically increasing with regard to in .

The proof of Theorem 4 is given in Supplement D. Note that, and are dependent on the index j, which is omitted for discussion convenience. Figure 4 illustrates results presented in Theorem 4. The first two cases of Theorem 4 indicate that,

Figure 4: Illustration of the monotone properties of Sasvi for Lasso with the assumption . The first plot geometrically shows the monotone properties of and , respectively. The last three plots correspond to the three cases in Theorem 4. For illustration convenience, the x-axis denotes rather than .

if the j-th feature can be safely removed for a regularization parameter , then it can also be safely discarded for any regularization parameter larger than . However, the third case in Theorem 4 says that this is not always true. This somehow coincides with the characteristic of Lasso that, a feature that is inactive for a regularization parameter might become active for a larger regularization parameter . In other words, when following the Lasso solution path with a decreasing regularization parameter, a feature that enters into the model might get removed.

By using Theorem 4, we can easily identify for each feature a sure removable parameter that satisfies and . Note that Theorem 4 assumes , but it can be easily extended to the case by replacing with .

5 Experiment

In this section, we conduct experiments to evaluate the performance of the proposed Sasvi in comparison with the sequential SAFE rule [5], the sequential strong rule [13], and the sequential DPP [14]. Note that, SAFE, Sasvi and DPP methods are “safe” in

Table 1: Running time (in seconds) for solving the Lasso problems along a sequence of 100 tuning parameter values equally spaced on the scale of from 0.05 to 1 by the solver [7] without screening, and the solver combined with different screening methods.

the sense that the discarded features are guaranteed to have 0 coefficients in the true solution, and the strong rule—which is a heuristic rule—might make error and such error was corrected by a KKT condition check as suggested in [13].

Synthetic Data Set We follow [1, 16, 12] in simulating the data as follows:

where X has entries. Similar to [1, 16, 12], we set the pairwise correlation between the i-th feature and the j-th feature to and draw X from a Gaussian distribution. In constructing the ground truth , we set the number of non-zero components to and randomly assign the values from a uniform distribution. We set and generate the response vector using Eq. (43). For the value of , we try 100, 1000, and 5000.

PIE Face Image Data Set The PIE face image data set used in this experiment 4 contains 11554 gray face images of 68 people, taken under different poses, illumination conditions and expressions. Each of the images has pixels. To use the regression model in Eq. (43), we first randomly pick up an image as the response , and then set the remaining images as the data matrix .

MNIST Handwritten Digit Data Set This data set contains grey images of scanned handwritten digits, including 60, 000 for training and 10, 000 for testing. The dimension of each image is . To use the regression model in Eq. (43), we first randomly select 5000 images for each digit from the training set (and in total we have 50000 images) and get a data matrix , and then we randomly select an image from the testing set and treat it as the response vector .

Experimental Settings For the Lasso solver, we make use of the SLEP package [7]. For a given generated data set (X and y), we run the solver with or without screening rules to solve the Lasso problems along a sequence of 100 parameter values equally spaced on the scale from 0.05 to 1.0. The reported results are averaged over 100 trials of randomly drawn X and y.

Results Table 1 reports the running time by different screening rules, and Figure 5 presents the corresponding rejection ratios—the ratios of the number features screened out by the screening approaches. It can be observed that the propose Sasvi significantly outperforms the safe screening rules such as SAFE and DPP. The reason is that, Sasvi

Figure 5: The rejectioin ratios—the ratios of the number features screened out by SAFE, DPP, the strong rule and Sasvi on synthetic and real data sets.

is able to discard more inactive features as discussed in Section 3. In addition, the rejection ratios of the strong rule and Sasvi are comparable, and both of them are more effective in discarding inactive features than SAFE and DPP. In terms of the speedup, Sasvi provides better performance than the strong rule. The reason is that the strong rule is a heuristic screening method, i.e., it may mistakenly discard active features which have nonzero components in the solution, and thus the strong rule needs to check the KKT conditions to make correction if necessary to ensure the correctness of the result. In contrast, Sasvi does not need to check the KKT conditions or make correction since the discarded features are guaranteed to be absent from the resulting sparse representation.

6 Conclusion

The safe screening is a technique for improving the computational efficiency by eliminating the inactive features in sparse learning algorithms. In this paper, we propose a novel approach called Sasvi (Safe screening with variational inequalities). The proposed Sasvi has three modules: dual problem derivation, feasible set construction, and upper-bound estimation. The key contribution of the proposed Sasvi is the usage of the variational inequality which provides the sufficient and necessary optimality conditions for the dual problem. Several existing approaches can be casted as relaxed versions of the proposed Sasvi, and thus Sasvi provides a stronger screening rule. The monotone properties of the established upper-bound are studied based on a sure removal regularization parameter which can be identified for each feature.

The proposed Sasvi can be extended to solve the generalized sparse linear models, by filling in Figure 1 with the three key modules. For example, the sparse logistic regression can be written as

We can derive its dual problem as

According to Lemma 1, for the dual optimal , the optimality condition via the varia- tional inequality is

Then, we can construct the feasible set for at the regularization parameter in a similar way to the in Eq. (15). Finally, we can estimate the upper-bound of by Eq. (16), and discard the j-th feature if such upper-bound is smaller than 1. Note that, compared to the Lasso case, Eq. (16) is much more challenging

for the logistic loss case. We plan to replace the feasible set by its quadratic approximation so that Eq. (16) has an easy solution. We also plan to apply the proposed Sasvi to solving the Lasso solution path using LARS.

References

[1] H. Bondell and B. Reich. Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR. Biometrics, 64:115–123, 2008.

[2] E. Candes and M. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25:21–30, 2008.

[3] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.

[4] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.

[5] L. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse super- vised learning. Pacific Journal of Optimization, 8:667–698, 2012.

[6] K. Koh, S. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519– 1555, 2007.

[7] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.

[8] Y. Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., 2004.

[9] Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140:125–161, 2013.

[10] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computation. In International Conference on Machine Learning, 2013.

[11] G. Schwarz. estimating the dimension of a model. Annals of Statistics, 6:461– 464, 1978.

[12] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.

[13] R. Tibshirani, J. Bien, J. H. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B, 74:245–266, 2012.

[14] J. Wang, B. Lin, P. Gong, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, 2013.

[15] J. X. Zhen, X. Hao, and J. R. Peter. Learning sparse representations of high dimensional data on large scale dictionaries. In Advances in Neural Information Processing Systems, 2011.

[16] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67:301–320, 2005.

Supplementary Material

A Proof of Theorem 1

We begin with three technical lemmas.

Lemma 2 Let and . We have

Proof Since the Euclidean projection of onto is , it follows from Lemma 1 that

As , we have Eq. (45).

Lemma 3 Let and . If parallels to y in that it can be written as for some , then .

Lemma 4 Let . If , we have

where the equality holds if and only if .

Proof We have

where the equality holds if and only if . Incorporating Eq. (45) in Lemma 2 and Eq. (49), we have Eq. (48). The equality in Eq. (49) holds if and only if . According to Lemma 3, if , then , which leads to . This ends the proof.

It follows from Lemma 4 that 1) and the equality holds if and only if, and 2) , which leads to . According to Lemma 3, if parallels to y, then . Therefore, if , then and

B Proof of Theorem 2

If , the primal and dual optimals can be analytically computed as: and . Thus, we have a = 0. It is easy to get that minimizes Eq. (20) with the minimum function value being

where are introduced for the two inequalities, respectively. It is clear that the minimal value of Eq. (20) is lower bounded (the minimum is no less than by only considering the constraint ). Therefore, the optimal dual variable is always positive; otherwise, minimizing Eq. (53) with regard to r achieves .

Setting the derivative with regard to r to zero, we have

Plugging Eq. (54) into Eq. (53), we obtain the dual problem of Eq. (20) as:

For a given , we have

We consider two cases. In the first case, we assume that . We have

By using the complementary slackness condition (note that the optimal does not equal to zero), we have

Thus, we have

Incorporating Eq. (57) and Eq. (59), we have

so that the angle between a and b is equal to or larger than the angle between x and a. Note that according to Theorem 1. In Figure 2, EXand EXillustrate the case that x satisfies Eq. (60), while EXand EXshow the opposite cases. In addition, we have

In the second case, Eq. (60) does not hold. We have

Plugging Eq. (62) into Eq. (54), we have

where we have used Eq. (21) to get the second equality. In addition, we have

In summary, Eq. (20) equals to , if , and

C Proof of Theorem 3

We prove the four cases one by one as follows.

Case 1 If and , i.e., Eq. (60) does not hold with . We have

The second equality plugs in the notations in Eq. (17). The fifth equality utilizes Eq. (65) which is the result for the case by setting . To get the last equality, we utlize the following two equalities

and

which can be derived from Eq. (17). It follows from Eq. (22) and Eq. (23) that

Incorporating Eq. (66), and Eqs. (70)-(71), we have Eq. (26). Following a similar

derivation, we have

The fifth equality utilizes Eq. (65) which is the result for the case

by setting . The last equality can be obtained using the similar derivation getting the last equality of Eq. (66). Incorporating Eqs. (70)-(72), we have Eq. (27).

Case 2 If and , we have since according to Theorem 1. Thus, Eq. (60) does not hold with , and we can get Eq. (66), or equivalently Eq. (26). In addition, Eq. (60) holds with , and we have

To get the fifth equality, we utilize Eq. (61) with . Therefore, we have Eq. (28). Case 3 If and , Eq. (60) holds with , and we have

where the fifth equality utilizes Eq. (61) with . Therefore, we have Eq. (29). In addition, we have since according to Theorem 1 and . Thus, Eq. (60) does not hold with , and we can get Eq. (72), or equivalently Eq. (27).

Case 4 If a = 0, then we have according to Theorem 1. Therefore,

To get the last equality, we utilize Eq. (52) with . Therefore, we have Eq. (74).

Similarly,

To get the last equality, we utilize Eq. (52) with . Therefore, we have Eq. (73). This ends the proof of this theorem.

D Proof of Theorem 4

We begin with a technical lemma. For a geometrical illustration of this lemma, please refer to the first plot of Figure 4.

Lemma 5 Let , and . Suppose that . For the two auxiliary functions defined in Eq. (41) and Eq. (42), is strictly increasing with regard to in is strictly decreasing with regard to in .

The derivative of with regard to can be computed as

For any if and only if a parallels to y. It follows the definition of a in Eq. (17) that, if a parallels to y, then parallels y. According to Lemma 3, we have , which contradicts to the assumption . Therefore, is strictly decreasing , and is strictly increasing with regard to in . Following a similar proof, we can show that is strictly decreasing with regard to in

Now, are ready to prove Theorem 4. Firstly, we summarize the and in unified equations.

Since satisfies Eq. (26) if , and Eq. (29) otherwise. Thus, we have

Case 1 When a = 0, we have , b =

Eq. (79) can be simplified as:

The first equality plugs in the definition of and y in Eq. (22) and Eq. (23). The second equality plugs in , makes use of Eq. (17), and utilizes Eq. (67). The last equality further makes use of . The established equality says that is continuous at the that satisfies .

Denote , and write . Thus, can be rewritten as

The first and second derivatives of with regard to can be computed as: we have

Therefore, we have

• If , i.e., when the angle between y and a is no larger than the angle between y and , then , and is monotonically decreasing with regard to in . In this case, the and satisfies .

• If , let . Then, 1) , 2) , and . Therefore, is monotonically decreas- ing with regard to in , and monotonically increasing with regard to in . In this case, the and satisfies .

This ends the proof of this theorem.

Designed for Accessibility and to further Open Science