First Order Methods take Exponential Time to Converge to Global Minimizers of Non-Convex Functions

2020·Arxiv

Abstract

Abstract

Machine learning algorithms typically perform optimization over a class of non-convex functions. In this work, we provide bounds on the fundamental hardness of identifying the global minimizer of a non convex function. Specifically, we design a family of parametrized non-convex functions and employ statistical lower bounds for parameter estimation. We show that the parameter estimation problem is equivalent to the problem of function identification in the given family. We then claim that non convex optimization is at least as hard as function identifica-tion. Jointly, we prove that any first order method can take exponential time to converge to a global minimizer.

1 Introduction

Deep learning algorithms generally employ first order optimization techniques that have convergence rates established only for convex functions. However, the function classes that they model happen to be non convex in nature. Non convex functions have epigraphs that are not convex sets. In other words, they do not exhibit the characteristic of having a single local minimum that also serves as its global minimum. Instead, they may have multiple local and multiple global minima. Although identifying global minimizers is not the goal of learning algorithms, it has motivated recent work on better understanding non convex optimization. Specifically, there has been emphasis on saddle points and local minimizers as understanding the geometry around such points may enable better algorithms to scan the search space. While this line of work attempts to provide guarantees under specific conditions, it is difficult in most machine learning scenarios to apriori understand which of these conditions are satisfied. In this work, we take an orthogonal approach to understanding the fundamental hardness of spanning the search space using first order methods, which we refer to as identifying the global minimizer.

The difficulty in optimizing non convex functions with first order methods can be intuitively viewed as the difference between topology and geometry. While the gradients and the function value at a point provide information on the geometry, information on the topology is much harder to estimate. However, finding a suitable hypothesis space during learning involves spanning the topology. It was first shown that gradient descent converges to minimizers if the strict saddle property is satisfied [6]. Subsequent work has shown that gradient descent with a constant step size can take exponential time to escape saddle points [2]. On the other hand, perturbed gradient descent is capable of escaping these saddle points under some technical assumptions including smoothness, Lipschitz Hessian, strict saddle points, among others [4].

In contrast, we take an algorithm agnostic approach by considering stochastic first order oracles. Such an approach provides an overarching framework that incorporates any possible first order algorithm. The algorithm could be as simple as gradient descent that decides its next point just based on the gradient at its previous point or as complex as using the gradient information from all the points that the algorithm has traversed. Previous work on evaluating lower bounds for stochastic convex optimization has employed a similar framework [1].

In this work, we formulate a parametrized sub-

class of non convex functions. The parametrization defines the position and the depth of the minimizers of each function. We employ classical statistical minimax bounds to the parameter estimation. Next, we claim that non convex optimization, also viewed as a set reconstruction problem, over the subclass is at least as hard as the parameter estimation. These naturally boil down to prove that it can take exponential time for first order methods to converge to any global minimizer of a non convex function.

2 Preliminaries

In this section, we describe the mapping of non convex optimization to an oracle framework. We define the oracle, its properties and the error for any optimization method M on a function f at any time step or query to the oracle. We also provide a brief discussion on the general minimax risk framework before concluding with an intuitive picture of the function class of interest. For the convenience of the reader, we present an overview of our notations.

2.1 Notation

We use S to denote a set of d-dimensional points, each denoted by to denote the ]. For any two distributions P and Q, we represent the Kullback-Leibler (KL) divergence between the distributions by KL(P||Q). We use I to denote the Iverson function, i.e., I(A) refers to a random variable that takes a 0-1 value conditioned on set A.

2.2 Stochastic first order oracles

A (stochastic) first order oracle, when queried at a point in its domain, returns the (noisy) function value and the (noisy) gradient at that point. We define general first order oracles represented as below.

Definition 2.1. Given a function class F on a domain D, the class of first-order stochastic oracles consists of random mappings of the form ()) such that

Note that stochastic first order oracles that satisfy the conditions given by Equation 1 return unbiased estimates of both the function value and the gradient while Equation 2 controls the variance of the oracle answers. In this work, the algorithm uses the information obtained from oracle answers to the queries to internally reconstruct a set S of minimizers of a function within the subclass.

In the context of non-convex optimization, the algorithm queries the oracle at every time step at a single point in the domain D. The algorithm is allowed the flexibility to use all of the previous information that it has obtained from the oracle, i.e., to decide its next point use Y to represent all the information that the algorithm has queried from the oracle. We note, however, that the indexing with time merely provides an intuitive optimization understanding. An alternate and a more general view of this setting is the maximum information that can be sourced from the oracle in a budget of T queries. The points at which these T queries are made can be completely arbitrary and not necessarily sequential.

2.3 Non convex optimization in the oracle model

Non convex optimization is the task of retrieving the global minimizer of a non convex function g over its domain D. In general, any non convex function g has multiple local minimizers. In other words, non convex optimization retrieves assuming such a global minimum exists. This is typically done with the help of an optimization method M that involves iterative sampling from the domain based on information from the previous samples. Common first order optimization algorithms such as stochastic gradient descent use only the information from the previous sample and the gradient at the previous sample to determine the next sample. However, formulating the non-convex optimization problem in an oracle framework aids in the assumption of an algorithm agnostic approach that represents a much more general scenario. It captures any first order algorithm that can be provided with the flex-ibility to choose the next sample using information from all or some of the previous (noisy) samples and (noisy) gradients in an arbitrary manner as desired.

We cast non convex optimization as a set reconstruction problem in the oracle model. Intuitively, this is because identifying the global minimizer requires the identification of all possible minimizers of the function. We explicitly discuss the reason in Section 3.1. The algorithm queries the oracle for the function values and the gradients at various points and internally reconstructs using the information . The elements of the set viewed as the best guess of the possible minimizers of a function in the function class by the algorithm. At any time T or after T queries, we define the error of the optimization algorithm

Thus, the error term helps evaluate the quality of the algorithm’s estimate of the possible minimizers. For oracles that are stochastic, the error in itself is a random variable. In this case, the expectation of the error term over the oracle’s randomness given by )] is considered.

Next, we discuss the minimax framework which is commonly used for studying the optimality of algorithms in statistics and machine learning [9, 12]. Consider a family of distributions P over a sample space A and a function ) be the parameter of the distribution In the minimax framework, the goal is to estimate the parameter ) from a set of m observations drawn from the (unknown) distribution pre-metric to evaluate the quality of an estimator ˆWe note that defining in a suitable way allows for studying different problems in the minimax framework. The minimax risk is defined as

where we take the supremum (worst case) over the distributions [10] and infimum (best) over all estimators ˆ

Finally, we map the minimax risk to our setting wherein, given a family of functions F defined over a domain D is optimized using a class of optimization methods M based on a budget of T queries, we define the following minimax risk

where the supremum is over the family of functions and the infimum over the optimization method. In essence, we are interested in bounding limits on the performance of the best possible algorithm on the hardest possible function in the class. We describe the general function class of interest below followed by the construction of a difficult subclass of functions in the next section.

2.4 Function class of interest

We paint an intuitive picture of the general function class before formally defining a subclass in the following section. Let the space of optimization be d-dimensional. Consider that there exists a function class consisting of 2functions which have minimizers present in the dimensions dictated by an element (a set) that belongs to thee power set of d elements. In other words, the identification of the minimizers of a specific function requires the identification of its corresponding element in the power set. If we are able to define such a function class, we are able to embed an exponential number of functions unique in their set of minimizers in a d-dimensional space. In the formal definition, we extend this to embed a super exponential number of minimizers. We represent the broad class of non-convex functions by F. In order to avoid issues with subsets, we formulate an equivalent problem of estimating unique minimizers. Further, we modulate the depth of the function at its minimizers using a random vector the presence of this random vector that modulates the depth of the function at any minimizer, it is not possible for the algorithm to zero in on a global minimizer until all the minimizers of the function have been identified.

3 Problem Formulation

In this section, we formally introduce the problem by means of a paramtrized subclass of functions. Our information-theoretic result relies on the construction of a restricted class of functions. The use of restricted ensembles is customary for information-theoretic lower bounds [8, 11, 5]. We first show that if a certain set S closely reconstructs the minimizers of a function in the subclass, there is no other function in the subclass that the same set can minimize. We then use Fano’s inequality to obtain a lower bound on the parameter estimation. Parameter estimation is equivalent to function identification. We further define a hypothesis test to show that non convex optimization is at least as hard as the function identification problem.

3.1 Constructing a difficult subclass of functions

To construct the desired subclass of functions with exponential number of minimizers, consider Z = The number of unique sets generally referred to as the cardinality of Z. The subclass of functions G are designed have their minima at , the exact permutation decided by any

where ∆denotes the Hamming metric. As the minimizers are defined by elements of set V, it parametrizes the subclass of functions. Thus, for every , there exists a function in the subclass. From a classical fact, we have cardinality

In addition, we assume that a set of random vectors Θ is sampled. The cardinality of Θ is 2each element Θ is a vector of 2random variables that is associated with a function in the subclass. Each element of the vector is sampled from the distribution (1] represents the uniform distribution in the domain [0, 1]. Once Θ has been sampled, each function is conditioned on a 2dimensional vector to characterize the depth of the function at a particular minimizer. As is a random vector, the algorithm can never be sure that it has identified the global minimizer until it has identified all minimizers. Importantly, we note that the function subclass is conditioned on the set Θ. The function subclass is given by

where,

The function class can be visualized as one with inverted pyramids but separated by a constant c such that on summing, the peaks still remain maintaining the non convex characteristic.

3.2 Optimizing well is equivalent to function identification

We claim that finding the global optimizer is equivalent to the identification of all minimizers. This can be followed from the fact that is a random vector and the algorithm can never be sure there is not a lower minima until all of them have been identified. We note that for a given that is parametrized by is completely defined and is equivalent to the identification of . Thus, we show that retrieving the global optima is at least as hard as reconstruction of the set . If the method M is able to optimize over the function class a desired tolerance, then it is capable of identifying the function . In order to show this, we define a discrepancy measure according to which the closeness of two functions in the subclass G is measured. Consider two functions and for any set S, we define

The discrepancy measure negative and symmetric in its arguments. This pre-metric could be intuitively viewed as the tional norm over the set of interest which seems natural in the context of non convex optimization. Using ), we define the minimum discrepancy measure between any two functions in the subclass G by

In the context of a predefined subset of functions G, we shall denote Ψ(). We now explicitly show that optimization up to a certain tolerance implies the identification of a specific function in the subclass.

Lemma 3.1. For any set S such that the for any , there can be at most one function

Thus, if we have a set S that minimizes a function up to a certain tolerance (Ψ(cannot approximately minimize any other function in the subclass G.

Proof. From Equations 7 and 8 and given that there is an we have for any

Rearranging, we have,

Consider an optimization method makes a set of queries to the oracle. The method is now allowed to make use of this information in any arbitrary manner before it makes the next query. In addition, after T queries, the algorithm reconstructs a set from the information that it has obtained from queries to the oracle. Assuming that an optimization algorithm is able to reconstruct a set that minimizes expected error up to a certain tolerance (9), we can then define a hypothesis to identify correctly at least 2/3 of the time.

Lemma 3.2. Suppose that an algorithm M, with access to obtains an expected error satisfying

then one is able to construct a hypothesis test 1

Proof. We build an estimator of the true as . If there exists an

then we assign . In any other case, we pick at random. From the above hypothesis, we have that

where the upper bound is by the Markov’s inequality under the expected error assumption. As this is true for any , maximizing over completes the proof.

Thus, if the optimization method is able to obtain a small enough error, we are able to identify the correct function most of the time.

3.3 Oracle answers and coin tossing

We now show that for the defined function subclass, the stochastic first order oracle answers ( can be viewed as coin tosses. Specifically, we associate the scaling factor for each z to the bias of a coin. Thus, the information obtained from oracle answers can be interpreted as information obtained from flips of 2coins, with each coin having a bias from the set

To gain a better picture, we represent the oracle decision process in terms of coin tosses with the following algorithm

1. Pick indices between without replacement. Lets represent the set of sponding to these indices by

2. Draw for each of these indices according to a Bernoulli distribution with parameter 1

3. For a given input x, return the value and a sub-gradient function based on the outcomes of the coin tosses

We observe easily that is satisfied as required by Definition 2.1. As the expectation is over the randomness of the oracle, this holds for the sub-gradients as well.

3.4 Lower bounds on coin tossing

Given that the oracle answers can be viewed as coin tosses, estimating the lower bounds on correctly predicting from the information obtained from the oracle boils down to probability of success in estimating a binary vector from coin tosses.

We note that the subclass of function are parametrized by the finite set V. However, note that the conditioning of the function subclass on the random vector the depth at the minimizers of a function. In order to link the conditioning with Fano’s, we first state a result that treats this conditioned random variable as a latent variable –one that is not observed – and provides a bound on information obtained in this conditioned scenario over any arbitrary distribution of Θ [3]. The result stems from extending Fano’s inequality to a slightly more general scenario as shown in Figure 1. Adopting it to our setting involves estimating from answers returned by the oracle

Figure 1: The graphical model depicting our the esti- mation problem, represented as a Bayesian network

be random variables and let be any estimator of obtained from samples Y . If the random variables dependent, then

Thus, it is sufficient for us to show the RHS, that is, Fano’s inequality in the conditioned case with the supremum applied over while estimating the mutual information which we derive below. We note that the algorithm uses Y to internally reconstruct the set . Thus, by evaluating the mutual information between and upper bounding the same, this upper bounds the information from Y that the algorithm could use to estimate the its set reconstruction . In other words, this bounds the best possible by any algorithm. We note that and are equivalent.

Lemma 3.4. An estimate of a Bernoulli parameter vector chosen uniformly at random from the packing set V that is obtained from the outcome of coins chosen uniformly at random at each iteration

where the probability is taken over coin toss randomness of the oracle and the uniform randomness over the choice of

Proof. We introduce some notation to represent the coins chosen and the set of oracle answers obtained. For to represent the subset of coins chosen at each iteration, to represent the outcome of the coin at time instant to be a vector of dimension equal to that of

The core of the proof deals with the estimation of the mutual information between the information obtained for the queried sequence from the oracle and the true parameter of interest ). The rest of the proof follows directly from Fano’s inequality. It is important to note that the queried sequence is conditioned on We note that although the sequence could be indexed using time as in an optimization scenario, we derive bounds with independence across the elements of the sequence, which is the worst case. Thus, these points could be obtained in any arbitrary fashion based on the algorithm with no sequential requirement to be satisfied. Thus, we have

Thus, it is sufficient to show that to bound the estimate of the mutual information between the sequence and . By the chain rule for mutual information, we have

Due to independence between the sampling process of , the second term can rewrite the first term from the definition of conditional mutual information as

As is uniformly distributed over V and from the convexity of KL Divergence, we have [13]

For any pair , the summation of the KL Divergence can be bounded by the KL divergence between independent pairs of Bernoulli variates with parameters being 1we denote the KL Divergence between a single pair of Bernoulli variates with parameters 1and 1

From Lemma 3.3, we are required to take a sup over the conditioning or the latent variable, here Thus, we have

Thus, as long as 4, from the bound , we have

Following the proof backwards leads to the desired upper bound completing the proof.

4 Main theorem and Proof

In this section, we provide our main theorem stating that first order methods take exponential time to span the search space to identify a global minimizer and its proof.

Theorem 4.1. There exists an universal constant such that any first order method provided with information from oracle answers to optimize over the function class satisfies the following lower bound

Proof. We consider an oracle that reveals information based on coin tosses with respect to the point with which it has been queried. From Lemma 3.4, we have the lower bound

The application of the upper bound from Lemma 3.2 requires the expected error to satisfy the conditions under which the upper bound holds, that is, the expected error is required to be within a certain tolerance. In order to evaluate the tolerance, we derive Ψ() as follows.

First, we compute inf ) which is achieved at the global minima given by,

(2

At any such point that minimizes in the summation apart from the one at which the minimum occurs contribute (1while from the z where the minimum occurs, we have a contribution of (1. Thus, we have

inf

We note that and Ψ are defined over sets S, which contain the algorithm’s estimates of the minima of a function in the function class G. The error is computed as a sum over all the elements in the set S. In other words, any point in the set that is incorrectly identified as a minimizer adds to the error and correspondingly to the discrepancy measure.

Let us consider two functions set S. From Equation 7 and considering for a specific

Note from Equation 5 that g has a summation over

Lets consider two functions for a spe-cific . Then we have,

+ (1

In order to better parse the summation, we denote

2

(1 (1 (1 + 2(1 + 22

Putting these together lead to multiple cases which we will explore in detail. Note that we are interested in lower bounding Ψ(). Since we do not have control over the selection of develop a lower bound for the worst case. We refer to “common minimizer” and “unique minimizer” loosely for intuitive understanding. These could be viewed as a minimizer present at that particular x for both the functions but the function value may be different due to its dependence on

1) two possible cases with are satisfied. Considering infcontribution of ) from this z and from all other The resulting sum would cancel out infleading to

2) two possible cases with are satisfied. Considering infcontribution of ) from this z and from all other from all other z’s exactly cancels with the first term in infWe are now left with

3) : We have a contribution of 21 of these terms cancel with the first term of infleaving behind

Now summing over all , which is the algorithm’s guess of the minimizers or the internal reconstruction, the discrepancy measure has at least 2contribution from the ∆we have

Using the above insights, we now evaluate Ψ(follows

Finally, we set 18 satisfying the requirement of 9 required to apply Lemma 2. We choose c = 1/8, considering the minimum possible separation between two different minimizers, which results in 144 and in this regime, the following holds

We replace and rearranging leads to

Thus, we have proved that the reconstruction of a set that has a low error on the function subclass requires an exponential number of queries. As the set reconstruction is equivalent to the identification of all minimizers, this boils down to our claim that identification of the global minimizer can take exponential time.

5 Discussion/ Conclusion

We provide lower bounds for non convex optimization for up to first order. Importantly, we note that there is no restriction on the queries being sequential or related in any manner and the function is not required to be differentiable. Thus, we obtain algorithm agnostic lower bounds. Our future work includes exploring better rates with access to second order information or imposing the function class to be differentiable. On the other hand, situations with constraints on queries to be sequential in nature or using distributed optimization provide interesting avenues to understand if they lead to worse rates and by what factors.

References

[1] Alekh Agarwal, Martin J Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems 22, pages 1–9. 2009.

[2] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems 30, pages 1067– 1077. 2017.

[3] Asish Ghoshal and Jean Honorio. Information- theoretic limits of Bayesian network structure learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine

Learning Research, pages 767–775, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.

[4] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1724–1732, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[5] Chuyang Ke and Jean Honorio. Informationtheoretic limits for community detection in network models. In Advances in Neural Information Processing Systems 31, pages 8324–8333. Curran Associates, Inc., 2018.

[6] Jason D. Lee, Max Simchowitz, Michael I. Jor- dan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246–1257. PMLR, 23–26 Jun 2016.

[7] Jiri Matousek. Lectures on Discrete Geometry. Springer-Verlag, Berlin, Heidelberg, 2002.

[8] N. P. Santhanam and M. J. Wainwright. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Transactions on Information Theory, 58(7):4117–4134, July 2012.

[9] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

[10] Abraham Wald. Contributions to the theory of statistical estimation and testing hypotheses. The Annals of Mathematical Statistics, 10(4):299–326, 1939.

[11] W. Wang, M. J. Wainwright, and K. Ramchan- dran. Information-theoretic bounds on model selection for gaussian markov random fields. In 2010 IEEE International Symposium on Information Theory, pages 1373–1377, June 2010.

[12] Larry Wasserman. All of Nonparametric Statistics (Springer Texts in Statistics). SpringerVerlag, Berlin, Heidelberg, 2006.

[13] B. Yu. Assouad, Fano, and Le Cam. SpringerVerlag, 1997.