b

DiscoverSearch
About
My stuff
First Order Methods take Exponential Time to Converge to Global Minimizers of Non-Convex Functions
2020·arXiv
Abstract
Abstract

Machine learning algorithms typically perform optimization over a class of non-convex functions. In this work, we provide bounds on the fundamental hardness of identifying the global minimizer of a non convex function. Specifically, we design a family of parametrized non-convex functions and employ statistical lower bounds for parameter estimation. We show that the parameter estimation problem is equivalent to the problem of function identification in the given family. We then claim that non convex optimization is at least as hard as function identifica-tion. Jointly, we prove that any first order method can take exponential time to converge to a global minimizer.

Deep learning algorithms generally employ first order optimization techniques that have convergence rates established only for convex functions. However, the function classes that they model happen to be non convex in nature. Non convex functions have epigraphs that are not convex sets. In other words, they do not exhibit the characteristic of having a single local minimum that also serves as its global minimum. Instead, they may have multiple local and multiple global minima. Although identifying global minimizers is not the goal of learning algorithms, it has motivated recent work on better understanding non convex optimization. Specifically, there has been emphasis on saddle points and local minimizers as understanding the geometry around such points may enable better algorithms to scan the search space. While this line of work attempts to provide guarantees under specific conditions, it is difficult in most machine learning scenarios to apriori understand which of these conditions are satisfied. In this work, we take an orthogonal approach to understanding the fundamental hardness of spanning the search space using first order methods, which we refer to as identifying the global minimizer.

The difficulty in optimizing non convex functions with first order methods can be intuitively viewed as the difference between topology and geometry. While the gradients and the function value at a point provide information on the geometry, information on the topology is much harder to estimate. However, finding a suitable hypothesis space during learning involves spanning the topology. It was first shown that gradient descent converges to minimizers if the strict saddle property is satisfied [6]. Subsequent work has shown that gradient descent with a constant step size can take exponential time to escape saddle points [2]. On the other hand, perturbed gradient descent is capable of escaping these saddle points under some technical assumptions including smoothness, Lipschitz Hessian, strict saddle points, among others [4].

In contrast, we take an algorithm agnostic approach by considering stochastic first order oracles. Such an approach provides an overarching framework that incorporates any possible first order algorithm. The algorithm could be as simple as gradient descent that decides its next point just based on the gradient at its previous point or as complex as using the gradient information from all the points that the algorithm has traversed. Previous work on evaluating lower bounds for stochastic convex optimization has employed a similar framework [1].

In this work, we formulate a parametrized sub-

class of non convex functions. The parametrization defines the position and the depth of the minimizers of each function. We employ classical statistical minimax bounds to the parameter estimation. Next, we claim that non convex optimization, also viewed as a set reconstruction problem, over the subclass is at least as hard as the parameter estimation. These naturally boil down to prove that it can take exponential time for first order methods to converge to any global minimizer of a non convex function.

In this section, we describe the mapping of non convex optimization to an oracle framework. We define the oracle, its properties and the error for any optimization method M on a function f at any time step or query to the oracle. We also provide a brief discussion on the general minimax risk framework before concluding with an intuitive picture of the function class of interest. For the convenience of the reader, we present an overview of our notations.

2.1 Notation

We use S to denote a set of d-dimensional points, each denoted by  x ∈ S. We use ||x||pto denote the ℓp norm for p ∈ [1, ∞]. For any two distributions P and Q, we represent the Kullback-Leibler (KL) divergence between the distributions by KL(P||Q). We use I to denote the Iverson function, i.e., I(A) refers to a random variable that takes a 0-1 value conditioned on set A.

2.2 Stochastic first order oracles

A (stochastic) first order oracle, when queried at a point in its domain, returns the (noisy) function value and the (noisy) gradient at that point. We define general first order oracles represented as  Op,σbelow.

Definition 2.1. Given a function class F on a domain D, the class of first-order stochastic oracles consists of random mappings  φ : D×F −→ Iof the form (�g(x), �v(x)) such that

image

Note that stochastic first order oracles that satisfy the conditions given by Equation 1 return unbiased estimates of both the function value and the gradient while Equation 2 controls the variance of the oracle answers. In this work, the algorithm uses the information obtained from oracle answers to the queries to internally reconstruct a set S of minimizers of a function within the subclass.

In the context of non-convex optimization, the algorithm queries the oracle at every time step at a single point in the domain D. The algorithm is allowed the flexibility to use all of the previous information that it has obtained from the oracle, i.e., Y = {Ix1:T −1}to decide its next point  xT . Weuse Y to represent all the information that the algorithm has queried from the oracle. We note, however, that the indexing with time merely provides an intuitive optimization understanding. An alternate and a more general view of this setting is the maximum information that can be sourced from the oracle in a budget of T queries. The points at which these T queries are made can be completely arbitrary and not necessarily sequential.

2.3 Non convex optimization in the oracle model

Non convex optimization is the task of retrieving the global minimizer of a non convex function g over its domain D. In general, any non convex function g has multiple local minimizers. In other words, non convex optimization retrieves  x∗ = arg minx∈D g(x)assuming such a global minimum exists. This is typically done with the help of an optimization method M that involves iterative sampling from the domain based on information from the previous samples. Common first order optimization algorithms such as stochastic gradient descent use only the information from the previous sample and the gradient at the previous sample to determine the next sample. However, formulating the non-convex optimization problem in an oracle framework aids in the assumption of an algorithm agnostic approach that represents a much more general scenario. It captures any first order algorithm that can be provided with the flex-ibility to choose the next sample using information from all or some of the previous (noisy) samples and (noisy) gradients in an arbitrary manner as desired.

We cast non convex optimization as a set reconstruction problem in the oracle model. Intuitively, this is because identifying the global minimizer requires the identification of all possible minimizers of the function. We explicitly discuss the reason in Section 3.1. The algorithm queries the oracle for the function values and the gradients at various points and internally reconstructs  STusing the information Y = {Ix1:T }. The elements of the set  ST could beviewed as the best guess of the possible minimizers of a function in the function class by the algorithm. At any time T or after T queries, we define the error of the optimization algorithm  M based on ST as

image

Thus, the error term helps evaluate the quality of the algorithm’s estimate of the possible minimizers. For oracles that are stochastic, the error in itself is a random variable. In this case, the expectation of the error term over the oracle’s randomness given by Eφ[ǫT (M, g, D, φ)] is considered.

Next, we discuss the minimax framework which is commonly used for studying the optimality of algorithms in statistics and machine learning [9, 12]. Consider a family of distributions P over a sample space A and a function  ω : P −→ Ω. Letω(P) be the parameter of the distribution  P ∈ P.In the minimax framework, the goal is to estimate the parameter  ω(P) from a set of m observations A = (a1, . . . , am) ∈ Am drawn from the (unknown) distribution  P. We let ρ : Ω × Ω −→ R+ denote apre-metric to evaluate the quality of an estimator ˆω : Am → Ω.We note that defining  P, Ω and ρin a suitable way allows for studying different problems in the minimax framework. The minimax risk is defined as

image

where we take the supremum (worst case) over the distributions  P ∈ P[10] and infimum (best) over all estimators ˆω : Am → Ω..

Finally, we map the minimax risk to our setting wherein, given a family of functions F defined over a domain D is optimized using a class of optimization methods M based on a budget of T queries, we define the following minimax risk

image

where the supremum is over the family of functions and the infimum over the optimization method. In essence, we are interested in bounding limits on the performance of the best possible algorithm on the hardest possible function in the class. We describe the general function class of interest below followed by the construction of a difficult subclass of functions in the next section.

2.4 Function class of interest

We paint an intuitive picture of the general function class before formally defining a subclass in the following section. Let the space of optimization be d-dimensional. Consider that there exists a function class consisting of 2d functions which have minimizers present in the dimensions dictated by an element (a set) that belongs to thee power set of d elements. In other words, the identification of the minimizers of a specific function requires the identification of its corresponding element in the power set. If we are able to define such a function class, we are able to embed an exponential number of functions unique in their set of minimizers in a d-dimensional space. In the formal definition, we extend this to embed a super exponential number of minimizers. We represent the broad class of non-convex functions by F. In order to avoid issues with subsets, we formulate an equivalent problem of estimating unique minimizers. Further, we modulate the depth of the function at its minimizers using a random vector  θ. Due tothe presence of this random vector  θthat modulates the depth of the function at any minimizer, it is not possible for the algorithm to zero in on a global minimizer until all the minimizers of the function have been identified.

In this section, we formally introduce the problem by means of a paramtrized subclass of functions. Our information-theoretic result relies on the construction of a restricted class of functions. The use of restricted ensembles is customary for information-theoretic lower bounds [8, 11, 5]. We first show that if a certain set S closely reconstructs the minimizers of a function in the subclass, there is no other function in the subclass that the same set can minimize. We then use Fano’s inequality to obtain a lower bound on the parameter estimation. Parameter estimation is equivalent to function identification. We further define a hypothesis test to show that non convex optimization is at least as hard as the function identification problem.

3.1 Constructing a difficult subclass of functions

To construct the desired subclass of functions with exponential number of minimizers, consider Z = {0, 1}d.The number of unique sets  z ∈ Z is 2d,generally referred to as the cardinality of Z. The subclass of functions G are designed have their minima at  {2z − 1/2, 2z − 1/4}d, the exact permutation decided by  α ∈ V, where V ⊂ {−1, 1}2d such thatany  α, β ∈ V satisfy

image

where ∆Hdenotes the Hamming metric. As the minimizers are defined by elements of set V, it parametrizes the subclass of functions. Thus, for every  α ∈ V, there exists a function in the subclass. From a classical fact, we have cardinality |V| ≥ (2/√e)2d/2 [7].

In addition, we assume that a set of random vectors Θ is sampled. The cardinality of Θ is 2d andeach element  θ ∈Θ is a vector of 2d random variables that is associated with a function in the subclass. Each element of the vector  θzis sampled from the distribution (14 − δ2) U[0, 1], where U[0,1] represents the uniform distribution in the domain [0, 1]. Once Θ has been sampled, each function is conditioned on a 2d dimensional vector  θ. We use θto characterize the depth of the function at a particular minimizer. As  θis a random vector, the algorithm can never be sure that it has identified the global minimizer until it has identified all minimizers. Importantly, we note that the function subclass is conditioned on the set Θ. The function subclass is given by

gα(x | θα) = 12d�z∈Z(12 + αzδ + θα,z)f1(x)

image

where,

image

The function class can be visualized as one with inverted pyramids but separated by a constant c such that on summing, the peaks still remain maintaining the non convex characteristic.

3.2 Optimizing well is equivalent to function identification

We claim that finding the global optimizer is equivalent to the identification of all minimizers. This can be followed from the fact that  θis a random vector and the algorithm can never be sure there is not a lower minima until all of them have been identified. We note that for a given  gαthat is parametrized by α, the set Sαis completely defined and is equivalent to the identification of  α. Thus, we show that retrieving the global optima is at least as hard as reconstruction of the set  Sα. If the method M is able to optimize over the function class  G(δ) uptoa desired tolerance, then it is capable of identifying the function  g ∈ G. In order to show this, we define a discrepancy measure according to which the closeness of two functions in the subclass G is measured. Consider two functions  gα, gβ ∈ Gand for any set S, we define

image

The discrepancy measure  ρ(gα, gβ, S|θα, θβ) is nonnegative and symmetric in its arguments. This pre-metric could be intuitively viewed as the  ℓ1 func-tional norm over the set of interest which seems natural in the context of non convex optimization. Using  ρ(gα, gβ, S|θα, θβ), we define the minimum discrepancy measure between any two functions in the subclass G by

image

In the context of a predefined subset of functions G, we shall denote Ψ(G(δ)) as Ψ(δ). We now explicitly show that optimization up to a certain tolerance implies the identification of a specific function in the subclass.

Lemma 3.1. For any set S such that the for any x1, x2 ∈ S satisfy ||x1 − x2||1 > c and any θα ∈{(0, 14 − δ2)2d}, there can be at most one function

image

Thus, if we have a set S that minimizes a function gα ∈ Gup to a certain tolerance (Ψ(δ)/3), then Scannot approximately minimize any other function in the subclass G.

Proof. From Equations 7 and 8 and given that there is an  S and α such that �S gα − infx gα ≤ Ψ(δ)/3,we have for any  β ̸= α,

image

Rearranging, we have,

image

Consider an optimization method  M ∈ M whichmakes a set of queries to the oracle. The method is now allowed to make use of this information in any arbitrary manner before it makes the next query. In addition, after T queries, the algorithm reconstructs a set  STfrom the information that it has obtained from queries to the oracle. Assuming that an optimization algorithm is able to reconstruct a set  STthat minimizes expected error up to a certain tolerance (ψ(δ)/9), we can then define a hypothesis to identify  α∗ correctly at least 2/3 of the time.

Lemma 3.2. Suppose that an algorithm M, with access to  Y = {Ix1:T } = φ(x1 : xT ; g∗α | θ)obtains an expected error satisfying

image

then one is able to construct a hypothesis test  �α :φ(x1 : xT ; g∗α | θ) −→ V such that maxα∗∈V P[�α ̸= α∗ | θ] ≤1/3 for all θ ∈ {(0, 14 − δ2)2d}

Proof. We build an estimator of the true  α∗, denotedas  �α. If there exists an  α ∈ V such that

image

then we assign  �α(ST ) to α. In any other case, we pick  �αat random. From the above hypothesis, we have that

image

where the upper bound is by the Markov’s inequality under the expected error assumption. As this is true for any  θ, maximizing over  αcompletes the proof.

image

Thus, if the optimization method is able to obtain a small enough error, we are able to identify the correct function most of the time.

3.3 Oracle answers and coin tossing

We now show that for the defined function subclass, the stochastic first order oracle answers ( �f(x), �v(x))can be viewed as coin tosses. Specifically, we associate the scaling factor for each z to the bias of a coin. Thus, the information obtained from oracle answers can be interpreted as information obtained from flips of 2d coins, with each coin having a bias from the set

image

To gain a better picture, we represent the oracle decision process in terms of coin tosses with the following algorithm

1. Pick  ℓindices between  {1, ...., 2d}without replacement. Lets represent the set of  z’s corre-sponding to these indices by  Zℓ

2. Draw  bz ∈ {0, 1}for each of these  ℓindices according to a Bernoulli distribution with parameter 1/2 + αzδ + θα,z

3. For a given input x, return the value  �gα(x | θα)and a sub-gradient  �vα(x) ∈ ∂�gα(x | θα) of thefunction based on the outcomes of the coin tosses  bz

image

We observe easily that  Eφ[�gα(x | θα)] = gα(x | θα)is satisfied as required by Definition 2.1. As the expectation is over the randomness of the oracle, this holds for the sub-gradients as well.

3.4 Lower bounds on coin tossing

Given that the oracle answers can be viewed as coin tosses, estimating the lower bounds on correctly predicting  α∗ ∈ Vfrom the information obtained from the oracle boils down to probability of success in estimating a binary vector from coin tosses.

We note that the subclass of function are parametrized by the finite set V. However, note that the conditioning of the function subclass on the random vector  θ ∈ {(0, 14 − δ2)|V|} modulatesthe depth at the minimizers of a function. In order to link the conditioning with Fano’s, we first state a result that treats this conditioned random variable as a latent variable –one that is not observed – and provides a bound on information obtained in this conditioned scenario over any arbitrary distribution of Θ [3]. The result stems from extending Fano’s inequality to a slightly more general scenario as shown in Figure 1. Adopting it to our setting involves estimating  α∗ from answers returned by the oracle  Y = {Ix1:T } = φ(x1 : xT ; g∗α | θ)

image

Figure 1: The graphical model depicting our the esti- mation problem, represented as a Bayesian network

Lemma 3.3. [3] Let α ∈ V and θbe random variables and let  �αbe any estimator of  αobtained from samples Y . If the random variables  α and θ are in-dependent, then

image

Thus, it is sufficient for us to show the RHS, that is, Fano’s inequality in the conditioned case with the supremum applied over  θ0while estimating the mutual information which we derive below. We note that the algorithm uses Y to internally reconstruct the set  ST. Thus, by evaluating the mutual information between  α∗ and Yand upper bounding the same, this upper bounds the information from Y that the algorithm could use to estimate the its set reconstruction  ST. In other words, this bounds the best  STpossible by any algorithm. We note that  α∗and  Sαare equivalent.

Lemma 3.4. An estimate of a Bernoulli parameter vector  α∗ chosen uniformly at random from the packing set V that is obtained from the outcome of  ℓ ≤ 2dcoins chosen uniformly at random at each iteration

image

where the probability is taken over coin toss randomness of the oracle and the uniform randomness over the choice of  α∗

Proof. We introduce some notation to represent the ℓcoins chosen and the set of oracle answers obtained. For  t = 1, 2, ..., T, consider Utto represent the subset of  ℓcoins chosen at each iteration,  Xt,ito represent the outcome of the  zthicoin at time instant  t and Yt,ito be a vector of dimension equal to that of  α∗

image

The core of the proof deals with the estimation of the mutual information between the information obtained for the queried sequence from the oracle and the true parameter of interest  α∗,I({Ut, Yt}Tt=1, α∗ | θ). The rest of the proof follows directly from Fano’s inequality. It is important to note that the queried sequence is conditioned on  θ.We note that although the sequence could be indexed using time as in an optimization scenario, we derive bounds with independence across the elements of the sequence, which is the worst case. Thus, these points could be obtained in any arbitrary fashion based on the algorithm with no sequential requirement to be satisfied. Thus, we have

image

Thus, it is sufficient to show that  I((U1, Y1), α∗ | θ) ≤ℓ(1+2δ)2 to bound the estimate of the mutual information between the sequence and  α∗. By the chain rule for mutual information, we have

image

Due to independence between the sampling process of  U1 and α∗, the second term  I(α∗; U1 | θ) = 0. Wecan rewrite the first term from the definition of conditional mutual information as

image

As  αis uniformly distributed over V and from the convexity of KL Divergence, we have [13]

image

For any pair  α, α∗ ∈ V, the summation of the KL Divergence can be bounded by the KL divergence between  ℓindependent pairs of Bernoulli variates with parameters being 1/2+δ+θ0 and 1/2−δ−θ0. Thus,we denote the KL Divergence between a single pair of Bernoulli variates with parameters 1/2 + δ + θ0and 1/2 − δ − θ0 by KL(δ, θ0) given by

image

From Lemma 3.3, we are required to take a sup over the conditioning or the latent variable, here  θ0.Thus, we have

image

Thus, as long as  δ ≤ 1/4, from the bound , we have

image

Following the proof backwards leads to the desired upper bound  I((U, Y ); α∗ | θ) ≤ Tℓ(1+ 2δ)2, therebycompleting the proof.

In this section, we provide our main theorem stating that first order methods take exponential time to span the search space to identify a global minimizer and its proof.

Theorem 4.1. There exists an universal constant c0 > 0such that any first order method provided with information from  ℓ ≤ 2d oracle answers to optimize over the function class  Fncv(D)satisfies the following lower bound

image

Proof. We consider an oracle that reveals information based on  ℓ of the 2d coin tosses with respect to the point with which it has been queried. From Lemma 3.4, we have the lower bound

image

The application of the upper bound from Lemma 3.2 requires the expected error to satisfy the conditions under which the upper bound holds, that is, the expected error is required to be within a certain tolerance. In order to evaluate the tolerance, we derive Ψ(δ) as follows.

First, we compute inf  gα(x) which is achieved at the global minima given by,

x∗ =� (2z − 1)/2 if αz = 1, arg sup˜z θα,˜z = z(2z − 1)/4 if αz = −1, arg sup˜z θα,˜z = z

At any such point  x∗ that minimizes  gα, all z ∈ Zin the summation apart from the one at which the minimum occurs contribute (1/2+δ)c+(1/2−δ)c = cwhile from the z where the minimum occurs, we have a contribution of (1/2 − δ)c. Thus, we have

infx gα(x) = (|Z| − 1)c + (12 − δ − supz θα,z)c= (|Z| − 1)c + (14 − δ2)c

We note that  ρand Ψ are defined over sets S, which contain the algorithm’s estimates of the minima of a function in the function class G. The error is computed as a sum over all the elements in the set S. In other words, any point in the set that is incorrectly identified as a minimizer adds to the error and correspondingly to the discrepancy measure.

Let us consider two functions  gα, gβ ∈ G and theset S. From Equation 7 and considering for a specific

image

Note from Equation 5 that g has a summation over z ∈ Z.

Lets consider two functions  gα, gβ ∈ Gfor a spe-cific  x ∈ S. Then we have,

image

= 12d�z∈Z[(1 + θα,z + θβ,z)f1(x, z)

image

+ (1  − 2αzδ − θα,z − θβ,z)f2(x, z)]I(αz = βz)

In order to better parse the summation, we denote

image

c(1 + θα,z + θβ,z) if x = (2z − 1)/2c(1 − θα,z − θβ,z) if x = (2z − 1)/42c if x ̸= (2z − 1)/2, (2z − 1)/4

image

(1  − 2δ − θα,z − θβ,z)c if αz = βz = 1, x = (2z − 1)/2(1  − 2δ − θα,z − θβ,z)c if αz = βz = −1, x = (2z − 1)/4(1 + 2δ + θα,z + θβ,z)c if αz = βz = 1, x = (2z − 1)/4(1 + 2δ + θα,z + θβ,z)c if αz = βz = −1, x = (2z − 1)/22c if αz = βz, x ̸= {(2z − 1)/2, (2zi − 1)/4}

Putting these together lead to multiple cases which we will explore in detail. Note that we are interested in lower bounding Ψ(δ). Since we do not have control over the selection of  θα and θβ, wedevelop a lower bound for the worst case. We refer to “common minimizer” and “unique minimizer” loosely for intuitive understanding. These could be viewed as a minimizer present at that particular x for both the functions but the function value may be different due to its dependence on  θ.

1)  If x is a “common minimizer”: The firsttwo possible cases with  αz = βzare satisfied. Considering infΘ, we have acontribution of {gα(x) + gβ(x)}z = c(1/2 − δ) from this z and |Z − 1|2cfrom all other  z’s.The resulting sum would cancel out infx gα + infx gβleading to  ρ = 0.

2)  If x is a “unique minimizer”: The firsttwo possible cases with  αz ̸= βzare satisfied. Considering infΘ, we have acontribution of {gα(x) + gβ(x)}z = c(1/2 + δ) from this z and |Z − 1|2cfrom all other  z’s. The |Z − 1|2cfrom all other z’s exactly cancels with the first term in infx gα + infx gβ.We are now left with ρ = c(1/2 + δ) − c(1/2 − δ) = 2δc.

3)  If x is not a “minimizer”: We have a contribution of 2c from all z’s. |Z| −1 of these terms cancel with the first term of infx gα +infx gβleaving behind

image

Now summing over all  x ∈ S, which is the algorithm’s guess of the minimizers or the internal reconstruction, the discrepancy measure has at least 2cδcontribution from the ∆H(α, β) terms. Thus,we have

image

Using the above insights, we now evaluate Ψ(δ) asfollows

image

Finally, we set  ǫ = cδ/18 satisfying the requirement of  ǫ ≤ Ψ(δ)/9 required to apply Lemma 2. We choose c = 1/8, considering the minimum possible separation between two different minimizers, which results in  ǫ = δ/144 and in this regime, the following holds

image

We replace  δ = 144ǫand rearranging leads to

image

Thus, we have proved that the reconstruction of a set that has a low error on the function subclass requires an exponential number of queries. As the set reconstruction is equivalent to the identification of all minimizers, this boils down to our claim that identification of the global minimizer can take exponential time.

We provide lower bounds for non convex optimization for up to first order. Importantly, we note that there is no restriction on the queries being sequential or related in any manner and the function is not required to be differentiable. Thus, we obtain algorithm agnostic lower bounds. Our future work includes exploring better rates with access to second order information or imposing the function class to be differentiable. On the other hand, situations with constraints on queries to be sequential in nature or using distributed optimization provide interesting avenues to understand if they lead to worse rates and by what factors.

[1] Alekh Agarwal, Martin J Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems 22, pages 1–9. 2009.

[2] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems 30, pages 1067– 1077. 2017.

[3] Asish Ghoshal and Jean Honorio. Information- theoretic limits of Bayesian network structure learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine

Learning Research, pages 767–775, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.

[4] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1724–1732, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[5] Chuyang Ke and Jean Honorio. Informationtheoretic limits for community detection in network models. In Advances in Neural Information Processing Systems 31, pages 8324–8333. Curran Associates, Inc., 2018.

[6] Jason D. Lee, Max Simchowitz, Michael I. Jor- dan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246–1257. PMLR, 23–26 Jun 2016.

[7] Jiri Matousek. Lectures on Discrete Geometry. Springer-Verlag, Berlin, Heidelberg, 2002.

[8] N. P. Santhanam and M. J. Wainwright. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Transactions on Information Theory, 58(7):4117–4134, July 2012.

[9] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

[10] Abraham Wald. Contributions to the theory of statistical estimation and testing hypotheses. The Annals of Mathematical Statistics, 10(4):299–326, 1939.

[11] W. Wang, M. J. Wainwright, and K. Ramchan- dran. Information-theoretic bounds on model selection for gaussian markov random fields. In 2010 IEEE International Symposium on Information Theory, pages 1373–1377, June 2010.

[12] Larry Wasserman. All of Nonparametric Statistics (Springer Texts in Statistics). SpringerVerlag, Berlin, Heidelberg, 2006.

[13] B. Yu. Assouad, Fano, and Le Cam. SpringerVerlag, 1997.


Designed for Accessibility and to further Open Science