Scale-free network optimization: foundations and algorithms

2016·Arxiv

Abstract

Abstract

We investigate the fundamental principles that drive the development of scalable algorithms for network optimization. Despite the significant amount of work on parallel and decentralized algorithms in the optimization community, the methods that have been proposed typically rely on strict separability assumptions for objective function and constraints. Beside sparsity, these methods typically do not exploit the strength of the interaction between variables in the system. We propose a notion of correlation in constrained optimization that is based on the sensitivity of the optimal solution upon perturbations of the constraints. We develop a general theory of sensitivity of optimizers the extends beyond the infinitesimal setting. We present instances in network optimization where the correlation decays exponentially fast with respect to the natural distance in the network, and we design algorithms that can exploit this decay to yield dimension-free optimization. Our results are the first of their kind, and open new possibilities in the theory of local algorithms.

Keywords: sensitivity of optimal points, decay of correlation, scalable algorithms, network flow, Laplacian, Green’s function

1. Introduction

Many problems in machine learning, networking, control, and statistics can be posed in the framework of optimization. Despite the significant amount of work on decomposition methods and decentralized algorithms in the optimization community, typically the methodologies being considered rely on strict separability assumptions on the objective function and constraints, so that the problem can exactly decouple across components and each component can be handled by its own processing unit (Bertsekas and Tsitsiklis, 1997; Boyd et al., 2011). These methods are insensitive to the strength of interaction among variables in the system, and beside sparsity they typically do not exploit more refined structures. On the other hand, probability theory has taught us that random variables need not to be independent for distributed methods to be engineered, and that notions of correlation decay can be exploited to develop scale-free algorithms (Gamarnik, 2013). This paper represents a first attempt to characterize the correlation among variables in network optimization, and to investigate how decay of correlations with respect to the natural distance of the network can be exploited to develop scalable computationally-efficient algorithms. The paper presents three main contributions.

1) Sensitivity of optimal points: notion of correlation in optimization. In Section 2 we develop a general theory on the sensitivity of optimal points in constrained convex optimization. We consider the problem of minimizing a convex function subject to Ax = b, for a certain matrix A and vector denotes the image of A. If the function f is strongly convex, we show that the optimal point is continuously differentiable along Im(A). We explicitly characterize the effect that perturbations have on the optimal solution as a function of the objective function f and the constraint matrix A: given a differentiable function , we establish an expression for in terms of the Moore-Penrose pseudoinverse of the matrix is the transpose of A, and where the inverse of the Hessian of f evaluated at . We provide an interpretation of the derivatives of optimal points as a measure of the correlation between variables in the optimization procedure. Textbook results on the sensitivity analysis for optimization procedures are typically stated only with respect to the optimal objective function, i.e., which in general is a much more well-behaved object than the point where the optimum is attained, i.e., . On the other hand, the literature on the sensitivity of optimal points (see Castillo et al. (2007) and reference therein) is only concerned with establishing infinitesimal perturbations locally, on a neighborhood of a certain , while the theory that we develop extends to finite perturbations as well via the fundamental theorem of calculus. The workhorse behind our results is Hadamard’s global inverse function theorem. The details of the proofs involving Hadamard’s theorem are presented in Appendix A.

2) Foundation of scale-free network optimization: decay of correlation. As a paradigm for network optimization, in Section 3 we consider the widely-studied min-cost network flow problem, which has been fundamental in the development of the theory of polynomial-times algorithms for optimizations (see Gamarnik et al. (2012) and references therein, or Ahuja et al. (1993) for book reference). Here a directed graph G = (V, E) is given, with its structure encoded in the vertex-to-edge incidence matrix . To each edge is associated a flow with a cost and to each vertex is associated an external flow . The min-cost network flow problem consists in finding the flow that minimizes the total cost that satisfies the conservation law Ax = b. In this setting, the general sensitivity theory that we developed allows to characterize the optimal flow in terms of graph Laplacians; in fact, in this case the matrix corresponds to the Laplacian of an undirected weighted graph naturally associated to G. To estimate the strength of the correlation, we develop a general connection between the Moore-Penrose pseudoinverse of graph Laplacians and the Green’s function of random walks on weighed graphs. To the best of our knowledge, this connection — which we present as standalone in Appendix B — has not been previously investigated in the literature. This result allows us to get an upper bound for the correlation term that decays exponentially as a function of the graph distance between the edges that are considered and the set of vertices where the perturbation is localized. The rate of the decay is controlled by the second largest eigenvalue in magnitude of the corresponding random walk. This phenomenon can be interpreted as a first manifestation of the decay of correlation principle in constrained optimization, resembling the decay of correlation property in statistical mechanics and probability theory first investigated in the seminal work of Dobrushin (Dobruˇsin, 1970) (for book references see Simon (1993) and Georgii (2011)).

3) Scale-free algorithms. Finally, in Section 4 we investigate applications of our theory to the field of local algorithms. To illustrate the main principle behind scale-free algorithms, we consider the case when the solution is given and we want to compute the solution perturbed flow b+p, where p is supported on a small subset . In this setting, we show that the decay of correlation structurally exhibited by the min-cost network flow problem can be exploited to design algorithms that yield scale-free optimization, in the sense that the computational complexity required to meet a certain precision level does not depend on the dimension of the network G. We consider a localized version of the projected gradient descent algorithm, which only updates the edges in a subgraph of G whose vertex set contains Z. The correlation decay property encodes the

fact that when the external flow is locally perturbed it suffices to recompute the solution only for the part of the network that is “mostly affected” by this perturbation, i.e., the set of nodes that have a distance at most r from the perturbation set Z, where the radius r is tuned to meet the desired level of error tolerance, given the size of the perturbation. Hence the savings in the computational complexity compared to global algorithms that update the solution at every edge in G. The theory that we develop in the context of the min-cost network flow problem hints to a general framework to study the trade-off between statistical accuracy and computational complexity for local algorithms in optimization. Our results are the first of their kind, and represent a building block to develop more sophisticated algorithms to exploit decay of correlation in more general instances of network optimization. The proof of the results in Section 4 are given in Appendix C.

Remark 1 (Connection with previous work) Some of the results presented in this paper will appear in a weaker form and without full proofs in Rebeschini and Tatikonda (2016). There, the sensitivity analysis is developed for matrices A’s that are full row rank, so that the matrix is invertible under the assumption that f is strongly convex. In the current work we relax this assumption and we provide results in terms of the pseudoinverse of . Moreover, the current paper presents the full details of the proof which involve Hadamard’s global inverse function theorem (Appendix A). Also the min-cost network flow problem was previously investigated in Rebeschini and Tatikonda (2016), albeit in a more restrictive fashion through the connection with killed random walks. The current paper develops a more general theory of correlation for optimization in terms of graph Laplacians and Green’s functions of ordinary (i.e., not killed) random walks on graphs (Appendix B). The difference is crucial as far as the results on the decay of correlation property are concerned, as the second largest eigenvalue in magnitude of random walks on graphs is typically much more well-behaved than the largest eigenvalue of killed random walks, as far as the dependence with the dimension is concerned. The algorithmic part of this paper (Section 4 and Appendix C) is completely new.

Remark 2 (Notation) Throughout, for a given real matrix M, we denote by its transpose, by its inverse, and by its Moore-Penrose pseudoinverse. We denote by Ker(M) := {x : Mx = 0} and Im(M) := {y : y = Mx for some x} the kernel and the image of M, respectively. Given an index set I and subsets submatrix corresponding to the rows of M indexed by K and the columns of M indexed by L. We use the notation I to indicate the identity matrix, 1 to indicate the all-one vector (or matrix), and 0 to indicate the all-zero vector (or matrix), whose sizes will be implied by the context. Given a vector , we denote by -th component, and we let denote its Given a subset we define the localized . We use the notation |K| to denote the cardinality of K. If G = (V, E) denotes a directed graph with vertex set V and edge set E, we let G = (V, E) represent the undirected graph naturally associated to if and only if either

2. Sensitivity of optimal points: notion of correlation in optimization

Let V be a finite set — to be referred to as the “variable set” — and let be a strictly convex function, twice continuously differentiable. Let F be a finite set — to be referred to as the

“factor set” — and let . Consider the following optimization problem over

for , so that the feasible region is not empty. Throughout this paper we think of the function f and the matrix A as fixed, and we consider the solution of the optimization problem above as a function of the vector . By strict convexity, this problem clearly has a unique optimal solution, which we denote by

Theorem 3 below provides a characterization of the way a perturbation of the constraint vector b along the subspace Im(A) affects the optimal solution , in the case when the function f is strongly convex. In textbooks, results on the sensitivity analysis for optimization procedures are typically stated only with respect to the optimal objective function, i.e., , not with respect to the point where the optimum is attained, i.e., Boyd and Vandenberghe (2004), for instance. The reason is that the optimal value typically behaves much more nicely with respect to perturbations than the optimizer itself. In case of linear programming when f is linear, for instance, it is known that the optimal solution is differentiable upon perturbations, while the optimal point might jump as it is restricted to be on the extreme points of the feasible polyhedron (Bertsimas and Tsitsiklis, 1997). On the other hand, the literature on the sensitivity of optimal points is only concerned with infinitesimal perturbations (see Castillo et al. (2007) and reference therein). The theory that we develop, instead, extends to finite perturbations as well, as we show that if f is strongly convex then the optimal point continuously differentiable along the entire subspace Im(A), which allows the use of the fundamental theorem of calculus to get finite-difference statements (the results in Section 4 rely heavily on this fact). The workhorse that allows us to establish global results is Hadamard’s global inverse function theorem. We now present the main result on the sensitivity of optimal points, together with the main outline of its proof. The technical details involving Hadamard’s theorem are given in Appendix A.

Theorem 3 (Sensitivity of the optimal point) Let be a strongly convex function, twice continuously differentiable. Let . Define the function

For each

Then, is continuously differentiable along the subspace Im(A), and given a differentiable function

Proof The Lagrangian of the optimization problem is the function

where -th row of the matrix is the vector formed by the Lagrangian multipliers. Let us define the function

For any fixed , as the constraints are linear, the Lagrange multiplier theorem says that for the unique minimizer there exists

As , the set of Lagrangian multipliers that satisfies (1) is a translation of the null space of . We denote the unique translation vector by . By Hadamard’s global inverse function theorem, as shown in Lemma 11 in Appendix A, the restriction of the function diffeomorphism, namely, it is continuously differentiable, bijective, and its inverse is also continuously differentiable. In particular, this means that the functions are continuously differentiable along the subspace Im(A). Differentiating both sides of (1) with respect to , we get, by the chain rule,

where . As the function f is strongly convex, the Hessian is positive definite for every , hence it is invertible for every . Solving the linear system for first, from the first equation . Substituting this expression in the second equation where . The set of solutions to can be expressed in terms of the pseudoinverse of L as follows (see Barata and Hussein (2012)[Theorem 6.1], for instance):

We show that . We show that , as the opposite direction trivially holds. In fact, let if the positive definite matrix that satisfies. The condition is equivalent to At the same time, clearly, is orthogonal to must be which implies is positive definite. By Barata and Hussein (2012)[Prop. 3.3] it follows that is the unique solution to that belongs to Im(A). Substituting this expression into , we finally get The proof follows as

Theorem 3 characterizes the behavior of the optimal point upon perturbations of the constraint vector b along the subspace . If the matrix A is full row rank, i.e., then the optimal point is everywhere continuously differentiable, and we can compute its gradient. In this case the statement of Theorem 3 simplifies, as following corollary makes this precise.

Corollary 4 (Sensitivity of the optimal point, full rank case) Consider the setting of Theorem 3, with the matrix having full row rank, i.e., . Then, the function is continuously differentiable and

Proof The proof follows immediately from Theorem 3, once we notice that the matrix L(b) := is positive definite for every , hence invertible, and . To see this, let has full column rank, we have positive definite by the assumption of strong convexity, also its inverse is positive definite and we have

If the matrix A is full row rank, then the quantity represents a natural notion of the correlation between variable and factor in the optimization procedure, and the quantity in Corollary 4 characterizes this correlation as a function of the constraint matrix A, the objective function f, and the optimal solution allows us to extend the notion of correlation between variables and factors to the more general case when the matrix A is not full rank. As an example, let , and assume that p is supported on a subset if and only if . Then, the quantity how much a perturbation of the constraints in F affects the optimal solution at can be interpreted as a measure of the correlation between variable i and the factors in F, which is characterized by the quantity in Theorem 3.

Remark 5 (Previous literature on notions of correlation in optimization) There is only one paper that we are aware of where notions of correlation among variables in optimization procedures have been considered, which is Moallemi and Van Roy (2010). In this paper the authors use a notion of correlation similar to the one that we are proposing to prove the convergence of the min-sum message passing algorithm to solve the class of separable unconstrained convex optimization problems. Yet, in that work correlations are simply regarded as a tool to prove convergence guarantees for the specific algorithm at hand, and no general theory is built around them. On the other hand, the need to address diverse large-scale applications in the optimization and machine learning domains prompts to investigate the foundations of notions of correlation in optimization, and to develop a general theory that can inspire a principled use of these concepts for local algorithms. This is one of the main goal of our paper.

In the next section we investigate the notion of correlation just introduced in the context of network optimization, when the constraints naturally reflect a graph structure, and we investigate the behavior of the correlations as a function of the natural distance in the graph.

3. Foundation of scale-free network optimization: decay of correlation

As a paradigm for network optimization, we consider the minimum-cost network flow problem, a cornerstone in the development of the theory of polynomial-times algorithms for optimizations. We refer to Gamarnik et al. (2012) for an account of the importance that this problem has had in the field of optimization, and to Ahuja et al. (1993) for book reference.

Consider a directed graph G := (V, E), with vertex set V and edge set E, with no self-edges and no multiple edges. Let G = (V, E) be the undirected graph naturally associated with G, that is, if and only if either . Without loss of generality, assume that G is connected, otherwise we can treat each connected component on its own. For each denote the flow on edge if the flow is in the direction of the edge, if the flow is in the direction opposite the edge. For each be a given external flow on the vertex v: represents a source where the flow enters the vertex, whereas represents a sink where the flow enters the vertex. Assume that the total of the source flows equals the total of the sink flows, that is, is the flow vector. We assume that the flow satisfies a conservation equation so that at each vertex the total flow is zero. This conservation law can be expressed as vertex-to-edge incidence matrix defined as

For each edge be its associated cost function, assumed to be strongly convex and twice continuously differentiable. The min-cost network flow problem reads

It can be shown that since G is connected Im(A) consists of all vectors orthogonal to the vector 1, i.e., ), for instance. Henceforth, for each such that denote the unique optimal point of the network flow problem.

We first apply the sensitivity theory developed in Section 2 to characterize the correlation between vertices (i.e., factors) and edges (i.e., variables) in the network flow problem. Then, we investigate the behavior of these correlations in terms of the natural distance on the graph G.

3.1. Correlation in terms of graph Laplacians

In the setting of the min-cost network flow problem, Theorem 3 immediately allows us to characterize the derivatives of the optimal point along the subspace Im(A) as a function of graph Laplacians, as we now discuss. For which is a diagonal matrix with entries given by, for each

Each term is strictly positive as is strongly convex by assumption. Let the symmetric matrix defined as follows, for each

and let be the diagonal matrix with entries given by, for each

Let be the graph Laplacian of the undirected weighted graph (V, E, W(b)), where to each edge is associated the weight . A direct application of Theorem 3 shows that the derivatives of the optimal point along the subspace Im(A) can be expressed in terms of the Moore-Penrose pseudoinverse of L(b).

Lemma 6 (Sensitivity for min-cost network flow problem) For

Then, is continuously differentiable along the subspace Im(A), and given a differentiable function

Proof The proof follows immediately from Theorem 3, upon choosing variable set V := E and factor set F := V , and noticing that

Let , and assume that p is supported on a subset namely, if and only if . Then, as discussed in Section 2, the quantity can be interpreted as a measure of the correlation between edge vertices in Z in the network flow problem. How does this notion of correlation behave with respect to the graph distance between e and Z? We now address this type of questions, and we present upper bounds that decay exponentially fast with rate controlled by the second largest eigenvalue in magnitude of the diffusion random walk naturally defined on (V, E, W(b)).

3.2. Decay of correlation

Lemma 6 expresses the correlation quantity for the min-cost network flow problem in terms of the Moore-Penrose pseudoinverse of the Laplacian for the undirected weighted graph (V, E, W(b)). To investigate the behavior of this quantity as a function of the natural distance in the unweighted graph G = (V, E), we develop a general connection between the pseudoinverse of the Laplacian and the Green’s function of the random walk with transition matrix . To the best of our knowledge, this connection — which we present as standalone in Appendix B — has not been previously investigated in the literature. Presently, we only state the main result on the decay of correlation for the min-cost network flow problem.

Let n := |V | be the cardinality of V , and for each be the real eigenvalues of and be the set of node neighbors of v in the graph G. Let d be the graph-theoretical distance between vertices in the graph G, namely, d(u, v) is the length of the shortest path between vertices . Recall the definition of the localized -norm from Remark 2. The following result shows that the solution of the min-cost network flow problem satisfies a decay of correlation bound in the localized with exponential rate given by . The proof is given at the end of Appendix B.

Theorem 7 (Decay of correlation in the be a differentiable function such that for any if and only if . Then, for any (U, F ) subgraph of G = (V, E), we have

Recall that the bound in Theorem 7 controls the effect that localized perturbations that are supported on a subset of vertices have on a subset of edges , as a function of the distance between F and Z, i.e., d(U, Z) (we only defined the distance among vertices, not edges). A key property — which is essential for the results in Section 4 — is that this bound does not depend on the cardinality of F .

In the next section we investigate the consequences of the decay of correlation property established by Theorem 7 in the theory of local algorithms. We show that this is a fundamental property that can be used to develop scale-free algorithms for large network optimization problems.

4. Scale-free algorithms

Let us consider the min-cost network flow problem defined in the previous section, for a certain external flow , and choose such that p is supported on if and only if . Assume that we perturb the external flow b by adding p. We want to address the following question: given knowledge of the solution for the unperturbed problem, what is a computationally efficient algorithm to compute the solution of the perturbed problem? The basic idea that we aim to exploit is that the decay of correlation property established in Theorem 7 implies that a localized perturbation of the external flow affects more the components of that are close to the perturbed sites. As a result, only a subset of the components of the solution around the perturbed region Z needs to be updated to meet a prescribed level of error precision, yielding savings on the computational complexity.

To formalize this idea, henceforth let be a subgraph of G = (V, E) such that be the undirected graph associated to G (see Remark 2), and assume that is connected. Define . We now introduce a local algorithm to approximately compute . This algorithm only updates the components of — which is assumed to be known — on the subset

4.1. Localized projected gradient descent algorithm

As a general-purpose algorithm for constrained convex optimization, we consider the canonical projected gradient descent algorithm. The same argument about localization that we are about to present can analogously be developed for other optimization procedures (we refer to Bubeck (2015) for a recent review of algorithmic procedures in large-scale optimization, and to Bertsekas and Tsitsiklis (1997) for a book reference). Recall that a single iteration of the projected gradient descent algorithm to compute is the map

where is a given step size. Let each function -strongly convex and -smooth, i.e., . A classical result yields that the projected gradient descent with step size converges to the optimal solution of the problem, namely, for any starting point -th iteration of the algorithm. In the the convergence rate is given by (see Bubeck (2015)[Theorem 3.6], for instance)

where is the so-called condition number. We naturally define the localized projected gradient descent on as follows (recall from Remark 2 the notation for submatrices).

Definition 8 (Localized projected gradient descent) Given localized projected gradient descent on with step size is defined as

Only the components of x supported on are updated by , while the components on fixed, playing the role of boundary conditions: for . For this reason, the map is defined only for the points whose coordinates outside are consistent with the constraint equations. The algorithm that we propose to compute given knowledge of is easily described: it amounts to running for t times the localized projected gradient descent on with “frozen” boundary conditions (and step size Clearly, satisfies the flow conservation constraints on , by definition.

4.2. Error analysis: bias-variance decomposition

We now provide estimates for the error committed by the localized projected gradient descent as a function of the subgraph and the running time t. The key ingredient behind our estimates is the decay of correlation property for the min-cost network flow problem established in Theorem 7.

Let us define the error committed by the localized projected gradient descent algorithm after iterations as the vector in

The analysis that we give is based on the following decomposition, that resembles the bias-variance decomposition in statistical analysis:

The bias term is algorithm-independent — any algorithm that converges to the optimal solution yields the same bias — and it characterizes the error that we commit by localizing the optimization procedure per se, as a function of the subgraph . On the other hand, the variance term depends on the specific choice of the algorithm that we run inside

Let define the inner boundary of

Let vertex-to-vertex adjacency matrix of the undirected graph G = (V, E), which is the symmetric matrix defined as otherwise. Being real and symmetric, the matrix B has n := |V | real eigenvalues which we denote by be the second largest eigenvalue in magnitude of B.

The next theorem yields bounds for the bias and variance error terms in the -norm. The bound for the bias decays exponentially with respect to the graph-theoretical distance (i.e., the distance in the unweighted graph G) between the inner boundary of , and the region where the perturbation p is supported, i.e., . The rate is governed by the eigenvalue , the condition number Q, and the maximum/minimum degree of the graph. The bound for the variance decays exponentially with respect to the running time, with rate proportional to 1/Q. The proof of this theorem is given in Appendix C, and the key ingredient is the decay of correlation property for the min-cost network flow problem established in Theorem 7.

Theorem 9 (Error localized algorithm) Let be, respectively, the minimum and maximum degree of

with . The bound for total error committed by the algorithm follows by the triangle inequality for the -norm, namely,

Note that the constants appearing in the bounds in Theorem 9 do not depend on the choice of the subgraph , but depend only on (a more refined analysis can yield better constants that do depend on the choice of , but we do not need them for our purposes). In particular, the same constants apply for the analysis of the global algorithm, i.e., the projected gradient descent applied to the entire graph G. In this case, the bias term clearly equals 0, so that the error is equivalent to the variance (hence the indicator function in Theorem 9).

Analogously to what happens in the statistical setting, in the next section we show that the bias introduced by the localization procedure can be exploited to lower the computational complexity that is associated to the variance term. This is the key idea that allows us to prove dimension-free computational complexity for the localized projected gradient descent algorithm.

4.3. Dimension-free computational complexity

The error estimates established in Theorem 9 allow to prove that the localized projected gradient descent is scale-free, in the sense that it is guaranteed to meet a prescribed level of error accuracy with a computational complexity that does not depend on the dimension of the network G.

To illustrate this fact, let G = (V, E) be a k-regular graph such that the second largest eigenvalue in magnitude of its vertex-to-vertex adjacency matrix is bounded away from k as a function of the dimension of is a universal constant that does not depend on the size |V |, nor on the size |E|. This is the same as saying that G comes from a family of k-regular expander graphs (Hoory et al., 2006). Define G = (V, E) by assigning an arbitrary orientation to the edges of G. Assume that the following holds: , where recall that is the condition number. For the sake of simplicity, we introduce a collection of subgraphs of G that are centered on a given vertex and are parametrized by their radii. Fix a vertex denote the ball of radius r > 0 around vertex let be the subgraph of G that has vertex set , and induced edge set . Consider a perturbation vector that is supported on , for a fixed z > 0. If we run the localized algorithm on time steps, then Theorem 9 yields the following estimate (here

with

Let be the computational complexity required to run the localized projected gradient descent algorithm on time steps. A rough estimate for the asymptotic behavior of is easily derived as follows (more refined estimates can be made, but we do not need them to make our point). If denotes the vertex-to-edge adjacency matrix associated to is the restriction of the cost function f to the edges in , it is easy to check that a single iteration of the localized projected gradient descent algorithm on

for any . The exact computation of the matrix has an asymptotic complexity that scales like as a function of matrix multiplication constant.2 As each matrix-vector multiplication has a cost of scales like . For the sake of simplicity, consider . To estimate the complexity of the local algorithm, we need to bound the growth of as a function of r. In a k-regular graph, we clearly have (which is realistic for expander graphs, as they are locally tree-like) so that grows at most as

We are now in the position to appreciate the computational savings that the localized algorithm offers over the global algorithm (i.e., the projected gradient descent on G). Assume that G is an infinite network with . In this case, the computational complexity of the global algorithm is clearly infinity, as the global algorithm updates the components of the solution at every edge of the entire network. On the other hand, the complexity of the localized projected gradient descent algorithm is finite. This can be seen if we seek, for example, for the minimal radius r and time Clearly, these constraints guarantee that , and it is easy to see that both the minimal t and the minimal r that satisfy the above inequalities scale like , so that the complexity of the localized algorithm scales like where the constants involved do not dependent of the dimension of the graph G, but depend only on

The decay of correlation property exhibited by the min-cost network flow problem allowed us to show that the bias introduced by localizing the optimization problem to a subgraph from the computational complexity associated to the variance term, which corresponds to running the gradient descent algorithm on time steps. In fact, a finer analysis shows that one can exploit the bias-variance trade-off to optimally tune the algorithm, i.e., to find a radius time that minimize the computational complexity which is required to reach the prescribed level of error accuracy . These ideas suggest a general framework to study the trade-off between statistical accuracy and computational complexity for local algorithms in optimization.

Acknowledgments

We would like to thank Rasmus Kyng and Sushant Sachdeva for useful discussions.

References

Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.

David Aldous and James Allen Fill. Reversible markov chains and random walks on graphs, 2002. Unfinished monograph, recompiled 2014, available at http://www.stat.berkeley.edu/˜aldous/RWG/book.html.

Jo˜ao Carlos Alves Barata and Mahir Saleh Hussein. The Moore–Penrose pseudoinverse: A tutorial review of the theory. Brazilian Journal of Physics, 42(1-2):146–165, 2012.

Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 1997.

D. Bertsimas and J.N. Tsitsiklis. Introduction to linear optimization. Athena Scientific, 1997.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimiza- tion and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, 2011.

S´ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends Rin Machine Learning, 8(3-4):231–357, 2015.

E. Castillo, A.J. Conejo, C. Castillo, and R. M´ınguez. Closed formulas in local sensitivity analysis for some classes of linear and non-linear problems. TOP, 15(2):355–371, 2007.

Fan Chung and S.-T. Yau. Discrete Green’s functions. Journal of Combinatorial Theory, Series A, 91(1–2):191 – 214, 2000.

R. L. Dobruˇsin. Definition of a system of random variables by means of conditional distributions. Teor. Verojatnost. i Primenen., 15:469–497, 1970. ISSN 0040-361x.

F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. Knowledge and Data Engineering, IEEE Transactions on, 19(3):355–369, 2007.

Tobias Friedrich and Thomas Sauerwald. The cover time of deterministic random walks. In Computing and Combinatorics, volume 6196, pages 130–139. Springer Berlin Heidelberg, 2010.

David Gamarnik. Correlation Decay Method for Decision, Optimization, and Inference in LargeScale Networks, chapter 7, pages 108–121. 2013.

David Gamarnik, Devavrat Shah, and Yehua Wei. Belief propagation for min-cost network flow: Convergence and correctness. Operations Research, 60(2):410–428, 2012.

Hans-Otto Georgii. Gibbs measures and phase transitions, volume 9 of de Gruyter Studies in Mathematics. Walter de Gruyter & Co., Berlin, second edition, 2011.

Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. BULL. AMER. MATH. SOC., 43(4):439–561, 2006.

Roger A. Horn and Charles R. Johnson, editors. Matrix Analysis. Cambridge University Press, New York, NY, USA, 1986.

Ioannis Koutis, Gary L. Miller, and Richard Peng. A nearly-m log n time solver for sdd linear systems. In FOCS, pages 590–598. IEEE, 2011.

L. Lov´asz. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1):1–46, 1993.

Ciamac C. Moallemi and Benjamin Van Roy. Convergence of min-sum message-passing for convex optimization. Information Theory, IEEE Transactions on, 56(4):2041–2050, 2010.

Patrick Rebeschini and Sekhar C. Tatikonda. Decay of correlation in network flow problems. In 50th Annual Conference on Information Sciences and Systems, CISS (to appear in the proceedings), 2016.

Barry Simon. The statistical mechanics of lattice gases. Vol. I. Princeton Series in Physics. Princeton University Press, Princeton, NJ, 1993. ISBN 0-691-08779-2.

Nisheeth K. Vishnoi. Lx = b Laplacian Solvers and Their Algorithmic Applications. 2013.

F. Wu and C. Desoer. Global inverse function theorem. IEEE Transactions on Circuit Theory, 19: 199–201, 1972.

Appendix A. Hadamard’s global inverse function theorem

Recall that a function from is said to be if it has continuous derivatives up to order k. A function is said to be a diffeomorphism if it is , bijective, and its inverse is also following important result characterizes when a function is a diffeomorphism.

Theorem 10 (Hadamard’s global inverse function theorem) Let function from diffeomorphism if and only if the following two conditions hold:

1. The determinant of the differential of is different from zero at any point, namely,

2. The function norm coercive, namely, for any sequence of points (for any choice of the norm , as norms are equivalent in finite dimension).

Proof See Wu and Desoer (1972)[Corollary of Lemma 2], for instance.

The following result, which is the backbone behind the proof of Theorem 3, comes as a corollary to the previous theorem.

Lemma 11 (Diffeomorphism for Lagrangian multipliers map) Let be a strongly convex function, twice continuously differentiable. Let be a given matrix. Define the function

for any . Then, the restriction of the function phism.

Proof Let us interpret as the representation of a transformation T in the standard basis of Recall the orthogonal decomposition . Let the vectors form an orthogonal basis for Im(A), where r is the rank of A, and let the vectors form an orthogonal basis for . Define the orthogonal matrix which represents a change of basis in . As we have

then the transformation T is represented in the standard basis for and in the basis the following map

where

where is the identity matrix, and is the all-zero matrix. As

where . Therefore, the restriction of the transformation T to the invariant subspace is represented in the standard basis for and in the basis for Im(A) by the following map

As the function f is twice continuously differentiable, clearly the function is continuously differentiable, i.e., . We now check that the two conditions of Theorem 10 are satisfied. The differential of evaluated at is given by the Jacobian matrix

As f is strongly convex, is positive definite so invertible. Then, the determinant of the Jacobian can be expressed as has full row rank by definition, is positive definite and we clearly have

To prove that the function is norm coercive, let us choose to be the Euclidean norm and consider a sequence . As for any , clearly for the sequence to go to infinity one of the following two cases must happen:

(b)

Before we consider these two cases separately, let us note that, for any

Let be the strong convexity parameter, and recall the following definition of strong convexity, for any

(a) Assume be the projection operator on and let be the projection operator on Ker(B), the orthogonal complement of . As for any we have the decomposition clearly . So, the condition holds only if one of the two cases happens:

Consider the case (i) first. Let ) we have, for any

where is the minimum eigenvalue of among those corresponding to the eigenvectors spanning the subspace . Clearly, if by definition) then the above yields that . To prove this, assume by contradiction that . Then, there exists , such that has full column rank by assumption, the latter is equivalent to By = O so that , which contradicts the hypothesis that

Consider now the case (ii). Decomposing the gradient on and its orthogonal subspace, from (2) we have, for any

so that

and applying Cauchy-Schwarz we get, for any

By assumption f is twice continuously differentiable, so is continuous and it stays bounded on a bounded domain. Hence, we can conclude that

(b) Assume bounded. Notice that for any

where is the minimum eigenvalue of , which is strictly positive as is positive definite by the assumption that B has full row rank. From (2) we have

that, by continuity of , shows that is bounded.

Appendix B. Laplacians and random walks

Let G = (V, E, W) be a simple (i.e., no self-loops, and no multiple edges), connected, undirected, weighted graph, where to each edge is associated a non-negative weight be a diagonal matrix with entries for each . For each vertex be the set of node neighbors of v. In this section we establish several connections between the graph Laplacian and the random walk with transition matrix . Henceforth, for each be the law of a time homogeneous Markov chain transition matrix P and initial condition . Analogously, denote by the expectation with respect to this law. The hitting time to the site is defined as be the unique stationary distribution of the random walk, namely, . By substitution it is easy to check that . We adopt the notation to denote the vector whose only non-zero component equals 1 and corresponds to the entry associated to

B.1. Restricted Laplacians and killed random walks

The connection between Laplacians and random walks that we present in Section B.2 below is established by investigating restricted Laplacians and killed random walks. Throughout this section, let be fixed, and define as the matrix obtained by removing the -th row and column form W and D, respectively. Let be the restricted Laplacian that we obtain by removing the -th row and column form L. On the other hand, let be the transition matrix of the transient part of the killed random walk that is obtained from X by adding a cemetery at site . Creating a cemetery at means modifying the walk becomes a recurrent state, i.e., once the walk is in state it will go back to with probably 1. This is clearly done by replacing the row with zeros everywhere but in the -th coordinate, where the entry is equal to 1. The relation between the transition matrix of the killed random walk and the law of the random walk X itself is made explicit in the next proposition.

Proposition 12 For any

Proof We prove the statement by induction. Clearly, for any which proves the statement for is the indicator function). Assume that the statement holds for any time . By the properties of conditional expectation, noticing that

for any . By the Markov property, on the event

so that by the induction hypothesis we have

which proves the statement for t + 1.

The following proposition relates the inverse of the reduced Laplacian with the Green function of the killed random walk, namely, the function , with the hitting times of the original random walk X.

Proposition 13 For each

Proof Let us first assume that is connected. The matrix is sub-stochastic as, clearly, if is irreducible (in the sense of Markov chains, i.e., for each there exists ) and the spectral radius of is strictly less than 1 (see Corollary 6.2.28 in Horn and Johnson (1986), for instance), so that the Neumann series converges. The Neumann series expansion for

or, entry-wise, by Proposition 12, by the Monotone convergence theorem we can take the summation inside the expectation and get

where in the last step we used that . Recall that if S is a stopping time for the Markov chain , then by the strong Markov property we know that, conditionally on , the chain has the same law as a time-homogeneous Markov chain with transition matrix P and initial condition is independent of . The hitting times are two stopping times for X, and so is their minimum . As either

where we used that, conditionally on tionally on and the strong Markov property yields (note that the event has probability one from any starting point, as the graph G is connected by assumption so that the Markov chain will almost surely eventually hit either

Putting everything together we have The argument just presented extends easily to the case when is not connected. In fact, in this case the matrix a block structure, where each block corresponds to a connected component and to a sub-stochastic submatrix, so that the argument above can be applied to each block separately.

The following result relates the inverse of the reduced Laplacian with the pseudoinverse of the Laplacian L, which we denote by . It is proved in Fouss et al. (2007)[Appendix B].

Proposition 14 For any

Proposition 13 and Proposition 14 allow us to relate the quantity difference of the Green’s function of the random walk, as we discuss next.

B.2. Pseudoinverse of graph Laplacians and Green’s function of random walks

We now relate the Moore-Penrose pseudoinverse of the Laplacian with the Green’s function of the random walk, which represents the expected number of times the Markov chain X visits site v when it starts from site u. Notice that as the graph G is finite and connected, then the Markov chain X is recurrent and the Green’s function itself equals infinity for any . In fact, the following result involves differences of the Green’s function, not the Green’s function itself. To the best of our knowledge, this connection — which represents the key result that will allow us to bound functions of by spectral properties of P — has not been previously investigated in the literature.3

Lemma 15 For any

and the same formulas hold if we swap the role of

Proof Using first Proposition 14 and then Proposition 13 we obtain, for any in Section B.1),

From (3.27) in the proof of Proposition 3.10 in Chapter 3 in Aldous and Fill (2002), upon identifying , we immediately have the following relation between the difference of potentials and hitting times of the random walk X:

where -th component of the stationary distribution of the random walk X, and From Corollary 8 in Chapter 2 in Aldous and Fill (2002), we have

and we recall the connection between commute times and effective resistance (see, for example, Corollary 3.11 in Aldous and Fill (2002)):

and the statement of the lemma follows by combining everything together.

The connection between the Moore-Penrose pseudoinverse of Laplacians and the Green’s functions of random walks in Lemma 15 is the key result that allows us to derive spectral bounds in terms of the second largest eigenvalue in magnitude of the transition matrix P. We now present three lemmas that, albeit generic, are instrumental to the proof of Theorem 7 in Section 3. Henceforth, let d denote the graph-theoretical distance on G: that is, d(u, v) denotes the length of the shortest path between vertex u and vertex v. Note that , as we assumed that to each edge is associated a non-negative weight be the eigenvalues of

Lemma 16 For any

Proof From Lemma 15, by summing the quantity , recalling that

The identity in the statement of the Lemma follows easily as by assumption.

Proof Consider the matrix . This matrix is clearly similar to P and symmetric. Let denote by the orthonormal eigenvectors of corresponding, respectively, to the eigenvalues