b

DiscoverSearch
About
My stuff
Polynomial Optimization for Bounding Lipschitz Constants of Deep Networks
2020·arXiv
Abstract
Abstract

The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives with a weak generalization of Putinar’s positivity certificate. This idea could also apply to other, nearly sparse, polynomial optimization problems in machine learning. We empirically demonstrate that our method provides a trade-off with respect to state of the art linear programming approach, and in some cases we obtain better bounds in less time.

We focus on the multiple layer networks with ReLU activations. We propose a computationally efficient method to give a valid upper bound on the Lipschitz constant of such networks. Recall that a function f, defined on a convex set  X ⊆ Rn, is L-Lipschitz with respect to the norm  || · ||if for all x, y ∈ X, we have  |f(x) − f(y)| ≤ L||x − y||. The Lipschitz constant of f with respect to norm || · ||, denoted by  L||·||f, is the infimum of all those valid Ls:

image

For deep networks, they play an important role in many applications related to robustness certification which has emerged as an active topic. See recent works [9, 31] based on semidefinite programming (SDP), [8, 40] based on linear programming (LP), [34] based on mixed integer programming (MIP), and [6, 38, 39, 41] based on outer polytope approximation, [7] based on averaged activation operators. We follow a different route and compute upper bounds on the Lipschitz constant of neural networks [35].

Another important application is the Wasserstein Generative Adversarial Network (WGAN) [3]. Wasserstein distance is estimated by using the space of functions encoded by 1-Lipschitz neural networks. This requires a precise estimation of the Lipschitz constants , see recent contributions [2, 11, 25].

Recently there has been a growing interest in polynomial optimization for such problems. In [31], robustness certification is modeled as a quadratically constrained quadratic problem (QCQP) for ReLU networks. Similarly, in [21], an upper bound on the Lipschitz constant of an ELU network is obtained from a polynomial optimization problem (POP). In contrast to optimization problems with more general functions, powerful (global) positivity certificates are available for POP. Such certificates are needed to approximate global optima as closely as desired [20].

Such positivity certificates have been already applied with success in various areas of science and engineering. The first attempt to compute lower bounds of a QCQP by solving an SDP can be traced back to Shor [32], recently applied to certify robustness of neural networks in [31]. Converging LP based hierarchies are based on Krivine-Stengle’s certificates [15, 19, 33]. In [21], sparse versions are used to bound the Lipschitz constant of neural networks. On the other hand, Putinar’s certificate [17, 29] is implemented via an SDP-based hierarchy (a.k.a., “Lasserre’s hierarchy”) and provides converging approximate solutions of a POP. Various applications are described in [16], see also its application to large scale optimal power flow problems in [26, 27] and for roundoff error certification in [24]. The LP hierarchy is cheaper than the SDP hierarchy, but less efficient for combinatorial optimization [22], and cannot converge in finitely many steps for continuous POPs. Finally, weaker positivity certificates can be used, for example DSOS/SDSOS [1] based on second-order cone programming, or hybrid BSOS hierarchy [19].

1.1 Related Works

Upper bound on Lipschitz constants of deep networks can be obtained by a product of the layer-wise Lipschitz constants [13]. This is however extremely loose and has many limitations [13]. Note that [36] propose an improvement via a finer product.

Departing from this approach, [21] propose a QCQP formulation to estimate the Lipschitz constant of neural networks. Shor’s relaxation allows to obtain a valid upper bound. Alternatively, using the LP hierarchy, [21] obtain tighter upper bounds. By another SDP-based method, [10] provide an upper bound of the Lipschitz constant. However this method is restricted to the  L2-norm whereas most robustness certification problems in deep learning are rather concerned with the  L∞-norm.

1.2 Preliminaries and Notations

Denote by F the multiple layer neural network, m the number of hidden layers,  p0, p1, . . . , pmthe number of nodes in the input layer and each hidden layer. For simplicity,  (p0, p1, . . . , pm)will denote the layer structure of network  F. Let x0be the initial input, and  x1, . . . , xmbe the activation vectors in each hidden layer. Each  xi, i = 1, . . . , m, is obtained by a weight  Ai, a bias  bi, and an activation function  σ, i.e.,  xi = σ(Aixi−1 + bi). We only consider coordinatewise application of the ReLU activation function, defined as ReLU(x) = max{0, x} for  x ∈ R. The ReLU function is non-smooth, we define its generalized derivative as the set-valued function G(x) such that G(x) = 1 for x > 0, G(x) = 0 for x < 0 and G(x) = {0, 1} for x = 0.

We assume that the last layer in our neural network is a softmax layer with K entries, that is, the network is a classifier for K labels. For each label  k ∈ {1, . . . , K}, the score of label k is obtained by an affine product with the last activation vector, i.e.,  cTk xm for some ck ∈ Rpm. The final output is the label with the highest score, i.e.,  y = arg maxk cTk xm. The product xy of two vectors x and y is considered as coordinate-wise product.

1.3 Contribution

We first express both graphs of ReLU and its generalized derivative via basic closed semialgebraic sets, i.e., sets defined with finite conjunctions of polynomial (in)equalities. Indeed, if y = ReLU(x) and  v ∈ G(x), then equivalently:  y(y − x) = 0, y ≥ x, y ≥ 0and  v(v − 1) = 0, (v − 1/2)x ≥ 0. Note that the explicit semialgebraic expression of ReLU has already been used in related works, see for example [31]. The semialgebraic expression for G is our contribution. Being exact, this semi-algebraic reformulation is a noticeable improvement compared to the model proposed in [21] where the generalized derivative of ReLU is simply replaced with a decision variable lying between 0 and 1 (and so is only approximated).

Table 1: Global Lipschitz constant and solver running time of networks of size (80, 80) and (40, 40, 10) obtained by HR-1, HR-2, LipOpt-3 and LipOpt-4 on various sparsities s. The abbreviation “HR-2” (resp. “HR-1”) stands for the second-order (resp. first-order) SDP-based method we propose (see Appendix E), and “LipOpt-3/4” stands for the LP-based method from [21] (which uses Krivine-Stengle’s positivity certificate of degree 3 or 4). LBS is the lower bound computed by random sampling. For networks of more than 2 hidden layers, we use the technique introduced in Appendix E in order to deal with the cubic terms in the objective. OfM means out of memory while building the model. The results are for a single random network, complete results are shown in Appendix F and G.

image

Second, we provide a heuristic approach based on an SDP-hierarchy for nearly sparse polynomial optimization problems. In such problems one assumes that only a few affine constraints destroy a sparsity pattern satisfied by the other constraints. This new approach is mainly based on the sparse version of the Lasserre’s hierarchy, popularized in [18, 37]. It provides upper bounds yielding a trade-off between third and fourth degree LP relaxations in [21], and sometimes a strict improvement.

1.4 Main Results

In recent work [21] a certain sparsity structure arising from a neural network is exploited. Consider a neural network F with one single hidden layer, and 4 nodes in each layer. The network F is said to have a sparsity of 4 if its weight matrix A is symmetric with diagonal blocks of size at most  2 × 2:

image

Larger sparsity values refer to symmetric matrices with band structure of a given size. This sparsity structure (2) of the networks greatly influences the number of variables involved in the LP program to solve in [21]. This is in deep contrast with our method which does not require the weight matrix to be as in (2). Hence when the network is fully-connected, our method is more efficient and provides tighter upper bounds.

Table 1 gives a brief comparison outlook of the results obtained by our method and the method in [21]. For (80, 80) networks, apart from s = 20, which is not significative, HR-2 obtains much better bounds and is also much more efficient than LipOpt-3. LipOpt-4 provides tighter bounds than HR-2 but suffers more computational time, and run out of memory when the sparsity increases. For (40, 40, 10) networks, HR-1 is a trade-off between LipOpt-3 and LipOpt-4, it provides tighter (resp. looser) bounds than LipOpt-3 (resp. LipOpt-4), but takes more (resp. less) computational time.

In this section, we recall basic facts about optimization and build the polynomial optimization model for estimating Lipschitz constant of neural networks.

2.1 Polynomial Optimization

In a polynomial optimization problem (POP), one computes the global minimum (or maximum) of a multivariate polynomial function on a basic closed semialgebraic set. If the semialgebraic set is the whole space, the problem is unconstrained, and constrained otherwise. Given a positive integer n ∈ N, let x = (x1, . . . , xn)T be a vector of decision variables, and denote by [n] the set {1, . . . , n}.

A POP has the canonical form:

image

where  f, fi, gjare all polynomials in n variables. With  I ⊂ {1, . . . , n}, let  xI := (xi)i∈Iand let R[x] be the space of real polynomials in the variables  x while R[xI]is the space of real polynomials in the variables  xI.

In particular, if the objective f and constraints  fi, gj in (POP)are all of degree at most 2, we say that the problem is a quadratically constrainted quadratic problem (QCQP). The Shor’s relaxation of a QCQP is a semidefinite program which can be solved efficiently numerically. If all polynomials involved in (POP) are affine then the problem is a linear program (LP).

2.2 Lipschitz Constant Estimation Problem (LCEP)

Suppose we train a neural network F for K-classifications and denote by  Ai, bi, ckits parameters already defined in section 1.2. Thus for an input  x0 ∈ Rp0, the targeted score of label k can be expressed as  Fk(x0) = cTk xm, where xi = ReLU(Aixi−1+bi), for i ∈ [m]. Let zi = Aixi−1+bifor  i ∈ [m]. By applying the chain rule on the non-smooth function  Fk, we obtain a set valued map for  Fkat point any  x0 as GFk(x0) = (�mi=1 ATi diag(G(zi)))ck.

We fix a targeted label (label 1 for example) and omit the symbol k for simplicity. We define  L||·||F ofF with respect to norm  || · ||, is the supremum of the gradient’s dual norm, i.e.:

image

where  Ωis the convex input space, and  || · ||∗is the dual norm of  || · ||, which is defined by ||x||∗ := sup||t||≤1 |⟨t, x⟩| for all x ∈ Rn. In general, the chain rule cannot be applied to composition of non-smooth functions [5, 14]. Hence the formulation of  GFk and (3)may lead to incorrect gradients and bounds on the Lipschitz constant of the networks. The following ensures that this is not the case and that the approach is sound, its proof is postponed to Appendix A.

Lemma 1 If  Ωis convex, then  L||·||Fis a Lipschitz constant for  Fk on Ω.

When  Ω = Rn, L||·||Fis the global Lipschitz constant of F with respect to norm  || · ||. In many cases we are also interested in the local Lipschitz constant of a neural network constrained in a small neighborhood of a fixed input  ¯x0. In this situation the input space  Ωis often the ball around ¯x0 ∈ Rwith radius  ε: Ω = {x : ||x − ¯x0|| ≤ ε}. In particular, with the  L∞-norm (and using l ≤ x ≤ u ⇔ (x − l)(x − u) ≤ 0), the input space  Ωis the basic semialgebraic set:

image

Combining Lemma 1 and (3), the Lipschitz constant estimation problem (LCEP) for neural networks with respect to the norm  || · ||, is the following POP:

image

xi−1(xi−1 − Ai−1xi−2 − bi−1) = 0, xi−1 ≥ 0, xi−1 ≥ Ai−1xi−2 + bi−1, 2 ≤ i ≤ m ;t2 ≤ 1, (x0 − ¯x0 + ε)(x0 − ¯x0 − ε) ≤ 0 .} (LCEP)

In [21] the authors only use the constraint  0 ≤ ui ≤ 1on the variables  ui, only capturing the Lipschitz character of the considered activation function. We could use the same constraints, this would allow to use activations which do not have semi-algebraic representations such as the Exponential Linear Unit (ELU). However, such a relaxation, despite very general, is a lot coarser than the one we propose. Indeed, (LCEP) treats an exact formulation of the generalized derivative of the ReLU function by exploiting its semialgebraic character.

In this section we briefly introduce the Lasserre’s hierarchy [17] which has already many successful applications in and outside optimization [20].

3.1 Convergent SDP Relaxations without Exploiting Sparsity [17]

In the Lasserre’s hierarchy for optimization one approximates the global optimum of the POP

image

(where  f, giare all polynomials in R[x]), by solving a hierarchy of SDPs 1 of increasing size. Equality constraints can also be taken into account easily. Each SDP is a semidefinite relaxation of (Opt) in the form:

image

where  ωi = ⌈deg(gj)/2⌉, y = (yα)α∈Nn2d, Ly : R[x] → Ris the so-called Riesz linear functional: f = �α fα xα �→ Ly(f) := �α fαyαwith  f ∈ R[x], and  Md(y), Md−ωi(giy)are moment matrix and localizing matrix respectively; see [20] for precise definitions and more details. The semidefinite program (MomOpt-d) is the d-th order moment relaxation of problem (Opt). As a result, when  K := {x : gi(x) ≥ 0, i ∈ [p]}is compact, one obtains a monotone sequence of lower bounds  (ρd)d∈Nwith the property  ρd ↑ f ∗as  d → ∞under a certain technical Archimedean condition; the latter is easily satisfied by including a quadratic redundant constraint  M − ∥x∥2 ≥ 0in the definition of K (redundant as K is compact and M is large enough). At last but not least and interestingly, generically the latter convergence is finite [28]. Ideally, one expects an optimal solution y∗of (MomOpt-d) to be the vector of moments up to order 2d of the Dirac measure  δx∗at a global minimizer  x∗of (Opt). See Appendix B.1 for an example to illustrate the core idea of Lasserre’s hierarchy.

3.2 Convergent SDP Relaxations Exploiting Sparsity [18, 37]

The hierarchy (MomOpt-d) is often referred to as dense Lasserre’s hierarchy since we do not exploit any possible sparsity pattern of the POP. Therefore, if one solves (MomOpt-d) with interior point methods (as all current SDP solvers do), then the dense hierarchy is limited to POPs of modest size. Indeed the d-th order dense moment relaxation (MomOpt-d) involves�n+2d2d �variables and a moment matrix  Md(y) of size�n+dd �= O(nd) at fixed d. Fortunately, large-scale POPs often exhibit some structured sparsity patterns which can be exploited to yield a sparse version of (MomOpt-d), as initially demonstrated in [37]. As a result, wider applications of Lasserre’s hierarchy have been possible.

Assume that the set of variables in (Opt) can be divided into several subsets indexed by  Ik, for k ∈ [l],i.e.,  [n] = ∪lk=1Ik, and suppose that the following assumptions hold:

A1: The function f is a sum of polynomials, each involving variables of only one subset, i.e.,

image

A2: Each constraint also involves variables of only one subset, i.e.,  gi ∈ R[xIk(i)] for some k(i) ∈ [l];A3: The subsets  Iksatisfy the Running Intersection Property (RIP): for every  k ∈ [l − 1], Ik+1 ∩�kj=1 Ij ⊆ Is, for some s ≤ k.

A4: Add redundant constraints  Mk − ||xIk||2 ≥ 0 where Mkare constants determined beforehand.

A POP with such a sparsity pattern is of the form:

image

and its associated sparse Lasserre’s hierarchy reads:

image

where  d, ωi, y, Lyare defined as in (MomOpt-d) but with a crucial difference. The matrix  Md(y, Ik)(resp.  Md−ωi(gi y, Ik)) is a submatrix of the moment matrix  Md(y)(resp. localizing matrix Md−ωi(giy)) with respect to the subset  Ik, and hence of much smaller size�τk+dτk �if |Ik| =: τk ≪ n.See Appendix B.2 for an example to illustrate the core idea of sparse Lasserre’s hierarchy.

If the maximum size  τof the subsets is such that  τ ≪ n, then solving (MomSpOpt-d) rather than (MomOpt-d) results in drastic computational savings. In fact, even with not so large n, (MomOpt-d) the second relaxation with d = 2 is out of reach for currently available SDP solvers. Finally,  θd ≤ f ∗for all d and moreover, if the subsets  Iksatisfy RIP, then we still obtain the convergence  θd ↑ f ∗as d → ∞, like for the dense relaxation (MomOpt-d).

There is a primal-dual relation between the moment problem and the sum-of-square (SOS) problem, as shown in Appendix C. The specific MATLAB toolboxes Gloptipoly [12] and YALMIP [23] can solve the hierarchy (MomOpt-d) and its sparse variant (MomSpOpt-d), and also their dual SOS problem.

For illustration purpose, consider 1-hidden layer networks. Then in (LCEP) we can define natural subsets  Ii = {u(i)1 , x0}, i ∈ [p1](w.r.t. constraints  u1(u1 − 1) = 0, (u1 − 1/2)(A1x0 + b1) ≥ 0, and  (x0 − ¯x0 + ε)(x0 − ¯x0 − ε) ≤ 0); and Jj = {t(j)}, j ∈ [p0](w.r.t. constraints  t2 ≤ 1). Clearly, Ii, Jjsatisfy the RIP condition and are subsets with smallest possible size. Recall that  x0 ∈ Rp0. Hence  |Ii| = 1 + p0and the maximum size of the PSD matrices is�1+p0+dd �. Therefore, as in real deep neural networks  p0can be as large as 1000, the second-order sparse Lasserre’s hierarchy (MomSpOpt-d) cannot be implemented in practice.

In fact (LCEP) can be considered as a “nearly sparse” POP, i.e., a sparse POP with some additional “bad" constraints that violate the sparsity assumptions. More precisely, suppose that  f, giand subsets Iksatisfy assumptions A1, A2, A3 and A4. Let g be a polynomial that violates A2. Then we call the POP

image

a nearly sparse POP because only one constraint, namely  g ≥ 0, does not satisfy the sparsity pattern A2. This single “bad" constraint  g ≥ 0precludes us from applying the sparse Lasserre hierarchy (MomSpOpt-d).

In this situation, we propose a heuristic method which can be applied to problems with arbitrary many constraints that possibly destroy the sparsity. The key idea of our algorithm is: (i) Keep the “nice" sparsity pattern defined without the bad constraints; (ii) Associate only low-order localizing matrix constraints to the “bad” constraints. In brief, the d-th order heuristic hierarchy (HR-d) reads:

image

where  y, Ly, Md(y, Ik), Md−ωi(giy, Ik(i))have been defined in section 3.2. For more illustration of this heuristic relaxation and how it is applied to estimate the Lipschitz constant of neural networks, see Appendix D.

For simplicity, assume that the neural networks have only one single hidden layer, i.e., m = 1. Denote by A, b the weight and bias respectively. As in (4), we use the fact that  l ≤ x ≤ uis equivalent to (x − l)(x − u) ≤ 0. Then the local Lipschitz constant estimation problem with respect to  L∞-normcan be written as:

image

Define the subsets of (LCEP-MLP1) to be Ii = {xi, ti}, Jj = {uj, zj} for i ∈ [p0], j ∈ [p1], wherep0, p1are the number of nodes in the input layer and hidden layer respectively. Then the second-order

(d = 2) heuristic relaxation of (LCEP-MLP1) is the following SDP:

image

M2(y, Ii) ⪰ 0, M1(−(x(i) − ¯x(i)0 + ε)(x(i) − ¯x(i)0 − ε)y, Ii) ⪰ 0, M1((1 − t2i )y, Ii) ⪰ 0, i ∈ [p0] ;M2(y, Jj) ⪰ 0, M1(uj(uj − 1)y, Jj) = 0, M1((uj − 1/2)zjy, Jj) ⪰ 0, j ∈ [p1] .} .(MomLCEP-2)

The d-th order heuristic relaxation (MomNlySpOpt-d) also applies to multiple layer neural networks. However, if the neural network has m hidden layers, then the criterion in (LCEP) is of degree m + 1. If  m ≥ 2, then the first-order moment matrix  M1(y)is no longer sufficient, as moments of degree > 2 are not encoded in  M1(y)and some may not be encoded in the moment matrices  M2(y, Ii), if they include variables of different subsets. See Appendix E for more information to deal with higher-degree polynomial objective.

In this section, we provide results for the global and local Lipschitz constants of random networks of fixed size (80, 80) and with various sparsities. We also compute bounds of a real trained 1-hidden layer network. The complete results for global/local Lipschitz constants of both 1-hidden layer and 2-hidden layer networks can be found in Appendix F and G. For all experiments we focus on the L∞-norm, the most interesting case for robustness certification. Moreover, we use the Lipschitz constants computed by various methods to certify robustness of a trained network, and compare the ratio of certified inputs with different methods. Let us first provide an overview of the methods with which we compare our results.

SHOR: Shor’s relaxation applied to (LCEP). Note that this is different from Shor’s relaxation

image

LipOpt-3: LP-based method by [21] with degree 3.

LBS: lower bound obtained by sampling 50000 random points and evaluating the dual norm of the gradient.

The reason why we list LBS here is because LBS is a valid lower bound on the Lipschitz constant. Therefore all methods should provide a result not lower than LBS, a basic necessary condition of consistency.

As discussed in section 2.2, if we want to estimate the global Lipschitz constant, we need the input space  Ωto be the whole space. In consideration of numerical issues, we set  Ωto be the ball of radius 10 around the origin. For the local Lipschitz constant, we set by default the radius of the input ball as  ε = 0.1. In both cases, we compute the Lipschitz constant with respect to the first label. All experiments are run on a personal laptop with a 4-core i5-6300HQ 2.3GHz CPU and 8GB of RAM. We use the (Python) code provided by [21]2 to execute the experiments for LipOpt with Gurobi solver. For HR-2 and SHOR, we use the YALMIP toolbox (MATLAB) [23] with MOSEK as a backend to calculate the Lipschitz constants for random networks. For trained network, we implement our algorithm on Julia [4] with MOSEK optimizer to accelerate the computation.

Remark: The crossover option 3 in Gurobi solver is activated by default, and it is used to transform the interior solution produced by barrier into a basic solution. We deactivate this option in our experiments since this computation is unnecessary and takes a lot of time. Throughout this paper, running time is referred to the time taken by the LP/SDP solver (Gurobi/Mosek) and OfM means running out of memory during building up the LP/SDP model.

5.1 Lipschitz Constant Estimation

Random Networks. We first compare the upper bounds for (80, 80) networks, whose weights and biases are randomly generated. We use the codes provided by [21] to generate networks with various sparsities. For each fixed sparsity, we generate 10 different random networks, and apply all the

Table 2: Comparison of upper bounds of global Lipschitz constant and solver running time on trained network SDP-NN obtained by HR-2, SHOR, LipOpt-3 and LBS. The network is a fully connected neural network with one hidden layer, with 784 nodes in the input layer and 500 nodes in the hidden layer. The network is for 10-classification, we calculate the upper bound with respect to label 2.

image

methods to them repeatedly. Then we compute the average upper bound and average running time of those 10 experiments.

image

Figure 1: Lipschitz constant upper bounds and solver running time with respect to  L∞norm obtained by HR-2, SHOR, LipOpt-3, LipOpt-4 and LBS. We generate random networks of size (80, 80) with sparsity 10, 20, 30, 40, 50, 60, 70, 80. In the meantime, we display median and quartiles over 10 random networks draws.

Figure 1 displays a comparison of average upper bounds of global and local Lipschitz constants. For global bounds, we can see from Figure 1a that when the sparsity of the network is small (10, 20, etc.), the LP-based method LipOpt-3 is slightly better than the SDP-based method HR-2. As the sparsity increases, HR-2 provides tighter bounds. Figure 1b shows that LipOpt-3 is more efficient than HR-2 only for sparsity 10. When the networks are dense or nearly dense, our method not only takes much less time, but also gives tighter upper bounds. For global Lipschitz constant estimation, SHOR and HR-2 give nearly the same upper bounds. However, in the local case, HR-2 provides strictly tighter bounds than SHOR. In both global and local cases, SHOR has smaller computational time than HR-2. In Appendix F and G, we present more results of global and local Lipschitz constant bounds for networks of various sizes and sparsities.

Trained Network We use the MNIST classifier (SDP-NN) described in [30]4. The network is of size (784, 500). In Table 2, we see that the LipOpt-3 algorithm runs out of memory when applied to the real network SDP-NN to compute the global Lipschitz bound. In contrast, SHOR and HR-2 still work and moreover, HR-2 provides tighter upper bounds than SHOR in both global and local cases. As a trade-off, the running time of HR-2 is around 5 times longer than that of SHOR.

5.2 Robustness Certification

Binary Classifier We generate 20000 random points in dimension 80, where half of them are concentrated around the sphere with radius 1 (labeled as 1) and the rest are concentrated around the sphere with radius 2 (labeled as 2). We train a (80, 80) network for this binary classification task, and use sigmoid layer as the output layer. Therefore, the input is labeled as 1 (resp. 2) if the output is negative (resp. positive). For an input  x0and a perturbation  ϵ, we compute an upper bound of the Lipschitz constant with respect to  x0, ϵand  L∞-norm, denoted by  Lx0,ϵ. Then, if the output  y0is negative (resp. positive) and  y0 + ϵLx0,ϵ < 0(resp.  y0 − ϵLx0,ϵ > 0), the input  x0is ϵ-robust. With this criteria, if we have n inputs to certify, then we need to execute n experiments.

Table 3: Comparison of ratios of certified examples for (80, 80) network by HR-2 and LipOpt-3.

image

Table 4: Ratios of certified test examples for SDP-NN network by HR-2.

image

However, for large networks such as MNIST networks, this is impractical, even if we certify the network directly without using Lipschitz constants (see [30]). Therefore, we compute the Lipschitz constant with respect to  x0 = 0and  ϵ = 3, denoted by L, and generate  106random points in the box  B = {x : ||x||∞ ≤ 2.9}. For any  ϵ ≤ 0.1and  x0 ∈ B, we have  Lx0,ϵ ≤ L, then the point  x0is  ϵ-robust if  y0(y0 − sign(y0)ϵL) > 0. Instead of running  106times the algorithm, we are able to certify robustness of large number of inputs by only doing one experiment. Table 3 shows the ratio of certified points in the box B by HR-2 and LipOpt-3. We see that HR-2 can always certify more examples than LipOpt-3.

Multi-Classifier As described in Section 5.1, the SDP-NN network is a well-trained (784, 500) network to classify the digit images from 0 to 9. Denote the parameters of this network by  A ∈R500×784, b1 ∈ R500, C ∈ R10×500, b2 ∈ R10. The score of an input x is denoted by  yx, i.e., yx = C · ReLU(Ax0 + b1) + b2. The label of x, denoted by  rx, is the index with the largest score, i.e.,  rx = arg max yx. Suppose an input  x0 has label r. For ϵ and x such that ||x − x0||∞ ≤ ϵ, if forall  i ̸= r, yxi −yxr < 0, then x0 is ϵ-robust. Alternatively, denote by  Lx0,ϵi,rthe local Lipschitz constant of function  fi,r(x) = yxi − yxr with respect to  L∞-norm in the ball  {x : ||x − x0||∞ ≤ ϵ}. Then thepoint  x0is  ϵ-robust if for all  i ̸= r, fi,r(x0) + ϵLx0,ϵi,r < 0. Since the  28 × 28MNIST images are flattened and normalized into vectors taking value in [0, 1], we compute the local Lipschitz constant (by HR-2) with respect to  x0 = 0 and ϵ = 2, the complete value is referred to matrix L in Appendix H. We take different values of  ϵfrom 0.01 to 0.1, and compute the ratio of certified examples among the 10000 MNIST test data by the Lipschitz constants we obtain, as shown in Table 4. Note that for ϵ = 0.1, we improve a little bit by 67% compared to Grad-cert (65%) described in [30], as we use an exact formulation of the derivative of ReLU function.

Optimization Aspect: In this work, we propose a new heuristic moment relaxation based on the dense and sparse Lasserre’s hierarchy. In terms of performance, our method provides bounds no worse than Shor’s relaxation and better bounds in many cases. In terms of computational efficiency, our algorithm also applies to nearly sparse polynomial optimization problems without running into computational issue.

Machine Learning Aspect: The ReLU function and its generalized derivative G(x) are semialgebraic. This semialgebraic character is easy to handle exactly in polynomial optimization (via some lifting) so that one is able to apply moment relaxation techniques to the resulting POP. Moreover, our heuristic moment relaxation provides tighter bounds than Shor’s relaxation and the state-of-the-art LP-based algorithm in [21].

Future research: The heuristic relaxation is designed for QCQP (e.g. problem (LCEP) for 1-hidden layer networks). As the number of hidden layer increases, the degree of the objective function also increases and the approach must be combined with lifting or sub-moment techniques described in Appendix E, in order to deal with higher-degree objective polynomials. Efficient derivation of approximate sparse certificates for high degree polynomials should allow to enlarge the spectrum of applicability of such techniques to larger size networks and broader classes of activation functions. This is an exciting topic of future research.

Developing optimization methods resulting in numerical certificates is a necessary step toward the verification of systems involving AI trained components. Such systems are expected to be more and more common in the transport industry and constitute a major challenge in terms of certification. We believe that polynomial optimization is one promising tool to address this challenge.

This work has benefited from the AI Interdisciplinary Institute ANITI funding, through the French “Investing for the Future – PIA3” program under the Grant agreement n°ANR-19-PI3A-0004. Edouard Pauwels acknowledges the support of Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA9550-19-1-7026 and FA9550-18-1-0226, and ANR MasDol. Victor Magron was supported by the FMJH Program PGMO (EPICS project) and EDF, Thales, Orange et Criteo, the Tremplin ERC Stg Grant ANR-18-ERC2-0004-01 (T-COPS project) as well as the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions, grant agreement 813211 (POEMA). A large part of this work was carried out as Tong Chen was a master intern at IRT-Saint-Exupéry, Toulouse, France.

[1] Amir Ali Ahmadi and Anirudha Majumdar. Dsos and sdsos optimization: Lp and socp-based alternatives to sum of squares optimization. In 2014 48th annual conference on information sciences and systems (CISS), pages 1–5. IEEE, 2014.

[2] Cem Anil, James Lucas, and Roger Grosse. Sorting out Lipschitz function approximation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 291–301, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223, 2017.

[4] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98, 2017.

[5] Jérôme Bolte and Edouard Pauwels. Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning. arXiv preprint arXiv:1909.10300, 2019.

[6] Akhilan Boopathy, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Cnn-cert: An efficient framework for certifying robustness of convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3240–3247, 2019.

[7] Patrick L Combettes and Jean-Christophe Pesquet. Lipschitz certificates for neural network structures driven by averaged activation operators. arXiv preprint arXiv:1903.01014, 2019.

[8] Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy A Mann, and Pushmeet Kohli. A dual approach to scalable verification of deep networks. In UAI, pages 550–559, 2018.

[9] Mahyar Fazlyab, Manfred Morari, and George J Pappas. Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. arXiv preprint arXiv:1903.01287, 2019.

[10] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George J. Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks, 2019.

[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.

[12] Didier Henrion, Jean-Bernard Lasserre, and Johan Löfberg. Gloptipoly 3: moments, optimization and semidefinite programming. Optimization Methods & Software, 24(4-5):761–779, 2009.

[13] Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha. Limitations of the lipschitz constant as a defense against adversarial examples. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 16–29. Springer, 2018.

[14] Sham M Kakade and Jason D Lee. Provably correct automatic sub-differentiation for qualified programs. In Advances in neural information processing systems, pages 7125–7135, 2018.

[15] Jean-Louis Krivine. Anneaux préordonnés. Journal d’Analyse Mathématique, 12(1):307–326, 1964.

[16] J. B. Lasserre. The Moment-SOS Hierarchy. In B. Sirakov, P. Ney de Souza, and M. Viana, editors, Proceedings of the International Congress of Mathematicians (ICM 2018), volume 4, pages 3773–3794, Rio de Janeiro, 2019. World Scientific, Singapore.

[17] Jean B Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on optimization, 11(3):796–817, 2001.

[18] Jean B Lasserre. Convergent sdp-relaxations in polynomial optimization with sparsity. SIAM Journal on Optimization, 17(3):822–843, 2006.

[19] Jean B Lasserre, Kim-Chuan Toh, and Shouguang Yang. A bounded degree sos hierarchy for polynomial optimization. EURO Journal on Computational Optimization, 5(1-2):87–117, 2017.

[20] Jean Bernard Lasserre. An introduction to polynomial and semi-algebraic optimization, volume 52. Cambridge University Press, 2015.

[21] Fabian Latorre, Paul Rolland, and Volkan Cevher. Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, 2020.

[22] Monique Laurent. A comparison of the sherali-adams, lovász-schrijver, and lasserre relaxations for 0–1 programming. Mathematics of Operations Research, 28(3):470–496, 2003.

[23] Johan Löfberg. Yalmip: A toolbox for modeling and optimization in matlab. In Proceedings of the CACSD Conference, volume 3. Taipei, Taiwan, 2004.

[24] V. Magron, G. Constantinides, and A. Donaldson. Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw., 43(4):1–34, 2017.

[25] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.

[26] D. K. Molzahn and I. A. Hiskens. A survey of relaxations and approximations of the power flow equations. Foundations and Trends® in Electric Energy Systems, 4(1-2):1–221, 2019.

[27] D. K. Molzahn, I. A. Hiskens, C. Josz, and P. Panciatici. Computational analysis of sparsityexploiting moment relaxations of the opf problem. In Proceedings of the PSCC Conference. Genoa, Italy, IEEE, 2016.

[28] Jiawang Nie. Optimality conditions and finite convergence of lasserre’s hierarchy. Mathematical programming, 146(1-2):97–121, 2014.

[29] Mihai Putinar. Positive polynomials on compact semi-algebraic sets. Indiana University Mathematics Journal, 42(3):969–984, 1993.

[30] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018.

[31] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pages 10877–10887, 2018.

[32] Naum Z Shor. Quadratic optimization problems. Soviet Journal of Computer and Systems Sciences, 25:1–11, 1987.

[33] Gilbert Stengle. A nullstellensatz and a positivstellensatz in semialgebraic geometry. Mathematische Annalen, 207(2):87–97, 1974.

[34] Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356, 2017.

[35] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, pages 6541–6550, 2018.

[36] Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. In Advances in Neural Information Processing Systems, pages 3835–3844, 2018.

[37] Hayato Waki, Sunyoung Kim, Masakazu Kojima, and Masakazu Muramatsu. Sums of squares and semidefinite program relaxations for polynomial optimization problems with structured sparsity. SIAM Journal on Optimization, 17(1):218–242, 2006.

[38] Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certified robustness for relu networks. arXiv preprint arXiv:1804.09699, 2018.

[39] Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. arXiv preprint arXiv:1801.10578, 2018.

[40] Eric Wong and J Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.

[41] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In Advances in neural information processing systems, pages 4939–4948, 2018.

Denote by D(x) the sub-differential of ReLU function, i.e. D(x) = 1 for x > 0, D(x) = 0 for x < 0 and D(x) = [0, 1] for x = 0.

According to [5], G has a closed graph and compact values. Furthermore, it holds that  G(t) ⊆ D(t)for all  t ∈ R. Adopting the terminology from [5], D is conservative for the ReLU function, which implies that G is conservative for the ReLU function as well [5, Remark 3(e)]. The formulation GFk(x0) = (�mi=1 ATi diag(G(zi)))ckis an application of the chain rule of differentiation, where along each chain the conservative set-valued field G is used in place of derivative for the ReLU function. By [5, Lemma 2], chain rule preserves conservativity, hence  GFkis a conservative mapping for function  Fk. By conservativity [5], using convexity of  Ωwe have for all  x, y ∈ Ω, integrating along the segment.

image

Remark (Being a gradient almost everywhere is not sufficient in general): As suggested by anonymous AC5, Lemma 1 may admit a simpler proof. Indeed the Clarke subdifferential is the convex hull of limits of sequences of gradients. For Lipschitz constant, we want the maximum norm element, which necessarily happens at a corner of the convex hull, therefore for our purposes it suffices to consider sequences. Since the ReLU network will be almost-everywhere differentiable, we can consider a shrinking sequence of balls around any point, and we will have gradients which are arbitrarily close to any corner of the differential at our given point. Therefore, the norms will converge to that norm, and thus it suffices to optimize over differentiable points, and what we choose at the nondifferentiability does not matter.

The above argument is essentially valid because ReLU only contains univariate nondifferentiability which is very specific. However the argument is implicitely based on the idea that the composition of almost everywhere differentiable functions complies with calculus rules, which is not correct in general due to lack of injectivity. For more general networks, what we choose at the nondifferentiability does matter. Indeed, consider the following functions

image

The composition  G ◦ Fis the identity on R and both F and G are differentiable almost every where. Consider the mappings

image

JFis the Jacobian of  F and JGis the gradient of G almost everywhere. Now the product

image

for all  x ∈ R. Hence computing the product of these gradient almost everywhere operators gives value 0, suggesting that the function is constant, while it is not. The reason for the failure here is that JG, despite being gradient almost everywhere, is not conservative for G. For this reason, the product does not provide any notion of Lipschicity and the choice at nondifferentiability point does matter for G.

Overall, we believe that the proof suggested by AC is correct for ReLU networks because they are built only using univariate nondifferentiabilities. However the argument that, the choice made at nondifferentiability point does not have an impact, is not correct for more general networks. This discussion is kept explicit in this appendix since many other types of semialgebraic activation functions are used in deep learning, such as max pooling, for which one may use similar approaches as what we proposed to evaluate Lipschitz constants. For these more complex nondifferentiability, concervativity will be essential to ensure that one obtains proper Lipschitz constants, beyond the argument that beging differentiable almost everywhere implies that the choice made at nondifferentiability point does not matter.

B.1 Dense Case 3.1 For illustration purpose and without going into details, consider the following simple example where we want to minimize  x1x2over the unit disk on  R2. That is:

image

For d  = 1, y = {y00, y01, y10, y20, y11, y02} ∈ R6, Ly(f) = y11, and

image

As  ω = ⌈deg(g)/2⌉ = 1, M0(gy) ⪰ 0simply translates to the linear constraint  Ly(g) = 1 − y20 −y02 ≥ 0. Therefore (MomOpt-d) with d = 1 reads:

image

with optimal value  ρ1 = −1/2 = f ∗. It turns out that (6) is exactly Shor’s relaxation applied to (5). In fact, for QCQP the first-order moment relaxation (i.e., (MomOpt-d) with d = 1) is exactly Shor’s relaxation.

image

Define the subsets  I1 = {1, 2}, I2 = {2, 3}. It is easy to check that assumptions A1, A2, A3 and A4 hold. Define  y = {y000, y100, y010, y001, y200, y110, y101, y020, y011, y002} ∈ R10. For d = 1, thefirst-order dense moment matrix reads:

image

whereas the sparse moment matrix  M1(y, I1)(resp.  M1(y, I2)) is the submatrix of  M1(y)taking red and pink (resp. blue and pink) entries. That is,  M1(y, I1)and  M1(y, I2)are submatrices of M1(y), obtained by restricting to rows and columns concerned with subsets  I1 and I2 only.

The primal and dual of Lasserre’s hierarchy (MomOpt-d) nicely illustrate the duality between moments and positive polynomials. Indeed for each fixed d, the dual of (MomOpt-d) reads:

image

where  σ0is a sum-of-squares (SOS) polynomial of degree at most 2d, and  σjare SOS polynomials of degree at most  2(d − ωi), ωi = ⌈deg(gj)/2⌉. The right-hand-side of the identity in (SOS-d) is nothing less than Putinar’s positivity certificate [29] for the polynomial  x �→ f(x) − ton the compact semialgebraic set  {x : gi(x) ≥ 0, i ∈ [p]}.

Similarly, the dual problem of (MomSpOpt-d) reads:

image

where  σ0,kare SOS in  R[xIk]of degree at most 2d, and  σj,kare SOS in  R[xIk]of degree at most 2(d − ωi), ωi = ⌈deg(gj)/2⌉. Then (SpSOS-d)implements the sparse Putinar’s positivity certificate [18, 37].

Consider problem (NlySpOpt). We already have a sparsity pattern with subsets  Ikand an additional “bad” constraint  g ≥ 0(assumed to be quadratic). Then we consider the sparse moment relaxations (MomSpOpt-d) applied to (NlySpOpt) without the bad constraint  g ≥ 0and simply add two constraints: (i) the moment constraint  M1(y) ⪰ 0(with full dense first-order moment matrix  M1(y)), and (ii) the linear moment inequality constraint  Ly(g) ≥ 0(which is the lowest-order localizing matrix constraint  M0(g y) ⪰ 0).

To see why the full moment constraint  M1(y) ⪰ 0is needed, consider the toy problem (7). Recall that the subsets we defined are  I1 = {1, 2}, I2 = {2, 3}. Now suppose that we need to consider an additional “bad” constraint  (1 − x1 − x2 − x3)2 = 0. After developing  Ly(g), one needs to consider the moment variable  y103corresponding to the monomial  x1x3in the expansion of g = (1 − x1 − x2 − x3)2, and  y103does not appear in the moment matrices  Md(y, I1)and Md(y, I2) because x1 and x3are not in the same subset. However  y103appears in  M1(y) (which isa  n × n matrix).

Now let us see how this works for problem (LCEP). First introduce new variables  ziwith associated constraints  zi − Aixi−1 − bi = 0, so that all “bad” constraints are affine. Equivalently, we may and will consider the single “bad" constraint  g ≥ 0 with g(z1, . . . , x0, x1, . . .) = − �i ||zi − Axi−1 −bi||2and solve (MomNlySpOpt-d). We briefly sketch the rationale behind this reformulation. Let (yd)d∈Nbe a sequence of optimal solutions of (MomNlySpOpt-d). If d → ∞, then yd → y (possiblyfor a subsequence  (dk)k∈N), and ycorresponds to the moment sequence of a measure  µ, supported on {(x, z) : gi(x, z) ≥ 0, i ∈ [p];�g dµ ≥ 0}. But as −gis a square,�gdµ ≥ 0 implies g = 0, µ-a.e.,and therefore  zi = Axi−1 + bi, µ-a.e. This is why we do not need to consider the higher-order constraints  Md(g y) ⪰ 0 for d > 0; only M0(g y) ⪰ 0 (⇔ Ly(g) ≥ 0) suffices. In fact, we impose the stronger linear constraints  Ly(g) = 0 and Ly(zi − Axi−1 − bi) = 0 for all i ∈ [p].

As discussed at the end of Section 4, for 2-hidden layer networks, one needs to reduce the objective function to degree 2 so that the HR-2 algorithm can be adapted to problem (LCEP). Precisely, problem (LCEP) for 2-hidden layer networks is the following POP:

max xi,ui,ttT AT1 diag(u1)AT2 diag(u2)c (LCEP-MLP2)

image

E.1 Lifting Technique

Define new decision variable  s := u1uT2, so that the degree of objective is reduced to 2. Problem (LCEP-MLP2) can now be reformulated as:

image

For (ReducedLCEP-MLP2), we have  p1p2more variables (s) and constraints (s = u1uT2), where u1 ∈ Rp1 and u2 ∈ Rp2. Even when  p1 = p2 = 100, we add 10000 variables and constraints, which will cause a memory issue (no SDP solver is able to handle matrices of size  O(104)). This is why we use the following approximation technique as a remedy.

E.2 Heuristic Relaxation for Cubic Terms

In this section, we introduce an alternative technique to handle the cubic terms  tiuj1uk2appearing in the objective function of problem (LCEP-MLP2). Recall that the main obstacle that prevents us from applying the HR-2 method is that we don’t have the moments for cubic terms  tiuj1uk2in the first-order moment matrix  M1(y, {ti, uj1, uk2})). Precisely, we only have the moments of quadratic terms in  M1(y, {ti, uj1, uk2})):

image

The moments of cubic terms  tiuj1uk2lie in the second-order moment matrix  M2(y, {ti, uj1, uk2}), which is of size�3+22 �= 10. However, since we only need the moments of the cubic terms, a submatrix of  M2(y) suffices:

image

Thus, in order to obtain the moments of cubic terms, one only needs to put  M1(y)and Msub2 (y, {ti, uj1, uk2})for each cubic term  tiuj1uk2together. Recall that for problem (LCEP-MLP2), t ∈ Rp0, u1 ∈ Rp1, u2 ∈ Rp2. Define the subsets for (LCEP-MLP2)as  Ii = {xi0, ti}for  i ∈ [p0]; Jj1 = {xj1, zj1}, Jj2 = {uj1, zj1}for  j ∈ [p1]; Kk = {uk2, zk2}for  k ∈ [p2]. Then the second-order

heuristic relaxation (HR-2) for problem (LCEP-MLP2) reads as:

image

M1(xj1y, Jj1) ⪰ 0, M1((xj1 − zj1)y, Jj1) ⪰ 0, j ∈ [p1] ;M2(y, Ii1) ⪰ 0, M1((1 − (ti)2)y, Ii) ⪰ 0 ,

image

In this way, we add  p0p1p2moment matrices  Msub2 (y, {ti, uj1, uk2})of size 3, and  p0p1p2 + p1p2moment variables  Ly(tiuj1uk2), Ly((uj1)2(uk2)2). A variant of this technique is to enlarge the size of the moment matrices but in the meantime reduce the number of moment matrices. For instance, consider the following submatrix of the second-order moment matrix  M2(y, {t, y1, uk2}):

image

We have all the moments of the cubic terms  tiuj1uk2 from those Msub2 (y, {t, u1, uk2}). However, in this case, we only add  p2moment matrices  Msub2 (y, {t, u1, uk2}) of size 1 + p0 + p1, and p0p1p2 + p21p2new variables  Ly(u1tT uk2), Ly(u1uT1 (uk2)2). Note that we can also use the first-order heuristic relaxation (HR-1), which is formulated as:

image

We use the experimental settings described in Section 5.

F.1 1-Hidden Layer Networks

Figure 2 displays the average upper bounds of global Lipschitz constants and the time of different algorithms for 1-hidden layer random networks of different sizes and sparsities. We can see from Figure 2a that when the size of the network is small (10, 20, etc.), the LP-based method LipOpt-3 is slightly better than the SDP-based method HR-2. However, when the size and sparsity of the network increase, HR-2 provides tighter bounds. From Figure 2b, we can see that LipOpt-3 is more efficient than HR-2 only when the size or the sparsity of the network is small (for (10, 10) networks, or for (40, 40) networks of sparsity 5, etc.). When the networks are dense or nearly dense, our method not only takes much less time, but also gives much tighter upper bounds. For global Lipschitz constant estimation, SHOR and HR-2 give nearly the same upper bounds. This is because the sizes of the toy networks are quite small. For big real network, as shown in Table 2, HR-2 provides strictly tighter bound than SHOR. Finally, SHOR is more efficient than HR-2 and LipOpt-3 in terms of computational complexity.

image

Figure 2: Global Lipschitz constant upper bounds (left) and solver running time (right) for 1-hidden layer networks with respect to  L∞-norm obtained by SHOR, HR-2, LipOpt-3, LipOpt-4 and LBS. We generate random networks of size 10, 20, 40, 80. For size 10, we consider sparsity 4, 8, 12, 16, 20; for size 20, we consider sparsity 8, 16, 24, 32, 40; for size 40 and 80, we consider sparsity 10, 20, 30, 40, 50, 60, 70, 80. In the meantime, we display median and quartiles over 10 random networks draws.

F.2 2-Hidden Layer Networks

For 2-hidden layer networks, we use the technique introduced in Appendix E in order to deal with the cubic terms in the objective. Figure 3 displays the average upper bounds of global Lipschitz constants and the running time of different algorithms for 2-hidden layer random networks of different sizes and sparsities. We can see from Figure 3a that the SDP-based method HR-2 performs worse than the LP-based method LipOpt-3 for networks of size (10, 10, 10). However, as the size and the sparsity of the network increase, the difference between HR-2 and LipOpt-3 becomes smaller (and HR-2 performs even better). For networks of size (20, 20, 10), (30, 30, 10) and (40, 40, 10), with sparsity greater than 10, HR-2 provides strictly tighter bounds than LipOpt-3. This fact has already been shown in Table 1 (right), HR-1 and HR-2 give consistently tighter upper bounds than LipOpt-3, with the price of higher computational time.

We use the experimental settings described in Section 5.

G.1 1-Hidden Layer Networks

Figure 4 displays the average upper bounds of local Lipschitz constants and the running time of different algorithms for 1-hidden layer random networks of different sizes and sparsities. By contrast with the global case, we can see from Figure 4a that HR-2 gives strictly tighter upper bounds than SHOR when the sparsity is larger than 40. As a trade-off, HR-2 takes more computational time than SHOR. According to Figure 4b, the running time of HR-2 is around 5 times longer than SHOR. Similar to the global case, LipOpt-3 performs well when the network is sparse. However, when the sparsity increases, HR-2 and SHOR are more efficient and provide better bounds than LipOpt-3.

G.2 2-Hidden Layer Networks

For 2-hidden layer networks, we use the approximation technique described in Appendix E in order to reduce the objective to degree 2. Figure 5a and 5b displays the average upper bounds of local Lipschitz constants and the running time of different algorithms for 2-hidden layer random networks of different sizes and sparsities. By contrast with the global case, we can see from Figure 5a that HR-2 gives strictly tighter upper bounds than HR-1 when the sparsity is larger than 40. As a trade-off,

image

Figure 3: Global Lipschitz constant upper bounds (left) and solver running time (right) for 2-hidden layer networks with respect to  L∞-norm obtained by HR-2, HR-1, LipOpt-3, LipOpt-4 and LBS. We generate random networks of size 10, 20, 30, 40. For size (10, 10, 10), we consider sparsity 4, 8, 12, 16, 20; for size (20, 20, 10), we consider sparsity 4, 8, 12, 16, 20, 24, 28, 32, 36, 40; for size (30, 30, 10), we consider sparsity 10, 20, 30, 40, 50, 60; for size (40, 40, 10), we consider sparsity 10, 20, 30, 40, 50, 60, 70, 80. In the meantime, we display median and quartiles over 10 random networks draws.

image

Figure 4: Local Lipschitz constant upper bounds (left) and solver running time (right) for 1-hidden layer networks with respect to  L∞-norm obtained by HR-2, SHOR, LipOpt-3 and LBS. By default, ε = 0.1. We generate random networks of size 10, 20, 40, 80. For size 10, we consider sparsity 4, 8, 12, 16, 20; for size 20, we consider sparsity 8, 16, 24, 32, 40; for size 40 and 80, we consider sparsity 10, 20, 30, 40, 50, 60, 70, 80. In the meantime, we display median and quartiles over 10 random networks draws.

HR-2 takes more computational time than HR-1. According to Figure 5b, the running time of HR-2 is just around 3 times longer than HR-1. Similar to the global case, LipOpt-3 performs well when the network is sparse. However, when the sparsity increases, HR-2 and HR-1 provide better bounds than LipOpt-3, with the price of higher computational time..

image

Figure 5: Local Lipschitz constant upper bounds (left) and solver running time (right) for 2-hidden layer networks with respect to  L∞-norm obtained by HR-2, HR-1, LipOpt-3 and LBS. By default, ε = 0.1. We generate random networks of size 20, 30, 40, 50. For size (10, 10, 10), we consider sparsity 4, 8, 12, 16, 20; for size (20, 20, 10), we consider sparsity 4, 8, 12, 16, 20, 24, 28, 32, 36, 40; for size (30, 30, 10), we consider sparsity 10, 20, 30, 40, 50, 60; for size (40, 40, 10), we consider sparsity 10, 20, 30, 40, 50, 60, 70, 80. In the meantime, we display median and quartiles over 10 random networks draws.

The matrix of Lipschitz constants of function  fi,j(defined in the second paragraph of Section 5.2) with respect to input  x0 = 0, ϵ = 2 and norm || · ||∞:

image

where  L = (Lij)i̸=j. Note that if we replace the vector  c in (LCEP) by −c, the problem is equivalent to the original one. Therefore, the matrix L is symmetric, and we only need to compute 45 Lipschitz constants (the upper triangle of L).

Figure 6 shows several certified and non-certified examples taken from the MNIST test dataset.

image

Figure 6: Examples of certified points (above) and non-certified points (bellow).


Designed for Accessibility and to further Open Science