Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks

2020·Arxiv

Abstract

Abstract

We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs) in terms of a single convex program with number of variables polynomial in the number of training samples and the number of hidden neurons. Our theory utilizes semi-infinite duality and minimum norm regularization. We show that ReLU networks trained with standard weight decay are equivalent to block penalized convex models. Moreover, we show that certain standard convolutional linear networks are equivalent semi-definite programs which can be simplified to regularized linear models in a polynomial sized discrete Fourier feature space.

1. Introduction

In this paper, we introduce a finite dimensional, polynomialsize convex program that globally solves the training problem for two-layer neural networks with rectified linear unit (ReLU) activation functions. The key to our analysis is a generic convex duality method we introduce, and is of independent interest for other non-convex problems. We further prove that strong duality holds in a variety of architectures.

1.1. Related work and overview

Convex neural network training was considered in the literature (Bengio et al., 2006; Bach, 2017). However, convexity arguments in the existing work are restricted to infinite width networks, where infinite dimensional optimization problems need to be solved. In fact, adding even a single neuron to the model requires the solution of a non-convex problem where no efficient algorithm is known (Bach, 2017). In this work, we develop a novel duality theory and introduce polynomial-time finite dimensional convex programs, which are exact and computationally tractable.

1Department of Electrical Engineering, Stanford University, CA, USA. Correspondence to: Mert Pilanci <pilanci@stanford.edu>, Tolga Ergen <ergen@stanford.edu>.

Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Several recent studies considered over-parameterized neural networks, where the width approaches infinity by leveraging connections to kernel methods, and showed that randomly initialized gradient descent can fit all the training samples (Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019). However, in this kernel regime, the analysis shows that almost no hidden neurons move from their initial values to actively learn useful features (Chizat & Bach, 2018). Experiments also confirm that the kernel approximation as the width tends to infinity is unable to fully explain the success of non-convex neural network models (Arora et al., 2019). On the contrary, our work precisely characterizes the mechanism behind extraordinary modeling capabilities of neural networks for any finite number of hidden neurons. We prove that networks with ReLU are identical to convex regularization methods in a finite higher dimensional space.

Consider a two-layer network

where are the weights for hidden and output layers, respectively, and is the ReLU activation. We extend the definition of scalar functions to vectors/matrices entry-wise. We use to denote the unit . We denote the set of integers from 1 to n as [n]. We also use to denote singular values.

In order to keep the notation simple and clearly convey the main idea, we will restrict our attention to two-layer ReLU networks with scalar output trained with squared loss. All of our results immediately extend to vector outputs, tensor inputs, arbitrary convex classification and regression loss functions, and other network architectures (see Appendix).

Given a data matrix , a label vector a regularization parameter , consider minimizing the squared loss objective and squared -norm of all parameters

Figure 1: Sets involved in the construction of the Neural Gauge. Ellipsoidal set, rectified ellipsoid and the polar of

The above objective is highly non-convex due to non-linear ReLU activations and product between hidden and outer layer weights. The best known algorithm for globally minimizing the above objective is a brute-force search over all possible piece-wise linear regions of ReLU activations of m neurons and output layer sign patterns (Arora et al., 2018). This algorithm has complexity (see Theorem 4.1 in (Arora et al., 2018)). In fact, known algorithms for approximately learning m hidden neuron ReLU networks have complexity (see Theorem 5 of (Goel et al., 2017)) due to similar combinatorial explosion with m.

2. Convex Duality for Two-layer Networks

Now we introduce our main technical tool for deriving convex representations of the non-convex objective function (2). We start with the penalized representation, which is equivalent to (2) (see Appendix A.3),

Replacing the inner minimization problem with its convex dual, we obtain (see Appendix A.4)

Interchanging the order of min and max, we obtain the lower-bound via weak duality

The above problem is a convex semi-infinite optimization problem with n variables and infinitely many constraints. We will show that strong duality holds, i.e., , as long as the number of hidden neurons for some , which will be defined in the sequel. As it will be shown, can be smaller than n. The dual of the dual program (4) can be derived using standard semi-infinite programming theory (Goberna & L´opez-Cerd´a, 1998), and corresponds to the bi-dual of the non-convex problem (2).

Now we briefly introduce basic properties of signed measures that are necessary to state the dual of (4) and refer the reader to (Rosset et al., 2007; Bach, 2017) for further details. Consider an arbitrary measurable input space X with a set of continuous basis functions parameterized by . We then consider real-valued Radon measures equipped with the uniform norm (Rudin, 1964). For a signed Radon measure , we can define an infinite width neural network output for the input as . The total variation norm of the signed measure is defined as the supremum of over all continuous functions q(u) that satisfy . Consider the ReLU basis functions . We may express networks with finitely many neurons as in (1) by

which corresponds to Dirac delta measure. And the total variation norm of reduces to the

We state the dual of (4) (see Section 2 of (Shapiro, 2009) and Section 8.6 of (Goberna & L´opez-Cerd´a, 1998)) as follows

Furthermore, an application of Caratheodory’s theorem shows that the infinite dimensional bi-dual (5) always has a solution that is supported with Dirac deltas, where (Rosset et al., 2007). Therefore we have

as long as . We show that strong duality holds, i.e., in Appendix A.8 and A.11. In the sequel, we illustrate how can be determined via a finite-dimensional parameterization of (4) and its dual.

2.1. A geometric insight: Neural Gauge Function

An interesting geometric insight can be provided in the weakly regularized case where . In this case, minimizers of (3) and hence (2) approach minimum norm interpolation

We show that is the gauge function of the convex hull of where (see Appendix A.10), i.e.,

which we call Neural Gauge due to the connection to the minimum norm interpolation problem. Using classical polar gauge duality (see e.g. (Rockafellar, 1970), it holds that

where is the polar of the set . Therefore, evaluating the support function of this polar set is equivalent to solving the neural gauge problem, i.e., minimum norm interpolation . These sets are illustrated in Figure 1. Note that the polar set is always convex (see Figure 1c), which also appears in the dual problem (4) as a constraint. In particular, equal to the support function. Our finite dimensional convex program leverages the convexity and an efficient description of this set as we discuss next.

3. An Exact Finite Dimensional Convex Program

Consider diagonal matrices Diagwhere is arbitrary and is an indicator vector with Boolean elements 0]]. Let us enumerate all such distinct diagonal matrices that can be obtained for all possible , and denote them as is the number of regions in a partition of by hyperplanes passing through the origin, and are perpendicular to the rows of X. It is well known that

for (Ojha, 2000; Stanley et al., 2004; Winder, 1966; Cover, 1965) (see Appendix A.2).

Consider the finite dimensional convex problem

Theorem 1. The convex program (8) and the non-convex problem (2) where have identical optimal values1. Moreover, an optimal solution to (2) with neurons can be constructed from an optimal solution to (8) as follows

where are the optimal solutions to (8). Thus, we have

Remark 3.1. Note that optimal solutions of (8) may not be unique. As an example, merging positively colinear neurons does not change the objective value. Particularly, if there exist positively colinear solutions such as and , where and , then merging these solutions as yields the same objective. Remark 3.2. Theorem 1 shows that two-layer ReLU networks with m hidden neurons can be globally optimized via the second order cone program (8) with 2dP variables and 2nP linear inequalities where

r = rank(X). The computational complexity is at most using standard interior-point solvers. For fixed rank r (or dimension d), the complexity is polynomial in n and m, which is an exponential improvement over the state of the art (Arora et al., 2018; Bienstock et al., 2018). Note that d is a small number that corresponds to the filter size in CNNs as we illustrate in the next section. However, for fixed n and rank(X) = d, the complexity is exponential in d, which can not be improved unless P = NP even for m = 2 (Boob et al., 2018). It is interesting to note that the proposed convex program trains ReLU neural networks optimally, unlike local search heuristics such as backpropagation, which may converge to suboptimal solutions (see Section 6 for numerical evidence). To the best of our knowledge, our results provide the first polynomial-time algorithm to train non-trivial neural networks with global optimality guarantees. We also remark that further theoretical insight as well as faster numerical solvers can be developed due to the similarity to group Lasso (Yuan & Lin, 2006) and related structured regularization methods. Theorem 1 implies that ReLU networks are equivalent to sparse mixtures of linear models, where sparsity is enforced by the group convex regularizer. More specifically, the non-convex neural network approach implicitly maps the data to the higher dimensional feature matrix , and consequently seeks a group sparse model.

Remark 3.3. We note that the convex program (8) can be approximated by sampling a set of diagonal matrices . For example, one can generate , or from any distribution times, and let Diagand solve the reduced convex problem, where remaining variables are set to zero. This is essentially a type of coordinate descent applied to (8). In Section 6, we show that this approximation in fact works extremely well, often better than backpropagation. In fact, backpropagation can be viewed as a heuristic method to solve the convex objective (8). The global optima of this convex program (8) are among the fixed points of backpropagation, i.e., stationary points of (2). Moreover, we can bound the suboptimality of any feasible solution, e.g., from backpropagation, in the non-convex cost (2) using the dual of (8).

The proof of Theorem 1 can be found in Section 5.

4. Convolutional Neural Networks

Here, we introduce extensions of our approach to convolutional neural networks (CNNs). Two-layer convolutional networks with m hidden neurons (filters) of dimension d and fully connected output layer weights can be described by patch matrices 1, ..., K. This formulation also includes image, or tensor inputs. For flattened activations, we have as the net- work output. We first present a simpler case for vector regression, which is separable over the patch index k.

4.1. Separable ReLU convolutional networks

Consider the training problem

where ’s are labels. We first note that this problem is separable over the patch indices k, therefore, do not exactly correspond to classical convolutional network architectures which are not separable. The above problem can be reduced to the fully connected case (2) by defining and . Therefore, the convex program (8) solves the above problem exactly in complexity, where r is the number of variables in a single filter. Note that typical CNNs use m filters of size (r=9) in the first layer (He et al., 2016).

4.2. Linear convolutional network training as a Semi-definite Program (SDP)

We now start with the simple case of linear activations , where the training problem becomes

The corresponding dual problem is given by

Similar arguments to those used in the proof of Theorem 1, strong duality holds. Further, the maximizers of the inner problem are the maximal eigenvectors of , which are optimal neurons (filters). We can express (10) as the SDP

The dual of the above SDP is a nuclear norm penalized convex optimization problem (see Appendix A.5)

whereof singular values, i.e., nuclear norm (Recht et al., 2010).

4.3. Linear circular convolutional networks

Now, if we assume that the patches are padded with enough zeros and extracted with stride one, then the circular version of (9) can be written as

where is a circulant matrix generated by a circular shift modulo d using the elements . Then, the SDP (11) reduces to (see Appendix A.6)

where , and is the Discrete Fourier Transform (DFT) matrix.

5. Proof of the Main Result (Theorem 1)

We now prove the main result two-layer ReLU networks with squared loss2. We start with the dual representation

We now focus on a single-sided dual constraint

by considering hyperplane arrangements and a convex duality argument over each partition. We first partition into the following subsets

Let be the set of all hyperplane arrangement patterns for the matrix X, defined as the following set

It is obvious that the set H is bounded, i.e., such that . We next define an alternative representation of the sign patterns in , which is the collection of sets that correspond to positive signs for each element in H. More precisely, let

We now express the maximization in the dual constraint in (16) over all possible hyperplane arrangement patterns as

Let us define the diagonal matrix function of the subset

Note that is the complement of the set S. With this notation, we can represent

and the maximization in the dual constraint as

Enumerating all hyperplane arrangements , or equivalently , we index them in an arbitrary order via We denote . Hence, is the list of all M elements of . Next we use the strong duality result from Lemma 4 for the inner maximization problem. The dual constraint (16) can be represented as

Therefore, recalling the two-sided constraint in (15), we can represent the dual optimization problem in (15) as a finite dimensional convex optimization problem with variables , and 2M second

order cone constraints as follows

The above problem can be represented as a standard finite dimensional second order cone program. Note that the particular choice of parameters , , are strictly feasible in the above constraints as long as . Therefore Slater’s condition and consequently strong duality holds (Boyd & Vandenberghe, 2004a). The dual problem (15) can be written as

Next we introduce variables and represent the dual problem (15) as

We note that the objective is concave in and convex in . Moreover the constraint sets are convex and compact. In- voking Sion’s minimax theorem (Sion, 1958) for the inner max min problem, we may express the strong dual of the problem (15) as

Computing the maximum with respect to , , analytically we obtain the strong dual to (15) as

Now we apply a change of variables and define and . Note that we can take when without changing the optimal value. We obtain

The variables can be eliminated since and are feasible and optimal. Plugging in these expressions, we get

which is identical to (8), and proves that the objective values are equal. Constructing as stated in the theorem, and plugging in (2), we obtain the value

Figure 2: Training cost of a two-layer ReLU network trained with SGD (10 initialization trials) on a one dimensional dataset (d = 1), where Optimal denotes proposed convex programming approach in (8). SGD can be stuck at local minima for small m, while the proposed approach is optimal as guaranteed by Theorem 1.

Figure 3: Training cost of a two-layer ReLU network trained with SGD (10 initialization trials) on a two-dimensional dataset, where Optimal and Approximate denote the objective value obtained by the proposed convex program in (8) and its approximation by sampling variables, respectively. Learned decision boundaries are also depicted.

which is identical to the objective value of the convex program (8). Since the value of the convex program is equal to the value of it’s dual in (15), and , we conclude that , which is equal to the value of the convex program (8) achieved by the prescribed parameters.

6. Numerical Experiments

In this section, we present small scale numerical experiments to verify our results in the previous sections3. We first consider a one-dimensional dataset with n = 5, i.e., and , where we include the bias term by simply concatenating a column of ones to the data matrix X. We then fit these data points using a two-layer ReLU network trained with SGD and the proposed convex program, where we use squared loss as a performance metric. In Figure 2, we plot the value of the regularized objective function with respect to the iteration index. Here, we plot 10 independent realizations for SGD and denote the convex program in (8) as “Optimal”. Additionally, we repeat the same experiment for different number of neurons, particularly, m = 8, 15, and 50. As demonstrated in the figure, when the number of neurons is small, SGD is stuck at local minima. As we increase m, the number of trials that achieve the optimal performance gradually increases as well, which is also consistent with the interpretations in (Ergen & Pilanci, 2019). We also note that Optimal achieves the smallest objective value as claimed in the previous sections. We then compare the performances

0 500 1000 1500 2000 2500 3000 Iteration

0 500 1000 1500 2000 2500 3000 Iteration

Figure 4: Training cost of a two-layer ReLU network trained with SGD (10 initialization trials) on a subset of CIFAR-10 and the convex program (8) denoted as Alg1. Alg2 and Alg3, which are approximations of the convex program.

Figure 5: Training accuracy of a two-layer linear CNN trained with SGD (5 initialization trials) on a subset of CIFAR-10, where L1-Convex denotes the proposed convex program in (14). Filters found via SGD converge to the solution of (14).

on two-dimensional datasets with n = 50, m = 50 and , where we use SGD with the batch size 25 and hinge loss as a performance metric. In these experiments, we also consider an approximate convex program, i.e., denoted as “Approximate” for which we use only a random subset of the diagonal matrices of size m. As illustrated in Figure 3, most of the SGD realizations converge to a slightly higher objective than Optimal. Interestingly, we also observe that even Approximate can outperform SGD in this case. In the same figure, we also provide the decision boundaries obtained by each method.

We also evaluate the performance of the algorithms on a small subset of CIFAR-10 for binary classification (Krizhevsky et al., 2014). Particularly, in each experiment, we first select two classes and then randomly under-sample to create a subset of the original dataset. For these experiments, we use hinge loss and SGD. In the first experiment, we train a two-layer ReLU network on the subset of CIFAR-10, where we include three different versions denoted as “Alg1”, “Alg2”, and “Alg3”, respectively. For Alg1, we use a random subset of the diagonal matrices which match the sign patterns of the optimized (by SGD) network along with a randomly selected subset of possible sign patterns. Similarly, for Alg2, we use the sign patterns that match the initialized network. For Alg3, we perform a heuristic adaptive sampling for the diagonal matrices: we first examine the values of Xu for each neuron using the initial weights and flip the sign pattern corresponding to small values and use it along with the original sign pattern. In Figure 4, we plot both the objective value and the corresponding test accuracy for 10 independent realizations with n = 106, d = 100, m = 12, and batch size 25. We observe that Alg1 achieves the lowest objective value and highest test accuracy. Finally, we train a two-layer linear CNN architecture on a subset of CIFAR-10, where we denote the proposed convex program in (14) as “L1-Convex”. In Figure 5, we plot both the objective value and the Euclidean distance between the filters found by GD and L1-Convex for 5 independent realizations with n = 387, m = 30, h = 10, and batch size 60. In this experiment, all the realizations converge to the objective value obtained by L1-Convex and find almost the same filters.

7. Concluding Remarks

We introduced a convex duality theory for non-convex neural network objectives and developed an exact representation via a convex program with polynomial many variables and constraints. Our results provide an equivalent characterization of neural network models in terms of convex regularization in a higher dimensional space where the data matrix is partitioned over all possible hyperplane arrangements. ReLU neural networks can be precisely represented as convex regularizers, where piecewise linear models are fitted via an group norm regularizer. It is well known that two-layer networks have a quite rich representation power thanks to their universal approximation property. However, our results clearly show that the fitted model is parsimonious due to the group regularization, which facilitates better generalization. Thus, we believe that our characterization sheds light into the extraordinary success of ReLU networks. There are a multitude of open research directions. One can obtain a better understanding of neural networks and their generalization properties by leveraging convexity, and high dimensional regularization theory (Wainwright, 2019). In the light of our results, one can view backpropagation as a heuristic method to solve the convex program (8) and analyze the loss landscape, since the global minima are necessarily stationary points of the non-convex objective (2), i.e., fixed points of the update rule. Interesting consequences in this direction are reported in (Lacotte & Pilanci, 2020) after our work. Furthermore, one can extend our convex approach to various architectures, e.g., modern CNNs, recurrent networks, and autoencoders. Based on our methods, recently (Ergen & Pilanci, 2020a) considered convex programs for CNNs with various pooling strategies, and determined several other convex regularizers implied by the network architecture. Finally, to the best of our knowledge, our results provide the first algorithm to train non-trivial neural networks optimally. On the other hand, the popular backpropagation method is a local search heuristic, which may not find the optimal neural network and may be dramatically inefficient as shown in the experiments. Efficient optimization algorithms that exactly or approximately solve the convex program can be developed for larger scale exper- iments, including proximal and stochastic gradient methods.

Acknowledgements

This work was supported in part by the National Science Foundation under grant IIS-1838179 and Stanford SystemX Alliance.

References

Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242–252, Long Beach, California, USA, 09– 15 Jun 2019. PMLR. URL http://proceedings. mlr.press/v97/allen-zhu19a.html.

Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Under- standing deep neural networks with rectified linear units. In 6th International Conference on Learning Representations, ICLR 2018, 2018.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8139–8148, 2019.

Bach, F. Breaking the curse of dimensionality with con- vex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.

Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., and Mar- cotte, P. Convex neural networks. In Advances in neural information processing systems, pp. 123–130, 2006.

Bienstock, D., Mu˜noz, G., and Pokutta, S. Principled deep neural network training through linear programming. arXiv preprint arXiv:1810.03218, 2018.

Boob, D., Dey, S. S., and Lan, G. Complexity of training relu neural network. arXiv preprint arXiv:1809.10787, 2018.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, Cambridge, UK, 2004a.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004b.

Chizat, L. and Bach, F. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.

Cover, T. M. Geometrical and statistical properties of sys- tems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3):326–334, 1965.

Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=S1eK3i09YQ.

Edelsbrunner, H., ORourke, J., and Seidel, R. Constructing arrangements of lines and hyperplanes with applications. SIAM Journal on Computing, 15(2):341–363, 1986.

Ergen, T. and Pilanci, M. Convex optimization for shallow neural networks. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 79–83, 2019.

Ergen, T. and Pilanci, M. Training convolutional relu neural networks in polynomial time: Exact convex optimization formulations. arXiv preprint arXiv:2006.14798, 2020a.

Ergen, T. and Pilanci, M. Convex geometry of two-layer relu networks: Implicit autoencoding and interpretable models. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 4024–4033, Online, 26– 28 Aug 2020b. PMLR. URL http://proceedings. mlr.press/v108/ergen20a.html.

Ergen, T. and Pilanci, M. Convex geometry and duality of over-parameterized neural networks. arXiv preprint arXiv:2002.11219, 2020c.

Ergen, T. and Pilanci, M. Convex duality of deep neural networks. arXiv preprint arXiv:2002.09773, 2020d.

Goberna, M. A. and L´opez-Cerd´a, M. Linear semi-infinite optimization. 01 1998. doi: 10.1007/978-1-4899-8044-1 3.

Goel, S., Kanade, V., Klivans, A., and Thaler, J. Reliably learning the relu in polynomial time. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 1004–1042, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.

Grant, M. and Boyd, S. CVX: Matlab software for dis- ciplined convex programming, version 2.1. http:// cvxr.com/cvx, March 2014.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.

Krizhevsky, A., Nair, V., and Hinton, G. The CIFAR-10 dataset. http://www.cs.toronto.edu/kriz/ cifar.html, 2014.

Lacotte, J. and Pilanci, M. All local minima are global for two-layer relu neural networks: The hidden convex optimization landscape. arXiv preprint arXiv:2006.05900, 2020.

Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.

Ojha, P. C. Enumeration of linear threshold functions from the lattice of hyperplane intersections. IEEE Transactions on Neural Networks, 11(4):839–850, 2000.

Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

Rockafellar, R. T. Convex Analysis. Princeton University Press, Princeton, 1970.

Rosset, S., Swirszcz, G., Srebro, N., and Zhu, J. L1 regu- larization in infinite dimensional feature spaces. In International Conference on Computational Learning Theory, pp. 544–558. Springer, 2007.

Rudin, W. Principles of Mathematical Analysis. McGrawHill, New York, 1964.

Savarese, P., Evron, I., Soudry, D., and Srebro, N. How do infinite width bounded norm networks look in function space? CoRR, abs/1902.05040, 2019. URL http:// arxiv.org/abs/1902.05040.

Shapiro, A. Semi-infinite programming, duality, discretiza- tion and optimality conditions. Optimization, 58(2):133– 161, 2009.

Sion, M. On general minimax theorems. Pa-cific J. Math., 8(1):171–176, 1958. URL https://projecteuclid.org:443/euclid. pjm/1103040253.

Stanley, R. P. et al. An introduction to hyperplane arrange- ments. Geometric combinatorics, 13:389–496, 2004.

Torgo, L. Regression data sets. http://www.dcc. fc.up.pt/˜ltorgo/Regression/DataSets. html.

T¨ut¨unc¨u, R., Toh, K., and Todd, M. Sdpt3a matlab software package for semidefinite-quadratic-linear programming, version 3.0. Web page http://www. math. nus. edu. sg/mattohkc/sdpt3. html, 2001.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Winder, R. Partitions of n-space by hyperplanes. SIAM Journal on Applied Mathematics, 14(4):811–818, 1966.

Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (1):49–67, 2006.

Appendix Table of Contents

A Appendix 12

A.1 Additional numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

A.2 Constructing hyperplane arrangements in polynomial time . . . . . . . . . . . . . . . . . . . . . . . . . 13

A.3 Equivalence of the penalized neural network training cost . . . . . . . . . . . . . . . . . . . . . . . . 15

A.4 Dual problem for (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.5 Dual problem for (11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.6 Dual problem for (13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.7 Dual problem for vector output two-layer linear convolutional networks . . . . . . . . . . . . . . . . . . . 17

A.8 Semi-infinite strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.9 Semi-infinite strong gauge duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.10 Neural Gauge function and equivalence to minimum norm networks . . . . . . . . . . . . . . . . . . . . 18

A.11 Alternative proof of the semi-infinite strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.12 Finite dimensional strong duality results for Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.13 General loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A. Appendix

A.1. Additional numerical results

Figure 6: Training of a two-layer ReLU network with SGD (10 initialization trials) and proposed convex programs on a two-dimensional dataset. Optimal and Approximate denote the objective value obtained by the proposed convex program (8) and its approximation, respectively. Learned decision boundaries are also depicted.

We now present another numerical experiment on a two-dimensional dataset4, where we place a negative sample (near the positive samples (y = +1) to have a more challenging loss landscape. In Figure 6, we observe that all the SGD realizations are stuck at local minima, therefore, achieve a significantly higher objective value compared to both Optimal and Approximate, which are based on convex optimization.

In addition to the classification datasets, we evaluate the performance of the algorithms on three regression datasets, i.e., the Boston Housing, Kinematics, and Bank datasets (Torgo). In Figure 7, we plot the objective value and the corresponding test error of 5 independent initialization trials with respect to time in seconds, where we use squared loss and choose n = 400, d = 13, m = 12, and batch size(bs) 25. Similarly, we plot the objective values and test errors for the Kinematics and Bank datasets in Figure 8 and 9, where (n, d, m, bs) = (4000, 8, 12, 25) and (n, d, m, bs) = (4000, 32, 12, 25), respectively. We observe that Alg1 achieves the lowest objective value and test error in both cases.

We also consider the training of a two-layer CNN architecture. In Figure 10, we provide the binary classification performance of the algorithms on a subset of CIFAR-10, where we use hinge loss and choose (n, d, m, bs) = (195, 3072, 50, 20), filter size , and stride 4. This experiment also illustrates that Alg1 achieves lower objective value and higher test accuracy compared with the other methods including GD. We also emphasize that in this experiment, we use sign patterns of a clustered subset of patches, specifically 50 clusters, as well as the GD patterns for Alg1. As depicted in Figure 11, the neurons that correspond to the sign patterns of GD matches with the neurons found by GD. Thus, the performance difference stems from the additional sign patterns found by clustering the patches.

In order to evaluate the computational complexity of the introduced approaches, in Table 1, we provide the training time of each algorithm in the main paper. This data clearly shows that the introduced convex programs outperform GD while requiring significantly less training time.

0 50 100 150 200 250 300 350 Time(s)

0 50 100 150 200 250 300 350 Time(s)

Figure 7: Training and test errors of the algorithms on the Boston Housing dataset (n = 400 and d = 13) where we run SGD independently in 5 initialization trials. For the convex program (8) approximations (Alg1, Alg2 and Alg3), crossed markers correspond to the total computation time of the convex optimization solver.

Table 1: Training time(in seconds), final objective value and test accuracy(%) of each algorithm in the main paper,where we use the CVX SDPT3 solver to optimize the convex programs.

A.2. Constructing hyperplane arrangements in polynomial time

We now consider the number of all distinct sign patterns sign(Xz) for all possible choices . Note that this number is the number of regions in a partition of by hyperplanes passing through the origin, and are perpendicular to the rows

0 200 400 600 800 1000 1200 Time(s)

0 200 400 600 800 1000 1200 Time(s)

Figure 8: Performance comparison of the algorithms on the Kinematics dataset (n = 4000 and d = 8) where we run SGD independently in 5 initialization trials. For the convex program (8) approximations (Alg1, Alg2 and Alg3), crossed markers correspond to the total computation time of the convex optimization solver.

0 100 200 300 400 500 600 Time(s)

0 100 200 300 400 500 600 Time(s)

Figure 9: Performance comparison of the algorithms on the Bank dataset (n = 4000 and d = 32) where we run SGD independently in 5 initialization trials. For the convex program (8) approximations (Alg1, Alg2 and Alg3), crossed markers correspond to the total computation time of the convex optimization solver.

of X. We now show that the dimension d can be replaced with rank(X) without loss of generality. Suppose that the data matrix X has rank r. We may express using its Singular Value Decomposition in compact form, where . For any vector we have for some . Therefore, the number of distinct sign patterns sign(Xz) for all possible is equal to the number of distinct sign patterns signfor all possible

Consider an arrangement of n hyperplanes . Let us denote the number of regions in this arrangement by . In (Ojha, 2000; Cover, 1965) it’s shown that this number satisfies

For hyperplanes in general position, the above inequality is in fact an equality. In (Edelsbrunner et al., 1986), the authors present an algorithm that enumerates all possible hyperplane arrangements time, which can be used to construct the

0 200 400 600 800 1000 1200 1400 1600 Time(s)

0 200 400 600 800 1000 1200 1400 1600 Time(s)

Figure 10: Performance of the algorithms for two-layer CNN training on a subset of CIFAR-10 (n = 195 and filter size ) where we run SGD independently in 5 initialization trials. For the convex program (8) approximations (Alg1, Alg2 and Alg3), crossed markers correspond to the total computation time of the convex optimization solver.

Figure 11: Visualization of the distance (using the Euclidean norm of the difference) between the neurons found by GD and our convex program in Figure 10. The entries of the distance plots are

respectively.

data for the convex program (8).

A.3. Equivalence of the penalized neural network training cost

In this section, we prove the equivalence between (2) and (3).

Lemma 2 ((Neyshabur et al., 2014; Savarese et al., 2019; Ergen & Pilanci, 2020b;c;d)). The following two problems are equivalent:

[Proof of Lemma 2] We can rescale the parameters as and , for any . Then, the output

becomes

which proves that the scaling does not change the network output. In addition to this, we have the following basic inequality

where the equality is achieved with the scaling choice is used. Since the scaling operation does not change the right-hand side of the inequality, we can set . Therefore, the right-hand side becomes

Now, let us consider a modified version of the problem, where the unit norm equality constraint is relaxed as us also assume that for a certain index j, we obtain as an optimal solution. This shows that the unit norm inequality constraint is not active for , and hence removing the constraint for will not change the optimal solution. However, when we remove the constraint, reduces the objective value since it yields . Therefore, we have a contradiction, which proves that all the constraints that correspond to a nonzero must be active for an optimal solution. This also shows that replacing does not change the solution to the problem.

The following lemma proves the dual form of (3).

Lemma 3. The dual form of the following primal problem

is given by the following

[Proof of Lemma 3] Let us first reparametrize the primal problem as follows

which has the following Lagrangian

Then, minimizing over yields the proposed dual form.

Let us first reparameterize the primal problem as follows

Then the Lagrangian is as follows

where . Then maximizing over M and v yields the following dual form

Let us denote the eigenvalue decomposition of as , where is the Discrete Fourier Transform matrix and is a diagonal matrix. Then, applying the scaling in Lemma 2 and then taking the dual as in Lemma 3 yields

which can be equivalently written as

Then, taking the dual of this problem gives the following

A.7. Dual problem for vector output two-layer linear convolutional networks

Vector version of the two-layer linear convolutional network training problem has the following dual

Similarly, extreme points are the maximal eigenvectors of When V = Y , and one-hot encoding is used, these are the right singular vectors of the matrix whose rows contain all the patch vectors for class c.

A.8. Semi-infinite strong duality

Note that the semi-infinite problem (4) is convex. We first show that the optimal value is finite. For , it is clear that v = 0 is strictly feasible, and achieves 0 objective value. Note that the optimal value since this value is achieved when all the parameters are zero. Consequently, Theorem 2.2 of (Shapiro, 2009) implies that strong duality holds, i.e., , if the solution set of the semi-infinite problem in (4) is nonempty and bounded. Next, we note that the solution set of (4) is the Euclidean projection of y onto the polar set which is a convex, closed and bounded set sincecan be expressed as the union of finitely many convex closed and bounded sets.

A.9. Semi-infinite strong gauge duality

Now we prove strong duality for (7). We invoke the semi-infinite optimality conditions for the dual (7), in particular we apply Theorem 7.2 of (Goberna & L´opez-Cerd´a, 1998) and use the standard notation therein. We first define the set

Note that K is the union of finitely many convex closed sets, sincecan be expressed as the union of finitely many convex closed sets. Therefore the set K is closed. By Theorem 5.3 (Goberna & L´opez-Cerd´a, 1998), this implies that the set of constraints in (15) forms a Farkas-Minkowski system. By Theorem 8.4 of (Goberna & L´opez-Cerd´a, 1998), primal and dual values are equal, given that the system is consistent. Moreover, the system is discretizable, i.e., there exists a sequence of problems with finitely many constraints whose optimal values approach to the optimal value of (15).

A.10. Neural Gauge function and equivalence to minimum norm networks

Consider the gauge function

and its dual representation in terms of the support function of the polar of conv

Since the set is a closed convex set that contains the origin, we have (Rockafellar, 1970) and . The result in Section A.8 implies that the above value is equal to the semi-infinite dual value, i.e.,

By Caratheodory’s theorem, there exists optimal solutions the above problem consisting of Dirac deltas (Rockafellar, 1970; Rosset et al., 2007), and therefore

where we define as the number of Dirac delta’s in the optimal solution to . If the optimizer is non-unique, we define as the minimum cardinality solution among the set of optimal solutions. Now consider the non-convex problem

Using the standard parameterization for norm we get

Introducing a slack variable , an equivalent representation can be written as

Note that r > 0 as long as . Rescaling variables by letting in the above program, we obtain

Suppose that . It holds that

We conclude that the optimal value of (17) is identical to the gauge function

A.11. Alternative proof of the semi-infinite strong duality

It holds that by weak duality in (4). Theorem 1 proves that the objective value of (15) is identical to the value of (2) as long as . Therefore we have

A.12. Finite dimensional strong duality results for Theorem 1

Lemma 4. Suppose are fixed diagonal matrices as described earlier, and X is a fixed matrices. The dual of the convex optimization problem

is given by

and strong duality holds.

Note that the linear inequality constraints specify valid hyperplane arrangements. Then there exists strictly feasible points in the constraints of the maximization problem. Standard finite second order cone programming duality implies that strong duality holds (Boyd & Vandenberghe, 2004b) and the dual is as specified.

A.13. General loss functions

In this section, we extend our derivations to arbitrary convex loss functions.

Consider minimizing the sum of the squared loss objective and squared -norm of all parameters

where is a convex loss function. Then, consider the following finite dimensional convex optimization problem

Let us define are optimal in (19). Theorem 5. The convex program (19) and the non-convex problem (18) where has identical optimal values. Moreover, an optimal solution to (18) can be constructed from an optimal solution to (19) as follows (8) as follows

[Proof of Theorem 5] The proof parallels the proof of the main result section and Theorem 6. We note that dual constraint set remains the same, and analogous strong duality results apply as we show next.

We also show that our dual characterization holds for arbitrary convex loss functions.

where is a convex loss function.

Theorem 6. The dual of (20) is given by

where is the Fenchel conjugate function defined as

[Proof of Theorem 6] The proof follows from classical Fenchel duality (Boyd & Vandenberghe, 2004b). We first describe (20) in an equivalent form as follows

Then the dual function is

Therefore, using the classical Fenchel duality (Boyd & Vandenberghe, 2004b) yields the claimed dual form.