Safeguarded Learned Convex Optimization

2020·Arxiv

Abstract

Abstract

Applications abound in which optimization problems must be repeatedly solved, each time with new (but similar) data. Analytic optimization algorithms can be hand-designed to provably solve these problems in an iterative fashion. On one hand, data-driven algorithms can “learn to optimize” (L2O) with much fewer iterations and similar cost per iteration as generalpurpose optimization algorithms. On the other hand, unfortunately, many L2O algorithms lack converge guarantees. To fuse the advantages of these approaches, we present a Safe-L2O framework. Safe-L2O updates incorporate a safeguard to guarantee convergence for convex problems with proximal and/or gradient oracles. The safeguard is simple and computationally cheap to implement, and it is activated only when the data-driven L2O updates would perform poorly or appear to diverge. This yields the numerical benefits of employing machine learning to create rapid L2O algorithms while still guaranteeing convergence. Our numerical examples show convergence of Safe-L2O algorithms, even when the provided data is not from the distribution of training data.

Solving scientific computing problems often requires application of efficient and scalable optimization algorithms. Data-driven algorithms can execute in much fewer iterations and with similar cost per iteration as state-of-the-art general purpose algorithms. Inspired by one such algorithm called ISTA, (Gregor and LeCun 2010) proposed treating the entries in fixed matrices/vectors of the algorithm as learnable parameters that can vary by iteration. These entries were fine-tuned to obtain optimal performance on a data set for a fixed number of iterations. Empirically, this approach converged and showed roughly a 20-fold reduction in computational cost compared to the original analytic algorithm. Several related works followed, also demonstrating numerical success (discussed below). These efforts open the door to a new class of algorithms and analyses. Analytic optimization results often provide worst-case convergence rates, and limited theory exists pertaining to instances drawn from a common distribution (e.g. data supported on a low-dimensional manifold). Most L2O methods have little or no convergence guarantees, especially on data distinct from what is seen in training. This work addresses the inquiry:

Can a safeguard be added to L2O algorithms to give convergence without significantly hindering performance? A safeguard is anything that identifies when a “bad” L2O update would occur and what to do in place of that “bad” update. Informally, what a safeguard does can be summarized by generating a sequence with updates of the form

We provide an affirmative answer to the question for convex problems with gradient and/or proximal oracles by providing such a safeguard and replacing “bad” L2O updates with updates from analytic methods. Our framework is called Safe-L2O. Since a trade-off is formed between per iteration costs and ensuring convergence, we clarify properties constituting a “practical” L2O safeguard as follows.

1. The safeguard should ensure certain forms of worst-case convergence similar to analytic algorithms.

2. The safeguard must only use known quantities related to convex problems (e.g. objective values, gradient norms).

3. Both L2O and Safe-L2O schemes should perform identically on “good” data with comparable per-iteration costs.

4. The safeguard should kick in only when “bad” L2O updates would otherwise occur. The core challenge is to create a simple safeguard that

kicks in only when needed. Unlike classic optimization al-

gorithms, exceptional L2O algorithms do not necessarily exhibit the behavior that each successive iterate is “better” than the current iterate (i.e. are not monotonically improving). Loosely speaking, this means there are cases where an L2O scheme that gets “worse” for a couple iterates yields a better final output than an L2O scheme that is required to get “better” at each iterate. The intuition behind why this can be acceptable is we are interested in the final output of the L2O algorithm and L2O schemes may learn “shortcuts.” From this insight, we deduce the safeguard should exhibit a form of trailing behavior, i.e. it should measure progress of previous iterates and only require that updates are “good” on average. If the safeguard triggers too often, then the Safe-L2O scheme’s flexibility and performance are limited. If it triggers too rarely, then the Safe-L2O scheme may exhibit highly oscillatory behavior and converge slowly.

In addition to L2O updates, our method uses a safeguard condition with the update formula from a conventional algorithm. When the “good” condition holds, the L2O update is used; when it fails, the formula from the conventional algorithm is used. In the ideal case, L2O updates are applied often and the conventional algorithm formula provides a “fallback” for exceptional cases. This fallback is designed together with the safeguard condition to ensure convergence. This also implies, even when an L2O algorithm has a fixed number of iterations with tunable parameters, the algorithm may be extended to an arbitrary number of iterations by applying the fallback to compute latter updates (see Figure 1).

Review of L2O Methods. A seminal L2O work in the context of sparse coding was by (Gregor and LeCun 2010). Numerous follow-up papers also demonstrated empirical success at constructing rapid regressors approximating iterative sparse solvers for compression, nonnegative matrix factorization, compressive sensing and other applications (Sprechmann, Bronstein, and Sapiro 2015; Wang, Ling, and Huang 2016; Wang et al. 2016; Hershey, Roux, and Weninger 2014; Yang et al. 2016). A summary of unfolded optimization procedures for sparse recovery is given by (Ablin et al. 2019) in Table A.1. The majority of L2O works pertain to sparse coding and provide limited theoretical results. Some works have interpreted LISTA in various ways to provide proofs of different convergence properties (Giryes et al. 2018; Moreau and Bruna 2017). Others have investigated structures related to LISTA (Xin et al. 2016; Blumensath and Davies 2009; Borgerding, Schniter, and Rangan 2017; Met- zler, Mousavi, and Baraniuk 2017), providing results varying by assumptions. (Chen et al. 2018) introduced necessary conditions for the LISTA weight structure to asymptotically achieve a linear convergence rate. This was followed by (Liu et al. 2019), which further simplified the weight conditions and provided a result stating that, with high probability, the convergence rate of LISTA is at most linear. The mentioned results are useful, yet can require intricate assumptions and proofs specific to the sparse coding problems. We refer readers to the recent survey (Chen et al. 2022) for a more comprehensive overview of L2O methods.

Our safeguarding scheme is related to existing works in Krasnosel’ski˘ı-Mann (KM) methods. The SuperMann scheme (Themelis and Patrinos 2019) presents a KM method that safeguards in a more hierarchical manner than ours and solely refers to the current iterate residuals (plus a summable sequence). Additionally, a similar safeguarding setup has been used for Anderson accelerated KM methods (Zhang, O’Donoghue, and Boyd 2018). These methods are not designed with L2O in mind and differ from our approach both in the assumptions used and particular update formulae.

Our Contribution. We provide a simple framework, Safe-L2O, for wrapping data-driven algorithms with convergence guarantees. This framework can be used with all L2O algorithms that solve convex problems for which proximal and/or gradient oracles are available. We incorporate multiple safeguarding procedures in a general setting and present a simple procedure for utilizing machine learning methods to instill knowledge from available data. These results form a single, general framework for use by practitioners.

Fixed Point Methods

Fixed-point iteration abstracts most of the convex optimization methods that are based on gradient and/or proximal oracles, and those methods are targets of recent L2O accelerations. Our method is based on examining the fixed-point residual, and so it applies to a wide variety of L2O methods, even complicated first-order methods (e.g. ADMM and primal-dual methods) where the underlying problems can include constraints. To provide a brief background, here we overview fixed point methods. Denote the set of fixed points of each operator by For an operator T with a nonempty fixed point set (i.e. ), we consider the fixed point problem

See Table 1 for examples of the operator T in convex minimization. We focus on fixed point iteration to give a general approach for creating sequences that converge to solutions of (2) and, thus, of the corresponding optimization problem.

The following definitions are used throughout. An operator is nonexpansive if it is 1-Lipschitz,1 i.e.

An operator T is averaged if there exists and a nonexpansive operator such that T = , with I the identity. A classic theorem states sequences generated by successively applying an averaged operator converge to a fixed point. In all of this work, we assume each operator T is averaged. This method comes from (Krasnosel’ski˘ı 1955) and (Mann 1953), which yielded adoption of the name Krasnosel’ski˘ı-Mann (KM) method. This result is stated below. Theorem 1. If an averaged operator has a nonempty fixed point set and a sequence with arbitrary initial iterate satisfies the update relation

then there is a solution to (2) such that the sequence converges to .

For reference, we briefly discuss the use of resolvents in conventional algorithms. Consider a convex function with subgradient . For , the resolvent of is defined by , i.e.

and the reflected resolvent of is

If f is closed, convex, and proper, then the resolvent is precisely the proximal operator, i.e.

Proximal operators for several well-known functions can be expressed by explicit formulas (e.g. see page 177 in (Beck 2017)). It can be shown that is nonexpansive and is averaged (Bauschke and Combettes 2017). Table 1 provides several examples of these operators in well-known optimization algorithms.

Table 1: Averaged operators for well-known algorithms. We assume and, when is multiplied by a gradient, we also assume , with L the Lipschitz constant for the gradient. The dual of a function is denoted by a superscript , and . Operators J and R are defined in equations (5) and (6), respectively. The block matrix M is . In each case, L is the Lagrangian associated with the presented problem.

Safeguarded L2O Method

This section presents the Safe-L2O framework. The safeguard acts as a wrapper around a data-driven algorithm, which is formulated in practice as a neural network. Each L2O operator, denoted throughout by , is parameterized by layerwise weights . Input data d is used to define an optimization problem (e.g. the measurement vector in a least squares problem). To make this dependence clear, we often write . Often can be viewed as forming one or multiple layers of a feed forward network. Thus, we interpret in Algorithm 1 as a feed forward network. In addition to an L2O operator , our Safe-L2O method uses a fallback operator T (unrelated to ) and a scalar sequence . Here T defines an averaged operator from the update formula of a conventional optimization algorithm. Each defines a reference value to determine whether a tentative L2O update is “good.” Each reference value in our safeguarding schemes is related to a combination of and among previous iterates i = 1, . . . , k, where .

We propose the Safe-L2O scheme in Algorithm 2. As shown in Line 1, a safeguarded L2O operator consists of an L2O network , a fallback operator T, and a parameter . Here forms layerwise weights that define the L2O update at the k-th iteration. These parameters are trained beforehand, fol-

lowing standard training methods for machine learning models (as outlined in the Appendices). Table 1 shows numerous choices of the fallback operator T for different combinations of optimization problems and algorithms In Line 2, the initial iterate is chosen to be an arbitrary (but fixed) vector . The initial iterate of the safeguard sequence is initialized using the initial iterate , an L2O update , and the fallback operator T in Line 3. From Line 4 to Line 11, a repeated loop occurs to compute each update . In Line 5 the L2O operator is applied to the current iterate get a tentative update . This is “good” if the the inequality in Line 6 holds. In such a case, the L2O update is assigned to in Line 7. Otherwise, the fallback operator T is triggered to obtain the update in Line 9. Note the initial iterate is defined such that , i.e. the first L2O update is always accepted. Lastly, the safeguard parameter is updated in Line 10. Refer to Table 2 for choices of schemes to update the safeguard parameters.

and there exists such that

Our proposed methods for choosing the sequence satisfy Assumption 2 (see Table 2). These methods are adaptive in the sense that each depends upon the weights , iterate and (possibly) previous weights and iterates. Each safeguard parameter also remains constant in k except for when the sum of residual norms and decreases to less than a geometric factor of . This allows each to trail the value of the residual norm and the residual norm to increase in k from time to time. This trailing behavior provides flexibility to the L2O updates. Our main result is below and is followed by a corollary justifying use of the schemes in Table 2 (both proven in the appendix). Theorem 2. If is a sequence generated by the inner loop in Safe-L2O and Assumptions 1 and 2 hold, then converges to a limit , i.e. . Corollary 1. If is generated by the inner loop in Safe-L2O and Assumption 1 holds, and is generated using a scheme outlined in Table 2 with and , then .

Remark 1. The appropriate frequency for the safeguard to trigger can be estimated by tuning L2O parameters for optimal performance on a training set without safeguarding and then using a validation set to test various safeguards with the L2O scheme. To avoid possible confusions, note we are not trying to prove the convergence of any standalone L2O algo- rithm. We instead 1) alarm on an L2O update when it may break convergence, 2) replace it with a fallback update, and 3) show the resulting “hybrid optimization” converges to a solution of the provided optimization problem.

Training and Averaged Operator Selection

Safe-L2O may be executed via inferences of a feed forward neural network. The input into the network is the data d, often in vector form. Input d is usually the observation we have, based on which we optimize over the variable of interest. For example, the LASSO problem in (12) is used for sparse coding, where the goal is to recover a unknown sparse vector from its noisy measurements . Assuming A is known beforehand, the input to the Safe-L2O model is the observation d. In other cases, the dictionary A can change and also be part of the input to the model. We include case-by-case discussions about what the inputs for each numerical example.

Each layer of the Safe-L2O model is designed so that its input is , to which it applies either an L2O or fallback update (following the Safe-L2O method), and outputs to the next layer. The set over which is minimized, may be

Figure 2: Plot of error versus iteration for LISTA denoising. Here ISTA is the classic algorithm, LISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of LISTA in Algorithm 2. Trained with . Inferences used and EMA(0.25). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. This plot shows the safeguard is used intermittently for k > 2.

Table 2: Rules for updating . Here , and is the statement .

chosen with great flexibility. For each application of the algorithm, the fallback operator depends upon the data d.

The “optimal” choice of parameters depends upon the application. Suppose each d is drawn from a common distribution D. Then a choice of “optimal” parameters may be identified as those for which the expected value of is minimized among , where is an appropriate cost function and K is a fixed positive integer. Mathematically, this means solves the problem

may be used (e.g. the min-max problem used by adversarial networks (Goodfellow et al. 2014)).

Numerical Examples

This section presents examples using Safe-L2O.2 We numerically investigate i) the convergence rate of Safe-L2O relative to corresponding conventional algorithms, ii) the efficacy of safeguarding procedures when inferences are performed on data for which L2O fails intermittently, and iii) the convergence of Safe-L2O schemes even when the application of L2O operators is not theoretically justified. We first use L2O from ALISTA (Liu et al. 2019) on a synthetic LASSO problem. We then use LISTA on a LASSO problem for image processing, differentiable linearized ADMM (Xie et al. 2019) on a sparse coding problem. In all three types of problems, the input data d to the L2O models is generated by

Figure 3: Plot of error versus iteration for LiADMM. Here LiADMM is the classic algorithm, L2O LiADMM is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of L2O LiADMM in Algorithm 2. Inferences used and EMA(0.75). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. The safeguard is used about 10% and 30% of the time when k = 4 and k = 5, respectively.

where is white Gaussian noise and is the hidden variable that we want to recover through a dictionary A, which is generated and beforehand and fixed. In these experiments, the ground-truth vector are sampled from a distribution, which characterizes the distribution of data of interest along with the dictionary. We denote the distribution of the input data as D. Besides these “linear” examples, we also validate Safe-L2O on a distribution of LASSO problems where the dictionary A also changes and is part of the input to the L2O models. In this case, the distributions of the dictionaries and ground-truth vectors together characterize the input distribution D.

In each example, denotes the optimal value of the ob- jective f(x; d) among all possible x. Performance is measured using a modified relative objective error:

where the expectations are estimated numerically (see the appendices for details). We use (11) rather than the expectation of relative error to avoid high sensitivity to outliers.

Our numerical results are shown in several plots. When each iterate is computed using data d drawn from the same distribution that was used to train the L2O algorithm, we say the performance is on the “seen” distribution . These plots form the primary illustrations of the speedup of L2O algorithms as compared to conventional optimization algorithms. When each d is drawn from a distribution that is different than , we refer to as the unseen distribution. These plots show the ability of the safeguard to ensure convergence. A dotted plot with square markers is also added to show the frequency of safeguard activations among test samples, with the reference axis on the right hand side of the plots. We extend the Safe-L2O methods beyond their training iterations by applying the fallback operator T; we demarcate where this extension begins by changing the Safe-L2O plots from solid to dashed lines. As Safe-L2O convergence holds whenever , we can set to be arbitrarily small (e.g. below machine precision); for simplicity, we use in the experiments (as, even in this case, it can be shown that iterates approach the solution set). Implementation details for each experiment are in the appendices.

ALISTA for LASSO

Here we consider the LASSO problem for sparse coding. Let be a sparse vector and be a dictionary. We assume access is given to noisy linear measurements , where is additive Gaussian white noise and Even for underdetermined systems, when is sufficiently sparse and is an appropriately chosen regularization parameter, can often be reasonably estimated by solving the LASSO problem

where and are the and norms, respectively. A classic method for solving (12) is the iterative shrinkage thresholding algorithm (ISTA) (e.g. see (Daubechies, De- frise, and Mol 2004)).3 Liu et al. (2019) present the L2O scheme ALISTA that we implement here. This L2O model L2O is parameterized by .

Linearized ADMM

Let and be as in the LASSO problem. Here we apply the L2O scheme differentiable linearized ADMM (LiADMM) of Xie et al. (2019) to the closely related sparse coding problem

Figure 4: Plot of error versus iteration for AdaLISTA example. Here ISTA is the classic algorithm, AdaLISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of AdaLISTA in Algorithm 2.

The L2O network L2O and fallback linearized ADMM (LiADMM) operator T are provided in the appendices along with implementation details. Plots are provided in Figure 3.

LISTA for Natural Image Denoising

To evaluate our safeguarding mechanism in a more realistic setting, we apply safeguarded LISTA to a natural image denoising problem. In this subsection, we learn a LISTA-CP model (Chen et al. 2018) to perform natural image denoising. During training, L2O LISTA-CP model is trained to recover clean images from their Gaussian noisy counterparts by solving (12). In (12), d is the noisy input to the model, and the clean image is recovered with , where is the optimal solution. The dictionary is learned on the BSD500 dataset (Martin et al. 2001) by solving a dictionary learning problem (Xu and Yin 2014). During testing, however, the learned L2O LISTA-CP is applied to unseen pepper-and-salt noisy images. Comparison plots are provided in Figure 2.

AdaLISTA: Dictionary as Part of Inputs

Here we consider the same LASSO problem (12) as in Subsection but make the dictionary A part of the inputs to the L2O model (i.e. able to change across samples). Aberdam, Golts, and Elad (2021) present a new L2O scheme AdaLISTA that is trained to quickly solve a distribution of LASSO problems with varying dictionaries. AdaLISTA has a different parameterization scheme from the original LISTA (Gre- gor and LeCun 2010) to enable the adaptivity to the dictionaries. The L2O model L2O in AdaLISTA is parameterized by and two weight matrices where the dictionary A has shape . The two matrices are shared by operators in all iterations.

We mainly follow the settings in (Aberdam, Golts, and Elad 2021). Specifically, we let be sparse vectors with random supports of cardinality s = 6 and a single fixed dictionary . We assume access is given

Figure 5: Plot of error versus iteration for AdaLISTA example. Here ISTA is the classic algorithm, AdaLISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of AdaLISTA in Algorithm 2. Inferences used and EMA(0.25). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. This plot shows the safeguard is used intermittently for k > 7.

to noiseless linear measurements , where A is uniformly sampled from all column-permuted variants of . Figure 4 and 5 show summary plots, from which we have similar observations as in the ALISTA experiments with a single fixed dictionary.

Conclusions

Numerous insights may be drawn from our examples. The first observation is, roughly speaking, each L2O scheme in our numerical examples reduces computational costs by at least one order of magnitude when applied to data from the same distribution as the training data (as compared to analytic optimization algorithms). This is consistent with results of previous works. More importantly, plots in Figures 1b and 2b, and 5 show the safeguard steers updates to convergence when they would otherwise diverge or converge slower than the conventional algorithm. That is, Safe-L2O converges with data distinct from training while the nonsafeguarded L2O schemes diverge.

This work proposes a framework for ensuring convergence of L2O algorithms. Sequences generated by our Safe-L2O method provably converge to solutions of the optimization problems. Our Safe-L2O algorithm is also easy to implement as a wrapper around trained neural networks. Numerical experiments demonstrate rapid convergence by Safe-L2O methods and effective safeguarding when the L2O schemes appear to otherwise diverge. Future work will provide a better data-driven fallback method and investigate stochastic extensions.

References

Aberdam, A.; Golts, A.; and Elad, M. 2021. Ada-lista: Learned solvers adaptive to varying models. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Ablin, P.; Moreau, T.; Massias, M.; and Gramfort, A. 2019. Learning step sizes for unfolded sparse coding. arXiv:1905.11071.

Bauschke, H.; and Combettes, P. 2017. Convex Analysis and Monotone Operator Theory in Hilbert Spaace. Springer, 2nd. edition.

Beck, A. 2017. First-order methods in optimization, volume 25. SIAM.

Blumensath, T.; and Davies, M. E. 2009. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3): 265–274.

Borgerding, M.; Schniter, P.; and Rangan, S. 2017. AMP- Inspired Deep Networks for Sparse Linear Inverse Problems. IEEE Transactions on Signal Processing, 65(16): 4293–4308.

Chen, T.; Chen, X.; Chen, W.; Heaton, H.; Liu, J.; Wang, Z.; and Yin, W. 2022. Learning to Optimize: A Primer and A Benchmark. Journal of Machine Learning Research, 23(189): 1–59.

Chen, X.; Liu, J.; Wang, Z.; and Yin, W. 2018. Theoretical Linear Convergence of Unfolded ISTA and Its Practical Weights and Thresholds. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31, 9061–9071. Curran Associates, Inc.

Daubechies, I.; Defrise, M.; and Mol, C. D. 2004. An itera- tive thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): 1413–1457.

Giryes, R.; Eldar, Y. C.; Bronstein, A. M.; and Sapiro, G. 2018. Tradeoffs Between Convergence Speed and Reconstruction Accuracy in Inverse Problems. IEEE Transactions on Signal Processing, 66(7): 1676–1690.

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde- Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.

Gregor, K.; and LeCun, Y. 2010. Learning Fast Approxima- tions of Sparse Coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 399–406. USA: Omnipress.

Hershey, J. R.; Roux, J. L.; and Weninger, F. 2014. Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures. arXiv:1409.2574.

Krasnosel’ski˘ı, M. 1955. Two remarks about the method of successive approximations. Uspekhi Mat. Nauk, 10: 123– 127.

Liu, J.; Chen, X.; Wang, Z.; and Yin, W. 2019. ALISTA: An- alytic Weights Are As Good As Learned Weights in LISTA. In International Conference on Learning Representations.

Liu, Q.; Shen, X.; and Gu, Y. 2019. Linearized ADMM for Nonconvex Nonsmooth Optimization With Convergence Analysis. IEEE Access, 7: 76131–76144.

Mann, R. 1953. Mean Value Methods in Iteration. 4(3): 506– 510.

Martin, D.; Fowlkes, C.; Tal, D.; and Malik, J. 2001. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, 416–423.

Metzler, C.; Mousavi, A.; and Baraniuk, R. 2017. Learned D- AMP: Principled Neural Network based Compressive Image Recovery. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 1772– 1783. Curran Associates, Inc.

Moreau, T.; and Bruna, J. 2017. Understanding Trainable Sparse Coding with Matrix Factorization.

Ryu, E.; and Yin, W. 2022. Large-Scale Convex Optimization: Algorithm Designs via Monotone Operators. Cambridge, England: Cambridge University Press.

Sprechmann, P.; Bronstein, A. M.; and Sapiro, G. 2015. Learning Efficient Sparse and Low Rank Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1821–1833.

Themelis, A.; and Patrinos, P. 2019. SuperMann: a superlin- early convergent algorithm for finding fixed points of nonexpansive operators.

Wang, Z.; Ling, Q.; and Huang, T. S. 2016. Learning Deep Encoders. In Thirtieth AAAI Conference on Artificial Intelligence.

Wang, Z.; Liu, D.; Chang, S.; Ling, Q.; Yang, Y.; and Huang, T. S. 2016. D3: Deep Dual-Domain Based Fast Restoration of JPEG-Compressed Images. 2764–2772.

Xie, X.; Wu, J.; Zhong, Z.; Liu, G.; and Lin, Z. 2019. Differ- entiable Linearized ADMM. arXiv:1905.06179.

Xin, B.; Wang, Y.; Gao, W.; Wipf, D.; and Wang, B. 2016. Maximal Sparsity with Deep Networks? In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 4340–4348. Curran Associates, Inc.

Xu, Y.; and Yin, W. 2014. A fast patch-dictionary method for whole image recovery. arXiv preprint arXiv:1408.3740.

Yang, Y.; Sun, J.; Li, H.; and Xu, Z. 2016. Deep ADMM-Net for Compressive Sensing MRI. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 10–18. Curran Associates, Inc.

Zhang, J.; O’Donoghue, B.; and Boyd, S. 2018. Globally convergent type-I Anderson acceleration for non-smooth fixed-point iterations. arXiv:1808.03971.

Numerical Example Supplement Materials

We begin with a general note on training and then consider individual experiments. The training procedure for each experiment was conducted layerwise. By this, we mean that first the network weights were tuned using one layer. Then the network was trained with two layers, using the learned weights from the first layer as an initialization/warm start for that layer’s parameters. This was then repeated until the final number K of layers was reached. This approach is built upon the intuition that, because the network layers model an optimization algorithm that progressively improves, each successive layer’s weights likely depend upon the previous layers weights.

Supplement for ALISTA. In similar manner to (Chen et al. 2018) and (Liu et al. 2019), we use the following setup. We take m = 250, n = 500, and . Each entry of the dictionary A is sampled i.i.d from the standard Gaussian distribution, i.e., . Having these entries, we then normalize each column of A, with respect to the Euclidean norm. Each d in the distribution of data used to train the neural network is constructed using with noise and each entry of as the composition of Bernoulli and Gaussian distributions, i.e., Berfor all . Each d in the un- seen distribution is computed using the same distribution of noise as before and using Ber. Our data set consists of 10,000 training samples and 1,000 test samples.

Given , the ISTA method iteratively computes

where and is the soft-thresholding function defined by component-wise operations:

We applied Safe-L2O to the LASSO problem above by using T defined in (14) and the L2O update operation from ALISTA (Liu et al. 2019). The L2O operator L2O is parameterized by for positive scalars and and defined by

where

and is the Frobenius norm and the Matlab notation is used to denote the th column of the matrix M. The parameter consists of 2K scalars.

Supplement for LiADMM. LiADMM is used to solve problems of the form

for which LiADMM generates sequences and defined by the updates

with given scalars . The problem (13) may be written in the form of (18) by taking , and Id. In this case, the proximal operators in (19) reduce to soft-thresholding operators. Although not given in Table 1, at each iteration of LiADMM there is an associated iterate for which the update is generated by applying an averaged operator T to the current iterate . As shown in Lemma 2 of Subsection A.2, for our setup, assuming and , the norm of the associated fixed point residual at the iterate is given by

with P defined below in (48). For notational clarity, the term in the SKM and LSKM schemes is replaced in this subsection by the tuple . This is of practical importance too since it is the sequence that converges to a solution of (13).

We now modify the iteration (19) for the problem (13) to create the LiADMM L2O scheme. We generalize soft- thresholding to vectorized soft-thresholding for by

We assume represents the scalar soft-thresholding in (15) when and the vector generalization (21) when . Combining ideas from ALISTA (Liu et al. 2019) and LiADMM (Liu, Shen, and Gu 2019), given , and , set

with element-wise products denoted by . For the parameter , then define

Fixing the number of iterations K, the learnable parameters from (22) used in the LSKM Algorithm may be encoded by consist- ing of scalars. To stabilize the training process, we share the across all layers in practice. We also fix Id and only learn the step sizes before it. Moreover, different from other experiments, we add an additional end-to-end training stage after we finish the normal layer-wise training described at the beginning of the Appendix, which is found helpful to improve the recovery performance at the final output layer.

For data generation, we use the same settings for the dictionary A and sparse vectors as in the experiments of Subsection . But we make a small modification to to generation of noise due to the objective, where we sample from the same (seen and unseen) distribution as , i.e. the noises are also sparse. We choose .4

Supplement for LISTA. We choose LISTA with coupled weight (i.e. LISTA-CP) in (Chen et al. 2018) for natural image denoising. This is done for two reasons: 1) LISTA-CP has a larger capacity to work well in complex real-world settings; 2) the dictionary A learned from natural images is much more ill-conditioned than the Gaussian matrix in Subsection . The dictionary is learned by solving a dictionary learning problem (Xu and Yin 2014) on image patches extracted from BSD500 dataset (Martin et al. 2001) 5.

The training set that we use includes 50,000 images patches of size randomly extracted from 200 images in the BSD500 training set, 250 patches each. White Gaussian noises with standard deviation (pixel values range from 0 to 255) are added to the patches. For testing, we use the “Peppers” image as the ground truth, which is 1,024 non-overlapping patches. The noisy testing patches in the seen distribution is generated in the same way as the training set. The testing patches in the unseen distribution is polluted by pepper-and-salt noises with density r = 70%.

The update operation from LISTA-CP (Chen et al. 2018) is similar to (25) but has one more matrix weight to learn in each layer:

where parameterizes the update operator with non-negative scalar and a matrix weight . The tunable parameters include K scalars and K matrices. We take K = 20, i.e. we train a 20-layer L2O LISTA-CP model. To train the L2O LISTA-CP model, we use (12) as loss function with A mentioned above and .

Supplement to AdaLISTA. In similar manner to (Aberdam, Golts, and Elad 2021), we use the following setup. We take m = 50, n = 70, and . Each entry of the dictionary is sampled i.i.d from the standard Gaussian distribution, i.e., . Having these entries, we then normal- ize each column of , with respect to the Euclidean norm.

Instead using a fixed dictionary in Subsection , for each sample, we generate a variant dictionary A from by permutating its columns. Then the linear measurement d is constructed using . The measurement-dictionary pair (d, A) is the input to the Safe-L2O scheme. For each d in the seen distribution has supports of cardinality s = 6 that are uniformly sampled over all coordinates and the non-zero entries are sampled from the standard Gaussian distribution. For each d in the unseen distribution has supports of larger cardinality s = 10 and the non-zero entries are sampled from Gaussian distribution N(0, 2). Our data set consists of 20,000 training samples and 1,000 test samples.

We applied Safe-L2O to LASSO by using T defined in (14) and the L2O update operation from AdaLISTA (Aber- dam, Golts, and Elad 2021). The L2O operator L2O is parameterized by for positive scalars and and matrices and defined by

where and are shared by operators in all iterations. The parameter consists of 2K scalars and two matrices.

Proofs

This section contains proofs for a lemma, the main theorem and its associated corollary. A final lemma is given for the safeguarding condition used for Linearized ADMM.

Lemma 1. If is a sequence generated by Safe-L2O and Assumptions 1 and 2 hold, then is bounded and there is a summable sequence such that, for all and ,

Proof. Fix any . Set to be the set of all indices such that the update relation holds, and set so that Then

which implies, by induction and the safeguard inequality,

where the final inequality holds by the fact the series is geometric and the term B is finite by Assumption 2.

A classic result (e.g. see Corollary 2.15 in (Bauschke and Combettes 2017)) states, for all and ,

Additionally, because is averaged, there exists (0, 1) such that for some 1-Lipschitz

operator Q. Together these two facts imply, for all ,

) + (30c) + (30d)

Because the terms for are summable by (28) and for , the sequence defined by for and 0 otherwise is summable, which establishes (26) in the second claim of the lemma (upon application of the triangle inequality for ). Moreover, applying this inequality inductively reveals, for all ,

which proves boundedness of .

We restate the main theorem below and provide a proof.

Theorem 2. If is a sequence generated by the inner loop in Safe-L2O and Assumptions 1 and 2 hold, then converges to a limit , i.e. .

Proof. Let be the set of all indices such that the update relation holds, and set so that . If the L2O update is applied finitely many times (i.e. ), then there is a finite index after which the iteration coincides with the classic setting of Theorem 1, by which convergence follows. In what remains, assume . We first show contains a limit point in the fixed point set of (Step 1). This is used to show the entire sequence converges to this limit point (Step 2).

Step 1. By Lemma 1, the sequence is bounded, and so also is the subsequence , with an enumeration of . By the boundedness of , there exists a convergent subsequence with limit . Ob-

serve

where the first and third inequalities are applications of the triangle inequality and the second is holds since T is 1-Lipschitz. Since each update in is “good,” the first term in the k-th summand is bounded about by and the second term is bounded by . Whence

of the proof for Lemma 1. Since the sequence of partial sums in the series on the left of (35) is monotonicallly increasing (due to nonnegative summands) and is bounded by (35c), the series converges. Thus, together with the continuity of norms and the Lipschitz continuity of T, we deduce since

Step 2. All that remains is to show the entire sequence converges to . To this end, let be given. It suffices to show there exists such that

Again utilizing Lemma 1, there exists a summable sequence such that

for all and . Since is summable, there exists such that

Since , there exists such that

Combining (39) and (40), repeated application of the triangle inequality reveals

This verifies (37), taking , completing the proof.

We restate Corollary 1 below and then provide a proof.

Corollary 1. If is generated by the inner loop in Safe-L2O and Assumption 1 holds, and is generated using a scheme outlined in Table 2 with and , then .

Proof. The proof is parsed into three parts, one for each choice of sequence in Table 2.

Geometric Sequence. Define the sequence using, for each , the Geometric Sequence update formula in Table 2. Then whenever an L2O update is used and whenever a fallback update is used for which the descent condition holds. This verifies (9) and that is monotonically decreasing.

We verify (8) by induction. The base case holds by defini-tion of . Now suppose the inequality in (8) holds for a particular choice of k. If holds, then the inequality in Line 6 of Safe-L2O ensures . If does not hold, then a fallback update is used, which with the nonexpansivity of T yields

In either case, the inequality (8) holds for the (k + 1)-st case, thereby closing the induction. By the principle of induction, (8) holds.

Exponential Moving Average. Given , define the sequence using the EMA() formula in Table 2. For each k when changes value, observe

and so (9) holds by the fact

This also shows is monotonically decreasing. We verify (8) by induction. The base case holds by definition of . Now suppose the inequality in (8) holds for a

particular choice of k. If holds, then the inequality in Line 6 of Safe-L2O ensures

If does not hold, then a fallback update is used, which yields the inequality in (42). In either case, the inequality in (8) holds, closing the induction. By induction, (8) holds.

Recent Max. Define the sequence using, for each N, the Recent Max update formula in Table 2. Then

whenever an L2O update is used and whenever a fallback update is used for which the descent condition holds. This verifies (9) and that is monotonically decreasing.

We verify (8) by induction. As before, the base case holds. Now suppose the inequality in (8) holds for a particular choice of k. If holds, then

If does not hold, then a fallback update is used, which yields the inequality in (42). In either case, the inequality in (8) holds, closing the induction.

Below is a lemma used to identify the safeguarding procedure for the LiADMM method.

Lemma 2. Let be a sequence generated by LiADMM as in (19). If and , then for each index k there is an associated iterate such that the update is generated by applying an averaged operator T to with respect to the Euclidean norm, i.e. . In addition, for

the fixed point residual is given by

Proof. We outline the proof as follows. We first derive a trio of updates that forms the application of an averaged operator for the ADMM problem (18) (Step 1). Next we rewrite the updates in a more meaningful manner using minimizations with f and g (Step 2). This formulation is then applied to a proximal ADMM problem (a special case of ADMM) that introduces auxiliary variables. This yields an explicit formula for (49) (Step 3). The remaining step uses substitution to transform the proximal ADMM formulation into the linearized ADMM updates in (19) (Step 4). This

derivation draws heavily from (Ryu and Yin 2022).

Step 1: The classic ADMM method applied to the problem (18) is equivalent to applying Douglas Rachford Splitting (DRS) to the problem

where is the infimal postcomposition (e.g. see (Bauschke and Combettes 2017))

This yields the iteration

which may be rewritten in parts by

This formulation reveals

Below we transform this expression into something meaningful.

Step 2: It can be shown that if the range of intersected with the domain of the dual is nonempty (i.e., ri dom), then

if and only if

Also, for a function we have

These equivalences imply (53) may be rewritten as

All instances of and may be removed upon substitution, i.e.

Step 3: First note that our hypothesis implies the inverses of P and Q are defined and the square root can be taken since the matrices are symmetric. In addition, P and Q are positive definite. The ADMM problem (18) is equivalent to

subject to the constraint

Using the same update formula as in Step 2 yields

Simplifying once more gives

Thus,

Step 4: We now derive the form of the linearized ADMM updates. Let . Then

and

In similar fashion, we deduce

Upon reordering the updates in (66), (67) and (68) to obtain the appropriate dependencies, we obtain (19), as desired.