b

DiscoverSearch
About
My stuff
Safeguarded Learned Convex Optimization
2020·arXiv
Abstract
Abstract

Applications abound in which optimization problems must be repeatedly solved, each time with new (but similar) data. Analytic optimization algorithms can be hand-designed to provably solve these problems in an iterative fashion. On one hand, data-driven algorithms can “learn to optimize” (L2O) with much fewer iterations and similar cost per iteration as generalpurpose optimization algorithms. On the other hand, unfortunately, many L2O algorithms lack converge guarantees. To fuse the advantages of these approaches, we present a Safe-L2O framework. Safe-L2O updates incorporate a safeguard to guarantee convergence for convex problems with proximal and/or gradient oracles. The safeguard is simple and computationally cheap to implement, and it is activated only when the data-driven L2O updates would perform poorly or appear to diverge. This yields the numerical benefits of employing machine learning to create rapid L2O algorithms while still guaranteeing convergence. Our numerical examples show convergence of Safe-L2O algorithms, even when the provided data is not from the distribution of training data.

Solving scientific computing problems often requires application of efficient and scalable optimization algorithms. Data-driven algorithms can execute in much fewer iterations and with similar cost per iteration as state-of-the-art general purpose algorithms. Inspired by one such algorithm called ISTA, (Gregor and LeCun 2010) proposed treating the entries in fixed matrices/vectors of the algorithm as learnable parameters that can vary by iteration. These entries were fine-tuned to obtain optimal performance on a data set for a fixed number of iterations. Empirically, this approach converged and showed roughly a 20-fold reduction in computational cost compared to the original analytic algorithm. Several related works followed, also demonstrating numerical success (discussed below). These efforts open the door to a new class of algorithms and analyses. Analytic optimization results often provide worst-case convergence rates, and limited theory exists pertaining to instances drawn from a common distribution (e.g. data supported on a low-dimensional manifold). Most L2O methods have little or no convergence guarantees, especially on data distinct from what is seen in training. This work addresses the inquiry:

Can a safeguard be added to L2O algorithms to give convergence without significantly hindering performance? A safeguard is anything that identifies when a “bad” L2O update would occur and what to do in place of that “bad” update. Informally, what a safeguard does can be summarized by generating a sequence  {xk}with updates of the form

image

We provide an affirmative answer to the question for convex problems with gradient and/or proximal oracles by providing such a safeguard and replacing “bad” L2O updates with updates from analytic methods. Our framework is called Safe-L2O. Since a trade-off is formed between per iteration costs and ensuring convergence, we clarify properties constituting a “practical” L2O safeguard as follows.

1. The safeguard should ensure certain forms of worst-case convergence similar to analytic algorithms.

2. The safeguard must only use known quantities related to convex problems (e.g. objective values, gradient norms).

3. Both L2O and Safe-L2O schemes should perform identically on “good” data with comparable per-iteration costs.

4. The safeguard should kick in only when “bad” L2O updates would otherwise occur. The core challenge is to create a simple safeguard that

kicks in only when needed. Unlike classic optimization al-

gorithms, exceptional L2O algorithms do not necessarily exhibit the behavior that each successive iterate is “better” than the current iterate (i.e. are not monotonically improving). Loosely speaking, this means there are cases where an L2O scheme that gets “worse” for a couple iterates yields a better final output than an L2O scheme that is required to get “better” at each iterate. The intuition behind why this can be acceptable is we are interested in the final output of the L2O algorithm and L2O schemes may learn “shortcuts.” From this insight, we deduce the safeguard should exhibit a form of trailing behavior, i.e. it should measure progress of previous iterates and only require that updates are “good” on average. If the safeguard triggers too often, then the Safe-L2O scheme’s flexibility and performance are limited. If it triggers too rarely, then the Safe-L2O scheme may exhibit highly oscillatory behavior and converge slowly.

In addition to L2O updates, our method uses a safeguard condition with the update formula from a conventional algorithm. When the “good” condition holds, the L2O update is used; when it fails, the formula from the conventional algorithm is used. In the ideal case, L2O updates are applied often and the conventional algorithm formula provides a “fallback” for exceptional cases. This fallback is designed together with the safeguard condition to ensure convergence. This also implies, even when an L2O algorithm has a fixed number of iterations with tunable parameters, the algorithm may be extended to an arbitrary number of iterations by applying the fallback to compute latter updates (see Figure 1).

Review of L2O Methods. A seminal L2O work in the context of sparse coding was by (Gregor and LeCun 2010). Numerous follow-up papers also demonstrated empirical success at constructing rapid regressors approximating iterative sparse solvers for compression, nonnegative matrix factorization, compressive sensing and other applications (Sprechmann, Bronstein, and Sapiro 2015; Wang, Ling, and Huang 2016; Wang et al. 2016; Hershey, Roux, and Weninger 2014; Yang et al. 2016). A summary of unfolded optimization procedures for sparse recovery is given by (Ablin et al. 2019) in Table A.1. The majority of L2O works pertain to sparse coding and provide limited theoretical results. Some works have interpreted LISTA in various ways to provide proofs of different convergence properties (Giryes et al. 2018; Moreau and Bruna 2017). Others have investigated structures related to LISTA (Xin et al. 2016; Blumensath and Davies 2009; Borgerding, Schniter, and Rangan 2017; Met- zler, Mousavi, and Baraniuk 2017), providing results varying by assumptions. (Chen et al. 2018) introduced necessary conditions for the LISTA weight structure to asymptotically achieve a linear convergence rate. This was followed by (Liu et al. 2019), which further simplified the weight conditions and provided a result stating that, with high probability, the convergence rate of LISTA is at most linear. The mentioned results are useful, yet can require intricate assumptions and proofs specific to the sparse coding problems. We refer readers to the recent survey (Chen et al. 2022) for a more comprehensive overview of L2O methods.

Our safeguarding scheme is related to existing works in Krasnosel’ski˘ı-Mann (KM) methods. The SuperMann scheme (Themelis and Patrinos 2019) presents a KM method that safeguards in a more hierarchical manner than ours and solely refers to the current iterate residuals (plus a summable sequence). Additionally, a similar safeguarding setup has been used for Anderson accelerated KM methods (Zhang, O’Donoghue, and Boyd 2018). These methods are not designed with L2O in mind and differ from our approach both in the assumptions used and particular update formulae.

Our Contribution. We provide a simple framework, Safe-L2O, for wrapping data-driven algorithms with convergence guarantees. This framework can be used with all L2O algorithms that solve convex problems for which proximal and/or gradient oracles are available. We incorporate multiple safeguarding procedures in a general setting and present a simple procedure for utilizing machine learning methods to instill knowledge from available data. These results form a single, general framework for use by practitioners.

Fixed-point iteration abstracts most of the convex optimization methods that are based on gradient and/or proximal oracles, and those methods are targets of recent L2O accelerations. Our method is based on examining the fixed-point residual, and so it applies to a wide variety of L2O methods, even complicated first-order methods (e.g. ADMM and primal-dual methods) where the underlying problems can include constraints. To provide a brief background, here we overview fixed point methods. Denote the set of fixed points of each operator  T : Rn → Rnby  fix(T) ≜ {x : Tx = x}.For an operator T with a nonempty fixed point set (i.e. fix(T) ̸= ∅), we consider the fixed point problem

image

See Table 1 for examples of the operator T in convex minimization. We focus on fixed point iteration to give a general approach for creating sequences that converge to solutions of (2) and, thus, of the corresponding optimization problem.

The following definitions are used throughout. An operator  T : Rn → Rnis nonexpansive if it is 1-Lipschitz,1 i.e.

image

An operator T is averaged if there exists  ϑ ∈ (0, 1)and a nonexpansive operator  Q : Rn → Rnsuch that T = (1 − ϑ)I + ϑQ, with I the identity. A classic theorem states sequences generated by successively applying an averaged operator converge to a fixed point. In all of this work, we assume each operator T is averaged. This method comes from (Krasnosel’ski˘ı 1955) and (Mann 1953), which yielded adoption of the name Krasnosel’ski˘ı-Mann (KM) method. This result is stated below. Theorem 1. If an averaged operator  T : Rn → Rnhas a nonempty fixed point set and a sequence  {xk}with arbitrary initial iterate  x1 ∈ Rnsatisfies the update relation

image

then there is a solution  x⋆ ∈ fix(T)to (2) such that the sequence  {xk}converges to  x⋆.

For reference, we briefly discuss the use of resolvents in conventional algorithms. Consider a convex function f : Rn → Rwith subgradient  ∂f. For  α > 0, the resolvent Jα∂fof  α∂fis defined by  Jα∂f(x) ≜ (I + α∂f)−1(x), i.e.

image

and the reflected resolvent of  ∂fis

image

If f is closed, convex, and proper, then the resolvent is precisely the proximal operator, i.e.

image

Proximal operators for several well-known functions can be expressed by explicit formulas (e.g. see page 177 in (Beck 2017)). It can be shown that  Rα∂fis nonexpansive and  Jα∂fis averaged (Bauschke and Combettes 2017). Table 1 provides several examples of these operators in well-known optimization algorithms.

Table 1: Averaged operators for well-known algorithms. We assume  α > 0and, when  αis multiplied by a gradient, we also assume  α < 2/L, with L the Lipschitz constant for the gradient. The dual of a function is denoted by a superscript  ∗, and Ω = {(x, z) : Ax + Bz = b}. Operators J and R are defined in equations (5) and (6), respectively. The block matrix M is M = [α−1Id, AT ; −A, β−1Id]. In each case, L is the Lagrangian associated with the presented problem.

image

image

This section presents the Safe-L2O framework. The safeguard acts as a wrapper around a data-driven algorithm, which is formulated in practice as a neural network. Each L2O operator, denoted throughout by  TΘk, is parameterized by layerwise weights  Θ = (Θ1, ..., ΘK). Input data d is used to define an optimization problem (e.g. the measurement vector in a least squares problem). To make this dependence clear, we often write  T(·; d). Often  TΘk(·; d)can be viewed as forming one or multiple layers of a feed forward network. Thus, we interpret  NΘ(d)in Algorithm 1 as a feed forward network. In addition to an L2O operator  TΘk, our Safe-L2O method uses a fallback operator T (unrelated to  TΘk) and a scalar sequence  {µk}. Here T defines an averaged operator from the update formula of a conventional optimization algorithm. Each  µkdefines a reference value to determine whether a tentative L2O update is “good.” Each reference value  µkin our safeguarding schemes is related to a combination of  ∥yi − T(yi; d)∥and  ∥yi − xi∥among previous iterates i = 1, . . . , k, where  yi = TΘi(xi; d).

We propose the Safe-L2O scheme in Algorithm 2. As shown in Line 1, a safeguarded L2O operator consists of an L2O network  L2O(d; Θ), a fallback operator T, and a parameter  α ∈ (0, 1). Here  Θ = (Θ1, . . . , ΘK)forms layerwise weights  Θkthat define the L2O update  TΘkat the k-th iteration. These parameters are trained beforehand, fol-

image

lowing standard training methods for machine learning models (as outlined in the Appendices). Table 1 shows numerous choices of the fallback operator T for different combinations of optimization problems and algorithms In Line 2, the initial iterate  x1is chosen to be an arbitrary (but fixed) vector ˜x. The initial iterate  µ1of the safeguard sequence  {µk}is initialized using the initial iterate  x1, an L2O update  y1, and the fallback operator T in Line 3. From Line 4 to Line 11, a repeated loop occurs to compute each update  xk+1. In Line 5 the L2O operator is applied to the current iterate  xkget a tentative update  yk. This  ykis “good” if the the inequality in Line 6 holds. In such a case, the L2O update is assigned to  xk+1in Line 7. Otherwise, the fallback operator T is triggered to obtain the update  xk+1in Line 9. Note the initial iterate  µ1is defined such that  x2 = y1, i.e. the first L2O update is always accepted. Lastly, the safeguard parameter is updated in Line 10. Refer to Table 2 for choices of schemes to update the safeguard parameters.

image

and there exists  ζ ∈ (0, 1)such that

image

Our proposed methods for choosing the sequence  {µk}satisfy Assumption 2 (see Table 2). These methods are adaptive in the sense that each  µkdepends upon the weights  Θk, iterate  xkand (possibly) previous weights and iterates. Each safeguard parameter  µkalso remains constant in k except for when the sum of residual norms  ∥xk+1 − T(xk+1)∥and ∥yk − xk∥decreases to less than a geometric factor of  µk. This allows each  µkto trail the value of the residual norm ∥xk − T(xk)∥and the residual norm to increase in k from time to time. This trailing behavior provides flexibility to the L2O updates. Our main result is below and is followed by a corollary justifying use of the schemes in Table 2 (both proven in the appendix). Theorem 2. If  {xk}is a sequence generated by the inner loop in Safe-L2O and Assumptions 1 and 2 hold, then  {xk}converges to a limit  x⋆d ∈ fix(T(·; d)), i.e.  xk → x⋆d. Corollary 1. If  {xk}is generated by the inner loop in Safe-L2O and Assumption 1 holds, and  {µk}is generated using a scheme outlined in Table 2 with  α ∈ [0, 1)and  β ∈ (0, ∞), then  xk → x⋆d ∈ fix(T(·; d)).

image

Remark 1. The appropriate frequency for the safeguard to trigger can be estimated by tuning L2O parameters for optimal performance on a training set without safeguarding and then using a validation set to test various safeguards with the L2O scheme. To avoid possible confusions, note we are not trying to prove the convergence of any standalone L2O algo- rithm. We instead 1) alarm on an L2O update when it may break convergence, 2) replace it with a fallback update, and 3) show the resulting “hybrid optimization” converges to a solution of the provided optimization problem.

Safe-L2O may be executed via inferences of a feed forward neural network. The input into the network is the data d, often in vector form. Input d is usually the observation we have, based on which we optimize over the variable of interest. For example, the LASSO problem in (12) is used for sparse coding, where the goal is to recover a unknown sparse vector  x⋆from its noisy measurements  d = Ax⋆ + ε. Assuming A is known beforehand, the input to the Safe-L2O model is the observation d. In other cases, the dictionary A can change and also be part of the input to the model. We include case-by-case discussions about what the inputs for each numerical example.

Each layer of the Safe-L2O model is designed so that its input is  xk, to which it applies either an L2O or fallback update (following the Safe-L2O method), and outputs  xk+1to the next layer. The set over which  Θis minimized, may be

image

Figure 2: Plot of error versus iteration for LISTA denoising. Here ISTA is the classic algorithm, LISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of LISTA in Algorithm 2. Trained with  φd = fd. Inferences used α = 0.99and EMA(0.25). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. This plot shows the safeguard is used intermittently for k > 2.

Table 2: Rules for updating  µk. Here  α, θ ∈ (0, 1), β ∈ (0, ∞), and  Ckis the statement  ∥yk −T(yk; d)∥+β∥yk −xk∥ ≤ αµk.

image

chosen with great flexibility. For each application of the algorithm, the fallback operator depends upon the data d.

The “optimal” choice of parameters  Θdepends upon the application. Suppose each d is drawn from a common distribution D. Then a choice of “optimal” parameters  Θ⋆may be identified as those for which the expected value of  φ(xK; d)is minimized among  d ∼ D, where  φ(· ; d) : Rn → Ris an appropriate cost function and K is a fixed positive integer. Mathematically, this means  Θ⋆solves the problem

image

may be used (e.g. the min-max problem used by adversarial networks (Goodfellow et al. 2014)).

This section presents examples using Safe-L2O.2 We numerically investigate i) the convergence rate of Safe-L2O relative to corresponding conventional algorithms, ii) the efficacy of safeguarding procedures when inferences are performed on data for which L2O fails intermittently, and iii) the convergence of Safe-L2O schemes even when the application of L2O operators is not theoretically justified. We first use L2O from ALISTA (Liu et al. 2019) on a synthetic LASSO problem. We then use LISTA on a LASSO problem for image processing, differentiable linearized ADMM (Xie et al. 2019) on a sparse coding problem. In all three types of problems, the input data d to the L2O models is generated by  d = Ax⋆ + ε

image

Figure 3: Plot of error versus iteration for LiADMM. Here LiADMM is the classic algorithm, L2O LiADMM is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of L2O LiADMM in Algorithm 2. Inferences used  α = 0.99and EMA(0.75). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. The safeguard is used about 10% and 30% of the time when k = 4 and k = 5, respectively.

where  εis white Gaussian noise and  x⋆is the hidden variable that we want to recover through a dictionary A, which is generated and beforehand and fixed. In these experiments, the ground-truth vector  x⋆are sampled from a distribution, which characterizes the distribution of data of interest along with the dictionary. We denote the distribution of the input data as D. Besides these “linear” examples, we also validate Safe-L2O on a distribution of LASSO problems where the dictionary A also changes and is part of the input to the L2O models. In this case, the distributions of the dictionaries and ground-truth vectors together characterize the input distribution D.

In each example,  f ⋆ddenotes the optimal value of the ob- jective f(x; d) among all possible x. Performance is measured using a modified relative objective error:

image

where the expectations are estimated numerically (see the appendices for details). We use (11) rather than the expectation of relative error to avoid high sensitivity to outliers.

Our numerical results are shown in several plots. When each iterate  xkis computed using data d drawn from the same distribution  Dsthat was used to train the L2O algorithm, we say the performance is on the “seen” distribution  Ds. These plots form the primary illustrations of the speedup of L2O algorithms as compared to conventional optimization algorithms. When each d is drawn from a distribution  Duthat is different than  Ds, we refer to  Duas the unseen distribution. These plots show the ability of the safeguard to ensure convergence. A dotted plot with square markers is also added to show the frequency of safeguard activations among test samples, with the reference axis on the right hand side of the plots. We extend the Safe-L2O methods beyond their training iterations by applying the fallback operator T; we demarcate where this extension begins by changing the Safe-L2O plots from solid to dashed lines. As Safe-L2O convergence holds whenever  β > 0, we can set  βto be arbitrarily small (e.g. below machine precision); for simplicity, we use  β = 0in the experiments (as, even in this case, it can be shown that iterates approach the solution set). Implementation details for each experiment are in the appendices.

ALISTA for LASSO

Here we consider the LASSO problem for sparse coding. Let x⋆ ∈ R500be a sparse vector and  A ∈ R250×500be a dictionary. We assume access is given to noisy linear measurements  d ∈ R250, where  ε ∈ R250is additive Gaussian white noise and  d = Ax⋆ + ε.Even for underdetermined systems, when  x⋆is sufficiently sparse and  τ ∈ (0, ∞)is an appropriately chosen regularization parameter,  x⋆can often be reasonably estimated by solving the LASSO problem

image

where  ∥ · ∥2and  ∥ · ∥1are the  ℓ2and  ℓ1norms, respectively. A classic method for solving (12) is the iterative shrinkage thresholding algorithm (ISTA) (e.g. see (Daubechies, De- frise, and Mol 2004)).3 Liu et al. (2019) present the L2O scheme ALISTA that we implement here. This L2O model L2O is parameterized by  Θk = (θk, γk) ∈ R2.

Linearized ADMM

Let  A ∈ R250×500and  d ∈ R250be as in the LASSO problem. Here we apply the L2O scheme differentiable linearized ADMM (LiADMM) of Xie et al. (2019) to the closely related sparse coding problem

image

image

Figure 4: Plot of error versus iteration for AdaLISTA example. Here ISTA is the classic algorithm, AdaLISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of AdaLISTA in Algorithm 2.

The L2O network L2O and fallback linearized ADMM (LiADMM) operator T are provided in the appendices along with implementation details. Plots are provided in Figure 3.

LISTA for Natural Image Denoising

To evaluate our safeguarding mechanism in a more realistic setting, we apply safeguarded LISTA to a natural image denoising problem. In this subsection, we learn a LISTA-CP model (Chen et al. 2018) to perform natural image denoising. During training, L2O LISTA-CP model is trained to recover clean images from their Gaussian noisy counterparts by solving (12). In (12), d is the noisy input to the model, and the clean image is recovered with ˆd = Ax⋆, where  x⋆is the optimal solution. The dictionary  A ∈ R256×512is learned on the BSD500 dataset (Martin et al. 2001) by solving a dictionary learning problem (Xu and Yin 2014). During testing, however, the learned L2O LISTA-CP is applied to unseen pepper-and-salt noisy images. Comparison plots are provided in Figure 2.

AdaLISTA: Dictionary as Part of Inputs

Here we consider the same LASSO problem (12) as in Subsection but make the dictionary A part of the inputs to the L2O model (i.e. able to change across samples). Aberdam, Golts, and Elad (2021) present a new L2O scheme AdaLISTA that is trained to quickly solve a distribution of LASSO problems with varying dictionaries. AdaLISTA has a different parameterization scheme from the original LISTA (Gre- gor and LeCun 2010) to enable the adaptivity to the dictionaries. The L2O model L2O in AdaLISTA is parameterized by ζ = (θ, γ) ∈ R2and two weight matrices  W1, W2 ∈ Rm×mwhere the dictionary A has shape  m × n. The two matrices are shared by operators in all iterations.

We mainly follow the settings in (Aberdam, Golts, and Elad 2021). Specifically, we let  x⋆ ∈ R70be sparse vectors with random supports of cardinality s = 6 and a single fixed dictionary  A′ ∈ R50×70. We assume access is given

image

Figure 5: Plot of error versus iteration for AdaLISTA example. Here ISTA is the classic algorithm, AdaLISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of AdaLISTA in Algorithm 2. Inferences used  α = 0.99and EMA(0.25). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. This plot shows the safeguard is used intermittently for k > 7.

to noiseless linear measurements  d ∈ R50 = Ax⋆, where A is uniformly sampled from all column-permuted variants of A′. Figure 4 and 5 show summary plots, from which we have similar observations as in the ALISTA experiments with a single fixed dictionary.

Numerous insights may be drawn from our examples. The first observation is, roughly speaking, each L2O scheme in our numerical examples reduces computational costs by at least one order of magnitude when applied to data from the same distribution as the training data (as compared to analytic optimization algorithms). This is consistent with results of previous works. More importantly, plots in Figures 1b and 2b, and 5 show the safeguard steers updates to convergence when they would otherwise diverge or converge slower than the conventional algorithm. That is, Safe-L2O converges with data distinct from training while the nonsafeguarded L2O schemes diverge.

This work proposes a framework for ensuring convergence of L2O algorithms. Sequences generated by our Safe-L2O method provably converge to solutions of the optimization problems. Our Safe-L2O algorithm is also easy to implement as a wrapper around trained neural networks. Numerical experiments demonstrate rapid convergence by Safe-L2O methods and effective safeguarding when the L2O schemes appear to otherwise diverge. Future work will provide a better data-driven fallback method and investigate stochastic extensions.

Aberdam, A.; Golts, A.; and Elad, M. 2021. Ada-lista: Learned solvers adaptive to varying models. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Ablin, P.; Moreau, T.; Massias, M.; and Gramfort, A. 2019. Learning step sizes for unfolded sparse coding. arXiv:1905.11071.

Bauschke, H.; and Combettes, P. 2017. Convex Analysis and Monotone Operator Theory in Hilbert Spaace. Springer, 2nd. edition.

Beck, A. 2017. First-order methods in optimization, volume 25. SIAM.

Blumensath, T.; and Davies, M. E. 2009. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3): 265–274.

Borgerding, M.; Schniter, P.; and Rangan, S. 2017. AMP- Inspired Deep Networks for Sparse Linear Inverse Problems. IEEE Transactions on Signal Processing, 65(16): 4293–4308.

Chen, T.; Chen, X.; Chen, W.; Heaton, H.; Liu, J.; Wang, Z.; and Yin, W. 2022. Learning to Optimize: A Primer and A Benchmark. Journal of Machine Learning Research, 23(189): 1–59.

Chen, X.; Liu, J.; Wang, Z.; and Yin, W. 2018. Theoretical Linear Convergence of Unfolded ISTA and Its Practical Weights and Thresholds. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31, 9061–9071. Curran Associates, Inc.

Daubechies, I.; Defrise, M.; and Mol, C. D. 2004. An itera- tive thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): 1413–1457.

Giryes, R.; Eldar, Y. C.; Bronstein, A. M.; and Sapiro, G. 2018. Tradeoffs Between Convergence Speed and Reconstruction Accuracy in Inverse Problems. IEEE Transactions on Signal Processing, 66(7): 1676–1690.

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde- Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.

Gregor, K.; and LeCun, Y. 2010. Learning Fast Approxima- tions of Sparse Coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 399–406. USA: Omnipress.

Hershey, J. R.; Roux, J. L.; and Weninger, F. 2014. Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures. arXiv:1409.2574.

Krasnosel’ski˘ı, M. 1955. Two remarks about the method of successive approximations. Uspekhi Mat. Nauk, 10: 123– 127.

Liu, J.; Chen, X.; Wang, Z.; and Yin, W. 2019. ALISTA: An- alytic Weights Are As Good As Learned Weights in LISTA. In International Conference on Learning Representations.

Liu, Q.; Shen, X.; and Gu, Y. 2019. Linearized ADMM for Nonconvex Nonsmooth Optimization With Convergence Analysis. IEEE Access, 7: 76131–76144.

Mann, R. 1953. Mean Value Methods in Iteration. 4(3): 506– 510.

Martin, D.; Fowlkes, C.; Tal, D.; and Malik, J. 2001. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, 416–423.

Metzler, C.; Mousavi, A.; and Baraniuk, R. 2017. Learned D- AMP: Principled Neural Network based Compressive Image Recovery. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 1772– 1783. Curran Associates, Inc.

Moreau, T.; and Bruna, J. 2017. Understanding Trainable Sparse Coding with Matrix Factorization.

Ryu, E.; and Yin, W. 2022. Large-Scale Convex Optimization: Algorithm Designs via Monotone Operators. Cambridge, England: Cambridge University Press.

Sprechmann, P.; Bronstein, A. M.; and Sapiro, G. 2015. Learning Efficient Sparse and Low Rank Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1821–1833.

Themelis, A.; and Patrinos, P. 2019. SuperMann: a superlin- early convergent algorithm for finding fixed points of nonexpansive operators.

Wang, Z.; Ling, Q.; and Huang, T. S. 2016. Learning Deep ℓ0Encoders. In Thirtieth AAAI Conference on Artificial Intelligence.

Wang, Z.; Liu, D.; Chang, S.; Ling, Q.; Yang, Y.; and Huang, T. S. 2016. D3: Deep Dual-Domain Based Fast Restoration of JPEG-Compressed Images. 2764–2772.

Xie, X.; Wu, J.; Zhong, Z.; Liu, G.; and Lin, Z. 2019. Differ- entiable Linearized ADMM. arXiv:1905.06179.

Xin, B.; Wang, Y.; Gao, W.; Wipf, D.; and Wang, B. 2016. Maximal Sparsity with Deep Networks? In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 4340–4348. Curran Associates, Inc.

Xu, Y.; and Yin, W. 2014. A fast patch-dictionary method for whole image recovery. arXiv preprint arXiv:1408.3740.

Yang, Y.; Sun, J.; Li, H.; and Xu, Z. 2016. Deep ADMM-Net for Compressive Sensing MRI. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 10–18. Curran Associates, Inc.

Zhang, J.; O’Donoghue, B.; and Boyd, S. 2018. Globally convergent type-I Anderson acceleration for non-smooth fixed-point iterations. arXiv:1808.03971.

We begin with a general note on training and then consider individual experiments. The training procedure for each experiment was conducted layerwise. By this, we mean that first the network weights were tuned using one layer. Then the network was trained with two layers, using the learned weights from the first layer as an initialization/warm start for that layer’s parameters. This was then repeated until the final number K of layers was reached. This approach is built upon the intuition that, because the network layers model an optimization algorithm that progressively improves, each successive layer’s weights likely depend upon the previous layers weights.

Supplement for ALISTA. In similar manner to (Chen et al. 2018) and (Liu et al. 2019), we use the following setup. We take m = 250, n = 500, and  τ = 0.001. Each entry of the dictionary A is sampled i.i.d from the standard Gaussian distribution, i.e.,  aij ∼ N(0, 1/m). Having these entries, we then normalize each column of A, with respect to the Euclidean norm. Each d in the distribution  Dsof data used to train the neural network is constructed using  d = Ax⋆ + εwith noise  ε ∼ 0.1 · N(0, 1/m)and each entry of  x⋆as the composition of Bernoulli and Gaussian distributions, i.e., x⋆j ∼Ber(0.1) ◦ N(0, 1)for all  j ∈ [n]. Each d in the un- seen distribution  Duis computed using the same distribution of noise  εas before and using  x⋆j ∼Ber(0.2) ◦ N(0, 2). Our data set consists of 10,000 training samples and 1,000 test samples.

Given  x1 ∈ Rn, the ISTA method iteratively computes

image

where  L = ∥AtA∥2and  ηθis the soft-thresholding function defined by component-wise operations:

image

We applied Safe-L2O to the LASSO problem above by using T defined in (14) and the L2O update operation from ALISTA (Liu et al. 2019). The L2O operator L2O is parameterized by  ζ = (θ, γ)for positive scalars  θand  γand defined by

image

where

image

and  ∥ · ∥Fis the Frobenius norm and the Matlab notation M:,ℓis used to denote the  ℓth column of the matrix M. The parameter  Θ = (θk, γk)Kk=1consists of 2K scalars.

Supplement for LiADMM. LiADMM is used to solve problems of the form

image

for which LiADMM generates sequences  {xk}, {zk}and {uk}defined by the updates

image

with given scalars  α, β, γ ∈ (0, ∞). The problem (13) may be written in the form of (18) by taking  f = τ∥·∥1, g = ∥·∥1, and  B = −Id. In this case, the proximal operators in (19) reduce to soft-thresholding operators. Although not given in Table 1, at each iteration of LiADMM there is an associated iterate  νkfor which the update  νk+1is generated by applying an averaged operator T to the current iterate  νk. As shown in Lemma 2 of Subsection A.2, for our setup, assuming  γ =1/αand  αβ∥AT A∥2 < 1, the norm of the associated fixed point residual at the iterate  νkis given by

image

with P defined below in (48). For notational clarity, the term xkin the SKM and LSKM schemes is replaced in this subsection by the tuple  (xk, zk, uk). This is of practical importance too since it is the sequence  {xk}that converges to a solution of (13).

We now modify the iteration (19) for the problem (13) to create the LiADMM L2O scheme. We generalize soft- thresholding to vectorized soft-thresholding for  β ∈ Rnby

image

We assume  ηβrepresents the scalar soft-thresholding in (15) when  β ∈ Rand the vector generalization (21) when  β ∈Rn. Combining ideas from ALISTA (Liu et al. 2019) and LiADMM (Liu, Shen, and Gu 2019), given  (xk, zk, νk) ∈Rn × Rm × Rm, αk, γk, ξk ∈ Rm, βk, σk ∈ Rn, W1 ∈Rn×m, and  W2 ∈ Rm×m, set

˜xk+1 ≜ ηβk�xk − σk ◦ (W k1 )T �νk + αk ◦�Axk − zk − d���,

˜zk+1 ≜ ηγk�zk − ξk ◦ (W k2 )T �νk + αk ◦�A˜xk+1 − zk − d���,

image

with element-wise products denoted by  ◦. For the parameter ζk ≜ (αk, βk, γk, σk, ξk, W k1 , W k2 ), then define

image

Fixing the number of iterations K, the learnable parameters from (22) used in the LSKM Algorithm may be encoded by Θ = (ζk)Kk=1 =�αk, βk, γk, σk, ξk, W k1 , W k2�Kk=1 ,consist- ing of  (2n + 3m + mn + m2)Kscalars. To stabilize the training process, we share the  W1across all layers in practice. We also fix  W2 = −Id and only learn the step sizes ξkbefore it. Moreover, different from other experiments, we add an additional end-to-end training stage after we finish the normal layer-wise training described at the beginning of the Appendix, which is found helpful to improve the recovery performance at the final output layer.

For data generation, we use the same settings for the dictionary A and sparse vectors  x⋆as in the experiments of Subsection . But we make a small modification to to generation of noise  εdue to the  ℓ1-ℓ1objective, where we sample  εfrom the same (seen and unseen) distribution as  x∗, i.e. the noises are also sparse. We choose  τ = 1.0.4

Supplement for LISTA. We choose LISTA with coupled weight (i.e. LISTA-CP) in (Chen et al. 2018) for natural image denoising. This is done for two reasons: 1) LISTA-CP has a larger capacity to work well in complex real-world settings; 2) the dictionary A learned from natural images is much more ill-conditioned than the Gaussian matrix in Subsection . The dictionary  A ∈ R256×512is learned by solving a dictionary learning problem (Xu and Yin 2014) on  16 × 16image patches extracted from BSD500 dataset (Martin et al. 2001) 5.

The training set that we use includes 50,000 images patches of size  16 × 16randomly extracted from 200 images in the BSD500 training set, 250 patches each. White Gaussian noises with standard deviation  σ = 30(pixel values range from 0 to 255) are added to the patches. For testing, we use the “Peppers” image as the ground truth, which is 1,024 non-overlapping patches. The noisy testing patches in the seen distribution is generated in the same way as the training set. The testing patches in the unseen distribution is polluted by pepper-and-salt noises with density r = 70%.

The update operation from LISTA-CP (Chen et al. 2018) is similar to (25) but has one more matrix weight to learn in each layer:

image

where  ζ = {θ, ˜W}parameterizes the update operator with non-negative scalar  θand a matrix weight  W ∈ R256×512. The tunable parameters  Θ = (θk, ˆW k)Kk=1include K scalars and K matrices. We take K = 20, i.e. we train a 20-layer L2O LISTA-CP model. To train the L2O LISTA-CP model, we use (12) as loss function with A mentioned above and τ = 0.01.

Supplement to AdaLISTA. In similar manner to (Aberdam, Golts, and Elad 2021), we use the following setup. We take m = 50, n = 70, and  τ = 0.1. Each entry of the dictionary A′is sampled i.i.d from the standard Gaussian distribution, i.e.,  a′ij ∼ N(0, 1/m). Having these entries, we then normal- ize each column of  A′, with respect to the Euclidean norm.

Instead using a fixed dictionary in Subsection , for each sample, we generate a variant dictionary A from  A′by permutating its columns. Then the linear measurement d is constructed using  d = Ax⋆. The measurement-dictionary pair (d, A) is the input to the Safe-L2O scheme. For each d in the seen distribution  Ds, x⋆has supports of cardinality s = 6 that are uniformly sampled over all coordinates and the non-zero entries are sampled from the standard Gaussian distribution. For each d in the unseen distribution  Du, x⋆has supports of larger cardinality s = 10 and the non-zero entries are sampled from Gaussian distribution N(0, 2). Our data set consists of 20,000 training samples and 1,000 test samples.

We applied Safe-L2O to LASSO by using T defined in (14) and the L2O update operation from AdaLISTA (Aber- dam, Golts, and Elad 2021). The L2O operator L2O is parameterized by  ζ = (θ, γ, W1, W2)for positive scalars  θand γand matrices  W1, W2 ∈ Rm×mand defined by

image

where  W1and  W2are shared by operators in all iterations. The parameter  Θ = {(θk, γk)Kk=1, W1, W2}consists of 2K scalars and two  m × mmatrices.

This section contains proofs for a lemma, the main theorem and its associated corollary. A final lemma is given for the safeguarding condition used for Linearized ADMM.

Lemma 1. If  {xk}is a sequence generated by Safe-L2O and Assumptions 1 and 2 hold, then  {xk}is bounded and there is a summable sequence  {δk} ⊂ [0, ∞)such that, for all  k ∈ Nand  x⋆ ∈ fix(T(·; d)),

image

Proof. Fix any  x⋆ ∈ fix(T(·; d)). Set  I1 ⊆ Nto be the set of all indices such that the update relation  xk+1 = TΘk(xk; d)holds, and set  I2 ≜ N − I1so that  N = I1 ∪ I2.Then

image

which implies, by induction and the safeguard inequality,

image

where the final inequality holds by the fact the series is geometric and the term B is finite by Assumption 2.

A classic result (e.g. see Corollary 2.15 in (Bauschke and Combettes 2017)) states, for all  x, y ∈ Rnand  θ ∈ R,

image

Additionally, because  T(·; d)is averaged, there exists  α ∈(0, 1) such that  T(·; d) = (1−α)I+αQfor some 1-Lipschitz

operator Q. Together these two facts imply, for all  k ∈ I2,

image

= ∥(1 − α)(xk − x⋆) +  α(Q(xk) − x⋆)∥2(30c) = (1 − α)∥xk − x⋆∥2+  α∥Q(xk) − Q(x⋆)∥2(30d)

image

Because the terms  ∥xk+1 − xk∥for  k ∈ I1are summable by (28) and  ∥xk+1 − x⋆∥ ≤ ∥xk − x⋆∥for  k ∈ I2, the sequence  {δk}defined by  δk = ∥xk+1 − xk∥for  k ∈ I1and 0 otherwise is summable, which establishes (26) in the second claim of the lemma (upon application of the triangle inequality for  k ∈ I1). Moreover, applying this inequality inductively reveals, for all  k ∈ N,

image

which proves boundedness of  {xk}.

We restate the main theorem below and provide a proof.

Theorem 2. If  {xk}is a sequence generated by the inner loop in Safe-L2O and Assumptions 1 and 2 hold, then  {xk}converges to a limit  x⋆d ∈ fix(T(·; d)), i.e.  xk → x⋆d.

Proof. Let  I1 ⊆ Nbe the set of all indices such that the update relation  xk+1 = TΘk(xk; d)holds, and set I2 ≜ N − I1so that  N = I1 ∪ I2. If the L2O update is applied finitely many times (i.e.  |I1| < ∞), then there is a finite index after which the iteration coincides with the classic setting of Theorem 1, by which convergence follows. In what remains, assume  |I1| = ∞. We first show  {xk}contains a limit point in the fixed point set of  T(·; d)(Step 1). This is used to show the entire sequence  {xk}converges to this limit point (Step 2).

Step 1. By Lemma 1, the sequence  {xk}is bounded, and so also is the subsequence  {xnk}, with  {nk}an enumeration of I1. By the boundedness of  {xnk}, there exists a convergent subsequence  {xmk} ⊆ {xnk} ⊆ {xk}with limit  x∞. Ob-

serve

image

where the first and third inequalities are applications of the triangle inequality and the second is holds since T is 1-Lipschitz. Since each update in  {xmk}is “good,” the first term in the k-th summand is bounded about by  αµkand the second term is bounded by  2αµk/β. Whence

image

of the proof for Lemma 1. Since the sequence of partial sums in the series on the left of (35) is monotonicallly increasing (due to nonnegative summands) and is bounded by (35c), the series converges. Thus, together with the continuity of norms and the Lipschitz continuity of T, we deduce x∞ ∈ fix(T(· ; d))since

image

Step 2. All that remains is to show the entire sequence  {xk}converges to  x∞. To this end, let  ε > 0be given. It suffices to show there exists  N ∈ Nsuch that

image

Again utilizing Lemma 1, there exists a summable sequence {δk} ⊂ [0, ∞)such that

image

for all  k ∈ Nand  x⋆ ∈ fix(T(·; d)). Since  {δk}is summable, there exists  N1 ∈ Nsuch that

image

Since  xmk → x∞, there exists  N2 ≥ N1such that

image

Combining (39) and (40), repeated application of the triangle inequality reveals

image

This verifies (37), taking  N = N2, completing the proof.

We restate Corollary 1 below and then provide a proof.

Corollary 1. If  {xk}is generated by the inner loop in Safe-L2O and Assumption 1 holds, and  {µk}is generated using a scheme outlined in Table 2 with  α ∈ [0, 1)and β ∈ (0, ∞), then  xk → x⋆d ∈ fix(T(·; d)).

Proof. The proof is parsed into three parts, one for each choice of sequence  {µk}in Table 2.

Geometric Sequence. Define the sequence  {µk}using, for each  k ∈ N, the Geometric Sequence update formula in Table 2. Then  µk+1 = αµkwhenever an L2O update is used and whenever a fallback update is used for which the descent condition  Ckholds. This verifies (9) and that  {µk}is monotonically decreasing.

We verify (8) by induction. The base case holds by defini-tion of  µ1. Now suppose the inequality in (8) holds for a particular choice of k. If  Ckholds, then the inequality in Line 6 of Safe-L2O ensures  ∥T(xk+1; d)−xk+1∥+β∥xk+1−xk∥ ≤αµk = µk+1. If  Ckdoes not hold, then a fallback update is used, which with the nonexpansivity of T yields

image

In either case, the inequality (8) holds for the (k + 1)-st case, thereby closing the induction. By the principle of induction, (8) holds.

Exponential Moving Average. Given  θ ∈ (0, 1), define the sequence  {µk}using the EMA(θ) formula in Table 2. For each k when  µkchanges value, observe

image

and so (9) holds by the fact

image

This also shows  {µk}is monotonically decreasing. We verify (8) by induction. The base case holds by definition of  µ1. Now suppose the inequality in (8) holds for a

particular choice of k. If  Ckholds, then the inequality in Line 6 of Safe-L2O ensures

image

If  Ckdoes not hold, then a fallback update is used, which yields the inequality in (42). In either case, the inequality in (8) holds, closing the induction. By induction, (8) holds.

Recent Max. Define the sequence  {µk}using, for each  k ∈N, the Recent Max update formula in Table 2. Then

image

whenever an L2O update is used and whenever a fallback update is used for which the descent condition  Ckholds. This verifies (9) and that  {µk}is monotonically decreasing.

We verify (8) by induction. As before, the base case holds. Now suppose the inequality in (8) holds for a particular choice of k. If  Ckholds, then

image

If  Ckdoes not hold, then a fallback update is used, which yields the inequality in (42). In either case, the inequality in (8) holds, closing the induction.

Below is a lemma used to identify the safeguarding procedure for the LiADMM method.

Lemma 2. Let  {(xk, zk, νk)}be a sequence generated by LiADMM as in (19). If  αγ∥BtB∥2 < 1and  αβ∥AtA∥2 < 1, then for each index k there is an associated iterate  νksuch that the update  νk+1is generated by applying an averaged operator T to  νkwith respect to the Euclidean norm, i.e. νk+1 = T(νk). In addition, for

image

the fixed point residual is given by

image

Proof. We outline the proof as follows. We first derive a trio of updates that forms the application of an averaged operator for the ADMM problem (18) (Step 1). Next we rewrite the updates in a more meaningful manner using minimizations with f and g (Step 2). This formulation is then applied to a proximal ADMM problem (a special case of ADMM) that introduces auxiliary variables. This yields an explicit formula for (49) (Step 3). The remaining step uses substitution to transform the proximal ADMM formulation into the linearized ADMM updates in (19) (Step 4). This

derivation draws heavily from (Ryu and Yin 2022).

Step 1: The classic ADMM method applied to the problem (18) is equivalent to applying Douglas Rachford Splitting (DRS) to the problem

image

where  (A ▷ f)is the infimal postcomposition (e.g. see (Bauschke and Combettes 2017))

image

This yields the iteration

image

which may be rewritten in parts by

image

This formulation reveals

image

Below we transform this expression into something meaningful.

Step 2: It can be shown that if the range of  ATintersected with the domain of the dual  f ∗is nonempty (i.e.,  R(At) ∩ri dom(f ⋆) ̸= ∅), then

image

if and only if

image

Also, for a function  B(x) = A(t − x)we have

image

These equivalences imply (53) may be rewritten as

image

All instances of  ζkand  ζk+1may be removed upon substitution, i.e.

image

Step 3: First note that our hypothesis implies the inverses of P and Q are defined and the square root can be taken since the matrices are symmetric. In addition, P and Q are positive definite. The ADMM problem (18) is equivalent to

image

subject to the constraint

image

Using the same update formula as in Step 2 yields

image

Simplifying once more gives

image

Thus,

image

Step 4: We now derive the form of the linearized ADMM updates. Let  uk = α(νk1 − Axk+1). Then

image

and

image

In similar fashion, we deduce

image

Upon reordering the updates in (66), (67) and (68) to obtain the appropriate dependencies, we obtain (19), as desired.


Designed for Accessibility and to further Open Science