Fixed-point iteration abstracts most of the convex optimization methods that are based on gradient and/or proximal oracles, and those methods are targets of recent L2O accelerations. Our method is based on examining the fixed-point residual, and so it applies to a wide variety of L2O methods, even complicated first-order methods (e.g. ADMM and primal-dual methods) where the underlying problems can include constraints. To provide a brief background, here we overview fixed point methods. Denote the set of fixed points of each operator by
For an operator T with a nonempty fixed point set (i.e.
), we consider the fixed point problem
See Table 1 for examples of the operator T in convex minimization. We focus on fixed point iteration to give a general approach for creating sequences that converge to solutions of (2) and, thus, of the corresponding optimization problem.
The following definitions are used throughout. An operator is nonexpansive if it is 1-Lipschitz,1 i.e.
An operator T is averaged if there exists and a nonexpansive operator
such that T =
, with I the identity. A classic theorem states sequences generated by successively applying an averaged operator converge to a fixed point. In all of this work, we assume each operator T is averaged. This method comes from (Krasnosel’ski˘ı 1955) and (Mann 1953), which yielded adoption of the name Krasnosel’ski˘ı-Mann (KM) method. This result is stated below. Theorem 1. If an averaged operator
has a nonempty fixed point set and a sequence
with arbitrary initial iterate
satisfies the update relation
then there is a solution to (2) such that the sequence
converges to
.
For reference, we briefly discuss the use of resolvents in conventional algorithms. Consider a convex function with subgradient
. For
, the resolvent
of
is defined by
, i.e.
and the reflected resolvent of is
If f is closed, convex, and proper, then the resolvent is precisely the proximal operator, i.e.
Proximal operators for several well-known functions can be expressed by explicit formulas (e.g. see page 177 in (Beck 2017)). It can be shown that is nonexpansive and
is averaged (Bauschke and Combettes 2017). Table 1 provides several examples of these operators in well-known optimization algorithms.
Table 1: Averaged operators for well-known algorithms. We assume and, when
is multiplied by a gradient, we also assume
, with L the Lipschitz constant for the gradient. The dual of a function is denoted by a superscript
, and
. Operators J and R are defined in equations (5) and (6), respectively. The block matrix M is
. In each case, L is the Lagrangian associated with the presented problem.
This section presents the Safe-L2O framework. The safeguard acts as a wrapper around a data-driven algorithm, which is formulated in practice as a neural network. Each L2O operator, denoted throughout by , is parameterized by layerwise weights
. Input data d is used to define an optimization problem (e.g. the measurement vector in a least squares problem). To make this dependence clear, we often write
. Often
can be viewed as forming one or multiple layers of a feed forward network. Thus, we interpret
in Algorithm 1 as a feed forward network. In addition to an L2O operator
, our Safe-L2O method uses a fallback operator T (unrelated to
) and a scalar sequence
. Here T defines an averaged operator from the update formula of a conventional optimization algorithm. Each
defines a reference value to determine whether a tentative L2O update is “good.” Each reference value
in our safeguarding schemes is related to a combination of
and
among previous iterates i = 1, . . . , k, where
.
We propose the Safe-L2O scheme in Algorithm 2. As shown in Line 1, a safeguarded L2O operator consists of an L2O network , a fallback operator T, and a parameter
. Here
forms layerwise weights
that define the L2O update
at the k-th iteration. These parameters are trained beforehand, fol-
lowing standard training methods for machine learning models (as outlined in the Appendices). Table 1 shows numerous choices of the fallback operator T for different combinations of optimization problems and algorithms In Line 2, the initial iterate is chosen to be an arbitrary (but fixed) vector
. The initial iterate
of the safeguard sequence
is initialized using the initial iterate
, an L2O update
, and the fallback operator T in Line 3. From Line 4 to Line 11, a repeated loop occurs to compute each update
. In Line 5 the L2O operator is applied to the current iterate
get a tentative update
. This
is “good” if the the inequality in Line 6 holds. In such a case, the L2O update is assigned to
in Line 7. Otherwise, the fallback operator T is triggered to obtain the update
in Line 9. Note the initial iterate
is defined such that
, i.e. the first L2O update is always accepted. Lastly, the safeguard parameter is updated in Line 10. Refer to Table 2 for choices of schemes to update the safeguard parameters.
and there exists such that
Our proposed methods for choosing the sequence satisfy Assumption 2 (see Table 2). These methods are adaptive in the sense that each
depends upon the weights
, iterate
and (possibly) previous weights and iterates. Each safeguard parameter
also remains constant in k except for when the sum of residual norms
and
decreases to less than a geometric factor of
. This allows each
to trail the value of the residual norm
and the residual norm to increase in k from time to time. This trailing behavior provides flexibility to the L2O updates. Our main result is below and is followed by a corollary justifying use of the schemes in Table 2 (both proven in the appendix). Theorem 2. If
is a sequence generated by the inner loop in Safe-L2O and Assumptions 1 and 2 hold, then
converges to a limit
, i.e.
. Corollary 1. If
is generated by the inner loop in Safe-L2O and Assumption 1 holds, and
is generated using a scheme outlined in Table 2 with
and
, then
.
Remark 1. The appropriate frequency for the safeguard to trigger can be estimated by tuning L2O parameters for optimal performance on a training set without safeguarding and then using a validation set to test various safeguards with the L2O scheme. To avoid possible confusions, note we are not trying to prove the convergence of any standalone L2O algo- rithm. We instead 1) alarm on an L2O update when it may break convergence, 2) replace it with a fallback update, and 3) show the resulting “hybrid optimization” converges to a solution of the provided optimization problem.
Safe-L2O may be executed via inferences of a feed forward neural network. The input into the network is the data d, often in vector form. Input d is usually the observation we have, based on which we optimize over the variable of interest. For example, the LASSO problem in (12) is used for sparse coding, where the goal is to recover a unknown sparse vector from its noisy measurements
. Assuming A is known beforehand, the input to the Safe-L2O model is the observation d. In other cases, the dictionary A can change and also be part of the input to the model. We include case-by-case discussions about what the inputs for each numerical example.
Each layer of the Safe-L2O model is designed so that its input is , to which it applies either an L2O or fallback update (following the Safe-L2O method), and outputs
to the next layer. The set over which
is minimized, may be
Figure 2: Plot of error versus iteration for LISTA denoising. Here ISTA is the classic algorithm, LISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of LISTA in Algorithm 2. Trained with . Inferences used
and EMA(0.25). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. This plot shows the safeguard is used intermittently for k > 2.
Table 2: Rules for updating . Here
, and
is the statement
.
chosen with great flexibility. For each application of the algorithm, the fallback operator depends upon the data d.
The “optimal” choice of parameters depends upon the application. Suppose each d is drawn from a common distribution D. Then a choice of “optimal” parameters
may be identified as those for which the expected value of
is minimized among
, where
is an appropriate cost function and K is a fixed positive integer. Mathematically, this means
solves the problem
may be used (e.g. the min-max problem used by adversarial networks (Goodfellow et al. 2014)).
This section presents examples using Safe-L2O.2 We numerically investigate i) the convergence rate of Safe-L2O relative to corresponding conventional algorithms, ii) the efficacy of safeguarding procedures when inferences are performed on data for which L2O fails intermittently, and iii) the convergence of Safe-L2O schemes even when the application of L2O operators is not theoretically justified. We first use L2O from ALISTA (Liu et al. 2019) on a synthetic LASSO problem. We then use LISTA on a LASSO problem for image processing, differentiable linearized ADMM (Xie et al. 2019) on a sparse coding problem. In all three types of problems, the input data d to the L2O models is generated by
Figure 3: Plot of error versus iteration for LiADMM. Here LiADMM is the classic algorithm, L2O LiADMM is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of L2O LiADMM in Algorithm 2. Inferences used and EMA(0.75). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. The safeguard is used about 10% and 30% of the time when k = 4 and k = 5, respectively.
where is white Gaussian noise and
is the hidden variable that we want to recover through a dictionary A, which is generated and beforehand and fixed. In these experiments, the ground-truth vector
are sampled from a distribution, which characterizes the distribution of data of interest along with the dictionary. We denote the distribution of the input data as D. Besides these “linear” examples, we also validate Safe-L2O on a distribution of LASSO problems where the dictionary A also changes and is part of the input to the L2O models. In this case, the distributions of the dictionaries and ground-truth vectors together characterize the input distribution D.
In each example, denotes the optimal value of the ob- jective f(x; d) among all possible x. Performance is measured using a modified relative objective error:
where the expectations are estimated numerically (see the appendices for details). We use (11) rather than the expectation of relative error to avoid high sensitivity to outliers.
Our numerical results are shown in several plots. When each iterate is computed using data d drawn from the same distribution
that was used to train the L2O algorithm, we say the performance is on the “seen” distribution
. These plots form the primary illustrations of the speedup of L2O algorithms as compared to conventional optimization algorithms. When each d is drawn from a distribution
that is different than
, we refer to
as the unseen distribution. These plots show the ability of the safeguard to ensure convergence. A dotted plot with square markers is also added to show the frequency of safeguard activations among test samples, with the reference axis on the right hand side of the plots. We extend the Safe-L2O methods beyond their training iterations by applying the fallback operator T; we demarcate where this extension begins by changing the Safe-L2O plots from solid to dashed lines. As Safe-L2O convergence holds whenever
, we can set
to be arbitrarily small (e.g. below machine precision); for simplicity, we use
in the experiments (as, even in this case, it can be shown that iterates approach the solution set). Implementation details for each experiment are in the appendices.
ALISTA for LASSO
Here we consider the LASSO problem for sparse coding. Let be a sparse vector and
be a dictionary. We assume access is given to noisy linear measurements
, where
is additive Gaussian white noise and
Even for underdetermined systems, when
is sufficiently sparse and
is an appropriately chosen regularization parameter,
can often be reasonably estimated by solving the LASSO problem
where and
are the
and
norms, respectively. A classic method for solving (12) is the iterative shrinkage thresholding algorithm (ISTA) (e.g. see (Daubechies, De- frise, and Mol 2004)).3 Liu et al. (2019) present the L2O scheme ALISTA that we implement here. This L2O model L2O is parameterized by
.
Linearized ADMM
Let and
be as in the LASSO problem. Here we apply the L2O scheme differentiable linearized ADMM (LiADMM) of Xie et al. (2019) to the closely related sparse coding problem
Figure 4: Plot of error versus iteration for AdaLISTA example. Here ISTA is the classic algorithm, AdaLISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of AdaLISTA in Algorithm 2.
The L2O network L2O and fallback linearized ADMM (LiADMM) operator T are provided in the appendices along with implementation details. Plots are provided in Figure 3.
LISTA for Natural Image Denoising
To evaluate our safeguarding mechanism in a more realistic setting, we apply safeguarded LISTA to a natural image denoising problem. In this subsection, we learn a LISTA-CP model (Chen et al. 2018) to perform natural image denoising. During training, L2O LISTA-CP model is trained to recover clean images from their Gaussian noisy counterparts by solving (12). In (12), d is the noisy input to the model, and the clean image is recovered with , where
is the optimal solution. The dictionary
is learned on the BSD500 dataset (Martin et al. 2001) by solving a dictionary learning problem (Xu and Yin 2014). During testing, however, the learned L2O LISTA-CP is applied to unseen pepper-and-salt noisy images. Comparison plots are provided in Figure 2.
AdaLISTA: Dictionary as Part of Inputs
Here we consider the same LASSO problem (12) as in Subsection but make the dictionary A part of the inputs to the L2O model (i.e. able to change across samples). Aberdam, Golts, and Elad (2021) present a new L2O scheme AdaLISTA that is trained to quickly solve a distribution of LASSO problems with varying dictionaries. AdaLISTA has a different parameterization scheme from the original LISTA (Gre- gor and LeCun 2010) to enable the adaptivity to the dictionaries. The L2O model L2O in AdaLISTA is parameterized by and two weight matrices
where the dictionary A has shape
. The two matrices are shared by operators in all iterations.
We mainly follow the settings in (Aberdam, Golts, and Elad 2021). Specifically, we let be sparse vectors with random supports of cardinality s = 6 and a single fixed dictionary
. We assume access is given
Figure 5: Plot of error versus iteration for AdaLISTA example. Here ISTA is the classic algorithm, AdaLISTA is the L2O operator in Algorithm 1 and Safe-L2O is the safeguarded version of AdaLISTA in Algorithm 2. Inferences used and EMA(0.25). In (b), how often the L2O update is “bad” and the safeguard activates for Safe-L2O is indicated in reference to the right vertical axis. This plot shows the safeguard is used intermittently for k > 7.
to noiseless linear measurements , where A is uniformly sampled from all column-permuted variants of
. Figure 4 and 5 show summary plots, from which we have similar observations as in the ALISTA experiments with a single fixed dictionary.
Numerous insights may be drawn from our examples. The first observation is, roughly speaking, each L2O scheme in our numerical examples reduces computational costs by at least one order of magnitude when applied to data from the same distribution as the training data (as compared to analytic optimization algorithms). This is consistent with results of previous works. More importantly, plots in Figures 1b and 2b, and 5 show the safeguard steers updates to convergence when they would otherwise diverge or converge slower than the conventional algorithm. That is, Safe-L2O converges with data distinct from training while the nonsafeguarded L2O schemes diverge.
This work proposes a framework for ensuring convergence of L2O algorithms. Sequences generated by our Safe-L2O method provably converge to solutions of the optimization problems. Our Safe-L2O algorithm is also easy to implement as a wrapper around trained neural networks. Numerical experiments demonstrate rapid convergence by Safe-L2O methods and effective safeguarding when the L2O schemes appear to otherwise diverge. Future work will provide a better data-driven fallback method and investigate stochastic extensions.
Aberdam, A.; Golts, A.; and Elad, M. 2021. Ada-lista: Learned solvers adaptive to varying models. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Ablin, P.; Moreau, T.; Massias, M.; and Gramfort, A. 2019. Learning step sizes for unfolded sparse coding. arXiv:1905.11071.
Bauschke, H.; and Combettes, P. 2017. Convex Analysis and Monotone Operator Theory in Hilbert Spaace. Springer, 2nd. edition.
Beck, A. 2017. First-order methods in optimization, volume 25. SIAM.
Blumensath, T.; and Davies, M. E. 2009. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3): 265–274.
Borgerding, M.; Schniter, P.; and Rangan, S. 2017. AMP- Inspired Deep Networks for Sparse Linear Inverse Problems. IEEE Transactions on Signal Processing, 65(16): 4293–4308.
Chen, T.; Chen, X.; Chen, W.; Heaton, H.; Liu, J.; Wang, Z.; and Yin, W. 2022. Learning to Optimize: A Primer and A Benchmark. Journal of Machine Learning Research, 23(189): 1–59.
Chen, X.; Liu, J.; Wang, Z.; and Yin, W. 2018. Theoretical Linear Convergence of Unfolded ISTA and Its Practical Weights and Thresholds. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31, 9061–9071. Curran Associates, Inc.
Daubechies, I.; Defrise, M.; and Mol, C. D. 2004. An itera- tive thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): 1413–1457.
Giryes, R.; Eldar, Y. C.; Bronstein, A. M.; and Sapiro, G. 2018. Tradeoffs Between Convergence Speed and Reconstruction Accuracy in Inverse Problems. IEEE Transactions on Signal Processing, 66(7): 1676–1690.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde- Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Gregor, K.; and LeCun, Y. 2010. Learning Fast Approxima- tions of Sparse Coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 399–406. USA: Omnipress.
Hershey, J. R.; Roux, J. L.; and Weninger, F. 2014. Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures. arXiv:1409.2574.
Krasnosel’ski˘ı, M. 1955. Two remarks about the method of successive approximations. Uspekhi Mat. Nauk, 10: 123– 127.
Liu, J.; Chen, X.; Wang, Z.; and Yin, W. 2019. ALISTA: An- alytic Weights Are As Good As Learned Weights in LISTA. In International Conference on Learning Representations.
Liu, Q.; Shen, X.; and Gu, Y. 2019. Linearized ADMM for Nonconvex Nonsmooth Optimization With Convergence Analysis. IEEE Access, 7: 76131–76144.
Mann, R. 1953. Mean Value Methods in Iteration. 4(3): 506– 510.
Martin, D.; Fowlkes, C.; Tal, D.; and Malik, J. 2001. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, 416–423.
Metzler, C.; Mousavi, A.; and Baraniuk, R. 2017. Learned D- AMP: Principled Neural Network based Compressive Image Recovery. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 1772– 1783. Curran Associates, Inc.
Moreau, T.; and Bruna, J. 2017. Understanding Trainable Sparse Coding with Matrix Factorization.
Ryu, E.; and Yin, W. 2022. Large-Scale Convex Optimization: Algorithm Designs via Monotone Operators. Cambridge, England: Cambridge University Press.
Sprechmann, P.; Bronstein, A. M.; and Sapiro, G. 2015. Learning Efficient Sparse and Low Rank Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1821–1833.
Themelis, A.; and Patrinos, P. 2019. SuperMann: a superlin- early convergent algorithm for finding fixed points of nonexpansive operators.
Wang, Z.; Ling, Q.; and Huang, T. S. 2016. Learning Deep Encoders. In Thirtieth AAAI Conference on Artificial Intelligence.
Wang, Z.; Liu, D.; Chang, S.; Ling, Q.; Yang, Y.; and Huang, T. S. 2016. D3: Deep Dual-Domain Based Fast Restoration of JPEG-Compressed Images. 2764–2772.
Xie, X.; Wu, J.; Zhong, Z.; Liu, G.; and Lin, Z. 2019. Differ- entiable Linearized ADMM. arXiv:1905.06179.
Xin, B.; Wang, Y.; Gao, W.; Wipf, D.; and Wang, B. 2016. Maximal Sparsity with Deep Networks? In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 4340–4348. Curran Associates, Inc.
Xu, Y.; and Yin, W. 2014. A fast patch-dictionary method for whole image recovery. arXiv preprint arXiv:1408.3740.
Yang, Y.; Sun, J.; Li, H.; and Xu, Z. 2016. Deep ADMM-Net for Compressive Sensing MRI. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 10–18. Curran Associates, Inc.
Zhang, J.; O’Donoghue, B.; and Boyd, S. 2018. Globally convergent type-I Anderson acceleration for non-smooth fixed-point iterations. arXiv:1808.03971.
We begin with a general note on training and then consider individual experiments. The training procedure for each experiment was conducted layerwise. By this, we mean that first the network weights were tuned using one layer. Then the network was trained with two layers, using the learned weights from the first layer as an initialization/warm start for that layer’s parameters. This was then repeated until the final number K of layers was reached. This approach is built upon the intuition that, because the network layers model an optimization algorithm that progressively improves, each successive layer’s weights likely depend upon the previous layers weights.
Supplement for ALISTA. In similar manner to (Chen et al. 2018) and (Liu et al. 2019), we use the following setup. We take m = 250, n = 500, and . Each entry of the dictionary A is sampled i.i.d from the standard Gaussian distribution, i.e.,
. Having these entries, we then normalize each column of A, with respect to the Euclidean norm. Each d in the distribution
of data used to train the neural network is constructed using
with noise
and each entry of
as the composition of Bernoulli and Gaussian distributions, i.e.,
Ber
for all
. Each d in the un- seen distribution
is computed using the same distribution of noise
as before and using
Ber
. Our data set consists of 10,000 training samples and 1,000 test samples.
Given , the ISTA method iteratively computes
where and
is the soft-thresholding function defined by component-wise operations:
We applied Safe-L2O to the LASSO problem above by using T defined in (14) and the L2O update operation from ALISTA (Liu et al. 2019). The L2O operator L2O is parameterized by for positive scalars
and
and defined by
where
and is the Frobenius norm and the Matlab notation
is used to denote the
th column of the matrix M. The parameter
consists of 2K scalars.
Supplement for LiADMM. LiADMM is used to solve problems of the form
for which LiADMM generates sequences and
defined by the updates
with given scalars . The problem (13) may be written in the form of (18) by taking
, and
Id. In this case, the proximal operators in (19) reduce to soft-thresholding operators. Although not given in Table 1, at each iteration of LiADMM there is an associated iterate
for which the update
is generated by applying an averaged operator T to the current iterate
. As shown in Lemma 2 of Subsection A.2, for our setup, assuming
and
, the norm of the associated fixed point residual at the iterate
is given by
with P defined below in (48). For notational clarity, the term in the SKM and LSKM schemes is replaced in this subsection by the tuple
. This is of practical importance too since it is the sequence
that converges to a solution of (13).
We now modify the iteration (19) for the problem (13) to create the LiADMM L2O scheme. We generalize soft- thresholding to vectorized soft-thresholding for by
We assume represents the scalar soft-thresholding in (15) when
and the vector generalization (21) when
. Combining ideas from ALISTA (Liu et al. 2019) and LiADMM (Liu, Shen, and Gu 2019), given
, and
, set
with element-wise products denoted by . For the parameter
, then define
Fixing the number of iterations K, the learnable parameters from (22) used in the LSKM Algorithm may be encoded by consist- ing of
scalars. To stabilize the training process, we share the
across all layers in practice. We also fix
Id and only learn the step sizes
before it. Moreover, different from other experiments, we add an additional end-to-end training stage after we finish the normal layer-wise training described at the beginning of the Appendix, which is found helpful to improve the recovery performance at the final output layer.
For data generation, we use the same settings for the dictionary A and sparse vectors as in the experiments of Subsection . But we make a small modification to to generation of noise
due to the
objective, where we sample
from the same (seen and unseen) distribution as
, i.e. the noises are also sparse. We choose
.4
Supplement for LISTA. We choose LISTA with coupled weight (i.e. LISTA-CP) in (Chen et al. 2018) for natural image denoising. This is done for two reasons: 1) LISTA-CP has a larger capacity to work well in complex real-world settings; 2) the dictionary A learned from natural images is much more ill-conditioned than the Gaussian matrix in Subsection . The dictionary is learned by solving a dictionary learning problem (Xu and Yin 2014) on
image patches extracted from BSD500 dataset (Martin et al. 2001) 5.
The training set that we use includes 50,000 images patches of size randomly extracted from 200 images in the BSD500 training set, 250 patches each. White Gaussian noises with standard deviation
(pixel values range from 0 to 255) are added to the patches. For testing, we use the “Peppers” image as the ground truth, which is 1,024 non-overlapping patches. The noisy testing patches in the seen distribution is generated in the same way as the training set. The testing patches in the unseen distribution is polluted by pepper-and-salt noises with density r = 70%.
The update operation from LISTA-CP (Chen et al. 2018) is similar to (25) but has one more matrix weight to learn in each layer:
where parameterizes the update operator with non-negative scalar
and a matrix weight
. The tunable parameters
include K scalars and K matrices. We take K = 20, i.e. we train a 20-layer L2O LISTA-CP model. To train the L2O LISTA-CP model, we use (12) as loss function with A mentioned above and
.
Supplement to AdaLISTA. In similar manner to (Aberdam, Golts, and Elad 2021), we use the following setup. We take m = 50, n = 70, and . Each entry of the dictionary
is sampled i.i.d from the standard Gaussian distribution, i.e.,
. Having these entries, we then normal- ize each column of
, with respect to the Euclidean norm.
Instead using a fixed dictionary in Subsection , for each sample, we generate a variant dictionary A from by permutating its columns. Then the linear measurement d is constructed using
. The measurement-dictionary pair (d, A) is the input to the Safe-L2O scheme. For each d in the seen distribution
has supports of cardinality s = 6 that are uniformly sampled over all coordinates and the non-zero entries are sampled from the standard Gaussian distribution. For each d in the unseen distribution
has supports of larger cardinality s = 10 and the non-zero entries are sampled from Gaussian distribution N(0, 2). Our data set consists of 20,000 training samples and 1,000 test samples.
We applied Safe-L2O to LASSO by using T defined in (14) and the L2O update operation from AdaLISTA (Aber- dam, Golts, and Elad 2021). The L2O operator L2O is parameterized by for positive scalars
and
and matrices
and defined by
where and
are shared by operators in all iterations. The parameter
consists of 2K scalars and two
matrices.
This section contains proofs for a lemma, the main theorem and its associated corollary. A final lemma is given for the safeguarding condition used for Linearized ADMM.
Lemma 1. If is a sequence generated by Safe-L2O and Assumptions 1 and 2 hold, then
is bounded and there is a summable sequence
such that, for all
and
,
Proof. Fix any . Set
to be the set of all indices such that the update relation
holds, and set
so that
Then
which implies, by induction and the safeguard inequality,
where the final inequality holds by the fact the series is geometric and the term B is finite by Assumption 2.
A classic result (e.g. see Corollary 2.15 in (Bauschke and Combettes 2017)) states, for all and
,
Additionally, because is averaged, there exists
(0, 1) such that
for some 1-Lipschitz
operator Q. Together these two facts imply, for all ,
) +
(30c)
+
(30d)
Because the terms for
are summable by (28) and
for
, the sequence
defined by
for
and 0 otherwise is summable, which establishes (26) in the second claim of the lemma (upon application of the triangle inequality for
). Moreover, applying this inequality inductively reveals, for all
,
which proves boundedness of .
We restate the main theorem below and provide a proof.
Theorem 2. If is a sequence generated by the inner loop in Safe-L2O and Assumptions 1 and 2 hold, then
converges to a limit
, i.e.
.
Proof. Let be the set of all indices such that the update relation
holds, and set
so that
. If the L2O update is applied finitely many times (i.e.
), then there is a finite index after which the iteration coincides with the classic setting of Theorem 1, by which convergence follows. In what remains, assume
. We first show
contains a limit point in the fixed point set of
(Step 1). This is used to show the entire sequence
converges to this limit point (Step 2).
Step 1. By Lemma 1, the sequence is bounded, and so also is the subsequence
, with
an enumeration of
. By the boundedness of
, there exists a convergent subsequence
with limit
. Ob-
serve
where the first and third inequalities are applications of the triangle inequality and the second is holds since T is 1-Lipschitz. Since each update in is “good,” the first term in the k-th summand is bounded about by
and the second term is bounded by
. Whence
of the proof for Lemma 1. Since the sequence of partial sums in the series on the left of (35) is monotonicallly increasing (due to nonnegative summands) and is bounded by (35c), the series converges. Thus, together with the continuity of norms and the Lipschitz continuity of T, we deduce since
Step 2. All that remains is to show the entire sequence converges to
. To this end, let
be given. It suffices to show there exists
such that
Again utilizing Lemma 1, there exists a summable sequence such that
for all and
. Since
is summable, there exists
such that
Since , there exists
such that
Combining (39) and (40), repeated application of the triangle inequality reveals
This verifies (37), taking , completing the proof.
We restate Corollary 1 below and then provide a proof.
Corollary 1. If is generated by the inner loop in Safe-L2O and Assumption 1 holds, and
is generated using a scheme outlined in Table 2 with
and
, then
.
Proof. The proof is parsed into three parts, one for each choice of sequence in Table 2.
Geometric Sequence. Define the sequence using, for each
, the Geometric Sequence update formula in Table 2. Then
whenever an L2O update is used and whenever a fallback update is used for which the descent condition
holds. This verifies (9) and that
is monotonically decreasing.
We verify (8) by induction. The base case holds by defini-tion of . Now suppose the inequality in (8) holds for a particular choice of k. If
holds, then the inequality in Line 6 of Safe-L2O ensures
. If
does not hold, then a fallback update is used, which with the nonexpansivity of T yields
In either case, the inequality (8) holds for the (k + 1)-st case, thereby closing the induction. By the principle of induction, (8) holds.
Exponential Moving Average. Given , define the sequence
using the EMA(
) formula in Table 2. For each k when
changes value, observe
and so (9) holds by the fact
This also shows is monotonically decreasing. We verify (8) by induction. The base case holds by definition of
. Now suppose the inequality in (8) holds for a
particular choice of k. If holds, then the inequality in Line 6 of Safe-L2O ensures
If does not hold, then a fallback update is used, which yields the inequality in (42). In either case, the inequality in (8) holds, closing the induction. By induction, (8) holds.
Recent Max. Define the sequence using, for each
N, the Recent Max update formula in Table 2. Then
whenever an L2O update is used and whenever a fallback update is used for which the descent condition holds. This verifies (9) and that
is monotonically decreasing.
We verify (8) by induction. As before, the base case holds. Now suppose the inequality in (8) holds for a particular choice of k. If holds, then
If does not hold, then a fallback update is used, which yields the inequality in (42). In either case, the inequality in (8) holds, closing the induction.
Below is a lemma used to identify the safeguarding procedure for the LiADMM method.
Lemma 2. Let be a sequence generated by LiADMM as in (19). If
and
, then for each index k there is an associated iterate
such that the update
is generated by applying an averaged operator T to
with respect to the Euclidean norm, i.e.
. In addition, for
the fixed point residual is given by
Proof. We outline the proof as follows. We first derive a trio of updates that forms the application of an averaged operator for the ADMM problem (18) (Step 1). Next we rewrite the updates in a more meaningful manner using minimizations with f and g (Step 2). This formulation is then applied to a proximal ADMM problem (a special case of ADMM) that introduces auxiliary variables. This yields an explicit formula for (49) (Step 3). The remaining step uses substitution to transform the proximal ADMM formulation into the linearized ADMM updates in (19) (Step 4). This
derivation draws heavily from (Ryu and Yin 2022).
Step 1: The classic ADMM method applied to the problem (18) is equivalent to applying Douglas Rachford Splitting (DRS) to the problem
where is the infimal postcomposition (e.g. see (Bauschke and Combettes 2017))
This yields the iteration
which may be rewritten in parts by
This formulation reveals
Below we transform this expression into something meaningful.
Step 2: It can be shown that if the range of intersected with the domain of the dual
is nonempty (i.e.,
ri dom
), then
if and only if
Also, for a function we have
These equivalences imply (53) may be rewritten as
All instances of and
may be removed upon substitution, i.e.
Step 3: First note that our hypothesis implies the inverses of P and Q are defined and the square root can be taken since the matrices are symmetric. In addition, P and Q are positive definite. The ADMM problem (18) is equivalent to
subject to the constraint
Using the same update formula as in Step 2 yields
Simplifying once more gives
Thus,
Step 4: We now derive the form of the linearized ADMM updates. Let . Then
and
In similar fashion, we deduce
Upon reordering the updates in (66), (67) and (68) to obtain the appropriate dependencies, we obtain (19), as desired.