Non-stationary Douglas-Rachford and alternating direction method of multipliers: adaptive stepsizes and convergence

2018·arXiv

Abstract

1 Introduction

In this paper we consider the Douglas-Rachford (DR) method to solve the problem of finding a zero of the sum of two maximal monotone operators, i.e., solving:

where t > 0 is a given stepsize. This iteration scheme also makes sense for general maximal monotone operators as soon as B is single-valued. It has been observed in [29] that the iteration can be rewritten for arbitrary maximally monotone operators by substituting u := and using the identity

Comparing (2) and (4), we see that (4) does not require to evaluate Bu, which avoids assuming that B is single-valued as in (2). Otherwise, is not uniquely defined. While in (2) converges to a solution of (1), in (4) is just an intermediate sequence converging to such that = (

is a solution of (1). Therefore, (2) gives us a convenient form to study the DR method in the framework of fixed-point theory. Note that the iterations (2) and (4) are equivalent in the stationary case, but they are not equivalent in the non-stationary case, i.e., when the stepsize t varies along the iterations; we will shed more light on this later in Section 2.2.

From a practical point of view, the performance of a DR scheme mainly depends on the following two aspects:

• Good stepsize t: It seems to be generally acknowledged that the choice of the stepsize is crucial for the algorithmic performance of the method but a general rule to choose the stepsize seems to be missing [2, 16]. So far, convergence theory of DR methods provides some theoretical guidance to select the parameter t in a given range in order to guarantee convergence of the method. Such a choice is often globally fixed for all iterations, and does not take into account local structures of the underlying operators A and B. Moreover, the global convergence rate of the DR method is known to be

O(1/n) under only monotonicity assumption, but often using an averaging sequence [13, 14, 24], where n is the iteration counter. Several experiments have shown that DR methods have better practical rate than its theoretical bound [33] by using the last iterate (i.e. not an averaging sequence). In the special case of convex optimization problems, the Douglas-Rachford method is equivalent to the alternating direction methods of multipliers (ADMM) (see, e.g. the recent [21] for a short historical account) and there a several proposals for dynamic stepsizes for ADMM [25, 28, 37, 41] but we are not aware of a method that applies to DR in the case of monotone operators. The recent work [30] provides explicit choices for constant stepsizes in cases where the monotone operator posses further properties.

• Proper metric: Since the DR method is not invariant as the Newton method, the choice of metric and preconditioning seems to be crucial to accelerate its performance. Some researchers have been studying this aspect from different views, see, e.g., [10, 8, 19, 20, 23, 34]. Clearly, the choice of a metric and a preconditioner also affects the choice of the stepsize.

Note that a metric choice often depends on the variant of methods, while the choice of stepsize depends on the problem structures such as the strongly monotonicity parameters and the Lipschitz constants [30]. In general cases, where A and B are only monotone, we only have a general rule to select the parameter t to obtain its sublinear convergence rate [14, 16, 24]. This stepsize depends on the mentioned global parameters only and does not adequately capture the local structure of A and B to adaptively update t. For instance, a linesearch procedure to evaluate a local Lipschitz constant for computing stepsize in first-order methods can beat the optimal stepsize using global Lipschitz constant [5], or a Barzilai-Borwein stepsize in gradient descent methods essentially exploits local curvature of the objective function to obtain a good performance.

Our contribution: We prove the convergence of a new version of the non-stationary Douglas-Rachford method for the case where both operators are merely maximally monotone. Moreover, we propose a very simple adaptive step-size rule and demonstrate that this rule does improve convergence in practical situations. We also transfer our results to the case of ADMM and obtain a new adaptive rule that outperforms previously known adaptive ADMM methods and also does have a convergence guarantee. Our step-size rule is relatively simple and does not incur significantly computational effort rather than the norm of two vectors. Our stepsize rule has a theoretical convergence guarantee.

Paper organization: We begin with an analysis of the case of linear monotone operators in section 2, analyze the convergence of the non-stationary form of the iteration (2), i.e. the form where varies with n, in section 3, and then propose adaptive stepsize rules in section 4. Section 5 extends the analysis to non-stationary ADMM. Finally, section 6 provides several numerical experiments for the DR scheme and ADMM using our new stepsize rule.

1.1 State of the art

There are several results on the convergence of the iteration (4). The seminal paper [29] showed that, for any positive stepsize t, the iteration map in (4) is firmly nonexpansive, that the sequence weakly converges to a fixed point of the iteration map [29, Prop. 2] and that, = weakly converges to a solution of the inclusion (1) as soon as A, B and A + B are maximal monotone [29, Theorem 1]. In the case where B is coercive and Lipschitz continuous, linear convergence was also shown in [29, Proposition 4]. These results have been extended in various ways. Let us attempt to summarize some key contributions on the DR method. Eckstein and Bertsekas in [16] showed that the DR scheme can be cast into a special case of the proximal point method [35]. This allows the authors to exploit inexact computation from the proximal point method [35]. They also presented a connection between the DR method and the alternating direction method of multipliers (ADMM). In [39] Svaiter proved a weak convergence of the DR method in Hilbert space without the assumption that A + B is maximal monotone and the prove have been simplified in [38]. Combettes [11] cast the DR method as special case of the averaging operator from a fixed-point framework. Applications of the DR method have been studied in [12]. The convergence rate of the DR method was first studied in [29] for the strongly monotone case, while the sublinear rate was then proved in [24]. A more intensive research on convergence rates of the DR methods can be found in [13, 14, 30, 31]. The DR method has been extended to accelerated variant in [33] but specifying for a special setting. In [27] the authors analyzed a non-stationary DR method derived from (4) in the framework of perturbations of non-expansive iterations and showed convergence for convergent stepsize sequences with summable errors.

The DR method together with its dual variant, ADMM, become extremely popular in recent years due to a wide range of applications in image processing, and machine learning [6, 26], which are unnecessary to recall them all here.

In terms of stepsize selection for DR schemes as well as for ADMM methods, it seems that there is very little work available from the literature. Some general rules for fixed stepsizes based on further properties of the operators such as strong monotonicity, Lipschitz continuity, and coercivity are given in [18, 30], and it is shown that the resulting linear rates are tight. Heuristic rules for fixed stepsizes motivated by quadratic problems are derived in [20]. A self-adaptive stepsize for ADMM proposed in [25] seems to be one of the first work in this direction. The recent works [41, 42] also proposed an adaptive update rule for stepsize in ADMM based on a spectral estimation. Some other papers rely on theoretical analysis to choose optimal stepsize such as [17], but it only works in the quadratic case. In [28], the authors proposed a nonincreasing adaptive rule for the penalty parameter in ADMM. Another update rule for ADMM can be found in [37]. While ADMM is a dual variant of the DR scheme, we unfortunately have not seen any work that converts such an adaptive step-size from ADMM to the DR scheme where the more general case of monotone operators can be handled. In addition, the adaptive step-size for the DR scheme

by itself seems to not exist in the literature.

1.2 A motivating linear example

While the Douglas-Rachford iteration (weakly) converges for all positive stepsizes t > 0, it seems to be folk wisdom, that there is a “sweet spot” for the stepsize which leads to fast convergence. We illustrate this effect with a simple linear example. We consider a linear equation

where are two matrices of the size with m = 200. We choose symmetric positive definite matrices with rank(A) = + 10 and rank such that A + B has full rank, and thus the equation 0 = Ax + Bx has zero as its unique solution.Since B is single-valued, we directly use the iteration (2).

Remark 1.1. Note that the shift ˜would allow to treat inhomogeneous equation (A + B)x = y. If is a solution of this equation, then one sees that iteration (2) applied to A + ˜B is equivalent to applying the iteration to A + B but for the residual .

We ran the DR scheme (2) for a given range of different values of t > 0, and show the residuals in semi-log-scale on the left of Figure 1. One observes the following typical behavior for this example:

• A not too small stepsize (t = 0.5 in this case) leads to good progress in the beginning, but slows down considerably in the end.

• Large stepsizes (larger than 2 in this case) are slower in the beginning and tend to produce non-monotone decrease of the error.

• Somewhere in the middle, there is a stepsize which performs much better than the small and large stepsizes.

In this particular example the stepsize t = 1.5 greatly outperforms the other stepsizes. On the right of Figure 1 we show the norm of the residual after a fixed number of iterations for varying stepsizes. One can see that there is indeed a sweet spot for the stepsizes around t = 1.5. Note that the value of t = 1.5 is by no means universal and this sweet spot of 1.5 varies with the problem size, with the ranks of A and B, and even for each particular instance of this linear example.

2 Analysis of the linear monotone inclusion

In order to develop an adaptive stepsize for our non-stationary DR method, we first consider the linear problem instance of (1). We consider the original DR

Figure 1: The residual of the Douglas-Rachford scheme (2) for the linear example. Left: The dependence of the residual on the iterations with different stepsizes. Right: The dependence of the residual on the the stepsize with different numbers of iterations.

scheme (2) instead of (4) since (2) generates the sequence which converges to a solution of (1), while the sequence computed by (4) does not converge to a solution and its limit does depend on the stepsize in general.

2.1 The construction of adaptive stepsize for single-valued operator B

When both A and B are linear, the DR scheme (2) can be expressed as a fixed-point iteration scheme of the following mapping:

Recall that, by Remark 1.1, all of this section also applies not only to problem (A + B)x = 0 but also problem (A + B)x = y. The notion of a monotone operator has a natural equivalence for matrices, which is, however, not widely used. Hence, we recall that a matrix is called monotone, if, for all , it holds that 0. Note that any symmetric positive semidefinite (spd) matrix is monotone, but a monotone matrix is not necessarily spd. Examples of a monotone matrices that are not spd are

The first matrix is skew symmetric, i.e., = and any such matrix is monotone. Note that even if A and B are spd (as in our example in Section 1.2), the iteration map in (6) is not even symmetric. Consequently, the asymptotic convergence rate of the iteration scheme (2) is not governed by the norm of but by its ), which is the largest magnitude of an eigenvalue of (cf. [22, Theorem 11.2.1]). Moreover, the eigenvalues and eigenvectors of are complex in general.

First, it is clear from the derivation of that the eigenspace of for the eigenvalue = 1 exactly consists of the solutions of (A + B)x = 0.

In the following, for any (the set of complex numbers) and r > 0, we denote by ) the ball of radius r centered at z. We estimate the eigenvalues of that are different from 1.

Lemma 2.1. Let be monotone, and be defined by (6). Let be an eigenvalue of with corresponding eigenvector . Assume that = 1 and define c by

i.e. .

Proof. Note that for a real, linear, and monotone map M, and a complex vector a = b + ic, it holds that and thus, Re(0. This shows that 0.

fulfills

Now, if we denote u := Bz, then this expression becomes

which, by rearranging, leads to

Hence, by monotonicity of tA, we can derive from the above relation that

This leads to

Denoting := , the last expression reads as

Recalling the definition of c in (7), we get

This is equivalent to

which, in turn, is equivalent to . Adding to both sides, it leads to (, which shows the desired estimate.

In general, the eigenvalues of depend on t in a complicated way. For t = 0, we have and hence, all eigenvalues are equal to one. For growing t > 0, some eigenvalues move into the interior of the circle 2) and for , it seems that all eigenvalues tend to converge to the boundary of such a circle, see Figure 2 for an illustration of eigenvalue distribution.

Figure 2: Eigenvalues of for different values of t for a linear example similar to the example (5) in Section 1.2 (but with m = 50).

Remark 2.1. It appears that Lemma 2.1 is related to Proposition 4.10 of [3] and also to the fact that the iteration mapping is (in the general nonlinear case) known to be not only non-expansive, but firmly non-expansive (cf. [16, Lemma 1] and [16, Figure 1]). In general, firmly non-expansiveness allows overrelaxation of the method, and indeed, one can also easily see this in the linear case as well: If is an eigenvalue of , then it lies in 2) (when it is not equal to one) and the corresponding eigenvalue of the relaxed iteration map

is = 1and lies in ). Therefore, for 0 2 all eigenvalues different from one of the relaxed iteration

lie in a circle of radius 2 centered at 1 2, and hence, the iteration is still non-expansive. It is know that relaxation can speed up convergence, but we will not investigate this in this paper.

Lemma 2.1 tells us a little more than that all eigenvalues of the iteration map lie in a circle centered at of radius . Especially, all eigenvalues except for = 1 have magnitude strictly smaller than one if Re(0 for all corresponding eigenvectors z. This implies that the iteration map is indeed asymptotically contracting outside the set of solutions = 0} of (1). This proves that the stationary iteration = converges to a zero point of the map A + B at a linear rate. Note that this does not imply the convergence in the non-stationary case.

To optimize the convergence speed, we aim at minimizing the spectral radius of , which is the magnitude of the largest eigenvalue of and there seems to be little hope to explicitly minimize this quantity.

Here is a heuristic argument based on Lemma 2.1, which we will use to derive an adaptive stepsize rule: Note that is increasing and hence, to minimize the upper bound on (more precisely: the distance of to ) we want to make c from (7) as large as possible. This is achieved by minimizing the denominator of c over t which happens for

This gives c = Re() and note that 0 2 (which implies 0 ). This motivates an adaptive choice for the stepsize as

in the Douglas-Rachford iteration scheme (2).

Remark 2.2. One can use the above derivation to deduce that t = 1is a good constant step-size. In fact, this is also the stepsize that gives the best linear rate derived in [29, Proposition 4], which is minimized when t = 1/M where M is the Lipschitz constant of B. However, this choice does not perform well in practice in our experiments.

Since little is known about the non-stationary Douglas-Rachford iteration in general (besides the result from [27] on convergent stepsizes with summable errors), we turn to an investigation of this method in Section 3. Before we do so, we generalize the heuristic stepsize to the case of multivalued B.

2.2 The construction of adaptive stepsize for non-single-valued B

In the case of multi-valued B, one needs to apply the iteration (4) instead of (2). To motivate an adaptive choice for the stepsize in this case, we again consider situation of linear operators.

In the linear case, the iteration (4) is given by the iteration matrix

Comparing this with the iteration map from (6) (corresponding to (2)) one notes that = (

i.e., the matrices and are similar and hence, have the same eigenvalues. Moreover, if z is an eigenvector of with the eigenvalue , then (I + tB)z is an eigenvector of for the same eigenvalue .

However, in the case of the iteration (4) we do not assume that B is single-valued, and thus, the adaptive stepsize using the quotient cannot be used. However, again due to (3), we can rewrite this quotient without applying B and get, with , that

Note that the two iteration schemes (2) and (4) are not equivalent in the non-stationary and non-linear case. Indeed, let us consider such that := . By induction, we have = . Substituting into (2), we obtain

From (3) we have

Substituting = and into (10), we obtain

Updating by (9) would then give

This scheme essentially has the same per-iteration complexity as in the standard DR method since the computation of does not significantly increase the cost.

Note that the non-stationary scheme (11) is notably different from the non-stationary scheme derived directly from (4) (which has been analyzed in [27]). To the best of our knowledge, the scheme (11) is new.

3 Convergence of the non-stationary DR method

In this section, we prove weak convergence of the new non-stationary scheme (11). We follow the approach by [38, 39] and restate the DR iteration as follows: Given () such that ) and a sequence , at each iteration 0,

we iterate

Note that, in the case of single-valued B, this iteration reduces to

and this scheme can, as shown in Section 2.2, be transformed into the non-stationary iteration scheme (11). Below are some consequences which we will need in our analysis:

Before proving our convergence result, we state the following lemma.

Lemma 3.1. Let , and be three nonnegative sequences, and be a bounded sequence such that for 0:

If , then and are bounded.

Proof. If , then

If , then 1 and

By the assumption that and, without loss of generality, we assume that the latter term is positive (which is fulfilled for n large enough, because 0). Thus, in both cases, we can show that

Recursively, we get

Under the assumption , we have

M > 0 and all 1. Then, we have (). This shows that is bounded. Since , and are all nonnegative, it implies that and are bounded.

Theorem 3.1 (Convergence of non-stationary DR). Let A and B be maximally monotone and be a positive sequence such that

where 0 ¯are given. Then, the sequence generated by the iteration scheme (12) weakly converges to some () in the extended solution set S(A, B) = of (1), so in particular, 0 ).

Proof. The proof of this theorem follows the proof of [39, Theorem 1]. First, we

observe that, for any (), we have

Moreover, by (13) and (14) it holds that

and thus

We see from (16) that

The first line gives

Summing this inequality from n = 1 to n = N, we get

Now, since is bounded and it holds that

by our assumption, we can conclude that

i.e., by (15), we have

This expression shows that and are also bounded. Due to the boundedness of , we conclude the existence of weak convergence subsequences and such that

and by the above limits, we also have

From [1, Corollary 3] it follows that (). This shows that has a weak cluster point and that all such points are in S(A, B). Now we deduce from (17) that

Since is bounded and , this shows that the sequence is quasi-Fejer convergent to the extended solution set S(A, B) with respect to the distance d((u, b), (z, w)) = + (. Thus, similar to the proof of [39, Theorem 1], we conclude that the whole sequence weakly converges to an element of S(A, B).

4 An adaptive step-size for DR methods

The step-size suggested by (8) or by (9) is derived from our analysis of a linear case and it does not guarantee the convergence in general. In this section, we suggest modifying this step-size so that we can prove the convergence of the DR scheme. We build our adaptive step-size based on two insights:

• The estimates of the eigenvalues of the DR-iteration in the linear case from Section 2.1 motivated the adaptive stepsize

• Theorem 3.1 ensures the convergence of the non-stationary DR-iteration as soon as the stepsize sequence is convergent with summable increments.

However, the sequences (18) and (19) are not guaranteed to converge (and numerical experiments indicate that, indeed, divergence may occur). Here is a way to adapt the sequence (18) to produce a suitable stepsize sequence in the single-valued case:

1. Choose safeguards 0 , a summable “conservation sequence” (0, 1] with = 1 and start with = 0.

2. Let proj) be the projection onto a box []. We construct as

The following lemma ensures that this will lead to a convergent sequence .

Lemma 4.1. Let be a bounded sequence, i.e., , and (0, 1] such that and = 1. Then, the sequence defined by = 0 and

is in [] and converges to some and it holds that .

Proof. Obviously, and since is a convex combination of and , one can easily see that obeys the same bounds as , i.e. . Moreover, it holds that

thus ) from which the assertion follows, since is summable.

Clearly, if we apply Lemma 4.1 to the sequence defined by (20), then it converges to some .

We use a similar trick to construct an adaptive stepsize based on the choice (19) in the case of multi-valued operators. More precisely, we construct as follows:

1. Choose safeguards 0 , a summable “conservation sequence” (0, 1], and = 1.

2. We construct as

In this case we get that = and since and is bounded, the summability of implies summability of . This implies that converges to some positive value and and thus, 0, too.

The stepsize sequence constructed by either (20) or (21) fulfills the conditions of Theorem 3.1. Hence, the convergence of the nonstationary DR scheme using this adaptive stepsize follows as a direct consequence. We will provide guidelines on how to choose the safeguards and the conservation sequence in practice in Section 6.1.

5 Application to ADMM

It is well-known that the alternating direction method of multipliers (ADMM) for convex optimization with linear constraint can be interpreted as the DR method on its dual problem, see, e.g. [16]. In this section, we apply our adaptive stepsize to ADMM to obtain a new variant for solving the following constrained problem:

where are two proper, closed, and convex functions, and are two given bounded linear operators, and .

The dual problem associated with (22) becomes

where and are the Fenchel conjugate of and , respectively. The optimality condition of (23) becomes

which is of the form (1).

In the stationary case, ADMM is equivalent to the DR method applying to the dual problem (24), see, e.g., [16]. However, for the non-stationary DR method, we can derive a different parameter update rule for ADMM. Let us summarize this result into the following theorem for the non-stationary scheme (11). The proof of this theorem is given in Appendix A.

Theorem 5.1. Given 0 , the ADMM scheme for solving

(22) derived from the non-stationary DR method (11) applying to (24) becomes:

Consequently, the sequence generated by (25) weakly converges to a solution of the dual problem (23).

The ADMM variant (25) is essentially the same as the standard ADMM, but its parameter is adaptively updated. This rule is different from [25, 41].

6 Numerical experiments

In this section we provide several numerical experiments to illustrate the influ-ence of the stepsize and the adaptive choice in practical applications. Although we motivate the adaptive stepsize only for linear problems, we will apply it to problems that do not fulfill this assumption since the convergence of the method is ensured by Theorem 3.1 in all cases. We also note that the steps of the non-stationary method may be more costly than the one with constant stepsize, if the evaluation of the resolvents is costly and the constant stepsize can be leveraged to precompute something. This is the case when A and/or B is linear and the resolvents involve the solution of a linear system for which a matrix factorization can be precomputed. However, there are tricks to overcome this issue, see [7, pages 28-29], but we will not go in more detail here.

For the Douglas-Rachford method we just provide illustrative examples since we are not aware of any adaptive rule that applies to the Douglas-Rachford method in the general case of monotone operators. For the ADMM there are several other adaptive rules available and we do a comparison in Section 6.2.

6.1 Experiments for non-stationary Douglas-Rachford

We provide four numerical examples to illustrate the new adaptive DR scheme (11) on some well-studied problems in the literature. The stepsizes (20) (in the case of single valued B) and (21) (in the case of multivalued B) come with new parameters: the safeguards and and a “conservation” term . Since B is single valued is all experiments we always used (20) and we also fixed = 10= 10and = 2for all experiments.

6.1.1 The linear toy example

We start with the linear toy example from Section 1.2. The residual sequence along the iterations is shown on the left of Figure 3. Additionally, we deter-

Figure 3: Results for the linear problem from Section 1.2 using fixed and the adaptive stepsizes. Left: Residual sequences. Right: Auxiliary sequence = and the stepsize .

mined the stepsize that leads to the smallest asymptotic convergence rate, i.e. to the smallest spectral radius of the iteration map (in this case = 1.367) and also plot the corresponding residual sequence with this optimal constant stepsize in the same figure. The adaptive stepsize does indeed improve the convergence considerably both by using small steps in the beginning and automatically tuning to a stepsize t that is close to the optimal one (cf. Figure 1, right). It also outperforms the optimal constant stepsize .

6.1.2 LASSO problems

The LASSO problem is the minimization problem

and is also known as basis pursuit denoising [40]. We will treat this with the Douglas-Rachford method as follows: We set F = f + g with

In this particular example we take with orthonormal rows, and hence, by the matrix inversion lemma, we get

The resolvent of A is the so-called soft-thresholding operator:

Note that B is single-valued and A is a subgradient and hence, the adaptive step-size computed by (20) does apply. Figure 4 shows the result of the DouglasRachford iteration with constant and adaptive stepsizes, and also a comparison with the FISTA [4] method. (Note that if K would not have orthonormal rows, one would have to solve a linear system at each Douglas-Rachford step which would make the comparison with FISTA by iteration count unfair.) As shown in this plot, the adaptive stepsize again automatically tunes to a stepsize close to 10 which, experimentally, seems to be the best constant stepsize for this particular instance.

6.1.3 Convex-concave saddle-point problems

Let X and Y be two finite dimensional Hilbert spaces, be a bounded linear operator and and be two proper, convex and lower-semicontinuous functionals. The saddle point problem then

Saddle points () are characterized by the inclusion

Figure 4: The convergence behavior of the Douglas-Rachford iteration and the FISTA method on a LASSO problem using fixed and the adaptive stepsizes.

To apply the Douglas-Rachford method we split the optimality system as follows. We denote z = (x, y) and set

(cf. [32, 9]). The operator A is maximally monotone as a subgradient and B is linear and skew-symmetric, hence maximally monotone and even continuous. One standard problem in this class in the so-called Rudin-Osher-Fatemi model for image denoising [36], also known as total variation denoising. For a given noisy image one seeks a denoising image u as the minimizer of

where denotes the discrete gradient of u and denotes the components-wise magnitude of this gradient. The penalty term is the discretized total variation, and 0 is a regularization parameter. The saddle point form of this minimization problem is

We test our DR scheme (11) using the adaptive stepsize and compare with two constant stepsizes t = 1 and t = 13. The constant stepsize t = 13 seems to be the best among many trial stepsizes after tuning. The convergence behavior of these cases is plotted in Figure 5 for one particular image called auge of the size 256 256. As we can see from this figure that the adaptive stepsize has a good performance and is comparable with the best constant stepsize in this example (t = 13).

Figure 5: The decrease of the objective values of three DR variants in the total variation denoising problem.

6.2 Experiments for ADMM with an adaptive stepsize

In this subsection we verify the performance of the our adaptive ADMM variant (25). We follow the comparison from [42] where several adaptive variants of ADMM are compared. However, we only compare the methods of ADMM that do not involve relaxation, since we did not consider relaxation in this paper.

In our comparison we compare the ADMM with constant stepsize which is fixed ad-hoc, the adaptive rule of He [25] which is based on residual balancing (RB), the adaptive ADMM (AADMM) from [41] and our approach from Theorem 5.1. We used five different test problems from the comparison in [42]: Elastic net regression, LASSO regression, quadratic programming, consensus -regularized logistic regression, and SVM for classification (see [42, Section 6] for details). We also use the code released online from [41].

Table 1 summarizes the results for average number of iterations for 50 runs on random instances of the same size. Note that both the RB ADMM and the AADMM do guarantee convergence only if the the adaptivity is switched of at a certain point while our rule comes with a convergence guarantee.Table 1 shows that our adaptive method consistently performs good.

Figure 6 shows example runs for four of the five problem (the fifth being the SVM classification and is omitted due to space reasons). One observes that residual balancing often fails to make progress towards a favorable stepsize and that AADMM sometimes shows large oscillations in the stepsizes. Our method leads to a stepsize sequence that stabilizes quickly and leads to good reduction of the residual.

Figure 6: Example runs of the different adaptive methods. Upper plots show the relative residual (cf. [7, Section 3.3.1] or [41, Section 4.3]), lower plots show the stepsizes, respectively.

Table 1: Results for the comparison of different apadtive stepsizes for ADMM. We compare the number of iterations needed for the methods to reach a given tolerance as in [41]. We report mean(standard deviation) for 50 runs on random instances.

7 Conclusion

We have attempted to address one fundamental practical issue in the well-known DR method: step-size selection. This issue has been standing for a long time and has not adequately been well-understood. In this paper, we have proposed an adaptive step-size that is derived from an observation of the linear case. Our non-stationary DR method is new; it is derived from the iteration for single-valued B and differs from the standard non-stationary iteration considered previously, e.g. in [27]. Our stepsize remains heuristic in the general case, but we can guarantee a global convergence of the DR method. As a byproduct, we have also derived a new ADMM variant that uses a simple adaptive stepsize and has a convergence guarantee. This is practically significant since ADMM has been widely used in many areas in the last two decades. Our finding also opens some future research ideas: Although we gained some insight, the linear case is still not properly understood. Since our heuristic applies to general A and B, there is the possibility to investigate, which operators should be used as “B” to compute the adaptive stepsize. As shown in [32], one can rescale convex-concave saddle point problems to use two different stepsizes for the Douglas-Rachford method, and one may extend our heuristic to this case. Moreover, the convergence speed of the non-stationary method under additional assumptions such as Lipschitz continuity or coercivity could be analyzed. Finally, an adaptive rule for the relaxed DR method would be of interest.

Acknowledgement

We thank Zheng Xu and Tom Goldstein for sharing the ADMM code.

A The proof of Theorem 5.1

Let us assume that we apply (11) to solve the optimality condition (24) of the dual problem (23). From (11), i.e.,

we define := and := ((1 + ) to obtain

Shifting up this scheme by one index and changing the order, we obtain

Let (1 + = + . This gives = ) and hence, + ) = and = ) = + ). Substituting these into the above expression of the DR

Let ), which implies ). Hence, we have +) = 0, therefore = (+). This condition leads to

This is the optimality condition of

Similarly, from = ), if we define ), then we can also derive that

From the line ) = 0 above, we can write = . Substituting this expression into the above step, we obtain

This is the second line of (25).

Next, from + + = 0, we have = . This implies = ). From the last line of (27), we have = + ). Combine these two lines, we get = due to the update rule (9): . Substituting = into the u-subproblem, we obtain

This is the first line of (25).

Now, since = , and = , combining these expressions, we obtain = +). This is the last line of (25).

Finally, we derive the update rule for . Indeed, note that = , and + ) = 0. These relations show that = ). Moreover, we also have = ) = ). In this case, we have = + ) = + ) + ) = . Hence, we can compute as

Using the fact that := , we show that := , which is the last line of (25) after projecting and weighting as in Section 4. Since is equivalent to the sequence in the DR scheme (2) (or equivalently, (11)) applying to the dual optimality condition (24) of the dual problem (23), the last conclusion is a direct consequence of Theorem 3.1.

References

[1] Heinz H. Bauschke. A note on the paper by Eckstein and Svaiter on “Gen- eral projective splitting methods for sums of maximal monotone operators”. SIAM Journal on Control and Optimization, 48(4):2513–2515, 2009.

[2] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer, second edition edition, 2017.

[3] Heinz H. Bauschke, Sarah M. Moffat, and Xianfu Wang. Firmly nonexpan- sive mappings and maximally monotone operators: Correspondence and duality. Set-Valued and Variational Analysis, 20(1):131–153, 2012.

[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding al- gorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[5] S. Becker, E. J. Cand`es, and M. Grant. Templates for convex cone prob- lems with applications to sparse signal recovery. Math. Program. Compt., 3(3):165–218, 2011.

[6] Stephen Becker and Patrick L. Combettes. An algorithm for splitting par- allel sums of linearly composed monotone operators, with applications to signal recovery, 2013. Arxiv preprint:1305.5828.

[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eck- stein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Learning, 3(1):1–122, 2011.

[8] Kristian Bredies and Hong Peng Sun. Preconditioned Douglas–Rachford al- gorithms for TV-and TGV-regularized variational imaging problems. Journal of Mathematical Imaging and Vision, 52(3):317–344, 2015.

[9] Kristian Bredies and Hongpeng Sun. Preconditioned Douglas-Rachford splitting methods for convex-concave saddle-point problems. SIAM Journal on Numerical Analysis, 53(1):421–444, 2015.

[10] Kristian Bredies and Hongpeng Sun. Accelerated Douglas-Rachford meth- ods for the solution of convex-concave saddle-point problems. arXiv preprint arXiv:1604.06282, 2016.

[11] Patrick L. Combettes. Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization, 53(5-6):475–504, 2004.

[12] Patrick L. Combettes and Jean-Christophe Pesquet. A Douglas–Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4):564–574, 2007.

[13] D. Davis. Convergence rate analysis of the forward-Douglas-Rachford split- ting scheme. SIAM Journal on Optimization, 25(3):1760–1786, 2015.

[14] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 115–163. Springer International Publishing, 2016.

[15] Jim Jr. Douglas and Henry H. Jr. Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.

[16] Jonathan Eckstein and Dimitri P. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):293–318, 1992.

[17] Euhanna Ghadimi, Andr´e Teixeira, Iman Shames, and Mikael Johansson. Optimal parameter selection for the alternating direction method of multipliers (ADMM): quadratic problems. IEEE Transactions on Automatic Control, 60(3):644–658, 2015.

[18] Pontus Giselsson. Tight global linear convergence rate bounds for Douglas– Rachford splitting. Journal of Fixed Point Theory and Applications, 19(4):2241–2270, Dec 2017.

[19] Pontus Giselsson and Stephen Boyd. Diagonal scaling in Douglas-Rachford splitting and ADMM. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 5033–5039. IEEE, 2014.

[20] Pontus Giselsson and Stephen Boyd. Linear convergence and metric se- lection for Douglas-Rachford splitting and ADMM. IEEE Transactions on Automatic Control, 62(2):532–544, 2017.

[21] Roland Glowinski. On alternating direction methods of multipliers: a his- torical perspective. In Modeling, simulation and optimization for science and technology, pages 59–82. Springer, 2014.

[22] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013.

[23] Bingsheng He and Xiaoming Yuan. Convergence analysis of primal-dual al- gorithms for a saddle-point problem: From contraction perspective. SIAM Journal on Imaging Sciences, 5(1):119–149, 2012.

[24] Bingsheng He and Xiaoming Yuan. On the O(1/n) convergence rate of the Douglas–Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.

[25] BS He, Hai Yang, and SL Wang. Alternating direction method with self- adaptive penalty parameters for monotone variational inequalities. Journal of Optimization Theory and Applications, 106(2):337–356, 2000.

[26] Xinxin Li and Xiaoming Yuan. A proximal strictly contractive Peaceman– Rachford splitting method for convex programming with applications to imaging. SIAM Journal on Imaging Sciences, 8(2):1332–1365, 2015.

[27] Jingwei Liang, Jalal Fadili, and Gabriel Peyr´e. Local convergence properties of Douglas-Rachford and alternating direction method of multipliers. J. Optim. Theory Appl., 172(3):874–913, 2017.

[28] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized alternating direction method with adaptive penalty for low-rank representation. In Advances in neural information processing systems, pages 612–620, 2011.

[29] Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.

[30] Walaa M. Moursi and Lieven Vandenberghe. Douglas-Rachford splitting for a Lipschitz continuous and a strongly monotone operator, 2018. arXiv preprint arXiv:1805.09396.

[31] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. Jordan. A general analysis of the convergence of ADMM. arXiv preprint arXiv:1502.02009, 2015.

[32] Daniel O’Connor and Lieven Vandenberghe. Primal-dual decomposition by operator splitting and applications to image deblurring. SIAM Journal on Imaging Sciences, 7(3):1724–1754, 2014.

[33] Panagiotis Patrinos, Lorenzo Stella, and Alberto Bemporad. DouglasRachford splitting: Complexity estimates and accelerated variants. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 4234–4239. IEEE, 2014.

[34] Thomas Pock and Antonin Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1762–1769. IEEE, 2011.

[35] R. Tyrrell Rockafellar. Monotone operators and the proximal point algo- rithm. SIAM Journal on Control and Optimization, 14(5):877–898, 1976.

[36] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total vari- ation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.

[37] Changkyu Song, Sejong Yoon, and Vladimir Pavlovic. Fast ADMM algo- rithm for distributed optimization with adaptive penalty. In AAAI, pages 753–759, 2016.

[38] Benar F Svaiter. A simplified proof of weak convergence in DouglasRachford method to a solution of the unnderlying inclusion problem. arXiv preprint arXiv:1809.00967, 2018.

[39] Benar Fux Svaiter. On weak convergence of the Douglas–Rachford method. SIAM Journal on Control and Optimization, 49(1):280–287, 2011.

[40] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

[41] Zheng Xu, M´ario AT Figueiredo, and Tom Goldstein. Adaptive ADMM with spectral penalty parameter selection. arXiv preprint arXiv:1605.07246, 2016.

[42] Zheng Xu, M´ario AT Figueiredo, Xiaoming Yuan, Christoph Studer, and Tom Goldstein. Adaptive relaxed ADMM: Convergence theory and practical implementation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7234–7243. IEEE, 2017.

Designed for Accessibility and to further Open Science