where y is the data and x is the unknown to be recovered. The solution of equation (1) balances the goals of fitting the data while also regularizing this fit according to the prior.
In more general settings, for example with multiple data terms from multi-modal data collection, a cost function can be decomposed as a sum of auxiliary (usually convex) functions:
with variable and
. In consensus optimization, the minimization of the original cost function is reformulated as the minimization of the sum of the auxiliary functions, each a function of a separate variable, with the constraint that the separate variables must share a common value:
(2) minimize
with variables . This reformulation allows for the application of the Alternating Direction Method of Multipliers (ADMM) or other efficient minimization methods and applies to the original problem in (1) as well as many other problems. An account of this approach with many variations and examples can be found in [4].
While regularized inversion and optimization problems more generally benefit from extensive theoretical results and powerful algorithms, they are also expressively limited. For example, many of the best denoising algorithms cannot be put into the form of a simple optimization [6, 10]. Likewise, the behavior of denoising neural networks cannot generally be captured via optimization. These successful approaches to inverse problems lie outside the realm of optimization problems and give rise to the motivating question for this paper:
Question: How can we generalize the consensus optimization framework in (2) to encompass models and operators that are not associated with an optimization problem, and how can we find solutions efficiently?
There is a vast and quickly-growing literature on methods and results for convex and consensus optimization. Seminal work in this area includes the work of Lions and Mercier [19], as well as the PhD thesis of Eckstein [11] and the work of Eckstein and Bertsekas [12]. We do not provide a complete survey of this literature since our focus is on a framework beyond optimization, but some starting points for this area are [2, 4, 5].
As for approaches to fuse a data fidelity model with a denoiser that is not based on an optimization problem, the first attempt to our knowledge is [28]. The goal of this approach, called the Plug-and-Play prior method, is to replace the prior model in the Bayesian formulation with a denoising operator. This is done by taking the ADMM algorithm, which is often used to find solutions for consensus optimization problems, and replacing one of the optimization steps (proximal maps) of this algorithm with the output of a denoiser. Recently, a number of authors have built on the Plug-and-Play method as a way to construct implicit prior models through the use of denoising operators [24, 27, 26, 29]. In [27], conditions are given on the denoising operator that will ensure it is a proximal mapping, so that the MAP estimate exists and the ADMM algorithm converges. However, these conditions impose relatively strong symmetry conditions on the denoising operator that may not occur in practice. For applications where fixed point convergence is sufficient, it is possible to relax the conditions on the denoising operator by iteratively controlling the step size in the proximal map for the forward model and the noise level for the denoiser [7].
The paper [23] provides a different approach to building on the idea of Plug-and-Play. That paper uses the classical forward model plus prior model in the framework of optimization, but constructs a prior term directly from the denoising engine; this is called Regularization by Denoising (RED). For a denoiser , the prior term is given by
. This approach is formulated as an optimization problem associated with any denoiser, but in the case that the denoiser itself is obtained from a prior, the RED prior is different from the denoiser prior. Other approaches that build on Plug-and-Play include [21], which uses primal-dual splitting in place of an ADMM approach, and [17], which uses FISTA in a Plug-and-Play framework to address a nonlinear inverse scattering problem.
In this paper, we introduce Consensus Equilibrium (CE) as an optimization-free generalization of regularized inversion and consensus optimization that can be used to fuse multiple sources of information implemented as maps such as denoisers, deblurring maps, data fidelity maps, proximal maps, etc. We show that CE generalizes consensus optimization problems in the sense that if the defining maps are all proximal maps associated with convex functions, then any CE solution is also a solution to the corresponding consensus optimization problem. However, the consensus equilibrium can still exist in the more general case when the defining maps are not proximal maps; in this case, there is no underlying optimization. In the case of a single data fidelity term and a single denoiser, the solution has the interpretation of achieving the best denoised inverse of the data. That is, the proximal map associated with the forward model pulls the current point towards a more accurate fit to data, while the denoising operator pulls the current point towards a “less noisy” image. We illustrate this in a toy example in two dimensions: the consensus equilibrium is given by a balance between two competing forces.
In addition to introducing the CE equations, we discuss ways to solve them and give several examples. We describe a version of the Douglas-Rachford (DR)/ADMM algorithm with a novel form of anisotropic preconditioning. We also apply Newton’s method, both in standard form and in a Jacobian-free form using Krylov subspaces.
In the experimental results section, we give several examples to illustrate the idea of consensus equilibrium and the convergence properties of these algorithms. We first demonstrate the proposed algorithms on some toy problems in order to illustrate properties of the method. We next use the consensus equilibrium framework to solve an image denosing problem using an array of convolutional neural network (CNN) denoisers, none of which is tuned to match the noise level in a noisy image. Our results demonstrate that that the consensus equilibrium result is better than any of the individually applied CNN denoisers.
2. Consensus Equilibrium: Optimization and Beyond. In this section we formulate the consensus equilibrium equations, show that they encompass a form of consensus optimization in the case of proximal maps, and describe the ways that CE goes beyond the optimization framework.
2.1. Consensus Equilibrium for Proximal Maps. We begin with a slight generalization of (2):
(3) minimize
with variables , and weights
, that sum to 1 (an arbitrary normalization, but one that supports the idea of weighted average that we use later). From the point of view of optimization, each weight
could be absorbed into
. However, in Consensus Equilibrium we move beyond this optimization framework to the case in which the
may be defined only implicitly or the case in which there is no optimization, but only mappings that play a role similar to the proximal maps that arise in the ADMM approach to solving (3). The formulation in (3) serves as motivation and the foundation on which we build.
To extend the optimization framework of (3) to consensus equilibrium, we start with N vectorvalued maps, . The Consensus Equilibrium for these maps is defined as any solution
that solves the equations
Here u is a vector in obtained by stacking the vectors
, and
is the weighted average
.In order to relate consensus equilibrium to consensus optimization, first consider the special case in which each
in (3) is a proper, closed, convex function and each
is a corresponding proximal map, i.e., a map of the form
Methods such as ADMM, Douglas-Rachford, and other variants of the proximal point algorithm apply these maps in sequence or in parallel with well-chosen arguments, together with some map to promote for all i, in order to solve (3); see e.g., [3, 4, 9, 11, 12, 19, 25]. In the setting of Bayesian regularized inversion, each
represents a data fidelity term or regularizing function. We allow for the possibility that
enforces some hard constraints by taking on the value
.
Our first theorem states that when the maps are all proximal maps as described above, then the solutions to the consensus equilibrium problem are exactly the solutions to the consensus optimization problem of equation (3). In this sense, Consensus Equilibrium encompasses the optimization framework of (3).
THEOREM 1. For i = 1, . . . , N, let be a proper, lower-semicontinuous, convex function on
, and let
with
. Define
, and assume f is finite on some open set in
. Let
be the proximal maps as in (6). Then the set of solutions to the consensus equilibrium equations of (4) and (5) is exactly the set of solutions to the minimization problem (3).
The proof is contained in the appendix.
2.2. Consensus Equilibrium Beyond Optimization. Theorem 1 tells us that consensus equilibrium extends consensus optimization, but as noted above, the novelty of consensus equilibrium is not as a recharacterization of (3) in the case of proximal maps but rather as a framework that applies even when some of the are not proximal mappings and there is no underlying optimization problem to be solved. The Plug-and-Play reconstruction method of [28], which yields high quality solutions for important applications in tomography [27] and denoising [24], is to our knowledge, the first method to use denoisers that do not arise from an optimization for regularized inversion. As we show below, the CE framework also encompasses the Plug-and-Play framework in that if Plug-and-Play converges, then the result is also a CE solution. However, Plug-and-Play grew out of ADMM, and the operators that yield convergence in ADMM are more limited than we would like. Hence, for both consensus optimization and Plug-and-Play priors, CE encompasses the original method but also allows for a wider variety of operators and solution algorithms.
An important point about moving beyond the optimization framework is that a given set of maps may lead to multiple possible CE solutions. This may also happen in the optimization framework when the
are not strictly convex since there may be multiple local minima. In the optimization case, the objective function can sometimes be used to select among local minima. The analogous approach for consensus equilibrium is to choose a solution that minimizes the size of
, e.g. the
or
norm of
. This corresponds in some sense to minimizing the tension among the competing forces balanced to find equilibrium.
3. Solving the Equilibrium Equations. In this section, we rewrite the CE equations as an unconstrained system of equations and then use this to express the solution in terms of a fixed point problem. We also discuss particular methods of solution, including novel preconditioning methods and methods to solve for a wide range of possible F. We first introduce some additional notation. For , with
and each
, define
by
where has the important interpretation of redistributing the weighted average of the vector components given by
across each of the output components.
Also, for , let
denote the vector obtained by stacking N copies of x. With this notation, the CE equations are given by
This notation allows us to reformulate the CE equations as the solution to a system of equations.
THEOREM 2. The point is a solution of the CE equations (4) and (5) if and only if the point
satisfies
and
We use this to reformulate consensus equilibrium as a fixed point problem.
COROLLARY 3. (Consensus equilibrium as fixed point.) The point is a solution of the CE equations (4) and (5) if and only if the point
satisfies
and
When F is a proximal map for a function f, then is known as the reflected resolvent of f. Discussion and results concerning this operator can be found in [2, 12, 15] among many other places. This fixed point formulation is closely related to the fixed point formulation for minimizing the sum of two functions using Douglas-Rachford splitting; this is seen clearly in section 4 of [13] among other places. The form given here is somewhat different in that the reflected resolvents are computed in parallel and then averaged, as opposed to the standard sequential form. Beyond that, the novelty here is in the equivalence of this formulation with the CE formulation.
Proof of Corollary 3. By Theorem 2, is a solution of (4) and (5) if and only if
satisfies
and (9). From (9) we have
. A calculation shows that
, so
by linearity of
. Hence applying
to both sides gives (10). Reversing these steps returns from (10) to (9).
3.1. Anisotropic Preconditioned Mann Iteration for Nonexpansive Maps. Define T = . When T is nonexpansive and has a fixed point, we can use Mann iteration to find a fixed point of T as required by (10). For a fixed parameter
, this takes the form
with iterates guaranteed to converge to a fixed point of T. In the context of minimization problems in which F and G are both proximal maps, and depending on the choice of , iterations of this form are essentially variants of the proximal point algorithm and give rise to the (generalized) DouglasRachford algorithm, the Peaceman-Rachford algorithm, and the ADMM algorithm, including over-relaxed and under-relaxed variants of ADMM. In the case of N = 2 and
, the form in (11) is equivalent up to a change of variables to the standard ADMM algorithm; other values of
give over-relaxed and under-relaxed variants. Early work in this direction appears in [11] and [12]. A concise discussion is found in [15], which together with [14] provides a preconditioned version of this algorithm in the case of N = 2. This preconditioning is obtained by replacing the original minimization of f(x) + g(x) by minimization of f(Dq) + g(Dq), which gives rise to iterations involving the conjugate proximal maps
, where
is the proximal map for f as in (6) using the norm
in place of the usual Euclidean metric. [15] includes some results about the rate of convergence as a function of D. In some cases, a larger value of
leads to faster convergence relative to
. There are also results on convergence in the case that fixed
is replaced by a sequence of
such that
[2]. Further discussion and early work on this approach is found in [11, 12]. With some abuse of nomenclature, we use ADMM below to refer to Mann iteration with
.
Here we describe an alternative preconditioning approach for the Mann iteration in which we use an invertible linear map H in place of the scalar in (11). In this approach, T can be any nonexpansive map and H can be any symmetric matrix with H and
both positive definite.
THEOREM 4. Let H be a positive definite, symmetric matrix and let T be nonexpansive on with at least one fixed point. Suppose that the largest eigenvalue of H is strictly less than 1. For any
in
, define
(12) vHT
for each . Then the sequence
converges to a fixed point of T.
The idea of the proof is similar to the proof of convergence for Mann iteration given in [25], but using a norm that weights differently the orthogonal components arising from the spectral decomposition of H. The proof is contained in the appendix.
We note that in the case that each is a proper, closed, convex function on
and
is the proximal map as in (6), then the map
is nonexpansive, so this preconditioning method can be used to find a solution to the problem in (3). The asymptotic rate of convergence with this method is not significantly different from that obtained with the isotropic scaling obtained with a scalar
. However, we have found this approach to be useful for accelerating convergence in certain tomography problems in which various frequency components converge at different rates, leading sometimes to visible oscillations in the reconstructions as a function of iteration number. An appropriate choice of the preconditioner H can dampen these oscillations and provide faster convergence in the initial few iterations. We will explore this example and related algorithmic considerations further in a future paper.
3.2. Beyond nonexpansive maps. The iterative algorithms obtained from (11) and (12) give guaranteed global convergence when T is nonexpansive and (or H) satisfy appropriate conditions. However, the iterates of (11) may still be convergent for more general maps T. We illustrate this behavior in Case 1 of Section 4.2.
In fact, when T is differentiable at a fixed point, the rate of convergence is closely related to the spectrum of the linearization of T near this fixed point. The parameter in (11) maintains a fixed point at
but changes the linear part of the iterated map to have eigenvalues
, where
are the eigenvalues of the linear part of T. The iterates of (11) converge locally exactly when all of these
are strictly inside the unit disk in the complex plane. This can be achieved for sufficiently small
precisely when the real part of each
is less than 1. Since there is no constraint on the complex part of the eigenvalues, the map T may be quite expansive in some directions. In this case, the optimal rate of convergence is obtained when
is chosen so that the eigenvalues
all lie within a minimum radius disk about the origin.
The use of to affect convergence rate and/or to promote convergence is closely related to the ideas of overrelaxation and underrelaxation as applied in a variety of contexts. See e.g. [16] for further discussion in the context of linear systems. In the current setting, the use of
is a form of underrelaxation that is related to methods for iteratively solving ill-posed linear systems. In the following theorem, the main idea is to make use of underrelaxation in order to shrink the eigenvalues of the resulting operator to the unit disk and thus guarantee convergence.
THEOREM 5. (Local convergence of Mann iterates) Let and
be maps such that
has a fixed point
. Suppose that T is differentiable at
and that the Jacobian of T at
has eigenvalues
with the real part of
strictly less than 1 for all j. Then there is
and an open set U containing
such that for any initial point
in U, the iterates defined by (11) converge to
.
The proof of this theorem is given in Appendix A.
3.3. Newton’s Method. By formulating the CE as a solution to , we can apply a variety of root-finding methods to find solutions. Likewise, rewriting (10) as
gives the same set of options.
Let H be a smooth map from to
. The basic form of Newton’s method for solving H(x) = 0 is to start with a vector
and look for a vector dx to solve
. A first-order approximation gives
, where
is the Jacobian of H at
. If this Jacobian is invertible, this equation can be solved for dx to give
and the method iterated. There are a wide variety of results concerning the convergence of this method with and without preconditioning, with various inexact steps, etc. An overview and further references are available in [20].
For large scale problems, calculating the Jacobian can be prohibitively expensive. The JacobianFree Newton-Krylov (JFNK) method is one approach to avoid the need for a full calculation of the Jacobian. Let . The key idea in Newton-Krylov methods is that instead of trying to solve
exactly, we instead minimize
over the vectors dx in a Krylov subspace,
. This subspace is defined by first calculating the residual
and then taking
The basis in this form is typically highly ill-conditioned, so the Generalized Minimal RESidual method (GMRES) is often used to produce an orthonormal basis and solve the minimization problem over this subspace. This form requires only multiplication of a vector by the Jacobian, which can be approximated as
Applying this to produce requires j applications of the map H together with the creation of the Arnoldi basis elements, which can then be used to find the minimizing dx by reducing to a standard least squares problem of dimension j. Various stopping conditions can be used to determine an appropriate j. These calculations take the place of the solution of
. In cases for which
there are many contracting directions and only a few expanding directions for Newton’s method near the solution point, the JFNK method can be quite efficient. A more complete description with a discussion of the benefits and pitfalls, approaches to preconditioning, and many further references is contained in [18].
We note in connection with the previous section that if H is chosen to be in (12), then the choice of
in Theorem 4 is an exact Newton step applied to
. That is, the formula for the step in Newton’s method in this case is
which is the same as the formula in Theorem 4.
In the examples below, we use standard Newton’s method applied to both and
in the first example and JFNK applied to
in the second. Because of the connection with Mann iteration just given, we use the term Newton Mann to describe Newton’s method applied to
.
3.4. Other approaches. An alternative approach is to convert the CE equations back into an optimization framework by considering the residual error norm given by
and minimizing over v. Assuming that a solution of the CE equations exists, then that solution is also a minimum of this objective function. In the case that F is twice continuously differentiable, a calculation using the facts that
and that
is linear shows that the Hessian of
is
, where
. Hence
is locally convex near
as long as A has no eigenvalue equal to 0. Since
is a projection, its only eigenvalues are 0 and 1, hence this is equivalent to saying that
does not have 1 as an eigenvalue. If A does have an eigenvalue 0, then a perturbation of the form
produces a unique solution, which can be followed in a homotopy scheme as
decreases to 0.
One possible algorithm for this approach is the Gauss-Newton method, which can be used to minimize a sum of squared function values and which does not require second derivatives.
We note that the residual error of equation (13) is also useful as a general measure of convergence when computing the CE solution; we use this in plots below.
Other candidate solution algorithms include the forward-backward algorithm and related algorithms as presented in [9]. We leave further investigation of efficient algorithms for solving the CE equations to future research.
4. Experimental Results. Here we provide some computational examples of varying complexity. For each of these examples, at least one of the component maps is not a proximal mapping, so the traditional optimization formulation of equations (1) or (3) is not applicable.
We start with a toy model in 2 dimensions to illustrate the ideas, then follow with some more complex examples.
4.1. Toy model. In this example we have ,
, both in
, and maps
and
defined by
In this case, is a proximal map as in (6) corresponding to
and
is a weakly expanding map designed to illustrate the properties of consensus equilibrium. We use
FIG. 1. Left: Map fields and trajectories for 2-dimensional toy example using Newton’s method applied to solve . Blue segments show the map
, red segments show
, with black dots showing the common endpoints of these maps. Blue and red open squares show the points
, respectively. Filled red and blue circles show the CE solution. Middle: Zoom in near the fixed point of the plot on the left. Right: Error
as a function of iteration for Newton’s method applied to
, Newton’s method applied to
(labeled as Newton Mann), and standard Mann iteration with
(labeled as ADMM).
and
We take and so write G for
. We apply Newton iterations to
and to the fixed point formulation
. In both cases, the Jacobian of
is evaluated only at the initial point.
Figure 1 shows the vectors obtained from each of the maps and
. Blue line segments are vectors from a point
to
, and red line segments are vectors from a point
to
. The starting points of each pair of red and blue vectors are chosen so that they have a common ending point, signified by black dot. Open squares show the trajectories of
in blue and
in red. The trajectories converge to points for which the corresponding red and blue vectors have a common end point and are equal in magnitude and opposite in direction; this is the consensus equilibrium solution. The plots shown are for Newton’s method applied to
; the plots for Newton’s method applied to
are similar (not shown). In the right panel of this figure, we use the true fixed point to plot error versus iterate for this example using all 3 methods. The expansion in
prevents ADMM from converging in this example.
4.2. Stochastic matrix. The next example uses the proximal map form for as in the pre- vious example, although now with dimension n = 100. A and y were chosen using the random number generator rand in Matlab, approximating the uniform distribution on [0, 1] in each component. The map
has the form
; here W is constructed by first choosing entries at random in the interval [0, 1] as for A, then replacing the diagonal entry by the maximum entry in that row (in which case the maximum entry may appear twice in one row), and then normalizing so that each row sums to 1. This mimics some of the features in a weight matrix appearing in denoisers such as non-local means [6] but is designed to allow us to compute an exact analytic solution of the CE equations. In particular, since W is not symmetric,
cannot be a proximal map, as shown in [27].
In order to illustrate possible convergence behaviors, we first fix the matrices A and W and the vector y as above and then use a one-parameter family of maps . When
, this map averages W and I/2. The map I/2 is a proximal map as in (6) with
and
; i.e., the proximal map associated with a quadratic regularization term. In the framework of Corollary 3, the map
satisfies
. Hence the scaling of r controls the expansiveness of one of the component maps in
, and hence the expansiveness
FIG. 2. Residual norm and root mean square error
versus iteration. Left 2 panels: (Case 1) T has Lipschitz constant larger than 1 but all eigenvalues have real part strictly less than 1. Both methods converge. Right 2 panels: (Case 2) T has an eigenvalue with real part larger than 1. JFNK converges, while ADMM diverges.
of the operator in (10) through the averaging operator . For the examples here, we choose r to be 1.02 and 1.06. As described below, with appropriate choices of parameters, the Jacobian-free Newton Krylov method converges for both examples, while ADMM converges for the first one only.
Recall that if the Lipschitz constant, L(T), is strictly less than 1, then the operator T is a contraction, and if we say it is nonexpansive. Moreover, for linear operators, L(T) =
where
is the maximum singular value of T; and
where
is the eigenvalue with greatest magnitude.
Case 1, r = 1.02: In this case, T has Lipschitz constant L(T) > 1, and the conditions of Theorem 4 or similar theorems on the convergence of Mann iteration for the convergence of nonexpansive maps do not hold. However, in this case, T is affine (linear map plus constant) and all eigenvalues of the linear part of (T + I)/2 lie strictly inside the unit circle. From (11) with and basic linear algebra, this means that Mann iteration converges. This is confirmed in Figure 2. In this example, convergence for Mann iteration can be improved by taking
to be 0.8, in which case the convergence is marginally better than that for JFNK. For this example, we used a Krylov subspace of dimension 10, so that each Newton step requires 10 function evaluations. This is indicated by closed circles in the plots.
Case 2, r = 1.06: In this case, T has Lipschitz constant L(T) > 1, and there is an eigenvalue with real part approximately 1.0039, so averaging T with the identity as in Mann iteration will maintain an eigenvalue larger than one. In particular, Mann iteration with (labelled as ADMM) does not converge, but the JFNK algorithm does. For this example, we used a Krylov subspace of dimension 75, so that each Newton step requires 75 function evaluations. This is indicated by closed circles in the plots.
4.3. Image Denoising with Multiple Neural Networks. The third example we give is an image denoising problem using multiple deep neural networks. This problem is more complex in that we use several neural networks, none of which is tuned to match the noise in the image to be denoised. Nevertheless, we show that CE is often able to outperform each individual network. The images and code for this section are available at [1].
The forward model of image denoising is described by the following linear equation:
where is latent unknown image,
is i.i.d. Gaussian noise, and
is the corrupted observation. Our motivation is to find an estimate
by solving the consensus
equilibrium equation analogous to the classical maximum-a-posteriori approach:
where p(x) is the prior of x. However, instead of a prior function, which would induce a proximal map, we will use a set of convolutional neural networks, which will play the role of regularization in the way that a prior term does, but which are almost certainly not themselves proximal maps for any function.
To define the CE operators , we consider a set of K image denoisers. Specifically, we use the denoising convolutional neural network (DnCNN) proposed by Zhang et al. [31]. In the code provided by the authors 1, there are five DnCNNs trained at five different noise levels:
,
and
. In other words, the user has to choose the appropriate DnCNN to match the actual noise level
. In the CE framework, we see that
is the operator
(15) DnCNN
with denoising strength
The CE operator
is the proximal map of the likelihood function:
where is an internal parameter controlling the strength of the regularization
. In this example, we set
for simplicity.
FIG. 3. Image denoising experiment for Man512 when . Notice that the consensus equilibrium (CE) result has the highest SNR when compared to individual convolutional neural network denoisers train on varying noise levels.
To make the algorithm more adaptive to the data, we use weights , where
In this pair of equations, measures the deviation between the actual noise level
and the denoising strength of the neural networks
. The parameter h = 5/255 controls the cut off. Therefore, among the five neural networks,
weights more heavily the relevant networks. The
weight
is the weight of the map to fit to data. Its value is chosen to be the sum of the weights of the denoisers to provide appropriate balance between the likelihood and the denoisers.
FIG. 4. Image denoising experiment for Peppers256 when . Notice that the consensus equilibrium (CE) result has the highest SNR when compared to individual convolutional neural network denoisers train on varying noise levels.
Figures 3 and 4 show some results using noise levels of and 40/255, respectively. Notice that none of these noise levels is covered by the trained DnCNNs. Table 1 shows resulting SNR values for the full set of experiments using 8 test images and 3 noise levels of
. The results in the center of the table indicate the result of applying an individual CNN to the noisy image. Because of the form of
in (16) and
, the result of this single application of the CNN is the same as the CE solution obtained by using only that single CNN together with
.
Notice that in almost all cases the consensus equilibrium of the full group has the highest PSNR when compared to the individual application of the DnCNNs. Also, the improvement in terms of the PSNR is quite substantial for noise levels and
. For
, CE still offers PSNR improvement except for House256, which is an image with many smooth regions. In addition, visual inspection of the images shows that the CE result yields the best visual detail while also removing the noise most effectively. While DnCNN denoisers can be very effective, they must be trained in advance using the correct noise level. This demonstrates that the consensus equilibrium can be used to generate a better result by blending together multiple pre-trained DnCNNs.
In order to illustrate that the CE solution outperforms a well-chosen linear combination of the outputs from each denoiser, we report a baseline combination result in Table 1. The baseline results
TABLE 1 Image denoising results for actual noise level
are generated by
where are the initial estimates provided by the denoisers, and
is defined through 17 without
. That is, we use the same weights as those for CE, excluding the weight for the likelihood term and rescaled to sum to 1 after this exclusion. Therefore,
can be considered as a linear combination of the initial estimates, with weights defined by the distance between the current noise level and the trained noise levels. The results in Table 1 show that while
very occasionally outperforms the best of the individual denoisers, it is usually worse than the best individual denoiser and is uniformly worse than CE. In the last column of Table 1 we show the result of DnCNN trained at a noise level matched with the actual noise level. It is interesting to note that CE compares favorably with the matched DnCNN in many cases, except for large sigma where the matched DnCNN is uniformly better.
We note that [30] uses a linear transformation depending on the noise level of a noisy image in order to match the noise level of a trained neural network, and then applies the inverse linear transformation to the output. This provides another approach to the example above but doesn’t include
the ability of CE to combine multiple sources of influence without a predetermined conversion from one to the other. We should also point out the recent work of Choi et al. [8] which demonstrates an optimal mechanism of combining image denoisers.
5. Conclusion. We presented a new framework for image reconstruction, which we term Consensus Equilibrium (CE). The distinguishing feature of the CE solution is that it is defined by a balance among operators rather than the minimum of a cost function. The CE solution is given by the consensus vector that arises simultaneously from the balance of multiple operators, which may include various kinds of image processing operations. In the case of conventional regularized inversion, for which the optimization framework holds, the CE solution agrees with the usual MAP estimate, but CE also applies to a wide array of problems for which there is no corresponding optimization formulation.
We discussed several algorithms for solving the CE equations, including a novel anisotropic preconditioned Mann iteration and a Jacobian-free Newton Krylov method. We also introduced a novel precondition method for accelerating the Mann iterations used to solve the CE equations. There is a great deal of room to explore other methods for finding CE solutions as well as for formulating other equilibrium conditions.
Our experimental results, on a variety of problems with varying complexity, demonstrate that the Consensus Equilibrium approach can solve problems for which there is no corresponding regularized optimization and can in some cases achieve consensus results that are better than any of the individual operators. In particular, we showed how the Consensus Equilibrium can be used to integrate a number of CNN denoisers, thereby achieving a better result than any individual denoiser.
Acknowledgments. We thank the referees for many helpful remarks that have improved this paper substantially.
Proof of Theorem 1. In order to use as in (6), we multiply the objective function in (3) by
, which does not change the solution. Define the Lagrangian associated with this scaled problem as
where the are the Lagrange multipliers for the equality constraints
. Since the
are all convex and lower-semicontinuous, the first order KKT conditions are necessary and sufficient for optimality [22, Theorem 28.3]. At a solution point
, these conditions are given by
where is the subdifferential with respect to
. These convert to
Define , in which case (18) is the same as (5). Next, use
from (20) in (19) and cancel
to get
for all i. Adding
to both sides gives
or
Since the are convex and
, we can invert to get
. From [2, Proposition 16.34], this is equivalent to (4) in the case that
is the proximal map of (6).
Proof of Theorem 4. Since H is symmetric and positive definite, there is an orthogonal matrix Q and a diagonal matrix with
for all j and
. Let
be the jth column of Q, and let
be orthogonal projection onto the span of
. Define the associated norm
. Also, let
be the product
, and let
(i.e., the product of all
through
except
). Define the weighted norm
Here and below, we use freely as needed. As in [25], we use the equality
, which holds for
between 0 and 1 and can be verified by expanding both sides as a function of
. In our case, we have
from the assumptions on H. After conversion back to the norm
, this yields
Since T is nonexpansive, the right hand side is bounded above by replacing with
in the second sum. This new sum then exactly cancels the term arising from
in the first sum. Let c be the minimum over j of
, and note that c > 0 since
for each j by assumption. Putting these together and re-expressing in the
norm gives
The remainder of the proof is nearly identical to that in [25]; we include it for completeness. Iterating in the first term on the right hand side, we obtain
(21) ∥−
≤ ∥
In particular, and hence
tend to 0 as j tends to
. This also implies that
Finally, note that (21) implies that the sequence is bounded, hence has a limit point, say
. Since
is continuous and
converges to 0, we have
. Using
in place of
in (21), we see that
decreases monotonically to 0, hence
converges to
.
Proof of Theorem 5. Let denote the map
. Let
and note that
has eigenvalues
. Since the real part of
is less than 1, the line segment defined by
for
in the interval [0, 1] has a nonempty intersection with the open unit disk in the complex plane. For each j, there is some
so that this intersection contains the set of points
for
in
. Taking
to be the minimum of the
and taking
in the interval
, there exists r < 1 for which
for all j.
For this choice of , let A be the Jacobian of
at the fixed point,
, which we may assume is the origin. The Schur triangulation gives a unitary matrix Q and an upper triangular matrix U with
. Write
with
diagonal and
zero on the diagonal. Let
be the maximum of
over all entries in
. For
, define D to be the diagonal
matrix with
. A computation shows that
has the same diagonal entries as U but that each off-diagonal has the form
with j > i, hence is bounded by
in norm. This plus the differentiability of
implies that for x in a neighborhood of 0,
where decreases to 0 as
tends to 0. Choosing
and
sufficiently small, we have
for some
. In this case we can iterate to obtain
In other words, for in a neighborhood N of the origin, the iterates
converge geometrically to the origin. Multiplying by QD and labeling
, we have
converges geometrically to 0 for all
in the neighborhood QDN of the origin.
[1] Matlab implementation for consensus equilibrium. https://engineering.purdue.edu/ChanGroup/code.html. Accessed: 2018-04-06.
[2] H. H. BAUSCHKE AND P. L. COMBETTES, Convex analysis and monotone operator theory in Hilbert spaces, CMS Books in mathematics, Springer, New York, Dordrecht, Heidelberg, 2011.
[3] D. P. BERTSEKAS, Constrained Optimization and Lagrange Multiplier Methods (Optimization and Neural Computation Series), Athena Scientific, 1996.
[4] S. BOYD, N. PARIKH, E. CHU, B. PELEATO, AND J. ECKSTEIN, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3 (2011), pp. 1–122.
[5] S. BOYD AND L. VANDENBERGHE, Convex Optimization, Cambridge University Press, Cambridge UK, 2009.
[6] A. BUADES, B. COLL, AND J.-M. MOREL, A review of image denoising algorithms, with a new one, Multiscale Modeling & Simulation, 4 (2005), pp. 490–530.
[7] S. H. CHAN, X. WANG, AND O. A. ELGENDY, Plug-and-Play ADMM for image restoration: Fixedpoint convergence and applications, IEEE Trans. Computational Imaging, 3 (2017), pp. 84–98, doi:10.1109/TCI.2016.2629286.
[8] J. H. CHOI, O. A. ELGENDY, AND S. H. CHAN, Optimal combination of image denoisers. Available online: https://arxiv.org/abs/1711.06712, 2018.
[9] P. L. COMBETTES AND J.-C. PESQUET, Proximal splitting methods in signal processing, in Fixed-point algorithms for inverse problems in science and engineering, Springer New York, 2011, pp. 185–212.
[10] K. DABOV, A. FOI, V. KATKOVNIK, AND K. EGIAZARIAN, Image denoising by sparse 3-D transformdomain collaborative filtering, IEEE Transactions on Image Processing, 16 (2007), pp. 2080–2095, doi:10.1109/TIP.2007.901238.
[11] J. ECKSTEIN, Splitting methods for monotone operators with applications to parallel optimization, PhD thesis, Massachusetts Institute of Technology, 1989.
[12] J. ECKSTEIN AND D. P. BERTSEKAS, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming, 55 (1992), pp. 293–318.
[13] P. GISELSSON, Tight global linear convergence rate bounds for douglas–rachford splitting, Journal of Fixed Point Theory and Applications, 19 (2017), pp. 2241–2270, doi:10.1007/s11784-017-0417-1.
[14] P. GISELSSON AND S. BOYD, Preconditioning in fast dual gradient methods, in 53rd IEEE Conference on Decision and Control, Dec 2014, pp. 5040–5045, doi:10.1109/CDC.2014.7040176.
[15] P. GISELSSON AND S. BOYD, Metric selection in fast dual forward-backward splitting, Automatica, 62 (2015), pp. 1 – 10, doi:https://doi.org/10.1016/j.automatica.2015.09.010.
[16] W. HACKBUSCH, Iterative solution of large sparse systems of equations, vol. 95 of Applied Mathematical Sciences, Springer-Verlag, New York, 1994. Translated and revised from the 1991 German original.
[17] U. S. KAMILOV, H. MANSOUR, AND B. WOHLBERG, A plug-and-play priors approach for solving nonlinear imaging inverse problems, IEEE Signal Processing Letters, 24 (2017), pp. 1872–1876, doi:10.1109/LSP.2017.2763583.
[18] D. KNOLL AND D. KEYES, Jacobian-free Newton-Krylov methods: a survey of approaches and applications, Journal of Computational Physics, 193 (2004), pp. 357 – 397, doi:https://doi.org/10.1016/j.jcp.2003.08.010.
[19] P. L. LIONS AND B. MERCIER, Splitting algorithms for the sum of two nonlinear operators, SIAM Journal on Numerical Analysis, 16 (1979), pp. 964–979, doi:10.1137/0716071, https://doi.org/10.1137/0716071.
[20] J. NOCEDAL AND S. J. WRIGHT, Numerical Optimization, Springer, New York, 2nd ed., 2006.
[21] S. ONO, Primal-dual plug-and-play image restoration, IEEE Signal Processing Letters, 24 (2017), pp. 1108–1112, doi:10.1109/LSP.2017.2710233.
[22] R. ROCKAFELLAR, Convex Analysis, Princeton, N.J. : Princeton University Press, 1997, c1970.
[23] Y. ROMANO, M. ELAD, AND P. MILANFAR, The little engine that could: Regularization by denoising (RED), SIAM J. Imaging Sciences, 10 (2017), pp. 1804–1844.
[24] A. ROND, R. GIRYES, AND M. ELAD, Poisson inverse problems by the plug-and-play scheme, Journal of Visual Communication and Image Representation, 41 (2016), pp. 96 – 108, doi:https://doi.org/10.1016/j.jvcir.2016.09.009.
[25] E. RYU AND S. BOYD, A primer on monotone operator methods survey, Applied and computational mathematics, 15 (2016), pp. 3–43.
[26] S. SREEHARI, S. V. VENKATAKRISHNAN, K. L. BOUMAN, J. P. SIMMONS, L. F. DRUMMY, AND C. A. BOUMAN, Multi-resolution data fusion for super-resolution electron microscopy, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 1084–1092, doi:10.1109/CVPRW.2017.146.
[27] S. SREEHARI, S. V. VENKATAKRISHNAN, B. WOHLBERG, G. T. BUZZARD, L. F. DRUMMY, J. P. SIMMONS, AND C. A. BOUMAN, Plug-and-play priors for bright field electron tomography and sparse interpolation, IEEE Transactions on Computational Imaging, 2 (2016), pp. 408–423, doi:10.1109/TCI.2016.2599778.
[28] S. V. VENKATAKRISHNAN, C. A. BOUMAN, AND B. WOHLBERG, Plug-and-play priors for model based reconstruction, in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2013, IEEE, 2013, pp. 945–948.
[29] S. V. VENKATAKRISHNAN, L. F. DRUMMY, M. DE GRAEF, J. P. SIMMONS, AND C. A. BOUMAN, Model based iterative reconstruction for bright field electron tomography, in IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2013, pp. 86570A–86570A.
[30] Y. Q. WANG AND J. M. MOREL, Can a single image denoising neural network handle all levels of gaussian noise?, IEEE Signal Processing Letters, 21 (2014), pp. 1150–1153, doi:10.1109/LSP.2014.2314613.
[31] K. ZHANG, W. ZUO, Y. CHEN, D. MENG, AND L. ZHANG, Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising, IEEE Transactions on Image Processing, 26 (2017), pp. 3142–3155, doi:10.1109/TIP.2017.2662206.