Learning Deep Analysis Dictionaries -- Part II: Convolutional Dictionaries

2020·arXiv

Abstract

I. INTRODUCTION

CONVOLUTIONAL dictionary learning has attracted in-creasing interests in signal and image processing communities as it leads to a more elegant framework for high-dimensional signal analysis. An advantage of convolutional dictionaries [1]–[13] is that they can take the high-dimensional signal as input for sparse representation and processing, whereas traditional approaches [14]–[20] have to divide the high-dimensional signal into overlapping low-dimensional patches and perform sparse representation on each patch independently. A convolutional dictionary models the convolution between a set of filters and a signal. It is a structured dictionary and can be represented as a concatenation of Toeplitz matrices where each Toeplitz matrix is constructed using the taps of a filter and the usual assumption is that the filters are with compact support. So a convolutional dictionary is effective for processing high-dimensional signals while also restraining the number of free parameters. To achieve efficient convolutional dictionary learning, the convolutional dictionary is usually modelled as a concatenation of circulant matrices [1]–[6] by assuming a periodic boundary condition on the signals. As all circulant matrices share the same set of eigenvectors which is the Discrete Fourier Transform (DFT) matrix, a circular convolution can be therefore represented as a multiplication in Fourier domain and

can be efficiently implemented using Fast Fourier Transform (FFT). However, using a circulant matrix to approximate a general Toeplitz matrix may lead to boundary artifacts [3], [21], [22] especially when the boundary region is large.

A multi-layer convolutional dictionary model is able to represent multiple levels of abstraction of the input signal. Due to the associativity property of convolution, multiplying two convolutional dictionaries results in a convolutional dictionary whose corresponding filters are the convolution of two set of filters and have support size that is larger than the original filters. In the Multi-Layer Convolutional Sparse Coding (MLCSC) model in [11], [12], [23], there are multiple layers of convolutional synthesis dictionaries. By increasing the number of layers, the global dictionary which is the multiplication of the convolutional dictionaries is able to represent more complex structures. The ML-CSC model [11] provides theoretical insights on the conditions for the success of the layered sparse pursuit and uses it as a way to interpret the forward pass of Deep Neural Networks (DNNs) as a layered sparse pursuit.

Convolutional Neural Networks (CNNs) [24]–[26] have been widely used for processing images and achieve state-of-the-art performances in many applications. A CNN consists of a cascade of convolution operations and element-wise non-linear operations. The Rectified Linear Unit (ReLU) activation function is one of the most popular non-linear operators used in DNNs. With the multi-layer convolution structure, a deeper layer in a CNN receives information from a corresponding wider region of the input signal. The ReLU operator provides a non-linear transformation and leads to a sparse feature representation. The parameters of a CNN are usually optimized using the backpropagation algorithm [27] with stochastic gradient descent.

In this paper, we propose a Deep Convolutional Analysis Dictionary Model (DeepCAM) which consists of multiple layers of convolutional analysis dictionaries and soft-thresholding operations and a layer of convolutional synthesis dictionary. The motivation is to use DeepCAM as a tool to interpret the workings of CNNs from the sparse representation perspective. Single image super-resolution [16], [17], [28]–[37] is used as a sample application to validate the proposed model design. The input low-resolution image is used as the input to DeepCAM and is not partitioned into patches. At each layer of DeepCAM, the input signal is multiplied with a convolutional analysis dictionary and then passed through soft-thresholding operators. At the last layer, the convolutional synthesis dictionary is used to predict the high-resolution image.

The contribution of this paper is two-fold:

• We propose a Deep Convolutional Analysis Dictionary

Model (DeepCAM) for single image super-resolution, where at each layer, the convolutional analysis dictionary and the soft-thresholding operation are designed to achieve simultaneously information preservation and discriminative representation.

• We propose a convolutional analysis dictionary learning method by explicitly modelling the convolutional dictionary with a Toeplitz structure. By exploiting the properties of Toeplitz matrices, the convolutional analysis dictionary can be efficiently learned from a set of training samples. Simulation results on single image super-resolution are used to validate our proposed DeepCAM and convolutional dictionary learning method. The rest of the paper is organized as follows. In Section II we show how we build a convolutional analysis dictionary from an unstructured dictionary. In Section III, we propose an efficient convolutional analysis dictionary learning algorithm by exploiting the properties of Toeplitz matrices. Section IV presents the proposed Deep Convolutional Analysis Dictionary Model (DeepCAM) and the complete learning algorithm. Section V presents simulation results on single image super-resolution task and Section VI draws conclusions.

II. CONVOLUTIONAL ANALYSIS DICTIONARY

An analysis dictionary contains m row atoms and is usually assumed to be over-complete with . Given a signal of interests , the analysis dictionary should be able to sparsify while preserving its essential information. That is, the analysis coefficients are sparse but still contain sufficient information for further processing.

The focus of this paper is on learning convolutional analysis dictionaries which model the convolution between a signal and a set of filters. The filters’ taps depend on the rows of the analysis dictionary. In what follows, we first show how we build our convolutional analysis dictionary from a unstructured analysis dictionary. We then study strategies to learn a convolutional analysis dictionary from a set of training samples.

The convolution can be represented as a multiplication with a convolutional analysis dictionary. Let us assume that the input signal is 1-dimensional for simplicity. The convolution between an atom and an input signal with l > n can be expressed as:

where denotes the convolution operator, and is a Toeplitz matrix with l columns which is constructed using as follows:

where is the j-th coefficient of , and with is an indicator matrix with 1s on the j-th upper diagonal and 0s on other locations.

Given an unstructured analysis dictionary and an input signal , the convolution between x and each

Fig. 1. A convolutional analysis dictionary concatenation of m = 4 Toeplitz matrices

Fig. 2. An example of convolution between a 2-dimensional convolutional filter with 2-dimensional data and the convolution can be represented by a doubly block Toeplitz matrix.

row of can be expressed as a matrix multiplication where is converted to a convolutional analysis dictionary which can be represented as a concatenation of m Toeplitz matrices along the column direction:

Instead of assuming a circulant structure, we model the convolutional operation with a Toeplitz matrix. The represented convolution operation will be performed only within the input signal. That is, convolutional operation is performed without padding at the boundaries.

Fig. 1 shows an example of how we build a convolutional analysis dictionary from the unstructured analysis dictionary with m = 4, n = 5 and l = 12. Note that the analysis dictionary in Fig. 1(a) is not over-complete, while the convolutional analysis dictionary has more rows than columns.

Fig. 3. Convolution can be represented as multiplying with a Toeplitz matrix or multiplying with a right dual matrix. The right dual matrix is not sparse.

The above description applies to 1-dimensional input signals, but it can be extended to multi-dimensional signals, like for example images. A convolutional analysis dictionary will then be in the form of a concatenation of doubly block Toeplitz matrices (i.e. a matrix with block Toeplitz structure where each block is a Toeplitz matrix). Similar to Eqn. (2), the doubly block Toeplitz structure can be represented with corresponding indicator matrices. Fig. 2(a) shows an example of 2-dimensional convolution between a image region and a filter and Fig. 2(b) shows the corresponding convolutional analysis dictionary.

III. LEARNING CONVOLUTIONAL ANALYSIS DICTIONARIES

In this section, we propose an efficient convolutional analysis dictionary learning algorithm by exploiting the properties of the Toeplitz structure within the dictionary.

For simplicity, let us assume that the input data is made of 1-dimensional vectors. Therefore, the convolutional analysis dictionary is a concatenation of Toeplitz matrices as we discussed in Section II. The proposed learning algorithm can be easily extended to the multi-dimensional case where the convolutional dictionary is a concatenation of doubly block Toeplitz matrices.

Let us assume that convolution is performed between an analysis dictionary and an input signal with l > n. Therefore, the convolutional analysis dictionary will be of size with q = pm. Given that is built using , it has the same number of free parameters as despite being a much bigger matrix. This means that if we were to optimize directly we would end up with a computationally inefficient approach.

We mitigate this issue by first observing that when we assume that the convolutional filters are with compact support , the convolutional analysis dictionary has many zero entries. It is therefore inefficient to evaluate by directly multiplying with x. As illustrated in Fig. 3, with the commutativity property of convolution (i.e. ), the matrix multiplication between a Toeplitz matrix and an input vector can be efficiently implemented as:

where is the right dual matrix of and is defined as:

where x(j) is the j-th coefficient of x, and is an indicator matrix with 1s on the j-th skew-diagonal and 0s on other locations. Note that the right dual matrix R(x, n) is a matrix without zero entries and has columns.

Secondly, whenever possible, we will pose the optimization problem using whilst imposing the constraints associated with the structured matrix as the actual analysis dictionary learning problem.

We want the convolutional analysis dictionary to satisfy four properties: (i) its row atoms span the input data space; (ii) it is able to sparsify the input data; (iii) the row atoms are of unit norm; (iv) there are no pairs of row atoms in that are linearly dependent.

Different from the unstructured analysis dictionary learning case, we propose to use two sets of input training data with different sizes. Let us denote the super-patch training data as and denote the small patch training data as . Please remember that n is the filters’ size and l is the dimension of input signals and we assume n < l. Both the super-patch and small patch training datasets are extracted from an external training dataset. The super-patch training data will be used to impose property (i) which is a global property of the convolutional dictionary. The small patch training data will be used to impose property (ii).

The first learning objective is that the convolutional analysis dictionary should be able to span the input data space in order to preserve the information within the input super-patch training data . The super-patch training dataset defines the subspace covered by the input data. Let us denote with the orthogonal basis covering the signal subspace of the input super-patch data where we assume that this subspace has dimension K, we also denote with the orthogonal basis of the orthogonal complement to the signal subspace of . These two bases will be used to impose that the row space of the learned convolutional analysis dictionary spans the input data space while being orthogonal to the null-space of . The information preservation constraint can be interpreted as a rank constraint on the convolutional analysis dictionary which is usually achieved by imposing a logarithm determinant constraint:

The size of the convolutional analysis dictionary can be huge, especially when the input data is multi-dimensional. Therefore it would be computationally inefficient to evaluate Eqn. (6) and its first order derivative directly. By exploiting the properties of the convolutional analysis dictionary, we propose an efficient reformulation of Eqn. (6) which is based on the analysis dictionary .

TABLE I A LIST OF SYMBOLS AND THEIR DIMENSIONS. FOR SIMPLICITY, IN THE TABLE WE DENOTE

With the definition of the right dual matrix, the multiplication between and the j-th orthogonal basis element of W can be expressed as:

where vecdenotes the vectorization operation vec(A) : , the vectorization operation for 2-dimensional signal can be expressed as:

where is the i-th canonical basis vector of , that is, (with 1 on the i-th location), and denotes the Kroneckers product.

The information preservation constraint in Eqn. (6) can therefore be reformulated and expressed in terms of the analysis dictionary as:

where with vec. The gradient of can be expressed as:

where and .

With the information preservation constraint in Eqn. (9), the learned is constrained to span the signal subspace de-fined by W . However, we still need to exclude the null-space components of the training data from . Specifically, the Toeplitz matrix should not be within the subspace spanned by U to avoid a zero response when multiplying with .

Therefore we define the feasible set of the convolutional analysis dictionary as with being the unit sphere in being the product of q unit spheres and being the orthogonal complement of U. The unit sphere constraint ensures that the unit norm condition is satisfied. The feasible set is defined in , while we wish to have a feasible set for which is defined in with and can be more efficiently implemented.

The operation of orthogonal projection onto the complementary subspace of U can be represented by the projection matrix given by:

(11) where is the identity matrix and is the pseudoinverse of . The orthogonal projection operation is achieved by multiplying the convolutional analysis dictionary with the projection matrix. The projection is applied on the rows of .

With the definition of the right dual matrix, the orthogonal projection operation can be expressed in terms of the analysis dictionary atom as:

We note that, after the projection, the Toeplitz structure within may not be preserved and needs to be imposed again. The Toeplitz matrix closest to is obtained by averaging over the diagonal elements [38]. The orthogonal projection operation and the averaging operation can be jointly represented and applied to the atoms of the analysis dictionary. Let us define a vector whose inner product with equals the average value of the j-th diagonal elements of , it can be expressed as:

The matrix therefore represents simultaneously the orthogonal projection operation and the averaging operation. Let us denote the feasible set of the analysis dictionary as where represents the product of m unit spheres and represents the orthogonal complementary subspace of the null-space of X. The operation of the orthogonal projection onto the tangent space can then be represented by the projection matrix :

where is the identity matrix, and .

The sparsifying property of the convolutional analysis dictionary over the super-patch training data can be achieved by imposing the sparsifying property of the analysis dictionary over the small patch training data. The rationale is that the row atoms of the convolutional analysis dictionary only operate on local regions of the input signal as illustrated in Eqn. (4) and Fig. 3. Similar to [20], [39], [40], the sparsifying constraint is imposed by using a log-square function which promotes analysis dictionary atoms that sparsify the small patch training data:

where is a tunable parameter which controls the sparsifying ability of the learned dictionary.

The linearly dependent penalty and the unit norm constraint can also be imposed directly on the analysis dictionary .

Linearly dependent row atoms (e.g. ) are penalized by using a logarithm barrier term :

We observe that, by exploiting the Toeplitz structure, we have been able to impose the desired proprieties of a convolutional analysis dictionary by imposing constraints on the lower-dimensional analysis dictionary . This will reduce computational costs and memory requirements.

Combining the information preservation constraint in Eqn. (9), feasible set constraint in Eqn. (14), sparsifying constraint in Eqn. (15), and linearly dependent penalty term in Eqn. (16), the objective function for the convolutional analysis dictionary problem can be expressed as:

where with and being the regularization parameters.

The proposed convolutional analysis dictionary learning algorithm ConvGOAL+ is summarized in Algorithm 1. The objective function in Eqn. (17) is optimized using a geometric conjugate gradient descent method (see also [20], [40], [41]).

IV. DEEP CONVOLUTIONAL ANALYSIS DICTIONARY MODEL

In this section, we introduce our Deep Convolutional Analysis Dictionary Model (DeepCAM). DeepCAM is a convolutional extension of the Deep Analysis dictionary Model (DeepAM) [40]. Different from DeepAM which is patch-based, DeepCAM performs convolution operation and element-wise soft-thresholding at image level on all layers without dividing the input image into patches.

When it comes to Single Image Super-Resolution (SISR), convolutional neural networks are designed using two main strategies: the early-upsampling approaches [34], [35] and the late-upsampling approaches [36], [37]. The early-upsampling approaches [34], [35] first upsample the low-resolution (LR) image to the same resolution of the desired high-resolution (HR) one through bicubic interpolation and then perform convolution on the upsampled image. The drawback is that this leads to a large number of model parameters and a high computational complexity during testing as the feature maps are of the same size as the HR image. The late-upsampling approaches [36], [37] perform convolution on the input LR image and applies a deconvolution layer [36] or a sub-pixel convolution layer [37] at the last layer to predict the HR image. The late-upsampling approaches have smaller number of parameters and lower computational cost than the early-upsampling one.

SISR is used as a sample application to validate our proposed design. We utilize a similar strategy as the late-upsampling approach. The LR image is used as input to DeepCAM without bicubic interpolation. At each layer, the convolution and soft-thresholding operations are applied to the corresponding input signal. For SISR with up-sampling factor s, the synthesis dictionary consists of atoms. The convolution between the synthesis dictionary and its input signal yields output channels which correspond to subsampled version of the HR image. The final predicted HR image can then be obtained by reshaping and combing the output channels.

The parameters of a L-layer DeepCAM include L layers of analysis dictionary and soft-thresholds pair and a single synthesis layer modelled with dictionary D. The atoms of the dictionaries represent filters. Let us denote with the number of filters at the i-th layer and with the spatial support size of the convolutional filters, since there were filters at the previous layer, there are free parameters at layer i with . Therefore the com- plete set of free parameters is given by the analysis dictionaries , the soft-thresholds and the synthesis dictionary where .

Fig. 4 shows an example of a 2-layer DeepCAM for image super-resolution. The input LR image denoted with passes through multiple layers of convolution with the analysis dictionary and soft-thresholding. There are 4 synthesised HR sub-images which are obtained by convolving the last layer analysis feature maps with the synthesis dictionary D and will be rearranged to generate the final predicted HR image according to the sampling pattern.

Let us denote with the input signal at the i-th layer, and denote with the j-th atom of . The convolution and soft-thresholding operations corresponding to the j-th atom and threshold pair can be expressed as1:

where matrepresents the operation which reshapes a vector of length to a tensor with size

Fig. 4. A 2-layer Deep Convolutional Analysis dictionary Model for single image super-resolution. There are 2 layers of analysis dictionaries with element-wise soft-thresholding operatorsand a layer of synthesis dictionary D. The input image is a gray image. The estimated HR images is obtained through a cascade of convolution and soft-thresholding operations with input LR image . The final predicted HR image is obtained by reshaping according to the sampling pattern.

Fig. 5. The convolution and soft-thresholding operation corresponding to the atom and threshold pair . The input signal is of size represents a convolutional filter with spatial support size channels. The convolution of results in a matrix . An element-wise soft-thresholding operation is applied to every element of results in

is the convolution result, denotes the element-wise soft-thresholding operation with threshold , and is the sparse representation after thresholding.

Fig. 5 illustrates the convolution and the soft-thresholding operation described in Eqn. (18). The convolution linearly transforms the input signal to a 2-D representation . An element-wise soft-thresholding operation is then applied to every element on and generates a sparse 2-D representation .

By stacking the sparse 2-D representation , the i-th layer output signal can be represented as . For simplicity, let us denote the i-th layer convolution and soft-thresholding operation as:

When the convolution of and is represented by a convolutional analysis dictionary with l = , the convolution and soft-thresholding operations can be expressed as follows:

where 1 is a all ones vector of size , and is the Kronecker product.

where denotes the estimated HR images.

V. LEARNING A DEEP CONVOLUTIONAL ANALYSIS DICTIONARY MODEL

In this section, we will introduce the proposed algorithm for learning both the convolutional analysis dictionary and the soft-thresholds in DeepCAM. We adopt a joint Information Preserving and Clustering strategy as proposed in DeepAM [40]. At each layer, the analysis dictionary is divided into two sub-dictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD) . The IPAD and soft-threshold pair will generate feature maps that can preserve the information from the input image. The CAD and soft-threshold pair will generate feature maps with strong discriminative power that can facilitate the prediction of the HR image. To achieve this goal, there should be a sufficient number of IPAD and CAD atoms and guidelines on how to determine the size of each dictionary will also be provided in this section.

A. Learning IPAD and Threshold Pair

The Information Preserving Analysis Dictionary (IPAD) will be learned using the proposed convolutional analysis dictionary learning method of Section III. The thresholds will be set according to the method used in DeepAM [40].

A multi-layer convolutional analysis dictionary naturally possesses a multi-scale property. The product of two convolutional dictionaries leads to a convolutional dictionary whose equivalent filters are given by the convolution of the filters in the two dictionaries due to the associativity property of convolution (i.e. ).

Let us denote with the i-th layer convolutional analysis dictionary constructed using convolutional filters with patch size . The effective convolutional analysis dictionary has filters with spatial patch size:

Fig. 6. With a two-layer convolutional analysis dictionary, the effective convolutional analysis dictionary is still a convolutional analysis dictionary and with support size

Fig. 7. Super-patches at different layers in a 2-layer DeepCAM. A synthesised pixel value corresponds to a super-patch on the i-th layer with patch size In this example, the convolutional filters are with size . The super-patch size at layer 1, 2, and 3 is 7, 6, and 4, respectively.

An example of a two-layer convolutional analysis dictionary is shown in Fig. 6. The effective dictionary has an effective patch size that increases with the number of layers and can be large even when each convolutional analysis dictionary uses filters with small patch size.

When the support size of a convolutional analysis dictionary is small, its row atoms can only receive local information from the whole input signal. With an increased effective patch size, the row atoms of the convolutional analysis dictionary at a deeper layer will receive information from a larger segment of the input signal. For each HR pixel at the synthesised HR images , thereis a corresponding super-patch region on each layer which contributes all the information for predicting that pixel. Let us denote with the super-patch size at the i-th layer. It can be expressed in terms of the patch size of the convolutional filters from the final layer L to layer i:

Fig. 7 shows the super-patches at different layers for a 2-layer DeepCAM. Note that the super-patch size at a shallower layer is larger than that in a deeper layer.

In the proposed convolutional analysis dictionary learning method ConvGOAL+, there are two sets of training data: the super-patch training data and the small patch training data X. The super-patch data is used to impose the rank constraint. The small patch data X has the same support as the filters and is used to impose the sparsifying and linear independence constraints.

The patch size of the super-patch training data for convolutional analysis dictionary learning should be no smaller than . Otherwise, we can not ensure that the learned convolutional analysis dictionary will be able to utilize all information within the super-patch for predicting the corresponding HR pixel values.

At the i-th layer, let us define the support size of the super-patch training data as . The super- patch training data, the small patch training dataset and the ground-truth training dataset are denoted as , and , respectively.

Let us denote as the number of atoms in . With , we will have . From the degree of freedom perspective, there should be at least rows in to ensure that information from will be preserved. This leads to:

Eqn. (24) indicates that there should be more atoms for information preservation in a deeper layer of a DeepCAM. For example, when , and in a 2-layer DeepCAM, there should be at least 2 atoms in the 1-st layer, and 4 atoms in the 2-nd layer for information preservation.

Given atoms, the super-patch training data and the small patch training data , the IPAD is learned using ConvGOAL+ algorithm. The convolutional analysis dictionary will then be able to preserve essential information from the input LR image.

The soft-thresholds should be set properly. As in [40], the inner product between an analysis atom and the small patch training samples can be well modelled by a Laplacian distribution with variance . Therefore, as in [40], the soft-thresholds associated with IPAD is set to be inversely proportional to the variances:

where is a scaling parameter, and the variance of the j-th coefficient can be estimated using the obtained IPAD and the small patch training data .

The free parameter is determined by solving a 1-dimensional search problem. The optimization problem for is therefore formulated as:

(26) where , 1 is an all ones vector of size is the Kronecker product, G = with vec, and D is a discrete set of values.

B. Learning CAD and Threshold Pair

The objective of a Clustering Analysis Dictionary (CAD) is to perform a linear transformation to its input signal such that the responses are highly correlated with the most significant components of the residual error. Soft-thresholding, which is used as the non-linearity, sets to zero the data with relatively small responses. The components with large residual error will then be identified.

The number of atoms in is essential to the performance of DeepCAM. Similar to the discussions in Section V-A, with atoms in , the size of the convolutional analysis dictionary will be . For each super-patch region on the LR image, the number of coefficients for discriminative feature representation should not decrease over layers. That is, we would like to have more atoms in than in . Therefore, the number of CAD atoms should meet the condition:

Different from the unstructured deep dictionary model, it is not straightforward to set the dictionary sizes. Eqn. (24) and Eqn. (27) provide a guideline on how to set the number of atoms in order to generate representations that are both information preserving and discriminative.

Let us denote with the corresponding HR patch training data of . A synthesis dictionary can be learned to map to by solving:

It has a closed-form solution:

Given , we define the middle resolution (MR) and the residual data as and , respectively. The MR data is a linear transformation of the input small patch training data. The residual data contains the information about the residual energy.

We propose to learn an analysis dictionary in the ground-truth data domain. If is able to simultaneously sparsify the middle resolution data and the residual data , the atoms within the learned will then be able to identify the data in with large residual energy and the i-th layer CAD is then re-parameterized as:

Therefore an additional constraint as proposed in [40] is applied to impose the simultaneous sparsifying property. Each analysis atom is enforced to be able to jointly sparsify and :

where is a tunable parameter, and and are the j-th column of and , respectively.The objective function for learning the analysis dictionary can then be formulated as:

where with and being the regularization parameters. The functions , and are those defined in Eqn. (15), Eqn. (9) and Eqn. (16), respectively. To have zero mean responses for each learned CAD atom, the feasible set of the analysis dictionary is set to with .

The objective function in Eqn. (32) is optimized using ConvGOAL+ algorithm. With the learned analysis dictionary , the i-th layer CAD is then obtained as in Eqn. (30).

As proposed in DeepAM [40], it is both effective and efficient to set CAD soft-thresholds being proportional to the variance of the analysis coefficients. The CAD soft-thresholds are therefore defined as follows:

where is a scaling parameter, and is the variance of the Laplacian distribution for the j-th atom.

The free parameter can be learned using a similar approach to the one used to solve Eqn. (26). As the analysis coefficients can be well modelled by Laplacian distributions, the proportion of data that is set to zero for each pair of atom and threshold will be the same. The optimization problem for is formulated as:

(34) where is the estimation residual using IPAD, , 1 is a all ones vector of size is the Kronecker product, with vec, and D is a discrete set of values.

C. Synthesis Dictionary Learning

At the last layer, the synthesis dictionary will transform the L-th layer deep convolutional representation to the ground-truth training data . The synthesis dictionary can be learned using least squares:

Convolving the learned synthesis dictionary with the L-th layer feature maps, results in estimated HR images which can be reshaped and combined to form the final estimated HR image.

The overall learning algorithm for DeepCAM is summarized in Algorithm 2.

VI. SIMULATION RESULTS

In this section, we report the implementation details and numerical results of our proposed DeepCAM method and compare it with other existing single image super-resolution algorithms.

TABLE II PARAMETERS SETTING OF GOAL+ ALGORITHM FOR LEARNING THE LAYER IPAD

A. Implementation Details

Most of the implementation settings are the same as in [40]. The standard 91 training images [16] are used as the training dataset and the Set5 [16] and the Set14 [17] are used as the testing datasets. The color images have been converted from the RGB color space to the YCbCr color space. Image super-resolution is only performed on the luminance channel.

Table II shows the parameters setting of ConvGOAL+ algorithm for learning the i-th layer IPAD and CAD. Both the IPAD and the CAD are initialized with i.i.d. Gaussian random entries. The spatial size of the super-patches used for training is set to the minimum value as indicated by Eqn. (23) for the purpose of minimizing the training computational complexity. The number of IPAD atoms is set to the minimum integer satisfying Eqn. (24). The number of CAD atoms is then set to with a pre-defined total number of channel . We apply batch training for ConvGOAL+ algorithm. The training data has been equally divided into batches. During training, the ConvGOAL+ algorithm is sequentially applied to each batch until the learned dictionary converges or all batches have been used for training. For each batch, iterations of conjugate gradient descent is performed to update the dictionary. The discrete set D used for searching the scaling parameter of the thresholds is set to be .

B. Analysis of the Learned DeepCAM

In this section, we analyze the learned DeepCAM in terms of the number of layers, learned soft-thresholds and extracted feature maps.

Table III shows the PSNR (dB) of the learned DeepCAM with different number of layers evaluated on Set5 [17]. For

TABLE III PSNR (DB) OF DEEPCAM WITH DIFFERENT NUMBER OF LAYERS. FOR ALL DEEPCAM, THE SPATIAL FILTER SIZE AT ALL LAYERS IS MAXIMUM NUMBER OF FILTERS AT THE LAST LAYER IS SET TO 64. DENOTES THAT THERE ARE FILTERS AT THE FIRST LAYER, AND FILTER AT THE SECOND LAYER.

DeepCAM with different number of layers, the spatial filter size is set to for all layers and the maximum number of filters at the last layer is set to 64. The effective filter size for the DeepCAM with 1, 2, and 3 layers is therefore , , and , respectively. We can see that DeepCAM with more layers achieves higher average PSNR. From 1 layer to 2 layers, there is an improvement of about 0.2 dB. In particular, the PSNR of the testing image bird has been improved by around 0.5 dB. With 3 layers, further improvements can be observed on all testing images. The improved performance of DeepCAM with more layers can be due to two reasons. First, a deeper model has more non-linear layers and has therefore a stronger expressive power. Second, different from the unstructured deep dictionary model, a deeper convolutional dictionary model has an increased effective filter size which helps include more information for prediction and therefore improves prediction performance.

Fig. 8 shows the soft-thresholds of a 3-layer DeepCAM with 9, 25 and 64 filters at layer 1, 2 and 3, respectively. We can observe that the soft-thresholds have a bimodal behaviour. That is, the thresholds corresponding to IPAD are relatively small, while the thresholds corresponding to CAD are relatively large. Another observation is that the amplitude of the soft-thresholds decreases over layers. This will lead to denser representations at a deeper layer which can represent more complex signals.

Due to different learning objectives, the resultant feature maps of IPAD and CAD contains different information. Fig. 9 shows the feature maps of the first layer in a 3-layer DeepCAM. The first 2 feature maps correspond to IPAD. We can find that these two feature maps, especially, the feature map in Fig. 9(b) represents detailed structural information of the input LR image. The feature maps in Fig.9(c) and 9(d) are due to CAD and have zero responses on most regions due to relatively large soft-thresholds. These maps contain different directional edges corresponding to regions that require non-linear estimations. A combination of these features from both IPAD and CAD forms an informative and discriminative feature representation for predicting the HR image.

C. Comparison with Single Image Super-Resolution Methods

In this section, we compare our proposed DeepCAM method with the DeepAM [40] and some existing single image super-resolution methods including bicubic interpolation, sparse coding (SC)-based method [17], Anchored Neighbor Regression

Fig. 8. The soft-thresholds in layer 1, 2 and 3 of DeepCAM. There is a bimodal behaviour on the thresholds. The thresholds corresponding to IPAD are relatively small, while the thresholds corresponding to CAD are relatively large.

Fig. 9. The first layer feature maps of DeepCAM. The feature maps in 9(a) - 9(b) are due to IPAD and contain detailed structural information about the input LR image. The feature maps in 9(c) - 9(d) are due to CAD. They contain directional edges.

TABLE IV NUMBER OF FREE PARAMETERS IN DIFFERENT SINGLE IMAGE SUPER-RESOLUTION METHODS.

TABLE V PSNR (DB) OF DIFFERENT METHODS EVALUATED ON Set5 [16].

(ANR) [28], Adjusted Anchored Neighborhood Regression (A+) [29], and Super-Resolution Convolutional Neural Network (SRCNN) [42].

SC-based method [17], ANR [28] and A+ [29] are patch-based. SC-based method [17] is based on synthesis sparse representation and has a LR synthesis dictionary with 1024 atoms and a corresponding HR synthesis dictionary. The input feature is the compressed 1-st and 2-nd order derivatives of the image patch and is obtained using Principal Component Analysis (PCA). ANR and A+ [28], [29] are based on clustering and assign each cluster a linear regression model. They use the same feature as in [17] while require a huge number of free parameters. The SRCNN method [42] is based on a convolutional neural network with 2 convolutional layers of 64 and 32 filters. The spatial filter size is and , respectively.

DeepCAM used for comparison is a 3-layer DeepCAM. For the convolutional analysis dictionary, the spatial filter size is at all layers and the filter number is 9, 25, 100 for layer 1, 2 and 3, respectively. The convolutional synthesis dictionary is with spatial filter size . Therefore, the effective filter size of DeepCAM is .

Table IV shows the number of free parameters in different single image super-resolution methods. The SC-based method [17] requires a relatively small number of parameters which mainly comes from two synthesis dictionaries. The ANR method [28] and the A+ method [29] have around 1 million free parameters because there are 1024 regressors with size . The SRCNN method [42] has the least number of parameters. DeepAM has around 160,000 parameters. This is because the dictionaries are not structured and there are 3 layers of analysis dictionaries. The proposed DeepCAM has only approximately 35,000 free parameters since each structured convolutional dictionary shares a small number of free parameters though it has a huge size.

We denote with CNN-BPthe deep neural network with

TABLE VI PSNR (DB) OF DIFFERENT METHODS EVALUATED ON Set14 [17].

Fig. 10. Examples of reconstructed HR images by different methods. DeepCAM achieves better results than the backpropagation trained CNN. A region with characters on the reconstructed image have been marked using red rectangle and zoomed in.

the same structure as DeepCAM and trained using backpropagation algorithm [27] with learning rate decay step DS and total epochs for training. The implementation of DNNs is based on Pytorch with Adam optimizer [43], batch size 1, initial learning rate , and decay rate 0.1. The training data has been arranged into and patch pairs.

Table V shows the evaluation results of different methods on Set5. DeepCAM outperforms SC-based method and ANR by around 0.5 dB, and has similar performance as SRCNN and DeepAM, while has around 0.2 dB lower PSNR than A+. The parameters setting of CNN-BPhas been tuned to achieve the best performance on Set5. CNN-BPand CNN-BPhave been used for comparison and achieves around 0.1 dB and 0.2 dB higher PSNR than DeepCAM. DeepCAMbp represents the backpropagation fine-tuned version of DeepCAM with Adam optimizer [43], batch size 1, initial learning rate , and with total 20 epochs. DeepCAMbp achieves better performance than DeepCAM. Its average PSNR is comparable to that of A+ and CNN-BP.

Table VI shows the evaluation results of different single image super-resolution methods on Set14 and results similar to those in Table V can be observed. The average PSNR of DeepCAM is around 0.3 dB higher than that of SC-based method and ANR, while it is around 0.15 dB lower than that of A+. DeepCAM achieves performance similar to SRCNN, DeepAM and CNN-BP. DeepCAMbp achieves improved performance than DeepCAM.

It is interesting to note that DeepCAM significantly outperforms CNN-BPand CNN-BPon ppt3 and zebra which contain sharp edges with small scales. In particular, on ppt3, DeepCAM outperforms CNN-BPand CNN-BPby 0.4 dB and 1.6 dB, respectively. Figure 10 shows the reconstructed HR images of the testing image ppt3 using CNN-BP, CNN-BPand DeepCAM. We can find that CNN-BPand CNN-BPcannot reconstruct well the characters. A possible reason is that CNN-BP has a weaker generalization ability on the unseen testing data.

Li et al. [44] proposed a method to visualize the loss

Fig. 11. Visualization the 2D surface of minima obtained with different methods. The sharpness of minimizers correlates well with generalization error. A wider, and flatter minimizer usually has better generalization ability. The minimizers of DeepCAM and DeepCAMbp are flatter and wider than that of CNN-BPand CNN-BP

surface landscape around the minimizer of a deep model. The sharpness of the loss surface landscape of a minimizer is well correlated to the generalization ability. That is, a wider and flatter minimizer has better generalization ability. From Figure 11, we can see that CNN-BPhas the sharpest and the most narrow 2D loss surface, CNN-BPhas a wider and flatter one but it is still not as wide as that of DeepCAM. We can also find that performing backpropagation fine-tuning on DeepCAM does not change significantly the 2D loss surface. The visualization in Figure 11 correlates well with the simulation results in Table VI. The reason that DeepCAM possesses a stronger generalization ability is probably due to our information preserving and clustering design.

VII. CONCLUSIONS

In this paper, we proposed a convolutional analysis dictionary learning algorithm by exploiting the properties of the Toeplitz structure within the convolution matrices. The proposed algorithm can impose the global rank property on learned convolutional analysis dictionaries while performing learning on the low-dimensional signals. We then proposed a Deep Convolutional Analysis Dictionary Model (DeepCAM) framework which consists of multiple layers of convolutional analysis dictionary and soft-threshold pairs and a single layer of convolutional synthesis dictionary. Similar to DeepAM, the convolutional analysis dictionaries are designed to be made of an information preserving analysis dictionary (IPAD) and a clustering analysis dictionary (CAD). The IPAD preserves the information from the input image, while the CAD generates discriminative feature maps for image super-resolution. Simulation results show that our proposed DeepCAM achieves comparable performance with other existing single image super-resolution methods while also having a good generalization capability.

REFERENCES

[1] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 391–398.

[2] B. Wohlberg, “Efficient convolutional sparse coding,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 7173–7177.

[3] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5135–5143.

[4] L. Pfister and Y. Bresler, “Learning filter bank sparsifying transforms,” IEEE Transactions on Signal Processing, vol. 67, no. 2, pp. 504–519, 2018.

[5] I. Y. Chun and J. A. Fessler, “Convolutional dictionary learning: Acceleration and convergence,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1697–1712, 2017.

[6] ——, “Convolutional analysis operator learning: Acceleration, convergence, application, and neural networks,” arXiv preprint arXiv:1802.05584, 2018.

[7] J.-F. Cai, H. Ji, Z. Shen, and G.-B. Ye, “Data-driven tight frame construction and image denoising,” Applied and Computational Harmonic Analysis, vol. 37, no. 1, pp. 89–105, 2014.

[8] V. Papyan, J. Sulam, and M. Elad, “Working locally thinking globally: Theoretical guarantees for convolutional sparse coding,” IEEE Transactions on Signal Processing, vol. 65, no. 21, pp. 5687–5701, 2017.

[9] B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 301–315, 2015.

[10] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 2887–2938, 2017.

[11] V. Papyan and M. Elad, “Multi-scale patch-based image restoration,” IEEE Transactions on image processing, vol. 25, no. 1, pp. 249–261, 2016.

[12] J. Sulam, V. Papyan, Y. Romano, and M. Elad, “Multi-layer convolu- tional sparse modeling: Pursuit and dictionary learning,” arXiv preprint arXiv:1708.08705, 2017.

[13] A. Aberdam, J. Sulam, and M. Elad, “Multi-layer sparse coding: The holistic way,” SIAM Journal on Mathematics of Data Science, vol. 1, no. 1, pp. 46–77, 2019.

[14] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4311–4322, 2006.

[15] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image processing, vol. 15, no. 12, pp. 3736–3745, 2006.

[16] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.

[17] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International conference on curves and surfaces. Springer, 2010, pp. 711–730.

[18] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,” IEEE Transactions on Signal Processing, vol. 61, no. 5, pp. 1072–1086, 2012.

[19] R. Rubinstein, T. Peleg, and M. Elad, “Analysis K-SVD: A dictionarylearning algorithm for the analysis sparse model,” IEEE Transactions on Signal Processing, vol. 61, no. 3, pp. 661–677, 2013.

[20] S. Hawe, M. Kleinsteuber, and K. Diepold, “Analysis operator learning and its application to image reconstruction,” IEEE Transactions on Image Processing, vol. 22, no. 6, pp. 2138–2150, 2013.

[21] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. L. Cun, “Learning convolutional feature hierarchies for visual recognition,” in Advances in neural information processing systems, 2010, pp. 1090–1098.

[22] I. Y. Chun and J. A. Fessler, “Convolutional analysis operator learning: Acceleration, convergence, application, and neural networks,” arXiv preprint arXiv:1802.05584, 2018.

[23] V. Papyan, Y. Romano, J. Sulam, and M. Elad, “Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and its connection to convolutional neural networks,” IEEE Signal Processing Magazine, vol. 35, no. 4, pp. 72–89, 2018.

[24] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a backpropagation network,” in Advances in neural information processing systems, 1990, pp. 396–404.

[25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.

[28] R. Timofte, V. De Smet, and L. Van Gool, “Anchored neighborhood regression for fast example-based super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1920– 1927.

[29] ——, “A+: Adjusted anchored neighborhood regression for fast super- resolution,” in Asian Conference on Computer Vision. Springer, 2014, pp. 111–126.

[30] J.-J. Huang and W.-C. Siu, “Practical application of random forests for super-resolution imaging,” in 2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015, pp. 2161–2164.

[31] ——, “Learning hierarchical decision trees for single image super-resolution,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.

[32] J.-J. Huang, W.-C. Siu, and T.-R. Liu, “Fast image interpolation via random forests,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3232–3245, 2015.

[33] J.-J. Huang, T. Liu, P. L. Dragotti, and T. Stathaki, “SRHRF+: Selfexample enhanced single image super-resolution using hierarchical random forests,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on New Trends in Image Restoration and Enhancement, July 2017.

[34] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.

[35] ——, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.

[36] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in European conference on computer vision. Springer, 2016, pp. 391–407.

[37] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.

[38] J. A. Cadzow, “Signal enhancement-a composite property mapping algo- rithm,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, no. 1, pp. 49–62, 1988.

[39] M. Kiechle, T. Habigt, S. Hawe, and M. Kleinsteuber, “A bimodal co- sparse analysis model for image processing,” International Journal of Computer Vision, vol. 114, no. 2-3, pp. 233–247, 2015.

[40] J.-J. Huang and P. L. Dragotti, “Learning deep analysis dictionaries-part I: Unstructured dictionary,” submitted to IEEE Transactions on Signal Processing, 2020.

[41] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. Princeton University Press, 2009.

[42] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu- tional network for image super-resolution,” in European Conference on Computer Vision. Springer, 2014, pp. 184–199.

[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[44] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems, 2018, pp. 6389–6399.

Designed for Accessibility and to further Open Science