Single image deblurring aims to recover a clear image from a single blurred input. Conventional methods formulate the blur process (assuming spatially invariant blur) as the convolution operation between a latent clear image and a blur kernel, and solve this problem based on the maximum a posteriori (MAP) framework. As the problem is ill-posed, the state-of-the-art algorithms typically rely on natural image priors (e.g., gradient (Xu et al., 2013) and dark channel prior (Pan et al., 2016b)) to constrain the solution space.
While existing image priors are effective for deblurring natural images, the underlying assumption may not hold well for images from specific categories, e.g., text, face and low-light conditions. Numerous approaches exploit domain spe-cific visual information, such as designing intensity (Pan et al., 2017b) priors for text images or detecting light streaks (Hu et al., 2014a) for extremely low-light images. As face images contain fewer textures and edges for estimating blur kernels, Pan et al. (2014) search for similar face exemplars from an external dataset and extract the contour as reference edges. However, reference images may not always exist for a specific input due to diversity of real-world face images. Furthermore, those methods based on the MAP framework typically entail heavy computational cost due to the iterative optimization process to determine latent images and blur kernels. The long execution time limits the applications on resource-sensitive platforms, e.g., mobile devices.
In this work, we propose an efficient and effective solution to deblur face images via deep convolutional neural networks (CNNs). Since face images are highly structured and composed of similar components, semantic information can provide a strong prior for restoration. We propose to lever- age the face semantic labels as global priors and local constraints to train a deep CNN. The proposed model consists of three sub-networks: a coarse deblurring network, a face parsing network, and a fine deblurring network. The coarse deblurring network first predicts a deblurred image from the given input blurred image. The face parsing network then estimates the semantic labels from the coarse deblurred image. Finally, the fine deblurring network takes the blurred image, coarse deblurred image, and semantic labels to restore a clear face image. To encourage the network to restore fine details, we propose an adaptive local structural loss on important face components (e.g., eyes, noses, and mouths). Finally, we impose a perceptual loss (Johnson et al., 2016) and an adversarial loss (Goodfellow et al., 2014) to generate photo-realistic deblurred results. As our method is end-to-end without any blur kernel estimation or post-processing, the execution time is significantly shorter than the conventional MAP-based approaches.
To handle blurred images caused by unknown blur kernels, we construct a large face blurred image dataset for training and testing. We first synthesize random blur kernels by modeling the camera trajectories (Chakrabarti, 2016; Hradiˇs et al., 2015). Next, we generate blurred face images using the synthesized blur kernels and face images from the Helen (Le et al., 2012), CMU PIE (Sim et al., 2002), and CelebA (Liu et al., 2015) datasets. We show that the proposed model trained on synthetic images generalizes well to images generated by unseen blur kernels as well as real blurred images. The proposed method reconstructs better facial details and achieves higher accuracy on face detection and recognition than the state-of-the-art face deblurring approaches (Shen et al., 2018; Pan et al., 2014) (see Fig. 1).
In this work, we make the following contributions:
– We propose a deep multi-scale CNN that exploits global semantic priors and local structural constraints for face image deblurring. The proposed local structural loss adaptively adjusts the weights based on the size of each facial component and greatly improves the quantitative and qualitative results.
– We develop a large-scale blurred face image dataset. The training set consists of 130 million blurred images (synthesized from 6,464 face images and 20,000 blur kernels) and the test set has 16,000 blurred images (synthesized from 200 face images and 80 blur kernels). Our dataset can serve as a common benchmark for training and evaluating face image deblurring.
– We demonstrate that the proposed method performs favorably against the state-of-the-art deblurring approaches in terms of restoration quality, face detection, recognition and execution speed.
Fig. 1 Visual comparison on face image deblurring. We exploit the semantic information of face images within an end-to-end deep CNN for deblurring. (a) Ground truth images (b) Blurred images (c) Pan et al. (2014) (d) Shen et al. (2018) (e) Ours.
Our work belongs to the single-image blind image deblurring problem, where the blur kernel is unknown. In this section, we focus our discussion on generic, domain specific, and recent CNN-based image deblurring approaches.
2.1 Generic image deblurring methods
The recent advances in single image blind deblurring can be attributed to the development of effective natural image priors, including sparse gradient prior (Fergus et al., 2006; Levin et al., 2009), normalized sparsity measure (Krishnan et al., 2011), patch prior (Sun et al., 2013a), gradient (Xu et al., 2013), color-line prior (Lai et al., 2015), low-rank prior (Ren et al., 2016a), self-similarity (Michaeli and Irani, 2014), and extreme channel priors (Pan et al., 2016b; Yan et al., 2017). Recently, a number of approaches learn data fitting functions (Pan et al., 2017a) or image priors with Markov random fields (MRFs) (Liu et al., 2018) to recover latent images. By optimizing the image priors within the MAP framework, those approaches implicitly restore strong edges, and therefore, estimate blur kernels and latent sharp images. However, solving complex non-linear priors involves several optimization steps and thus entails high computational loads. Edge-selection based methods (Cho and Lee, 2009; Xu and Jia, 2010) use simple priors (e.g., gradients) with image filters (e.g., shock filter) to explicitly restore or select strong edges. In addition, a number of approaches use reference images as guidance for non-blind (Sun et al., 2014) and blind deblurring (Hacohen et al., 2013). However, the performance of such methods hinges on the similarity of reference images and quality of dense correspondence.
While generic image deblurring methods demonstrate the state-of-the-art performance, face images have different statistical properties than natural scenes. Fewer edges or structure on face images can be extracted for blur kernel estimation. The above-mentioned approaches typically cannot deblur face images well and may generate undesired visual artifacts.
Another line of work proposes various motion blur models to handle non-uniform blur (Hirsch et al., 2011; Whyte et al., 2012) and depth variation (Hu et al., 2014b). In this work, we focus on face images caused by uniform motion blur. Our method can also be extended to handle non-uniform blur by synthesizing training data with the non-uniform blur model.
2.2 Domain specific image deblurring methods
Several domain specific image deblurring approaches have been developed to handle images from different categories. As text images usually contain nearly uniform intensity, Pan et al. (2017b) introduce the -regularized priors on both intensity and image gradients for deblurring text images. To handle extreme cases such as low-light images, Hu et al. (2014a) detect the light streaks in images for estimating blur kernels. Anwar et al. (2015) propose a frequency-domain class-specific prior to restore the band-pass frequency components. Several recent approaches propose outlier detection methods (Pan et al., 2016a) or robust loss functions (Dong et al., 2017) to handle images with non-Gaussian noise.
As face images contain fewer textures and edges, existing algorithms based on implicit or explicit edge restoration are less effective. Pan et al. (2014) search for similar images from a face dataset and extract reference exemplar contours for blur kernel estimation. However, this approach requires manual annotations of the facial contours and involves computationally expensive optimization within the MAP framework. In contrast, we train an end-to-end deep CNN to bypass the blur kernel estimation step, without requiring any reference images or manual annotations for face deblurring.
2.3 CNN-based image deblurring methods
Deep CNNs have been adopted for several image restoration tasks, such as denoising (Mao et al., 2016), JPEG deblocking (Dong et al., 2015), dehazing (Ren et al., 2016b) and super-resolution (Kim et al., 2016; Lai et al., 2017). Several methods apply deep CNNs for image deblurring in different aspects, including non-blind deconvolution (Schuler et al., 2013; Xu et al., 2014; Zhang et al., 2017), blur kernel estimation (Sun et al., 2015; Schuler et al., 2016; Chakrabarti, 2016), and dynamic scene deblurring (Nah et al., 2017; Tao et al., 2018). the state-of-the-art MAP-based approaches, especially in the presence of large motion. Several approaches
Fig. 2 Overview of the proposed model. The state-of-the-art method (Shen et al., 2018) extracts the semantic labels from a blurred image, while we obtain the semantic labels from a coarse deblurred image. The coarse deblurring network reduces the motion blur from the input image and leads to more accurate face parsing results.
embed deep CNNs into the conventional MAP-based framework by learning discriminative image priors (Li et al., 2018) or predicting sharp edges (Xu et al., 2018) to achieve the state-of-the-art performance. More recently, Nimisha et al. (2017) and Kupyn et al. (2018) train generative adversarial networks for blind motion deblurring.
A number of methods train end-to-end networks to handle class-specific images, e.g., texts (Hradiˇs et al., 2015) and faces (Jin et al., 2018; Chrysos et al., 2019). Xu et al. (2017) train generative adversarial networks to jointly deblur and super-resolve low-resolution blurred face and text images, which are typically degraded by Gaussian-like blur kernels. A few face deblurring methods (Jin et al., 2018; Chrysos et al., 2019) based on generic CNNs have recently been developed. Although there are some implementation differences in network architectures and loss functions (e.g., the model of (Jin et al., 2018) is lightweight conditions), these methods do not explore face-related prior information to help the deblurring process. In this work, we focus on deblurring face images affected by complex motion blur. We exploit global and local semantic cues as well as the perceptual (Johnson et al., 2016) and adversarial (Goodfel- low et al., 2014) losses to restore photo-realistic face images with fine details.
In this section, we first give an overview of the proposed face deblurring method. We then describe the design methodology of the network architecture, loss functions, and implementation details.
3.1 Overview
We aim to utilize the face semantic cues to deblur face images. In our preliminary work (Shen et al., 2018), we first apply a face parsing network to extract semantic labels from the input blurred image and then adopt a deblurring network for restoration. We also propose a local structural loss to enforce additional weights on important facial components to recover fine details. However, the labels extracted from the blurred images may be erroneous due to severe motion blur. In this work, we make the following improvements:
– We first construct a coarse deblurring network to reduce the blur in the input image. The face parsing network then extracts semantic labels from the coarse deblurred image. Finally, the fine deblurring network restores a clear face image from the given blurred input image, coarse deblurred image, and corresponding semantic label maps.
– Instead of using a fixed weight for all key components, we propose an adaptive local structural loss which adjusts the weight based on the size of each facial component and restores more fine details.
Fig. 2 shows the differences between the method of Shen et al. (2018) and proposed model.
3.2 Network architecture
Given a blurred face image where H and W denotes the height and width of the image, our goal is to recover a clear and sharp face image y which is as similar as the ground truth image . To this end, we train an end-to-end deep CNN to deblur the face images efficiently. The proposed face deblurring model consists of three sub-networks: a coarse deblurring network , a face parsing network P, and a fine deblurring network .
Coarse deblurring network. To reduce the influence of motion blur on the face parsing, we first use a network to obtain a coarse deblurred image :
We use a multi-scale network similar to the model of Nah et al. (2017), but with several differences. First, as face images typically have smaller spatial resolutions (e.g., 128 or less), we use only 2 scales instead of 3 scales for natural images in (Nah et al., 2017). Second, we use fewer ResBlocks (reduce from 19 to 6) and a larger filter size () at the first convolutional layer to increase the receptive field. The first scale takes as input the downsampled blurred image (3 channels) and generates a deblurred image . The input to the second scale contains the blurred image x (3 channels) and the upsampled deblurred image from the first scale (3 channels), where is bicubic upsampling operator. The output image from the second scale is the coarse deblurred result .
Face parsing network. We use an encoder-decoder architecture with skip connections as our face parsing network. The face parsing network takes the coarse deblurred image as input and generates the probability map of face semantic labels :
where K is the number of semantic classes. The semantic probabilities encode the essential appearance information and approximate locations of the facial components (e.g., eyes, noses and mouths) and serve as a strong global prior for reconstructing the deblurred face image. We extract K = 14 semantic labels (see Table 3) for each input image.
Fine deblurring network. The fine deblurring network has a similar architecture to the coarse deblurring network. In addition, we take as input the blurred image x, coarse deblurred image , as well as the semantic probability maps p to recover a clear face image y:
Our fine deblurring network also has a similar two-scale structure to the coarse deblurring network. The input to the first scale includes the downsampled blurred image (3 channels), downsampled coarse deblurred image (3 channels), and the downsampled semantic probability maps (11 channels), resulting in a 17-channel input feature. The input to the second scale includes the blurred image x (3 channels), coarse deblurred image (3 channels), the upsampled deblurred image from the first scale (3 channels), and the semantic probability maps (11 channels), resulting in a 20-channel input feature. The output image from the second scale of the fine deblurring network is the final deblurred result y. Fig. 3 shows an overview of our face parsing and deblurring network.
3.3 Loss functions
We train the parsing network using a cross-entropy loss and optimize the deblurring networks with a pixel-wise content loss and the proposed adaptive local structural loss. As pixel-wise or loss functions typically lead to overly-smooth results, we further introduce a perceptual loss (Johnson et al., 2016) and an adversarial loss (Goodfellow et al., 2014) to optimize our deblurring network and generate photo-realistic deblurred results.
Fig. 3 Architecture of the proposed model. The face parsing network is an encoder-decoder architecture with skip connections from the encoder to the decoder. The fine deblurring network has two scales. The first scale generates a deblurred image with spatial resolution, and the second scale generates a full-resolution deblurred image. Each scale of the deblurring network receives the supervision from the pixel-wise content loss and local structural loss. In addition, we impose the perceptual and adversarial losses at the output of the second scale. The coarse deblurring network has a similar architecture to the fine deblurring network but without taking the semantic label as input and only receiving supervision from the content loss.
Parsing loss. We adopt a multi-class cross-entropy loss function to optimize the face parsing network:
L−
where is the ground truth semantic label for the class.
Content loss. We adopt the pixel-wise robust function as the content loss of the coarse and fine deblurring net- works:
Adaptive local structural loss. While the content loss (5) enforces a holistic supervision from the ground truth clear image, the key components (e.g., eyes, lips, and mouths) on faces may be easily ignored as they are typically thin and small. Solely minimizing the content loss on the whole face image cannot guarantee to restore the fine details. Thus, we propose to impose a local structural loss on facial key com- ponents:
where is the weight of each component and denotes the structural mask of the component (extracted from the semantic label p). We apply the local structural losses on eight important components, including left eye, right eye, left eyebrow, right eyebrow, nose, upper lip, lower lip, and teeth, to enhance the local details. We do not apply the local structural loss on textureless regions, such as hair and skin. The local structural losses enforce the deblurring network to restore more details with fewer artifacts on the face images.
In our preliminary work (Shen et al., 2018), we adopt an equal weight for all the selected components, i.e., . However, tiny components (e.g., eyes) may not be well reconstructed when optimizing the network. In this work, we propose an adaptive weighting mechanism based on the size of each component:
where c is a constant and is the size of the component. The adaptive local structural loss enforces larger weights on small components and thus helps recover facial details.
Perceptual loss. The perceptual loss has been adopted in style transfer (Gatys et al., 2015; Johnson et al., 2016), image super-resolution (Ledig et al., 2017) and image synthesis (Chen and Koltun, 2017; Wang et al., 2018b). The perceptual loss aims to measure the similarity in the high dimensional feature space of a pre-trained classification network (e.g., VGG16 (Simonyan and Zisserman, 2015)). Given the input image x, we denote as the activation at the l-th layer of the loss network . The perceptual loss is then defined as:
We compute the perceptual loss on the pool2 and pool5 layers of the pre-trained VGG-Face (Parkhi et al., 2015).
Adversarial loss. The adversarial training framework has been effectively applied to synthesize realistic images (Good- fellow et al., 2014; Ledig et al., 2017; Nah et al., 2017). We treat our fine deblurring network as the generator and construct a discriminator based on the DCGAN (Radford et al., 2016) model. The goal of the discriminator D is to distinguish the real image from the output of the generator. The generator G aims to generate images as real as possible to fool the discriminator. The adversarial training is formulated as solving the following min-max problem:
When updating the generator, the adversarial loss is:
Our discriminator takes an input image with of pixels and has 6 strided convolutional layers followed by the ReLU activation function. In the last layer, we use the sigmoid function to output a single scalar as the probability of being a real image. Similar to existing image super-resolution (Ledig et al., 2017) and motion deblurring (Kupyn et al., 2018; Nah et al., 2017) methods, the generator of the proposed model does not take a noise vector as input.
Overall loss function. The overall loss function for training our face deblurring model is:
where , and are the weights to balance the local structural losses, parsing loss, perceptual loss and adversarial loss, respectively. In this work, we empirically set the weights to , , and the constant c = 1 in (7). We adopt the content and local structural losses at all scales of the deblurring network while only apply the perceptual and adversarial losses to the final output image, i.e., the output of second scale from the fine deblurring network.
3.4 Training strategy
As our model consists of three sub-networks, it is difficult to jointly optimize the whole model simultaneously. We adopt the following progressive training strategy:
1. We first train the coarse deblurring network using the content loss (5) on the coarse deblurred image for 200,000 iterations.
2. We then fix and train the face parsing network P using the parsing loss (4) for 60,000 iterations.
3. Next, we fix both and P and train the fine deblurring network using the content loss (5), local structural loss (6), perceptual loss (8) and adversarial loss (9) for 200,000 iterations.
4. Finally, we jointly optimize all three sub-networks by minimizing the overall loss (11) for 100,000 iterations.
We demonstrate that such a progressive training strategy can achieve better performance then jointly training the whole model from scratch in Section 4.
3.5 Implementation details
Both the coarse and fine deblurring networks have two scales, where each scale has 6 ResBlock (He et al., 2016) (include two convolutional layers and one activation layer) and 18 convolutional layers. The first convolutional layer at each scale has a kernel size of , while all other convolutional layers have a kernel size of and 64 channels. The upsampling layer uses a transposed convolutional layer to upsample the image by . We use the ReLU as the activation function and do not use any normalization layer (e.g., batch normalization).
We implement our network using the MatConvNet toolbox (Vedaldi and Lenc, 2015). We use a batch size of 16 and set the learning rate to for the parsing network and for the coarse and fine deblurring networks. During the training process, we apply the following data augmentation: (1) random scaling between , (2) random horizontal and vertical shifting within 12 pixels, and (3) random rotating within . The whole training process takes about 5 days on an NVIDIA Titan X GPU card.
3.6 Face deblurring datasets
We collect clear face images from the Helen (Le et al., 2012), CMU PIE (Sim et al., 2002), and CelebA (Liu et al., 2015)
Table 1 Summary of our face deblurring dataset. We collect clear face images from the Helen (Le et al., 2012), CMU PIE (Sim et al., 2002), and CelebA (Liu et al., 2015) datasets and synthesize blur kernels for generating blurred face images.
datasets. We align all the face images by first detecting the facial landmarks using the method of Sun et al. (2013b) and warping the images based on the aligned landmarks (Kae et al., 2013). The motion blur kernels are synthesized by modeling random 3D camera trajectories (Boracchi and Foi, 2012). We generate blur kernels with 8 different sizes (from to ). By convolving the clear images with blur kernels and adding Gaussian noise with , we obtain 130 million blurred images for training and 16,000 blurred images for testing. Table 1 summarizes the number of clear face images, motion blur kernels, and synthesized blurred images in the training and testing sets. We note that the 20,000 blur kernels used to generate training images are different from the 80 blur kernels used in the test set. Both the clear faces images and blur kernels are disjoint in the training and testing sets.
In this section, we first demonstrate the effectiveness of using semantic parsing labels for face image deblurring. We then conduct ablation studies to analyze the contribution of each sub-network and loss function.
4.1 Effect of semantic parsing
Our key idea is to utilize the face semantic labels as prior information to facilitate the face deblurring. We first validate the idea by using the ground truth semantic labels as an additional input to our deblurring network. Since only the Helen dataset contains ground truth face labels, we first train a face parsing network using the clear images and ground truth from the Helen dataset. We then use this face parsing network to generate labels for the clear images in the CMU PIE and CelebA datasets, which are treated as the pseudo ground-truth labels to train the proposed face parsing network for deblurring.
We train a baseline model G using the coarse deblurring network, which does not take any semantic information as input and does not adopt the local structural loss. Then, we concatenate the ground truth semantic labels with the blurred image as input to the baseline model. We evaluate
Table 2 Effect of semantic labels on face image deblurring. We evaluate the average labeling accuracy (i.e., F-score), PSNR and SSIM of the deblurred images on the Helen dataset.
Fig. 4 Deblurred results using different semantic labels. (a) Blurred images (b) Baseline (w/o semantic labels) (c) Using ground truth semantic labels (d) Using labels from blurred images (e) Using labels from coarse deblurred images
the PSNR and SSIM on the Helen test set and present the results in Table 2. The model with prior knowledge from the ground truth labels (row) significantly outperforms the baseline model (row), which demonstrates the effect of semantic labels on deblurring face images.
In Shen et al. (2018), the semantic labels are extracted from the blurred images. While the parsing network P is fine-tuned on blurred images for performance gain, the semantic labels of some small components (e.g., eyebrows, lips, and teeth) may not be accurate enough when the input image suffers from large motion blur. In the proposed method, we first apply a coarse deblurring network to reduce the motion blur and recover a rough structure of the input face image. We then fine-tune the parsing network P on the coarse deblurred images and train the fine deblurring network using the labels extracted from the coarse deblurred images. Table 2 shows the performance difference between the method of Shen et al. (2018) (row) and the proposed model (row). The proposed method achieves higher label accuracy and obtains better deblurring results. We note that we only use the content loss (5) to train the models in Table 2. We also fix the coarse deblurring network and parsing network when training the fine deblurring network to rule out the influence of model parameters.
Fig. 4 shows the deblurred images by the models listed in Table 2. Table 3 shows the parsing accuracy (in terms of the F-score) of each component, and Fig. 5 visualizes the parsing results. It is clear that more accurate semantic labels provide stronger priors to achieve better deblurring results.
Table 3 Performance of face parsing network. We measure the Fscore for each facial component on the Helen dataset.
Fig. 5 Labeling results of face parsing network. (a) Ground truth clear images (b) Input blurred images (c) Ground truth semantic labels (d) Semantic labels from blurred images (e) Semantic labels from coarse deblurred images
4.2 Ablation study
In this section, we analyze the contribution of loss functions, training strategy, and several design choices of the proposed model, including the kernel size, multi-stage deblurring, and effective range of hyper-parameters.
Local structural loss. Shen et al. (2018) adopt an equal weight in the local structural loss for all the key components, while we apply adaptive weights based on the size of each component. Here we train our fine deblurring network (freezing the coarse deblurring network and parsing network) using the content loss as well as local structural loss, and present the results in Table 4. We note that the model trained solely on the content loss considers all the pixels, including hair, skin, and background, equally. The equal-weight local structural loss significantly improves the performance by encouraging the network to enhance details on eight key components, including left eye, right eye, left eyebrow, right eyebrow, nose, upper lip, lower lip, and teeth. The proposed adaptive local structural loss further adjusts the weights by considering the size of key components to prevent the model from sacrificing some tiny components, e.g., lips and teeth. Fig. 6 shows the deblurred results by the models listed in Table 4.
Table 4 Analysis on loss functions. We fix the parsing network and coarse deblurring network and train the fine deblurring network using the content loss and local structural loss
(a) (b) (c) (d) (e) Fig. 6 Effects of loss functions. (a) Ground truth images (b) Blurred images (c) + equal-weight + adaptive
Table 5 Analysis on training strategy. We progressively train the coarse deblurring network , face parsing network P, and the fine deblurring network . Finally, we jointly fine-tune all three sub-networks. The proposed training strategy achieves better performance than the training the whole model from scratch.
Training strategy. Since the proposed model consists of three sub-networks, the cascade of all sub-networks becomes a very deep model. As such, it is not easy to training such a deep model from scratch. The last row of Table 5 shows that the model trained from scratch does not perform well. Thus, we train our model stage-by-stage using the training strategy described in Section 3.4. We show the evaluation results of each stage in Table 5. With the proposed training strategy, our model gradually achieves better performance. Fig. 7 shows the deblurred results of the models listed in Table 5. The model using the progressive training strategy recovers better content and more facial details than the model trained from scratch.
Perceptual and adversarial losses. We compare the deblurring results with and without using the perceptual and adversarial losses in Table 6 and Fig. 8. The perceptual loss
Fig. 7 Effects of training strategy. (a) Ground truth images (b) Blurred images (c) (fine-tuned) (e) (from scratch)
Table 6 Analysis on perceptual and adversarial losses. Both perceptual and adversarial losses further improve the performance by restoring more faithful details.
Fig. 8 Effects of perceptual and adversarial functions. (a) Ground truth images (b) Blurred images (c) Ours w/o w/ (e) Ours w/
encourages the images to match the high-level activations of the VGG-Face network and makes the output look more photo-realistic. The adversarial loss further introduces more details on hairs and beards, which cannot be reconstructed well using the pixel-wise or loss. As shown in Table 6, both the perceptual and adversarial losses improve the average PSNR and SSIM on both test sets as more faithful details are recovered.
Kernel size. We use a larger kernel size at the first convolutional layer of our coarse and fine deblurring networks. Here we evaluate the performance of the proposed model with different kernel sizes in Table 7. Consistent performance gain can be achieved when using a larger filter size up to the ker-
Table 7 Analysis on kernel size. We evaluate the model performance by changing the kernel size at the first convolutional layer.
nel of pixels. Therefore, we choose to use filter at the first convolutional layer to have a larger receptive field for the whole model. In addition, as the first convolutional layer only contains 64 feature channels, using a larger filter size does not significantly increase the number of model parameters.
Hyper-parameters. We analyze the effective range of the hyper-parameters, , and by changing one of the hyper-parameters and fixing the others. In Fig. 9(a), we show that the local structural loss effectively improves the PSNR but saturates at . As shown in Fig. 9(b), without the parsing loss (i.e., ), the face parsing network cannot learn meaningful semantic labels as the facial priors. However, a larger does not further improve the face restoration performance as the gradient of the parsing loss is back-propagated to the coarse deblurring network, which may introduce additional artifacts. Therefore, setting achieves a good balance for the whole model. In Fig. 9(c), we show that the proposed model obtains plausible results when choosing 4. Using a larger weight for the perceptual loss introduces more checkerboard artifacts and harms the restoration performance. Finally, in Fig. 9(d), we show that using a smaller weight for the adversarial loss, i.e., , does not affect the PSNR too much. However, the model can generate more facial details to improve visual quality. When increasing , the model generates noise-like artifacts, resulting in a performance drop. Therefore, we choose .
Multi-stage deblurring. Due to our architecture design, we are able to extend the proposed model by cascading multiple fine deblurring networks. Here we construct our model with one coarse deblurring network, one face parsing network, and N fine deblurring networks. The fine deblurring network takes as input the blurred image, deblurred image from the fine deblurring network, and the semantic labels from the face parsing network. We compare the performance, model parameters, and execution time in Table 8. The performance of our model saturates at N = 2. When using three fine deblurring networks, the model only slightly improves the performance but uses 160% more parameters and runs slower than the model with N = 1.
In the last two rows of Table 8, we show the performance of the proposed model by sharing the weight of the
Fig. 9 Effective range of hyper-parameters. We plot the average PSNR on both the CelebA and Helen datasets.
Table 8 Multi-stage deblurring. We apply the fine deblurring network for multiple times and compare the restoration performance, model parameters, and execution time.
fine deblurring network. The experimental results show that the models with shared weights do not perform well. As the fine deblurring network is already a deep sub-network, sharing the weight of a large sub-module is not guaranteed to improve the performance. Instead, sharing the weight of a single convolutional layer or a small block (e.g., a residual block) might be a more reasonable way to design a recurrent structure. As our goal is to utilize the semantic labels for face deblurring instead of exploring a better network architecture, we leave this issue as future work. Overall, the proposed model with a single fine deblurring network already achieves state-of-the-art performance.
In this section, we present evaluations against the state-of-the-art deblurring approaches in terms of the restoration quality, face detection, face recognition, and execution time. We also provide visual comparisons on synthetic datasets and real blurred images. Finally, we discuss the limitation and failure cases of the proposed method.
5.1 Restoration quality
We compare the proposed method with the state-of-the-art deblurring algorithms, including MAP-based methods (Cho and Lee, 2009; Krishnan et al., 2011; Shan et al., 2008; Xu et al., 2013; Zhong et al., 2013; Pan et al., 2014, 2017a; Li et al., 2018) and CNN-based methods (Nah et al., 2017; Tao et al., 2018; Kupyn et al., 2018; Jin et al., 2018; Shen et al., 2018). We evaluate all the algorithms on both the Helen and CelebA test sets. Table 9 presents the average PSNR and SSIM for different sizes of blur kernels, and Table 10 shows the average and the worst PSNR/SSIM on the entire datasets for each method. We note that the optimization-based methods (Shan et al., 2008; Cho and Lee, 2009; Krishnan et al., 2011; Xu et al., 2013; Zhong et al., 2013; Pan et al., 2014, 2017a) may generate severe visual artifacts when the blur kernel is not estimated well and achieve significant lower PSNR/SSIM values. The proposed method performs favorably against existing deblurring approaches and our preliminary method (Shen et al., 2018) on both datasets.
We show the results of the Helen dataset in Fig. 10 and the CelebA dataset in Fig. 11. Conventional MAP-based approaches (Cho and Lee, 2009; Krishnan et al., 2011; Shan et al., 2008; Xu et al., 2013; Zhong et al., 2013; Pan et al., 2017a) do not estimate blur kernels well and therefore generate more ringing artifacts. The face deblurring approach (Pan et al., 2014) is not robust to noise and the performance depends heavily on the similarity of the reference image. There are several ringing artifacts in the deblurred images by Pan et al. (2014). The method of Li et al. (2018) generates sharp debluured images, but the faces do not look realistic. The CNN-based methods (Nah et al., 2017; Kupyn et al., 2018; Tao et al., 2018; Jin et al., 2018) do not consider the face semantic information and thus cannot effectively reduce the motion blur.
Both the method by Shen et al. (2018) and the proposed model obtain visually pleasing results. However, the method by Shen et al. (2018) is not robust to the error on semantic labels (which is predicted from blurred images) and less effective in restoring facial details (e.g., the mouth of the first and second rows in Fig. 11). In contrast, the proposed method extracts more accurate semantic priors and restores better facial structures and details (e.g., the eyes of the second and third rows in Fig. 10).
In Fig. 12, we show the deblurring results from images with specific attributes, such as occlusion, mustaches, saturation, and people with different skin colors. As our test set does not contain images with significant saturation, we adjust the intensity of the blurred images (row 5 and 6 of Fig. 12) by multiplying the Y-channel by . The proposed method can still recover more facial details than existing approaches from such an input. Overall, our method performs well in real-world scenarios.
Table 9 Quantitative comparison with the state-of-the-art methods. We compute the average PSNR and SSIM on the Helen and CelebA test sets. Each dataset has 8,000 blurred images synthesized from 100 clear face images and 80 blur kernels (10 blur kernels for each size). The red and blue texts indicate the best and second best performance.
Table 10 Quantitative comparison with the state-of-the-art methods. We compute the average PSNR and SSIM on the Helen and CelebA test sets. The red and blue texts indicate the best and second best performance.
5.2 Face recognition
We also demonstrate the performance of the proposed method by evaluating the face identity distance, face detection, and recognition accuracy.
Identity distance. We use the FaceNet (Schroff et al., 2015) to extract face features and compute the identity distance with the loss and cosine loss (Wang et al., 2018a) between the ground truth image and deblurred image. Fig. 13 shows that the deblurred images from the proposed method have the lowest identity distance on both measurements, which demonstrates that the proposed method preserves the face identity well.
Face detection. We use the OpenFace toolbox (Amos et al., 2016) to detect the face for each image in the CelebA test set. We show the success rate of the face detection for blurred images and the state-of-the-art deblurring approaches in Table 11. The clear images have a success rate of 96%, while the success rate on blurred images drops to 77.4% due to motion blur. The deblurred images from some of the evaluated methods have an lower success rate as the images contain severe ringing artifacts. In contrast, the proposed method has 95.3% success rate, which is close to the upper bound of the clear images.
Face recognition. As the CelebA dataset contains identity labels, we conduct another experiment on the identity recog-
Fig. 10 Visual comparison on Helen dataset. The results from the proposed method contain fewer visual artifacts and more details on key face components (e.g., eyes and mouths).
Fig. 11 Visual comparison on CelebA dataset. The results from the proposed method contain fewer visual artifacts and more details on key face components (e.g., eyes and mouths).
nition. We consider our CelebA test images as a probe set, which has 100 different identities. For each identity, we collect additional 9 clear face images as a gallery set. For each image in the probe set, our goal is to find the most similar face image from the gallery set and identify whether they belong to the same identity.
Given a blurred or deblurred image from the probe set, we compute the identity distance with all images in the gallery set and select the top-K nearest matches. Table 11 shows the top-1, top-3 and top-5 accuracy. The proposed method generates fewer artifacts and thus achieves the highest recognition accuracy against other evaluated approaches.
5.3 Real-world blurred images
We evaluate the proposed method on face images collected from the real blurred dataset of Lai et al. (2016). As real images usually contain outliers that cannot be modeled well by Gaussian distributions, conventional methods fail to estimate the blur kernel and generate serious ringing artifacts. The CNN-based generic deblurring method (Nah et al., 2017) generates overly smooth results. In contrast, both the method of Shen et al. (2018) and the proposed model restore sharp and visually pleasing face images.
5.4 Execution time
We evaluate the execution time of the state-of-the-art approaches and the proposed model on a machine with a 3.4
Fig. 12 Visual comparison on images with different attributes. We show that the proposed method is able to generate sharp images and robust to several scenarios, e.g., occlusion with sunglass or hands (row 1 to 3), faces with mustaches (row 1 and 4), over-exposed images (row 5 to 6), and people with different skin colors.
Table 11 Face detection and recognition on the CelebA dataset. We show the success rate of face detection and top-1, top-3 and top-5 accuracy of face recognition.
Fig. 13 Quantitative evaluation on face identity. We compute the L2 and cosine losses on the features extracted from the FaceNet (Schroff et al., 2015). The proposed method has the lowest values on the CelebA test sets.
GHz Intel i7 CPU (64G RAM) and an NVIDIA Titan X GPU card (12G memory). Table 12 shows the average execution time based on 10 images with a size of .
Fig. 14 Visual comparison on real blurred images. The proposed method generates visually pleasing deblurred results with fewer artifacts.
Table 12 Comparison of execution time and model size. We report the average execution time on 10 images with the size of
Most conventional approaches require solving several iterative optimization problems and therefore are computationally expensive. Since we use only two scales and fewer residual blocks, our model is more efficient than the model of Nah et al. (2017). The proposed model is slightly slower than the model of Shen et al. (2018) as there is an additional coarse deblurring network.
5.5 Limitations and discussions
Our method is likely to fail in two situations. First, when the input image contains severe non-uniform blur or nonGaussian noise, our model may not be able to reduce the blur effectively, as shown in Fig. 15. A potential solution is to synthesize more training data with complex motion models or realistic noise (Foi et al., 2008). Second, when the face cannot be well aligned (e.g., profile faces in Fig. 15 bottom), the face parsing network may not estimate accurate semantic labels to guide the deblurring network. To further analyze the performance of the proposed model on pro-file faces, we evaluate the face images from the FEI face database (Thomaz and Giraldi, 2010), where each face is captured under different rotation angles. As shown in Fig. 16, our model performs well on frontal faces and profile faces which are rotated by about 60 degrees (i.e., to columns of Fig. 16). For extreme cases (e.g., rotated by about 90 degrees as shown in the and columns of Fig. 16), our deblurred results contain some visual artifacts around the nose and mouth. The eyes are not restored well due to the inaccurate semantic labels.
In this work, we propose a multi-scale deep convolutional neural network for face image deblurring. We exploit the
Fig. 15 Failure cases. Our method fails when the input image suffers from extremely large motion blur and the semantic labels cannot be estimated well.
Fig. 16 Deblurring profile faces. We evaluate our model on the FEI face database (Thomaz and Giraldi, 2010). The proposed model becomes less effective when a face is rotated by 90 degrees.
face semantic information as global priors and local structural constraints to better restore the shape and detail of face images. Compared with the preliminary work (Shen et al., 2018) which obtains the semantic labels from the input blurred image, we show that the semantic information extracted from a coarse deblurred image is more accurate and leads to better performance on deblurring images. Furthermore, we propose an adaptive local structural loss to balance the weights of facial key components and restore better content and details. Experimental results on image deblurring, execution time and face recognition demonstrate that the proposed method performs favorably against our preliminary method (Shen et al., 2018) and the state-of-the-art deblurring algorithms.
Acknowledgements This work was supported by the Major Science Instrument Program of the National Natural Science Foundation of China under Grant 61527802, the General Program of National Nature Science Foundation of China under Grants 61371132 and 61471043, NSF CAREER (No. 1149783) and gifts from Adobe and Nvidia.
Amos B, Ludwiczuk B, Satyanarayanan M (2016) Openface: A general-purpose face recognition library with mobile applications. Tech. rep., CMU-CS-16-118 11
Anwar S, Phuoc Huynh C, Porikli F (2015) Class-specific image de- blurring. In: IEEE International Conference on Computer Vision 3
Boracchi G, Foi A (2012) Modeling the performance of image restora- tion from motion blur. IEEE Transactions on Image Processing 21(8):3502–3517 7
Chakrabarti A (2016) A neural approach to blind motion deblurring. In: European Conference on Computer Vision 2, 3
Chen Q, Koltun V (2017) Photographic image synthesis with cascaded refinement networks. In: IEEE International Conference on Computer Vision 6
Cho S, Lee S (2009) Fast motion deblurring. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 28(5):145:1–145:8 2, 10, 11, 13, 14
Chrysos GG, Favaro P, Zafeiriou S (2019) Motion deblurring of faces. International Journal of Computer Vision 127(6-7):801–823 3
Dong C, Deng Y, Change Loy C, Tang X (2015) Compression artifacts reduction by a deep convolutional network. In: IEEE International Conference on Computer Vision 3
Dong J, Pan J, Su Z, Yang MH (2017) Blind image deblurring with outlier handling. In: IEEE International Conference on Computer Vision 3
Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT (2006) Re- moving camera shake from a single photograph. ACM Transactions on Graphics (Proceedings of SIGGRAPH) pp 787–794 2
Foi A, Trimeche M, Katkovnik V, Egiazarian K (2008) Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE Transactions on Image Processing 14
Gatys LA, Ecker AS, Bethge M (2015) Texture synthesis using con- volutional neural networks. In: Neural Information Processing Systems 6
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Neural Information Processing Systems 2, 3, 4, 6
Hacohen Y, Shechtman E, Lischinski D (2013) Deblurring by example using dense correspondence. In: IEEE International Conference on Computer Vision 2
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition 6
Hirsch M, Schuler CJ, Harmeling S, Sch¨olkopf B (2011) Fast removal of non-uniform camera shake. In: IEEE International Conference on Computer Vision 3
Hradiˇs M, Kotera J, Zemc´ık P, Sroubek F (2015) Convolutional neu- ral networks for direct text deblurring. In: British Machine Vision Conference 2, 3
Hu Z, Cho S, Wang J, Yang MH (2014a) Deblurring low-light images with light streaks. In: IEEE Conference on Computer Vision and Pattern Recognition 1, 3
Hu Z, Xu L, Yang MH (2014b) Joint depth estimation and camera shake removal from single blurry image. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Jin M, Hirsch M, Favaro P (2018) Learning face deblurring fast and wide. In: CVPR Workshops, pp 745–753 3, 10, 11, 12, 13, 14
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision 2, 3, 4, 6
Kae A, Sohn K, Lee H, Learned-Miller EG (2013) Augmenting crfs with boltzmann machine shape priors for image labeling. In: IEEE Conference on Computer Vision and Pattern Recognition 7
Kim J, Lee JK, Lee KM (2016) Accurate image super-resolution using very deep convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Krishnan D, Tay T, Fergus R (2011) Blind deconvolution using a nor- malized sparsity measure. In: IEEE Conference on Computer Vision and Pattern Recognition 2, 10, 11, 13, 14
Kupyn O, Budzan V, Mykhailych M, Mishkin D, Matas J (2018) Deblurgan: Blind motion deblurring using conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp 8183–8192 3, 6, 10, 11, 12, 13, 14
Lai WS, Ding JJ, Lin YY, Chuang YY (2015) Blur kernel estimation using normalized color-line prior. In: IEEE Conference on Computer Vision and Pattern Recognition 2
Lai WS, Huang JB, Hu Z, Ahuja N, Yang MH (2016) A comparative study for single image blind deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 12
Lai WS, Huang JB, Ahuja N, Yang MH (2017) Deep laplacian pyra- mid networks for fast and accurate super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Le V, Brandt J, Lin Z, Bourdev L, Huang TS (2012) Interactive facial feature localization. In: European Conference on Computer Vision 2, 6, 7
Ledig C, Theis L, Huszar F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, Shi W (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: IEEE Conference on Computer Vision and Pattern Recognition 6
Levin A, Weiss Y, Durand F, Freeman WT (2009) Understanding and evaluating blind deconvolution algorithms. In: IEEE Conference on Computer Vision and Pattern Recognition 2
Li L, Pan J, Lai WS, Gao C, Sang N, Yang MH (2018) Learning a dis- criminative prior for blind image deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 3, 10, 11, 12, 13, 14
Liu Y, Dong W, Gong D, Zhang L, Shi Q (2018) Deblurring natu- ral image using super-gaussian fields. In: European Conference on Computer Vision, pp 467–484 2
Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. In: IEEE International Conference on Computer Vision 2, 6, 7
Mao X, Shen C, Yang YB (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Neural Information Processing Systems 3
Michaeli T, Irani M (2014) Blind deblurring using internal patch recur- rence. In: European Conference on Computer Vision 2
Nah S, Hyun Kim T, Mu Lee K (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 3, 4, 6, 10, 11, 12, 13, 14, 15
Nimisha TM, Singh AK, Rajagopalan AN (2017) Blur-invariant deep learning for blind-deblurring. In: IEEE International Conference on Computer Vision, pp 4762–4770 3
Pan J, Hu Z, Su Z, Yang M (2014) Deblurring face images with exem- plars. In: European Conference on Computer Vision 1, 2, 3, 10, 11, 12, 13, 14, 15
Pan J, Lin Z, Su Z, Yang MH (2016a) Robust kernel estimation with outliers handling for image deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Pan J, Sun D, Pfister H, Yang M (2016b) Blind image deblurring using dark channel prior. In: IEEE Conference on Computer Vision and Pattern Recognition 1, 2
Pan J, Dong J, Tai Y, Su Z, Yang M (2017a) Learning discriminative data fitting functions for blind image deblurring. In: IEEE International Conference on Computer Vision, pp 1077–1085 2, 10, 11, 12, 13, 14
Pan J, Hu Z, Su Z, Yang M (2017b) L-regularized intensity and gradient prior for deblurring text images and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(2):342–355 1, 3
Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: British Machine Vision Conference 6
Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations 6
Ren W, Cao X, Pan J, Guo X, Zuo W, Yang MH (2016a) Image de- blurring via enhanced low-rank prior. IEEE Transactions on Image Processing 25(7):3426–3437 2
Ren W, Liu S, Zhang H, Pan J, Cao X, Yang MH (2016b) Single image dehazing via multi-scale convolutional neural networks. In: European Conference on Computer Vision 3
Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: A unified em- bedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition 11, 13
Schuler CJ, Christopher Burger H, Harmeling S, Scholkopf B (2013) A machine learning approach for non-blind image deconvolution. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Schuler CJ, Hirsch M, Harmeling S, Sch¨olkopf B (2016) Learning to deblur. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(7):1439–1451 3
Shan Q, Jia J, Agarwala A (2008) High-quality motion deblurring from a single image. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 27(3):73:1–73:10 10, 11, 13, 14
Shen Z, Lai W, Xu T, Kautz J, Yang M (2018) Deep semantic face deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15
Sim T, Baker S, Bsat M (2002) The cmu pose, illumination, and ex- pression (pie) database. In: IEEE International Conference on Automatic Face and Gesture Recognition 2, 6, 7
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations 6
Sun J, Cao W, Xu Z, Ponce J (2015) Learning a convolutional neural network for non-uniform motion blur removal. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Sun L, Cho S, Wang J, Hays J (2013a) Edge-based blur kernel es- timation using patch priors. In: IEEE International Conference on Computational Photography 2
Sun L, Cho S, Wang J, Hays J (2014) Good image priors for non-blind deconvolution. In: European Conference on Computer Vision 2
Sun Y, Wang X, Tang X (2013b) Deep convolutional network cascade for facial point detection. In: IEEE Conference on Computer Vision and Pattern Recognition 7
Tao X, Gao H, Shen X, Wang J, Jia J (2018) Scale-recurrent network for deep image deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 3, 10, 11, 12, 13, 14
Thomaz CE, Giraldi GA (2010) A new ranking method for principal components analysis and its application to face image analysis. Image and Vision Computing 28(6):902–913 14, 15
Vedaldi A, Lenc K (2015) MatConvNet: Convolutional neural net- works for matlab. In: ACM International conference on Multimedia 6
Wang H, Wang Y, Zhou Z, Ji X, Gong D, Zhou J, Li Z, Liu W (2018a) Cosface: Large margin cosine loss for deep face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5265–5274 11
Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2018b) High- resolution image synthesis and semantic manipulation with conditional gans. In: IEEE Conference on Computer Vision and Pattern Recognition 6
Whyte O, Sivic J, Zisserman A, Ponce J (2012) Non-uniform deblur- ring for shaken images. International Journal of Computer Vision 98(2):168–186 3
Xu L, Jia J (2010) Two-phase kernel estimation for robust motion de- blurring. In: European Conference on Computer Vision 2
Xu L, Zheng S, Jia J (2013) Unnatural L0 sparse representation for natural image deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition 1, 2, 10, 11, 13, 14
Xu L, Ren JSJ, Liu C, Jia J (2014) Deep convolutional neural net- work for image deconvolution. In: Neural Information Processing Systems 3
Xu X, Sun D, Pan J, Zhang Y, Pfister H, Yang MH (2017) Learning to super-resolve blurry face and text images. In: IEEE International
Conference on Computer Vision 3
Xu X, Pan J, Zhang Y, Yang M (2018) Motion blur kernel estima- tion via deep learning. IEEE Transactions on Image Processing 27(1):194–205 3
Yan Y, Ren W, Guo Y, Wang R, Cao X (2017) Image deblurring via extreme channels prior. In: IEEE Conference on Computer Vision and Pattern Recognition 2
Zhang J, Pan J, Lai WS, Lau RWH, Yang MH (2017) Learning fully convolutional networks for iterative non-blind deconvolution. In: IEEE Conference on Computer Vision and Pattern Recognition 3
Zhong L, Cho S, Metaxas DN, Paris S, Wang J (2013) Handling noise in single image deblurring using directional filters. In: IEEE Conference on Computer Vision and Pattern Recognition 10, 11, 13, 14