Historically, inpainting is an ancient technique that was performed by professional artists to restore damaged paintings in museums. These defects (scratches, cracks, dust and spots) were inpainted by hand to restore and maintain the image quality. The evolution of computers in the last century and its frequent daily use has encouraged inpainting to take a digital format (Efros and Leung, 1999; Bertalmio et al., 2000; Criminisi et al., 2004; Pathak et al., 2016; Liu et al., 2018a; Yang et al., 2017; Yan et al., 2018) as an image restoration technique. Image inpainting aims to fill in missing pixels caused by a defect based on pixel similarity information (Bertalmio et al., 2000).
The state-of-the-art approaches are two categories: conventional and deep learning-based methods. Conventional methods (Efros and Leung, 1999; Criminisi et al., 2004; Barnes et al., 2009; Sun et al., 2005) use image statistics of best-fitting pixels to fill in missing regions (defects). However, these approaches often fail to produce images with plausible visual semantics. With the evolution in research, deep learning-based methods (Pathak et al., 2016; Liu et al., 2018a; Iizuka et al., 2017; Yu et al., 2018; Yan et al., 2018; Yang et al., 2017) encode the semantic
Figure 1: Images showing some issues by state of the art: (a) Poor performance on holes with arbitrary sizes; (b) Lack of edge-preserving technique; (c) Blurry artefacts; and (d) Poor performance on high-resolution images and image completion with mask at the border region.
context of an image into feature space and fill in missing pixels on images by hallucinations (Yang et al., 2017) through the use of generative neural network. Although deep learning approaches achieve excellent performance in facial inpainting, there are some limitations of state of the art as illustrated in Figure 1. These are cases, where Figure 1(a) shows poor performance on holes with arbitrary sizes; Figure 1(b) illustrates the lack of edge-preserving using the existing technique; Figure 1(c) depicts the blurry artefacts; and Figure 1(d) demonstrates the poor performance on high-resolution images and image completion with mask at border region.
To correctly predict missing parts of a face and
preserve its realism, we propose S-WGAN with the following contributions:
• We propose a new framework with Wasserstein Generative Adversarial Network (WGAN) that uses symmetric skip connection to preserve image details.
• We define a new combined loss function based on RGB and feature space.
• We demonstrate that our loss, combined with our S-WGAN, can achieve better results than the state-of-the-art algorithms.
Pathak et al. (Pathak et al., 2016) proposed to use GANs (Goodfellow et al., 2014) with a context-encoder similar to (Vincent et al., 2010; Le et al., 2011) and AlexNet (Krizhevsky et al., 2012) for image inpainting despite poor hallucinations. Results show more artefacts and blur with randomised hole-to-image mask regions. Iizuka et al. (Iizuka et al., 2017) used a local and global discriminator to assess coherency and consistency of predicted pixels, and replaced the fully-connected layer of the generator with dilated convolutions (Yu and Koltun, 2015). Iizuka et al. (Iizuka et al., 2017) method failed to capture long-ranged textured information, however they used Poisson Blending by Perez et al. (P´erez et al., 2003) to process the output image. Yang et al. (Yang et al., 2017) proposed a multi-scale neural patch synthesis based on style transfer (Johnson et al., 2016; Ulyanov et al., 2016; ?), but failed to guarantee content and texture of high-resolution images with difficulty on irregular mask inpainting task. Yeh et al. (Yeh et al., 2017) introduced a spatial attention mechanism in deep convolutional GAN (?) combined with context loss but this algorithm suffers misalignment on closest encoding in latent space. It performs poorly in handling of high-resolution and complex scene images. Li et al. (Li et al., 2017) used face parsing network combined with a generator (encoder-decoder) and two discriminators optimised by a semantic parsing loss to ensure local-global consistency and pixel fidelity. However, despite excellent performance, neighbouring pixels fail to establish spatial connections leading to colour inconsistencies. Li et al. (Li et al., 2018) introduced reflection symmetry into face completion and used two networks, to establish a correspondence between missing pixels on two half-faces optimised using a symmetry loss defined on VGGFace (Parkhi et al., 2015). However, this network fails to preserve structural information and is computationally costly.
Liu et al. (Liu et al., 2018a) used partial convolution to replace typical convolutions (Ulyanov et al., 2018) with an automatic mask-updating step. This technique masks and renormalise convolutions to target only valid pixels. However, it performs poorly on sparsely structured images and binary masks with huge holes and no quantitative evaluation report on facial images. Yan et al. (Yan et al., 2018) used deep feature rearrangement by adding a particular shift-connection layer to the U-Net architecture (Ron- neberger et al., 2015), but lacks efficiency with no guarantee in computational speed. Yu et al. (Yu et al., 2018) proposed a dual-stage network convolutional network combined with a contextual attention layer that learns the location of feature information from background patches to generate missing content. However, this network lacks pixel-wise consistency on high-resolution images. Liu et al. (Liu and Jung, 2019) proposed a multi-scale feature extraction powered by a multi-level generative network optimised by content and texture losses based on Mean Square Error () and Structure Similarity Index (MS-SSIM), to capture features at various levels. This model struggles with larger masks and fails to preserve structure in unaligned facial images. Li et al. (Li et al., 2019) proposed a nested GAN for facial inpainting, that uses a residual connection structure to transport information and interpolate feature map in deeper layer and shallow layer. Wang et al. (Wang et al., 2019) introduced a Laplacian approach based on residual learning (He et al., 2016) to propagate high-frequency details and predict missing information. Despite the significant contributions by the methods above to the field of inpainting, the absence of preserved realism on facial images from a compact latent feature is still challenging due to larger and irregular masks.
Our proposed model uses skip connections with dilated convolution across the network, to perform image inpainting. We discuss the architecture and loss function of S-WGAN in the following sections
3.1 Architecture
Figure 2 shows the overall framework of our proposed S-WGAN. The network is designed to have a generator (G) and a discriminator (D). We de-fine Gas an encoder-decoder framework with dilated convolutions and symmetric skip connections. Figure 3 shows the process of dilated convolution. Dilated convolutions (Yu and Koltun, 2015), combined
Figure 2: S-WGAN framework. The dilated convolution and deconvolution with the element-wise sum of feature maps (skip connection) combined with a Wasserstein network. The skip connections in the diagram ensure local pixel-level accuracy of the feature details to be retained.
Figure 3: Illustration of dilated convolution process. Convolving a 33 kernel over a 77 input with a dilation factor of 2 (i.e., i 0) (Dumoulin and Visin, 2016). The accretion of receptive field is in linearity with the parameters (Yu and Koltun, 2015). A 55 kernel will have the same receptive field view as over a 77 input at dilation rate=2 whilst only using 9 parameters over a 512
with skip connections, are critical to the design of our model as:
• It broadens the receptive fields to capture more contextual information without parameter accretion and computational complexity, which are preserved and transferred by skip connections to corresponding deconvolution layers.
• It detects fine details and maintains high-resolution feature maps, and achieves end-to-end feature learning with a better local minimum (high restoration performance).
• It has shown considerable improvement of accuracy in segmentation task (Yu and Koltun, 2015; Chen et al., 2017a; Chen et al., 2017b).
Generator (G) The effectiveness of feature encoding is improved by having an encoder of tenconvolutional layers, with a kernel size of 5 and dilation rate of 2, designed to match the size of the output image. This technique enables our model to learn larger spatial filters and help reduce volume (Rose- brock, 2019). Each block of convolution in exception of the final layer has Leaky ReLU activation and max-pooling operation of pool size 2 2. We apply a dropout regularisation with a probability value of 0.25 in the 4th and final layer of the encoder. The dropout layer randomly disconnects nodes and adjust the weights to propagate information to the decoder without overfitting.
Decoder The decoder are five blocks of deconvolutional layers, with learnable upsampling layers that recover image details using the same kernel size and dilation rate of the generator. The corresponding feature maps in the decoder are asymmetrically linked by element-wise skip connections to reach an optimum size. The final layer in the decoder is Tanh activation.
Dilated Convolutions: We express the dilated convolution based on the network input in Equation 1:
where Iis the output feature map of the dilated convolution from the input MI M and the filter is given by . The dilation rate parameter (dr) reverts to normal when dr = 1.
It is advantageous to use dilated convolution compared to using typical convolutional layers combined with pooling. The reason for this is that a small kernel size of kk can enlarge into kdr based on the dilated stride dr, thus allowing a flexible receptive field of fine-detail contextual information while maintaining high-quality resolution.
The inpainting solver Gmay result in predictions Gof the missing region, that may be reasonable or ill-posed. We include as part of our network D, adopted from (Arjovsky et al., 2017) to provide improved stability and enhanced discrimination for photo-realistic images. With ongoing adversarial training, the discriminator is unable to distinguish real data from fake ones. Equation 2 shows the reconstruction of the image during training from G:
where IR is the reconstructed image, I is the ground-truth, is the predictions, is the element-wise multiplication and M is the binary mask, represented in 0 and 1. In our case 0 is the context of the entire image and 1 is the missing regions.
Equation 3 adopted from (Arjovsky et al., 2017) refers to the Wasserstein discriminator.
maxD VWGAN = ExEzIR))] (3) where D is the discriminator and Pr is real data distribution. G is the generator of our network and Pz is the distribution.
3.2 Loss function
Perceptual loss Instead of using the typical -loss function used in (Pathak et al., 2016), we use a new combination of loss functions, luminance (Ll) and feature loss. Pixel-wise reconstruction and feature space loss are not new to inpainting (Yeh et al., 2017; Yu et al., 2018; ?). We define a luminance guided Ll that uses -loss as a base to compute the loss using a range of constant pixel values in the RGB space. This preserves colour and luminance and does not over penalise large errors (Zhao et al., 2016). We use the Ll to adjust our perceptual loss, thus minimising any error >1. Also, the Ll allows better evaluation of the predictions to match the ground-truth. More specifi-cally, we express the luminance loss (Ll) based on as:
where i is the pixel index with xi and ˆzi as pixel values of the ground-truth and the predictions, constraint by a constant K. Our feature loss Lf is a feature based -loss, rather than being computed directly on the image we computed the loss in a feature space. To achieve this, we adopt a pre-trained VGG-16 model trained on ImageNet (Krizhevsky et al., 2012), and use it as a feature extractor in our loss function. More specifically we use the output of block3-convolution3 of this model to generate image feature. We use the as base to compute our loss function, which is the same as the perceptual loss proposed by Johnson et al. (Johnson et al., 2016). The advantage of using feature space is that a particular filter determines the extraction of feature maps from low-level to high-level, sophisticated features. To reconstruct quality images, we compute our loss function with feature maps determined by block3-conv3, resized to the same size as masks and generated images. The reason is that using another output for example block4-conv4 or block5-conv5 will result in poor quality, as the network starts to expand the view at these layers due to more filters used. Our feature loss is expressed as follows:
where MI is the input image, IR is the reconstructed image and N is dimensions obtained from feature maps with high-level representational abstractions extracted from the third block convolution layer. By combining Ll and Lf we obtained:
By using Lp the model learns to produce finer details in the predicted features and output without any blurry artefacts. We add the Wasserstein loss (Lw) improves convergence in GANs and its the mean difference between two images. Finally the entire model trains end-to-end with back-propagation and uses the global Wasserstein-perceptual loss function (Lwp) defined in Equation 7, to optimise Gand Dto learn reasonable predictions. Our goal is to reconstruct an image IR from MI by training the generator Gto learn and preserve image details.
This section describes the dataset, binary masks and the implementation.
4.1 Dataset and irregular binary mask
Our experiment focuses on high-resolution face images and irregular binary masks. The benchmark dataset for high-resolution face images is CelebA-HQ dataset (Karras et al., 2017), which was curated from the CelebA dataset (Liu et al., 2018b) and contained 30,000 images. Figure 5 shows a few samples from the CelebA-HQ dataset.
To create irregular holes on images, we use the Quickdraw irregular mask dataset (Iskakov, 2018),
Figure 4: Qualitative comparison of our proposed S-WGAN with the state-of-the-art methods on CelebA-HQ: (a) Input masked-image; (b) CE (Pathak et al., 2016); (c) PConv (Liu et al., 2018a); (d) WGAN; (e) S-WGAN (proposed method); and (f) Ground-truth image.
Figure 5: Sample images from CelebA-HQ Dataset (Karras et al., 2017).
Figure 6: Process of input generation: a) CelebA-HQ image; b) Binary mask image (Iskakov, 2018); and c) Corresponding masked image (input image).
available for public use and is divided into 50,000 train and 10,000 test masks. The images are of size 512512 pixels.
4.2 Implementation
We used the Keras library with TensorFlow backend to implement and design our network. With our choice of the dataset, we followed the experiment settings of state of the art (Liu et al., 2018a) and split our data into 27,000 images for training and 3,000 images for testing.
We perform normalised floating-point representation on the image to set the intensity values of the pixels in the range -1,1 and apply the mask on the image to obtain our input, as shown in Figure 6. We initialize pre-trained weights from VGG-16 to compute our loss function. We use a learning rate of 10in Gand 10in Dand optimise the training process using the Adam optimiser (Kingma and Ba, 2014). We use a Quadro P6000 GPU machine to train these models. According to our hardware conditions, we use a batch-size of 5 in each epoch for input images with shape 5125123. It takes 0.193 seconds to predict missing pixels of any size created by binary mask on an image and ten days to train 100 epochs.
We assess the performance of the inpainting methods qualitatively and quantitatively in this section.
5.1 Qualitative Comparisons
Consider the importance of visual and semantic coherence; we conducted a qualitative comparison of
Figure 7: Qualitative comparison of results using different architectures (Johnson et al., 2016) on CelebA-HQ (Karras et al., 2017). (a) Input masked image (b) Inpainted image by WGAN (c) Improved WGAN with skip connections (WGAN-S) (d) Improved WGAN with skip connection and dilated convolution (WGAN-SD) (e) Complete network with Lp (f) Ground-Truth image. The yellow box indicates the region where other models failed to inpaint successfully completely. This region in (e) shows the effectiveness of Lp on the inpainted image.
our test dataset. First, we implemented a WGAN approach with Lf and Lw. We observed an induced pattern and pitiable colour on the images, as shown in Figure 4(d). We introduced dilated convolution, skip connections combined with end-to-end training using Lwp to handle the induced pattern and match the luminance of the original images. We compare our model with three popular methods: • CE: Context-Encoder method by Pathak et al. (Pathak et al., 2016). • PConv: Image Inpainting for irregular holes using partial convolutions by Liu et al. (Liu et al., 2018a). • WGAN: Wasserstein GAN method with perceptual loss. We test our S-WGAN against state of the art on CelebA-HQ 512 512 test dataset and show the results in Figure 4. Based on visual inspection, Figure 4(b) illustrates blurry generated by the Pathak et al.’s CE method (Pathak et al., 2016). On the other hand, PConv (Liu et al., 2018a) generates clear images but with residues of the binary mask left on the images as shown in Figure 4(c). WGAN induced pattern and low-contrast images, shown in Figure 4(d). Overall, our proposed S-WGAN, as shown in Figure 4(e), produced the best visual results when compared to the ground-truth in Figure 4(f).
5.2 Quantitative Comparisons
We select some popular image quality metrics including , Peak Signal to Noise Ratio (PSNR), SSIM to evaluate the performance quantitatively. Table 1 shows the results from our experiment compared to state of the art (Pathak et al., 2016; ?) for image inpainting with our S-WGAN in bold.
For and , the lower the value, the better the image quality. measures the average squared intensity difference of pixels while measures the
Table 1: Quantitative comparison of various performance assessment metrics on 3,000 test images from the CelebAHQ dataset. † Lower is better. Higher is better.
magnitude of error between the ground-truth image and the reconstructed image. Conversely, for PSNR and SSIM, the higher the value, the closer the image quality to the ground-truth. Based on observation from Table 1, S-WGAN achieves lower , higher PSNR and higher SSIM values in comparison with CE (Pathak et al., 2016) and PConv (Liu et al., 2018a), which suggests that S-WGAN provide more accurate predictions than the state-of-the-art inpainting algorithm.
To justify the S-WGAN framework and validate the effectiveness of Lp, we conduct experiments and show intermediate results using different alterations of the S-WGAN on CelebA-HQ dataset.
Firstly, we conduct investigations on the WGAN and WGAN with skip connection (WGAN-S) using the Lf , and observed a slight improvement in texture and structure of the reconstructed masked regions of the images. Figure 7 (b) and (c) show changes influ-enced by skip connections. We observed that visually and quantitatively, the WGAN-S performs better than WGAN model but not satisfactory as shown in the first part of Table 2.
Secondly, we improve the WGAN-S model by including dilated convolutions to each block, and additional convolution layers to obtain our WGANSD model. We train the WGANSD with the Lf and train the S-WGAN model with our new combined loss
Figure 8: Qualitative comparison of results using different architectures with the perceptual loss (Johnson et al., 2016) on CelebA-HQ (Karras et al., 2017). (a) Input masked image; (b) inpainted image by WGAN; (c) Improved WGAN with skip connection; (d) improved WGAN with skip connection and dilated convolution (e) Complete network with Lp; (f) The ground-Truth image.
function. We noticed that training with the Lf improved our results slightly, but not satisfactorily. To verify the differences of these models, we conduct a qualitative and quantitative evaluation. Visually, within the yellow rectangle on Figure 7 comparing columns (d) and (e), the S-WGAN result in column (e) improved with significantly enhanced local detail when compared with column (d) and the original on column (f). Also, in quantitative evaluation shown in Table 2, we observe that S-WGAN trained end-to-end with Lwp predicts reasonable outputs with finer details. We also show more qualitatively results in Figure 8 to demonstrate the S-WGAN produces images with preserved realism. To validate our S- Table 2: Quantitative difference of results based on different architectures (WGAN), WGAN-S, WGANSD with Lf , and S-WGAN trained with Lp. † Lower is better. better.
WGANs’ representational ability generalised to other masks e.g. Nvidia mask (Karras et al., 2017), we use the various architectures of our model to conduct experiments during the ablation studies. We apply the Nvidia mask as the masking method and show our results in Figure 9.
Our proposed S-WGAN with dilated convolution and skip connections trained end-to-end with Wasserstein-perceptual loss function outperforms the state-of-the-art. Our model can learn the end-to-end mapping of input images from a large-scale dataset to predict missing pixels of the binary mask regions on the image. Our S-WGANautomatically learns and identifies missing pixels from the input and encodes them as feature representations, to be reconstructed in the decoder. Skip connections help to transfer image details forwardly and find local minimum by backward propagation.
Our experiments show the benefit of skip connection combined with Wasserstein-perceptual loss for image inpainting. We have visually compared our proposed method with state of the art (Pathak et al., 2016; Liu et al., 2018a) in Figure 4. To verify the effectiveness of our network, we carried out experiments with regular convolutions and used the Lf . We noticed that the images produced had checkboard artefacts with pitiable visual similarity compared to the original image, as shown in Figure 4(d). We introduced skip connections with dilated convolution and our new loss function and obtained improved results that are were semantically reasonable with preserved realism in all aspects.
Compared to existing methods, the generator of our S-WGAN learns specific structures in natural images by minimising Lp with an enhanced hallucinating ability powered by symmetric skip connections. Based on Figure 4, our S-WGAN can handle irregularly shaped binary mask without any blurry artefacts
(d) WGANSD
Figure 9: Qualitative evaluation of different architectures with perceptual loss (Johnson et al., 2016) on CelebA-HQ (Karras et al., 2017) and Nvidia Mask. (a) Input masked image; (b) Inpainted image by WGAN; (c) Improved WGAN with skip connection; (d) Improved WGAN with skip connection and dilated convolution; (e) Complete network with Lp; (f) The ground-Truth image.
and has shown edge-preserving and mask completion at border regions on the output images. Additionally, using the Wasserstein discriminator enables the overall network to perform better. This boost the experimental performance of our network to achieve state-of-the-art results in inpainting task on high-resolution images.
One limitation is a consistent practice of other inpainting methods in the preprocessing step. Most preprocessing ignores the fact that the image has to be converted into normalised floating points representations and an inverse-normalisation on the output image, which contributes to the colour discrepancies on the output image, that leads to expensive postprocessing. We have been able to solve this using SWGAN with a new combination of the loss function that preserves colour and image detail.
In this paper, we propose S-WGAN. Our network can generate images, which are semantically and visually plausible with preserved realism of facial features. We achieved this with a network structure that can widen the receptive field in each block to capture more information and forward to the corresponding deconvolutional blocks. Additionally, we introduced a new combined loss function based on luminance and feature space combined with Wasserstein loss. Our network was able to generate high-resolution images from input covered with arbitrary binary mask shape and achieve a better performance compared to the state-of-the-art methods. The proposed network has shown the effectiveness of skip connections with dilated convolutions as a capture and refining mechanism of contextual information combined with WGAN. For future work, we aim to extend our model to inpaint coarse and fine wrinkles extracted from wrinkle detectors (Yap et al., 2018) with preserved realism.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 used for this research.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser- stein gan. arXiv preprint arXiv:1701.07875.
Barnes, C., Shechtman, E., Finkelstein, A., and Goldman, D. B. (2009). Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG), 28(3):24.
Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C. (2000). Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424. ACM Press/AddisonWesley Publishing Co.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834– 848.
Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017b). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
Criminisi, A., P´erez, P., and Toyama, K. (2004). Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing, 13(9):1200–1212.
Dumoulin, V. and Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.
Efros, A. A. and Leung, T. K. (1999). Texture synthesis by non-parametric sampling. In iccv, page 1033. IEEE.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid- ual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
Iizuka, S., Simo-Serra, E., and Ishikawa, H. (2017). Glob- ally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107.
Iskakov, K. (2018). Semi-parametric image inpainting. arXiv preprint arXiv:1807.02855.
Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer.
Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Pro- gressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im- agenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. Y. (2011). On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 265–272. Omnipress.
Li, X., Liu, M., Zhu, J., Zuo, W., Wang, M., Hu, G., and Zhang, L. (2018). Learning symmetry consistent deep cnns for face completion. arXiv preprint arXiv:1812.07741.
Li, Y., Liu, S., Yang, J., and Yang, M.-H. (2017). Gen- erative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3911–3919.
Li, Z., Zhu, H., Cao, L., Jiao, L., Zhong, Y., and Ma, A. (2019). Face inpainting via nested generative adversarial networks. IEEE Access, 7:155462–155471.
Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catanzaro, B. (2018a). Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723.
Liu, J. and Jung, C. (2019). Facial image inpainting using multi-level generative network. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1168–1173. IEEE.
Liu, Z., Luo, P., Wang, X., and Tang, X. (2018b). Large- scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018.
Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep face recognition.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544.
P´erez, P., Gangnet, M., and Blake, A. (2003). Poisson im- age editing. In ACM SIGGRAPH 2003 Papers, pages 313–318.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer.
Rosebrock, A. (2019). Deep Learning for Computer Vision with Python. PyImageSearch.com, 2.1.0 edition.
Sun, J., Yuan, L., Jia, J., and Shum, H.-Y. (2005). Image completion with structure propagation. In ACM Transactions on Graphics (ToG), volume 24, pages 861–868. ACM.
Ulyanov, D., Lebedev, V., Vedaldi, A., and Lempitsky, V. S. (2016). Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, pages 1349– 1357.
Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408.
Wang, Q., Fan, H., Sun, G., Cong, Y., and Tang, Y. (2019). Laplacian pyramid adversarial network for face completion. Pattern Recognition, 88:493–505.
Yan, Z., Li, X., Li, M., Zuo, W., and Shan, S. (2018). Shift- net: Image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–17.
Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., and Li, H. (2017). High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3.
Yap, M. H., Alarifi, J., Ng, C.-C., Batool, N., and Walker, K. (2018). Automated facial wrinkles annotator. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0.
Yeh, R. A., Chen, C., Lim, T.-Y., Schwing, A. G., Hasegawa-Johnson, M., and Do, M. N. (2017). Semantic image inpainting with deep generative models. In CVPR, volume 2, page 4.
Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S. (2018). Generative image inpainting with contextual attention. arXiv preprint.
Zhao, H., Gallo, O., Frosio, I., and Kautz, J. (2016). Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1):47–57.