Learning visual representations from large unlabeled data has been an area of active research in computer vision. In this context, the goal is to learn a representation that describes the remarkable semantic features of an image. A method that can learn such representation may be adopted by a variety of supervised learning tasks such as visualization, regression, and classification. Generative models, and more specifically Generative Adversarial Networks (GANs), with no doubt, are among the most powerful techniques of unsupervised representation learning. The underlying belief of the generative frameworks is that the ability to synthesize an observed data encompasses some sort of understanding.
In practice, however, GANs are not able to learn a meaningful representation of the training dataset without additional constraints. To sidestep this problem, a long line of work proposed different frameworks to learn interpretable and meaningful latent representations in an unsupervised setting, such as InfoGAN [3], BiGAN [5], and ALI [6], or a supervised setting [1, 30].
Despite all the effort in this area, these approaches ignore one of the most fundamental principles of image generation, which is the disentanglement of the scene’s content and style. A scene’s content, here, represents its underlying geometry, while the style encodes the texture and illumination. Wang et al. [35] propose to decomposes the GAN latent code into two separate structure and style codes. However, their method requires the depth information and does not generalize to RGB datasets. In contrast, in this paper we propose a universal GAN framework, which does not need the depth information, to learn disentangled content and style codes. Factoring the style and content enhance the interpretability of the learned representations, compared to the counterparts, improves the performances of learning tasks which are reliant on the sole content or style representation, and enables the user to generate a specific scene with a variety of styles, or several images of a particular style.
Learning disentangled style and content codes is a challenging task. Several algorithms have been developed for solving this class of problems, however, they train GAN in a supervised fashion, or in its conditional setting. In addition, these frameworks address the problem of image-to-image translation, that is an image of a source domain, e.g., edges, represents the content, and a style code is learned for the target domain. Despite all the effort in this area, the field still lacks a coherent framework for unsupervised disentangled style and content representations learning. From this consideration, in this paper, we propose an end-to-end framework to learn a pair of disentangled content and style codes for a given dataset. Note that our framework makes the assumption that each image of the training dataset can
be decomposed into content and style codes.
Our framework consists of a single generator, in contrast to [35], whose input is the content code. The generator has several upsampling, convolutional layers, and a set of residual blocks. Inspired by a recent work which showed that parameters of affine transformation in normalization layers represent the styles, we equip each residual block with an Adaptive Instance Normalization (AdaIN) layer. The parameters of the AdaIN layers then generated by a multilayer perceptron (MLP) whose input is the desired style code. We develop a training strategy and discuss the required considerations for its success, to train the MLP and generator jointly. Note that, with minimum effort, the proposed framework can be adopted by any GAN structure. Finally, by learning to invert the generator, similar to BiGAN [5], we can scale up the applications of our framework to transferring style or content among images of the training domain.
The main contributions of our work are fourfold:
• We present the Style and Content Disentangled GAN (SC-GAN) that learns to disentangle style and content representations of the data in an unsupervised fashion.
• We propose a training scheme based on the proposed losses to train the proposed SC-GAN.
• We extend the BiGAN model using the proposed SCGAN framework, which enables us to transfer style and content between data samples.
• We present qualitative, quantitative experiments showing that the proposed framework can effectively disentangle the style and code and improve the performance of supervised tasks.
2.1. Generative Adversarial Network (GAN)
Building generative models, that are able to model highdimensional data distributions, is a fundamental problem within many computer vision applications, such as face generation [17], image-to-image translation [14], image editing [4], image in-painting [37], and speech synthesis [10]. Currently, the most prominent approaches are generative adversarial networks (GAN) [9], Variational Autoencoders (VAE) [19], and Auto-Regressive Generative Models [32]. These models capture the joint distribution between the data and a set of hidden variables, called latent codes, representing different variations of the data. The trained models then generate new samples, in the training domain, given random latent codes, which are sampled from their prior distributions. Prior works conditioned these models on additional information to direct the data generation process. The conditioning could be on another image for image-to-image translation, part-of-image for inpainting, some desired data attributes [36], or even class labels [28]. Although these methods produce impressive photorealistic images, they fail to learn an interpretable representation of the data.
InfoGAN [3], an information-theoretic extension of GAN, allows learning of representation which is partially interpretable. The resulting code then consists of a meaningful part corresponding to specific semantic attributes of the data, and a random part which injects diversity among the generated samples. In contrast, two concurrent independent works [5, 6] proposed a full inference of the random code. They have demonstrated that these codes can learn the semantic attributes of the data. Several other papers have also investigated supervised representation learning by conditioning the discriminator on specific attributes [1, 30]. Also, transferring attributes among images has been studied in the literature [4, 13].
Despite their success, they overlook the key principle of image generation, which is the style and content disentanglement. Learning a distinct style and content code is studied in [41, 21] for image-to-image translation. However, they do not learn the underlying content code and use an image of another domain as the reference content. Wang et al. [35] study this problem in unconditional setting to learn a disentangle representation of the data. However, their method utilizes the depth information to learn the content, which they refer to as structure representation.
2.2. Neural Style Transfer
Recently several methods were developed to transfer style between images. Style transfer is a technique which enables rendering texture from a reference image while preserving the semantic content of the target image. The early work by Gatus et al. [8] showed that a deep neural network (DNN) can encode both the style and content information and proposed an iterative method to transfer the style of an artistic image to an arbitrary photograph. To accelerate neural style transfer, some techniques are proposed to perform stylization using feed-forward neural networks in a single forward pass [16, 33, 22]. However, these techniques are restricted to a single style and cannot adapt to an unseen arbitrary style. To sidestep this issue, Dumoulin et al. [7] propose conditional instance normalization to learn the normalization parameters of distinct styles. Li et al. [23] also guide the network to synthesize the desired style using a texture selector network. However, all these methods are either limited to transferring a few sets of styles, or too slow for real-time applications.
The Instance Normalization (IN) [34] has been found to carry the style information of an image [25, 8]. Inspired by the IN success, Haung et al. [12] proposed the Adaptive Instance Normalization (AdaIN) to adjust the mean and variance of the content features with those of the style features. Given feature activations of the content and style images, Chen et al. [2] replaced content features with the closestmatching style features patch-by-patch. A universal stylization is also presented in [24] based on the whitening and coloring transform which stylizes images via feature projection.
In this section, we provide some rudiments of GANs and style transfer, necessary to understand the proposed SCGAN framework.
3.1. Generative Adversarial Networks (GANs)
GANs [9] are a group of generative models which learn the statistical distribution of training data, allowing us to synthesize data samples by mapping a random noise z to an output image , where G is the generator network. GAN in its conditional setting (cGAN) is proposed in [15] which learns a mapping from an input x and a random noise z to the output image
y, using an autoencoder network (generator). The generator model, G(x, z), is trained to generate an image which is not distinguishable from ”real” samples by a discriminator network, D. Simultaneously, the discriminator is learning, adversarially, to discriminate between the ”fake” generated images by the generator and the real samples from the training dataset. Consequently, the GAN objective function is defined as
where G attempts to minimize it and D tries to maximize it.
3.2. Style Transfer
Replacing Batch Normalization (BN) layers with IN layers can significantly improve the style transfer networks [34]. Given , an input tensor containing a batch of N images of size
with C channels, IN is given by:
where are affine parameters learned from data,
and
are computed across spatial dimensions independently for each channel and each sample:
Note that, in contrast to BN, which usually replaces mini-batch statistics with population statistics at the test time, IN remains similar to its training time. Instance normalization performs a form of style normalization by normalizing feature statistics, namely the mean and variance. In other words, it normalizes the input to a single style specified by its affine parameters. To adapt it to arbitrarily given styles, Haung et al. [12] proposed Adaptive IN (AdaIN) which employs adaptive affine transformations:
where x is a content input and y is a style input. Unlike IN, AdaIN has no affine parameter to learn. It simply scale and shift the normalized content input by and
, respectively.
4.1. Network Architecture
Our generator, G, consists of 4 different blocks: decoder #1, residual blocks, decoder #2 and an MLP (see Figure 1). The first decoder processes the content code by several upsampling and convolutional layers. All the convolutional layers of this decoder are followed by IN. The stylization block comprises a set of residual blocks which are responsible for imposing the requested style on the output image. Inspired by recent works that use affine transformation parameters in normalization layers to represent styles [12], we equip the residual blocks with AdaIN layers whose parameters are dynamically generated by the MLP block from a style code. Finally, the decoder #2 construct an image by several upsampling and convolutional layers. Since IN removes the original feature mean and variance that represent essential style information, the second decoder is equipped with batch normalization.
To adopt SC-GAN by any GAN framework, the generator is the only part which needs to be modified. In other words, the discriminator could be left intact. Consequently, we do not change the original discriminator of the GAN frameworks that we are using in this paper.
4.2. Training the SC-GAN
The proposed SC-GAN takes a random code z = composes of a content code
and a style code
as input, and synthesizes an output image, G(z). However, the network needs a mechanism to learn relating the content of the generated image to the content code and its style to the style code. In particular, the content of the generated image is supposed to be intact as long as
remains unaltered, despite the value of
, and vice versa. In order to train the generator to learn these relations, a pair of content codes,
Figure 1: The proposed SC-GAN framework, consists of a generator G, a discriminator D, and a pre-trained VGG19 to extract features for content and style consistency losses. The MLP dynamically changes the parameters of AdaIN layers in stylization block to render different styles on the synthesized image.
, and a pair of style codes,
, are drawn randomly. Now, we generate four images using the following four random codes:
Since the content code is shared between and
, the network needs to synthesize two images with the same content for these codes. Nevertheless, since the style codes are different, there should be a force on the network to render different styles on a unique content. The same discussion is also valid for
and
. Similarly, for
and
(or
and
), the rendered style on the generated images should be consistent while their contents differ significantly from each other. To this end, motivated by [33], we define our content and style consistency losses.
Content Consistency Loss: To learn complex crossdomain relationships, we impose the Euclidean distance on the high-level feature space as the cycle consistency, which is known as the perceptual loss [16]. Similar to [16], we make use of a VGG-19 network, , pretrained, as a fixed loss network. This loss network defines a feature reconstruction loss that measure differences in high-level content between images. Let
denote the feature maps of the
layer of the loss network for the input image x. Then the
layer content loss between images x and y is defined as:
where is the number of perceptrons in the
layer. Consequently, to train the generator keeping the content between
and
we define the content-consistency loss as:
Style Consistency Loss: We also wish to penalize differences in style: colors, textures, common patterns, etc. To achieve this effect, Gatys et al. [8] proposed a style reconstruction loss. As above, let be the activations at the
layer of the loss network for the input x, which is a feature map of size
. By reshaping
into a matrix
of size
, the Gram matrix
, which is of size
, can be computed efficiently as
. The style reconstruction loss is then the squared Frobenius norm of the difference between the Gram matrices of the output and target images:
Finally, to train the generator to share the same style between and
we define the style-consistency loss as:
where L is the set of layers over which the style-consistency loss is computed.
Diversity Loss: Early experiments showed that employing only style and content consistency losses results in the network to learn a limited number of styles. To sidestep this issue, we propose to force a minimum distance in style or content metrics if two images do not share the same style or content code, respectively. In other words, since the style codes differ between and
, the network needs to synthesize two images with at least a minimum style difference. To this end, we define the following loss function:
where is a margin which is greater than 0 and indicates that dissimilar pairs that are beyond the margin will not contribute to the loss. A similar loss function can be defined for
codes with dissimilar content parts and a shared style:
Total Loss: We solve the problem of style and content disentanglement by training the generator G to minimize the following objective function:
where the hyper-parameter controls the impact of each term in the objective function. This objective function forces the generator to disentangle its input code into content and style codes. Note that all the blocks of the generator are trained jointly using the proposed objective function. The discriminator is then trained to maximize the GAN terms
.
In this paper, we employ a Least Squares Generative Adversarial Network (LSGAN) [27] as the baseline. The LSGAN generator includes six strided convolutional layers. All the layers double the spatial size of their inputs except the first layer which quadruples the input, that leaves us with an image size of . We modify the LSGAN to create SC-GAN by adding a stylization block before its last strided convolutional layer. The stylization block consists of 4 residual blocks each equipped with an AdaIN normalization layer. We also replace the normalization layers of the LSGAN as described in section 4.1. The style code MLP comprises 5 fully-connected layers with ReLu activation function.
5.1. Experimental setup
The main goal of our experiments is to investigate if the proposed method can learn disentangled style and code representations of the data. The framework should be able to learn a content code which controls the geometrical information of the generated image and has no effect on the rendered styles. Similarly, a style representation should be learned to control only the style of the synthesized images. We evaluate SC-GAN on the edges handbags [40], edges
shoes [38], CelebA [26], LSUN bedroom and kitchen [31], and our own face datasets. Our face dataset has 3,000 face images of high quality. We conducted a comprehensive set of experiments, to evaluate the proposed SC-GAN qualitatively and quantitatively.
Figure 2: The proposed SC-GAN framework for style transfer incorporating BiGAN.
Disentangled style and content image generation employing LSGAN [27]: We employ LSGAN in its unconditional setting to conduct the first series of our experiments, as it can efficiently generate realistic and diverse images in different domains. Furthermore, our proposed framework is applicable to any other GAN models. In this work, we only modify the generator of LSGAN by adding a stylization block, consisting of four residual blocks, before its last constitutional layer. As we mentioned before, all the layers of this generator, up to the stylization block, are equipped with the IN layer.
Style transfer using BiGAN [5] and LSGAN: Bidirectional Generative Adversarial Network (BiGAN) is an unsupervised feature learning framework. It makes use of an encoder E, in addition to the generator G, which maps the data x to its latent representations z. Its discriminator also differs from the original GAN framework in that the BiGAN discriminator does not discriminate in data space (x versus G(z)). In contrast, it discriminates jointly in the data and latent space (tuples (x, E(x)) versus (G(z), z)). In other words, the BiGAN encoder E learns to invert the generator G. Our early experiments show that this framework, at least for the task of style code retrieval, can benefit from a code cycle-consistency loss defined as:
In order to generate high-quality images, we incorporate the BiGAN framework into the generator of LSGAN. The learned encoder then can be used to retrieve the style code of a given image. The extracted style code may be used, together with a content code, to generate an image with an arbitrary content and the style of a reference image. We used two separate encoders and
for content and style retrieval, respectively (see Figure 2). The content encoder downsamples the input image using several strided convolutional layers, each followed by an IN. Style encoder follows the same structure, however, does not equipped with the IN as it removes the style information.
Figure 3: The results of our framework on different datasets. The content code is fixed in each row while the style code varies. Similarly, the style code is fixed in each column while the content code varies.
We train our proposed networks using Adam optimizer [18], with learning rate of 0.0002, and mini-batch size of 20. The algorithm is implemented in PyTorch [29]. The LSGAN is trained to generate images of size
. The SC-GAN is quite robust to the values of hyper-parameters. Selecting
,
and
easily works for all the datasets.
5.2. Qualitative Analysis
Figure 3 show how the generated images change when the style or content code varies. The content code is fixed in each row while the style code differs between different columns. Similarly, we fixed the style code in each column. Our framework is able to train a network to generate multiple realistic images with fixed content and different styles, while it does not require any supervision. Figure 3 clearly illustrates the disentanglement of the content and style representations for different datasets.
Furthermore, using the proposed style transfer scheme, we can transfer style information from a reference image to the generated image by a random content code. To this end, we jointly train an encoder with the generator to retrieve the style code from an image along with the training domain. Then, instead of sampling from the distribution of style code, we use the style code extracted from a reference image. To this end, the reference image is fed to the learned encoder, E, to retrieve its style code. The extracted code then can be utilized for image generation guided by the reference image style. Figure 4 shows the results using style codes extracted from multiple reference images to generate realistic photos in different domains.
5.3. Quantitative Evaluation
To evaluate the proposed method quantitatively, we use two different metrics, namely Frechet Inception Distance (FID) [11] and Perceptual Image Patch Similarity (LPIPS) [39]. We trained the original LSGAN and our SC-GAN
Figure 4: Training SC-GAN with an encoder to retrieve the style code enables us to transfer style from a given reference image to the generated samples.
Table 1: Comparison of LSGAN and SC-GAN, with and without Diversity Loss, on different datasets using the FID and LPIPS scores as quantitative metrics. Lower FID score means higher quality, and higher LPIPS shows more diversity among generated samples.
extension of that on multiple datasets and then compared their performances based on FID and LPIPS metrics.
The FID is a recently proposed metric to evaluate the quality of the generative models [11]. It directly measures the distance between the synthetic data distribution p(.) and the real data distribution . To calculate FID, images are encoded with visual features from a pre-trained inception model:
where (m, C) and denote the mean and covariance of feature embedding for synthetic and real data, respectively. Note that a lower FID value interprets as a lower distance between synthetic and real data distributions. We calculate the FID over 10k randomly generated samples. Table 1 lists the FID scores of LSGAN and our proposed SC-GAN method. To investigate the effect of diversity loss we train our SC-GAN model with and without diversity loss. Compared to LSGAN, our SC-GAN achieves a slightly better FID. The SC-GAN without diversity loss, however, shows almost the same FID score as LSGAN. It means that the diversity loss can improve the FID score by increasing the diversity of generated sample styles. These results indicate that the proposed disentanglement of style and content representations comes at least with no cost in terms of the quality of synthesized images.
The observations regarding the diversity of generated samples are confirmed by the LPIPS distance. Table 1 presents the LPIPS distance for LSGAN as well as SC-GAN with and without diversity loss. The LPIPS distance is calculated as the average distance between 2000 pairs of randomly generated output images, in deep feature space of a pre-trained AlexNet [20]. Without the diversity loss, our model suffers from partial mode collapse, in which many style codes render the same texture on the output images. However, employing the diversity loss results in the SCGAN to generate images that are more diverse. Visualization comparison of SC-GAN with and without diversity loss in Figure 5 confirms the LPIPS scores. Note that the pro-
Table 2: FID score for different generator architectures. The structures differ only in how early we use the stylization block. Clearly, using stylization block in the early layers reduce the performance of the GAN.
posed SC-GAN has a higher LPIPS score compared to the LSGAN while it is still able to generate images of almost the same quality.
We have also investigated different generator architectures to find out where is the best place for the stylization block. The studied architectures differ only in how early we use the stylization block. We tried three different architectures: c5-r4-c1, c4-r4-c2, and c3-r4-c3. Here, ci-rj-ck denote a generator with decoder #1, residual blocks, and decoder #2 having i, j, and k layers, respectively. Table 2 shows the FID for these architectures on three different datasets. Clearly, using stylization block in the first layers reduces the performance of the SC-GAN. We achieved the best performance by placing the stylization block before the last convolutional layer (c5-r4-c1). Note that all the results in Table 1 are reported for this architecture.
In this paper, we introduced a framework for style and content disentangled representation learning, namely SCGAN, using generative adversarial networks in an unsupervised setting. In contrast to the previous works, our approach learns distinct content and style codes for a given dataset, which enables us to generate multiple images of a scene with different styles and textures. We also proposed to retrieve the style code which may be used later for style transfer from a given reference image to the generated image. The proposed SC-GAN can easily be adopted by any GAN frameworks. Extensive quantitative and qualitative results demonstrate that our proposed method can learn to disentangle representations of style and content while improving the quality of the generated images.
[1] G. Antipov, M. Baccouche, and J. L. Dugelay. Face aging with conditional generative adversarial networks. In 2017 IEEE International Conference on Image Processing (ICIP), pages 2089–2093, Sept 2017.
[2] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.
[3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learn-
Figure 5: Results of a SC-GAN trained on LSUN-bedroom dataset with and without diversity loss. Employing the loss improves the style diversity.
ing by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
[4] H. Ding, K. Sricharan, and R. Chellappa. Exprgan: Facial expression editing with controllable expression intensity. AAAI, 2018.
[5] J. Donahue, P. Kr¨ahenb¨uhl, and T. Darrell. Adversarial fea- ture learning. arXiv preprint arXiv:1605.09782, 2016.
[6] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
[7] V. Dumoulin, J. Shlens, and M. Kudlur. A learned represen- tation for artistic style. Proc. of ICLR, 2017.
[8] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer
using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[10] W.-L. Hao, Z. Zhang, and H. Guan. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation. AAAI, 2018.
[11] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
[12] X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. pages 1510– 1519, 2017.
[13] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. arXiv preprint arXiv:1804.04732, 2018.
[14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017.
[16] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
[17] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. ICLR, 2018.
[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[19] D. P. Kingma and M. Welling. Auto-encoding variational bayes. NIPS, 2014.
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[21] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1808.00948, 2018.
[22] C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016.
[23] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Diversified texture synthesis with feed-forward networks. In Proc. CVPR, 2017.
[24] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems, pages 386–396, 2017.
[25] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
[26] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
[27] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smol- ley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813–2821. IEEE, 2017.
[28] M. Mirza and S. Osindero. Conditional generative adversar- ial nets. arXiv preprint arXiv:1411.1784, 2014.
[29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[30] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. ´Alvarez. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016.
[31] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[32] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pix- elcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
[33] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, pages 1349–1357, 2016.
[34] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis.
[35] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335. Springer, 2016.
[36] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Con- ditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
[37] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa- Johnson, and M. N. Do. Semantic image inpainting with deep generative models. 2017.
[38] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 192– 199, 2014.
[39] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924, 2018.
[40] J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
[41] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.