In image-to-image translation tasks, mappings between two visual domains are learnt. Various computer vision and graphics problems are addressed and formulated using the image-to-image translation framework, including super-resolution [30, 28], colorization [29, 49], inpainting [38, 21], style transfer [23, 34] and photorealistic image synthesis [22, 6, 46]. In the photorealistic image synthesis problem, images are generated from abstract semantic label maps such as pixel-wise segmentation maps or sparse landmarks. In this paper, we study the problem of example-guided image synthesis. Given an input semantic label map x and a guidance image I, the goal is to synthesize a photo-realistic image, y, which is semantically consistent with the label map x, while being style-consistent with the exemplar I, so . Style consistency is automatically determined: in portraits, style consistency refers to the fact that we want our synthetic output to be plausibly of the same genetic type as an input exemplar; in full body images style consistency means the same clothing; and in street scenes it includes such things as the same weather and time of day. Representative applications are shown in Figure 1.
Example-based image synthesis cannot be solved with a straightforward combination of photorealistic image synthesis based on pix2pixHD [22, 46] and style transfer [34]; the style of the input exemplar is not well kept in the synthetic result, see Figure 14. Recently, example-guided image-to-image translation frameworks [20, 31, 2] are proposed using a disentangled model to represent content and style or identity and attributes, however they fail to synthesize photorealistic results from abstract semantic label maps. The challenges are multi-fold: first, the ground truth photorealistic result for each label map given an arbitrary exemplar is not available for training; second, the synthetic results should be photorealistic while semantically consistent with the source label maps; last but not least, the synthetic result should be stylistically consistent with the corresponding image exemplar.
We present a method for this example-guided image synthesis problem with conditional generative adversarial networks. We build on the recent pix2pixHD [46] for image synthesis to ensure photorealism, with the crucial contributions of:
• a novel style consistency discriminator to enforce style consistency of a pair of images (see Section 3.2.2) ;
• an adaptive semantic consistency loss to ensure quality (see Section 3.2.3);
• a data sampling strategy that ensures we need only a weakly supervised approach for training (see Section 3.3).
Generative Adversarial Networks. In recent years, generative adversarial networks (GANs) [11, 1] for image generation have progressed rapidly [22, 46]. Driven by adversarial losses, generators and discriminators compete with each other: discriminators aim to distinguish the generated fake images from the target domain; generators try to fool discriminators. Technologies to improve GANs include: progressive GANs [19, 48, 24], training objective and process designs [42, 1, 37, 43], etc. In this paper, we use GANs for example-guided image generation with style consistency awareness.
Image-to-Image Translation and Photorealistic Image Synthesis. The goal of image-to-image translation is to translate images from a source domain to a target domain. Isola et al. [22] proposed the conditional GAN framework for various image-to-image translation tasks with paired images for supervision. Wang et al. [46] extended this work for high-resolution image synthesis and interactive manipulation. Recently, researchers proposed to solve the unsupervised image-to-image translation problem with cycle consistency to overcome the lack of unpaired training data [51, 25, 33, 52, 20, 31, 5]. Photorealistic image synthesis [6, 39, 46] is a specific application of image-to-image translation, where images are synthesized semantically from abstract label maps. Chen et al. [6] proposed a cascade framework to synthesis high-resolution images from pixel-wise labeling maps. Wang et al. [46] proposed a framework for instance-level image synthesis with conditional GANs.
Very recently, a few works [16, 20, 31, 35] have been proposed to transfer the style or attributes of an exemplar to the source image, where the images belong to photorealistic domains (aka domain adaptation). Our goal differs from these works by aiming at synthesizing photos from an abstract semantic label domain rather than a photorealistic image domain. Zheng et al. [50] proposed a clothes changing system to change the clothing of a person in image. Chan et al. [4] presented a network to synthesize a dance video from a target dance video and an source exemplar video. Different from our model, it was trained for every input exemplar video. Ma et al. [36] proposed to synthesize person images from pose keypoints. We show in Section 4 that our method outperforms the state-of-the-art methods.
Style Transfer. Style transfer is a long-standing problem in computer vision and graphics, which aims to transfer the style of a source image to a target image or target domain. Some approaches [14, 10, 23, 34, 18, 32, 12, 5, 17] transfer style based on single exemplar, where others learn a general style of a target domain with a holistic sense [51, 20, 31, 7]. Similar to our model, the PairedCycleGAN model [5] uses a style discriminator to distinguish whether a pair of facial images wear the same make-up in the making-up application. However, in their discriminator, the input image pair must be accurately aligned via warping; a generator is learned for each facial component. Our style consistency discriminator, in contrast, provides a general solution for image synthesis from both sparse labels (e.g. sketch and pose) and pixel-wise dense labels (e.g. scene parsing).
In this section, we first review the baseline model pix2pixHD [46], then describe our method, a conditional generative adversarial network for synthesizing photorealistic images from semantic label maps given specific exemplars. Finally we show how to appropriately prepare training data for our framework.
3.1. The pix2pixHD Baseline
The pix2pixHD [46] is a powerful image synthesis and interactive manipulation framework based on the pioneering conditional image-to-image translation method pix2pix [22]. Let x be a label map from a semantic label domain X, the goal of pix2pixHD is to synthesize an image y, from . It consists of a hierarchically integrated generator G and multi-scale discriminators
to handle high-resolution synthesis tasks. The goal of the generator G is to translate semantic label maps to photorealistic images, and the objective of the discriminators is to distinguish generated fake images from real ones at different resolution. The training dataset
consists of pairs of label map
and corresponding real image
.
The pix2pixHD optimizes a multi-task problem with a standard GAN loss and feature matching loss
:
where is the standard GAN loss given by:
and is the feature matching loss given by:
where T is the layer size and is the feature size in corresponding discriminator layer. An optional perceptual loss is introduced as the
loss between pre-trained VGG network [44] features.
One appealing feature of pix2pixHD is the instance-level image manipulation with a feature embedding technique. Given an instance-level segmentation map, pix2pixHD is able to synthesize an image with a specific appearance from an instance exemplar in the same object category. We will show that without the input instance-level pixel-wise segmentation map as a constraint, our model is still able to synthesize images with styles automatically transferred from exemplar images.
3.2. Our Model
Let I be a guidance image from a natural image domain Y. Our goal is to synthesize an image y, from a semantic label map x and an image . The role of I is to provide a style constraint to image synthesis: the output image y must be style-consistent with the exemplar I. Our problem is more difficult than that solved by pix2pixHD. One particular challenge we face is that given an input label map x, the ground truth images {y} for arbitrary guided style exemplars {I} are missing. To solve this weakly-supervised problem, we learn style consistency between pairs of images: they could be style-consistent image pairs
or style-inconsistent image pairs
(see Section 3.3).
An overview of our method is illustrated in Figure 2. It builds upon a single-scale version of pix2pixHD, and contains: (i) a generator G, with semantic map x, style example I and its corresponding label F(I) as input and output a synthetic image; (ii) a standard discriminator to distinguish real images from fake ones given conditional inputs; and (iii) we introduce a style consistency discriminator
to detect whether the synthetic image and the guidance image I are style-compatible, which operates on image pairs from domain Y. Here,
is an operator which, given an image produces a set of semantic labels that represent the image (choices of
are given in Section 4.2); for convenience F(I) can be visualized as an image, provided the viewer recalls that the image contains semantic labels. Our objective function contains three losses: a standard adversarial loss; a novel adversarial style consistency loss; and a novel adaptive semantic consistency loss.
3.2.1 Standard Adversarial Loss
where the G tries to synthesize images that look similar to real images from image domain Y regardless of specific styles, while given an image conditioned with the corresponding label map, the aims to determine the image is real or fake.
3.2.2 Adversarial Style Consistency Loss
With the standard adversarial loss, the generator G is able to synthesize images matching the data distribution of domain Y, however the synthetic results are not guaranteed to be style-consistent with the corresponding guidance I. We introduce the style consistency loss using a discriminator
associated with a pair of images — either both real, or one real and one synthetic:
Figure 2: Overview of our framework consisting of a generator G and two discriminators and
. (a) Given an input label map, a guided example and its labels generated by a known function
, the generator G tries to synthesize an image semantically consistent to the labels, while being style-consistent to the exemplar. (b) The standard discriminator
learns to distinguish between real and synthetic images on conditional input. (c) The style consistency discriminator
aims to distinguish between style-consistent image pairs and style-inconsistent image pairs.
where and
are a pair of sampled real images from domain Y with the same style,
and
are a pair of sampled real images from domain Y with different styles. We introduce the data sampling strategy in Section 3.3.
With the proposed adversarial style consistency loss , the discriminator
tries to learn awareness of style consistency between a pair of images, while the generator G tries to fool
by generating an image with the same style to exemplar I.
3.2.3 Adaptive Semantic Consistency Loss
The semantic consistency loss is introduced to reconstruct an image from a label map in the semantic sense of e.g. sketch. It may appear we could use the error between the input labels x, and the predicted labels from the synthetic image, G(x, I, F(I)), for example or some variant thereof. However, different applications give distinct meanings to the semantic label maps, with the consequence that the gradient of the loss will, in general, vary between applications. This would mean selecting hyper-parameters
to combine losses on a per-application basis.
We avoid this problem by always computing semantic consistency losses between images: the synthetic image G(x, I, F(I)) and specifically an image z which is a priori known to be consistent with a given semantic map x. Typically the image z is drawn from the training dataset and we have x = F(z). A particular issue with our adopted scheme is that such losses will try to converge the network output on the image z, which by choice is photorealistic and is semantically consistent with x. Such behavior would work perfectly when z and I are sampled from images with the same style, but could force the output away from the desired style when z and I are “style-wise” different, i.e.
.
Our solution, is to use a novel adaptive VGG loss com-
Figure 3: Adaptive weight for semantic consistency loss.
puted via a pre-trained model [23] between the synthetic image G(x, I, F(I)) and the real image z of label map x. An adaptive weighting scheme is proposed for per-layer VGG loss computation, to ensure the semantic consistency of the synthetic image to x:
where represents the i-th layer feature extractor of the VGG network, and
is the adaptive weight for the i-th layer. We set
to gain the impact of details from shallow layers when z and I are from style-consistent sampled pairs
, and
to sup- press the impact of detail matching for style-inconsistent pair
is the number of elements in the i-th feature layer. The adaptive weighting scheme is illustrated in Figure 3.
Full Objective. The final loss is formulated as:
Figure 4: Representative sampled data for training networks using FaceForensics [41], YouTube Dances and BDD100K [47] datasets. Each row shows pairs of sampled images from the above three datasets.
where and
control the relative importance of the terms, our full objective is given by:
3.3. Sampling Strategy for Style-consistent and Style-inconsistent Image Pairs
So far, we have introduced the core techniques of our network. However one prerequisite to our method is to obtain style-consistent image pairs and style- inconsistent image pairs
. Thus the datasets for prior image-to-image translation works [22, 46, 51, 31, 20] are not feasible for our training.
A key idea for training data acquisition is to collect image pairs from videos. In face and dance synthesis tasks, we observed that: (i) within a short temporal period of a video, the style of frame contents are ensured to be the same, and (ii) frames from different videos probably have different styles (e.g. different gender, hairstyles, skin colors and make-up in the face image synthesis application). We thus randomly sample pairs of frames within T = 10 frames from a video and regard them as style-consistent ones . For style-inconsistent pairs
, we firstly randomly sample pairs of frames from different videos, then manually label whether images from each sampled pair are style-consistent or not.
In the street view synthesis task, as large scale street view videos with different styles are not easy to collect, we use images from the BDD100K dataset [47]. In BDD100K, street view images and the weather, time of day attributes are provided. We coarsely categorize the images into 13 style groups based on the attributes, then sample style-consistent image pairs inside each group and sample style-inconsistent image pairs between groups. Figure 4 shows representative sampled pairs of images.
4.1. Implementation Details
We implement our model based on the single-scale pix2pixHD framework and experiment with images with size for street view synthesis). The generator G contains several Convolution-InstanceNorm-ReLU-Stride-2 layers to encode deep features, then 9 residual blocks [13] and finally some Convolution-InstanceNorm-ReLU-Stride-0.5 layers to synthesize images. For both discriminators
and
, we use PatchGANs [22] with several Convolution-InstanceNorm-LeakyReLU-Stride-2 layers with the exception that InstanceNorm is not applied in the first layer. The slope for LeakyReLU is set as 0.2. For all the experiments, we set
and
in Equation 7. All the networks are trained from scratch on an NVIDIA GTX 1080 Ti GPU using the Adam solver [27] with a batch size of 1. The learning rate is initially fixed as 0.0002 for the first 500K iterations and linearly decayed to zero over the next 500K iterations. We use LSGANs [37] for stable training. For more details, please refer to the supplementary material.
4.2. Datasets
We evaluate our method on face, dance and street view image synthesis tasks, using the following datasets:
SketchFace. We use the real videos in the FaceForensics dataset [41], which contains 854 videos of reporters broadcasting news. We use the image sampling strategy described in Section 3.3 to acquire training image pairs from video, then apply face alignment algorithm [26] to localize facial landmarks, crop facial regions and resize them to size
. The detected facial landmarks are connected to create face sketches as function
.
PoseDance. We download 150 solo dance videos from YouTube, crop out the central body regions and resize them to
. As the number of videos is small, we evenly split each video into the first part and the second part along the time-line, then sample training data only from the first parts and sample testing data only from the second parts of all the videos. The function
is implemented using concatenated pre-trained DensePose [40] and OpenPose [3] pose detection results to provide pose labels.
Scene parsingStreet view. We use the BDD100k dataset [47] to synthesize street view images from pixel-wise semantic labels (i.e. scene parsing maps). We use the state-of-the-art scene parsing network DANet [9] as the function
. Please find more details in our supplementary material.
4.3. Baselines
We compare our method with the following algorithms:
Input label map Input exemplar Ours pix2pixHD +DPST MUNIT Ours w/o SC loss Ours w/o SCAdv loss pix2pixHD
Figure 5: Example-based face image synthesis on the FaceForensics dataset. The first column shows the input labels, the second column shows the input style example, next columns show the results from our method and our ablation studies, pix2pixHD, pix2pixHD with DPST, MUNIT and PairedMUNIT.
pix2pixHD and pix2pixHD [46] with DPST [34]. pix2pixHD is the image-to-image translation baseline. A default image could be synthesized using pix2pixHD with its style then transfered to the guided example using Deep Photo Style Transfer (DPST) method.
MUNIT [20] and PairedMUNIT. MUNIT is the state-of-the-art unsupervised image-to-image translation method with disentangled content and style representations that are able to translate images to given exemplars. We modify MUNIT by integrating pairwise style information to the original model and adaptively computing losses with style (denoted as PairedMUNIT).
Ours without or adaptive weights for ablation studies. All of the methods are trained on the datasets introduced in Section 4.2.
4.4. Evaluation Metrics
Photorealism and Semantic Consistency. We use the Fr´echet Inception Distance [15] to evaluate the realism and faithfulness of the synthetic results. This metric is widely used for implicit generative models, because it correlates with the visual quality of generated samples. A smaller FID is often favored by the human subjects. We further evaluate semantic consistency by translating the synthetic images back to the label domain and comparing the accuracy to the input labels. For tasks SketchFace and Pose
Dance, we use the labeling endpoint error (LEPE) between the input label map x and the labels generated by
to compute the label accuracy. For task Scene parsing
Street view, we use scene parsing score (SPS) [9] on synthetic street view images to measure the segmentation accuracy.
Table 1: Photorealism comparison measured by Fr´echet Inception Distance (FID) [15].
Table 2: Semantic consistency measured by normalized label endpoint error for different methods in face and dance image synthesis tasks.
Style Consistency. We perform a human perceptual study to compare style consistency from human point of view. We show pairs of our result and the result from baseline methods to invited subjects and ask which one they see as being closer to the guidances’ style.
4.5. Results
Main Results. In Figure 14, we show our results (column 3) and the results from baseline methods in the SketchFace synthesis application on the test set. While the pix2pixHD is able to generate photorealistic images consistent with the input semantic labels, it is not able to keep the style (e.g. gender, hair, skin color) from input exemplars in the synthetic results, even enhanced by the deep
Table 3: Style consistency evaluation by human option study on SketchFace synthesis. Each cell lists the percentage where our result is preferred over the other method.
photo style transfer effect (column 7 and 8). The unsupervised method MUNIT and its improvement PairedMUNIT fail to generate photorealistic results from semantic maps in this application (column 9 and 10). The possible reason for their failures is that they assume that the input and output domains share the same content space, which is not true in image synthesis applications from semantic label maps.
Table 1 gives the quantitative evaluation of the photorealism measured by FID in various image synthesis tasks, where our method performs the best. The semantic consistency of synthetic results to the input labels is given by LEPE in Table 2. It can be seen that the pix2pixHD obtains the best semantic consistency to the input labels, because it does not lose semantic accuracy by totally ignoring style consistency. Our method outperforms MUNIT and PairedMUNIT.
For style consistency evaluation, we conduct a human perception study commonly used in image-to-image translation works [22, 51, 6, 46, 8]. The input exemplars and pairwise synthetic results sampled from our method and a baseline method are shown to the subjects with unlimited watching time. Then the subjects were asked “Which image is closer to the exemplar in terms of style?” Images for user study were randomly sampled from the test set; each pair was shown in random order and guaranteed to be examined by at least 30 subjects. The ratios of votes our method got over baseline methods are given in Table 3. Our method won more user preferences in pairwise comparison. The quantitative results shown that our results are more photo-realistic and more style-consistent with the exemplars.
We conducted ablation studies to verify our model. As can be seen in Figure 14, without the adaptive weight scheme in , the quality of results is slightly reduced; without the semantic loss
, the semantic consistency would lose; without the style consistency adversarial loss
, the target style is not maintained. Quantitative photorealism statistics reported in Table 4 validated the above observation. We further extract
eye patches from synthetic images and exemplars and compute the VGG feature distance between them. Table 5 indicates that the weight adaptation makes a quantitative improvement of style consistency.
Figure 15 shows the in-the-wild synthesis results from our model using Internet images. The results indicate that the model generalizes well for “unseen” cases. We provide more results in the supplementary material.
PoseDance Synthesis. Figure 7 shows a visual com-
Table 4: Ablation study: Fr´echet Inception Distance (FID) of our results and alternatives on the SketchFace synthesis task.
Table 5: VGG feature distance of eye patches between synthetic image and exemplar.
Figure 6: In-the-wild SketchFace synthesis.
Input label map Input exemplar Ours pix2pixHD Paired MUNIT
Figure 7: Dance synthesis from pose maps.
Figure 8: PoseDance comparison with Ma et al. [36].
parison of our method and baselines in the PoseDance synthesis application. The semantic consistency of synthetic results to the input labels measured using LEPE are given in Table 2. Although the facial regions of our results
Figure 9: More results of example-based image synthesis on face, dance and street view synthesis tasks.
Input labels OursInput exemplars pix2pixHD +DPST
Figure 10: Street view synthesis from scene parsing maps and corresponding exemplars.
are blurry without including facial landmarks in the input pose labels, our model still produces images that are style-consistent with the guidance images while consistent with the semantic labels. Figure 8 shows the visual comparison with Ma et al. [36] on the dancing dataset. The generated poses and clothes in our results are visually better.
Scene parsingStreet view Synthesis. A comparison of our method and baselines in the Scene parsing
Street view task is given in Figure 10. The semantic consistency of synthetic results to the input labels measured using SPS are given in Table 6. Although the scene in the guidance images are not quite the same as the semantics of the input label maps, our model is able to produce images that are semantically consistent with the segmentation map and style-consistent with the guidance image.
Figure 9 shows more results. Our network can faithfully synthesize images from various semantic labels and exemplars. Please find more results in the supplementary file.
In this paper, we present a novel method for example-guided image synthesis with style-consistency from general-form semantic labels. During network training, we propose to sample style-consistent and style-inconsistent image pairs from video to provide style awareness to the
Table 6: Semantic consistency measured by scene parsing score [9] for different methods on the street view image synthesis task.
model. Beyond that, we introduce the style consistency adversarial losses and the style consistency discriminator, as well as the semantic consistency loss with adaptive weights, to produce plausible results. Qualitative and quantitative results in different applications show that the proposed model produces realistic and style-consistent images better than those from prior arts.
Limitations and Future Work. Our network is mainly trained on cropped video data whose resolution is limited (e.g. ), we did not use the multi-scale architecture as pix2pixHD did for high-resolution image synthesis (e.g.
resolution or more). Moreover, the synthetic background in face and dance image synthesis tasks may be blurry, because the semantic labels do not specify any background scenes. Lastly, we have demonstrated the efficiency of our method in several synthesis applications, however the results in other applications could be effected by the performance of the state-of-the-art semantic labeling function
. In the future, we plan to extend this framework to video domain [45] and synthesize style-consistent videos to given exemplars.
Acknowledgements. We thank the anonymous reviewers for the valuable discussions. This work was supported by the Natural Science Foundation of China (Project Number: 61521002, 61561146393). Shi-Min Hu is the corresponding author.
[1] Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. 2
[2] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Towards open-set identity preserving face synthesis. In CVPR, 2018. 2
[3] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017. 5, 11
[4] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. arXiv preprint arXiv:1808.07371, 2018. 2
[5] Huiwen Chang, Jingwan Lu, Fisher Yu, and Adam Finkel- stein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018. 2
[6] Qifeng Chen and Vladlen Koltun. Photographic image syn- thesis with cascaded refinement networks. In ICCV, 2017. 2, 7
[7] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In CVPR, pages 9465–9474, 2018. 2
[8] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018. 7
[9] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983, 2018. 5, 6, 8, 10
[10] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In CVPR, 2016. 2
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2
[12] Shuyang Gu, Congliang Chen, Jing Liao, and Lu Yuan. Ar- bitrary style transfer with deep feature reshuffle. In CVPR, 2018. 2
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 5
[14] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin. Image analogies. In SIGGRAPH, pages 327–340, 2001. 2
[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017. 6
[16] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. CyCADA: Cycle-consistent adversarial domain adaptation. In ICML, 2018. 2
[17] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, and Wei Liu. Real-time neural style transfer for videos. In CVPR, July 2017. 2
[18] Xun Huang and Serge J Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017. 2
[19] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generative adversarial networks. In CVPR, 2017. 2
[20] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018. 2, 5, 6
[21] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image Completion. ACM Trans. Graph., 36(4):107:1–107:14, 2017. 1
[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016. 2, 3, 5, 7, 11
[23] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016. 2, 4, 11
[24] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. 2
[25] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017. 2
[26] Davis E King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul):1755–1758, 2009. 5, 10
[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5
[28] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming- Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017. 1
[29] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016. 1
[30] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Te- jani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. 1
[31] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Ma- neesh Kumar Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In ECCV, 2018. 2, 5
[32] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute transfer through deep image analogy. ACM Trans. Graph., 36(4), 2017. 2
[33] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, pages 700– 708, 2017. 2
[34] Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. arXiv preprint arXiv:1703.07511, 2017. 2, 6
[35] Liqian Ma, Xu Jia, Stamatios Georgoulis, Tinne Tuytelaars, and Luc Van Gool. Exemplar guided unsupervised image-to-image translation. arXiv preprint arXiv:1805.11145, 2018. 2
[36] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuyte- laars, and Luc Van Gool. Pose guided person image generation. In NeurIPS, pages 405–415, 2017. 2, 7, 8
[37] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, 2017. 2, 5
[38] Deepak Pathak, Philipp Kr¨ahenb¨uhl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016. 1
[39] Xiaojuan Qi, Qifeng Chen, Jiaya Jia, and Vladlen Koltun. Semi-parametric image synthesis. In CVPR, 2018. 2
[40] Iasonas Kokkinos Riza Alp G¨uler, Natalia Neverova. Dense- pose: Dense human pose estimation in the wild. arXiv, 2018. 5, 11
[41] Andreas R¨ossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv, 2018. 5, 10
[42] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016. 2
[43] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017. 2
[44] Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3
[45] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. In NIPS, 2018. 8
[46] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018. 2, 3, 5, 6, 7, 11
[47] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018. 5, 10
[48] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint, 2017. 2
[49] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016. 1
[50] Zhao-Heng Zheng, Hao-Tian Zhang, Fang-Lue Zhang, and Tai-Jiang Mu. Image-based clothes changing system. Computational Visual Media, 3(4):337–347, 2017. 2
[51] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. 2, 5, 7, 11
[52] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar- rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017. 2
Table 7: Style groups we used to categorize BDD100K street view images.
As described in the main manuscript, we evaluate our model on face, dance and street view image synthesis tasks, using following datasets and semantic functions:
SketchFace. We use the real videos in the FaceForensics dataset [41], which contains 854 videos of reporters broadcasting news. We use the image sampling strategy described in Section 3.3 of the main manuscript to acquire training image pairs from video, then apply face alignment algorithm [26] to localize facial landmarks, crop facial regions and resize them to size
. We sample 20, 000 images from videos for training and 500 images from distinct videos for testing. The detected facial landmarks are connected to create face sketches; this is the function
, in both training set and test set. For each sketch extracted from a training image, we randomly sample 30 guidance images from other videos for training, and for each testing sketch, we randomly sample 5 guidance images from other videos for testing.
SceneParsingStreetView. We use the BDD100k dataset [47] to synthesize street view images from pixel-wise semantic labels (i.e. scene parsing) maps. For each street view image in the dataset, the corresponding scene parsing map and WEATHER and TIMEOFDAY attributes are provided. Based on these attributes, we divide images into 13 style groups as listed in Table 7, then sample style-consistent image pairs inside each group and style-inconsistent image pairs between groups. The training set contains 2, 000 images and test set contains 400 images, both resized to width 256. We use scene parsing network DANet [9] as the function
for each street view image during testing. For each scene parsing map, we randomly select an image inside each style group as the guidance, both in training and testing phases.
PoseDance. We downloaded 150 solo dance videos from YouTube, cropped out the central body regions and resized them to size
. As the number of videos is small, we evenly split each video into the first part and the second part along the timeline, then sample training data only from the first parts and sample testing data only from the second parts of all the videos. The function
is implemented using concatenated pre-trained DensePose [40] and OpenPose [3] pose detection results to provide pose labels. As a result, we have 35, 000 images for training and 500 images for testing. For each pose extracted from a training image, we randomly sample 30 guidance images from other dancing videos, and for each testing pose, we randomly sample 5 guidance images from other dancing videos.
B.1. Generator
We follow the naming convention used in Johnson et al. [23], CycleGAN [51] and pix2pixHD [46]. Let c7s1-k denote a Convolution-InstanceNorm-ReLU layer with k filters and stride 1. dk denotes a
Convolution-InstanceNorm-ReLU layer with k filters and stride 2. Re-flection padding is used to reduce boundary artifacts.
denotes residual blocks each contains two
convolutional layers with k filters, repeated t times. uk denotes a
fractional-strided-Convolution-InstanceNorm-ReLU layer with k filters and stride 0.5.
The architecture of generator is represented as:
c7s1-64, d128, d256, d512, d1024, R10249, u512, u256, u128, u64, c7s1-3
B.2. Discriminators
We use PatchGAN [22] in both of the two discriminators
and
. Let Ck denote a
Convolution-InstanceNorm-LeakyRU layer with k filters and stride 2. The last layer is send to an extra convolution layer to produce a 1 dimensional output. InstanceNorm is not used for the first C64 layer. Leaky ReLU slope is set as 0.2.
The architectures of and
are identical:
All the networks were trained from scratch. Weights were initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. In the first 250K iterations, the learning rate was fixed as 0.0002 with the adversarial style-consistency loss turned-off. In the next 250K iterations, we turned on the
loss. In the final 500K
Input label map Input exemplar Ours Paired MUNITpix2pixHD
Figure 11: Example-based dance image synthesis YouTube Dance dataset. The first column shows the input pose labels, the second column shows the input style examples, next columns show the results from our method, pix2pixHD and PairedMUNIT.
iterations, the learning rate linearly decayed to zero with all the losses turned-on.
The models were trained on an NVIDIA TITAN 1080 Ti GPU with 11GB memory. The inference time is about 8-10 milliseconds per image.
In Figure 11 and following pages, we show further experimental results from our method and baselines.
Figure 12: More results of dance synthesis. The first column shows input pose maps. The first row shows input dance exemplars. Other images are the synthetic dance results.
Input label map Input exemplar Ours pix2pixHD +DPST Paired MUINT Ours w/o SC loss Ours w/o SCAdv loss pix2pixHD Ours w/o adaptive weights
Figure 13: Example-based face image synthesis on the FaceForensics dataset. The first column shows the input labels, the second column shows the input style example, next columns show the results from our method and our ablation studies, pix2pixHD, pix2pixHD+DPST and PairedMUNIT.
Figure 14: More results of face synthesis. The first column shows input sketch maps. The first row shows input face exemplars. Other images are the synthetic face results.
Figure 15: More in-the-wild SketchFace results. The model is trained on our training dataset and tested on Internet images.
Figure 16: More results of street view synthesis. The first column shows input segmentation maps. The first row shows input exemplars. Other images are the synthetic street view results.