Generative adversarial networks (GANs) [11] have shown a remarkable success across a broad range of applications in computer vision, graphics, and machine learning, e.g., image generation [5, 20, 21], image-to-image translation [28, 33, 8], and video-to-video synthesis [43, 3, 6]. Current state-of-the-art GANs, however, often require a large amount of training data and heavy computational resources, which thus limits the applicability of GANs in practical scenarios. Numerous techniques have been proposed to overcome this limitation, e.g., transferring knowledge of a welltrained source model [45, 32, 44], learning meta-knowledge for quick adaptation to a target domain [24, 47, 42], using an auxiliary task to facilitate training [7, 26, 48, 49], improving an inference procedure of suboptimal models [2, 39, 29, 38], using an expressive prior distribution [13], actively choosing samples to give supervision for conditional generation [29], or actively sampling mini-batches for training [37].
Among the approaches, transfer learning [46] is arguably the most promising way to training models under limited
Figure 1: Trends of FID [15] scores of fine-tuning and our proposed baseline, FreezeD, on ‘Dog’ class in the Animal Face [36] dataset. While fine-tuning suffers from overfit-ting, FreezeD shows consistent stability in training GANs.
data and resources. Indeed, most of the recent success in deep learning is built upon strong backbones pre-trained on large datasets in supervised [9] or self-supervised [10, 14] ways. Following the success of transferring classifiers in recognition tasks, one can also consider utilizing welltrained GAN backbones for downstream generation tasks. While several methods propose such transfer-learning approaches to training GANs [45, 32, 44], they are often prone to overfitting with limited training data [45] or not robust in learning a significant distribution shift [32, 44].
In this paper, we propose a simple yet effective baseline for transfer learning of GANs. In particular, we show that simple fine-tuning of GANs (both generator and discriminator) with frozen lower layers of the discriminator performs surprisingly well (see Figure 1). Intuitively, the lower layers of the discriminator learn generic features of images while the upper layers learn to classify whether the image is real or fake based on the extracted features. We remark that this dichotomous view of a feature extractor and a classifier (and freezing the feature extractor for fine-tuning) is not new; it has been widely used for training classifiers [46]. We con-firm that this view is also useful for GANs, and set its proper baseline for transfer learning of GANs.
We demonstrate the effectiveness of the simple baseline, dubbed FreezeD, using various architectures and datasets. For unconditional GANs, we fine-tune the StyleGAN [20] architecture, which is pre-trained on FFHQ [20], onto Animal Face [36] and Anime Face [30] datasets, and for conditional GANs, fine-tune the SNGAN-projection [27] architecture, which is pre-trained on ImageNet [9], onto Oxford Flower [31], CUB-200-2011 [40], and Caltech-256 [12] datasets. FreezeD outperforms previous techniques for all experiment settings, e.g., improving the FID [15] score from 64.28 of fine-tuning to 61.46 (-4.4%) on ‘Dog’ class of Animal Face dataset.
The goal of GANs [11] is to learn a generator (and a corresponding discriminator) to match with a target data distribution. In transfer learning, we assume one can utilize a pre-trained source generator (and a corresponding discriminator) trained on the source data distribution to improve the target generator. See [25, 22] for the survey of GANs.
We first briefly review previous methods for transfer learning of GANs.
• Fine-tuning [45]: The most intuitive and effective way to transferring knowledge is fine-tuning; initialize the parameters of target models as the pre-trained weights of the source models. The authors report that fine-tuning both the generator and the discriminator indeed shows the best performance.1 However, fine-tuning often suffer from overfitting; hence one needs a proper regularization.
• Scale/shift [32]: Since na¨ıve fine-tuning is prone to overfitting, scale/shift suggest to update the normalization layers only (e.g., batch normalization (BN) [17]) while fixing all other weights. However, it often shows inferior results due to its restriction, especially when there is a significant shift between the source and the target distribution.
• Generative latent optimization (GLO) [32, 4]: Since GAN loss is given by the discriminator, which can be unreliable for limited data, GLO suggests fine-tuning the generator with supervised learning, where the loss is given by the sum of the L1 loss and the perceptual loss [19]. Here, GLO jointly optimizes the generator and the latent codes to avoid overfitting; one latent code (and its corresponding generated sample) matches one real sample; hence, the generator can generalize samples by interpolation. While GLO improves the stability, it tends to produce blurry images due to the lack of adversarial loss (and prior knowledge of the source discriminator).
• MineGAN [44]: To avoid overfitting of the generator, MineGAN suggests to fix the generator and modify the latent codes. To this end, MineGAN train a miner network that transforms the latent code to another latent code. While this importance-sampling-like approach can be effective when the source distribution and the target distributions share support, it may not be generalized when their supports are disjointed.
We now introduce a simple baseline, FreezeD, which outperforms the previous methods despite its simplicity, and suggest two other methods for possible future directions, which may give further improvement. We remark that our goal is not to advocate the state-of-the-art but to set a simple and effective baseline. By doing so, we hope to encourage new techniques that outperform the proposed baseline.
• FreezeD (our proposed baseline): We find that simply freezing the lower layers of the discriminator and only fine-tune the upper layers performs surprisingly well. We call this simple yet effective baseline as FreezeD, and will demonstrate its consistent gain over the previous methods in the experimental section.
• L2-SP [23]: In addition to the prior methods, we test L2-SP, which is known to be effective for the classi-fiers. Built upon to the fine-tuning, L2-SP regularizes the target models not to move far from the source models. In particular, it regularizes the L2-norm of the parameters of source models and target models. In our experiments, we applied L2-SP to the generator, discriminator, and both, but the results were not satisfactory. However, since freezing layers can be viewed as giving the infinite weight of L2-SP for the chosen layers and 0 for the other layers, using proper weights for each layer may perform better.
• Feature distillation [16, 35]: We also test feature distillation, one of the most popular approaches to transfer learning of classifiers. Among the variants, we simply distill the activations of the source models and target models (initialized to the source models). We find that feature distillation shows comparable results to FreezeD while takes twice computation. Investigating more advanced techniques (e.g., [1, 18, 34]) would be an interesting and promising future direction.2
In this section, we demonstrate the effectiveness of the simple yet effective baseline, FreezeD. We conduct extensive experiments for both unconditional GANs and conditional GANs in Section 3.1 and Section 3.2, respectively.
Figure 2: Samples generated by StyleGAN of (a) original weights, and trained by FreezeD under (b) ‘Cat’, and (c) ‘Dog’ classes in the Animal Face dataset. Each entry indicates the same latent code. Same latent code shares the same semantics even after fine-tuning, e.g., the background color and hair color are preserved. See Appendix D for more qualitative results.
Table 1: FID scores under Animal Face dataset. Left and right values indicate the best and final FID scores.
3.1. Unconditional GAN
We first demonstrate results for unconditional GANs. We use the StyleGAN [20] architecture pre-trained on FFHQ [20] dataset, and fine-tune it on Animal Face [36] and Anime Face [30] datasets. We use full 20 classes of the Animal Face dataset, and the first 10 classes among the total 1,000 classes of the Anime Face dataset. Each class contains around 100 samples. We use the public pre-trained model3 of resolution 256256 and fine-tune the models following the original training scheme for 50,000 iterations. We remark that the training performed successfully without progressive training by utilizing the source models.
Figure 2 visualizes the generated samples using the orig-
inal weights and the fine-tuned weights on ‘Cat’ and ‘Dog’ classes in the Animal Face dataset. Notably, the same latent code shares the same semantics even after fine-tuning. See Appendix D for more qualitative results. We also evaluate the FID [15] scores of the vanilla fine-tuning and FreezeD under Animal Face and Anime Face datasets in Table 1 and Table 2, respectively. We freeze the discriminator until layer 4. See Appendix A for the ablation study on different layers. FreezeD improves both the best performance and the stability as shown by the best and final FID scores.
We finally compare FreezeD with several previous methods, including scale/shift, GLO, MineGAN, L2-SP, and feature distillation (FD). We choose the weights of L2-SP and FD from {0.1, 1, 10} and simply use 1 for all experiments. We follow the hyperparameters of [32] for GLO, and use 2-layer MLP with ReLU activation for the Miner network.
Figure 3: Samples generated by SNGAN-projection trained by (a) fine-tuning and (b) FreezeD under the Oxford Flower dataset. Each row indicates the same class. FreezeD generates more class-consistent samples than fine-tuning, e.g., fine-tuning generates some abnormal samples for row 2 and 8. See Appendix E for more qualitative results.
Table 3 presents the FID scores of each method. Feature distillation and qualitative results are in Appendix B and C, respectively. Scale/shift and L2-SP are too restrictive and thus harms diversity. GLO produces blurry images while MineGAN fails to learn the distribution shift.
3.2. Conditional GAN
We also demonstrate the results for conditional GANs. We use the SNGAN-projection [27] architecture pre-trained on ImageNet [9] dataset, and fine-tune it on Oxford Flower [31], CUB-200-2011 [40], and Caltech-256 [12] datasets. Each dataset contains 102, 200, and 256 classes, respectively, where each class has 50-100 samples. We use the public pre-trained model4 of resolution 128128 and fine-tune the networks following the original training scheme for 20,000 iterations. SNGAN-projection has a larger variance than StyleGAN, but still the trend is similar.
Figure 3 visualizes the samples generated using the model trained by fine-tuning and FreezeD. FreezeD generates more class-consistent samples than fine-tuning as shown in the 2nd and 8th rows. See Appendix E for more qualitative results. We also evaluate the FID [15] scores of the vanilla fine-tuning and FreezeD in Table 4. We freeze the discriminator until {3, 2, 1} layers for {Oxford Flower, CUB-200-2011, Caltech-256 datasets}, respectively, as the
Table 4: FID scores under SNGAN-projection architecture. Left and right values indicate the best and final FID scores.
distribution shift goes larger. See Appendix A for details. FreezeD improves both the performance and stability for most cases, but harms the stability for Oxford Flower. We find that feature distillation shows more stable results in our experiments. We leave this investigation for future work.
We have introduced a simple yet effective baseline, FreezeD, for transfer learning of GANs. FreezeD splits the discriminator into a feature extractor and a classifier and then fine-tune the classifier only. We demonstrate that this simple baseline clearly outperforms most of the previous methods using various architectures and datasets. Our observation raises two questions. First, the transferability of the feature extractor of the discriminator could be applied for the universal detector of generated images [41]. Second, one can design a more sophisticated method that outperforms our proposed baseline. We hypothesize that the advanced version of feature distillation [16, 35] could be a promising direction.
[1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai. Variational information distillation for knowledge transfer. In CVPR, 2019.
[2] S. Azadi, C. Olsson, T. Darrell, I. Goodfellow, and A. Odena. Discriminator rejection sampling. In ICLR, 2018.
[3] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh. Recycle-gan: Unsupervised video retargeting. In ECCV, 2018.
[4] P. Bojanowski, A. Joulin, D. Lopez-Pas, and A. Szlam. Op- timizing the latent space of generative networks. In ICML, 2018.
[5] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
[6] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. In ICCV, 2019.
[7] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby. Self- supervised gans via auxiliary rotation loss. In CVPR, 2019.
[8] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha. Stargan v2: Diverse image synthesis for multiple domains. arXiv preprint arXiv:1912.01865, 2019.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat- egory dataset. 2007.
[13] S. Gurumurthy, R. Kiran Sarvadevabhatla, and R. Venkatesh Babu. Deligan: Generative adversarial networks for diverse and limited data. In CVPR, 2017.
[14] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momen- tum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
[15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
[16] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NeurIPS Workshop, 2014.
[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[18] Y. Jang, H. Lee, S. J. Hwang, and J. Shin. Learning what and where to transfer. In ICML, 2019.
[19] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
[20] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
[21] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958, 2019.
[22] K. Kurach, M. Luˇci´c, X. Zhai, M. Michalski, and S. Gelly. A large-scale study on regularization and normalization in gans. In ICML, 2019.
[23] X. Li, Y. Grandvalet, and F. Davoine. Explicit inductive bias for transfer learning with convolutional networks. In ICML, 2018.
[24] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehti- nen, and J. Kautz. Few-shot unsupervised image-to-image translation. In ICCV, 2019.
[25] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bous- quet. Are gans created equal? a large-scale study. In NeurIPS, 2018.
[26] M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly. High-fidelity image generation with fewer labels. In ICML, 2019.
[27] T. Miyato and M. Koyama. cgans with projection discrimi- nator. In ICLR, 2018.
[28] S. Mo, M. Cho, and J. Shin. Instagan: Instance-aware image- to-image translation. In ICLR, 2019.
[29] S. Mo, C. Kim, S. Kim, M. Cho, and J. Shin. Mining gold samples for conditional gans. In NeurIPS, 2019.
[30] C. Nagadomi. Animeface character dataset, 2018.
[31] M.-E. Nilsback and A. Zisserman. Automated flower classi- fication over a large number of classes. In ICVGIP, 2008.
[32] A. Noguchi and T. Harada. Image generation from small datasets via batch statistics adaptation. In ICCV, 2019.
[33] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
[34] W. Park, D. Kim, Y. Lu, and M. Cho. Relational knowledge distillation. In CVPR, 2019.
[35] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
[36] Z. Si and S.-C. Zhu. Learning hybrid image templates (hit) by information projection. IEEE Transactions on pattern analysis and machine intelligence, 2011.
[37] S. Sinha, H. Zhang, A. Goyal, Y. Bengio, H. Larochelle, and A. Odena. Small-gan: Speeding up gan training using coresets. arXiv preprint arXiv:1910.13540, 2019.
[38] A. Tanaka. Discriminator optimal transport. In NeurIPS, 2019.
[39] R. Turner, J. Hung, E. Frank, Y. Saatci, and J. Yosin- ski. Metropolis-hastings generative adversarial networks. In ICML, 2018.
[40] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
[41] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros. Cnn-generated images are surprisingly easy to spot... for now. arXiv preprint arXiv:1912.11035, 2019.
[42] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, B. Catanzaro, and J. Kautz. Few-shot video-to-video synthesis. In NeurIPS, pages 5014–5025, 2019.
[43] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In NeurIPS, 2018.
[44] Y. Wang, A. Gonzalez-Garcia, D. Berga, L. Herranz, F. S. Khan, and J. van de Weijer. Minegan: effective knowledge transfer from gans to target domains with few images. arXiv preprint arXiv:1912.05270, 2019.
[45] Y. Wang, C. Wu, L. Herranz, J. van de Weijer, A. Gonzalez- Garcia, and B. Raducanu. Transferring gans: generating images from limited data. In ECCV, 2018.
[46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How trans- ferable are features in deep neural networks? In NeurIPS, 2014.
[47] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. arXiv preprint arXiv:1905.08233, 2019.
[48] H. Zhang, Z. Zhang, A. Odena, and H. Lee. Consistency regularization for generative adversarial networks. In ICLR, 2020.
[49] Z. Zhao, S. Singh, H. Lee, Z. Zhang, A. Odena, and H. Zhang. Improved consistency regularization for gans. arXiv preprint arXiv:2002.04724, 2020.
We study the effect of freezing layers of the discriminator for StyleGAN and SNGAN-projection in Table 5 and Table 6, respectively. In StyleGAN, layer 4 consistently shows the best performance. However, in SNGAN-projection, layer {3, 2, 1} were the best for Oxford Flower, CUB-200-2011, and Caltech-256 datasets, respectively. It is since Caltech-256 is harder to learn compared to Oxford Flower (i.e., distribution shift is larger). Intuitively, one should less restrict the model to adapt to the large distribution shift. One can also see that FreezeD is less stable than fine-tuning for the Oxford Flower dataset. We observe that feature distillation shows better stability while showing a similar best performance in our early experiments. Investigating a more sophisticated method would be an interesting research direction.
Table 5: Ablation study on freezing layers of D on StyleGAN architecture under ‘Cat’ and ‘Dog’ classes in the Animal Face dataset. Layer i indicates that the first i layers of the discriminator are frozen. Layer 4 performs the best.
Table 6: Ablation study on freezing layers of D on SNGAN-projection architecture under Oxford Flower, CUB-200-2011, Caltech-256 datasets. Layer i indicates that the first i layers of the discriminator are frozen.
We compare FreezeD with feature distillation. We linearize the activations of the i-th layer of the discriminator, and match the activations of the source and target discriminators. Since the activation has a different size for each layer, we use the L2-norm normalized by the feature dimension. We simply use 1 for the weight of the regularizer regardless of the layer. Table 7 presents the comparison results. Feature distillation and FreezeD shows comparable results, while feature distillation is twice slower. Hence, we choose to FreezeD as the baseline for this paper.
Table 7: Comparison of FreezeD and feature distillation (FD) on StyleGAN architecture under ‘Bear’, ‘Cat’, and ‘Dog’ classes in the Animal Face dataset. FM (layer i) indicates the activations after layer i are matched. Feature distillation shows comparable results to FreezeD while it is twice slower.
We visualize the samples generated by the prior methods in Figure 4. Scale/shift and L2-SP generates reasonable samples, but have less diversity as measured by FID scores. GLO generates blurry images due to the lack of adversarial loss and the knowledge of source discriminator. In our experiments, MineGAN totally fails to adapt to the target distribution. Note that MineGAN assumes the source distribution covers (or at least close to) the target distribution (e.g., adult faces to child faces as in the original paper [44]), but cannot be applied if the distributions have disjoint support (e.g., human faces to dog faces).
Figure 4: Samples generated by prior methods under ‘Dog’ class in the Animal Face dataset.