Figure 1: Pedestrian images generated by our model. The lower left indicates the input.
Pedestrian detection [9],[6],[7],[21],[28],[41] is a fundamental task in applications such as robotics automation, autonomous driving, and video surveillance. In such tasks, training requires a large number of high-quality pedestrian images, but traditional data augmentation methods (e.g., flip and random crop) are not enough, and constructing large-scale manually labeled datasets (e.g., [5],[8],[15],[42]) are time-consuming and laborious. Therefore, we badly need an efficient way to augment pedestrian data autonomously.
We propose a method based on generative adversarial networks (GANs) to generate high-quality pedestrian images, Figure 1 shows several generated results. GANs [17] have recently shown great advantages in synthesizing images [45],[46],[38]. In principle, it is a minimax game, where the generator, G, is trained to produce images that are indistinguishable from real ones, and the discriminator, D, is trianed to distinguish faked images. The competition between G and D pushes the network to simulate the distribution of real images as closely as possible.
The possibility of using GANs to augment data has been studied in several research areas (e.g., [47],[13], and [30]), but it is still an open problem in pedestrian detection. Producing visually appealing pedestrians is challenging because of their diverse appearances and sizes. PS-GAN [32] is the first such attempt, it augments pedestrian detection images with a Unet structured generator [33]. The network generates a pedestrian within a noise masking rectangular area of the input background image, and adopts a Spatial Pyramid Pooling technique [18] to tackle multi-scale pedestrians. However, it has some pitfalls: rectangular masks leave artificial edges when blending with the background; under occlusion, local context information is lost; the details of output pedestrians are rather rough; and the model learns a one-to-one mapping, not efficient enough for data augmentation.
To produce realistic and diversified high-quality pedestrian images, we make the following improvements: (1) we use instance-level pedestrian masks instead of rectangle masks. Instance-level masks not only indicate the position to generate pedestrians but also inform the shape of pedestrians, thus helpful for generating realistic postures and clear body edges, as well as improves the problem of artificial blending edges and facilitates the synthesizing of occluded pedestrians. (2) We inject a masked latent code into every intermediate block of the encoder part of G to encourage both realistic and diversity of synthesized pedestrians. (3) We upgrade the U-net structured G into a residual U-net, with multi-scale residual blocks in the encoder part and attention residual blocks in the decoder part. Residual blocks [19] deepen the network and enlarge its capacity. The multi-scale residual blocks further enable modeling multi-scale information. The attention residual blocks help select the most important features in synthesizing images by adjusting the important weight of features. (4) We organize our model into a three-staged cascaded architecture to deal with multi-sized pedestrians. The generators operate at a resolution of 64 64, 128
128, and 256
256, respectively, with a higher stage takes the output of its previous stage as input.
We have experimented on Cityscapes datasets and justified that our PMC-GANs generates higher-quality pedestrians than baselines. The model is used to augment data for a pedestrian detection task, and the augmented dataset improves the performance of pedestrian detection.
Image-to-Image Translation Image-to-image translation learns to map from a source image domain to a target image domain. One of the most famous models is Pix2Pix [22], which
Figure 2: The cascaded architecture of our model. We use the instance-level mask, semantic label map, and edge map (the far left image set) to assist training. Each stage is a hybrid of cVAE-GAN and cLR-GAN, in which cVAE-GAN encodes ground truth by Ei, and Gi maps the input along with a sampled z back to ground truth, and cLR-GAN uses a randomly sampled z to map the input into the output domain and then rebuilds z from an output. Dimg distinguishes real or fake of the whole image, and Dped focuses on the object pedestrian instance.
uses a conditional GAN [31] with a U-net generator. CycleGAN [45] creatively applies cycle consistency losses to improve the reconstruction loss, and the idea of cycle consistency is then widely taken by other studies. These two works learn one-to-one mapping, while our model expects to produce several plausible results with one input. In this field, [1] proposes an Augmented CycleGAN, which learns many-to-many mappings by cycling over the original domains augmented with auxiliary latent spaces; [4] designs a star-topology to enable the model to learn multi-domain mappings; [27] produces multi-modal results by regarding the target domain as transforming conditions. We apply a similar fashion as [46], which learns one-to-many mappings by encouraging a bijection between the output and latent space.
Generate Pedestrians by Using GANs [14] designs a double-discriminator network to separately distill identity-related and pose-unrelated person features to generate person images. [48] transfers the pose of a given person to a target pose by applying Pose-Attentional Transfer Blocks within GANs, and [36] tackles the problem by using adversarial loss together with pose loss, content loss, style loss, and face loss. These works are enlightening, however, their generated pedestrians should maintain the original physical features and costumes of the input image, which is not conducive to the need for diversity in our task. [37] proposes a composite-based image synthesizing method, which can paste a foreground pedestrian instance into a background image. Compared to this method, our work is able to generate unseen patterns of pedestrians, thus enriching the appearances of augmented pedestrians.
Our goal is to learn a multi-modal mapping from a source domain, , to a target domain, B, where
is a masked image domain, and
is a pedestrian im-
Figure 3: The structure of our generator. Cov is a convolutional layer, CovT is a convolutional transpose layer, CAT is a concatenation function, DS is a downsampling operation, MSRB is a multi-scale residual block, and CARB is a channel attention residual block.
age domain. The pedestrian mask, , is acquired based on the instance-level semantic label map,
, of Cityscapes dataset, we set the pixels inside of the objective pedestrian instance to 1 and others to 0. Every BM
is computed from a B
and an M
by masking the objective pedestrian instance and leaving the background of B remains. We also use instance-level edge map[38],
, to assist training. During training, we have given a set of paired instances from these domains and the corresponding maps, A = {BM
Lm
Em
, to represent a joint distribution of p
. It is important to note that although there could be multiple plausible B that would fit an input instance A, the training set contains only one such pair. Figure 2 is the illustration of the PMC-GANs architecture. During testing, the model is expected to generate a varied set of
, given an new instance BM. To be specific, the testing instance BM could be computed either by masking an originally existed pedestrian or by adding a mask at where originally exist no pedestrians.
3.1 Network Construction
Multi-scale Attention Residual U-net Generator Generating high-quality pedestrians is challenging, not only because of the complex body structures but also for the rich and diverse details. Intuitively, we can divide the generation of pedestrians into two steps: encoding, to extract as many infromation as possible from the training instances; decoding, to select the most important features to synthesize pedestrians. Based on this idea, we propose a multi-scale attention residual U-net (U-MAR) structured generator (Figure 3). We introduce multi-scale residual blocks (MSRBs) [26] into the encoder part of the generator G, thus enabling the generator to obtain multi-scale information and to get a robust representation of the input. And we adopt channel attention residual blocks (CARBs) [44] to the decoder part of G to adjust the importance weights of features. We find Leaky ReLU activation performs better in our work than ReLU, which is the original settings of these residual blocks. Due to the limit of the length of paper, the inner architecture of MSRBs and CARBs are showed in the supplementary file.
Cascaded Architecture Generating visually appealing high-resolution pedestrians is much more difficult than generating low-resolution ones, because of finer details. For that differently sized pedestrians share some similar features, such as body structures and textures, it
Figure 4: The integration of two stages.
is reasonable to regard the low-resolution result as a start point of training a high-resolution one. We hence organize the network into a cascaded structure, with three stages, operating at resolutions of 6464, 128
128, and 256
256, respectively. Each stage consists of a GAN that uses the proposed U-MAR structured generators. The generator in a higher stage takes in the integration of the previous stage’s knowledge B
and the current stage’s input Acrrent. Hence, training a higher stage is also a process of fining its previous. The integration of neighbor stages is showed in Figure 4.
3.2 Loss Functions
We train the model based on the settings of BicycleGAN [46], whose loss function consisted of a conditional LR-GAN loss [10],[12],[3] and a conditional VAE-GAN loss [24], [25]. The cLR-GAN takes in a random latent code z, uses it to map the input A into B, and attempts to rebuild z from B
. The cVAE-GAN encodes a ground truth image B into a latent space by an encoder E, and the generator tries to map A together with a sampled z back to B. Please refer to [46] for more details of BicycleGAN. In our model, the cLR-GAN loss and the cVAE-GAN loss are computed as follows:
where is the cLR-GAN loss, B
is the product of the generated pedestrian im- age B
and pedestrian mask M. A is the concatenation of an input masked image BM, its corresponding label map L, mask M, and edge map E. z is a random drawn latent code.
where is the cVAE-GAN loss, which modifies equation (1) with a sampling z
by using the re-parameterization trick, allowing for direct back-propagation.
We make two improvements to the loss function: (1) adopt two discriminators, Dimg and Dped, to compute the loss of the whole generated image and the synthesized pedestrian, respectively, while the original BicycleGAN only computes Dimg. (2) Use a perceptual loss [35] based on VGG-19 to encourage the synthesized images to have similar content to the input training instance, i.e., make the output to be more like a pedestrian. The full objective loss function is formulated as:
where and
are adversarial losses of cVAE-GAN and cLR-GAN, respec- tively.
is L1 loss between image B and B
, driving G to match B.
encourages E to produce a latent code that is close to a Gaussian distribution.
is KL distance in cLR-GAN. The hyper-parameters
, and
control the relative importance of each term.
Implementation Details At each stage, G uses the proposed U-MAR structure, Dimg and Dped use the PatchGAN [22], and E uses the ResNet[19]. LSGANs [29] is adopted to compute adversarial losses. A 16-dimensional latent code z is injected into the network by spatial replication, multiplied with the pedestrian mask, down-sampled by the nearest neighbor interpolation, and then concatenation into every intermediate layer of the encoder part of the generator. The parameters, and
are set to 10, 0.5, 0.01, and 1, respectively. We set the batch size to 1, and the epoch to 200. The perceptual loss is not used in the first stage for it results in unstable during training. When training a cascaded model, the first stage is trained in a similar setting as in [46] to learn G1 and E1. The second stage trains the first 100 epochs for G2 and E2 with fixed G1 and E1, and trains another 100 epochs on all of the G1, G2, E1, and E2. The third stage uses the same strategy as the second one, and trains on G2, G3, E2, and E3. We use Adam [23] optimizer, with a learning rate of wh
lr, where lr is the basic learning rate, h is the total number of cascaded stages, i is the ordinal of the current training stage, and w is a weight factor. We set lr to 0.0002 and w to 0.01. By fixing G and E of a previous stage as well as setting a small factor of w, we force a higher stage to better follow the already learned knowledge of its previous ones.
Dataset The model is trained on the Cityscapes training set and evaluated on its validation set. The size si of input images in the ith stage are set to s1 = 64, s2 = 128, and s3 = 256. We crop the pedestrian images from Cityscapes dataset, every image shares the same center with the corresponding pedestrian bounding box. Let H denotes the height of the bounding box of a pedestrian. At the ith stage, if a H is smaller than si, the corresponding pedestrian image is cropped at the resolution of si si, and if H is bigger than si, the image is cropped at a resolution of H
H, and then resized to si
si. Considering that resizing an image too much can lead to information loss, we limit H of the first staged pedestrians to 64 to 256, the second to 100 to 1024, and the third to 150 to 1024. Because high-resolution pedestrians are fewer than low-resolution ones, the last stage contains far fewer images than the other two. Therefore, we expand the training set of the third stage as follows: for every pedestrian image, randomly pick a value from the interval, (H, 1.22 H], to be H
, and then crop on the resolution of H
and resize to si
si. The training set of each stage contains 6,000 images, 4,700 images, and 5,600 images, respectively, and the validation sets contain 1,000 images each.
Evaluation Metrics We use the Fréchet Inception Distance (FID) measurement [20], which calculates the distance between generated images and ground truth images in the Inception-v3 network feature space [40]. The score of FID is consistent with human judgment [20], which rewards realistic synthesized images and penalizes a lack of diversity [2]. The formula follows [11],[20], with the lower FID score, the better. For every input, we generate a set of samples by using 5 random latent codes, and take them as a whole to compute FID score.
Baselines Two previous works: (1) [38], a pix2pix-based model, uses generators with residual blocks, and takes instance-level masks and edge maps as the aid of input. We only use its global generator in the experiments. (2) [32], a pedestrian synthesis model, based on pix2pix network and uses U-net generator. And four ablation versions of our work: (1) Ours-1, uses basic U-net generator. (2) Ours-2, uses basic residual blocks to replace the multi-scale and the attention ones in the proposed generator. (3) Ours-3, uses MSRBs in the encoder part of the generator, and uses basic residual blocks in the decoder part. (4) Ours-4, uses the proposed generator structure. For fair, all the baseline models are trained for 200 epochs, optimized by Adam, and reduce the learning rate from the 100th epoch. The baseline methods are trained in a one-staged-fashion without cascaded architecture. Each method is trained on three input resolutions, based on our dataset. [38] and [32] use their original loss functions, and the other baselines use the same loss function as our PMC-GANs.
4.1 Results
Qualitative Comparison Figure 5 shows the output images of baseline methods and our work, at the resolutions of 6464, 128
128, and 256
256 in the three rows. The model, Ours (PMC-GANs) generates a more delicate pedestrian at every one of the resolutions, the details of the pedestrians are richer, and the body part boundaries are clearer. Our model learns a multimodal mapping, which improves the efficiency of data augmentation as well as avoids mode collapse. Figure 6 shows some multimodal results.
Figure 5: Visual comparison of PMC-GANs with baselines. The first two lines are generated at a resolution of 6464, the third and fourth lines are results at a resolution of 128
128, and the last two lines are generated at a resolution of 256
256.
Figure 6: The multimodal products of PMC-GANs.
We perform an interpolation experiment on PMC-GANs by manipulating the random code z, which is injected into the generator. Figure 7 shows an interpolation instance, where the first and the last generated images are produced by randomly sampling a zfirst and a zlast from a Gaussian distribution, and the others are produced by injecting the interpolate values between zfirst and zlast to the generator. The model produces different images on neighbor interpolation samples, illustrating that it does not over-fit to the training data.
Quantitative Evaluation We upgrade a U-net generator into a U-MAR generator, so the model, Ours-1, is a basic baseline of our study. We perform an ablation study based on Ours-1 to justify the effectiveness of Lm, Em, Dped, and in synthesizing pedestrians. The experiment is conducted at a resolution of 256
256, all the models are trained in a one-staged-fashion by using the same settings as Ours-1. Lm and Em are always bundled to each other, so we use LEm to denote the co-occurrence of the two. Table 1 shows that cutting out any of the variables will lead to a worse FID score. Therefore, it is reasonable to use Lm, Em, Dped, and
in our work.
Table 1: The results of the ablation study. The best performance is in bold.
Table 2 shows FID scores of the baseline methods and our model. By comparing [32] and [38] with the ablation versions of our model, we can see our work gets a better result on the whole, which is consistent with qualitative comparison results. Ours-4, with the proposed generator, improves the FID score by 7.6%, 27.2%, and 17.7%, on Ours-1, at the resolution of 64 64, 128
128, and 256
256, respectively, indicating the superior of the U-MAR generator than the basic U-net generator in producing pedestrians. The comparison between Ours-1 and Ours-2 shows the advance of adding residual blocks in basic U-net structure in our task, the comparison between Ours-2 and Ours-3 validates the usefulness of using the MSRBs in the encoder part of the generator, and the comparison between Ours-3 and Ours-4 justifies the effectiveness of using the CARBs in the decoder part of the generator.
The model, Ours, and the model, Ours-4, use the same structured G, D, and E, the difference between the two is that Ours uses cascaded architecture to produce high-resolution images, while Ours-4 uses only one stage. At the resolution of 64 64, Ours ablates to Ours-4, and the higher the resolution, the greater the advantage of the cascaded structure is revealed, with a benefit of improving 0.7% at the resolution of 128
128, and 9.3% at 256
Figure 7: An interpolation analysis.
256. The results show that a coarse-to-fine cascaded architecture is useful in synthesizing high-resolution pedestrian images.
Table 2: Comparison of FID score. The best results are in bold.
4.2 Data Augmentation Experiments
Figure 8: Data augmentation samples, where the left ones are synthesized images.
Experiments are performed on the CityPersons dataset [42]. We decide the position Pp and the size Ps of a synthesized pedestrian by first, use semantic label map to restrict the position of it to sidewalks and roads; then, compute the size of pedestrians according to both the size of existing cars and pedestrians in the image and the distribution of pedestrian size conditioned on the position in the dataset. We crop a Ps Ps sized background image, Ibg, centered at Pp, from an CityPersons image, IHD. Then randomly select a pedestrian mask M and compute with Ibg to acquire the masked image input to our trained PMC-GANs. The generated pedestrian image Iped is then blende into IHD by using a pix-wise replacement strategy. We then select 3,000 blended images, the same to the quantity of the CityPersons training set, as the augmented data (Figure 8 shows two augmented samples). The tested pedestrian detector uses ResNet-50[19] as the backbone network and pre-trained on the ImageNet dataset[34]. Pedestrians with a height less than 50 pixels or the visible ratio less than 0.65 are ignored. During training both the RPN proposals and the Fast R-CNN[16] stage, we avoid sampling the ignore regions. Random crop and horizontal flip strategies are also applied to augment data. We use
1 images to train the
1 detector, and use the scale of
3 upsampling images to train the
3 detector. To fit in 12GB of TITAN X GPU memory and avoid memory overflow, the
5 scaled detector is trained on
3 scaled images and tested on the
5 dataset. Please see supplementary file for more information. Table 3 shows the comparison of our pedestrian detection method with baselines on the CityPersons validation set, which justifies that augmenting data by using PMC-GANs is effective at every testing scales.
Table 3: Pedestrian detection results. MRis used to compare the performance of the detectors (the lower the better). "Reasonable", "Heavy", "Partial", and "Bare" are different subsets that are defined by Citypersons validation dataset according to occlusion ratio. The best performances are in bold.
We propose a multi-modal cascaded generative adversarial networks (PMC-GANs) to synthesize pedestrian images. The model uses multi-scale residual blocks in the encoder part of the generator to obtain multi-scale representation of pedestrian images and uses channel attention residual blocks in the decoder part of the generator to help select the most important features. Our model dramatically outperforms baselines in generating both realistic and diversified pedestrian images, especially in producing high-resolution ones. The experiment of using the PMC-GANs to augment pedestrian detection data further proves its effectiveness and applicability. However, sometimes the direction of light that falls on the generated pedestrian does not match the lightening condition of the background image, which looks artificial. We plan to fix the problem by taking into account lighting variables as future work.1
[1] Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151, 2018.
[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
[4] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint arXiv:1711.09020, 2017.
[5] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[6] Piotr Dollár, Zhuowen Tu, Pietro Perona, and Serge Belongie. Integral channel features. 2009.
[7] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 304–311. IEEE, 2009.
[8] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 34, 2012.
[9] Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (8):1532–1545, 2014.
[10] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
[11] DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982.
[12] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
[13] Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. Synthetic data augmentation using gan for improved liver lesion classifi-cation. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 289–293. IEEE, 2018.
[14] Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, and Hongsheng Li. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. arXiv preprint arXiv:1810.02936, 2018.
[15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[16] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pages 346–361. Springer, 2014.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
[21] Jan Hosang, Mohamed Omran, Rodrigo Benenson, and Bernt Schiele. Taking a deeper look at pedestrians. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4073–4082, 2015.
[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976. IEEE, 2017.
[23] D Kinga and J Ba Adam. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), volume 5, 2015.
[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. stat, 1050:10, 2014.
[25] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning, pages 1558–1566, 2016.
[26] Juncheng Li, Faming Fang, Kangfu Mei, and Guixu Zhang. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), pages 517–532, 2018.
[27] Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Conditional image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(July 2018), 2018.
[28] Jiayuan Mao, Tete Xiao, Yuning Jiang, and Zhimin Cao. What can help pedestrian detection? In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6034–6043. IEEE, 2017.
[29] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813–2821. IEEE, 2017.
[30] Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and Cristiano Malossi. Bagan: Data augmentation with balancing gan. arXiv preprint arXiv:1803.09655, 2018.
[31] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[32] Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou. Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047, 2018.
[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115 (3):211–252, 2015.
[35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[36] Sijie Song, Wei Zhang, Jiaying Liu, and Tao Mei. Unsupervised person image generation with semantic parsing transformation. arXiv preprint arXiv:1904.03379, 2019.
[37] Shashank Tripathi, Siddhartha Chandra, Amit Agrawal, Ambrish Tyagi, James M Rehg, and Visesh Chari. Learning to generate synthetic data via compositing. arXiv preprint arXiv:1904.05475, 2019.
[38] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
[39] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. Repulsion loss: Detecting pedestrians in a crowd.
[40] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
[41] Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. How far are we from solving pedestrian detection? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1259–1267, 2016.
[42] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. Citypersons: A diverse dataset for pedestrian detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
[43] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Occlusion-aware r-cnn: Detecting pedestrians in a crowd. In European Conference on Computer Vision, pages 657–674. Springer, 2018.
[44] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. arXiv preprint arXiv:1807.02758, 2018.
[45] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2242–2251. IEEE, 2017.
[46] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
[47] Yezi Zhu, Marc Aoun, Marcel Krijn, and Joaquin Vanschoren. Data augmentation using conditional generative adversarial networks for leaf counting in arabidopsis plants. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, page 324, 2018. URL http: //bmvc2018.org/contents/workshops/cvppp2018/0014.pdf.
[48] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. Progressive pose attention transfer for person image generation. arXiv preprint arXiv:1904.03349, 2019.