Faces appearing under occlusion is a major hindrance for accurate Face Recognition (FR), which has been far from being solved. With the advent of generative adversarial models [1] in the field of deep learning (DL), there has been a surge of techniques to predict the missing values or pixels in an image. Revealing of missing parts of an image is a common image editing operation, which aims to fill the missing or masked regions in images with appropriate contents that appears to be visually realistic. The generated contents can either be as accurate as the original, or simply fit well within the context such that the restored image looks perceptually plausible and complete. Recent image completion techniques [2, 3] rely on low and mid-level cues for the generation of the missing patches in the image.
Contrary to the recent techniques, our proposed method reconstructs a full face despite the fact that certain salient and unique features on the faces are occluded. The processes concerned with generating missing patches on the faces make an assumption that the similar patterns do not exist everywhere. Inline with his assumptions, Generative Adversarial Networks (GANs) aim to perform well in generating the facial parts behind the mask, due to its capability of generating the unseen. Wright et al. [4] used a method for sparse recovery of signals for image completion, which is further used in face completion. Recently, Ren et al. [5] used Convolutional neural networks (CNN) for inpainting of images. Li et al. [6] used a generative model to restore face-parts occluded by patches on the CelebA dataset, but they did not provide any result on real-world occluded face datasets, like the AR face database [7]. They also relied on post-processing of the images to produce semantically correct images. The Generative Face Completion (GFC) [6] process requires significantly large amount of training time to reach the equilibrium point.
With the aim of designing an end-to-end framework for generating face images from the masked ones, the primary contribution of this paper lies in design of a novel bimodal training algorithm for GAN. Mode-I of the training process produces faces with ambient illumination, while Mode-II denoises that generated by Mode-I. A unique training algorithm is proposed with faster convergence. An adversarial ”structural” loss is also proposed in this paper in order to maintain the holistic quality of the face images. This ”structural” loss consists of two components: ”Structural Similarity (SSIM) loss” and ”Patch-wise Mean squared error (PMSE)”. The SSIM [8] takes care of the holistic features of the face, while PMSE takes care of the pixel-wise differences in the faces. Further, our model converges to an equilibrium in Mode-II faster than other generative models [9], since the generator is based on a denoising auto-encoder [10] model. The generated faces boost the performance of FR on occluded faces, when compared with the works published recently in literature.
Sections 2 and 3 give brief overviews of GAN and Denoising Auto-encoder, respectively, while section 4 discusses the loss functions used in this paper. Section 5 gives the details of the proposed architecture of SD-GAN, followed by the description of the proposed training algorithm in section 6. In section 7, the quantitative and qualitative results of our experiments, showing the effectiveness of our proposed method are reported, along with the different benchmark datasets used for experimentations. Finally, the paper concludes in section 8.
Generative Adversarial Network (GAN) [1] consists of two models: the generative (G) and the discriminator (D). The CNN based deep network in G captures the true data distribution, , and generates images sampled from a distribution
, the distribution of the training data provided as input to G. D as a counter-part of G (also CNN-based) discriminates between the original images, sampled from
, and the images generated by G. Typically, G learns to map from a latent space (
) to a particular data distribution (
) of interest, while D discriminates between instances from
and candidates produced by the generator. The objective of training G is to increase the error rate of
producing novel synthesized instances that appear to have come from
adversarial training adopted for GAN is derived from that in Schmidhuber [11]. In other words, an alternate training procedure is performed on GAN, where D and G play a two-player minimax gaming strategy of a zero-sum game with the value function V (G, D). The overall objective function minimized by GANs [1], is given as:
To learn , a mapping to data space is represented as
G is a differentiable function representing a CNN with parameters
CNN based deep network represented by
outputs a single scalar [0/1]. D(x) represents the probability that x came from the true data rather than
Two major drawbacks of an adversarial system are:
1. GANs can generate all the pixels in one shot, rather than guessing the value of one pixel given another pixel. This is the main reason for the noise in the output images, whenever missing pixels are generated.
2. Reaching the Nash equilibrium [12] of a game requires large number of iterations/epochs due to the instability inherent in GANs [1].
An aim to overcome the above two drawbacks, forms the basic motivation of our work presented in this paper. To deal with noise, a Denoising Auto-encoder based generator model has been introduced in conjunction with the standard GAN framework. Further, the Mode-II reaches the Nash equilibrium faster than ModeI. A trade-off has been done at Mode-I between the structural loss and training time, where the generator loss is thresholded for the generated images passed to Mode-II for denoising.
The general deep auto-encoder, as proposed by Bengio et al. [13], maps an input vector to a latent representation
through a deterministic mapping
, and then maps back to the reconstructed vector,
in the input space with
denotes the activation function. The optimization of the parameters is based on the mean reconstruction error [13]:
where, represents the
training sample and L is the squared error
Vincent et al. [10] designed a denoising autoencoder by modifying the formulation in equation 2. The authors assumed to be a noisy approximation of
, characterized by a stochastic mapping
. The joint distribution is given as
and parameterized by
becomes the deterministic function of
objective function in equation 2 thus transforms into:
Patch-wise minimization of mean-squared error (discussed later in section 4.3) further helps in image denoising [14]. Thus patch-wise mean squared error loss has been used in this paper as a component of the loss function in both the generators () of our SD-GAN framework.
The process of training the SD-GAN consists of two modes, and optimizes four adversarial loss functions described (later) in equations 9-12. The corresponding criteria are described in the following sub-sections.
4.1. Binary Cross Entropy Loss
Binary cross-entropy is a loss function used effectively in the field of deep learning for binary classification problems and sigmoid output units. The binary class labels used at the discriminators are 0 & 1, representing the real and fake (generated) images. The loss function is given as:
where, i indexes n samples/observations and j indexes m classes, and sample label (binary for LHS, one-hot vector on the RHS) and the prediction of sample is
4.2. SSIM Loss
SSIM [8] gives the structural similarity index between two images (We first define SSIM index [8], estimated using multiple patches (windows) of an image. This measure between two windows p and q of common size
where, are the pixel-wise averages of image patches p and q respectively,
their respective variances,
the covariance of
as two variables used to stabilize the division with weak denominator, L the dynamic range of the pixel-values (typically this is
set by default. The SSIM loss (
) function estimated between two single-channel (gray-scale) images, produces a maximum value of 1 for two identical images and decreases henceforth as the similarity between the images decreases. Hence, the SSIM loss is calculated as:
where, SSIM is given in equation 5. Minimization of this loss provides a better estimate of the
4.3. Patch-wise MSE Loss
Patch-wise MSE (PMSE) loss is derived as the mean-squared error between two images. Let be the two patches extracted from
tively. The PMSE between
, is calculated as:
where, |C| & |h| are the number of channels and patches in an image, while is a patch extracted from
’s are the channel-wise weights of the image (
as given in [15]). A weighted linear combination (using
) of the MSE’s is used to estimate the MSE of each patch. PMSE is the average MSE over all the pair of corresponding (spatially) patches in the images.
4.4. Structural Loss
cross-entropy loss as in DCGAN [1]. The primary aim of proposing this novel loss is to constrain the structure of the generated image. The SSIM (see section 4.2) loss accounts for the facial structure while a mean-squared error (MSE) based loss applied patch-wise (refer section 4.3) helps to replicate of the illumination variation in and denoising in the auto-encoder based
. The structural loss is given as:
The proposed Structural and Denoising Generative Adversarial Network (SDGAN) works in two-modes. Figures 1 & 2 show the proposed architecture with structural details of SD-GAN, and descriptions for each of the modes of operation are described in the following sub-sections.
Figure 1: The proposed SD-GAN architecture (best viewed in color), exhibiting two modes of operations (training): (a) Mode - I and (b) Mode - II.
Figure 2: Architectural Details of the CNN-based Generator and Discriminator used in SD-GAN (best viewed in color).
5.1. Mode-I
The Mode-I of SD-GAN is derived from DC-GAN [16], with a few variations in the input as well as in the training procedure (see section 6 for further details). The generator, , is a deep-network (see figure 2(a)) which takes the occluded faces as input, instead of the noise vector (as in DC-GAN) and generates (synthetic) faces to be fed to the discriminator
, similar to the discriminator network in DC-GAN (see figure 2(b)), takes both the full real-world facial as well as
as inputs and attempts to discriminate between the real and generated (fake) images.
A ”nice generation” module acts as an interface for selective data transfer between two modes of training. It takes fake images () as input with a mini-batch of size 20, and computes a loss function (see line 5 of algorithm 1) to filter and create nice images (
), when the loss is significantly low (< 0.01). The corresponding full face images are also filtered as
and given to mode-II of training. This is done under the assumption that
has successfully fooled
for the batch of images, when loss is low. Since
are often corrupted by noise, an operation of denoising is necessary as done by mode-II of operation.
5.2. Mode-II
The Mode-II is the denoising unit of our proposed architecture, compared to the Mode-I which preserves the structural identity of the face. To perform the task of denoising, a denoising auto-encoder (see section 3) is used as the generator () in this mode of operation. For the CNN-based denoising auto-encoder (refer figure 2(c)) proposed in this paper, the generated ”nice” images (
from Mode-I are taken as inputs. The discriminator (
), identical to
as input
images and performs adversarial training independently and exclusively. Though the input to Mode-II is given as output of Mode-I, the training and weight update of the model at Mode-II is independent of the training of Mode-I, i.e. the gradients do not backpropagate into the model of Mode-I.
The bimodal SD-GAN model is trained using the proposed algorithm 1. The procedure involves an end-to-end training of both the modes simultaneously. Each mode is trained using a procedure adopted from DC-GAN [16], with a structural loss induced for each mode, exclusively. The model is trained in Keras with Ten-sorflow backend [17]. A uniform mini-batch size of 20 samples has been used throughout the training process, with gradient based optimization for weight update in the network. The following sub-sections detail the mode-wise training procedure, with the loss functions involved for weight update in the network (for all notations used hereafter, refer algorithm 1).
6.1. Training for Mode-I
The training process used for Mode-I is outlined in lines of algorithm 1. The occluded images are given as inputs to
, to generate fake images matching the underlying true distribution of the full-facial images. The semi-supervised training procedure of SD-GAN involves a discriminator
to distinguish between the real-world and generated images. The full-faces corresponding to each of the occluded faces in a batch, B, is fed to the discriminator as real images. The training of
is based on the minimization of the binary cross-entropy loss (
(see section 4.1 for details), using the ADAM [18] optimizer. Let,
the set of full real-world face images and
be the occluded faces in a particular batch, while
represents the discriminator function with an input x and a target label
depicts the generating function with the input x. The adversarial loss corresponding to
can be written as:
Training the generator is essentially an optimization process executed using Stochastic Gradient Descent (SGD) [19], while freezing the weight update of
. The proposed structural loss (auxiliary) is induced at this stage of training. The adversarial loss for
where, is defined in equation 8.
Minimization of these two criteria given by equations (9) and (10), makes outsmart (by cheating)
upon reaching Nash equilibrium [20], where
believes that the images generated by
is sampled from the true distribution.
6.2. Training for Mode-II
The output images obtained from Mode-I are used in training for Mode-II in SD-GAN. Hence, these batch of ”nice” images () generated by
provided as inputs to Mode-II along with their corresponding (subject-wise) full-face images (). Though, these images have their structural content partly preserved, they suffer from few degradation due to noise. To denoise these images, a denoising auto-encoder based generator model had been proposed in this paper. Lines
in algorithm 1 outlines mode-II of training. The Discriminator
comprises of a similar adversarial loss as in
, given as:
The denoising auto-encoder training of is incremental, in a sense that the number of training samples increases as the
becomes stronger. The instability issues [1] prevalent in training is taken care by over-training the weaker of the two to reach the equilibrium point. The adversarial loss incurred at this phase mainly deals with closing the gap between the distributions of the real and the generated (fake) samples. The adversarial loss at
is given by:
where, being the difference operator.
Minimization of reduces the gap in structural and pixel-values between the generated (fake) and true samples, which also reduces the noise in the generated samples.
The use of Mode-II of training along with Mode-I (done independently) reduces the overall time for training (folds, considering the number of epochs) compared to a recent state-of-the-art technique [6] used for the task at hand.
This section first describes the datasets used, then gives the quantitative measures used to show the effectiveness of our proposed model for face completion and FR, compared with a few state-of-the-art techniques.
7.1. Datasets
Experimentations are carried on three datasets: (a) AR dataset [7], (b) Celeb-A dataset [21], and (c) multi-PIE [22]; each is briefly described below.
7.1.1. AR Database
The AR database [7] consists of face images which contain real-world occlusions. The database consists of 136 subjects with varying illumination conditions and expressions. For our study, we consider those images which are near-frontal and have minimal expression variations (see figure 3 for samples). Two variations of occlusions are available in the database, viz. the sunglasses and scarf on the face, which prevents the faces to be reconstructed using symmetric transformations from the other half of the face. For our experimentations, the dataset has been divided into 2 subsets: AR1, the images with sunglasses and AR2, those with scarfs. A data partition as 60 : 20 : 20 ratio is maintained uniformly for training, validation, and testing throughout the set of the experimentations. The subjects used for training and validation are never used for testing.
Figure 3: Two face image examples from AR database (one in each row) with different levels of occlusions and illumination variations. Images in {(a), (b), (e)(c), (f), (g)
AR2; and {(d), (h)} are the full face images (best viewed in color).
7.1.2. Celeb-A Database
The CelebA [21] dataset consists of 202,599 face images. Each face image is cropped, roughly aligned by the position of two eyes, and rescaled to pixels. The standard benchmark split with 162,770 images for training, 19,867 for validation and 19,962 for testing, has been followed for experimentation. A mask of size
pixels covers the face (see figure 4 for samples) at random locations, as described in [6].
Figure 4: Examples from CelebA database, showing two subjects with synthetic occlusions (best viewed in color).
7.1.3. Multi-PIE dataset
The CMU Multi-PIE database [22] consists of 755,370 images shot in 4 different sessions from 337 subjects. The images in the dataset are split up into training, validation and test set. The training set is composed of all individuals in non-frontal pose (except those used for validation and testing) at the generator, while the size of the validation (64 identities at a pose of ) and test sets (65 identities at a pose of
) are almost identical. We consider the images taken in session 1, with the probe images taken at
7.2. Evaluation metrics
Along with the visual results shown in section 7.3 we perform quantitative evaluation of the proposed model for the two datasets under test. Firstly, we use the peak-signal-to-noise-ratio (PSNR) value, which captures the difference in the pixel values of the two images. PSNR (higher the better) is defined as:
where, is the output (generated) image and
is the reference (ground-truth, GT) image.
Secondly, SSIM index (refer equation 5) is used for quantifying the generated results, which estimates the holistic similarity between two images. Finally, we also use the identity distances measured by the OpenFace toolbox [23] to determine the high-level semantic similarity of two faces.
7.3. Performance Analysis for generation of full facial images
A few examples of generation of the full facial images from occluded faces are shown in figure 5 under two different scenarios of the proposed method. The column (b) depicts the output of DC-GAN [16], while the results progressively becomes better as we move towards the right, showing the effectiveness of the auxiliary losses proposed in this paper. The significant improvement in the image quality measure shown by our model in (e) as compared to (d) (see table 1 for quantitative measures showing similar trends) strengthens our claim for the introduction of Mode-II for denoising the output of Mode-I.
Figure 5: Results for image generation from two different sets of occlusions, viz., AR2 and AR1 (arranged row-wise) present in AR, by SD-GAN: (a) the input occluded image, (b) output of using
, (c) output of
, (d) output of
at Phase-I, (e) output of
Phase-II, (f) Ground-truth (GT). The values below each image from (b)-(e) give the (PSNR/SSIM) values of the images compared to the expected output (GT).
Both the quantitative as well as the qualitative measures are compared with a recent state-of-the-art technique. GFC [6] uses face parsing as well as Poisson Blending [24] as post-processing techniques to generate facial parts under occlusion. Graph Laplacian (GL) based methods [25] also attempts to solve the problem. The quantitative results evaluating the quality of the images are given in table 1. Our proposed SD-GAN (referred as ’SDG’ in tables) outperforms all
Figure 6: Results for image generation from two different methods: (a) occluded images (one each from AR2 (Top-row) and AR1 (Bottom-Row)), (b) Images generated by GFC [6] without post-processing, (c) Images generated by SD-GAN, (d) expected output. The values below each image gives the (PSNR/SSIM) values of the images compared to the expected output.
Figure 7: Results for image generation from two different methods: (a) occluded images (from Celeb-A dataset [21]), (b) Images generated by GFC [6] without post-processing, (c) Images generated by SD-GAN, (d) expected output. The values below each image gives the (PSNR/SSIM) values of the images compared to the expected output.
other techniques based on PSNR values, whereas in case of the holistic measure (SSIM), the nearest competing method GFC, also a GAN based deep model with post-processing techniques, matches our performance in a few cases and even marginally outperforms our proposed technique in only one case. Qualitative experiments also reveal that without the post-processing technique, GFC fails to match the performance of our proposed technique in both the datasets, for which our method is a clear winner, as shown in figures 6 & 7. The values at the bottom of the images in columns (b) & (c) in figures 6 & 7, reveal the superiority of our proposed SD-GAN, based on the PSNR/SSIM values on the four exemplar images.
Table 1: Quantitative values (averaged over the whole dataset) for different face images generated following the protocol, as in columns (b)-(d) of figure 5 for the AR dataset, compared with state-of-the-art techniques.
7.4. Performance boost in Face Recognition
Face Recognition (FR) systems underperform when the faces are occluded. Our proposed SD-GAN reconstructs a full-face when presented with a occluded face, which facilitates efficient performance for FR. Performances of several recent shallow learning techniques, viz. LSM [26], RPCA [27], GL [25] have been compared with our proposed and GFC [6] methods for generation of the faces, evaluated using state-of-the-art benchmark FR systems, like PCA [28], Gabor [29], LPP [30], Sparse Representation (SR) [4] and VGG [31]. The results in table 2 show the rank-1 accuracies for AR1 and AR2 datasets, where our proposed model (SD-GAN) outperforms all other methods, indicating that it must be capable of generating discriminative parts of the face better than the other competing methods. Interpret the values in the table 2 as performances for FR, for images generated by the methods mentioned at the top of each column, while the FR methods appear at the left of each row. Observe the huge jump in performance from the statistical methods to the GAN based methods, indicating the power of the GAN based techniques for overcoming occluded faces, specifically when applied for FR applications.
Table 2: Rank-1 Recognition rates (in %) exhibiting a higher performance for Face Recognition by SD-GAN (SDG), compared with several state-of-the-art shallow and deep learning techniques on AR Dataset. The results in bold demarcates the best performance (row-wise).
An extension of Linear Discriminant Analysis (LDA) [32] to the two color channels I-chrominance and the Red channel (LDA-IR) is described in [32]. InterSession Variability (ISV) [32] modeling is a technique that has been successfully employed for face verification, which does not have occluded images during training. The rank-1 recognition rates of the VGG+SD-GAN (VGG is used as a clas-sifier with SDG as the generator), when compared with these two state-of-the-art techniques, LDA-IR and ISV, are much higher for the AR database, as reported in
table 3.
Table 3: Rank-1 Recognition rates for end-to-end system for occluded face recognition. Higher values are better.
7.5. Analysis of training time of SD-GAN, compared to GFC [6]
All experiments are performed on a dual GPU machine with dual Nvidia TITAN X, with 64 GB RAM and Intel core i7 4790K processor. The training for both the models are performed using Keras with Tensorflow backend. The training times are tabulated in table 4, which shows that the SD-GAN is faster than GFC, since it converges near a Nash equilibrium (see arrow on graph in figure 8 for details) in lesser number of epochs as compared to GFC.
Table 4: Comparison of training times of SD-GAN and GFC. Lower value is better.
7.6. Results on the Multi-PIE dataset
In order to evaluate our proposed algorithm on pose-variations of the face images producing self-occlusions, we performed experimentations on MultiPIE dataset, which has 750000+ images, at different poses. Self-occlusion of faces occur due to off-frontal and out-of-plane rotation variations in pose. For evaluating performance using rank-1 recognition rates, we follow the protocol from [33], and only images from session one are used. Results are given in table 5. All images used for testing and validation have pose. Few results shown in figure 9 display the superiority of our method over TP-GAN (TPG) [33], both qualitatively as well as with quantitative measures in terms the SSIM/PSNR values. Observe the sample at the last row of figure 9, which shows a non-frontal (not side profile view) query face. In this case, our result in (c) has produced an exact illumination
Figure 8: Graphs showing the discriminator and generator loss functions during training.
variation as that in GT (d), whereas the process of [33] in (b) produces exactly the opposite (mirror-like image) while producing a sharper contrast (unnecessarily, in general) than that in GT. Also, observe intriguingly the presence of ear-rings (appears non-identical ones) in the output of [33], not present in GT and our output in (c). The proposed system intrinsically exploits the symmetric nature of the face, helping to generate images with appropriate illumination variations at high-resolution with desired quality as in GT.
Table 5: Comparison of Rank-1 recognition rate for Multi-PIE dataset, with faces at values are in bold).
Figure 9: Results for image generation from two different methods: (a) Images at different poses (obtained from Multi-Pie dataset [22], with left- (top-row) & right-looking (Middle-row) profiles at ; and a face image at
)) used for testing, (b) Image generated by TPG [33], (c) Image generated by SDG, (d) expected output (ground-truth).The values below each image gives the (PSNR/SSIM) values of the image compared to the expected (target) output.
The proposed SD-GAN model uses end-to-end training for reconstruction of occluded parts of the face. The proposed technique does not rely on any post-processing technique for semantic correction of the faces. Thus, this module may be used as pre-processing for any FR system, in cases where faces are occluded. A faster training time is ensured in this model, based on the Nash Equilibrium. The qualitative and the quantitative results discussed above confirm the superiority of our proposed model. Misalignment of faces may lead to distortions as happens in all reconstruction techniques. In order to generate better quality photo-realistic images for AR and LFW datasets, the dual pathway technique proposed in [33] can be used as a post-processing stage following our SD-GAN.
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
[2] C. Barnes, E. Shechtman, A. Finkelstein, D. B. Goldman, Patchmatch: A random- ized correspondence algorithm for structural image editing, ACM Trans. Graph. 28 (3) (2009) 24–1.
[3] J.-B. Huang, S. B. Kang, N. Ahuja, J. Kopf, Image completion using planar structure guidance, ACM Transactions on Graphics (TOG) 33 (4) (2014) 129.
[4] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 210–227.
[5] J. S. Ren, L. Xu, Q. Yan, W. Sun, Shepard convolutional neural networks, in: Ad- vances in Neural Information Processing Systems, 2015, pp. 901–909.
[6] Y. Li, S. Liu, J. Yang, M.-H. Yang, Generative face completion, IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] A. Martinez, R. Benavente, The AR face database, CVC Tech. Report (1998) 24.
[8] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing (TIP) 13 (4) (2004) 600–612.
[9] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convo- lutional neural networks, in: Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
[10] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing ro- bust features with denoising autoencoders, in: Int’l Conference on Machine learning (ICML), 2008.
[11] J. Schmidhuber, Learning factorial codes by predictability minimization, Neural Computation 4 (6) (1992) 863–879.
[12] J. F. Nash, et al., Equilibrium points in n-person games, Proceedings of the National Academy of Sciences 36 (1) (1950) 48–49.
[13] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in: Advances in Neural Information Processing Systems (NIPS), 2007, pp. 153–160.
[14] C. Lee, C. Lee, C.-S. Kim, An MMSE approach to nonlocal image denoising: The- ory and practical implementation, Journal of Visual Communication and Image Representation 23 (3) (2012) 476–490.
[15] S. Johnson, Stephen Johnson on digital photography, O’Reilly Media, Inc., 2006.
[16] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations, 2015.
[17] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning., in: OSDI, Vol. 16, 2016, pp. 265–283.
[18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.
[19] S.-i. Amari, Backpropagation and stochastic gradient descent method, Neurocom- puting 5 (4) (1993) 185–196.
[20] R. Gibbons, A primer in game theory, Harvester Wheatsheaf, 1992.
[21] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of International Conference on Computer Vision (ICCV), 2015.
[22] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-pie, Image and Vision Computing (IVC) 28 (5) (2010) 807–813.
[23] B. Amos, B. Ludwiczuk, M. Satyanarayanan, Openface: A general-purpose face recognition library with mobile applications, Tech. rep., CMU-CS-16-118, CMU School of Computer Science (2016).
[24] P. P´erez, M. Gangnet, A. Blake, Poisson image editing, in: ACM Transactions on graphics (TOG), Vol. 22, ACM, 2003, pp. 313–318.
[25] Y. Deng, Q. Dai, Z. Zhang, Graph laplace for occluded face completion and recog- nition, IEEE Transactions on Image Processing (TIP) 20 (8) (2011) 2329–2338.
[26] B.-W. Hwang, S.-W. Lee, Reconstruction of partially damaged face images based on a morphable face model, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (3) (2003) 365–372.
[27] J. Wright, A. Ganesh, S. Rao, Y. Peng, Y. Ma, Robust principal component anal- ysis: Exact recovery of corrupted low-rank matrices via convex optimization, in: Advances in Neural Information Processing Systems, 2009, pp. 2080–2088.
[28] M. A. Turk, A. P. Pentland, Face recognition using eigenfaces, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 1991, pp. 586–591.
[29] Z. Lei, S. Liao, R. He, M. Pietikainen, S. Z. Li, Gabor volume based local binary pat- tern for face representation and recognition, in: 8th IEEE International Conference on Automatic Face & Gesture Recognition (FG), IEEE, 2008, pp. 1–6.
[30] X. He, P. Niyogi, Locality preserving projections, in: Advances in Neural Informa- tion Processing Systems (NIPS), 2004, pp. 153–160.
[31] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, in: British Machine Vision Conference (BMVC), Vol. 1, 2015, p. 6.
[32] M. G¨unther, L. El Shafey, S. Marcel, Face recognition in challenging environments: An experimental and reproducible research survey, in: Face Recognition Across the Imaging Spectrum, Springer, 2016, pp. 247–280.
[33] R. Huang, S. Zhang, T. Li, R. He, Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis, ICCV, 2017.