As one of the most natural, powerful, and universal means of human communication, facial expression has been studied intensively in various active research fields. Most recently, deep CNNs have attracted increased attention for facial expression recognition. Despite the CNN structures employed, most of these approaches were trained to learn expression-related facial features from all subjects as illustrated in Fig. 1 (a), while identity-related attributes, e.g., age, race, and gender, are not explicitly considered.
However, it is widely believed that facial appearance and 3D geometry are determined by person-specific attributes and affected by facial expressions temporarily. For instance, it is hard to differentiate the transient wrinkles caused by facial expressions from the permanent ones of elder adults. In addition, the presence of facial hair such as beards may introduce various occlusions for male versus female subjects.
Furthermore, studies in psychology have shown that emotional
expressions demonstrate considerable differences across age [10], race [41], and gender [41, 11, 39] in terms of expression intensity. For examples, Asian people show consistently lower intensityexpressions than the other ethnic groups [27]; and women were found to express more anger [11] and sadness [39] than men do.
Due to high inter-subject variations caused by attributes, it remains challenging to learn expression-related features with CNNs, especially from static images. This motivates us to alleviate identity-related variations by explicitly modeling person-specific attributes in CNNs.
An intuitive solution is to train multiple attribute-specific CNNs from subsets of the dataset. As shown in Fig. 1 (b), a set of CNNs can be trained from different combinations of attributes, respectively. However, unlike large-scale datasets for object detection or categorization, expression-labeled datasets are much smaller and most of them lack attribute annotations. Moreover, classifying attributes in the real world is still a challenging problem. Therefore, recognition performance of attribute-specific CNNs is likely to degrade due to misclassified attributes and insuf-ficient training data in the subsets as demonstrated in our experiments.
In this work, we proposed a novel PAT-CNN, where features are learned through a hierarchical tree structure organized according to attributes. As shown in Fig. 1 (c), a novel PAT module is embedded right after the last pooling layer. The PAT module has a hierarchical tree structure, where each node contains a fullyconnected (FC) layer connected to the FC layer in its parent node, if any.
Given a set of training samples, clustering is conducted at each node except the leaf nodes according to a type of attributes by using features from the current FC layer in that node. Hence, the number of child nodes is determined by the number of clusters, i.e., the number of states of the attribute. For example, as shown in Fig. 1 (c), clustering is performed in terms of “Gender” at the root node and results in two clusters corresponding to two child nodes for male and female, respectively. Since the goal is NOT attribute recognition, each data sample is probabilistically assigned to all nodes. As depicted in Fig. 1 (c), a female Caucasian has a high probability to be assigned to “Female” node at the second level and “Female Caucasian” node at the third level. Expressionrelated features, i.e., the output of the PAT module, are extracted from the FC layers at the leaf nodes, from which a set of expression
Figure 1: Deep CNN structures for facial expression recognition: (a) a traditional CNN trained from all data, (b) attribute-specific CNNs trained from subsets of the dataset, and (c) the proposed PAT-CNN, where each node contains an FC layer connected to FC layer in its parent node, if any. Dots represent cluster centers, each of which corresponds to a state of the specific attribute. The green dots denote the cluster centers corresponding to the ground truth attribute states of the current sample, e.g., “Female” for gender in the root node. Best viewed in color.
classifiers are trained. The final decision of expression classifica-tion is achieved by a weighted sum of all expression classifiers.
Furthermore, a semi-supervised learning strategy is developed to learn the PAT-CNN from limited attribute-annotated data. In addition to the loss for expression classification, a novel PAT loss function is developed to iteratively update cluster centers, shown as blue and green dots in Fig. 1 (c), during training, which ensures clustering results are semantically meaningful. Specifically, the data samples with attribute labels are used to minimize the PAT loss; whereas all samples with expression labels are employed to minimize the expression loss. Note that the attribute labels are only used in training, but not in testing.
In summary, our major contributions are:
- Developing a PAT-CNN to alleviate variations introduced by person-specific attributes for facial expression recognition.
- Developing a novel PAT module with an associated PAT loss to learn expression-related features in a hierarchical manner, where the output features of the PAT module are less affected by attributes.
- Developing a semi-supervised learning strategy to train the PAT-CNN from limited attribute-annotated data, making the best use of available facial expression datasets. Extensive experiments on five expression datasets show that the proposed PAT-CNN yields considerable improvement over the baseline CNNs learned from all training data (Fig. 1 a) as well as the attribute-specific CNNs learned from subsets of the dataset (Fig. 1 b). We also showed that the proposed soft-clustering with probability outperforms the one based on hard-clustering using the same network structure. More impressively, the PAT-CNN using a single model achieves the best performance for faces in the wild on the SFEW dataset, compared with state-of-the-art methods using an ensemble of hundreds of CNNs.
alignment, pose estimation, gender recognition, age estimation, smile detection, and face recognition using a single deep CNN. To deal with the incomplete annotations and thus, insufficient and unbalanced training data for various tasks, the all-in-one framework was split into subnetworks, which were trained individually. Our approach differs significantly from the MTL [33] that we jointly minimize the loss of the major task, i.e., expression recognition errors, and those of the auxiliary tasks, i.e., the PAT loss, calculated in a hierarchical tree structure. In addition, semi-supervised learning is employed in our approach to make the best use of all available data.
Recently, clustering has been utilized to group deep features. A recurrent framework [47] updates deep features and image clusters alternatively until the number of clusters reaches the prede-fined value. DeepCluster [4] alternatively groups the features by k-means and uses the subsequent assignments as supervision to learn the network. Deep Density Clustering (DDC) [21] groups unconstrained face images based on local compact representations and a density-based similarity measure. In contrast to these unsupervised clustering methods, the proposed PAT-CNN takes advantage of available attribute annotations and thus, is capable of learning semantically-meaningful clusters that are related to facial expression recognition. Moreover, data samples are probabilistically assigned to clusters at different levels of the hierarchy to alleviate the misclassifications due to clustering errors.
In this section, we will first introduce the proposed PAT module, then present the PAT loss and the corresponding forward and backward propagation processes. Finally, we will show the overall loss function of the PAT-CNN.
3.1 The Overview of the PAT Module
The architecture of the proposed PAT-CNN is illustrated in Fig. 2, where a general l-Level PAT module is embedded between the last pooling layer and the decision layer of the CNN. Level-1 of the PAT contains the root node, which has one FC layer connected to the last pooling layer. Starting from Level-2, each level consists of a number of nodes, each of which contains an FC layer connected to the FC layer located at its parent node at the previous level.
Excepting the leaf nodes, clustering is performed at each PAT node according to a type of attribute, e.g., age, race, and gender; and nodes in the same level consider the same type of attribute. As shown in Fig. 2, features extracted from the associated FC layer are clustered into a number of clusters with the centers denoted as blue or green dots. The number of clusters is determined by the number of states of the specific attribute, e.g., 2 for gender. Furthermore, these clusters also correspond to its child nodes in the next level. As shown in Fig. 1 (c), a “Female” cluster in the root node corresponds to one of its child nodes, i.e., the “Female” node.
During training, the samples with the attribute labels will be used to update the cluster centers and learn the parameters of the FC layers of the PAT module by minimizing the proposed PAT
Figure 2: An illustration of the PAT-CNN. Each node contains an FC layer, which connects to another FC layer located at its parent node, if any. At each node except the leaf nodes, features are extracted at the associated FC layer and clustered according to a specified attributes, e.g., age , gender, and race, with centers marked by dots. The green dots denote the cluster centers corresponding to the ground truth attribute states of the current sample. Intensity of red nodes represents the probability of the current sample belonging to the node. Best viewed in color.
loss. As shown in Fig. 2, each sample contributes to all nodes differently according to its probabilities of belonging to the nodes illustrated by the color intensities of the nodes. Specifically, a sample contributes more to those nodes containing the green dots, which denote the cluster centers corresponding to its ground truth attribute states, but also to other nodes with lower probabilities. As shown in Fig. 1 (c), a sample of a female Caucasian will contribute more to learn the parameters associated with the “Female” node at the second level and “Female Caucasian” node at the third level, but less to the other nodes.
3.2 The PAT Loss: Forward Propagation
The PAT loss denoted as measures how far away the data samples are from their corresponding cluster centers and is calculated from samples with attribute labels. It is defined as the summation of the attribute losses of all levels except the leaf one:
where l is the number of PAT levels. is the attribute loss of the
level, which is defined as:
where is the number of tree nodes at the
is the attribute loss of the
node at the
level. From now on, the subscript j and k denote the variable at the
level and the
node, respectively.
Let denote a set of cluster centers and x
denote the feature vector of the
sample extracted from the FC layer in the
node at the
level. Given
data samples with the attribute labels,
is calculated as:
(3) where denotes the ground truth attribute label of the
at the
level, e.g., gender =“Female” in Fig. 1 (c). c
denotes the
cluster center in the
node, i.e., a dot in the
denotes the cluster center corresponding to the ground truth attribute label, i.e., the only green dot at the
level, in Fig. 1 (c) and Fig. 2. D(a, b) is a distance function and defined as a cosine distance between two vectors in this work.
chas three conditions: (1) c
cluster is denoted by a green dot), (2) c
node contains the green dot, but the
cluster is denoted by a blue dot); and (3) c
node contains all blue dots).
Thus, each attribute loss is calculated by Eq. 3 in two cases. Using Fig. 1 (c) as an example, both “Female” and “Male” nodes contain three clusters according to “Race”. In the first case,
of the “Female” node is calculated such that the current sample, i.e., a female Caucasian, will be pushed to the center “Caucasian” by minimizing the loss term
pulled away from the other centers by minimizing the loss term
. In the second case,
of the “Male” node is calculated such that the sample will be pulled away from all the centers by minimizing the loss
is the probability of x
belonging to the
cluster and also the probability of the
sample being assigned to the
child node at the
as:
(4) where is the probability of the
sample belonging to the
node at the
level and is calculated at its parent node as described above.
3.3 The PAT Loss: Backward Propagation
The partial derivative of the PAT loss with respect to the input sample can be calculated at each node as:
3.4 A Marginal Softmax Loss
Therefore, given data samples with attribute labels and
data samples with expression labels, the overall loss function of the PAT-CNN training is given below:
where is a hyperparameter 1 to balance the two losses.
Note that are not necessarily the same and
can be calculated from a small subset of attribute-labeled data. This enables a semi-supervised learning of the PAT-CNN and makes it feasible to improve expression recognition for those existing datasets without attribute labels with the help of additional attribute-labeled data.
The forward and backward training process in the PAT-CNN is summarized in Algorithm 1.
To evaluate the proposed PAT-CNN, experiments have been conducted on five benchmark datasets including three posed facial expression datasets, i.e., the BU-3DFE dataset [50], the CK+ dataset [14, 25], and the MMI dataset [31], and more importantly, two spontaneous ones, i.e., the Static Facial Expression in the Wild (SFEW) dataset [7] and the RAF-DB dataset [19].
4.1 Preprocessing
Face alignment was employed on each image based on centers of two eyes and nose, extracted by Discriminative Response Map Fitting (DRMF) [1]. The aligned facial images were then resized to . In addition, histogram equalization was utilized to improve the contrast in facial images. For data augmentation purpose,
patches were randomly cropped from the
images, and then rotated by a random degree between -5
and 5
. Finally, the rotated images were randomly horizontally flipped as the input of all CNNs in comparison.
4.2 Experimental Datasets
RAF-DB dataset provides attribute labels, e.g., age, race, and gender, for each image and thus, was employed as the major dataset for experimental validation in this work. Specifically, the single-label subset of the RAF-DB (12,271 training images and 3,068 testing images) was employed, where each image was labeled as one of seven expressions, i.e., neutral and six basic expressions. The subjects were divided into five age groups: (1) 0-3, (2) 4-19, (3) 20-39, (4) 40-69, (5) 70+ with the gender attribute labeled as one of the three categories, i.e., male, female, and unknown 2, and the race attribute labeled as one of the three categories, i.e., African-American, Asian, and Caucasian.
SFEW dataset is the most widely used benchmark dataset for facial expression recognition in the wild. It contains 1,766 facial images and has been divided into three sets, i.e. Train (958), Val (436), and Test (372). Each image has one of seven expression labels, i.e., neutral and six basic expressions. The expression labels of the “Test” set are not publicly available. Thus, the performance on the “Test” set was evaluated and provided by the challenge organizer.
BU-3DFE dataset consists of 2,500 pairs of static 3D face models and 2D texture images from 100 subjects with a variety of ages and races. Each subject displays six basic expressions with four levels of intensity and a neutral expression. Following [45, 46], we employed only the 2D texture images of the six basic expressions with high intensity (i.e., the last two levels) in our experiment. Thus, an experimental dataset including 1,200 images was built for the BU-3DFE dataset.
CK+ dataset consists of 327 videos collected from 118 subjects, each of which was labeled with one of seven expressions, i.e., contempt and six basic expressions. Each video starts with a neutral face, and reaches the peak in the last frame. To collect more data, the last three frames were collected as peak frames associated with the provided label. Hence, an experimental dataset including 981 images was
built for seven expressions and 927 images for six basic expressions.
MMI dataset contains 236 sequences from 32 subjects, from which 208 sequences of 31 subjects display six basic expressions captured in frontal-view were normally used in experimental validation. Sequences in MMI start from a neutral expression, through a peak phase near the middle, and back to a neutral face at the end. Since the actual location of the peak frame was not provided, three frames in the middle of each sequence were collected with the labeled expression Thus, a total of 624 images were used in our experiments.
4.3 Training/testing strategy
Because the three posed facial expression datasets do not provide specified training, validation, and test sets, a person-independent 10-fold cross-validation strategy was employed, where each dataset was further split into 10 subsets, and the subjects in any two subsets are mutually exclusive. For each run, data from 8 sets were used for training, the remaining two subsets were used for validation and testing, respectively. The results were reported as the average of the 10 runs on the testing sets. For the experiments on the SFEW and RAF-DB datasets, we used their training sets for training and their validation and/or testing sets for evaluation, respectively.
4.4 CNN Implementation Details
Figure 3: Variants of PAT-CNNs evaluated in the experiments. Best viewed in color.
4.5 Experimental Results
4.5.1 Result analysis for the RAF-DB
Table 1: Performance comparison on the RAF-DB. Some papers report performance as an average of diagonal values of confusion matrix. We convert them to regular accuracy for fair comparison.
the soft-clustering with probability outperforms hard-clustering, i.e., AT-VGG-(gender, race), using the same network structure.
Figure 4: Performance analysis by varying the percentage of attribute-labeled images in the RAF-DB dataset.
Moreover, we evaluated the semi-supervised learning strategy for PAT-CNN training, which is critical for realworld applications, where attribute information may be missing or incomplete. Specifically, we varied the percentage of attribute-labeled images from 100% to 10% by randomly removing attribute labels for RAF-DB images. As shown in Fig. 4, the performance of PAT-VGG-(gender, race) beats that of the VGG baseline when more than 20%
of RAF-DB images have attribute labels.
Table 2: Performance comparison on the BU-3DFE.
Table 3: Performance comparison on the CK+.
4.6 Result analysis for the three posed datasets
As shown in Table 2, 3, and 4, the proposed PATCNNs outperforms the baseline CNNs for both backbone structures and also achieve comparable results as the state-of-the-art methods evaluated on the three posed datasets. Note that most of the state-of-the-art methods utilized dynamic features extracted from image sequences, while the proposed PAT-CNN is trained on static images, which is more favorable for online applications or snapshots. Yang et
Table 4: Performance comparison on the MMI.
al. [48] achieved the highest performance on the BU-3DFE dataset by employing geometric features of the 3D shape model. Although the cGAN-based methods (DeRL [45] and IA-gen [46]) achieved high performance on the BU-3DFE and CK+ datasets, they are not end-to-end systems and also require higher computational cost. We are aware that PPDN [54] also has the best performance on the CK+ dataset owing to utilizing neutral images as reference. Island Loss [3] achieved the best performance on the MMI dataset by utilizing an average fusion of the three images from the same sequence.
4.7 Result analysis for the SFEW dataset
More importantly, the proposed PAT-CNN was also evaluated on the SFEW dataset, which contains unconstrained and thus, more natural facial expressions and has been used as a benchmark to evaluate facial expression recognition systems in the wild. Note that the top three methods reported on the SFEW testing set [16, 51, 3] utilized an ensemble of CNNs. As shown in Table 5, the proposed PATCNNs beat the baseline CNNs for both validation set and testing set by a large margin. More impressively, the proposed PAT-CNNs with both backbone structures using a single model achieve the best performance on the testing set among all the methods compared with.
In this work, we proposed a novel PAT-CNN along with a forward-backward propagation algorithm to learn expression-related features in a hierarchical structure by explicitly modeling identity-related attributes in the CNN. Our work differs from the other unsupervised clustering methods in that the proposed PAT-CNN is capable of building semantically-meaningful clusters from which expression-related features are learned to alleviate the inter-
Table 5: Performance comparison on the SFEW.
[2] S. Berretti, A. Del Bimbo, P. Pala, B. Amor, and M. Daoudi. A set of selected sift features for 3d facial expression recognition. In ICPR, pages 4125–4128. IEEE, 2010.
[3] J. Cai, Z. Meng, A. Khan, Z. Li, J. O’Reilly, and Y. Tong. Island loss for learning discriminative features in facial expression recognition. In FG, pages 302–309. IEEE, 2018.
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. arXiv preprint, 2018.
[5] J. Chen, R. Xu, and L. Liu. Deep peak-neutral difference feature for facial expression recognition. Multimedia Tools and Applications, pages 1–17, 2018.
[6] W. Chu, F. De la Torre, and J. Cohn. Selective transfer ma- chine for personalized facial expression analysis. IEEE TPAMI, 2016.
[7] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In ICMI, pages 423–426. ACM, 2015.
[8] H. Ding, S. Zhou, and R. Chellappa. Facenet2expnet: Reg- ularizing a deep face recognition net for expression recognition. In FG, pages 118–126. IEEE, 2017.
[9] Y. Fan, J. Lam, and V. Li. Multi-region ensemble convolu- tional neural network for facial expression recognition. In IACNN, pages 84–94. Springer, 2018.
[10] J. Gross, L. Carstensen, M. Pasupathi, J. Tsai, C. G¨otestam Skorpen, and A. Hsu. Emotion and aging: Experience, expression, and control. Psychology and aging, 12(4):590, 1997.
[11] U. Hess, R. Adams Jr, and R. Kleck. Facial appearance, gender, and emotion expression. Emotion, 4(4):378, 2004.
[12] S. Jain, C. Hu, and J. K. Aggarwal. Facial expression recog- nition with temporal modeling of shapes. In ICCV Workshops, pages 1642–1649, 2011.
[13] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In ICCV, pages 2983–2991, 2015.
[14] T. Kanade, J. F. Cohn, and Y. Tian. Comprehensive database for facial expression analysis. In FG, pages 46–53, 2000.
[15] H. Kaya, F. G¨urpinar, S. Afshar, and A. A. Salah. Contrast- ing and combining least squares based learners for emotion recognition in the wild. In ICMI, pages 459–466, 2015.
[16] B. Kim, H. Lee, J. Roh, and S. Lee. Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In ICMI, pages 427–434. ACM, 2015.
[17] C. Kuo, S. Lai, and M. Sarkis. A compact deep learning model for robust facial expression recognition. In CVPR Workshops, pages 2121–2129, 2018.
[18] Y. Lai and S. Lai. Emotion-preserving representation learn- ing via generative adversarial network for multi-view facial expression recognition. In FG, pages 263–270. IEEE, 2018.
[19] S. Li, W. Deng, and J. Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, July 2017.
[20] Y. Li, J. Zeng, S. Shan, and X. Chen. Patch-gated cnn for occlusion-aware facial expression recognition.
[21] W. Lin, J. Chen, C. Castillo, and R. Chellappa. Deep density clustering of unconstrained faces. In CVPR, pages 8128– 8137, 2018.
[22] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply learn- ing deformable facial action parts model for dynamic expression analysis. In ACCV, pages 143–157. Springer, 2014.
[23] M. Liu, S. Shan, R. Wang, and X. Chen. Learning expres- sionlets on spatio-temporal manifold for dynamic facial expression recognition. In CVPR, pages 1749–1756, 2014.
[24] A. T. Lopes, E. de Aguiar, A. De Souza, and T. Oliveira- Santos. Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recognition, 61:610–628, 2017.
[25] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete expression dataset for action unit and emotion-specified expression. In CVPR Workshops, pages 94–101, 2010.
[26] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic. Auto- matic analysis of facial actions: A survey. IEEE Trans. on Affective Computing, 13(9):1–22, 2017.
[27] D. Matsumoto. Ethnic differences in affect intensity, emo- tion judgments, display rule attitudes, and self-reported emotional expression in an american sample. Motivation and emotion, 17(2):107–123, 1993.
[28] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong. Identity-aware convolutional neural network for facial expression recognition. In FG, pages 558–565. IEEE, 2017.
[29] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper in facial expression recognition using deep neural networks. In WACV, pages 1–10. IEEE, 2016.
[30] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler. Deep learning for emotion recognition on small datasets using transfer learning. In ICMI, pages 443–449, 2015.
[31] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web- based database for facial expression analysis. In ICME, pages 5–pp. IEEE, 2005.
[32] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, volume 1, page 6, 2015.
[33] R. Ranjan, S. Sankaranarayanan, C. Castillo, and R. Chel- lappa. An all-in-one convolutional neural network for face analysis. In FG, pages 17–24. IEEE, 2017.
[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
[35] A. Sanin, C. Sanderson, M. Harandi, and B. Lovell. Spatio- temporal covariance descriptors for action and gesture recognition. In WACV, pages 103–110, 2013.
[36] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic anal- ysis of facial affect: A survey of registration, representation and recognition. IEEE T-PAMI, 37(6):1113–1133, 2015.
[37] E. Sariyanidi, H. Gunes, and A. Cavallaro. Learning bases of activity for facial expression recognition. IEEE T-IP, 26(4):1965–1978, 2017.
[38] K. Sikka, G. Sharma, and M. Bartlett. Lomo: Latent ordinal model for facial analysis in videos. In CVPR, pages 5580– 5589, 2016.
[39] R. Simon and L. Nath. Gender and emotion in the united states: Do men and women differ in self-reports of feelings and expressive behavior. American journal of sociology, 109(5):1137–1176, 2004.
[40] B. Sun, L. Li, G. Zhou, X. Wu, J. He, L. Yu, D. Li, and Q. Wei. Combining multimodal features within a fusion network for emotion recognition in the wild. In ICMI, pages 497–502, 2015.
[41] S. Vrana and D. Rollock. The role of ethnicity, gender, emotional content, and contextual differences in physiological, expressive, and self-reported emotional responses to imagery. Cognition & Emotion, 16(1):165–192, 2002.
[42] J. Wang, L. Yin, X. Wei, and Y. Sun. 3d facial expression recognition based on primitive surface feature distribution. In CVPR, volume 2, pages 1399–1406. IEEE, 2006.
[43] Z. Wang, S. Wang, and Q. Ji. Capturing complex spatio- temporal relations among facial muscles for facial expression recognition. In CVPR, pages 3422–3429, 2013.
[44] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea- ture learning approach for deep face recognition. In ECCV, pages 499–515. Springer, 2016.
[45] H. Yang, U. Ciftci, and L. Yin. Facial expression recognition by de-expression residue learning. In CVPR, pages 2168– 2177, 2018.
[46] H. Yang, Z. Zhang, and L. Yin. Identity-adaptive facial ex- pression recognition through expression regeneration using conditional generative adversarial networks. In FG, pages 294–301. IEEE, 2018.
[47] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In CVPR, pages 5147–5156, 2016.
[48] X. Yang, D. Huang, Y. Wang, and L. Chen. Automatic 3d facial expression recognition using geometric scattering representation. In FG Workshops, volume 1, pages 1–6. IEEE, 2015.
[49] A. Yao, J. Shao, N. Ma, and Y. Chen. Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In ICMI, pages 451–458, 2015.
[50] L. Yin, X. Wei, Y. Sun, J. Wang, and M. Rosato. A 3d fa- cial expression database for facial behavior research. In FG, pages 211–216. IEEE, 2006.
[51] Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network learning. In ICMI, pages 435–442, 2015.
[52] F. Zhang, T. Zhang, Q. Mao, and C. Xu. Joint pose and expression modeling for facial expression recognition. In CVPR, pages 3359–3368, 2018.
[53] S. Zhao, H. Cai, H. Liu, J. Zhang, and S. Chen. Feature se- lection mechanism in cnns for facial expression recognition. 2018.
[54] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan. Peak-piloted deep network for facial expression recognition. In ECCV, pages 425–442. Springer, 2016.
[55] Y. Zong, W. Zheng, X. Huang, K. Yan, J. Yan, and T. Zhang. Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis. J. on Multimodal User Interfaces, 10(2):163–172, 2016.