Group Cohesiveness plays an important role in the study of small group behavior, social psychology, group dynamics, sport psychology, and organizational behavior [6, 14]. Cohesiveness has been found to be one of the critical influencing factors in group performance. Several studies have shown that strong group performance is associated with a high level of group cohesion among the members [1, 5]. Moreover, recent research [8] shows that group cohesion is highly correlated to group-level emotion.
The rapid growth of web images, driven by photo hosting and sharing services such as Flickr, FaceBook, and Google Photos, has gradually and significantly changed our life style [17]. Many of these images are taken when people are attending meaningful social events, such as graduations, birthday parties, and family gatherings. Such images not only capture these most precious moments, but also have useful information that can be used to analyze group-level social attributes such as group cohesion. The availability of these images motivates the design of automatic systems capable of understanding human perception of cohesion at the group level.
Measuring and annotating group cohesion at different levels is often difficult for a human annotator, because cohesion has team and individual components [19]. The problem of group cohesiveness prediction becomes even more challenging in static images. Complications include face occlusions,
illumination variations, head pose variations, varied indoor and outdoor settings, faces at different distances from the camera, and low-resolution face images. In this paper, we propose a robust ensemble model that separately processes various high-level information of faces, skeletons, and scenes. Then, regression values are calculated and fused for the final cohesive intensity. In the 7th Emotion Recognition in the Wild (EmotiW 2019) Sub-Challenge [3], the proposed hybrid model achieves a competitive result.
Many researchers have employed the rapidly developing computer vision and machine learning techniques to machine understanding of images and videos. One specific task is to study groups of people from images.
Photos of groups of people during social gatherings, such as birthday parties, graduations, and family reunions, are widely available. [7] introduces contextual features that capture the structure of a group of people and the position of individuals within the group. This social context helps to accomplish a variety of tasks such as the following: identifying the demographics of people in the group, estimating camera and scene parameters, and classifying group events.
Recently, the EmotiW 2019 Challenge organizers presented the first study of group cohesion prediction in static images [8]. The challenge organizer extends the Group Affect Database [4] with group cohesion labels and proposes the new GAF Cohesion database. Two deep cohesion models, separately trained on holistic and face-level features, achieve results on the Cohesion database which approximate humanlevel performance. Motivated by considering cohesiveness as an attribute of group emotion, the paper jointly trains an inception V3 model on both group emotion and group cohesion. From the experimental results, joint training on both emotion and cohesion achieves a higher performance than individual training. It strongly infers that group emotion and cohesion are correlated.
The system pipeline is shown in Figure 1. The basic idea of the proposed approach is to train a Support Vector Regression (SVR) [22] with high-level features of the input images from different representations. The predicted regression values are fused by using a grid search to achieve the final prediction.
Scene Features
Holistic (scene-level) information is shown to be the important component in group-level classification in [10, 12, 24]. While analyzing the cohesiveness of a group of people, it is essential to understand the environments behind the people, e.g., students in a lecture tend to have a low cohesion
level, while a group people standing and protesting at a plaza probably have high cohesiveness. In order to extract the high-level interpretations of the holistic information, a state of art deep model Densely Connected Convolutional Network (DenseNets) [15] is applied.
DenseNets have several important advantages: alleviating the vanishing-gradient problem, strengthening feature propagation, feature reusing, and substantially reducing the number of parameters. DenseNets accomplish significant improvements over the state-of-the-art on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). Moreover, before extracting holistic features by using DenseNets, we fine-tune the Densenents network on Emotic Dataset [16]. Group cohesion level is relevant to the group-level emotion or valance degree. The Emotic Dataset consists of a total of 18,316 images that are labeled in two methods, 26 emotion discrete categories, and valence continuous dimensions scaled from 1 to 10. A pre-trained (on Imagenet) DenseNet161 model is fine-tuned by using the Emotic Dataset labeled in continuous dimensions. With the exception of the last layer, a size 2208 feature vector is extracted for each original image.
Considering the high correlation between group-emotion and group cohesion, the overall facial emotion stage of a group of people can contribute to group cohesiveness detection. The sample images shown in Figure 2 demonstrate that the average facial expression among all faces is a substantial indicator of group cohesiveness in the image. For instance, if most of the faces are classified as neutral expressions, the group cohesion level tends to have a lower value. In such a manner, faces are extracted by using Multi-task Cascaded Convolutional Network (MTCNN) which is effectively detecting and aligning faces in real time and achieves superior accuracy on the challenges FDDB and WIDER FACE benchmark for face detection and AFLW benchmark for face alignment [25].
The VGG Face is a deep network, containing 22 layers and 37 deep units, trained on a very large scale dataset [18]. This dataset contains 2.6M images with over 2.6K people which is assembled by a combination of automation and manual operations. The fine-tuned VGG Face model is often used as a feature extractor to extract the activation vector of the fully connected layer in the CNN architecture. It has proven more efficient than a trained from scratch model [11, 13]. In furtherance of exploiting the high-level abstractions of extracted faces, the VGG Face model is trained on the facial expression dataset FER 2013 [9]. Then, VGG Face considered as a feature extractor with the last fully connected layer removed, computes a size of 4096 feature vector for all faces. Moreover, we obtain a different representation for each face.
Figure 1: Overall Proposed Hybrid Network structure.
To train our SVR model, a single representation of each image is required. However, simply concatenating all feature vectors is invalid because each image can consist of a different number of faces. In this way, the face feature vectors are averaged to obtain a single facial feature vector to feed into the SVR predictor.
Skeleton Feature
As shown in Figure 3, skeleton features demonstrate salient patterns of different categories through facial expressions, poses, gestures, and the structures of groups of people. In this work, the skeleton of each image is extracted using OpenPose [2, 21, 23], which can jointly detect human body, hand, and facial keypoints (in total 135 keypoints) on each image. Furthermore, the Openface library contains multiple functions such as 2D real-time multi-person keypoint detection, 3D real-time single-person keypoint detection, a calibration toolbox, and single-person tracking.
A new model, EfficientNet, achieves state-of-the-art accuracy on ImageNet, CIFAR-100, and Flowers, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet [20]. EfficientNet is powered by a novel scaling method and the advanced Automated machine learning (AutoML). The heuristic model scaling method uses a simple yet highly effective compound coefficient to scale up CNNs in a more structured manner. Moreover, this method uniformly scales each dimension with fixed scaling coefficients. This scaling is different from traditional approaches, e.g., ResNet arbitrarily scales up layers from Resnet-18 to Resnet-50, Resnet-101 and Resnet-152, while they usually require
Figure 2: Samples of faces. Top: High-level Group Cohesiveness Below: Low-level Group Cohesiveness.
tedious manual tuning. A pre-trained (on Imagenet) EfficentNet model, with the exception of the last layer, extracts a size of 1536 feature vector for each original image.
Dataset
The group cohesiveness prediction dataset in Emotiw 2019 contains a total of 14,175 images. It is split into three parts: 9815 images for training, 4,349 images for validation, and 3011 images for testing. The database consists of all images in GAF 3.0 database [4], and new set of images are added and collected via web crawlers with various keywords related to social activities, e.g., wedding, birthday party, riot, and protest, etc. The dataset is labeled in four categories as cohesive level 0, 1, 2 and 3.
Figure 3: Samples of skeleton feature representations. Left: High-level Group Cohesiveness Right: Low-level Group Cohesiveness.
To better understand the perception of group cohesion and improve the labeling of the dataset, the Emotiw 2019 Challenge conducted a survey via a Google form with 102 participants (59 male and 43 female) whose age ranges from 22 to 54. The survey contained 24 images of groups of people in different contexts and has 4 different Group Cohesion Score (GCS) values. The participants selected one of GCS values for each image and described reasons behind their choice by using provided keywords related to the AGC score.
With the assistance of the survey results, we employed 5 annotators (3 females and 2 males) labeling each image for its cohesiveness in the range [0,3].
Experiment setting
The deep networks (DenseNet, EfficientNet and VGG FACE) are implemented in Pytorch powered by NVIDIA GFORCE 1080. The original images are resized to 224x224 to fit the CNNs as input, and the provided labels are normalized from [0, 3] to [0, 1]. After reviewing the training dataset, we notice that the dataset is severely imbalanced. The distribution of the training dataset is as follows: 1141 images belong to level 0, 1561 for level 1, 4601 for level 2, and 1997 for level 3. To balance the data, 30% of the images from the category of level 2 are down-sampled.
We conduct experiments on both original training set and balanced training set, and the table 1 shows the validation results. As shown in table 1, our fusion model significantly decreases the MSE. Due to the bias in the training data, data augmentation is important in this challenge and we achieve the lowest MSE of 0.662 on validation set by using our proposed approach with balanced training data. For the test phase, we use the fusion model which achieves the best result on validation. Table 2 summarizes our 5 submission results. Table 3 presents submission results of MSE corresponding to
Table 1: Performance on the validation set.
Table 2: Submission Results
each individual cohesive level. To make use of all available data, we combine both training data and validation data to train our model. However, the performance is decreased, and submission 2 and submission 5 demonstrate the conclusion. The possible reason is the combined data without modification are severely biased which causes model over-fitting. Eventually, in submission 4, our model achieves the best MSE 0.444 on combined data with data augmentation.
In summary, group cohesiveness is a major component for analyzing group behavior, group performance, group emotion etc. A large number of images, taken from social gathering and social activities, are shared on online photo services such as Flickr and Facebook.
In addition, measuring and annotating group cohesion at different levels for a human annotator is usually time consuming and inefficient. In this paper, we construct a robust ensemble hybrid regression model to automatically and effectively detect group cohesiveness. The model is separately trained on faces, skeletons, and scenes. The regression values are fused for the final cohesive intensity. Our experiments deliver a mean squared error of 0.662 and 0.444 on the validation and testing sets, respectively. This MSE outperforms the baseline MSE of 0.5. The result demonstrates that the proposed hybrid model is effective and makes promising improvements.
[1] Adeleke Banwo, Jian-guo Du, and Uchechi Onokala. 2015. The Impact of Group Cohesiveness on Organizational Performance: The Nigerian Case. International Journal of Business and Management 10 (05 2015). https://doi.org/10.5539/ijbm.v10n6p146
[2] Z. Cao, T. Simon, S. Wei, and Y. Sheikh. 2016. Realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050 (2016).
[3] Abhinav Dhall, Roland Goecke, Shreya Ghosh, and Tom Gedeon. 2019. EmotiW 2019: Automatic Emotion, Engagement and Cohesion PredictionTasks (ACM International Conference on Multimodal Interaction 2019). ACM.
[4] Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0. In Proceedings of the 19th ACM International . ACM, New York, NY, USA, 524–528. https://doi.org/10.1145/3136755.3143004
[5] Lata Dyaram and T. J. Kamalanabhan. 2005. Unearthed: The Other Side of Group Cohesiveness. Journal of Social Sciences 10, 3 (2005), 185–190. https://doi.org/10.1080/09718923.2005.11892479 arXiv:https://doi.org/10.1080/09718923.2005.11892479
[6] Nancy J. Evans and Paul A. Jarvis. 1980. Group Cohesion: A Review and Reevaluation. Small Group Behavior 11, 4 (1980), 359–370. https://doi.org/10.1177/104649648001100401 arXiv:https://doi.org/10.1177/104649648001100401
[7] A. C. Gallagher and T. Chen. 2009. Understanding images of groups of people. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 256–263. https://doi.org/10.1109/CVPR.2009.5206828
[8] Shreya Ghosh, Abhinav Dhall, Nicu Sebe, and Tom Gedeon. 2019. Predicting Group Cohesiveness in Images. In International Joint Conference on Neural Networks (IJCNN).
[9] I.J. Goodfellow et al. 2013. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing. Springer, 117–124.
[10] Xin Guo, Luisa F. Polanía, and Keneth E. Barner. 2017. Group-level emotion recognition using deep models on image scene, faces, and skeletons. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 603–608.
[11] Xin Guo, Luisa F. Polanía, and Kenneth E. Barner. 2018. Smile Detection in the Wild Based on Transfer Learning. In 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018). 679–686. https://doi.org/10.1109/FG.2018.00107
[12] Xin Guo, Bin Zhu, Luisa F. Polanía, Charles Boncelet, and Kenneth E. Barner. 2018. Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attentions. In Proceedings of the 20th ACM International Conference on Multimodal . ACM, New York, NY, USA, 635–639. https: //doi.org/10.1145/3242969.3264990
[13] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Lecture Notes in Computer Science (2016), 87âĂŞ102. https://doi.org/10.1007/978-3-319-46487-9_6
[14] Michael A. Hogg. 1993. Group Cohesiveness: A Critical Review and Some New Directions. European Review of Social Psychology 4, 1 (1993), 85–111. https://doi.org/10.1080/14792779343000031 arXiv:https://doi.org/10.1080/14792779343000031
[15] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. CoRR abs/1608.06993 (2016). arXiv:1608.06993 http://arxiv.org/abs/1608.06993
[16] Ronak Kosti, Jose M Alvarez, Adria Recasens, and Agata Lapedriza. 2017. Emotion recognition in context. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
[17] Andrew Miller and W Keith Edwards. 2007. Give and take: A study of consumer photo-sharing culture and practice. Conference on Human Factors in Computing Systems - Proceedings, 347–356. https://doi.org/ 10.1145/1240624.1240682
[18] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In BMVC.
[19] Eduardo Salas, Rebecca Grossman, Ashley Hughes, and Chris Coultas. 2015. Measuring Team Cohesion. Human factors 57 (05 2015), 365–74. https://doi.org/10.1177/0018720815578267
[20] Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California, USA, 6105–6114. http: //proceedings.mlr.press/v97/tan19a.html
[21] S. Tomas, J. Hanbyul, M. Iain, and S. Yaser. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In CVPR.
[22] V Vapnik and A Lerner. 1963. Pattern recognition using generalized portrait computer vision and pattern recognition method. Automation and Remote Control 24 (01 1963).
[23] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. 2016. Convolutional pose machines. In CVPR.
[24] Yuanjun Xiong, Kai Zhu, Dahua Lin, and X. Tang. 2015. Recognize complex events from static images by fusing deep channels. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1600–1609. https://doi.org/10.1109/CVPR.2015.7298768
[25] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct 2016), 1499âĂŞ1503. https://doi.org/10.1109/lsp.2016.2603342