Hybrid Graph Neural Networks for Crowd Counting

2020·Arxiv

Abstract

Abstract

Crowd counting is an important yet challenging task due to the large scale and density variation. Recent investigations have shown that distilling rich relations among multi-scale features and exploiting useful information from the auxiliary task, i.e., localization, are vital for this task. Nevertheless, how to comprehensively leverage these relations within a uni-fied network architecture is still a challenging problem. In this paper, we present a novel network structure called Hybrid Graph Neural Network (HyGnn) which targets to relieve the problem by interweaving the multi-scale features for crowd density as well as its auxiliary task (localization) together and performing joint reasoning over a graph. Specifically, HyGnn integrates a hybrid graph to jointly represent the task-specific feature maps of different scales as nodes, and two types of relations as edges: (i) multi-scale relations for capturing the feature dependencies across scales and (ii) mutual beneficial relations building bridges for the cooperation between counting and localization. Thus, through message passing, HyGnn can distill rich relations between the nodes to obtain more powerful representations, leading to robust and accurate results. Our HyGnn performs significantly well on four challenging datasets: ShanghaiTech Part A, ShanghaiTech Part B, UCF CC 50 and UCF QNRF, outperforming the state-of-the-art approaches by a large margin.

Introduction

Crowd counting, with the purpose of analyzing large crowds quickly, is a crucial yet challenging computer vision and AI task. It has drawn a lot of attention due to its potential applications in public security and planning, traffic control, crowd management, public space design, etc.

Same as many other computer vision tasks, the performance of crowd counting has been substantially improved by Convolutional Neural Networks (CNNs). Recently, the state-of-the-art crowd counting methods (Liu, Weng, and Mu 2019; Liu, Salzmann, and Fua 2019; Wan et al. 2019; Liu, Salzmann, and Fua 2019; Jiang et al. 2019) mostly follow the density-based paradigm. Given an image or video frame, CNN-based regressors are trained to estimate the

Figure 1: Illustration of the proposed HyGnn model. (a) Input image, in which crowds have heavy overlaps and occlusions. (b) Backbone, which is a truncated VGG-16 model. (c) Domain-specific branches: one for crowd counting and the other for localization. (d)HyGnn, which represents the features from different scales and domains as nodes, while the relations between them as edges. After several message passing iterations, multiple types of useful relations are built. (e) Crowd density map (for counting) and localization map (as the auxiliary task).

crowd density map, whose values are summed to give the entire crowd count.

Recent studies (Shen et al. 2018; Cao et al. 2018; Li et al. 2017; 2018) have shown that multi-scale information, or relations across multiple scales helps to capture contextual knowledge which benefits crowd counting. Moreover, the crowd counting and its auxiliary task (localization), in spite of analyzing the crowd scene from different perspectives, could provide beneficial clues for each other (Liu, Weng, and Mu 2019; Lian et al. 2019). Crowd density map can offer guidance information and self-adaptive perception for precise crowd localization, and on the other hand, crowd localization can help to alleviate local inconsistency issue in density map. The mutual cooperation, or called mutual beneficial relation, is the key factor in estimating the high-quality density map. However, most methods only consider the crowd counting problem from one aspect, while ignore the other one. Consequently, they fail to fully utilize multiple types of useful relations or structural dependencies in the learning and inferring processes, resulting in sub-optimal results.

One primary reason is the lack of a unified and effective framework capable of modeling the different types of relations (i.e., multi-scale relations and mutual beneficial relations) over a single model. To address this issue, we introduce a novel Hybrid Graph Neural Network (HyGnn), which formulates the crowd counting and localization as a graph-based, joint reasoning procedure. As shown in Fig. 1, we build a hybrid graph which consists of two types of nodes, i.e., counting nodes storing density-related features and localization nodes storing location-related features. Besides, there are two different pairwise relationships (edge types) between them. By interweaving the multi-scale and multi-task features together and progressively propagating information over the hybrid graph, HyGnn can fully leverage the different types of useful information, and is capable of distilling the valuable, high-order relations among them for much more comprehensive crowd analysis.

HyGnn is easy to implement and end-to-end learnable. Importantly, it has two major benefits in comparison to existing models for crowd counting (Liu, Weng, and Mu 2019; Liu, Salzmann, and Fua 2019; Wan et al. 2019; Liu, Salzmann, and Fua 2019; Jiang et al. 2019). (i) HyGnn interweaves crowd counting and localization with a joint, multi-scale and graph-based processing rather than a simple combination as done in most existing solutions. Thus, HyGnn significantly strengthens the information flow between tasks and across scales, thereby enabling the augmented representation to incorporate more useful priors learned from the auxiliary task and different scales. (ii) HyGnn explicitly models and reasons all relations (multi-scale relations and mutual beneficial relations) simultaneously over a hybrid graph, while most existing methods are not capable of dealing with such complicated relations. Therefore, our HyGnn can effectively capture their dependencies to overcome inherent ambiguities in the crowd scenes. Consequently, our predicted crowd density map is potentially more accurate, and consistent with the true crowd localization.

In our experiments, we show that HyGnn performs remarkably well on four well-used benchmarks and surpasses prior methods by a large margin. Our contributions are summarized in three aspects:

• We present a novel end-to-end learnable model, namely Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first deep model capable of explicitly modeling and mining high-level relations between counting and its auxiliary task (localization) across different scales through a hybrid graph model.

• HyGnn is equipped with a unique multi-tasking property, where different types of nodes, connections (or edges), and message passing functions are parameterized by different neural designs. With such property, HyGnn can more precisely leverage cooperative information between crowd counting and localization to boost the counting performance.

• We conduct extensive experiments on four well-known benchmarks including ShanghaiTech Part A, ShanghaiTech Part B, UCF CC 50 and UCF QNRF, on which we set new records.

Related Works

Crowd Counting and Localization. Early works (Viola, Jones, and Snow 2005) in crowd counting use detectionbased methods, and employ handcrafted features like Haar (Viola, Jones, and others 2001) and HOG (Dalal and Triggs 2005) to train the detector. The overall performances of these algorithms are rather limited due to various occlusion. The regression-based methods, which can avoid solving the hard detection problem, has become mainstream and achieved great performance breakthroughs. Traditionally, the regression models (Chen et al. 2013; Lempitsky and Zisserman 2010; Pham et al. 2015) learn the mapping between low-level images features and object count or density, using gaussian process or random forests regressors. Recently, various CNN-based counting methods have been proposed (Zhang et al. 2015; 2016; Liu, Weng, and Mu 2019; Liu, Salzmann, and Fua 2019; Wan et al. 2019; Liu, Salzmann, and Fua 2019; Jiang et al. 2019) to better deal with different challenges, by predicting a density map whose values are summed to give the count. Particularly, the scale variation issue has attracted the most attention of recent CNN-based methods (Dai et al. 2019; Varior et al. 2019). On the other hand, as observed by some recent researches (Idrees et al. 2018a; Liu, Weng, and Mu 2019; Lian et al. 2019), although the current state-of-the-art methods can report accurate crowd count, they may produce the density map which is inconsistent with the true density. One major reason is the lack of crowd localization information. Some recent studies (Zhao et al. 2019; Liu, Weng, and Mu 2019) have tried to exploit the useful information from localization in a unified framework. They, however, only simply share the underlying representations or interweave two modules for different task together for more robust representations. Differently, our HyGnn considers a better way to utilize the mutual guidance information: explicitly modeling and iteratively distilling the mutual beneficial relations across scales within a hybrid graph. For a more comprehensive survey, we refer interested readers to (Kang, Ma, and Chan 2018).

Graph Neural Networks. The essential idea of Graph Neural Network (GNN) is to enhance the node representations by propagating information between nodes. Scarselli et al. (Scarselli et al. 2008) first introduced the concept of GNN, which extended recursive neural networks for processing graph structure data. Li et al. (Li et al. 2016) proposed to improve the representation capacity of GNN by using Gated Recurrent Units (GRUs). Gilmer et al. (Gilmer et al. 2017) used message passing neural network to generalize the GNN. Recently, GNN has been successfully applied in attributes recognition (Meng et al. 2018), human-object interactions (Qi et al. 2018a), action recognition (Si et al. 2018), etc. Our HyGnn shares similar ideas with

Figure 2: Overall of our HyGnn model. Our model is built on the truncated VGG-16, and includes a Domain-specific Feature Learning Module to extract features from different domains. A novel HyGnn is used to distill multi-scale and cross-domain information, so as to learn better representations. Finally, the multi-scale features are fused to produce the density map for counting as well as generate the auxiliary task prediction (localization map).

above methods that fully exploits the underlying relationships between multiple latent representations through GNN. However, most existing GNN-based models are designed to deal with only one relation type, which may limit the power of GNN. To overcome above limitation, our HyGnn is equipped with a multitasking property, i.e., parameterizing different types of connections (or edges) and the message passing functions with different neural designs, which significantly discriminates HyGnn from all existing GNNs.

Methodology

Preliminaries

Problem Formulation. Let the crowd counting model be represented by the function M which takes an Image I as input and generates the corresponding crowd density map D (for counting) as well as the auxiliary task prediction, i.e., localization map L. Let and be the groundtruth of density map and localization map, respectively. Our goal is to learn the powerful domain-specific representations, denoted as fd and fl, to minimize errors between the estimated D and groundtruth , as well as L and . Notably, the two tasks share a common meta-objective, and and are obtained from the same point-annotations without additional target labels.

Notations. To achieve the goal, we need to distill the underlying dependencies between multi-task and multi-scale features. Given the multi-scale density feature maps {fsid and multi-scale localization feature maps {fsil , we represent and with a directed graph G = (V, E), where V is a set of nodes and E are edges. The nodes in our HyGnn are further grouped into two types: , where v1i is the set of counting (density) nodes and v2i denotes the set of localiza- tion nodes. In our model, we have the same number of nodes in two latent domains, and therefore N. Accordingly, there are two types of edges between them: (i) cross-scale edges emi(vmi , vmj ) stand for the multi-scale relations between nodes from the ith scale to the jth scale within the same domain m , where i; (ii) cross-domain edges ˘em(vmi , vni ) ˘reflect mutual beneficial relations between nodes from the domain m to n with the same scale i , where m& m n. For each node vmi (i & m ), we learn its updated representation, namely hmi , through aggregating representations of its neighbors. Fi- nally, the updated multi-scale features h1i and h2i are fused to produce the final representation fd and fl, which are used to generate the outputs D and L. Here, we only consider multi-scale relations between nodes in the same domain, and mutual beneficial (cross-domain) relations between nodes with the same scale in our graph model. Considering that our graph model is designed to simultaneously deal with two different node and relation types, we term it as Hybrid Graph Neural Network (HyGnn) which will be detailed in the following section.

Hybrid Graph Neural Network (HyGnn)

Overview. The key idea of our HyGnn is to perform K message propagation iterations over G to joint distill and reason all relations between crowd counting and the auxiliary task (localization) across scales. Generally, as shown in Fig. 2, HyGnn maps the given image I to the final predictions D and L through three phases. First, in the domain-specific feature extracting phase, HyGnn generates the multi-scale density features and localization features for I through a Domain-specific Feature Learning Module (DFL), and represents these features with a graph G = (V, E). Second, a parametric message passing phase runs for K times to propagate the message between nodes and also to update the node representations according to the received messages within the graph G. Third, a readout phase fuses the updated multi-scale features and to generate final representations (i.e., fd and fl), and maps them to the outputs D and L. Note that, as crowd counting is our main task, we emphasize the accuracy of D during the learning process.

Figure 3: The architecture of the learnable adapter. The adapter takes the node representation of one (source) domain hm(k)i as input and outputs the adaptive convolution parameters . The adaptive representation his generated conditioned on hn(k)i .

Domain-specific Feature Learning Module (DFL). DFL is one of the major modules of our model, which extracts the multi-scale, domain-specific features fsid and fsid from the input I. DFL is composed of three parts: one front-end and two domain-specific back-ends.

The front-end Fr() is based on the well-known VGG-16, which maps the RGB image I to the shared underlying representations: fshare = Fr(I). More specifically, the first 10 layers of VGG-16 are deployed as the front-end which is shared by the two tasks. Meanwhile, two series of convolution layers with different dilation rates are appended onto the back-ends, denoted as Bd() and Bl(). With the large receptive fields, the stacked convolutions are tailored for learning domain-specific features: fd = Bd(fshare) and fl = Bl(fshare). In addition, the Pyramid Pooling Module (PPM) (Zhao et al. 2017) is applied in each domain-specific back-end for extracting multi-scale features, followed by an interpolation layer R() to ensure the multi-scale feature maps to have the same size H W.

Node Embedding. In our HyGnn, each node v1i or v2i , where i takes an unique value from , is associated with an initial node embedding (or node state), namely v1i or v2i . We use the domain-specific feature maps produced by DFL as the initial node representations. Taking an arbitrary counting node v1i for example, its initial repre- sentations h1(0)i can be calculated by:

where h1(0)i is a 3D tensor feature (batch size is omitted). R() and P() denote the interpolation operation and pyramid pooling operation, respectively. The initial representation for the localization node v2i is defined sim- ilarly as follows:

where h2(0)i denotes the initial representation for the localization node v2i .

Cross-scale Edge Embedding. A cross-scale edge emiconnects two nodes vmi and vmj which are from the same domain m but different scales i. The cross-scale edge embedding, denoted as emi, is used to distill

Figure 4: Detailed illustration of the cross-domain edge embedding and message aggregation. Please see text for details.

the multi-scale relation from vmi to vmj as the edge represen- tation. To this goal, we employ a relation function frel() to capture the relations by:

where g() is a function to combine the feature hm(k)i and hm(k)j . Following (Wang et al. 2019b), we model g(hi, hj) = hihj, making the relations based on the difference between node embeddings to alleviate the symmetric impact in feature combination. Conv() means the convolution operation that is used to learn the edge embedding in a data-driven way. Each element in em(k)ireflects the pixel-level relations between the nodes of different scales from i to j. As a result, em(k)ican be considered as the features that depict the multi-scale relationships between nodes.

Cross-domain Edge Embedding. Since our HyGnn is designed to fully exploit the complementary knowledge contained in the nodes of different domains (m& m n), one major challenge is to overcome the “domain gap” between them. Rather than directly combining features as used in the cross-scale edge embedding, we first adapt the node representation of one (source) domain hm(k)i conditioned on the node representation of the other (target) domain hn(k)i to overcome the domain difference. Here, inspired by (Bertinetto et al. 2016), we integrate a learnable adapter (hm(k)i hn(k)i ) into our HyGnn to transform the original node representation hm(k)i to the adaptive representation has follows:

In the above function, is the convolution operation, and means the dynamic convolutional kernels. E) is a one-shot learner to predict the dynamic parameters from a single exemplar. Following (Nie et al. 2018), as shown in Fig. 3, we implement it by a small CNN with learnable parameters .

After achieving the adaptive representation h, the cross-domain edge embedding ˘emfor the edge ˘em

where ˘emis a 3D tensor, which contains the hidden representation of the cross-domain relation. The detailed architecture can be found in Fig. 4.

Cross-scale Message Aggregation. In our HyGnn, we employ different aggregation schemes for each node to aggregate feature messages from its neighbors. For the message mmipassed from node vmi to vmj within the same domain across different scales, we have:

where M() is the cross-scale message passing function (aggregator), and S ig() maps the edge’s embedding into the link weight. Note that since our HyGnn is devised to handle the pixel-level task, the link weight between nodes is in the manner of a 2D map. Thus, mmiassigns the pixel-wise weighted features from node vmi to vmj to aggregate informa- tion.

Cross-domain Message Aggregation. As the cross-domain discrepancy is significant in the high-dimensional feature space and distribution, directly passing the learned representations of one node to its neighboring nodes for aggregation is a sub-optimal solution. Therefore, we formulate the message passing from node vmi to vni as an adaptive repre- sentation learning process conditioned on hni . Here, we use the similar idea with that used in the cross-domain edge embedding process, i.e., using a one-shot adapter to predict the message that should be passed:

where ˘M() means the message passing function between nodes from two different domains. ˘) is the adapter which is conditioned on the node embedding of target domain hn(k. ˘E) means a small CNN with learnable parameters , which serves as an one-shot learner to predict the dynamic parameters. is the produced dynamic convolutional kernels, which includes the guidance information that should be propagated from node vmi to vni .

Two-stage Node State Update. In the kth step, our HyGnn first aggregates the information from the cross-domain nodes within the same scale i using Eq. 7. Therefore, vni (i & n ) gets an intermediate state hn(k)i by taking into account its received cross-domain message ˘mmand its prior state hn(k. Here, following (Qi et al. 2018b), we apply Gated Recurrent Unit (GRU) (Ballas et al. 2015) as the update function,

Then, HyGnn performs message passing across scales within the same domain n using Eq. 3, and aggregates messages using Eq. 6. After that, vni gets the new state hn(k)i after the kth iteration by considering the cross-scale message mn(k)jand its intermediate state hn(k)i ,

Readout Function. After K message passing iterations, the updated multi-scale features of two domains {h1i and h2i are merged to form their final representations fd and fl,

where ) and ) are the merge functions by concatenation. Then, fd and fl are fed into a convolution layer to get the final per-pixel predictions.

Loss. Our HyGnn is implemented to be fully differentiable and end-to-end trainable. The loss for each task can be computed after the readout functions, and the error can propagate back according to the chain rule. Here, we simply employ the Mean Square Error (MSE) loss to optimize the network parameters for two tasks:

where L1 and L2 are MSE losses, and is the combination weight. As our main task is the crowd counting, we set 0.001 to emphasize the accuracy of counting results.

Experiments

In this section, we empirically validate our HyGnn on four public counting benchmarks (i.e., ShanghaiTech Part A, ShanghaiTech Part B, UCF CC 50 and UCF QNRF). First, we conduct an ablation experiment to prove the effectiveness of our hybrid graph model and the multi-task learning. Then, our proposed HyGnn is evaluated on all of these public benchmarks, and compare the performance with the state-of-the-art approaches.

Datasets. We use Shanghai Tech (Zhang et al. 2016), UCF CC 50 (Idrees et al. 2013) and UCF QNRF (Idrees et al. 2018b) for benchmarking our HyGnn. Shanghai Tech provides 1,198 annotated images containing more than 330K people with head center annotations. It includes two subsets: Shanghai Tech A and Shanghai Tech B. UCF CC 50 provides 50 images with 63,974 head annotations in total. The small dataset volume and large count variance make it a very challenging dataset. UCF QNRF is the largest dataset to date, which contains 1,535 images that are divided into training and testing sets of 1,201 and 3,34 images respectively. All of these benchmarks have been widely used for performance evaluation by state-of-the-art approaches.

Implementation Details and Evaluation Protocol. To make a fair comparison with existing works, we use a truncated VGG as the backbone network. Specifically, the first 10 convolutional layers from VGG-16 are used as the front-end and shared by two tasks. Following (Li, Zhang, and

Table 1: Comparison with other state-of-the-art crowd counting methods on four benchmark crowd counting datasets using the MAE and MSE metrics.

Table 2: Analysis of the proposed method. Our results are obtained on Shanghai Tech A.

Chen 2018), our counting and localization back-ends are composed of 8 dilated convolutions with kernel size 3 3.

We use Adam optimizer with an initial learning rate 10. We set the momentum to 0.9, the weight decay to 10and the batchsize to 8. For data augmentation, the training images and corresponding groundtruths are randomly flipped and cropped from different locations with the size of 400 400. In the testing phase, we simply feed the whole image into the model for predicting the counting and localization results.

We adopt Mean Absolute Error (MAE) and Mean Squared Error (MSE) to evaluate the performance. The definitions are as follows:

where Ci and CGTi are the estimated count and the ground truth of the ith testing image, respectively.

Ablation Study. Extensive ablation experiments are performed on ShanghaiTech A to verify the impact of each component of our HyGnn. Results are summarized in Tab. 2. Effectiveness of HyGnn. To show the importance of our HyGnn, we offer a baseline model without HyGnn, which gives the results from our backbone model, the truncated

VGG with dilated back-ends. As shown in Tab. 2, our HyGnn significantly outperforms the baseline by 8.0 in MAE (6860.2) and 20.5 in MSE (11594.5). This is because our HyGnn can simultaneously model the multi-scale and cross-domain relationships which are important for achieving accurate crowd counting results.

Multi-task GNN vs. Single-task GNN. To evaluate the advantage of multi-task cooperation, we provide a single-task model which only formulates the cross-scale relationship. According to our experiments, HyGnn outperforms the single-task graph neural network by 2.3 in MAE (6260.2) and 8.9 in MSE (10394.5). This is because our HyGnn is able to distill mutual benefits between the density and localization, while single-task graph neural network ignores these important information.

Effectiveness of the Cross-domain Edge Embedding. Our HyGnn carefully deals with the cross-domain information by a learnable adapter. To evaluate its effectiveness, we provide a multi-task GNN without the learnable adapter. Instead, we directly fuse features from different domains through the aggregation operation. As shown Tab. 2, our cross-domain edge embedding method achieves better performance in both MAE (60.2 vs. 62.4) and MSE (94.5 vs. 101.8), which indicates that our design of cross-domain edge embedding method is helpful for better leveraging the information from the other domain.

Node Numbers N in HyGnn. In our model, we have N numbers of nodes in each domain, i.e., N. To investigate the impact of node numbers, we report the performance of our HyGnn with different N. We find that with more scales in the model (2 3), the performance improves significantly (i.e., 6260.2 in MAE and 10094.5 in MSE). However, when further considering more scales (3 5), it only achieves slight performance improvements, i.e., 6060.2 in MAE and 9494.1 in MSE. This may be due to the redundant information within additional features. Considering the tradeoff between efficiency and performance, we set N = 3 in the following experiments.

Message Passing Iterations K. To evaluate the impact of message passing iterations K, we report the performance of our

Figure 5: Density and localization maps generated by our HyGnn. We also show the counting map estimated by CSRNet for comparison. Clearly, our HyGnn produces more accurate results.

model with different passing iterations. Each message passing iteration in HyGnn includes two cascade steps: i) the cross-scale message passing and ii) the cross-domain message passing. We find that with more iterations (1 3), the performance of our model improves to some extent. When further considering more iterations (3 5), it just bring a slight improvement. Therefore, we set k = 3, and our HyGnn can converge to an optimal result.

GNN vs. Other Multi-feature Aggregation Methods. Here, we conduct an ablation to evaluate the superiority of GNN. To prevent other jamming factor, we use a single-task GNN to fully distill the underlying relationships between multi-scale features, and compare our method with two well-known multi-scale feature aggregation methods (PSP (Zhao et al. 2017) and Bidirectional Fusion (Yang et al. 2018)). As can be seen, our GNN-based method greatly outperforms other methods by a large margin.

Comparison with State-of-the-art. We compare our HyGnn with the state-of-the-art for the performance of counting.

Quantitative Results. As can be seen in Tab. 1, our HyGnn consistently achieves better results than other methods on four widely-used benchmarks. Specifically, our method greatly outperforms previous best result by 3.0 in MAE and 4.4 in MSE on ShanghaiTech Part A. Although previous methods have made remarkable progresses on ShanghaiTech Part B, our HyGnn also achieves the best performance. Compared with existing top approaches like ADCrowdNet (Liu et al. 2019) and SFCN (Wang et al. 2019a), our HyGnn achieves performance gain by 0.1 in MAE and 1.2 in MSE and 0.1 in MAE and 0.3 in MSE, respectively. On the most challenging UCF CC 50, our HyGnn achieves considerable performance gain by decreasing the MAE from previous best 214.2 to 184.4 and MSE from 318.2 to 270.1. On UCFQNRF dataset, our HyGnn also outperforms other methods by a large margin. As shown in Tab. 1, our HyGnn achieves a significant improvement of 10.8% in MAE over the existing best result produced by TEDNet (Jiang et al. 2019). Compared with other top-ranked methods, our HyGnn produces more accurate results. This is because HyGnn is able to leverage free-of-cost localization information and jointly reason all relations among them.

Qualitative Results. Fig. 5 provides some visualization comparisons of the predicted density maps and counts with CSRNet (Li, Zhang, and Chen 2018). In addition, we also show the localization results. We observe that our HyGnn is very powerful, achieves much more accurate count estimations and reserves more consistency with the real crowd distributions. This is because our HyGnn can distill the significant benefit information from the auxiliary task through a graph.

Conclusions

In this paper, we propose a novel method for crowd counting with a hybrid graph model. To best of our knowledge, it is the first deep neural network model that can distill both multi-scale and mutual beneficial relations within a unified graph for crowd counting. The whole HyGnn is end-to-end differentiable, and is able to handle different relations effectively. Meanwhile, the domain gap between different tasks is also carefully considered in our HyGnn. According to our experiments, HyGnn achieves significant improvements compared to recent state-of-the-art methods on four benchmarks. We believe that our HyGnn can also incorporate other knowledge, e.g., foreground information, for further improvements.

Acknowledgement. This work was supported in part by the National Key R&D Program of China (No.2017YFB1302300) and the NSFC (No.U1613223).

References

Ballas, N.; Yao, L.; Pal, C.; and Courville, A. 2015. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432.

Bertinetto, L.; Henriques, J. F.; Valmadre, J.; Torr, P.; and Vedaldi, A. 2016. Learning feed-forward one-shot learners. In NeurIPS.

Cao, X.; Wang, Z.; Zhao, Y.; and Su, F. 2018. Scale aggregation network for accurate and efficient crowd counting. In ECCV.

Chen, K.; Gong, S.; Xiang, T.; and Change Loy, C. 2013. Cumulative attribute space for age and crowd density estimation. In CVPR.

Dai, F.; Liu, H.; Ma, Y.; Cao, J.; Zhao, Q.; and Zhang, Y. 2019. Dense scale network for crowd counting. CoRR abs/1906.09707.

Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients for human detection. In CVPR.

Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. CoRR abs/1704.01212.

Idrees, H.; Saleemi, I.; Seibert, C.; and Shah, M. 2013. Multisource multi-scale counting in extremely dense crowd images. In CVPR.

Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; and Shah, M. 2018a. Composition loss for counting, density map estimation and localization in dense crowds. In ECCV. Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; and Shah, M. 2018b. Composition loss for counting, density map estimation and localization in dense crowds. In ECCV. Jiang, X.; Xiao, Z.; Zhang, B.; Zhen, X.; Cao, X.; Doermann, D.; and Shao, L. 2019. Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR.

Kang, D.; Ma, Z.; and Chan, A. B. 2018. Beyond counting: Comparisons of density maps for crowd analysis taskscounting, detection, and tracking. TCSVT 29(5):1408–1422.

Lempitsky, V., and Zisserman, A. 2010. Learning to count objects in images. In NeurIPS.

Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2016. Gated graph sequence neural networks. In ICLR.

Li, X.; Yang, F.; Cheng, H.; Chen, J.; Guo, Y.; and Chen, L. 2017. Multi-scale cascade network for salient object detection. In ACM MM.

Li, X.; Yang, F.; Cheng, H.; Liu, W.; and Shen, D. 2018. Contour knowledge transfer for salient object detection. In ECCV.

Li, Y.; Zhang, X.; and Chen, D. 2018. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR.

Lian, D.; Li, J.; Zheng, J.; Luo, W.; and Gao, S. 2019. Density map regression guided detection network for rgb-d crowd counting and localization. In CVPR.

Liu, N.; Long, Y.; Zou, C.; Niu, Q.; Pan, L.; and Wu, H. 2019. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In CVPR.

Liu, W.; Salzmann, M.; and Fua, P. 2019. Context-aware crowd counting. In CVPR.

Liu, X.; van de Weijer, J.; and Bagdanov, A. D. 2018. Leveraging unlabeled data for crowd counting by learning to rank. In CVPR.

Liu, C.; Weng, X.; and Mu, Y. 2019. Recurrent attentive zooming for joint crowd counting and precise localization. In CVPR.

Meng, Z.; Adluru, N.; Kim, H. J.; Fung, G.; and Singh, V. 2018. Efficient relative attribute learning using graph neural networks. In ECCV.

Nie, X.; Feng, J.; Zuo, Y.; and Yan, S. 2018. Human pose estimation with parsing induced learner. In CVPR.

Pham, V.-Q.; Kozakaya, T.; Yamaguchi, O.; and Okada, R. 2015. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In ICCV.

Qi, S.; Wang, W.; Jia, B.; Shen, J.; and Zhu, S.-C. 2018a. Learning human-object interactions by graph parsing neural networks. In ECCV.

Qi, S.; Wang, W.; Jia, B.; Shen, J.; and Zhu, S.-C. 2018b. Learning human-object interactions by graph parsing neural networks. In ECCV.

Sam, D. B.; Surya, S.; and Babu, R. V. 2017. Switching convolutional neural network for crowd counting. In CVPR.

Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2008. The graph neural network model. TNNLS 20(1):61– 80.

Shen, Z.; Xu, Y.; Ni, B.; Wang, M.; Hu, J.; and Yang, X. 2018. Crowd counting via adversarial cross-scale consistency pursuit. In CVPR.

Shi, Z.; Zhang, L.; Liu, Y.; Cao, X.; Ye, Y.; Cheng, M.-M.; and Zheng, G. 2018. Crowd counting with deep negative correlation learning. In CVPR.

Shi, M.; Yang, Z.; Xu, C.; and Chen, Q. 2019. Revisiting perspective information for efficient crowd counting. In CVPR.

Si, C.; Jing, Y.; Wang, W.; Wang, L.; and Tan, T. 2018. Skeletonbased action recognition with spatial reasoning and temporal stack learning. In ECCV.

Sindagi, V. A., and Patel, V. M. 2017. Generating high-quality crowd density maps using contextual pyramid cnns. In CVPR.

Varior, R. R.; Shuai, B.; Tighe, J.; and Modolo, D. 2019. Scale-aware attention network for crowd counting. CoRR abs/1901.06026.

Viola, P.; Jones, M.; et al. 2001. Rapid object detection using a boosted cascade of simple features.

Viola, P.; Jones, M. J.; and Snow, D. 2005. Detecting pedestrians using patterns of motion and appearance. IJCV 63(2):153–161.

Wan, J.; Luo, W.; Wu, B.; Chan, A. B.; and Liu, W. 2019. Residual regression with semantic prior for crowd counting. In CVPR.

Wang, Q.; Gao, J.; Lin, W.; and Yuan, Y. 2019a. Learning from synthetic data for crowd counting in the wild. In CVPR.

Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2019b. Dynamic graph cnn for learning on point clouds. TOG.

Yang, F.; Li, X.; Cheng, H.; Guo, Y.; Chen, L.; and Li, J. 2018. Multi-scale bidirectional fcn for object skeleton extraction. In AAAI.

Zhang, C.; Li, H.; Wang, X.; and Yang, X. 2015. Cross-scene crowd counting via deep convolutional neural networks. In CVPR. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; and Ma, Y. 2016. Singleimage crowd counting via multi-column convolutional neural network. In CVPR.

Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In CVPR.

Zhao, M.; Zhang, J.; Zhang, C.; and Zhang, W. 2019. Leveraging heterogeneous auxiliary tasks to assist crowd counting. In CVPR.