We study the problem of textual phrases grounding for images in the context of computer vision and natural language processing. Recent works like image-text query [10] are designed to retrieve a set of images which have the largest similarities considering both visual aspects and textual descriptions. Going beyond instances querying, language (textual) grounding emphasizes the capability of localizing a word or a phrase in a given image. Most of the prior works in this domain have benefited significantly from cognitive performance improvements especially the object detection, object classification and semantic segmentation. More specifically, previously works like [14, 2]
Figure 1. Illustrative framework of textual phrases grounding: We propose to localize phrases in images by matching visual features with phrases. For example, from the phrase man that is talking, we can force the attentive map to focus on the region contains the target objects by classify the highlighted feature region as ”person” and ”talking”.
consider the grounding problem into two steps: the proposals generation and the attribute matching. During testing phase, the proposals/regions with the highest matching probability with the textual descriptions would be selected as the target region. Though the strategy of using bounding boxes produces the location of the objects much more precisely, it largely depends on the object annotations and is accompanied with heavy computational redundancy on regions feature extraction. Moreover, the supervised learning paradigm is quite domain specific and it fails to achieve satisfactory results when we do not have sufficient labeled data for the a new task to train a reliable grounding model [13]. Thus, facing the necessity of using bounding box annotations under such scenario, we need a more general method.
We here put forward a weakly supervised textual grounding mechanism based on an attention generating method [3]. Our hypothesis is that the deep convolutional features would still be spatial consistent with the input image[6], and representations from a well trained deep model could be scale[11] and viewpoint invariant on the same objects [5]. Similar to the object co-segmentation [7] which typically refers to the task of jointly segmenting “something similar”
Figure 2. The overall architecture of our textual grounding pipeline consists of two part: one backbone module for global visual feature extraction, and a bilinear pooling based attentive module for attentive map generation. After word embedding, the expanded textual vector together with the visual features will be sent to bilinear pooling layer, based on which the attentive map is generated. The training of our networks adopts an end-to-end manner with on a joint supervision signal: a cross-entropy loss for classification and a pair regularizing loss for the feature supervision. In addition, a l2 regularization loss is also applied on the attentive map for sparsity.
in a given set of images, we encourage the attentive map to search out to the regions which contain the similar features in a given set of images (see Fig. 1). In the same time, similar to the work in [12, 15], a feature learning loss is applied on the architecture to obtain the deep features with more inter-class dispersion and intra-class compactness.
We evaluate our approach on a synthesized multiMNIST dataset obtaining testing samples as the ones visualized in Fig. 3. In this extended abstract, we demonstrate that our proposed pipeline is applicable for supervised/unsupervised demands and provides more interpretability and flexibility when transferring to different domains that is lack of fine-grained annotations, and it shows promising avenue to further conduct a comprehensive set of quantitative experiments.
We formulate here the weakly supervised grounding task as selecting the best region r in an image I given a set of objects . More specifically, the task selects the best matched feature region from the feature maps. Given a pretrained backbone model F, we first extract global visual feature maps f = F(I), and an attentive map/region
, where A refers to a bilinear pooling based attentive module. A classifier C classi-fies
based on the global visual features after applying attentive maps to it:
(see Fig. 2). Besides, for different images
and
that contain a common object O, their region features are enforced to be similar:
and this is achieved by the feature learning loss.
2.1. Visual Feature Extraction
In order to extract global and discriminative features from the image, we adopt the deep pretrained classification model as the backbone to extract the deep features. It is a pre-trained deep model, e.g., Residual Network pretrained on ImageNet, which consists of about 15 million highresolution labeled images in roughly 22 000 categories. The model has been shown that the encapsulated representations contained within can work remarkably well for a large set of diverse image classification tasks and often outperforms other classification approaches. In our two experiments, we use a well trained residual-18 net on MNIST as the backbone module for the multi-number grounding task to perform as the feature extractor. To maintain the resolution of the attentive map, here we remove all the max pooling layers in the original architecture and take the output from the last convolutional layer as the visual features.
2.2. Concatenation of Multi-Modal Feature
Since our system aims to utilize the given top-down signals (either in the form of word embedding or one hot vector) to guide the attentive module, we need an appropriate feature fusion/concatenation method. Traditional methods such as the element-wise product, sum, or the concatenation of the visual and textual representations, are not as expressive as an outer product of the visual and textual vectors. Therefore, we turn to the bilinear pooling here. Bilinear pooling is designed to fuse the multi-modal features by outer product and has been proved to be much more expressive than the above mentioned methods in a series of tasks [8]. However, direct applying the outer product is typically infeasible due to the curse of dimensionality, the Multimodal Compact Bilinear pooling (MCB) [4] works nicely for our schema to maintain the outer product feature with lower dimensions. Concretely, the Count Sketch projection function [1] would be applied on the outer product of the visual feature f and expanded top down signal t for dimensionality reduction:
. If converted to frequency domain the concatenated outer product can be written as:
. Based on the
, which consists of multiplicative interaction between all elements of both vectors, attentive map r is further computed after several convolutional layers.
2.3. Weakly Supervised Training
To shift the attentive map to the targeted region, the training of our networks yields an end-to-end manner with on a joint supervision signal: a cross-entropy loss for classi-fication and a pair regularizing loss
for the feature supervision. In addition to the joint signal, an
regularization R is applied on the attentive map to ensure sparsity. Thus, we formulate the overall loss L as:
where refers to the cross-entropy loss, and
can be expressed as:
The represents the attentive high-lightened features,
are class labels that are most relevant/matched to the input phrases in the images
. And,
where n denotes the size of the the attentive map r, and x, y denote the coordinates of each pixel. With an ablation study, we show that a regularization term can reduce and eliminate the invalid attention noises effectively, thus making the attentive map even more compact and sparse. Practically, our designed networks that are supervised by joint loss is trainable and can be optimized by standard stochastic gradient decent (SGD) methods. Scalar and
are two hyper parameters controlling the balance among the three loss functions, and the margin m is the minimum margin distance among the intra-class features. The conventional cross entropy loss can be considered as a special case of this joint supervision, if
and
is set to be 0.
In this Section, we present a preliminary experiment on a synthesized multi-MNIST dataset to validate the feasibility of our attentive mechanism. To better simulate a real textual grounding scenario, a multi-MNIST dataset is introduced. While instead of ground natural expressions/phrases, we feed in one-hot class labels as our top-down signal, and our goal is to localize all the matched regions in that image that contain the target numbers without using any bounding boxes annotation.
Data Preparing: We generate a multi-MNIST dataset in which each image contains randomly 1 to 4 in number of 0-9 handwritten digits (see Fig. 3 as an example). In order to make the synthetic data more challenging, we randomly increase the overlapping ratio among the numbers and add Gaussian white noise at the same time. 500k images are generated in total together with their corresponding labels and annotations.
Network Structure: ResNet-18 serves as our backbone architecture and we extract features from the final convolutional layers of the 4-th stage as the visual features . We modify the stride in all convolutional layers to be 1 to maintain feature resolution which is the same as the input image. As for the attentive module, since we adopt one-hot encodings in this setting, we replace the GRU/LSTM unit with a 128 units dense connected layer. We then expand the textual input to be with the same size of visual features, and pass the two tensors through MCB. After MCB pooling, two convolutional layers are applied to generate a one channel attentive map.
Training Process: We pretrain our backbone architecture on the original MNIST dataset for classification with an accuracy of 99.40%. After that, at the second stage of end-to-end training, we set the initial learning rate of the whole model to be 0.1 and further reduce the learning rate for the pretrained backbone to be always 1/10. We also test joint training and make the backbone architecture trained from scratch, and as expected, the former training schema converges much faster. We believe this is due to that a well pre-trained model provides a comparatively precise local feature that enables the feature matching to converge more quickly. During the training procedure, we set two hyper-parameters and
to be 0.1, respectively.
Result: We show quantitatively the effectiveness of our approach in Table 1 and 2, We adopt the standard evaluation metric for both the detection and the localization tasks (a.k.a. the mean average precision): if the IoU with the ground truth box is bigger than a certain threshold, that localization is considered to be correctly identified. In Table 1 that among different settings of the , our model can achieve up to 67.04% IoU for the localization task. It is worth noting that the attentive map are evaluated directly without any further post-processing steps. We further compare the average IoU achieved under the two training schema: training from scratch v.s. fine-tuning. Though we observe that using a well trained backbone model as the feature extractor is able to accelerate the training phase to a
Figure 3. Example of localization in multi MNIST (best viewed in color). Starting from left to right, we present the input images, followed by the grounded region (grounded regions are colored in red), and the attentive map. The input textual signals here are ‘8’, ‘5’, and ‘0’, respectively. For the right part of the gallery, we show the grounding result with multiple input signals (‘6’, and ‘4’, in this sample) in the upper row, the attentive map when input signal doesn’t exist (‘2’, in this sample, middle row), and a failure case in the lower-left row.
Table 1. The intersection over union (IoU) values of the differ- ent training schema on multi-MNIST datasets with our proposed architecture. Fine-tune backbone model refers to an end-to-end architecture with a pre-trained single handwritten number recognition model. Performances using various scales for regularization are also reported.
Table 2. The mean average of precision achieved on testing datasets. Here 22 denotes that each testing image consists of 4 numbers (Fig. 3), and 3
3 denotes the samples with 9 numbers. The mAP is obtained over different thresholds applied.
great extend, the final performances achieved are comparable, indicating that an end-to-end training from scratch is also applicable for our approach. In Table 2 we further report the mean average precision value under three IoU thresholds. In practice, we consider the prediction is correct if the IoU with ground truth is higher than 0.5, in which case our model achieved 71.53% mAP.
To further illustrate our model’s outputs, we show some of the grounding results in Fig. 3. Our approach is able to capture the correct object described by the input query (the left row in Fig. 3). When the input query doesn’t exist in the image, an almost blank attentive map would be generated as expected (the second row in the right column in Fig. 3). We also show a failure case (the lower-left one).
In this extended abstract, we present a novel approach for weakly supervised textual phrases grounding based on regularized bilinear pooling and feature learning. We report a preliminary validation of our approach on a synthesized multi-label MNIST dataset. The results obtained in our preliminary experiments demonstrate a promising avenue to achieve expandability and feasibility by the weakly supervised training mechanism in the textual phrase grounding task. We are currently extending the architecture to conduct further model validation on a large benchmarking natural images dataset such as MSCOCO [9].
[1] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pages 693–703. Springer, 2002. 3
[2] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network with context policy for phrase grounding. arXiv preprint arXiv:1708.01676, 2017. 1
[3] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016. 1
[4] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 317–326, 2016. 2
[5] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision, pages 392–407. Springer, 2014. 1
[6] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent vari- ational autoencoder. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 1133–1141. IEEE, 2017. 1
[7] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In CVPR, pages 1943–1950. IEEE, 2010. 1
[8] S. Kong and C. Fowlkes. Low-rank bilinear pooling for fine-grained classification. In CVPR, pages 7025–7034. IEEE, 2017. 2
[9] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740– 755. Springer, 2014. 4
[10] Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current tech- niques, promising directions, and open issues. Journal of visual communication and image representation, 10(1):39–62, 1999. 1
[11] J. Wang, Z. Fang, N. Lang, H. Yuan, M.-Y. Su, and P. Baldi. A multi-resolution approach for spinal metastasis detection using deep siamese neural networks. Computers in biology and medicine, 84:137–146, 2017. 1
[12] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learn- ing approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016. 2
[13] R. A. Yeh, M. N. Do, and A. G. Schwing. Unsupervised textual grounding: Linking words to image concepts. arXiv preprint arXiv:1803.11185, 2018. 1
[14] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. arXiv preprint arXiv:1801.08186, 2018. 1
[15] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep face recognition with long-tail. arXiv preprint arXiv:1611.08976, 2016. 2