Automatically describe an image using sentence-level captions has been receiving much attention recent years [11, 10, 13, 17, 16, 23, 34, 39]. It is a challenging task integrating
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
MM ’16, October 15-19, 2016, Amsterdam, Netherlands
visual and language understanding. It requires not only the recognition of visual objects in an image and the semantic interactions between objects, but the ability to capture visual-language interactions and learn how to“translate”the visual understanding to sensible sentence descriptions. The most important part of this visual-language modeling is to capture the semantic correlations across image and sentence by learning a multimodal joint model. While some previous models [20, 15, 26, 17, 16] have been proposed to address the problem of image captioning, they rely on either use sentence templates, or treat it as retrieval task through ranking the best matching sentence in database as caption. Those approaches usually suffer difficulty in generating variable-length and novel sentences. Recent work [11, 10, 13, 23, 34, 39] has indicated that embedding visual and language to common semantic space with relatively shallow recurrent neural network (RNN) can yield promising results.
In this work, we propose novel architectures to the problem of image captioning. Different to previous models, we learn a visual-language space where sentence embeddings are encoded using bidirectional Long-Short Term Memory (Bi-LSTM) and visual embeddings are encoded with CNN. Bi-LSTM is able to summarize long range visual-language interactions from forward and backward directions. Inspired by the architectural depth of human brain, we also explore the deep bidirectional LSTM architectures to learn higher level visual-language embeddings. All proposed models can be trained in end-to-end by optimizing a joint loss.
Why bidirectional LSTMs? In unidirectional sentence generation, one general way of predicting next word visual context I and history textual context imize log While unidirectional model includes past context, it is still limited to retain future context that can be used for reasoning previous word maximizing log Bidirectional model tries to overcome the shortcomings that each unidirectional (forward and backward direction) model suffers on its own and exploits the past and future dependence to give a prediction. As shown in Figure 1, two example images with bidirectionally generated sentences intuitively support our assumption that bidirectional captions are complementary, combining them can generate more sensible captions.
Why deeper LSTMs? The recent success of deep CNN in image classification and object detection [14, 33] demonstrates that deep, hierarchical models can be more efficient at learning representation than shallower ones. This motivated our work to explore deeper LSTM architectures in the context of learning bidirectional visual-language embed-
Figure 1: Illustration of generated captions. Two example images from Flickr8K dataset and their best matching captions that generated in forward order (blue) and backward order (red). Bidirectional models capture different levels of visual-language interactions (more evidence see Sec.4.4). The final caption is the sentence with higher probabilities (histogram under sentence). In both examples, backward caption is selected as final caption for corresponding image.
dings. As claimed in [29], if we consider LSTM as a composition of multiple hidden layers that unfolded in time, LSTM is already deep network. But this is the way of increasing “horizontal depth”in which network weights W are reused at each time step and limited to learn more representative features as increasing the“vertical depth”of network. To design deep LSTM, one straightforward way is to stack multiple LSTM layers as hidden to hidden transition. Alternatively, instead of stacking multiple LSTM layers, we propose to add multilayer perception (MLP) as intermediate transition between LSTM layers. This can not only increase LSTM network depth, but can also prevent the parameter size from growing dramatically.
The core contributions of this work are threefold:
• We propose an end-to-end trainable multimodal bidirectional LSTM (see Sec.3.2) and its deeper variant models (see Sec.3.3) that embed image and sentence into a high level semantic space by exploiting both long term history and future context.
• We visualize the evolution of hidden states of bidirectional LSTM units to qualitatively analyze and understand how to generate sentence that conditioned by visual context information over time (see Sec.4.4).
• We demonstrate the effectiveness of proposed models on three benchmark datasets: Flickr8K, Flickr30K and MSCOCO. Our experimental results show that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art on caption generation (see Sec.4.5) and perform significantly better than recent methods on retrieval task (see Sec.4.6).
Multimodal representation learning [27, 35] has signifi-cant value in multimedia understanding and retrieval. The shared concept across modalities plays an important role in bridging the “semantic gap” of multimodal data. Image captioning falls into this general category of learning multimodal representations.
Recently, several approaches have been proposed for image captioning. We can roughly classify those methods into three categories. The first category is template based approaches that generate caption templates based on detecting objects and discovering attributes within image. For example, the work [20] was proposed to parse a whole sentence into several phrases, and learn the relationships between phrases and objects within an image. In [15], conditional random field (CRF) was used to correspond objects, attributes and prepositions of image content and predict the best label. Other similar methods were presented in [26, 17, 16]. These methods are typically hard-designed and rely on fixed template, which mostly lead to poor performance in generating variable-length sentences. The second category is retrieval based approach, this sort of methods treat image captioning as retrieval task. By leveraging distance metric to retrieve similar captioned images, then modify and combine retrieved captions to generate caption [17]. But these approaches generally need additional procedures such as modification and generalization process to fit image query.
Inspired by the success use of CNN [14, 45] and Recurrent Neural Network [24, 25, 1]. The third category is emerged as neural network based methods [39, 42, 13, 10, 11]. Our work also belongs to this category. Among those work, Kiro et al.[12] can been as pioneer work to use neural network for image captioning with multimodal neural language model. In their follow up work [13], Kiro et al. introduced an encoder-decoder pipeline where sentence was encoded by LSTM and decoded with structure-content neural language model (SCNLM). Socher et al.[34] presented a DT-RNN (Dependency Tree-Recursive Neural Network) to embed sentence into a vector space in order to retrieve images. Later on, Mao et al.[23] proposed m-RNN which replaces feed-forward neural language model in [13]. Similar architectures were introduced in NIC [39] and LRCN [4], both approaches use LSTM to learn text context. But NIC only feed visual information at first time step while Mao et al.[23] and LRCN [4]’s model consider image context at each time step. Another group of neural network based approaches has been introduced in [10, 11] where image captions generated by integrating object detection with R-CNN (region-CNN) and inferring the alignment between image regions and descriptions.
Most recently, Fang et al.[5] used multi-instance learning and traditional maximum-entropy language model for description generation. Chen et al.[2] proposed to learn visual representation with RNN for generating image caption. In [42], Xu et al. introduced attention mechanism of human visual system into encoder-decoder framework. It is shown that attention model can visualize what the model “see” and yields significant improvements on image caption generation. Unlike those models, our deep LSTM model directly assumes the mapping relationship between visual-language is antisymmetric and dynamically learns long term bidirectional and hierarchical visual-language interactions. This is proved to be very effective in generation and retrieval tasks as we demonstrate in Sec.4.5 and Sec.4.6.
In this section, we describe our multimodal bidirectional LSTM model (Bi-LSTM for short) and explore its deeper
Figure 2: Long Short Term Memory (LSTM) cell. It is consist of an input gate i, a forget gate f, a memory cell c and an output gate o. The input gate decides let incoming signal go through to memory cell or block it. The output gate can allow new output or prevent it. The forget gate decides to remember or forget cell’s previous state. Updating cell states is performed by feeding previous cell output to itself by recurrent connections in two consecutive time steps.
variants. We first briefly introduce LSTM which is at the center of model. The LSTM we used is described in [44].
3.1 Long Short Term Memory
Our model builds on LSTM cell, which is a particular form of traditional recurrent neural network (RNN). It has been successfully applied to machine translation [3], speech recognition [8] and sequence learning [36]. As shown in Figure 2, the reading and writing memory cell c is controlled by a group of sigmoid gates. At given time step t, LSTM receives inputs from different sources: current input previous hidden state of all LSTM units as well as previous memory cell state . The updating of those gates at time step t for given inputs as follows.
where W are the weight matrices learned from the network and b are bias vectors. is the sigmoid activation function presents hyperbolic tangent the products with a gate value. The LSTM hidden output will be used to predict the next word by Softmax function with parameters
where is the probability distribution for predicted word.
Our key motivation of chosen LSTM is that it can learn long-term temporal activities and avoid quick exploding and vanishing problems that traditional RNN suffers from during back propagation optimization.
3.2 Bidirectional LSTM
In order to make use of both the past and future context information of a sentence in predicting word, we propose a
Figure 3: Multimodal Bidirectional LSTM. L1: sentence embedding layer. L2: T-LSTM layer. L3: M-LSTM layer. L4: Softmax layer. We feed sentence in both forward (blue arrows) and backward (red arrows) order which allows our model summarizes context information from both left and right side for generating sentence word by word over time. Our model is end-to-end trainable by minimize a joint loss.
bidirectional model by feeding sentence to LSTM from forward and backward order. Figure 3 presents the overview of our model, it is comprised of three modules: a CNN for encoding image inputs, a Text-LSTM (T-LSTM) for encoding sentence inputs, a Multimodal LSTM (M-LSTM) for embedding visual and textual vectors to a common semantic space and decoding to sentence. The bidirectional LSTM is implemented with two separate LSTM layers for computing forward hidden sequences and backward hidden sequences
h . The forward LSTM starts at time t = 1 and the backward LSTM starts at time t = T. Formally, our model works as follows, for raw image input , forward order sentenceand backward order sentence , the encoding performs as
where C, T represent CNN, T-LSTM respectively and
are their corresponding weights. rectional embedding matrices learned from network. Encoded visual and textual representations are then embedded to multimodal LSTM by:
where M presents M-LSTM and its weight
to capture the correlation of visual context and words at different time steps. We feed visual vector I to model at each time step for capturing strong visual-word correlation. On the top of M-LSTM are Softmax layers which compute the probability distribution of next predicted word by
where is the vocabulary size.
3.3 Deeper LSTM architecture
To design deeper LSTM architectures, in addition to directly stack multiple LSTMs on each other that we named as Bi-S-LSTM (Figure 4(c)), we propose to use a fully connected layer as intermediate transition layer. Our motivation comes from the finding of [29], in which DT(S)-RNN (deep transition RNN with shortcut) is designed by adding hidden to hidden multilayer perception (MLP) transition. It
Figure 4: Illustrations of proposed deep architectures for image captioning. The network in (a) is commonly used in previous work, e.g. [4, 23]. (b) Our proposed Bidirectional LSTM (Bi-LSTM). (c) Our proposed Bidirectional Stacked LSTM (Bi-S-LSTM). (d) Our proposed Bidirectional LSTM with fully connected (FC) transition layer (Bi-F-LSTM).
Figure 5: Transition for Bi-S-LSTM(L) and Bi-F-LSTM(R)
is arguably easier to train such network. Inspired by this, we extend Bi-LSTM (Figure 4(b)) with a fully connected layer that we called Bi-F-LSTM (Figure 4(d)), shortcut connection between input and hidden states is introduced to make it easier to train model. The aim of extension models is to learn an extra hidden transition function . In Bi-S-LSTM
where presents the hidden states of l-th layer at time t, U and V are matrices connect to transition layer (also see Figure 5(L)). For readability, we consider one direction training and suppress bias terms. Similarly, in Bi-F-LSTM, to learn a hidden transition function
where is the operator that concatenates and its abstractions to a long hidden states (also see Figure 5(R)). presents rectified linear unit (Relu) activation function for transition layer, which performs
3.4 Data Augmentation
One of the most challenging aspects of training deep bidirectional LSTM models is preventing overfitting. Since our largest dataset has only 80K images [21] which might cause overfitting easily, we adopted several techniques such as fine-tuning on pre-trained visual model, weight decay, dropout and early stopping that commonly used in the literature. Additionally, it has been proved that data augmentation such as randomly cropping and horizontal mirror [32, 22], adding noise, blur and rotation [40] can effectively alleviate over-fitting and other. Inspired by this, we designed new data augmentation techniques to increase the number of image-sentence pairs. Our implementation performs on visual model, as follows:
• Multi-Corp: Instead of randomly cropping on input image, we crop at the four corners and center region. Because we found random cropping is more tend to select center region and cause overfitting easily. By cropping four corners and center, the variations of network input can be increased to alleviate overfitting.
• Multi-Scale: To further increase the number of image-sentence pairs, we rescale input image to multiple scales. For each input image , it is resized to 256 256, then we randomly select a region with size of 85] is scale ratio. s = 1 means we do not multi-scale operation on given image. Finally we resize it to AlexNet input size 227 227 or VggNet input size 224
• Vertical Mirror: Motivated by the effectiveness of widely used horizontal mirror, it is natural to also consider the vertical mirror of image for same purpose.
Those augmentation techniques are implemented in realtime fashion. Each input image is randomly transformed using one of augmentations to network input for training. In principle, our data augmentation can increase image-sentence training pairs by roughly 40 times (5
3.5 Training and Inference
Our model is end-to-end trainable by using Stochastic Gradient Descent (SGD). The joint loss function is computed by accumulating the Softmax losses of forward and backward directions. Our objective is to minimize L, which is equivalent to maximize the probabilities of correctly generated sentences. We compute the gradient with Back-Propagation Through Time (BPTT) algorithm.
The trained model is used to predict a word given image context I and previous word context ) in forward order, or by backward order. We set at start point respectively for forward and backward directions. Ultimately, with generated sentences from two directions, we decide the final sentence for given image ) according to the summation of word probability within sentence
Follow previous work, we adopted beam search to consider the best k candidate sentences at time t to infer the sentence at next time step. In our work, we fix k = 1 in all experiments although the average of 2 BLEU [28] points better results can be achieved with k = 20 compare to k = 1 as reported in [39].
In this section, we design several groups of experiments to accomplish following objectives:
• Qualitatively analyze and understand how bidirectional multimodal LSTM learns to generate sentence conditioned by visual context information over time.
• Measure the benefits and performance of proposed bidirectional model and its deeper variant models that we increase their nonlinearity depth from different ways.
• Compare our approach with state-of-the-art methods in terms of sentence generation and image-sentence retrieval tasks on popular benchmark datasets.
4.1 Datasets
To validate the effectiveness, generality and robustness of our models, we conduct experiments on three benchmark datasets: Flickr8K [31], Flickr30K [43] and MSCOCO [21].
Flickr8K. It consists of 8,000 images and each of them has 5 sentence-level captions. We follow the standard dataset divisions provided by authors, 6,000/1,000/1,000 images for training/validation/testing respectively.
Flickr30K. An extension version of Flickr8K. It has 31,783 images and each of them has 5 captions. We follow the public accessibledataset divisions by Karpathy et al. [11]. In this dataset splits, 29,000/1,000/1,000 images are used for training/validation/testing respectively.
MSCOCO. This is a recent released dataset that covers 82,783 images for training and 40,504 images for validation. Each of images has 5 sentence annotations. Since there is lack of standard splits, we also follow the splits provided by Karpathy et al. [11]. Namely, 80,000 training images and 5,000 images for both validation and testing.
4.2 Implementation Details
Visual feature. We use two visual models for encoding image: Caffe [9] reference model which is pre-trained with AlexNet [14] and 16-layer VggNet model [33]. We extract features from last fully connected layer and feed to train visual-language model with LSTM. Previous work [39, 23] have demonstrated that with more powerful image models such as GoogleNet [37] and VggNet [33] can achieve promising improvements. To make a fair comparison with recent works, we select the widely used two models for experiments.
Textual feature. We first represent each word w within sentence as one-hot vector, is vocabulary size built on training sentences and different for different datasets. By performing basic tokenization and removing the words that occurs less than 5 times in the training set, we have 2028, 7400 and 8801 words for Flickr8K, Flickr30K and MSCOCO dataset vocabularies respectively.
Our work uses the LSTM implementation of [4] on Caffe framework. All of our experiments were conducted on Ubuntu
14.04, 16G RAM and single Titan X GPU with 12G memory. Our LSTMs use 1000 hidden units and weights initialized uniformly from [-0.08, 0.08]. The batch sizes are 150, 100, 100, 32 for Bi-LSTM, Bi-S-LSTM, Bi-F-LSTM and Bi-LSTM (VGG) models respectively. Models were trained with learning rate Bi-LSTM (VGG)), weight decay is 0.0005 and we used momentum 0.9. Each model is trained for 18with early stopping. The code for this work can be found at https:// github.com/ deepsemantic/ image captioning.
4.3 Evaluation Metrics
We evaluate our models on two tasks: caption generation and image-sentence retrieval. In caption generation, we follow previous work to use BLEU-N (N=1,2,3,4) scores [28]:
where r, c are the length of reference sentence and generated sentence, is the modified n-gram precisions. We also report METETOR [18] and CIDEr [38] scores for further comparison. In image-sentence retrieval (image query sentence and vice versa), we adopt R@K (K=1,5,10) and Med r as evaluation metrics. R@K is the recall rate R at top K candidates and Med r is the median rank of the first retrieved ground-truth image and sentence. All mentioned metric scores are computed by MSCOCO caption evaluation server, which is commonly used for image captioning challenge
4.4 Visualization and Qualitative Analysis
The aim of this set experiment is to visualize the properties of proposed bidirectional LSTM model and explain how it works in generating sentence word by word over time.
First, we examine the temporal evolution of internal gate states and understand how bidirectional LSTM units retain valuable context information and attenuate unimportant information. Figure 6 shows input and output data, the pattern of three sigmoid gates (input, forget and output) as well as cell states. We can clearly see that dynamic states are periodically distilled to units from time step t = 0 to t = 11. At t = 0, the input data are sigmoid modulated to input gate i(t) where values lie within in [0,1]. At this step, the values of forget gates f(t) of different LSTM units are zeros. Along with the increasing of time step, forget gate starts to decide which unimportant information should be forgotten, meanwhile, to retain those useful information. Then the memory cell states c(t) and output gate o(t) gradually absorb the valuable context information over time and make a rich representation h(t) of the output data.
Next, we examine how visual and textual features are embedded to common semantic space and used to predict word over time. Figure 7 shows the evolution of hidden units at different layers. For T-LSTM layer, units are conditioned by textual context from the past and future. It performs as the encoder of forward and backward sentences. At MLSTM layer, LSTM units are conditioned by both visual and textual context. It learns the correlations between input word sequence and visual information that encoded by CNN. At given time step, by removing unimportant information that make less contribution to correlate input word
Figure 6: Visualization of LSTM cell. The horizontal axis corresponds to time steps. The vertical axis is cell index. Here we visualize the gates and cell states of the first 32 Bi-LSTM units of T-LSTM in forward direction over 11 time steps.
A man in a black jacket is walking down the street Street the on walking is suit a in man a 2 7 3 2 23 76 8 41 38 4 36 36 4 5 41 8 193 2 3 7 2
(g) Generated words and corresponding word index in vocabulary
Figure 7: Pattern of the first 96 hidden units chosen at each layer of Bi-LSTM in both forward and backward directions. The vertical axis presents time steps. The horizontal axis corresponds to different LSTM units. In this example, we visualize the T-LSTM layer for text only, the M-LSTM layer for both text and image and Softmax layer for computing word probability distribution. The model was trained on Flickr 30K dataset for generating sentence word by word at each time step. In (g), we provide the predicted words at different time steps and their corresponding index in vocabulary where we can also read from (e) and (f) (the highlight point at each row). Word with highest probability is selected as the predicted word.
and visual context, the units tend to appear sparsity pattern and learn more discriminative representations from inputs. At higher layer, embedded multimodal representations are used to compute the probability distribution of next predict word with Softmax. It should be noted, for given image, the number of words in generated sentence from forward and backward direction can be different.
Figure 8 presents some example images with generated captions. From it we found some interesting patterns of bidirectional captions: (1) Cover different semantics, for example, in (b) forward sentence captures “while backward one describes “scribe static scenario and infer dynamics, in (a) and (d), one caption describes the static scene, and the other one presents the potential action or motion that possibly happen in the next time step. (3) Generate novel sentences, from generated captions, we found that a significant proportion (88% by randomly select 1000 images on MSCOCO validation set) of generated sentences are novel (not appear in training set). But generated sentences are highly similar to ground-truth captions, for example in (d), forward caption is similar to one of ground-truth captions () and backward caption is similar to ground-truth caption (). It illustrates that our model has strong capability in learning visual-language correlation and generating novel sentences.
Figure 8: Examples of generated captions for given query image on MSCOCO validation set. Blue-colored captions are generated in forward direction and red-colored captions are generated in backward direction. The final caption is selected according to equation (13) which selects the sentence with the higher probability. The final captions are marked in bold.
Table 1: Performance comparison on BLEU-N(high score is good). The superscript “A” means the visual model is AlexNet (or similar network), “V” is VggNet, “G” is GoogleNet, “-” indicates unknown value, “” means different data splitsresults are marked in bold and the second best results with underline. The superscripts are also applicable to Table 2.
4.5 Results on Caption Generation
Now, we compare with state-of-the-art methods. Table 1 presents the comparison results in terms of BLEU-N. Our approach achieves very competitive performance on evaluated datasets although with less powerful AlexNet visual model. We can see that increase the depth of LSTM is ben-eficial on generation task. Deeper variant models mostly obtain better performance compare to Bi-LSTM, but they are inferior to latter one in B-3 and B-4 on Flickr8K. We conjecture it should be the reason that Flick8K is a relatively small dataset which suffers difficulty in training deep models with limited data. One of interesting facts we found is that by stacking multiple LSTM layers is generally superior to LSTM with fully connected transition layer although Bi-S-LSTM needs more training time. By replacing AlexNet with VggNet brings significant improvements on all BLEU evaluation metrics. We should be aware of that a recent interesting work [42] achieves the best results by integrating attention mechanism [19, 42] on this task. Although we believe incorporating such powerful mechanism into our framework can make further improvements, note that our current model Bi-LSTMachieves the best or second best results on most of metrics while the small gap in performance between our model and Hard-Attention[42] is existed.
The further comparison on METEOR and CIDEr scores is plotted in Figure 9. Without integrating object detection and more powerful vision model, our model (Bi-LSTMoutperforms DeepVS[11] in a certain margin. It achieves 19.4/49.6 on Flickr 8K (compare to 16.7/31.8 of DeepVSand 16.2/28.2 on Flickr30K (15.3/24.7 of DeepVSMSCOCO, our Bi-S-LSTMobtains 20.8/66.6 for METEOR/CIDEr, which exceeds 19.5/66.0 in DeepVS
Figure 9: METEOR/CIDEr scores on different datasets.
4.6 Results on Image-Sentence Retrieval
For retrieval evaluation, we focus on image to sentence retrieval and vice versa. This is an instance of cross-modal retrieval [6, 30, 41] which has been a hot research subject in multimedia field. Table 2 illustrates our results on differ-ent datasets. The performance of our models exceeds those compared methods on most of metrics or matching exist-
Table 2: Comparison with state-of-the-art methods on R@K (high is good) and Med r (low is good). All scores are computed by averaging the results of forward and backward results. “+O” means the approach with additional object detection.
ing results. In a few metrics, our model didn’t show better result than Mind’s Eye [2] which combined image and text features in ranking (it makes this task more like multimodal retrieval) and NIC [39] which employed more powerful vision model, large beam size and model ensemble. While adopting more powerful visual model VggNet results in significant improvements across all metrics, with less powerful AlexNet model, our results are still competitive on some metrics, e.g. R@1, R@5 on Flickr8K and Flickr30K. We also note that on relatively small dataset Filckr8K, shallow model performs slightly better than deeper ones on retrieval task, which in contrast with the results on the other two datasets. As we explained before, we think deeper LSTM architectures are better suited for ranking task on large datasets which provides enough training data for more complicate model training, otherwise, overfitting occurs. By increasing data variations with our implemented data augmentation techniques can alleviate it in a certain degree. But we foresee further significant improvement gains as training example grows, by reducing reliance on augmentation with fresh data. Figure 10 presents some examples of retrieval experiments. For each caption (image) query, sensible images and descriptive captions are retrieved. It shows our models captured the visual-textual correlation for image and sentence ranking.
4.7 Discussion
Efficiency. In addition to showing superior performance,
Table 3: Time costs for testing 10 images on Flickr8K
our models also possess high computational efficiency. Table 3 presents the computational costs of proposed models. We randomly select 10 images from Flickr8K validation set, and perform caption generation and image to sentence retrieval test for 5 times respectively. The table shows the averaged time costs across 5 test results. The time cost of network initialization is excluded. The costs of caption generation includes: computing image feature, sampling bidirectional captions, computing the final caption. The time costs for retrieval considers: computing image-sentence pair scores (totally 10 50 pairs), ranking sentences for each image query. As can be seen from Table 1, 2 and 3, deep models have only slightly higher time consumption but yield signifi-cant improvements and our proposed Bi-F-LSTM can strike the balance between performance and efficiency.
Challenges in exact comparison. It is challenging to make a direct, extract comparison with related methods due to the differences in dataset division on MSCOCO. In principle, testing on smaller validation set can lead to better results, particularly in retrieval task. Since we strictly follow dataset splits as in [11], we compare to it in most cases. Another challenge is the visual model that utilized for en-
A soccer player tackles a player from the other team
A black dog jumps in a body of water with a stick in his mouth
Figure 10: Examples of image retrieval (top) and caption retrieval (bottom) with Bi-S-LSTM on Flickr30K validation set. Queries are marked with red color and top-4 retrieved results are marked with green color.
coding image inputs. Different models are employed in different works, to make a fair and comprehensive comparison, we select commonly used AlexNet and VggNet in our work.
We proposed a bidirectional LSTM model that generates descriptive sentence for image by taking both history and future context into account. We further designed deep bidirectional LSTM architectures to embed image and sentence at high semantic space for learning visual-language models. We also qualitatively visualized internal states of proposed model to understand how multimodal LSTM generates word at consecutive time steps. The effectiveness, generality and robustness of proposed models were evaluated on numerous datasets. Our models achieve highly completive or state-of-the-art results on both generation and retrieval tasks. Our future work will focus on exploring more sophisticated language representation (e.g. word2vec) and incorporating multitask learning and attention mechanism into our model. We also plan to apply our model to other sequence learning tasks such as text recognition and video captioning.
[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
[2] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422–2431, 2015.
[3] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
[4] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
[5] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Doll´ar, J. Gao, X. He, M. Mitchell, and J. Platt. From captions to visual concepts and back. In CVPR, pages 1473–1482, 2015.
[6] F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, pages 7–16. ACM, 2014.
[7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121–2129, 2013.
[8] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645–6649. IEEE, 2013.
[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACMMM, pages 675–678. ACM, 2014.
[10] A. Karpathy, A. Joulin, and F-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889–1897, 2014.
[11] A. Karpathy and F-F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128–3137, 2015.
[12] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, pages 595–603, 2014.
[13] R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 35(12):2891–2903, 2013.
[16] P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, volume 1, pages 359–368. ACL, 2012.
[17] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. Trans. of the Association for Computational Linguistics(TACL), 2(10):351–362, 2014.
[18] M. Lavie. Meteor universal: language specific translation evaluation for any target language. ACL, page 376, 2014.
[19] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[20] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, pages 220–228. ACL, 2011.
[21] T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
[22] Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z Wang. Rapid: Rating pictorial aesthetics using deep learning. In ACMMM, pages 457–466. ACM, 2014.
[23] J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
[24] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, volume 2, page 3, 2010.
[25] T. Mikolov, S. Kombrink, L. Burget, J. H. ˇCernock`y, and S. Khudanpur. Extensions of recurrent neural network language model. In ICASSP, pages 5528–5531. IEEE, 2011.
[26] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daum´e III. Midge: Generating image descriptions from computer vision detections. In ACL, pages 747–756. ACL, 2012.
[27] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011.
[28] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu:
a method for automatic evaluation of machine translation. In ACL, pages 311–318. ACL, 2002.
[29] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.
[30] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 36(3):521–535, 2014.
[31] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop, pages 139–147. Association for Computational Linguistics, 2010.
[32] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
[33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[34] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. Trans. of the Association for Computational Linguistics(TACL), 2:207–218, 2014.
[35] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages 2222–2230, 2012.
[36] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
[38] R. Vedantam, Z. Lawrence, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
[39] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
[40] Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. Deepfont: Identify your font from an image. In ACMMM, pages 451–459. ACM, 2015.
[41] X.Jiang, F. Wu, X. Li, Z. Zhao, W. Lu, S. Tang, and Y. Zhuang. Deep compositional cross-modal learning to rank via local-global alignment. In ACMMM, pages 69–78. ACM, 2015.
[42] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.
[43] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. of the Association for Computational Linguistics(TACL), 2:67–78, 2014.
[44] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
[45] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014.