Automatically generating a caption that describes an image, a problem known as image captioning, is a challenging problem where computer vision meets natural language processing. Compared to image classification and object recognition tasks, image captioning requires a higher level of scene understanding as well as language modelling. A well performing model not only has to identify the objects in the image, but also capture the semantic relationship between them, general context and the activities that they are involved in. Furthermore, the model has to map the visual representation into a fully-formed English sentence.
Given the many similarities shared between image captioning and neural translation, many recent approaches in the image captioning domain have been inspired by the advances in neural translation [1]–[3]. A common framework is to use a word embedding matrix to produce a word embedding vector to serve as the input, and a separate output projection matrix to produce a probability distribution over all the words.
Manuscript received July 05, 2018; revised December 22, 2018; accepted on February 22, 2019. This research is supported by the UM Frontier Research Grant FG002-17AFR, from University of Malaya. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. El Saddik, Abdulmotaleb. (Corresponding author: Chee Seng Chan)
J.H. Tan and C.S. Chan are with the Center of Image and Signal Processing, Department of Artificial Intelligence, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, 50603 MALAYSIA. e-mail: {tanjiahuei@siswa.um.edu.my; cs.chan@um.edu.my}
J.H. Chuah is with the Department of Electrical Engineering, Faculty of Engineering, University of Malaya, Kuala Lumpur, 50603 MALAYSIA. e-mail: {jhchuah@um.edu.my}
Fig. 1. Our proposed method (COMIC-256) is able to achieve comparable BLEU-1 score on MS-COCO (0.706) against DeepVS [6] (0.625); Soft-Att. [4] (0.707); Hard-Att. [4] (0.718) and Baseline (0.701) despite having an embedding vocabulary size that is 39
However we found out that when the datasets used to train the models grow larger in size, so does the vocabulary size. These huge embedding matrices in turn inflate the model, adversely affecting the compactness of the models. As a result of that, it makes them difficult to be deployed on embedded system with limited hardware resources. For example, the Recurrent Neural Network (RNN) decoder in the Show, Attend and Tell framework [4] has a vocabulary size of 9, 962. The resulting model has 12.2M parameters where 7.7M belongs to the word embedding and output projection matrices. Even with embeddings weight sharing [5], the model still has 7.3M parameters where 2.6M belongs to the embedding matrices. On the other hand, character-based models although compact with a small vocabulary size, suffers from poor performance. This is because character-based text sequences usually have much longer sequence lengths which exacerbates difficulties with long-range dependencies.
In this paper, our goals are to i) reduce the complexity of image captioning model without compromising the performance; and at the same time ii) improve model performance with attention module without incurring additional computational costs, paving the way for possible real-time applications deployment in resource constrained devices such as embedded/mobile devices. To achieve this goal, we present a simple yet effective framework named COMIC to reduce the model complexity in a manner that preserves the original accuracy; and at the same time increase model accuracy with attention module in a manner that preserves the model complexity.
Firstly, Radix Encoding is employed as a pre-processing step that allows us to encode a vocabulary of size using v symbols. The encoding scheme is designed in such a way that it can be deployed without requiring any changes to the existing image captioning models. Secondly, attention module for image captioning has become the de facto standard nowadays largely due to the success of [4], [7]–[10]. However, it is known that attention module usually operates on the high level Convolutional Neural Network (CNN) feature maps that come with a large channel dimension, leading to an increase in the model complexity in terms of RNN input size and their weight matrices. To combat this, we refine the feature map projection weight tying as a down-projection so that the new projected feature map has lesser channels, and thus it provides a more compact representation via attention. Finally, we adopted multi-head additive attention to take advantage of improving the effective resolution of attention module without affecting the original model complexity. With this, COMIC will have the ability to jointly attend to information from different representation subspaces at different positions. Technically, this is achieved by separating the feature map channels into groups, so that each attention head can attend to different parts of the image separately depending on the channel group. In order to prevent an increase in the computational cost due to the multi-head module, the dimensionality of each attention head is reduced by a factor of g, where g is the number of attention heads.
In summary, the core contributions of this work are twofold. Firstly, we propose COMIC, a COMpact Image Captioning model with vastly reduced vocabulary size (up to 99smaller) and multi-head attention module (see Section IV). This is the first attempt in the image captioning domain, and it opens up a new research angle in this domain. Secondly, we demonstrate the effectiveness of COMIC on two benchmark datasets: MSCOCO [11] and InstaPIC-1.1M [12] (see Section VI-A). We show that COMIC achieves comparable results (
loss only) on BLEU [13], METEOR [14], ROUGE-L [15], CIDEr [16] and SPICE [17] against state-of-the-art (SOTA) methods despite having an embedding vocabulary size that is 39
- 99
smaller. We discuss the technical differences as compared to some related works in the next section.
Our work is mostly related to the current research on image captioning and compact model. This section reviews the most relevant works on these two topics.
Image captioning. [18] proposed a multimodal log-bilinear model to generate image captions, while [6] used a Bidirectional RNN (BRNN) and Region CNN (R-CNN) to learn multimodal embedding which is then used by an RNN to generate sentences. [19] proposed to map image features from CNN to a common word embedding space, and generating sentences using Long-Short Term Memory (LSTM) network. Their work is extended by Xu et al. [4] who incorporated an attention mechanism, allowing the network to focus on salient objects. Following this, [20] further extended this framework by adding a reviewer stage between the encoder and decoder. Tan and Chan [21], [22] proposed a phrase LSTM model, which has two levels of LSTMs, one to model the sentence composed of phrases, and another to generate words in a phrase. [23] used multi-instance learning framework to learn 1000 visual detectors as the conditional inputs to a language model, and You et al. [7] enhanced the performance by learning the semantic attention on visual attributes. More examples of attribute models include [24], [25]. Park et al. [12] proposed to use context memory to personalise the captions for Instagram images. Wang et al. [26] proposed a deep bidirectional LSTM model to harness history and future context information, and is extended by [27] with the integration of multi-task learning. Dai and Lin [28] proposed Contrastive Learning to encourage distinctiveness of the generated captions. Although most of the aforementioned approaches achieved very promising results, all of these models do not scale naturally to large vocabulary size. Most if not all recent image captioning works focused on raw performance with the built of exotic encoder-decoder style networks with attention and placed little emphasis on reducing the computational costs of their models. In this paper, we introduce to the community a new research direction - a compact model with attention named COMIC.
Compact model. Building a compact model is an ongoing effort in the domain of deep learning [29], [30]. In this paper, we will focus on efforts in the field of neural natural language processing, as it is closer to image captioning. There are many existing works involving the use of encoding as a pre-processing step. For instance, Nakagawa [31] proposed a hybrid method for Chinese and Japanese word segmentation, using word-level information for known words, and character-level information for unknown words. [32] studied numerous encoding methods for text classification in Chinese, English, Japanese and Korean. [33] encoded rare words using Huffman encoding into subword symbols. Similarly [34] proposed using Byte-Pair Encoding (BPE) to segment rare words into subword units. However, it can only be used on English or languages with Latin character. Gillick et al. [35] treated the text as a sequence of variable-length UTF-8 bytes for text sequence annotation. While it does not involve natural language sequence generation, the work allows for a more compact representation of the word sequence. [36] proposed using CNN as encoder to produce radical-embeddings for Chinese and Japanese, resulting in reduced embedding vocabulary size. Li et al. [37] proposed building a word embedding table to factorise each word prediction into a 2-step process. The word embedding table is optimised separately using the minimum cost maximum flow (MCMF) algorithm. In our work, we show that it is possible to achieve good performance (i.e. generate a decent image caption) despite having an extremely small embedding vocabulary size.
Summary. Compared to regular image captioning models, COMIC has vastly fewer learnable parameters, leading to reduced requirement on GPU memory and storage. A closely related work to ours is LightRNN [37] but with few differences - i) COMIC requires only a single word embedding matrix (as opposed to two in LightRNN); ii) COMIC does not necessitate any changes in the model architecture (LightRNN requires a word embedding table); and iii) LightRNN is applied for language modelling only. On the other hand, our proposed method is orthogonal to compression and pruning based methods such as [38], [39]. Compression methods encode the trained weights of a full CNN into a smaller representation, while pruning methods are applied only after the full dense model has started the training process. In contrast, our method directly reduces the number of learnable parameters in the first place, thus producing a compact model. Moreover, [38], [39] are applied for image classification instead of image captioning. We believe that the aforementioned methods can be applied on top of COMIC to achieve even higher savings in terms of storage and parameters.
Following recent works, we formulate the image captioning task as a translation problem, where a probabilistic model is used to “translate” an image with fixed-size representation into a fully-formed English sentence. As such, we adopt a modified version of Show, Attend and Tell [4] framework as our model architecture, since it provides good performance on the image captioning task. This model will also serve as the baseline for our experiments. For clarification, we will refer to output projection and output embedding; embedding dimension and word size interchangeably. All the model size calculations in this paper include only the decoder and attention module (the encoder, i.e. CNN is excluded).
Suppose is a sequence of words, our model directly maximises the probability of the correct description given an image I using the following formulation:
where is the probability of generating a word given an image I, previous words
, and context vector
.
Although in principle any RNNs can be used, LSTM cell [40] (with forget bias) is chosen as it has shown SOTA performance on sequential tasks such as translation [41] and image captioning. For a LSTM network with n units, we initialise the hidden state of LSTM with image embedding vector through a pre-activation weight layer with layer normalisation (LN) [42]:
where is a weight matrix and
is the LN function.
The attention function used in this paper is the “softattention” introduced by [1] and used in [4], where a multilayer perceptron (MLP) with a single hidden layer is employed to calculate the attention weights on a particular feature map. The context vector is then concatenated with previous predicted word embedding to serve as input to the LSTM. Finally, a probability distribution over the vocabulary is produced from the hidden state
:
Fig. 2. Example of caption, original and encoded token indices when using radix encoding with base-128
where and
are input and output embedding matrices respectively;
are weight matrices; and [ , ] is the concatenation operator.
is the probability distribution over the vocabulary;
is the memory state;
is the current input to LSTM;
is the context vector;
is the feature map and
is the vector extracted from location
is the attention weight at time step t and location j;
is the one-hot vector of previous word;
is the softmax temperature.
This paper introduces COMIC, a simple yet effective framework that consists of radix encoding, feature map projection weight tying and multi-head additive attention that work together to built a compact image captioning model with attention without affecting the original accuracy.
A. Radix Encoding
The idea of the radix encoding is to transform the word indices to a higher base, splitting every word token into d tokens where . Although it is possible to achieve reduction in the vocabulary size using BPE [34], it can only be used on English and other languages using Latin characters. On the other hand, radix encoding can in theory be used on all languages including Chinese, Japanese, Korean etc. For example in Fig. 2, with a base of v = 128, the word token “a” with an index of i = 0 will be encoded using two tokens of
and
; while the word token “asphalt” with an index of i = 2118 will be encoded using two tokens of
and
. We also define two special tokens where
marks the start of a sequence and
EOS
marks the end of the sequence. For easy decoding, the special tokens are represented using only one token.
is assigned with an index of
(one-hot vector
) and
EOS
is assigned with an index of
(one-hot vector
). This enables radix encoding to be used without any modification to the existing model architectures. To generate a sequence, one simply run inference using beam search as usual and apply post-processing on the output tokens. The post-processing can be done by either converting the encoded index
back to base-10 index i, or by constructing a decoding tree dictionary.
With this encoding scheme, we managed to re-represent the original corpus vocabulary of size
using an encoded embedding vocabulary
of size v. This leads to a huge reduction in the model complexity. For example, the popular MS-COCO dataset often yields a vocabulary of 8, 000 to 10, 000 words while the InstaPIC-1.1M dataset yields around 22, 000 to 40, 000 words. With the proposed radix encoding,
can be set to a much lower size such as v = 128 or v = 256, a reduction of almost
in the MS-COCO dataset and
in the InstaPIC-1.1M dataset. Results are given in Section V-C1, Table I.
B. Feature Map Projection Weight Tying
For most of the image captioning models with visual attention [4], [7]–[10], the attention function operates on higher level feature map of the CNN in order for the context vector to capture higher level representation. Such feature maps usually comes with a large channel dimension, such as r = 512 for VGG-16 and r = 832 for GoogLeNet. This in turn increases the RNN’s input size and their weight matrices. To combat this issue, we propose a down-projection algorithm on the feature map such that the projected map has lesser channels, given by
where is a weight matrix, q is the number of channels of the projected feature map and
. As shown in Table II (untied), a small projection size such that
can reduce the model complexity, and at the same time it provides extra representation power to the language model.
However, still, the extra projection layer will naturally incur additional computational cost. To further alleviate this complexity issue, we introduce weight sharing on the feature map projection and attention MLP weights
, given by
With this, the projected feature map can be calculated in advanced and share with the attention MLP, and so the visual attention module can be introduced in COMIC without incurring extra computational cost as to conventional approaches. Table II shows that the attention module is put forward in a lower computational cost without compromising the accuracy.
C. Multi-Head Additive Attention
Multi-head attention [43] separates the feature map channels into groups, where each attention head can attend to different areas of the image separately depending on the channel group. In other words, each location of each channel group is assigned an attention weight with a MLP. This approach is opposed to the regular single-head attention which applies attention weights equally across all of the feature map channels, leading to averaging of contextual information from multiple regions.
Technically, for single-head attention, we use a single MLP with hidden size k and obtain a q-dimensional context vector . For multi-head attention with g heads, we use g separately learned MLPs with hidden size k/g, with each head produces a q/g-dimensional output vector. The output vectors are then concatenated to produce the final context vector
with q-dimensions. Due to the dimension reduction of each head, the total computational cost will be the same as to the single-head attention with full dimensionality.
To take advantage of this, we apply the multi-head dot-product attention [43], together with additive attention (MLP) and image captioning to increase the effective resolution of the attention module via an ensemble of attention modules. In practise, we combine the separate MLPs into a single MLP to maximise parallelism. Experiments on MS-COCO dataset (Table II) show that multi-head can improve the original accuracy with same model complexity. This is the first time multi-head additive attention is used in image captioning1.
A. Model Details
The LSTM model is trained in an unrolled form to predict each word of the sentence after it has seen the image, the current context vector and all the preceding words, as given by . As usual, each word is represented as one-hot vector
of dimension equal to the size of the dictionary. The training is performed by minimising the loss w.r.t. all the parameters except the image model. To tackle overfitting, we employed dropout at the input and output of the LSTM. Our loss function is the sum of the negative log likelihood of the correct word at each time step, doubly stochastic attention regularisation as employed in [4] and L2 weight loss as given below:
1 −
Unless stated otherwise, all the models used in our experiments have the following basic configurations. All models are implemented using TensorFlow. The image model used in our work is GoogLeNet (InceptionV1) with batch normalisation [45], [46] pre-trained on ImageNet. The input images are resized to , and randomly flipped and cropped to
before being fed to the CNN. The image embedding size is z = 1024. The attention function operates on the “Mixed-4f” map
, with MLP size of k = 512. The projected feature map for untied models in Table II have q = 512 channels. The LSTM network consists of a single layer with hidden state size of n = 512. The word size is set to m = 256 dimensions. The optimiser used for training is Adam [47], with batch size of 32.
TABLE I COMPARISON OF MODELS WITH DIFFERENT TOKENISATION AND ENCODING SCHEMES ON MS-COCO
The initial learning rate is set to , and is halved every 4 epochs until a minimum of
. All models are trained for 20 epochs. The input and output dropout rates for LSTM are both set to 0.35. Weight decay rate is set to
. All trainable parameters are initialised randomly using Xavier initialisation [48]. For inference, we used beam search in order to better approximate
. We use beam size b = 3 with no length normalisation for all experiments unless noted otherwise. All hyperparameters are chosen based on educated guesses due to limited computational resources.
B. Experiment Setup
We conducted our experiments on two public English captioning datasets, namely MS-COCO [11] and InstaPIC-1.1M [12]. MS-COCO dataset contains 123, 287 images and each image is given at least 5 captions by different AMT workers. We use the publicly available split2 in the work of [6], which use 5, 000 images for validation, and another 5, 000 for testing. InstaPIC dataset contains 648, 761 images for training, and 5, 000 images for testing. Each Instagram image is paired with one user caption. This dataset is challenging, as its captions are natural posts with varying formats. Following [28], we reserved 2, 000 images randomly from the training set for validation.
All the scores are obtained using the publicly available MS-COCO evaluation toolkit3 , which computes BLEU [13], METEOR [14], ROUGE-L [15], CIDEr [16] and SPICE [17]. For sake of brevity, we label BLEU-1 to BLEU-4 as B-1 to B-4, and METEOR, ROUGE-L, CIDEr, SPICE as M, R, C, S respectively. For MS-COCO, we use the publicly available tokenised captions2 [6], filtering out words that occur less than 5 times and truncating sentences longer than 20 words. For InstaPIC, we use the publicly available tokenisation script4, and select 25, 595 most frequently used words as our vocabulary. We also truncate captions longer than 18 words.
C. Ablation Study
1) Tokenisation and encoding: In this section, we examine the effect of the introduction of Radix Encoding scheme. From Table I, it can be seen that the regular word-based model performed the best. This is followed by Radix models using base-128, base-64 and finally the character-based model. The result can be attributed to the much shorter sentence length when using word tokens, which alleviates long-term dependency learning issues. Also, this performance degradation is an expected trade off of parameter reduction and we believe the result is still comparable. For instance, we can notice the performance gap between word and radix encoding model is moderate (2.3% in average), while the number of parameters reduced drastically (by 62%). This is almost one-third of the original amount which is comparable to the character model, yet at the same time it obtains better performance than the character model. This shows that radix encoding is able to reduce the complexity of image captioning models without affecting much on the original accuracy.
2) Attention module: In this section, we investigate the effect of different attention configurations, by varying the number of attention heads with and without the projection weight tying. The models used are as described in Section V-A, but with word size set to m = 64. Table II shows that it is possible to introduce visual attention module in a more compact way without compromising the original accuracy. For instance, when the feature map projection is employed (i.e. tied) in Radix, base-128, we found that even with lesser parameters, having the extra projection layer contributes a slight improvement in the overall performance. This is more obvious when the multi-head attention is applied. This phenomena is also spotted in the regular word-based model. Without the projection, multi-head attention often provide little to no benefit compared to regular single-head attention. This can be attributed to the extra projection provides the model ability to group channels that are relevant to each attention head together, forming contiguous groups.
From our further investigation on the type of projection, we notice that there are two opposite trends. When using single-head attention, the tied models generally performed slightly worse than the untied counterparts. This shows that the benefit obtained by the extra projection is counteracted by the reduction in the parameter count. On the other hand, the tied models generally performed better than untied counterparts when using the multi-head attention, despite having lesser parameters. This can be understood as the tied projection layer receives extra gradient information via weight sharing from both the multi-attention module and RNN, while training.
In terms of the inclusion of multi-head additive attention, we can notice that compared to a single head, models using 8 heads yields improvements of up to +4% on CIDEr score. Furthermore, as shown in Table II, the other metric scores also improve across the board as the number of attention heads increases, when the tied projection is used. This is consistent with the findings of [43], where their performance on the WMT 2014 English-to-German translation task improves as
TABLE II COMPARISON OF MODELS WITH DIFFERENT ATTENTION CONFIGURATIONS ON MS-COCO
the number of heads increases (up to 16 heads). Note that, as aforementioned in Section IV-C this overall performance improvement is essentially free as each individual head operate on a reduced dimension compared to the single-head.
A. State-of-the-art comparison
In this section, our COMIC model is a radix encoding model with 8 attention heads and tied feature map projection, the rest being identical to baseline. To provide a fair comparison, we trained two sets of baseline models. The first set consists of the standard baseline named “Baseline” and “Baseline-8” with 8 attention heads without feature map projection; while the second set consists of a pair of slim baseline models named “Baseline-S” where the parameter counts are reduced to match the COMIC models. “Baseline-SC” models have n = k = 160 and m = 128, and “Baseline-SI” models have n = k = 80 and m = 64. We trained all the baselines and COMIC - v models for 30 epochs, where v denotes the choice of base number. The base number of COMIC is chosen so that the number of tokens needed to encode a word token is d = 2 and . As MS-COCO word model has a vocabulary size of
, a base number of v = 128 or v = 256 is sufficient to encode the entire
while minimising the increase in sequence lengths. On the other hand, InstaPIC word model has
, hence larger base numbers v = 160 and 256 are used. We would like to note that our metric scores are obtained using a single model instead of an ensemble.
Table III-IV show the metric scores achieved by our baselines, COMIC and SOTA methods. On both datasets, our COMIC models managed to perform on par with the baselines even having much lower parameter count and vocabulary size. For example, on the MS-COCO dataset, the loss in performance of COMIC-256 is merely 0.45% on CIDEr when compared to the baselines, despite with only 33% of the parameters and a vocabulary size of reduction). On the
InstaPIC dataset, the complexity reduction is even more drastic. Despite having much lesser parameters (16.7% of baseline) and vocabulary size of only reduction), COMIC-256 still manages to perform on par with baseline models and even outperforms it on certain metrics. When compared to the slim baseline models with comparable parameter counts, our COMIC models again have better performance. This shows the effectiveness of the proposed methods in reducing the model complexity and at the same time minimising its impact on overall performance across five different evaluation metrics. Our COMIC models also compare favourably to SOTA approaches, losing moderately to attribute-based approaches in the MS-COCO dataset, and only to the latest CSMN [12] and AACL [28] approaches in the InstaPIC dataset, despite operating on a much condensed vocabulary size.
As a summary, although there is a slight performance drop in some of the metric scores when comparing COMIC against the baselines in Table III-IV, this performance degradation is an expected trade off of parameter reduction and we believe the results are still comparable. Then, when compared to the SOTA methods (with the exception of ACVT [24] which implemented an attribute dictionary), we showed that the performance of our proposed model in overall is very competitive in both of the datasets. In particular, those that have similar architectures (i.e. Soft and Hard Attention [4], and Review Net [20]) as to our proposed work.
B. Uniqueness and length of captions:
It has been pointed out that multimodal RNN-based approaches tend to reconstruct previously seen captions [51]. Hence, we compare our model with baselines in terms of the uniqueness and length of the generated captions in Table V. A caption is considered to be unique if it’s not seen in the pre-processed training corpus. From the results, we can see that although COMIC uses an encoded vocabulary, it still managed to generate considerably more unseen (unique) captions compared to the baselines. The average length of
TABLE III COMPARISON WITH BASELINE AND SOTA METHODS ON MS-COCO. V2 MODEL IS TRAINED ON A DIFFERENT SETTING WHERE F AND S
CRITICAL SEQUENCE TRAINING (SCST), PLEASE REFER TO OUR GITHUB PAGE FOR MORE DETAILS.
TABLE IV COMPARISON WITH BASELINE AND SOTA METHODS ON INSTAPIC-1.1M. METHODS WITH [ARE EXTRACTED FROM [12]
TABLE V CAPTION STATISTICS: UNIQUENESS AND LENGTH
captions generated by the COMIC is also longer compared to the baselines.
The trend is due to the decoding noise introduced by the radix encoding. In other words, in addition to the long-term dependencies between words, successful generation of captions relies strongly on accurately modelling the short-term dependencies between tokens. This has increased the difficulty along with exposure bias for the increased uniqueness of captions generated by our models, as well as the increased length of captions.
In this section, we provide some examples of the generated captions from our model in Fig. 3-4 for both of the MS-COCO
and InstaPIC datasets5. We can see that the captions generated by COMIC-256 are grammatically correct and are not affected by the vocabulary encoding scheme. In many cases, COMIC-256 even managed to provide finer details when describing the images compared to the baseline. For instance in the first image of Fig. 3, our model properly describes the image content “a man standing next to a zebra in a field”, while the baseline model only able to generate “a man is standing next to a zebra”.
To better understand our model, Fig. 5 visualises the multi-head attention maps for different words in the generated caption. Going through each of the attention maps, we can see that our proposed model effectively delegates each attention head to different locations. In other words, each head learn to focus on subjects, objects or background separately. For example in the first image, we can visualise that the 3rd head () generally attends to the car. Meanwhile, the 4th head (
) is focused on the cat at the beginning, and fades out when the model moves to the other words. The 5th head (
) attends to the space around the roof of the car, aiding in predicting “on top of”. Finally, the 7th head (
) attends to the boundary between the cat and the car while the model is predicting the word “sitting”. The second image shows similar task assignments. In the third image that has multiple subjects, we can see that each head can separately attend to the background, elephant and the people sitting on top.
This paper studied image captioning problem from a new perspective where it presented COMIC - a compact image captioning model with attention module. Experiments were conducted in the MS-COCO and InstaPIC-1.1M datasets, and the results showed that COMIC overall performance was not affected despite has a reduction of 33-99
in the vocabulary size. In future work, we would like to investigate the impact of different encoders (i.e. CNN models) such as MobileNets [30] on the overall performance and to train the radix encoding models in a greedy decoding setting using reinforcement learning methods, such as Policy Gradient [52] to avoid the “exposure bias” problem.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of Titan Xp GPU used for this research.
[1] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
[2] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3104–3112.
[4] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (ICML), 2015, pp. 2048–2057.
[5] O. Press and L. Wolf, “Using the output embedding to improve language models,” arXiv preprint arXiv:1608.05859, 2016.
[6] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3128– 3137.
[7] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4651– 4659.
[8] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning Where to See and What to Tell: Image Captioning with Region-based Attention and Scene-specific Contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321–2334, 2017.
[9] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045–2055, Sept 2017.
[10] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content using attention-based encoder-decoder networks,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886, Nov 2015.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
[12] C. Chunseong Park, B. Kim, and G. Kim, “Attend to you: Personal- ized image captioning with context sequence memory networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 895–903.
[13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, 2002, pp. 311–318.
[14] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol. 29, 2005, pp. 65–72.
[15] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8, 2004.
[16] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus- based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575.
[17] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in European Conference on Computer Vision (ECCV), 2016, pp. 382–398.
[18] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in Proceedings of the 31st International Conference on Machine Learning (ICML), 2014, pp. 595–603.
[19] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156– 3164.
[20] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. R. Salakhutdinov, “Review networks for caption generation,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 2361–2369.
[21] Y. H. Tan and C. S. Chan, “phi-lstm: a phrase-based hierarchical lstm model for image captioning,” in Asian Conference on Computer Vision. Springer, 2016, pp. 101–117.
[22] ——, “Phrase-based image caption generator with hierarchical LSTM network,” Neurocomputing, vol. 333, pp. 86–100, 2019.
[23] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From captions to visual concepts and back,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1473– 1482.
[24] Q. Wu, C. Shen, L. Liu, A. Dick, and A. van den Hengel, “What value do explicit high level concepts have in vision to language problems?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 203–212.
[25] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” arXiv preprint arXiv:1611.01646, 2016.
[26] C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image captioning with deep bidirectional LSTMs,” in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 988–997.
[27] C. Wang, H. Yang, and C. Meinel, “Image captioning with deep bidirectional LSTMs and multi-task learning,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 2s, p. 40, 2018.
[28] B. Dai and D. Lin, “Contrastive learning for image captioning,” in 31st Conference on Neural Information Processing Systems (NIPS), 2017.
[29] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[31] T. Nakagawa, “Chinese and japanese word segmentation using word- level and character-level information,” in Proceedings of the 20th International Conference on Computational Linguistics, 2004, p. 466.
[32] X. Zhang and Y. LeCun, “Which encoding is the best for text clas- sification in chinese, english, japanese and korean?” arXiv preprint arXiv:1708.02657, 2017.
[33] R. Chitnis and J. DeNero, “Variable-length word encodings for neural translation models.” in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015, pp. 2088–2093.
[34] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
[35] D. Gillick, C. Brunk, O. Vinyals, and A. Subramanya, “Multilingual language processing from bytes,” in Proceedings of NAACL-HLT, 2016, pp. 1296–1306.
[36] Y. Ke and M. Hagiwara, “Radical-level ideograph encoder for RNN- based sentiment analysis of chinese and japanese,” arXiv preprint arXiv:1708.03312, 2017.
[37] X. Li, T. Qin, J. Yang, and T.-Y. Liu, “LightRNN: Memory and computation-efficient recurrent neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4385–4393.
[38] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
[39] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet classification using binary convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2016, pp. 525–542.
[40] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[41] D. Britz, A. Goldie, T. Luong, and Q. Le, “Massive exploration of neural machine translation architectures,” arXiv preprint arXiv:1703.03906, 2017.
[42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
[44] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,” arXiv preprint arXiv:1706.05137, 2017.
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
[46] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning (ICML), 2015, pp. 448–456.
[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.
[48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.
[49] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[50] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforce- ment learning-based image captioning with embedding reward,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 290–298.
[51] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell, “Language models for image captioning: The quirks and what works,” arXiv preprint arXiv:1505.01809, 2015.
[52] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of SPIDEr,” in The IEEE International Conference on Computer Vision (ICCV), 2017.
Fig. 3. Captions generated by COMIC-256 and baseline on MS-COCO dataset: We can see that COMIC-256 model (solid green line) outperforms baseline method (dashed blue line) in most cases. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour.
Fig. 4. Captions generated by COMIC-256 and baseline on InstaPIC-1.1M dataset: We can see that COMIC-256 model (solid green line) outperforms baseline method (dashed blue line) in most cases. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour.
Fig. 5. Sample of the generated captions with the attention maps of different heads. It shows that COMIC has effectively delegated each attention head to different tasks.
In this supplementary material we provide additional visualisations of the attention heads in our Compact Image Captioning (COMIC) model on MS-COCO (in Sec. IX-A) and InstaPIC-1.1M (in Sec. IX-B) datasets. Furthermore, we also show some randomly sampled images with qualitative results in Sec. X.
In Fig. 6 to 10, the attention maps of different heads are denoted by where a = [0, 7]. Attention maps with the most activity are selected for better visualisations. Going through each of the attention maps, we can see that the models have effectively learned how to delegate each attention head to different tasks. In other words, each head has learned to focus on subjects, objects or background separately.
A. MS-COCO Dataset
Fig. 6: We can see that both and
attend to the surroundings of the zebras. While both
and
attend to the group of zebras, they provide different context to the language model as they switch on alternately of each other.
likely provides context for the noun “zebra” while
provides context for the verb “standing”.
Fig. 7 (top): Here, we can see that both and
attend to the surfboard, but
also attends to the surrounding ocean which might provide the general context. In contrast,
is strongly focused on the surfboard.
attends to the waves. Lastly, both
and
attend to the subject with
focusing on the lower body and
focuses on the head and torso.
Fig. 7 (bottom): Although the model misidentifies the player as male, the attention is focused on the correct regions. attends to the player and the court, that provide the general context.
attends to the cap, racquet, shoes and clothing. This provides the cue on the type of sports, which our model predicted correctly.
again attends to the background, in particular the court lines. Similar to the surfing example above, both
and
attend to the subject with
focuses on the lower body and
focuses on the head and torso.
Fig. 6. Generated captions using greedy decoding by COMIC-256 on the MS-COCO dataset and the attention maps of different heads
Fig. 7. More examples on MS-COCO dataset: Generated captions and the attention maps of different heads
B. InstaPIC-1.1M Dataset
Fig. 8: We can see that attends mainly to the sky region especially when the model is predicting “top” and “world”. Basically,
attends to the entire image, which provide the general context. Lastly,
attends to both the foreground and faraway regions, which provide cues that the image is a bird’s-eye view of the bay region.
Fig. 9 (Top): It can be seen that attends to the background.
attends to basically the entire image, which provide the general context. Both
and
attends to the dog, with
is more focused than
.
Fig. 9 (Bottom): We can see that attends to the hair and face of the subject, while both
and
attend strongly to the facial regions.
attends to the entire image.
Fig. 10 (Top): We can clearly observe that attends to the sky regions, while
attends to the tree, road and sun. Both
and
attend to the sun, with
being more focused than
.
Fig. 10 (Bottom): Here, we can see that mainly attends to the plate, while
attends to the food as a whole. Both
and
attend to different regions or food items on the plate.
Fig. 8. Generated captions using greedy decoding by COMIC-256 on the InstaPIC-1.1M dataset and the attention maps of different heads
Fig. 9. More examples on InstaPIC-1.1M dataset: Generated captions and the attention maps of different heads
Fig. 10. More examples on InstaPIC-1.1M dataset: Generated captions and the attention maps of different heads
For the generated captions, we provide results from both our COMIC-256 model and the baseline word model in Figure 11 to 15. Captions inside the solid green box are generated by COMIC-256 model, and captions inside the dashed blue box are generated by the baseline method. We can see that for most images, our proposed method matches or in some cases outperforms the baseline method. For instance, we can see that the captions generated by COMIC-256 model are grammatically correct and this shows that it does not affected by the vocabulary encoding scheme. In some cases, COMIC-256 model managed to provide finer details when describing the images compared to the baseline. Finally, we demonstrate the ability of our proposed method to generate variable length captions.
We also explicitly chose some failure examples in which COMIC-256 model performs no better than baseline method in Figure 12 for MS-COCO dataset and Figure 15 for InstaPIC-1.1M dataset. We can see that incorrect recognition of objects or missing main objects in the image is still the dominant cause of error.
A. MS-COCO Dataset
Fig. 11. Captions generated by COMIC-256 (solid green line) and baseline (dashed blue line) on MS-COCO dataset. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour
Fig. 12. Captions generated by COMIC-256 (solid green line) and baseline (dashed blue line) on MS-COCO dataset. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour
B. InstaPIC-1.1M Dataset
Fig. 13. Captions generated by COMIC-256 (solid green line) and baseline (dashed blue line) on InstaPIC-1.1M dataset. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour
Fig. 14. Captions generated by COMIC-256 (solid green line) and baseline (dashed blue line) on InstaPIC-1.1M dataset. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour
Fig. 15. Captions generated by COMIC-256 (solid green line) and baseline (dashed blue line) on InstaPIC-1.1M dataset. Accurate descriptions are indicated by blue with bold and italic text, inaccurate descriptions are indicated by red with bold text. Best viewed in colour