Visual Question Answering (VQA) (Antol et al. 2015) is the problem to answer questions related to an input image. Contrary to traditional visual recognition tasks, which concentrate on a predefined problem such as object detection, scene classification, activity recognition, etc., VQA involves various recognition tasks at the same time and provides a unified approach to solve the problems defined by questions. Due to these reasons, VQA requires substantial amount of learning to capture information from images and understand questions.
Recently, deep learning based approaches using multiple reasoning steps with attention (Yang et al. 2016; Xiong, Merity, and Socher 2016) have been proposed to improve the performance of VQA systems. After extracting features from an image and a question using Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), respectively, these methods iteratively generate answers by combining question feature with image feature of local area given by attention. This approach is natural as many questions in VQA conceptually require multiple steps for reasoning. For example, to answer the question like “what is in front of the giraffe?” the VQA model should be able to solve the subtasks in sequel as “finding a giraffe,” “identifying the region in front of the giraffe” and “classifying object in the specified region.”
One of the critical limitations of the existing methods is that the number of steps for reasoning is fixed to a predefined number. Even though the exact number of steps required to answer a question is difficult to figure out, it is reasonable to assume that different questions need the different numbers of steps to reach solutions. For example, a question “what color is the apple on the table?” requires more steps than another question “what color is the apple?” If the number of inference steps is not adaptive to questions, we may need to solve multiple subtasks in a single step or a single subtask in multiple steps, which are likely to degrade system performance by causing overfitting or underfitting of trained models.
Instead of presetting the number of steps, we make the model learn the number of steps for inference of each question implicitly by simply minimizing joint loss from multiple steps of our recurrent deep neural network without extra supervision. We believe that learning to predict with a proper number of steps not only enhance the overall accuracy but also help each answering unit to generalize better. However, identifying the optimal number of steps is an extremely dif-ficult task, and we propose an indirect but simple solution to learn a unified model for inference. By analyzing the answers from each step, we find out that predictions with more steps tend to overfit to easier questions even if later steps could solve more complex questions better than the earlier ones. Based on this observation, we progressively stop backpropagating the loss from the overfitting units. Training is terminated when the accuracy of the first answering unit is saturated. With this training strategy, we empirically show that the single-step prediction based on the first unit, whose model is trained with joint loss from multiple steps, outperforms other training-testing variations, and the accuracy is further improved by integrating early-stopping training strategy. The main contribution of the proposed algorithm is summarized below:
• We propose a novel architecture based on a deep recurrent neural network for VQA, which is composed of multiple answering units with shared parameters and optimized by minimizing joint loss from multiple units.
• We show that the VQA model involving multiple reasoning steps tends to overfit to easier questions quickly, and develop a unique training strategy to early-stop overfitting units progressively until the accuracy of the first unit is saturated.
• The proposed algorithm outperforms other multi-step attention based approaches using a single step prediction in VQA dataset.
The rest of the paper is organized as follows. We first review related work in Section and discuss our main idea with motivation in Section . Section describes the architecture to implement our idea and the training and testing methods of the proposed algorithm. We present experimental results of our algorithm in the standard public benchmark dataset in Section .
The VQA problem is first addressed by (Malinowski and Fritz 2014), which proposes a Bayesian framework to combine symbolic reasoning and uncertain visual information. (Zhou et al. 2015) has proposed a shallow neural network for VQA, which accepts the CNN features of images and the Bag-of-Words (BoW) representations of questions as its inputs. Recently, VQA algorithms are often formulated with deep neural networks since the techniques based on deep learning have straightforward end-to-end training procedure by backpropagation and present competitive performance in terms of accuracy. These approaches typically pose VQA problems as simple classification tasks based on the joint features from images and questions, where CNNs and LSTMs (Hochre- iter and Schmidhuber 1997) are employed to encode images and questions, respectively (Ren, Kiros, and Zemel 2015; Malinowski, Rohrbach, and Fritz 2015) while only CNNs are used for both encoding and classification in (Ma, Lu, and Li 2016).
Several VQA systems (Yang et al. 2016; Xiong, Merity, and Socher 2016; Shih, Singh, and Hoiem 2016) attempt to identify relevant regions in the input image to answer questions, which is often performed by the soft attention mechanism (Xu et al. 2015). Specifically, (Shih, Singh, and Hoiem 2016) utilizes a single step attention to an object proposal for VQA while (Xiong, Merity, and Socher 2016) also employs multi-step attention based on the dynamic attention network (Kumar et al. 2016). Stacked attention network (Yang et al. 2016) is motivated by memory networks (Sukhbaatar et al. 2015; Kumar et al. 2016), and constructs two succesive attention modules for multi-step inferences.
Several approaches employ the network architectures adaptive to input questions. Dynamic parameter prediction network (Noh, Seo, and Han 2016) determines the parameters of a fully connected layer in CNN given a question, and constructs a unified model to handle various tasks related to input images. With similar motivation, neural module networks combine multiple module networks to construct a single deep network (Andreas et al. 2016a; Andreas et al. 2016b), where the module combinations are given by syntactic parsing of questions.
Deep neural networks are sometimes trained with multiple supervisions to facilitate training procedure, where losses are backpropagated from multiple branches. GoogLeNet (Szegedy et al. 2015) attaches auxiliary clas-sifiers to a few intermediate layers of the network to provide additional supervision for training. Deeply supervised nets (Lee et al. 2015) has a companion objective function to avoid the vanishing gradient problem. Deeply recursive convolutional network (Kim, Lee, and Lee 2016) provides supervision to every recurrent convolutional layer and makes inference based on an ensemble of individual predictions. Our approach has something common with (Kim, Lee, and Lee 2016) in that supervision is given to each recurrent unit, but is differentiated since we analyze role of each supervision and propose a novel training strategy of early stopping.
This section describes the general formulation of VQA and discusses our approach based on recurrent deep neural network with its motivation.
Problem Formulation
We formulate VQA problem as a classification task. For a given image I and a question q, a VQA model predicts the best answer , which is formally given by
where denotes a set of all possible answers and
is a set of model parameters in the network. There exist a lot of different methods to implement this probabilistic framework, but we focus on the models based on deep neural networks, which are frequently employed for VQA problems in recent years.
The models based on deep neural networks are typically composed of three main components: image encoder, question encoder, and answering module. Image and question encoders extract image feature from an input image I and question feature
from a question sentence q, respectively. Answering module takes extracted image and question features, and generates an answer by combining information from the image and the question. Therefore, the module should be able to perform various tasks defined by questions. In this work, we propose a deep neural network architecture and investigate a training strategy for the answering module.
Main Idea The motivation behind this work is our observation that solving a task defined by a question often requires the capability to solve a sequence of atomic subtasks. However, the kinds and numbers of the subtasks in the sequence vary in individual questions, and, in addition, the same subtasks may appear in any places within the sequence depending on the
question. Due to these reasons, it is extremely difficult to develop a unified algorithm to handle all the variations. Instead, to overcome the challenges indirectly, we design a novel neural network architecture for VQA and propose a training strategy with early-stopping based on a simple joint loss minimization.
The overall architecture of the proposed algorithm is illustrated in Figure 1. The main component of the proposed network is an answering unit. Each answering unit is capable of solving the full task; based on the features extracted from image and question, it predicts an answer and updates its hidden state. By concatenating multiple answering units sequentially, we infer a series of answers, which are predicted by the model integrating subtasks progressively and solve more and more composite tasks. This procedure is implemented as a recurrent neural network whose recurrent unit corresponds to each answering unit.
We need to train the proposed network so that it can answer a given question by solving a series of subtasks one by one using the answering units. The main challenge to this objective is that it is difficult to know the optimal number of steps to solve each problem since there are various question types with different complexity. To circumvent this problem, we always make the first unit in the network solve problems, but allow it to learn the knowledge from the rest of units by backpropagation unless it degrades the model. For the purpose, we simply provide the same supervision to every step in the unfolded recurrent neural network and optimize all the answering units with shared parameters jointly using the standard backpropagation technique for RNNs. We also propose a unique training technique with early-stopping strategy, which is useful to avoid overfitting in complex models among multiple answering units.
This section discusses the proposed answering module and the procedure of training and testing. We provide more detailed description about the answering modules in the supplementary document.
Answering Module
Answering module is a recurrent neural network, which is composed of multiple answering units with shared model parameters as illustrated in Figure 1. Using a given image feature map (with P channels and L locations), a question feature
and an initial memory state
, answering module provides a sequence of an- swer probability vectors
is the number of answering units. Formally, the
unit predicts an answer probability as
where is the answering operation in the
answering unit.
The function requires several internal operations to predict answer probability as illustrated in Figure 2. The first operation is to compute a subtask feature, which is
Figure 2: The illustration of the answering unit which com- prises subtask embedding, attention and predict operation.
extracted based on the question and the history of performed subtasks in the previous steps as follows:
where is parameter for the subtask module.
This subtask feature is used to perform an attention operation, which finds a relevant location in an image feature map with its corresponding feature vector based on the key . We employ soft attention mechanism (Xu et al. 2015) and the attention operation is formally given by
where is the parameters for soft attention, and
and
denote attention probability map and attended feature, respectively.
The last operation is function. It receives subtask feature, attention information, and the memory state in the previous step to produce the final answer and updates hidden state. This operation involves LSTM to update hidden state in memory and the answer probability is given by applying a softmax function. The prediction operation is formally defined as
and the two outputs are specifically given by
where is the parameters for LSTM and
is the parameters for softmax classifier.
To implement the Answering module, we need to learn the parameters for the all internal operations, and it is performed by the standard backpropagation technique. The detailed training procedure is described next.
Training
We minimize joint loss from multiple steps by providing the ground-truth answer to every step k as a supervision. Loss from a single step is computed by a cross entropy between the answer probability
and the ground-truth answer y. Loss from each step is simply aggregated to compute the overall loss. Given image
, question
, and
Figure 1: Overall architecture of the proposed network. The proposed network is a recurrent deep neural network, where each recurrent unit corresponds to a complete module for visual question answering. For training, we unfold the network to predict answer and give supervision for every steps. For testing, we use a single answering unit to answer a question about an image.
ground-truth label of the
training example, the joint loss function is formally given by
where
Note that denote parameters for image and question encoder, respectively. The model parameter for the answering module is given by
We train the model end-to-end by backpropagation, where the objective function based on cross entropy in Eq. (8) is minimzed. Given an image and a question
, image encoder and question encoder extract features,
and
, respectively. The extracted features are given to the individual unfolded answering units. As the answering module itself is a recurrent neural network, we compute the gradients of all the parameters in
by backpropagation through time (BPTT). Note that the backpropagated loss to the inputs of answering module,
, is used to learn the parameters of the image and question encoders,
When the network is trained with the joint loss from multiple steps as in Eq. (8), we observe that models with multiple steps generally overfit to training data easily and show lower validation/testing accuracy than the model in the first step as illustrated in Figure 3. Note that this observation also coincides with (Yang et al. 2016), which states that using three or more attention does not improve the performance any further. We believe that the answering units in later steps have more model capacity and can fit to training data better while it loses the generality of models. Such overfitting tendency may ruin the models in the earlier steps since the losses in the later steps are propagated backwards until the first answering unit. To circumvent this issue, we stop backpropagating losses from the overfitted answering units progressively as soon as we identify their overfitting. In practice, we validate the model in every epoch and early-stop training from an overfitted answering unit if the validation accuracy of the answering unit drops more than a predefined threshold from its maximum value. According to our observation, the answering units in the later steps typically are terminated first and the first answering unit always manages to stay alive. If validation dataset is not available, we early-stop training based on the formula for convenience, which is empirically determined as
where is the number of epochs before the first early stopping,
is the total number of epochs for training and
is the number of epochs before training the
answering unit is terminated. The early stopping is scheduled by controlling the configuration parameter
Testing
In testing time, we predict answers only from the first answering unit in our model since it has better generalization accuracy in our experiments. Although we believe that some complex tasks can be solved better in the later steps, it is very difficult to know which step is optimal to find the correct solution. Hence, selecting one of the answers from multiple answering units is not feasible in practice. Also, the ensemble of the answers from multiple units is not particularly better than the solution from the first unit. Note that, since the losses from the later steps are always backpropagated until the first answering module, it gradually learns how to solve more complex problems and does not tend to overfit by employing early-stopping strategy.
Comparison to Memory Network based
Approaches
At first glance, the proposed method resembles the VQA methods such as (Yang et al. 2016; Xiong, Merity, and Socher 2016), which are based on memory network (Sukhbaatar et al. 2015). However, we introduce a novel training strategy with early-stopping and a simple testing method with a single-step inference. This decision is based on our observations in VQA problems, but it is contradictory to (Yang et al. 2016; Xiong, Merity, and Socher 2016). The previous works (Yang et al. 2016; Xiong, Merity, and Socher 2016) state that multi-step training and testing without parameter sharing is advantageous, but our results suggest that multi-step training with parameter sharing and a single-step testing may be better in
Figure 3: Training and validation accuracy curve for varying k. The prediction from the later step shows higher training accuracy but lower validation accuracy.
practice. This is partly supported by Figure 3 although the experiment setting is not exactly identical. Also, our experiment supports the claim that benefit from multi-step training with weight sharing is larger than that from multi-step training and testing without sharing weights.
Dataset and Evaluation Metric
We train and test the proposed network in VQA dataset (An- tol et al. 2015), which borrows images from MSCOCO dataset (Lin et al. 2014) and collects questions and answers via Amazon Mechanical Turk. The dataset consists of 248,349 questions for training, 121,512 questions for validation, and 244,302 questions for testing. For each image, 3 questions are asked and 10 independent answers are given to each question. There are two test datasets; we typically use test-dev split for the control experiments and test-standard for comparison with external algorithms.
Two tasks are defined on VQA dataset: open-ended task and multple-choice task. The model has to predict an answer for an open-ended question without knowing predefined candidate answers, but select one of 18 candidate answers in multiple-choice task In both cases, the answers given by a model are evaluated by the following metric reflecting human
Figure 4: Training and validation accuracy of Ours SS and Ours Full.
consensus:
where the score for a question is proportional to the number of matches with ground-truth answers, and the tested model receives full credit for each question if at least 3 people agree to the predicted answer.
Implementation Details
We use VGG-16 net (Simonyan and Zisserman 2015) and ResNet-101 (He et al. 2016) as image encoders. After rescaling input images to , we extract the feature maps from the last pooling layer in VGG or the layer below global average pooling layer in ResNet. The question features are given by a 2-layer LSTM1; the final hidden and cell states in both layers are concatenated to be used as a question feature. The dimensionality of the attended image feature and the subtask feature are both 512. For better generalization, we apply dropout with rate of 0.5 to the features for input images and questions in each answering unit independently. The vocabulary size of our model is 14, 772 and the rest of words are converted to UNK tokens. The set of possible answers (
Eq. (1)) contains 1, 000 answers with the highest frequencies among all answers in the training dataset.
We set the number of steps K to 8 for multi-step training, and determine the parameters in Eq. (10), using validation set. We use Adam (Kingma and Ba 2015) for optimization. Learning rate for question encoder and answering modules are set to
, respectively, and the image encoder is not fine-tuned. Both learning rates are decayed in every epoch by the factor of 0.9. Additionally, we inject random noises sampled from Gaussian distribution to the gradient as suggested in (Neelakantan et al. 2016), where
is the number of training iterations. To alleviate the exploding gradient problem, we limit the magnitude of gradient vector to 0.1 by normalization. To facilitate reproduction, we make our code publicly available2.
Table 1: Single model performance on the VQA test-dev dataset of all compared algorithms including the variations of our algorithm. Asterisk (*) denotes the concurrent submission with this paper. Open-Ended Multiple-Choice All Y/N Num Others All Y/N Num Others
Figure 5: Qualitative comparisons of attention between Ours Full and Ours SS.
Results and Analysis
To ensure the effectiveness of the multi-step training and early-stopping strategy, we perform several control experiments. The baseline is the proposed model without both components, which is referred to as Ours SS. We also test the model trained with multi-step training but without early-stopping strategy, Ours MS, and our full model denoted by
Table 2: Comparisons of single model performance in the VQA test-standard. Asterisk (*) denotes the concurrent submission with this paper. Open-Ended Multiple-Choice All Y/N Num Others All Y/N Num Others
Ours FULL employ both strategies for training. We use the image features extracted from VGG-16 net (Simonyan and Zisserman 2015) for all the control experiments.
Table 1 illustrates the single model performance of various approaches trained without data augmentation. The proposed algorithm, Ours FULL outperforms the models based on multi-step attention (Kumar et al. 2016; Yang et al. 2016; Lu et al. 2016) in the VQA dataset using the same image encoder. The best result is achieved by the work based on the multimodal compact bilinear pooling (Fukui et al. 2016). We believe that the multimodal compact bilinear pooling can be integrated within the proposed model, but we leave it as a future work.
Note that the performance gain from Ours SS to Ours FULL is about 2.3, which are significant in VQA context. Both training schemes turn out to be useful for performance improvement. These results are interesting because multi-step training is effective even for a single-step prediction and early-stopping strategy improves accuracy as long as all active steps are free from overfitting. Since we schedule early-stopping based on the formula from simple empirical study, the accuracy may be further improved with more sophisticated validation.
Figure 4 illustrates training and validation accuracy of Ours FULL and Ours SS. The proposed training strategy denoted by Ours FULL outperforms Ours SS consistently even though they have the exactly same architecture for a single step prediction during testing. Figure 6 presents predicted answers with attended regions. Visualization of attention shows that Ours FULL tends to focus on critical regions while Ours SS are often distracted by irrelevant objects. We believe that our training strategy helps the model learn to answer questions in appropriate steps in training while it enables the answering unit to implicitly solve a series of subtasks given a question in testing.
Comparison with Stacked Attention Network
The architecture of our single answering unit is similar to the stacked attention network (SAN) (Yang et al. 2016). The difference lies in how to handle questions that require multi- ple steps to answer. As the number of required steps varies across questions and it is difficult to figure out the desirable number, SAN uses the same fixed number of steps for training and testing, which might result in overfitting observed in Figure 3. Instead, we provide multiple supervision in our network, but hope a single answering unit (the first one, in practice) to learn the best model for evaluation in testing. The comparison between the SAN and the proposed method in Table 1 clearly shows the effectiveness of our method.
Experimental Results with ResNet
We can improve the accuracy of a VQA system by using a better image encoder. Hence, we train the model with image features extracted from ResNet-101 (He et al. 2016), and this model is denoted by Ours ResNet. The model is trained with same hyper-parameters with Ours FULL in Table 1, and evaluated both in VQA test-dev and test-standard. The results from Ours ResNet is presented in Table 1 and 2, where we observe that ResNet improves performance substantially.
We proposed a VQA algorithm based on a recurrent deep neural network, which is trained by minimizing the joint loss from all answering units. The answering units share model parameters while the outputs from the units in the later steps depend on the results from the ones in the earlier steps. To maximize performance, we introduce an early stopping training strategy, where individual answering units are disregarded in training as soon as they start to overfit to the training set. Only the first answering unit is employed for inference since it is the model learned from all the multiple answering units without overfitting. The proposed architecture illustrates the outstanding performance in the standard VQA dataset without data augmentation. We believe that our algorithm has great potential to be used as a general framework for VQA problems by replacing our answering unit with any other networks.
[Andreas et al. 2016a] Andreas, J.; Rohrbach, M.; Darrell, T.; and Klein, D. 2016a. Deep compositional question answering with neural module networks. In CVPR.
[Andreas et al. 2016b] Andreas, J.; Rohrbach, M.; Darrell, T.; and Klein, D. 2016b. Learning to compose neural networks for question answering. In NAACL.
[Antol et al. 2015] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. 2015. VQA: visual question answering. In ICCV.
[Fukui et al. 2016] Fukui, A.; Park, D. H.; Yang, D.; Rohrbach, A.; Darrell, T.; and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847.
[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
[Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
[Kim et al. 2016] Kim, J.-H.; Lee, S.-W.; Kwak, D.-H.; Heo, M.-O.; Kim, J.; Ha, J.-W.; and Zhang, B.-T. 2016. Multimodal residual learning for visual qa. arXiv preprint arXiv:1606.01455.
[Kim, Lee, and Lee 2016] Kim, J.; Lee, J. K.; and Lee, K. M. 2016. Deeply-recursive convolutional network for image super-resolution. In CVPR.
[Kingma and Ba 2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
[Kumar et al. 2016] Kumar, A.; Irsoy, O.; Su, J.; Bradbury, J.; English, R.; Pierce, B.; Ondruska, P.; Gulrajani, I.; and Socher, R. 2016. Ask me anything: Dynamic memory networks for natural language processing. In ICML.
[Lee et al. 2015] Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z. 2015. Deeply-supervised nets. In AISTATS.
[Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoft COCO: common objects in context. In ECCV.
[Lu et al. 2016] Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2016. Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061.
[Ma, Lu, and Li 2016] Ma, L.; Lu, Z.; and Li, H. 2016. Learning to answer questions from image using convolutional neural network. In AAAI.
[Malinowski and Fritz 2014] Malinowski, M., and Fritz, M. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.
[Malinowski, Rohrbach, and Fritz 2015] Malinowski, M.; Rohrbach, M.; and Fritz, M. 2015. Ask your neurons: A neural-based approach to answering questions about images. In ICCV.
[Neelakantan et al. 2016] Neelakantan, A.; Vilnis, L.; Le, Q. V.; Sutskever, I.; Kaiser, L.; Kurach, K.; and Martens, J. 2016. Adding gradient noise improves learning for very deep networks. In ICLR Workshop.
[Noh, Seo, and Han 2016] Noh, H.; Seo, P. H.; and Han, B. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In CVPR.
[Ren, Kiros, and Zemel 2015] Ren, M.; Kiros, R.; and Zemel, R. S. 2015. Exploring models and data for image question answering. In NIPS.
[Shih, Singh, and Hoiem 2016] Shih, K. J.; Singh, S.; and Hoiem, D. 2016. Where to look: Focus regions for visual question answering. In CVPR.
[Simonyan and Zisserman 2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
[Sukhbaatar et al. 2015] Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end memory networks. In NIPS.
[Szegedy et al. 2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR.
[Wu et al. 2016] Wu, Q.; Wang, P.; Shen, C.; Hengel, A. v. d.; and Dick, A. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR.
[Xiong, Merity, and Socher 2016] Xiong, C.; Merity, S.; and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In ICML.
[Xu et al. 2015] Xu, K.; Ba, J.; Kiros, R.; Courville, A.; Salakhutdinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
[Yang et al. 2016] Yang, Z.; He, X.; Gao, J.; Deng, L.; and Smola, A. 2016. Stacked attention networks for image question answering. In CVPR.
[Zhou et al. 2015] Zhou, B.; Tian, Y.; Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2015. Simple baseline for visual question answering. In arXiv 1512:02167.
This document provides our implementation details and additional results that could not be accommodated in the main paper due to space limitation.
In this section, we provide implementation details for each components of answering units described in Eq. (2) to (7) of the main paper. Without loss of generality, we describe the architecture of the answering unit.
Subtask Module
Subtask module in Eq. (3) of the main paper generates subtask feature from a question feature
the previous memory state
. This operation is given by
where and
are weight parameters.
Attention Module Attention module in Eq. (4) of the main paper generates the attention probability map and attended feature
based on a soft attention mechanism. The com- putation of
and
requires embedding of input image feature map
(with P channels and L locations), and the embedded feature map
is given by
where are weight parameters.The procedure to compute attention probability map
is composed of the following three steps. First, we compute pre-attention score
based on the embeded image feature map
and the subtask feature
by the following equation:
where and
are weight parameters, and
is one vector. Next, we compute attention score
by adding another pre-attention score extracted from the previous memory
, which is given by
where and
are weight parameters. Last, attention probability map
is given by ap- plying softmax function to the attention score
as described below:
Once is obtained, attended feature
is computed using embedded image feature
and attention probability map
Prediction Module
Prediction module in Eq. (5) of the main paper has two submodule: LSTM and softmax classification. As specified in the main paper, we use publicly available LSTM implementation. Single layer LSTM is constructed and the dimensionality of LSTM hidden state is equal to that of memory state . Input to the LSTM, denoted by
, is com- puted by
where and
are weight parameters. Answer probability
is obtained by applying a softmax function, which is given by
where are weight parameters.
Network Parameters
Network parameters used for the experiments are as follows: S = 512, Q = 1024, M = 512, P = 512, L = 196, A = 256, and C = 1000.
Figure 6 presents qualitative comparison between Ours Full and Ours SS on VQA validation dataset.
Figure 6: Qualitative comparisons of attention between Ours Full and Ours SS.