b

DiscoverSearch
About
My stuff
A Multimodal Target-Source Classifier with Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects
2019·arXiv
Abstract
Abstract

In this study, we focus on multimodal language understanding for fetching instructions in the domestic service robots context. This task consists of predicting a target object, as instructed by the user, given an image and an unstructured sentence, such as “Bring me the yellow box (from the wooden cabinet).” This is challenging because of the ambiguity of natural language, i.e., the relevant information may be missing or there might be several candidates. To solve such a task, we propose the multimodal target-source classifier model with attention branches (MTCM-AB), which is an extension of the MTCM [1]. Our methodology uses the attention branch network (ABN) [2] to develop a multimodal attention mechanism based on linguistic and visual inputs. Experimental validation using a standard dataset showed that the MTCM-AB outperformed both state-of-the-art methods and the MTCM. In particular, the MTCM-AB accuracy was 90.1% on average while human performance was 90.3% on the PFN-PIC dataset.

The current rise of life expectancy has increasingly emphasized the need for daily care and support. Robots that can physically assist people with disabilities [3] offer an alternative to overcoming the shortage of home care workers. This context has boosted the need for standardized domestic service robots (DSRs) that can provide necessary support functions as shown by [4]–[6].

However, one of the main limitations of DSRs is their inability to naturally interact through language. Specifically, most DSRs do not allow users to instruct them with various expressions relating to an object for fetching tasks. By overcoming this limitation, a user-friendly way to interact with DSRs could be provided to non-expert users.

Thus, our work focuses on multimodal language understanding for fetching instructions (MLU-FI). This task consists of predicting a target instructed in natural language, such as “Bring me the yellow box from the wooden cabinet.” MLU-FI is challenging because of the ambiguity of natural instructions, that is, relevant information may be missing, implicit or simply paraphrased. Using free-form language naturally induces ambiguity because of the many-to-many mapping between the linguistic and physical world which makes it difficult to accurately infer the user’s intention.

In this paper, we propose the multimodal target-source classifier model with attention branch (MTCM-AB) which is an extension of the MTCM proposed in [1]. Our methodology uses the attention branch network (ABN) proposed in [2]. The ABN is an image classifier, inspired by class

image

image

Fig. 1: MTCM-AB overview: the attention branch network is used to improve fetching instruction comprehension

activation mapping (CAM) [7] structures, that generates attention maps. The ABN is composed of an attention branch that infers an attention map and a perception branch that classifies images. We extended this architecture with the MTCM-AB where several visual and linguistic attention branches are proposed to infer visual and linguistic attention maps. From these multimodal attention maps, the model accuracy is enhanced and can be visualized through the attended areas of the visual input. A video is available at this URL1.

The main contributions of this paper are summarized as follows:

We propose the MTCM-AB which extends the MTCM with the ABN.

We introduce a Multimodal ABN architecture that combines both linguistic and visual attention mechanisms for MLU-FI.

We propose a visual explanation from the MTCM-AB given the input sentence and visual scene.

There have been many attempts in the field of robotics focused on grounded communication with robots (e.g., [8], [9]). Grounding a user’s intention requires linguistic inputs but also proprioceptive senses (e.g., vision) and contextual knowledge.

Similarly to many studies, we are interested in understanding fetching instructions in everyday environments. Recent studies have addressed multimodal language understanding (MLU) by using visual semantic embedding for visual grounding [1], [10]–[13], visual question answering [14] or caption generation [15]. This approach embeds the visual and linguistic features into a common latent space. In [10] the authors proposed a long short-term memory (LSTM) network to learn the probability of a referring expression, while a unified framework for referring expression generation and comprehension was introduced in [11]. Inter and intra self-attention mechanisms are explored in [16] for referring expression comprehension. In robotics, the authors of [12] developed a target prediction method from natural language in a pick-and-place task environment, with additional dialogue. Similarly [13] tackled the same kind of problem using a two-stage model to predict the likely target from the language expression and the pairwise relationships between different target candidates. In our previous work [1], we proposed to use both the target and source candidates to predict the likely target in a supervised manner.

The MTCM-AB is inspired by the ABN [2]. The ABN is based on class activation mapping (CAM) networks [7], [17]. This line of research focuses on the production of image masks that, overlaid onto an image, highlight the most salient portions with respect to some given query or task. In essence, the CAM purpose is to identify salient regions of a given label in an image classifier for visual explanation. The ABN builds visual attention maps from this approach.

Attention mechanisms have also been used differently in image and natural language processing (NLP). In the context of image captioning, the authors of [18] generated image captions with hard and soft visual attention. This approach learns the alignment between the salient area of an image and the generated sequence of words. Multiple visual attention networks were also proposed in [19] for solving visual question answering. However, most of these approaches use only a single modality for attention: visual attention. By contrast, recent studies in multimodal language understanding have shown that both linguistic and visual attention are beneficial for question-answering task [20], [21] or visual grounding [22], [23]. Similarly in [24], an attention method that performs a weighted average of linguistic and image inputs is introduced. Against this context, a multimodal attention branch network has been proposed in [25] for sentence generation. Unlike the former, the current study focuses on MLU-FI and adopts a different structure that is detailed in the following sections.

image

Fig. 2: Samples of the WRS-PV (left) and PFN-PIC datasets (right) where the source and target are given.

The purpose of this study is understanding fetching instructions using referring expressions. With the MTCM-AB being an extension of the MTCM [1], the multimodal language understanding for fetching instruction (MLU-FI) task context is similar to the one defined [1]. Let us summarize and recall this task context below.

A. Task Description

Our aim is to predict a target referred by an initial instruction among a set of candidate targets in a visual scene. Instructions are not constrained which is more natural but increases the complexity of the comprehension task because users may use referring expressions to characterize a target. Fetching instructions based on referring expressions can be from the following types ‘Take the Kleenex box and put it in the bottom right box’ or ‘Go to the kitchen and take the tea bottle on the upper shelf’. To address the MLU-FI, similar inputs and outputs, such as those in [1] are considered:

Input: A fetching instruction as a sentence in addition to an image of the scene.

Output: The most likely the target-source pair.

The terms target and source are defined as follows.

Target: A daily object (e.g. bottle or snacks) that a user intends the robot to fetch

Source: The origin of the target (e.g. desk or cabinet)

The evaluation metric is the prediction accuracy based on the top-1 target prediction. This metric is commonly used in the visual semantic embedding methods for the MLU-FI [1], [12], [13] and allows comparisons on standard datasets. Additionally, we do not rely on dialogue systems to disambiguate the target from the candidate targets unlike [12] and [13]. Ultimately this study does not focus on object detection. We suppose that the bounding boxes of the target and source are given in advance.

B. Task Background

The MTCM-AB is not specifically designed for a given scene or context. Although this approach could be used for various scenarios, our method is validated on two types of dataset, in real and simulated environments described below.

image

Fig. 3: Proposed method framework: the MTCM-AB utilizes three attention branches for the target (TAB), the visual context (nCAB) and the linguistic inputs. A Hinge loss embedding is used in these branches. In the main network the target and source are predicted from the attended features.

1) Home environment: In this configuration, we use a simulation-based dataset from the Partner Robot challengeVirtual Space (WRS-PV) competition that uses SIGVerse [26]. A simulated environment allows repeated and varied tasks for a small cost compared with real environment, which makes this choice reasonable. WRS-PV depicts home environments as represented in Fig. 2. The three-dimensional environments (Unity-based) are augmented to make them more realistic. In this environment, a targeted DSR, that is HSR (Human Support Robot), is able to freely navigate and manipulate objects. In this context, the MTCM-AB predicts the most likely target among several candidates for instructions such as “Go to the bedroom to get me the pink toy on the wooden wagon”.

2) Pick-and-place scene: Additionally a standard dataset PFN-PIC [12] for multimodal language understanding is used. This dataset was designed for pick-and-place tasks from an armed robot with a top-view camera. The scene consists of four boxes, in which several candidate targets (up to 40) are randomly placed. Each target is annotated with pick-and-place instructions such as “Move the red bottle near the coke can to the top left box”. Note that only the picking instruction comprehension is solved in this study.

A. Attention Branch Network

Our method consists of target prediction with respect to an instruction in natural language. We extend the MTCM [1] with attention mechanisms that are used to improve the prediction from the linguistic and visual inputs. These attention mechanisms are inspired by the ABN, which was initially proposed in [2]. In the ABN approach, the CAM is extended to produce an attention mask for improving image classification. However, instead of directly using the attention network into the classifier, the ABN is decomposed into parallel branches to avoid deteriorating the classifier accuracy. The branches refer to the following sub-components of the ABN:

1) an attention branch that produces attention maps. 2) a prediction branch that predicts the likelihood of some label. Both the attention and prediction branches of the ABN are classifiers. The attention maps are derived from the predicted label in the attention branch. Hence, this type of attention is built in a supervised manner. As an extension of CAM networks, the ABN also allows visual explanation when extracting the attention maps. Such a feature is particularly desirable for qualitatively validating this approach. In this study, we extend the MTCM to the MTCM-AB by introducing the ABN to multimodal language understanding as detailed below. Similar to the MTCM, the target and source are predicted from the full sentence. The training procedure is outlined in Algorithm 1.

B. Novelty

We propose the MTCM-AB (see Fig. 3) which can focus on the relevant expressions in a sentence and their visual representation by using attention branch mechanisms. The MTCM-AB has the following characteristics.

The MTCM-AB extends the MTCM, with attention branches to focus on the salient part of linguistic and visual inputs for referring expression comprehension.

We introduce several attention branches for linguistic and visual features. A neighboring context branch allows the network to predict a given target candidate based on the neighbor landmarks.

A visual explanation is produced from the attention maps generated by the MTCM-AB.

C. MTCM-AB

1) Inputs: The MTCM-AB takes as input visual and linguistic features in a similar manner to most visual semantic embedding methods. We assume that for each target candidate  i ∈ {1, ..., N}and source  i′ ∈ {1, ..., M}, their respective cropped image and positions are made available. Given a target candidate, the set of inputs x(i) is defined as:

image

where  xl(i), xt(i), xc(i)and  xr(i)denote linguistic, target, context and relation features. We purposefully omit index i in the following, that is, x(i) is then written as x, when further clarity is not necessary.

Visual input  xtis defined as the cropped image of the target, and  xcis a cropped image that characterizes a target and its neighborhood. The latter input  xcis more thoroughly defined in the next section. Linguistic input  xlconsists of sub-word vector embedding whereas input  xris a vector characterizing the position of the target candidate in the environment (e.g., other target candidates, location in the scene, location with respect to the source).

2) Linguistic Attention Branch: The purpose of the linguistic attention branch (LAB) is to emphasize the most salient part of the linguistic features for instruction comprehension. Similar to our previous approach [1], BERT [27] is used in the MTCM-AB for sub-word embedding. The BERT model uses Transformers [28] and a sub-word masking system for language embedding. In [1], BERT is reported to be better than simple embedding vectors. From the sub-word embedding, a multi-layer Bi-LSTM network is used to obtain a latent space representation of the linguistic features. The last hidden states of each layer are concatenated to form linguistic feature maps  flfrom which a linguistic attention mask is extracted using the same principle as the ABN. Feature maps  flare processed through onedimensional convolutional layers followed by a single fully connected layer (FC). Linguistic attention map  alis obtained from the second convolutional layer that is convoluted with an additional layer and normalized by a sigmoid activation function. This attention map selectively focuses on an area of the LSTM final state that also encodes all the previous states. The output visual feature maps are then obtained using a masking process given by:

image

The TAB also optimizes a loss function  Jtto predict the attention mask.

image

The loss function of the nCAB is noted  Jc.

5) Perception Branch: The perception branch follows a classic visual semantic embedding structure. Indeed, the visual linguistic and relation features are encoded to share a common latent space. A visual multi-layer perceptron (MLP) encodes the concatenation of  ot, ovand  xr. In parallel a linguistic MLP encodes linguistic features  ol. Both MLP ouputs are used to compute the embedding loss defined in the following section. The source is predicted as  ysrcfrom a third MLP that combines the two previous MLP outputs. To predict the correct source, a cross-entropy loss  Jsrcis used:

image

where  y∗nmdenotes the label given to the m-th dimension of the n-th sample, and  ynmdenotes its prediction.

6) Loss functions: The MTCM-AB is trained by minimizing several embedding loss functions related to the different branches. All of them are based on a Hinge loss model. This loss function consists in increasing the similarity between appropriate pairs of linguistic and non-linguistic features while decreasing the similarity between inappropriate ones. The network then minimizes the global loss function  Jtotal =λcJc + λtJt + λlJl + λpJp + λsrcJsrc. The parameters  λiare loss weights that are defined in the experimental section. In the perception branch, loss function  Jpis expressed as a triplet Hinge loss

image

where  λMis the margin, and  f(·, ·)is a similarity function (e.g., cosine similarity). Functions  g1(·)and  g2(·)are the networks related to the linguistic and non-linguistic features, respectively. The incorrect linguistic and non-linguistic features are extracted from two random candidate targets j and k in the same image as the current target i. Analytically, j and k are sampled from {1, ..., T}, where  j ̸= iand  k ̸= i.

Given J a generic notation of  Jl, Jtand  Jc, their respective Hinge loss function is characterized by:

image

A. Dataset

1) The PFN-PIC dataset: In this experiment we evaluated our approach on the PFN-PIC dataset [12], which allowed us to compare the MTCM-AB to other proposed methods (e.g., [1]). PFN-PIC contains 89, 891 sentences in the training set and 898 sentences in the validation set to instruct 25, 861 targets in the training set and 532 targets in the validation one (see Fig. 2). The sentences, on average 14.3 words, were given by three different annotators.

2) The WRS-PV dataset: In the second phase, we evaluated the MTCM-AB on a simulation-based dataset: WRSPV (see Fig. 2). We used the same dataset collected in [1] with additional instructions. The dataset is composed of 308 images from which 2015 instructions in the training set and 74 instructions in the validation set are provided. The dataset was labelled by two annotators. This dataset has an average of 3.4 targets per image, and 10.7 words for each instruction.

TABLE I: Parameter settings and structures of MCTM-AB

image

B. Experimental Setup

The experimental setup is summarized in Table I. The instructions were pre-processed and tokenized using the 24-layer pre-trained BERT model. Each sub-word, obtained by a word-piece model, were embedded into a 1024-dimensional vector. The feature extractor and CNN as shown in Fig. 3 were both based on ResNet [29]. The input images were downscaled to  299 × 299before being processed in ResNet-50. Feature maps  fcwere extracted inside the 4thblock of ResNet, while features  ftwere obtained from the last convolutional layer of the network. For each MLP, we applied batch normalization and a ReLU activation function for three layers, except for the source prediction, for which a softmax function was used on the last layer. In the LAB, we used a three-layer Bi-LSTM with 1024-size cells. Feature maps  flwere processed in three convolutional layers. In the TAB, four fully-connected layers were used, whereas the nCAB was based on four convolution layers. The different loss weights were all set to one except the source loss set to 0.1. We also introduced a parameter  δcthat represents the variation in size of the context input  xc. Indeed, the size of  xccorresponds to the size of the cropped image target to which is added  δcin width and height. Considering the size of the input image, we selected  δc = [0, 25, 50, 75, 100, 125, 150]pixels during the experiments.

The MTCM-AB had 27.5 M parameters and was trained on a RTX 2080Ti with 12 GB of GPU memory, 64 GB RAM and a Intel Core i9 3.6 GHz processor. The results were reported after 100 epochs when the training loss was reduced by approx. 95%. With this setup, it took approx. five hours to train the MTCM-AB on average for the PFN-PIC dataset. This time was decreased to approx. three hours and a half by using a bigger batch size with two Tesla V100 GPUs.

The same parameters as in Table I were used to evaluate WRS-PV dataset, except the learning rate that was decreased to 5×10−5and a batch size of 64.

C. Quantitative Results

The quantitative results are presented in Table II. The top-1 accuracy is the evaluation metric as defined in [11]. This evaluation metric correspond of the accuracy obtained for the most likely target (e.g., highest cosine similarity) predicted, given an instruction. Quantitative results of the baseline methods [12] and [1] are also reported for comparison. Except for [12] method, the average accuracy and standard deviation is provided for five trials. In the case of the MTCM-AB, we report additional results in Table III with varying sizes of  xcby setting  δcwith a 0 to 150-pixel wise extension. On PFN-PIC, the MTCM-AB outperformed the MTCM and baseline method [12] by 1.3% and 2.1% respectively. The results of WRS-PV corroborated the trend observed on the PFN-PIC dataset. The MTCM-AB attention mechanism improved the target prediction accuracy by 2.5% on average.

Furthermore, we conducted an evaluation on five test subjects. This evaluation was performed by selecting randomly

image

Fig. 4: Qualitative results of the MTCM-AB. In the first row the prediction is given in blue while the ground truth is in green. The attended region of each context feature  xcis given in the second row. The two first columns refer to correct predictions. The third column refers to an erroneous prediction (i.e. wrongly attended target), while the last column refers to an erroneous prediction due to an incorrect ground truth (“brown pack” is instructed but “can” is given the label).

(a): “Take the blue sandal move it to lower left box”. (b): “Take the green item next to the pair of white gloves and move it to top left box”. (c): “Move the grey colored bottle at top left to box below”. (d): “Pick the brown pack and put it in lower left box”.

TABLE II: Top-1 accuracy on PFN-PIC and WRS datasets

image

TABLE III: MTCM-AB accuracy with respect to  δc

image

200 samples of the validation set of PFN-PIC and the full validation set of WRS-PV. For each sample, each test subject was given the instruction, and had to select the most likely target among a set of candidate targets in the same image. The accuracy of these experiments are reported in Table II through the human performance row. From these evaluations, it appears that the human-level comprehension of the PFNPIC dataset, that can be considered as an upper-bound, is on average at 90.3% of accuracy with 2.01% of standard deviation. Thus, on the PFN-PIC dataset there is no statistically significant difference between human performance and the MTCM-AB. In the case of the WRS-PV dataset, the human performance was 94.3 % on average with 3.98% of standard deviation. Although test subjects performed with higher accuracy, the difference in performance with the MTCM-AB remains reasonable considering the dataset size.

As reported in Table III, from  δcanalysis, we found that increasing the size of the neighboring context from 50 to 150 pixels did not improve the accuracy for the PFN-PIC dataset. Although it may be thought that by increasing the size of  δcmore information about the context can be captured, we do believe that it also increases the probability of attending to an erroneous region. Indeed, the PFN-PIC dataset is highly cluttered with relatively small objects, which require accurate attention maps. This is exemplified by  δc = [125, 150]which has a lower accuracy than the configuration with  δc = 0. Regarding the WRS-PV dataset, the optimal value was found at 75 pixels. As the dataset is less cluttered, setting  δc > 0always improves the accuracy.

To characterize the contribution of each attention branch, we also report the results of an ablation study in Table IV for the PFN-PIC dataset. These results clearly support the contribution of the attention branches and the novelty of our approach. Both linguistic and visual attention branches improved the prediction accuracy compared to the MTCM.

TABLE IV: Ablation study of the MTCM-AB for PFN-PIC dataset

image

image

Fig. 5: Qualitative results on WRS-PV dataset. The predicted target is reported in blue while the ground truth is in green. The attention maps are reported on the second row.

(a) Bring me the yellow bottle from the bed with orange sheets. (b) Take a coffee cup on the right-hand table. (c) Give me the pink cup on the lower row of the shelf (d) Take an apple on the same shelf that a coffee cup is placed.

D. Qualitative Results

1) PFN-PIC qualitative results: The qualitative results shown in Fig. 4 illustrate predictions for given fetching instructions. Each column of the first row represents the MTCM-AB results for different scenarios. The second row illustrates the attention map of each sample given the context feature. Subfigures (a) and (b) show correct predictions. The attended regions show that the MTCM-AB focuses on the instructed target. The third column illustrates an erroneous case. Here the attention focused on the incorrect candidate target leading to a wrong prediction. The last column is also an erroneous prediction. However in this case, the ground truth label is incorrect. The target is not a “brown pack” as specified in the instruction, but a “can”. Therefore, it is reasonable that the MTCM-AB did not attend the “can”.

2) WRS-PV qualitative results: Qualitative results on the WRS-PV dataset (see Fig. 5) can also be analyzed in the same way. The three first samples illustrate correct prediction with consistent attended regions. The last sample, with the instruction “Take an apple on the same shelf that a coffee cup is placed”, is erroneous and our model predicts the cup instead of the apple. The attention map shows that the apple is not attended unlike the cup. This kind of mistake may be caused by the linguistic understanding since both the apple and cup are mentioned in the instruction.

E. Error Analysis

A thorough analysis of the MTCM-AB results allows us to characterize the different failure cases of our approach for the PFN-PIC dataset. We categorize the erroneous predictions based on main cause as follows:

ES (erroneous sentence): the ground truth does not correspond to the target specified in the instruction (see Fig. 4-d)

NE (negation): the ground truth is specified from a negation sentence which is thought to be difficult to solve in NLP community; e.g., “grab the glue not yellow and move...” and the predicted target is the yellow glue. In such a situation, the linguistic encoding should first be able to extract the negation features. Then the network should interpret ‘not yellow” that covers a wide range of different color characteristics (green, blue,...).

REL (relation to landmark): the ground truth is defined with respect to landmark objects and the predicted target position is incorrect with respect to this landmark; e.g., “move the rectangular card next to the red ketchup bottle...” and the predicted target is far from the ketchup.

RES (relation to source): the ground truth target is defined with respect to a source and the predicted target position is incorrect with respect to this source; e.g., “grab the black object in the middle right of the bottom left box...” and the predicted target is in the upper left corner of the bottom left box.

SE (source error): the instruction specifies a given source and the predicted target position is in a different source; e.g., “grab small white plastic container on lower right box...” and the predicted target in on the upper right box.

SU (sentence understanding): the predicted target characteristics are different from the specified ground truth characteristics; e.g., “grab thin orange and black box...” and the predicted target is a large orange and black box.

TR (target recognition): the predicted target is visually similar to the ground truth target but incorrect; e.g., in Fig. 4-b, a sample predicted the red tube instead of the coke can that looks very similar.

O (others): this category includes a case where the

TABLE V: Categorization of the erroneous predictions of the MTCM-AB on the PFN-PIC dataset

image

instruction contains words that rarely appear in the training samples; e.g., the word “barrel” is in the instruction while it appears only once in the training set. A meaningful encoding of this word is complex for the network.

The results are reported on Table V. Despite being superior to the MTCM, the major errors of the MTCM-AB are related to referring expression based on landmarks, sentence understanding and wrongly recognized targets. In particular, the parameter  δcmay affect the error rate of REL and need a careful tuning. Some improvement in the linguistic encoding may also be envisioned to increase the accuracy. Drawing inspiration from the recent methods for natural language processing (NLP), transformer-based model for language understanding such as XLNet [30] represent a path of improvement of our current Bi-LSTM model. Likewise, the TR error rate may be decreased with more sophisticated image classifier than ResNet-50.

In a context of high demand for responsive domestic service robots, we proposed the MTCM-AB which predicts the likelihood of targets to fetch given an instruction and a visual input. The following contributions of this study can be emphasized:

The MTCM-AB extends the MTCM with attention branches for linguistic and visual features. Our method achieves higher accuracy than the MTCM. Actually, our method is close to a human-level accuracy on a standard dataset.

We showed that multimodal attention achieves higher accuracy than monomodal attention on linguistic or visual inputs.

We qualitatively validated the MTCM-AB performance by showing that the instructed target was attended in the visual scene.

In future work, we plan to address the multimodal language understanding with only dense features (i.e., no heuristic inputs such as relation features) by improving the attention mechanism. Similarly, we intend to develop a word-level attention mechanism to visualize the effect of linguistic attention on ambiguous instructions.

This work was partially supported by JST CREST, SCOPE and NEDO.

[1] A. Magassouba, K. Sugiura, A. Trinh Quoc, and H. Kawai, “Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target-Source Classification,” IEEE RA-L, vol. 4, no. 4, pp. 3884–3891, 2019.

[2] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention Branch Network: Learning of Attention Mechanism for Visual Explanation,” in CVPR, 2019, pp. 10 705–10 714.

[3] S. W. Brose, D. J. Weber, et al., “The role of assistive robotics in the lives of persons with disability,” American Journal of Physical Medicine & Rehabilitation, vol. 89, no. 6, pp. 509–521, 2010.

[4] L. Piyathilaka and S. Kodagoda, “Human Activity Recognition for Domestic Robots,” in Field and Service Robotics, 2015, pp. 395–408.

[5] C.-A. Smarr, Mitzner, et al., “Domestic Robots for Older Adults: Attitudes, Preferences, and Potential,” International Journal of Social Robotics, vol. 6, no. 2, pp. 229–247, 2014.

[6] L. Iocchi, D. Holz, J. Ruiz-del Solar, K. Sugiura, and T. Van Der Zant, “RoboCup@ Home: Analysis and Results of Evolving Competitions for Domestic and Service Robots,” Artificial Intelligence, vol. 229, pp. 258–281, 2015.

[7] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in CVPR, 2016.

[8] A. Magassouba, K. Sugiura, and H. Kawai, “A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks From Ambiguous Language Instructions,” IEEE RA-L, vol. 3, no. 4, pp. 3113–3120, 2018.

[9] V. Cohen, B. Burchfiel, T. Nguyen, N. Gopalan, S. Tellex, and G. Konidaris, “Grounding language attributes to objects using bayesian eigenobjects,” in IEEE IROS, 2019.

[10] V. K. Nagaraja, V. I. Morariu, and L. S. Davis, “Modeling Context between Objects for Referring Expression Understanding,” in ECCV, 2016, pp. 792–807.

[11] L. Yu, H. Tan, M. Bansal, and T. L. Berg, “A joint Speaker ListenerReinforcer Model for Referring Expressions,” in CVPR, vol. 2, 2017.

[12] J. Hatori et al., “Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions,” in IEEE ICRA, 2018, pp. 3774–3781.

[13] M. Shridhar and D. Hsu, “Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction,” in RSS, 2018.

[14] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, L. Z.C., and D. Parikh, “VQA: Visual question answering,” in ICCV, 2015, pp. 2425–2433.

[15] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015, pp. 3156–3164.

[16] Z. Yu, Y. Cui, J. Yu, D. Tao, and Q. Tian, “Multimodal Unified Attention Networks for Vision-and-Language Interactions,” arXiv preprint arXiv:1908.04107, 2019.

[17] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization,” in ICCV, 2017.

[18] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015, pp. 2048–2057.

[19] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked Attention Networks for Image Question Answering,” in CVPR, 2016, pp. 21–29.

[20] D.-K. Nguyen and T. Okatani, “Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering,” in CVPR, 2018, pp. 6087–6096.

[21] J. Lei, L. Yu, T. Berg, and M. Bansal, “TVQA+: SpatioTemporal Grounding for Video Question Answering,” arXiv preprint arXiv:1904.11574, 2019.

[22] H. Akbari, S. Karaman, S. Bhargava, B. Chen, C. Vondrick, and S.-F. Chang, “Multi-level multimodal common semantic space for imagephrase grounding,” in CVPR, 2019, pp. 12 476–12 486.

[23] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring expression comprehension,” in CVPR, 2018, pp. 1307–1315.

[24] C. Hori, T. Hori, T. K. Marks, and J. R. Hershey, “Early and late integration of audio features for automatic video description,” in IEEE ASRU, 2017, pp. 430–436.

[25] A. Magassouba, K. Sugiura, and H. Kawai, “Multimodal Attention Branch Network for Perspective-Free Sentence Generation,” Conference on Robot Learning (CoRL), 2019.

[26] T. Inamura, J. T. C. Tan, K. Sugiura, T. Nagai, and H. Okada, “Development of Robocup@ Home Simulation Towards Long-term Large Scale HRI,” in Robot Soccer World Cup. Springer, 2013, pp. 672–680.

[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in ACL, 2019, pp. 4171–4186.

[28] A. Vaswani, Shazeer, et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE CVPR, 2016, pp. 770–778.

[30] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” arXiv preprint arXiv:1906.08237, 2019.


Designed for Accessibility and to further Open Science