3b:[["$","audio",null,{"id":"tts"}],["$","$L40",null,{"paperID":"1908.08530","publisher":"arxiv","paperJSON":{"title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations","paperID":"1908.08530","avgLineHeight":10.96,"imgScale":4,"sections":[{"heading":"ABSTRACT","paragraphs":[[{"text":"$41","element":"span"},{"href":"https://github.com/jackroos/VL-BERT","text":"https://github.com/jackroos/VL-BERT","element":"a"},{"text":".","element":"span"}]]},{"heading":"1 INTRODUCTION","paragraphs":[[{"text":"Pre-training of generic feature representations applicable to a variety of tasks in a domain is a hallmark of the success of deep networks. Firstly in computer vision, backbone networks designed for and pre-trained on ImageNet ","element":"span"},{"href":"#id-0","referenceIndex":5,"text":"(Deng et al., ","element":"a"},{"href":"#id-0","referenceIndex":5,"text":"2009) ","element":"a"},{"text":"classification are found to be effective for improving numerous image recognition tasks. Recently in natural language processing (NLP), Transformer networks ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"(Vaswani et al., ","element":"a"},{"href":"#id-1","referenceIndex":38,"text":"2017) ","element":"a"},{"text":"pre-trained with “masked language model” (MLM) objective ","element":"span"},{"href":"#id-2","referenceIndex":6,"text":"(De- ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"vlin et al., ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"2018) ","element":"a"},{"text":"on large language corpus excel at a variety of NLP tasks.","element":"span"}],[{"text":"Meanwhile, for tasks at the intersection of vision and language, such as image captioning ","element":"span"},{"href":"#id-3","referenceIndex":43,"text":"(Young ","element":"a"},{"href":"#id-3","referenceIndex":43,"text":"et al., ","element":"a"},{"href":"#id-3","referenceIndex":43,"text":"2014; ","element":"a"},{"href":"#id-4","referenceIndex":4,"text":"Chen et al., ","element":"a"},{"href":"#id-4","referenceIndex":4,"text":"2015; ","element":"a"},{"href":"#id-5","referenceIndex":34,"text":"Sharma et al., ","element":"a"},{"href":"#id-5","referenceIndex":34,"text":"2018)","element":"a"},{"text":", visual question answering (VQA) ","element":"span"},{"href":"#id-6","referenceIndex":3,"text":"(Antol et al., ","element":"a"},{"href":"#id-6","referenceIndex":3,"text":"2015; ","element":"a"},{"href":"#id-7","referenceIndex":16,"text":"Johnson et al., ","element":"a"},{"href":"#id-7","referenceIndex":16,"text":"2017; ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"Goyal et al., ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"2017; ","element":"a"},{"href":"#id-9","referenceIndex":15,"text":"Hudson & Manning, ","element":"a"},{"href":"#id-9","referenceIndex":15,"text":"2019)","element":"a"},{"text":", visual commonsense reasoning (VCR) ","element":"span"},{"href":"#id-10","referenceIndex":45,"text":"(Zellers et al., ","element":"a"},{"href":"#id-10","referenceIndex":45,"text":"2019; ","element":"a"},{"href":"#id-11","referenceIndex":8,"text":"Gao et al., ","element":"a"},{"href":"#id-11","referenceIndex":8,"text":"2019)","element":"a"},{"text":", there lacks such pre-trained generic feature representations. The previous practice is to combine base networks pre-trained for image recognition and NLP respectively in a task-specific way. The task-specific model is directly finetuned for the specific target task, without any generic visual-linguistic pre-training. The task-specific model may well suffer from overfitting when the data for the target task is scarce. Also, due to the task-specific model design, it is difficult to benefit from pre-training, where the pre-training task may well be different from the target. There lacks a common ground for studying the feature design and pre-training of visual-linguistic tasks in general.","element":"span"}],[{"text":"In the various network architectures designed for different visual-linguistic tasks, a key goal is to effectively aggregate the multi-modal information in both the visual and linguistic domains. For example, to pick the right answer in the VQA task, the network should empower integrating linguistic information from the question and the answers, and aggregating visual information from the input image, together with aligning the linguistic meanings with the visual clues. Thus, we seek to derive generic representations that can effectively aggregate and align visual and linguistic information.","element":"span"}],[{"text":"In the meantime, we see the successful application of Transformer attention ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"(Vaswani et al., ","element":"a"},{"href":"#id-1","referenceIndex":38,"text":"2017) ","element":"a"},{"text":"in NLP, together with its MLM-based pre-training technique in BERT ","element":"span"},{"href":"#id-2","referenceIndex":6,"text":"(Devlin et al., ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"2018)","element":"a"},{"text":". The attention module is powerful and flexible in aggregating and aligning word embedded features in sentences, while the pre-training in BERT further enhances the capability.","element":"span"}],[{"text":"Inspired by that, we developed VL-BERT, a pre-trainable generic representation for visual-linguistic tasks, as shown in Figure ","element":"span"},{"href":"#id-12","text":"1. ","element":"a"},{"text":"The backbone of VL-BERT is of (multi-modal) Transformer attention module taking both visual and linguistic embedded features as input. In it, each element is either of a word from the input sentence, or a region-of-interest (RoI) from the input image, together with certain special elements to disambiguate different input formats. Each element can adaptively aggregate information from all the other elements according to the compatibility defined on their contents, positions, categories, and etc. The content features of a word / an RoI are domain specific (WordPiece embeddings ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"(Wu et al., ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2016) ","element":"a"},{"text":"as word features, Fast R-CNN ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"(Girshick, ","element":"a"},{"href":"#id-14","referenceIndex":9,"text":"2015) ","element":"a"},{"text":"features for RoIs). By stacking multiple layers of multi-modal Transformer attention modules, the derived representation is of rich capability in aggregating and aligning visual-linguistic clues. And task-specific branches can be added above for specific visual-linguistic tasks.","element":"span"}],[{"text":"To better exploit the generic representation, we pre-train VL-BERT at both large visual-linguistic corpus and text-only datasets","element":"span"},{"text":"1","element":"span"},{"text":". The pre-training loss on the visual-linguistic corpus is incurred via predicting randomly masked words or RoIs. Such pre-training sharpens the capability of VL-BERT in aggregating and aligning visual-linguistic clues. While the loss on the text-only corpus is of the standard MLM loss in BERT, improving the generalization on long and complex sentences.","element":"span"}],[{"text":"Comprehensive empirical evidence demonstrates that the proposed VL-BERT achieves state-of-the-art performance on various downstream visual-linguistic tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. In particular, we achieved the first place of single model on the leaderboard of visual commonsense reasoning.","element":"span"}]]},{"heading":"2 RELATED WORK","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Pre-training for Computer Vision ","element":"span"},{"text":"Prior to the era of deep networks, it is far from mature to share features among different tasks and to improve the features via pre-training. The models for various computer vision tasks are of too diverse design choices to derive a generic representation. With the success of AlexNet ","element":"span"},{"href":"#id-15","referenceIndex":21,"text":"(Krizhevsky et al., ","element":"a"},{"href":"#id-15","referenceIndex":21,"text":"2012) ","element":"a"},{"text":"in ImageNet ","element":"span"},{"href":"#id-0","referenceIndex":5,"text":"(Deng et al., ","element":"a"},{"href":"#id-0","referenceIndex":5,"text":"2009) ","element":"a"},{"text":"classification, we see the renaissance of convolutional neural networks (CNNs) in the vision community. Soon after that, researchers found that ImageNet pre-trained CNNs can serve well as generic feature representation for various downstream tasks ","element":"span"},{"href":"#id-16","referenceIndex":7,"text":"(Donahue et al., ","element":"a"},{"href":"#id-16","referenceIndex":7,"text":"2014)","element":"a"},{"text":", such as object detection ","element":"span"},{"href":"#id-17","referenceIndex":10,"text":"(Girshick et al., ","element":"a"},{"href":"#id-17","referenceIndex":10,"text":"2014)","element":"a"},{"text":", semantic segmentation ","element":"span"},{"href":"#id-18","referenceIndex":27,"text":"(Long et al., ","element":"a"},{"href":"#id-18","referenceIndex":27,"text":"2015)","element":"a"},{"text":", instance segmentation ","element":"span"},{"href":"#id-19","referenceIndex":12,"text":"(Hariharan et al., ","element":"a"},{"href":"#id-19","referenceIndex":12,"text":"2014)","element":"a"},{"text":". The improvement in backbone networks for ImageNet classification further improves the downstream tasks. Recently there are research works on directly training CNNs from scratch on massive-scale target datasets, without ImageNet pre-training ","element":"span"},{"href":"#id-20","referenceIndex":13,"text":"(He et al., ","element":"a"},{"href":"#id-20","referenceIndex":13,"text":"2018)","element":"a"},{"text":". They achieved performance on par with those with ImageNet pre-training. While they also note that pre-training on a proper massive dataset is vital for improving performance on target tasks with scarce data.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pre-training for Natural Language Processing (NLP) ","element":"span"},{"text":"It is interesting to note that the development of pre-training techniques in NLP lags quite behind computer vision. There are previous research works on improving word embedding ","element":"span"},{"href":"#id-21","referenceIndex":29,"text":"(Mikolov et al., ","element":"a"},{"href":"#id-21","referenceIndex":29,"text":"2013; ","element":"a"},{"href":"#id-22","referenceIndex":30,"text":"Pennington et al., ","element":"a"},{"href":"#id-22","referenceIndex":30,"text":"2014; ","element":"a"},{"href":"#id-23","referenceIndex":19,"text":"Kiros et al., ","element":"a"},{"href":"#id-23","referenceIndex":19,"text":"2015)","element":"a"},{"text":", which is a low-level linguistic feature representation. On top of that, numerous diverse architectures are designed for various NLP tasks. In the milestone work of Transformers ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"(Vaswani ","element":"a"},{"href":"#id-1","referenceIndex":38,"text":"et al., ","element":"a"},{"href":"#id-1","referenceIndex":38,"text":"2017)","element":"a"},{"text":", the Transformer attention module is proposed as a generic building block for various NLP tasks. After that, a serious of approaches are proposed for pre-training the generic representation, mainly based on Transformers, such as GPT ","element":"span"},{"href":"#id-24","referenceIndex":31,"text":"(Radford et al., ","element":"a"},{"href":"#id-24","referenceIndex":31,"text":"2018)","element":"a"},{"text":", BERT ","element":"span"},{"href":"#id-2","referenceIndex":6,"text":"(Devlin et al., ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"2018)","element":"a"},{"text":", GPT-2 ","element":"span"},{"href":"#id-25","referenceIndex":32,"text":"(Radford et al., ","element":"a"},{"href":"#id-25","referenceIndex":32,"text":"2019)","element":"a"},{"text":", XLNet ","element":"span"},{"href":"#id-26","referenceIndex":42,"text":"(Yang et al., ","element":"a"},{"href":"#id-26","referenceIndex":42,"text":"2019)","element":"a"},{"text":", XLM ","element":"span"},{"href":"#id-27","referenceIndex":22,"text":"(Lample & Conneau, ","element":"a"},{"href":"#id-27","referenceIndex":22,"text":"2019)","element":"a"},{"text":", and RoBERTa ","element":"span"},{"href":"#id-28","referenceIndex":26,"text":"(Liu et al., ","element":"a"},{"href":"#id-28","referenceIndex":26,"text":"2019)","element":"a"},{"text":". Among them, BERT is perhaps the most popular one due to its simplicity and superior performance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pre-training for Visual-Linguistic Tasks. ","element":"span"},{"text":"The development course of models for visual-linguistic tasks is also quite similar to those in the computer vision and NLP communities. Previously, task-specific models are designed, wherein the features derived from off-the-shelf computer vision and NLP models are combined in an ad-hoc way for specific tasks. Model training is performed on the dataset for the specific task only.","element":"span"}],[{"text":"VideoBERT ","element":"span"},{"href":"#id-29","referenceIndex":36,"text":"(Sun et al., ","element":"a"},{"href":"#id-29","referenceIndex":36,"text":"2019b) ","element":"a"},{"text":"is the first work seeking to conduct pre-training for visual-linguistic tasks. In it, video clips are processed by off-the-shelf networks for action recognition, and are assigned to different clusters (visual words) based on the derived features. The pre-training loss is incurred via predicting the cluster ids of masked video clips. Due to the abrupt clustering of the video clips, it losses considerable visual content information and hinders updating visual network parameters. In the following work of CBT ","element":"span"},{"href":"#id-30","referenceIndex":35,"text":"(Sun et al., ","element":"a"},{"href":"#id-30","referenceIndex":35,"text":"2019a)","element":"a"},{"text":", such clustering mechanism is removed. Both works are applied on videos, which are of linear structure in the time dimension, same as sentences. It is highly desired to study at the well-established image-based visual-linguistic tasks.","element":"span"}],[{"text":"Concurrent to our work, multiple works released on Arxiv very recently also seek to derive a pre-trainable generic representation for visual-linguistic tasks. Table ","element":"span"},{"href":"#id-31","text":"5 ","element":"a"},{"text":"in Appendix compares among them. We briefly discuss some of these works here.","element":"span"}],[{"text":"In ViLBERT ","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"(Lu et al., ","element":"a"},{"href":"#id-32","referenceIndex":28,"text":"2019) ","element":"a"},{"text":"and LXMERT ","element":"span"},{"href":"#id-33","referenceIndex":37,"text":"(Tan & Bansal, ","element":"a"},{"href":"#id-33","referenceIndex":37,"text":"2019)","element":"a"},{"text":", which are under review or just got accepted, the network architectures are of two single-modal networks applied on input sentences and images respectively, followed by a cross-modal Transformer combining information from the two sources. The attention pattern in the cross-modal Transformer is restricted, where the authors believe to improve the performance. The authors of ViLBERT claim that such two-stream design is superior than a single-stream unified model. Meanwhile, in the proposed VL-BERT, it is of a unified architecture based on Transformers without any restriction on the attention patterns. The visual and linguistic contents are fed as input to VL-BERT, wherein they interact early and freely. We found that our unified model of VL-BERT outperforms such two-stream designs.","element":"span"}],[{"text":"VisualBert ","element":"span"},{"href":"#id-34","referenceIndex":24,"text":"(Li et al., ","element":"a"},{"href":"#id-34","referenceIndex":24,"text":"2019b)","element":"a"},{"text":", B2T2 ","element":"span"},{"href":"#id-35","referenceIndex":1,"text":"(Alberti et al., ","element":"a"},{"href":"#id-35","referenceIndex":1,"text":"2019)","element":"a"},{"text":", and Unicoder-VL ","element":"span"},{"href":"#id-36","referenceIndex":23,"text":"(Li et al., ","element":"a"},{"href":"#id-36","referenceIndex":23,"text":"2019a)","element":"a"},{"text":", which are of work in progress or under review, are also of unified single-stream architecture. The differences of these works are compared in Table ","element":"span"},{"href":"#id-31","text":"5. ","element":"a"},{"text":"The concurrent emergency of these research works indicates the importance of deriving a generic pre-trainable representation for visual-linguistic tasks.","element":"span"}],[{"text":"In addition, there are three noticeable differences between VL-BERT and other concurrent works in pre-training. Their effects are validated in Section ","element":"span"},{"href":"#id-37","text":"4.3. ","element":"a"},{"text":"(1) We found the task of Sentence-Image Relationship Prediction used in all of the other concurrent works (e.g., ViLBERT ","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"(Lu et al., ","element":"a"},{"href":"#id-32","referenceIndex":28,"text":"2019) ","element":"a"},{"text":"and LXMERT ","element":"span"},{"href":"#id-33","referenceIndex":37,"text":"(Tan & Bansal, ","element":"a"},{"href":"#id-33","referenceIndex":37,"text":"2019)","element":"a"},{"text":") is of no help in pre-training visual-linguistic representations. Thus such a task is not incorporated in VL-BERT. (2) We pre-train VL-BERT on both visual-linguistic and text-only datasets. We found such joint pre-training improves the generalization over long and complex sentences. (3) Improved tuning of the visual representation. In VL-BERT, the parameters of Fast R-CNN, deriving the visual features, are also updated. To avoid visual clue leakage in the pre-training task of Masked RoI Classification with Linguistic Clues, the masking operation is conducted on the input raw pixels, other than the feature maps produced by layers of convolution.","element":"span"}],[{"text":"3 ","element":"span"},{"text":"VL-BERT","element":"span"}],[{"text":"3.1 ","element":"span"},{"text":"R","element":"span"},{"text":"EVISIT ","element":"span"},{"text":"BERT M","element":"span"},{"text":"ODEL","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":290.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/2-0.png","element":"img","alt":" x = {x1, ..., xN}","inline":true,"padRight":true},{"text":"be the input elements in BERT ","element":"span"},{"href":"#id-2","referenceIndex":6,"text":"(Devlin et al., ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"2018)","element":"a"},{"text":", which are of embedded features encoding sentence words. They are processed by a multi-layer bidirectional Transformer ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"(Vaswani et al., ","element":"a"},{"href":"#id-1","referenceIndex":38,"text":"2017)","element":"a"},{"text":", where the embedding features of each element are transformed layer-by-layer in the fashion of aggregating features from the other elements with adaptive attention weights. Let ","element":"span"},{"style":{"height":17.78},"width":295.11,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/2-1.png","element":"img","alt":" xl = {xl1, ..., xlN}","inline":true,"padRight":true},{"text":"be the features of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"-th layer (","element":"span"},{"style":{"height":13.39},"width":38.78,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/2-2.png","element":"img","alt":"x0","inline":true,"padRight":true},{"text":"is set as the input ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"). The ","element":"span"},{"text":"features of the ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"+ 1)","element":"span"},{"text":"-th layer, ","element":"span"},{"style":{"height":13.38},"width":73.48,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/2-3.png","element":"img","alt":" xl+1","inline":true},{"text":", is computed by","element":"span"}],[{"id":"id-38","style":{"width":"90%"},"width":1435,"height":322,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/2-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"in Eq. ","element":"span"},{"href":"#id-38","text":"1 ","element":"a"},{"text":"indexes over the attention heads, and ","element":"span"},{"style":{"height":19.93},"width":562.75,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-0.png","element":"img","alt":" Ami,j ∝ exp[(Ql+1m xli)T (Kl+1m xlj)]","inline":true,"padRight":true},{"text":"de- ","element":"span"},{"text":"notes the attention weights between elements ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"-th head, which is normalized by ","element":"span"},{"style":{"height":22.8},"width":408.92,"height":56.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-1.png","element":"img","alt":"�Nj=1 Ami,j = 1. W l+1m","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":17.32},"width":82.2,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-2.png","element":"img","alt":" Ql+1m","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":17.32},"width":87.4,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-3.png","element":"img","alt":" Kl+1m","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.32},"width":82.81,"height":43.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-4.png","element":"img","alt":" V l+1m","inline":true,"padRight":true},{"text":"are learnable weights for ","element":"span"},{"style":{"height":13.78},"width":56.74,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-5.png","element":"img","alt":" mth","inline":true,"padRight":true},{"text":"attention head, ","element":"span"},{"style":{"height":18.67},"width":207.34,"height":46.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-6.png","element":"img","alt":"W l+11 , W l+12","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.67},"width":155.2,"height":46.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-7.png","element":"img","alt":" bl+11 , bl+12","inline":true,"padRight":true},{"text":"in Eq. ","element":"span"},{"href":"#id-38","text":"3 ","element":"a"},{"text":"are learnable weights and biases, respectively. Note that, the operations in Eq. ","element":"span"},{"href":"#id-38","text":"1 ","element":"a"},{"style":{"height":6},"width":31,"height":15,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-8.png","element":"img","alt":" ∼","inline":true,"padRight":true},{"href":"#id-38","text":"4 ","element":"a"},{"text":"is irrelevant to the order of input sequence, i.e. the final BERT representation of permuted input is same as the final BERT representation of the original input after the same permutation. The position of an element in BERT is encoded in its own embedding features by sequence positional embedding. Thanks to such decoupled representation, the BERT model is flexible enough to be pre-trained and finetuned for a variety of NLP tasks.","element":"span"}],[{"text":"In BERT pre-training, the masked language modeling (MLM) task is introduced. The embedded features of a certain input word would be randomly masked out (the token embedding channels capturing the word content is replaced by a special [MASK] token). The BERT model is trained to predict the masked word from linguistic clues of all the other unmasked elements. As explained in ","element":"span"},{"href":"#id-39","referenceIndex":40,"text":"Wang & Cho ","element":"a"},{"href":"#id-39","referenceIndex":40,"text":"(2019)","element":"a"},{"text":", the overall MLM-based training of BERT is equivalent to optimizing the following joint probability distribution","element":"span"}],[{"style":{"width":"67%"},"width":1076,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":122.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-10.png","element":"img","alt":" φi(x|θ)","inline":true,"padRight":true},{"text":"is the potential function for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th input element, with parameters ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-11.png","element":"img","alt":" θ","inline":true},{"text":", and ","element":"span"},{"style":{"height":16},"width":81.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-12.png","element":"img","alt":" Z(θ)","inline":true,"padRight":true},{"text":"is the partition function. Each log-potential term ","element":"span"},{"style":{"height":16},"width":149.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-13.png","element":"img","alt":" log φi(x)","inline":true,"padRight":true},{"text":"is defined as","element":"span"}],[{"style":{"width":"63%"},"width":1012,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.68},"width":147.53,"height":44.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-15.png","element":"img","alt":" fi(x\\i|θ)","inline":true,"padRight":true},{"text":"denotes the final output feature of BERT corresponding to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th element for input ","element":"span"},{"style":{"height":12.48},"width":50.1,"height":31.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-16.png","element":"img","alt":"x\\i","inline":true},{"text":", where ","element":"span"},{"style":{"height":12.48},"width":50.1,"height":31.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-17.png","element":"img","alt":" x\\i","inline":true,"padRight":true},{"text":"is defined as ","element":"span"},{"style":{"height":17.68},"width":322.55,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-18.png","element":"img","alt":" x\\i = {x1, ..., xi−1,","inline":true,"padRight":true},{"text":"[MASK]","element":"span"},{"style":{"height":16},"width":235.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-19.png","element":"img","alt":", xi+1, ..., xN}","inline":true},{"text":". The incurred MLM-based loss is as","element":"span"}],[{"style":{"width":"70%"},"width":1123,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-20.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is a randomly sampled sentence from the training set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is a randomly sampled location for masking words.","element":"span"}],[{"text":"The second pre-training task, Next Sentence Prediction, focuses on modeling the relationship between two sentences. Two sentences are sampled from the input document, and the model should predict whether the second sentence is the direct successor of the first. In BERT, the sampled two sentences are concatenated into one input sequence, with special elements [CLS] and [SEP] inserted prior to the first and the second sentences, respectively. A Sigmoid classifier is appended on the final output feature corresponding to the [CLS] element to make the prediction. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"be the input sequence, ","element":"span"},{"style":{"height":16},"width":177.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-21.png","element":"img","alt":" t ∈ {0, 1}","inline":true,"padRight":true},{"text":"indicates the relationship between the two sentences. The loss function is defined as","element":"span"}],[{"style":{"width":"82%"},"width":1302,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-22.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.74},"width":44.78,"height":44.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-23.png","element":"img","alt":" xL0","inline":true,"padRight":true},{"text":"is the final output feature of the [CLS] element (at the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-th layer), and ","element":"span"},{"style":{"height":17.78},"width":98.57,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-24.png","element":"img","alt":" g(xL0 )","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"text":"classifier output.","element":"span"}],[{"style":{"width":"31%"},"width":505,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/3-25.png","element":"img"}],[{"text":"Figure ","element":"span"},{"href":"#id-12","text":"1 ","element":"a"},{"text":"illustrates the architecture of VL-BERT. Basically, it modifies the original BERT ","element":"span"},{"href":"#id-2","referenceIndex":6,"text":"(Devlin ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"et al., ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"2018) ","element":"a"},{"text":"model by adding new elements to accommodate the visual contents, and a new type of visual feature embedding to the input feature embeddings. Similar to BERT, the backbone is of multi-layer bidirectional Transformer encoder ","element":"span"},{"href":"#id-1","referenceIndex":38,"text":"(Vaswani et al., ","element":"a"},{"href":"#id-1","referenceIndex":38,"text":"2017)","element":"a"},{"text":", enabling dependency modeling among all the input elements. Different to BERT processing sentence words only, VL-BERT takes both visual and linguistic elements as input, which are of features defined on regions-of-interest (RoIs) in images and sub-words from input sentences, respectively. The RoIs can either be bounding boxes produced by object detectors, or be annotated ones in certain tasks.","element":"span"}],[{"text":"It is worth noting that the input formats vary for different visual-linguistic tasks (e.g., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Caption, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"for image captioning, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"for VQA ","element":"span"},{"href":"#id-6","referenceIndex":3,"text":"(Antol et al., ","element":"a"},{"href":"#id-6","referenceIndex":3,"text":"2015; ","element":"a"},{"href":"#id-7","referenceIndex":16,"text":"John- ","element":"a"},{"href":"#id-7","referenceIndex":16,"text":"son et al., ","element":"a"},{"href":"#id-7","referenceIndex":16,"text":"2017; ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"Goyal et al., ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"2017; ","element":"a"},{"href":"#id-9","referenceIndex":15,"text":"Hudson & Manning, ","element":"a"},{"href":"#id-9","referenceIndex":15,"text":"2019) ","element":"a"},{"text":"and VCR ","element":"span"},{"href":"#id-10","referenceIndex":45,"text":"(Zellers et al., ","element":"a"},{"href":"#id-10","referenceIndex":45,"text":"2019; ","element":"a"},{"href":"#id-11","referenceIndex":8,"text":"Gao ","element":"a"},{"href":"#id-11","referenceIndex":8,"text":"et al., ","element":"a"},{"href":"#id-11","referenceIndex":8,"text":"2019)","element":"a"},{"text":"). But thanks to the unordered representation nature of Transformer attention (e.g., the","element":"span"}],[{"style":{"width":"99%"},"width":1579,"height":663,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/4-0.png","element":"img"}],[{"id":"id-12","text":"Figure 1: Architecture for pre-training VL-BERT. All the parameters in this architecture including ","element":"figcaption","subtype":"caption"},{"text":"VL-BERT and Fast R-CNN are jointly trained in both pre-training and fine-tuning phases.","element":"figcaption","subtype":"caption"}],[{"text":"$42","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Token Embedding ","element":"span"},{"text":"Following the practice in BERT, the linguistic words are embedded with WordPiece embeddings ","element":"span"},{"href":"#id-13","referenceIndex":41,"text":"(Wu et al., ","element":"a"},{"href":"#id-13","referenceIndex":41,"text":"2016) ","element":"a"},{"text":"with a 30,000 vocabulary. A special token is assigned to each special element. For the visual elements, a special [IMG] token is assigned for each one of them.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Visual Feature Embedding ","element":"span"},{"text":"We firstly describe visual appearance feature and visual geometry embedding separately, and then how to combine them to form the visual feature embedding.","element":"span"}],[{"text":"For the visual element corresponding to an RoI, the visual appearance feature is extracted by applying a Fast R-CNN ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"(Girshick, ","element":"a"},{"href":"#id-14","referenceIndex":9,"text":"2015) ","element":"a"},{"text":"detector (i.e., the detection branch in Faster R-CNN ","element":"span"},{"href":"#id-40","referenceIndex":33,"text":"(Ren et al., ","element":"a"},{"href":"#id-40","referenceIndex":33,"text":"2015)","element":"a"},{"text":"), where the feature vector prior to the output layer of each RoI is utilized as the visual feature embedding (of 2048-d in paper). For the non-visual elements, the corresponding visual appearance features are of features extracted on the whole input image. They are obtained by applying Faster R-CNN on an RoI covering the whole input image.","element":"span"}],[{"text":"The visual geometry embedding is designed to inform VL-BERT the geometry location of each input visual element in image. Each RoI is characterized by a 4-d vector, as ","element":"span"},{"style":{"height":19.63},"width":300,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/4-1.png","element":"img","alt":" ( xLTW , yLTH , xRBW , hRBH )","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":158.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/4-2.png","element":"img","alt":"(xLT, yLT)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":169.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/4-3.png","element":"img","alt":" (xRB, yRB)","inline":true,"padRight":true},{"text":"denote the coordinate of the top-left and bottom-right corner respectively, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W, H ","element":"span"},{"text":"are of the width and height of the input image. Following the practice in Relation Networks ","element":"span"},{"href":"#id-41","referenceIndex":14,"text":"(Hu et al., ","element":"a"},{"href":"#id-41","referenceIndex":14,"text":"2018)","element":"a"},{"text":", the 4-d vector is embedded into a high-dimensional representation (of 2048-d in paper) by computing sine and cosine functions of different wavelengths.","element":"span"}],[{"text":"The visual feature embedding is attached to each of the input elements, which is the output of a fully connected layer taking the concatenation of visual appearance feature and visual geometry embedding as input.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Segment Embedding ","element":"span"},{"text":"Three types of segment, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A, B, C","element":"span"},{"text":", are defined to separate input elements from different sources, namely, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"for the words from the first and second input sentence respectively, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"for the RoIs from the input image. For example, for input format of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"denotes Question, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"denotes Answer, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"denotes Image. For input format of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Caption, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"denotes Caption, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"denotes Image. A learned segment embedding is added to every input element for indicating which segment it belongs to.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Sequence Position Embedding ","element":"span"},{"text":"A learnable sequence position embedding is added to every input element indicating its order in the input sequence, same as BERT. Because there is no natural order among input visual elements, any permutation of them in the input sequence should achieve the same result. Thus the sequence position embedding for all visual elements are the same.","element":"span"}],[{"id":"id-45","text":"3.3 ","element":"span"},{"text":"P","element":"span"},{"text":"RE","element":"span"},{"text":"-","element":"span"},{"text":"TRAINING ","element":"span"},{"text":"VL-BERT","element":"span"}],[{"text":"The generic feature representation of VL-BERT enables us to pre-train it on massive-scale datasets, with properly designed pre-training tasks. We pre-train VL-BERT on both visual-linguistic and text-only datasets. Here we utilize the Conceptual Captions dataset ","element":"span"},{"href":"#id-5","referenceIndex":34,"text":"(Sharma et al., ","element":"a"},{"href":"#id-5","referenceIndex":34,"text":"2018) ","element":"a"},{"text":"as the visual-linguistic corpus. It contains around 3.3 million images annotated with captions, which are harvested from web data and processed through an automatic pipeline. The issue with the Conceptual Captions dataset is that the captions are mainly simple clauses, which are too short and simple for many down-stream tasks. To avoid overfitting on such short and simple text scenario, we also pre-train VL-BERT on text-only corpus with long and complex sentences. We utilize the BooksCorpus ","element":"span"},{"href":"#id-42","referenceIndex":47,"text":"(Zhu et al., ","element":"a"},{"href":"#id-42","referenceIndex":47,"text":"2015) ","element":"a"},{"text":"and the English Wikipedia datasets, which are also utilized in pre-training BERT.","element":"span"}],[{"text":"In SGD training, in each mini-batch, samples are randomly drawn from both Conceptual Captions and BooksCorpus & English Wikipedia (at a ratio of 1:1). For a sample drawn from Conceptual Captions, the input format to VL-BERT is of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Caption, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", where the RoIs in the image are localized and categorized by a pre-trained Faster R-CNN object detector. Two pre-training tasks are exploited to incur loss, which are as follows.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Task #1","element":"span"},{"text":": ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Masked Language Modeling with Visual Clues ","element":"span"},{"text":"This task is very similar to the Masked Language Modeling (MLM) task utilized in BERT. The key difference is that visual clues are incorporated in VL-BERT for capturing the dependencies among visual and linguistic contents. During pre-training, each word in the input sentence(s) is randomly masked (at a probability of 15%). For the masked word, its token is replaced with a special token of [MASK]. The model is trained to predict the masked words, based on the unmasked words and the visual features. The task drives the network to not only model the dependencies in sentence words, but also to align the visual and linguistic contents. For example, in Figure ","element":"span"},{"href":"#id-12","text":"1 ","element":"a"},{"text":"“kitten drinking from [MASK]”, without the input image, the masked word could be any containers, such as “bowl”, “spoon” and “bottle”. The representation should capture the correspondence of the word “bottle” and the corresponding RoIs in the image to make the right guess. During pre-training, the final output feature corresponding to the masked word is fed into a classifier over the whole vocabulary, driven by Softmax cross-entropy loss.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Task #2","element":"span"},{"text":": ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Masked RoI Classification with Linguistic Clues ","element":"span"},{"text":"This is a dual task of Task #1. Each RoI in image is randomly masked out (with 15% probability), and the pre-training task is to predict the category label of the masked RoI from the other clues. To avoid any visual clue leakage from the visual feature embedding of other elements, the pixels laid in the masked RoI are set as zeros before applying Fast R-CNN. During pre-training, the final output feature corresponding to the masked RoI is fed into a classifier with Softmax cross-entropy loss for object category classification. The category label predicted by pre-trained Faster R-CNN is set as the ground-truth. An example is shown in Figure ","element":"span"},{"href":"#id-12","text":"1. ","element":"a"},{"text":"The RoI corresponding to cat in image is masked out, and the corresponding category cannot be predicted from any visual clues. But with the input caption of “kitten drinking from bottle”, the model can infer the category by exploiting the linguistic clues.","element":"span"}],[{"text":"For a sample drawn from the BooksCorpus & English Wikipedia datasets, the input format to VLBERT degenerates to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Text, ","element":"span"},{"style":{"height":10.4},"width":74.46,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/5-0.png","element":"img","alt":" ∅ >","inline":true},{"text":", where no visual information is involved. The “visual feature embedding” term in Figure ","element":"span"},{"href":"#id-12","text":"1 ","element":"a"},{"text":"is a learnable embedding shared for all words. The training loss is from the standard task of Masked Language Modeling (MLM) as in BERT.","element":"span"}],[{"text":"In summary, the pre-training on visual-linguistic corpus improves the detailed alignment between visual and linguistic contents. Such detailed alignment is vital for many downstream tasks (for example, in Visual Grounding ","element":"span"},{"href":"#id-43","referenceIndex":17,"text":"(Kazemzadeh et al., ","element":"a"},{"href":"#id-43","referenceIndex":17,"text":"2014)","element":"a"},{"text":", the model locates the most relevant object or region in an image based on a natural language query). While the pre-training on text-only corpus facilitates downstream tasks involving understanding of long and complex sentences.","element":"span"}],[{"text":"3.4 ","element":"span"},{"text":"F","element":"span"},{"text":"INE","element":"span"},{"text":"-","element":"span"},{"text":"TUNING ","element":"span"},{"text":"VL-BERT","element":"span"}],[{"text":"VL-BERT is designed to be a generic feature representation for various visual-linguistic tasks. It is relatively simple to finetune VL-BERT for various downstream tasks. We simply need to feed VLBERT with properly formatted input and output, and finetune all the network parameters end-to-end. For the input, the typical formats of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Caption, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"cover the majority visual-linguistic tasks. VL-BERT also supports more sentences and more images as long as appropriate segment embeddings are introduced to identify different input sources. At the output, typically, the final output feature of the [CLS] element is used for sentence-image-relation level prediction. The final output features of words or RoIs are for word-level or RoI-level prediction. In addition to the input and output format, task-specific loss functions and training strategies also need to be tuned. See Section ","element":"span"},{"href":"#id-44","text":"4.2 ","element":"a"},{"text":"for the detailed design choices and settings.","element":"span"}]]},{"heading":"4 EXPERIMENT","paragraphs":[[{"style":{"width":"21%"},"width":348,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/6-0.png","element":"img"}],[{"text":"As described in Section ","element":"span"},{"href":"#id-45","text":"3.3, ","element":"a"},{"text":"we pre-train VL-BERT jointly on Conceptual Captions ","element":"span"},{"href":"#id-5","referenceIndex":34,"text":"(Sharma et al., ","element":"a"},{"href":"#id-5","referenceIndex":34,"text":"2018) ","element":"a"},{"text":"as visual-linguistic corpus, and BooksCorpus ","element":"span"},{"href":"#id-42","referenceIndex":47,"text":"(Zhu et al., ","element":"a"},{"href":"#id-42","referenceIndex":47,"text":"2015) ","element":"a"},{"text":"& English Wikipedia as text-only corpus. As VL-BERT is developed via adding new inputs capturing visual information to the original BERT model, we initialize the parameters to be the same as the original BERT described in ","element":"span"},{"href":"#id-2","referenceIndex":6,"text":"(Devlin et al., ","element":"a"},{"href":"#id-2","referenceIndex":6,"text":"2018)","element":"a"},{"text":". VL-BERT","element":"span"},{"text":"BASE ","element":"span"},{"text":"and VL-BERT","element":"span"},{"text":"LARGE ","element":"span"},{"text":"denote models developed from the original BERT","element":"span"},{"text":"BASE ","element":"span"},{"text":"and BERT","element":"span"},{"text":"LARGE ","element":"span"},{"text":"models, respectively. The newly added parameters in VL-BERT are randomly initialized from a Gaussian distribution with mean of 0 and standard deviation of 0.02. Visual content embedding is produced by Faster R-CNN + ResNet-101, initialized from parameters pre-trained on Visual Genome ","element":"span"},{"href":"#id-46","referenceIndex":20,"text":"(Krishna et al., ","element":"a"},{"href":"#id-46","referenceIndex":20,"text":"2017) ","element":"a"},{"text":"for object detection (see BUTD ","element":"span"},{"href":"#id-47","referenceIndex":2,"text":"(Anderson ","element":"a"},{"href":"#id-47","referenceIndex":2,"text":"et al., ","element":"a"},{"href":"#id-47","referenceIndex":2,"text":"2018)","element":"a"},{"text":").","element":"span"}],[{"text":"Prior to pre-training on Conceptual Captions, the pre-trained Faster R-CNN is applied to extract RoIs. Specifically, at most 100 RoIs with detection scores higher than 0.5 are selected for each image. At minimum, 10 RoIs are selected from one image, regardless of the detection score threshold. The detailed parameter settings are in Appendix.","element":"span"}],[{"id":"id-44","style":{"width":"48%"},"width":770,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/6-1.png","element":"img"}],[{"text":"The pre-trained VL-BERT model can be fine-tuned for various downstream visual-linguistic tasks, with simple modifications on the input format, output prediction, loss function and training strategy.","element":"span"}],[{"style":{"width":"86%"},"width":1374,"height":474,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/6-2.png","element":"img"}],[{"text":"Table 1: Comparison to the state-of-the-art methods with single model on the VCR dataset. ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":18,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/6-3.png","element":"img","alt":"†","inline":true,"padRight":true},{"id":"id-49","text":"indicates concurrent works.","element":"figcaption","subtype":"caption"}],[{"text":"Visual Commonsense Reasoning (VCR) focuses on higher-order cognitive and commonsense understanding of the given image. In the dataset of ","element":"span"},{"href":"#id-10","referenceIndex":45,"text":"Zellers et al. ","element":"a"},{"href":"#id-10","referenceIndex":45,"text":"(2019)","element":"a"},{"text":", given an image and a list of categorized RoIs, a question at cognition level is raised. The model should pick the right answer to the question and provide the rationale explanation. For each question, there are 4 candidate answers and 4 candidate rationales. This holistic task (Q ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/6-4.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"AR) is decomposed into two sub-tasks wherein researchers can train specific individual models: question answering (Q ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/6-5.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"A) and answer","element":"span"}],[{"style":{"width":"79%"},"width":1257,"height":1089,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/7-0.png","element":"img"}],[{"id":"id-48","text":"Figure 2: Input and output formats for fine-tuning different visual-linguistic downstream tasks.","element":"figcaption","subtype":"caption"}],[{"text":"justification (QA ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/7-1.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"R). The released VCR dataset consists of 265k pairs of questions, answers, and rationales, over 100k unique movie scenes (100k images). They are split into training, validation, and test sets consisting of 213k questions and 80k images, 27k questions and 10k images, and 25k questions and 10k images, respectively.","element":"span"}],[{"text":"Our experimental protocol for VCR follows that in R2C ","element":"span"},{"href":"#id-10","referenceIndex":45,"text":"(Zellers et al., ","element":"a"},{"href":"#id-10","referenceIndex":45,"text":"2019)","element":"a"},{"text":". The model is trained on the train split, and is evaluated at the val and test sets. In the original work R2C, task-specific “Grounding”, “Contextualization” and “Reasoning” modules are designed. Here we simply adopt the generic representation of VL-BERT for the task. ","element":"span"},{"text":"Figure ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"(a) illustrates the input format, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":". For the sub-task of Q ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/7-2.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"A, ‘Q’ and ‘A’ are filled to the Question section and Answer section respectively. For the sub-task of QA ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/7-3.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"R , the concatenation of ‘Q’ and ‘A’ is filled to the Question section, and ‘R’ is filled to the Answer section. The input RoIs to VL-BERT are the ground-truth annotations in the dataset. The final output feature of [CLS] element is fed to a Softmax classifier for predicting whether the given Answer is the correct choice. During fine-tuning, we adopt two losses, the classification over the correctness of the answers and the RoI classification with linguistic clues. The detailed parameter settings are in Appendix.","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-49","text":"1 ","element":"a"},{"text":"presents the experiment results. Pre-training VL-BERT improves the performance by 1.0% in the final Q ","element":"span"},{"style":{"height":8.8},"width":40,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/7-4.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"AR task, which validates the effectiveness of pre-training. Compared with R2C, we do not use ad-hoc task-specific modules. Instead, we simply adopt the generic representation of VL-BERT and jointly train the whole model end-to-end. Despite the same input, output and experimental protocol as R2C, VL-BERT outperforms R2C by large margins, indicating the power of our simple cross-modal architecture. Compared with other concurrent works, i.e., ViLBERT, VisualBERT and B2T2, our VL-BERT achieves the state-of-the-art performance.","element":"span"}],[{"style":{"width":"50%"},"width":807,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/7-5.png","element":"img"}],[{"text":"In the VQA task, given a natural image, a question at the perceptual level is asked, and the algorithm should generate / choose the correct answer. Here we conduct experiments on the widely-used VQA v2.0 dataset ","element":"span"},{"href":"#id-8","referenceIndex":11,"text":"(Goyal et al., ","element":"a"},{"href":"#id-8","referenceIndex":11,"text":"2017)","element":"a"},{"text":", which is built based on the COCO ","element":"span"},{"href":"#id-50","referenceIndex":25,"text":"(Lin et al., ","element":"a"},{"href":"#id-50","referenceIndex":25,"text":"2014) ","element":"a"},{"text":"images. The VQA v2.0 dataset is split into train (83k images and 444k questions), validation (41k images and","element":"span"}],[{"style":{"width":"53%"},"width":842,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/8-0.png","element":"img"}],[{"text":"Table 2: Comparison to the state-of-the-art methods with single model on the VQA dataset. ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":18,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/8-1.png","element":"img","alt":"†","inline":true,"padRight":true},{"id":"id-51","text":"indicates concurrent works.","element":"figcaption","subtype":"caption"}],[{"text":"214k questions), and test (81k images and 448k questions) sets. Following the experimental protocol in BUTD ","element":"span"},{"href":"#id-47","referenceIndex":2,"text":"(Anderson et al., ","element":"a"},{"href":"#id-47","referenceIndex":2,"text":"2018)","element":"a"},{"text":", for each question, the algorithm should pick the corresponding answer from a shared set consisting of 3,129 answers.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"(b) illustrates the input format for the VQA task, which is of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":". As the possible answers are from a shared pool independent to the question, we only fill a [MASK] element to the Answer section. As in BUTD ","element":"span"},{"href":"#id-47","referenceIndex":2,"text":"(Anderson et al., ","element":"a"},{"href":"#id-47","referenceIndex":2,"text":"2018)","element":"a"},{"text":", the input RoIs in VL-BERT are generated by a Faster R-CNN detector pre-trained on Visual Genome ","element":"span"},{"href":"#id-46","referenceIndex":20,"text":"(Krishna et al., ","element":"a"},{"href":"#id-46","referenceIndex":20,"text":"2017)","element":"a"},{"text":". The answer prediction is made from a multi-class classifier based upon the output feature of the [MASK] element. During fine-tuning, the network training is driven by the multi-class cross-entropy loss over the possible answers. The detailed parameter settings are in Appendix.","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-51","text":"2 ","element":"a"},{"text":"presents our experimental results. Pre-training VL-BERT improves the performance by 1.6%, which validates the importance of pre-training. VL-BERT shares the same input (i.e., question, image, and RoIs), output and experimental protocol with BUTD, a prevalent model specifi-cally designed for the task. Still, VL-BERT surpasses BUTD by over 5% in accuracy. Except for LXMERT, our VL-BERT achieves better performance than the other concurrent works. This is because LXMERT is pre-trained on massive visual question answering data (aggregating almost all the VQA datasets based on COCO and Visual Genome). While our model is only pre-trained on captioning and text-only dataset, where there is still gap with the VQA task.","element":"span"}],[{"style":{"width":"90%"},"width":1429,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/8-2.png","element":"img"}],[{"text":"Table 3: Comparison to the state-of-the-art methods with single model on the RefCOCO+ dataset. ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":18,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/8-3.png","element":"img","alt":"†","inline":true,"padRight":true},{"id":"id-53","text":"indicates concurrent work.","element":"figcaption","subtype":"caption"}],[{"text":"A referring expression is a natural language phrase that refers to an object in an image. The referring expression comprehension task is to localize the object in an image with the given referring expression. We adopt the RefCOCO+ ","element":"span"},{"href":"#id-43","referenceIndex":17,"text":"(Kazemzadeh et al., ","element":"a"},{"href":"#id-43","referenceIndex":17,"text":"2014) ","element":"a"},{"text":"dataset for evaluation, consisting of 141k expressions for 50k referred objects in 20k images in the COCO dataset ","element":"span"},{"href":"#id-50","referenceIndex":25,"text":"(Lin et al., ","element":"a"},{"href":"#id-50","referenceIndex":25,"text":"2014)","element":"a"},{"text":". The referring expressions in RefCOCO+ are forbidden from using absolute location words, e.g. left dog. Therefore the referring expressions focus on purely appearance-based descriptions. RefCOCO+ are split into four sets, training set (train), validation set (val), and two testing sets (testA and testB). Images containing multiple people are in testA set, while images containing multiple objects of other categories are in testB set. There is no overlap between the training, validation and testing images.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-48","text":"2 ","element":"a"},{"text":"(c) illustrates the input format for referring expression comprehension , where the input format is of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Query, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":". Model training and evaluation are conducted either on the ground-truth RoIs or on the detected boxes in MAttNet ","element":"span"},{"href":"#id-52","referenceIndex":44,"text":"(Yu et al., ","element":"a"},{"href":"#id-52","referenceIndex":44,"text":"2018)","element":"a"},{"text":". And the results are reported either in the track of ground-truth regions or that of detected regions, respectively. During training, we compute the classification scores for all the input RoIs. For each RoI, a binary classification loss is applied. During inference, we directly choose the RoI with the highest classification score as the referred object of the input referring expression. The detailed parameter settings are in Appendix.","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-53","text":"3 ","element":"a"},{"text":"presents our experimental results. Pre-trained VL-BERT significantly improves the performance. Compared with MAttNet, VL-BERT is much simpler without task-specific architecture designs, yet much better. VL-BERT achieves comparable performance with the concurrent work of ViLBERT.","element":"span"}],[{"id":"id-37","style":{"width":"99%"},"width":1584,"height":395,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/9-0.png","element":"img"}],[{"id":"id-54","text":"Table 4: Ablation study for VL-BERT","element":"figcaption","subtype":"caption"},{"text":"BASE ","element":"figcaption","subtype":"caption"},{"text":"with 0.5","element":"figcaption","subtype":"caption"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/9-1.png","element":"img","alt":"×","inline":true,"padRight":true},{"text":"fine-tuning epochs.","element":"figcaption","subtype":"caption"}],[{"text":"Table ","element":"span"},{"href":"#id-54","text":"4 ","element":"a"},{"text":"ablates key design choices in pre-training VL-BERT. For experimental efficiency, the fine-tuning epoches of VL-BERT are of 0.5","element":"span"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/9-2.png","element":"img","alt":"×","inline":true,"padRight":true},{"text":"of those in Section ","element":"span"},{"href":"#id-44","text":"4.2, ","element":"a"},{"text":"with only VL-BERT","element":"span"},{"text":"BASE ","element":"span"},{"text":"model.","element":"span"}],[{"text":"Overall, the pre-training of VL-BERT improves the performance over all the three down-stream tasks (by comparing setting “w/o pre-training” and VL-BERT","element":"span"},{"text":"BASE","element":"span"},{"text":"). The improvement amplitude varies for different tasks. By comparing setting (a) to that of “w/o pre-training”, we see the benefits of Task #1, Masked Language Modeling with Visual Clues. By further incorporating Task #2, Masked RoI Classification with Linguistic Clues, the accuracy further improves on RefCOCO+, but gets stuck at VCR and VQA. This might be because only RefCOCO+ utilizes the final output feature corresponding to [IMG] tokens for prediction. Thus the pre-training of such features is beneficial. Setting (c) incorporates the task of Sentence-Image Relationship Prediction as in ViLBERT ","element":"span"},{"href":"#id-32","referenceIndex":28,"text":"(Lu ","element":"a"},{"href":"#id-32","referenceIndex":28,"text":"et al., ","element":"a"},{"href":"#id-32","referenceIndex":28,"text":"2019) ","element":"a"},{"text":"and LXMERT ","element":"span"},{"href":"#id-33","referenceIndex":37,"text":"(Tan & Bansal, ","element":"a"},{"href":"#id-33","referenceIndex":37,"text":"2019)","element":"a"},{"text":". It would hurt accuracy on all the three down-stream tasks. We guess the reason is because the task of Sentence-Image Relationship Prediction would introduce unmatched image and caption pairs as negative examples. Such unmatched samples would hamper the training of other tasks. Setting (d) adds text-only corpus during pre-training. Compared with setting (b), it improves the performance over all three down-stream tasks, and is most significant on VCR. This is because the task of VCR involves more complex and longer sentences than those in VQA and RefCOCO+","element":"span"},{"text":"2","element":"span"},{"text":". By further finetuning the network parameters of Fast R-CNN, which generates the visual features, we get the final setting of VL-BERT","element":"span"},{"text":"BASE","element":"span"},{"text":". Such end-to-end training of the entire network is helpful for all the downstream tasks.","element":"span"}]]},{"heading":"5 CONCLUSION","paragraphs":[[{"text":"In this paper, we developed VL-BERT, a new pre-trainable generic representation for visual-linguistic tasks. Instead of using ad-hoc task-specific modules, VL-BERT adopts the simple yet powerful Transformer model as the backbone. It is pre-trained on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues, and thus benefit the downstream tasks. In the future, we would like to seek better pre-training tasks, which could beneficial more downstream tasks (e.g., Image Caption Generation).","element":"span"}]]},{"heading":"ACKNOWLEDGMENTS","paragraphs":[[{"text":"The work is partially supported by the National Natural Science Foundation of China under grand No.U19B2044 and No.61836011.","element":"span"}]]},{"heading":"REFERENCES","paragraphs":[[{"id":"id-35","text":"Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text ","element":"span"},{"text":"for visual question answering. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.05054","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-47","text":"Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and ","element":"span"},{"text":"Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 6077–6086, 2018.","element":"span"}],[{"id":"id-6","text":"Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- ","element":"span"},{"text":"nick, and Devi Parikh. Vqa: Visual question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE international conference on computer vision","element":"span"},{"text":", pp. 2425–2433, 2015.","element":"span"}],[{"id":"id-4","text":"Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll´ar, and ","element":"span"},{"text":"C Lawrence Zitnick. ","element":"span"},{"text":"Microsoft coco captions: Data collection and evaluation server. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1504.00325","element":"span"},{"text":", 2015.","element":"span"}],[{"id":"id-0","text":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- ","element":"span"},{"text":"erarchical image database. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2009 IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pp. 248–255. Ieee, 2009.","element":"span"}],[{"id":"id-2","text":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep ","element":"span"},{"text":"bidirectional transformers for language understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1810.04805","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-16","text":"Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor ","element":"span"},{"text":"Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pp. 647–655, 2014.","element":"span"}],[{"id":"id-11","text":"Difei Gao, Ruiping Wang, Shiguang Shan, and Xilin Chen. ","element":"span"},{"text":"From two graphs to n questions: A vqa dataset for compositional reasoning on vision and commonsense. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.02962","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-14","text":"Ross Girshick. Fast r-cnn. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE international conference on computer vision","element":"span"},{"text":", pp. 1440–1448, 2015.","element":"span"}],[{"id":"id-17","text":"Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for ac- ","element":"span"},{"text":"curate object detection and semantic segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pp. 580–587, 2014.","element":"span"}],[{"id":"id-8","text":"Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa ","element":"span"},{"text":"matter: Elevating the role of image understanding in visual question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 6904–6913, 2017.","element":"span"}],[{"id":"id-19","text":"Bharath Hariharan, Pablo Arbel´aez, Ross Girshick, and Jitendra Malik. Simultaneous detection and ","element":"span"},{"text":"segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Conference on Computer Vision","element":"span"},{"text":", pp. 297–312. Springer, 2014.","element":"span"}],[{"id":"id-20","text":"Kaiming He, Ross Girshick, and Piotr Doll´ar. Rethinking imagenet pre-training. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1811.08883","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-41","text":"Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object ","element":"span"},{"text":"detection. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 3588–3597, 2018.","element":"span"}],[{"id":"id-9","text":"Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning ","element":"span"},{"text":"and compositional question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 6700–6709, 2019.","element":"span"}],[{"id":"id-7","text":"Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and ","element":"span"},{"text":"Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 2901–2910, 2017.","element":"span"}],[{"id":"id-43","text":"Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to ","element":"span"},{"text":"objects in photographs of natural scenes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)","element":"span"},{"text":", pp. 787–798, 2014.","element":"span"}],[{"id":"id-56","text":"Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1412.6980","element":"span"},{"text":", 2014.","element":"span"}],[{"id":"id-23","text":"Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Tor- ","element":"span"},{"text":"ralba, and Sanja Fidler. Skip-thought vectors. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 3294–3302, 2015.","element":"span"}],[{"id":"id-46","text":"Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie ","element":"span"},{"text":"Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Journal of Computer Vision","element":"span"},{"text":", 123(1):32–73, 2017.","element":"span"}],[{"id":"id-15","text":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo- ","element":"span"},{"text":"lutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 1097–1105, 2012.","element":"span"}],[{"id":"id-27","text":"Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1901.07291","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-36","text":"Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder ","element":"span"},{"text":"for vision and language by cross-modal pre-training, 2019a.","element":"span"}],[{"id":"id-34","text":"Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple ","element":"span"},{"text":"and performant baseline for vision and language. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.03557","element":"span"},{"text":", 2019b.","element":"span"}],[{"id":"id-50","text":"Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr ","element":"span"},{"text":"Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European conference on computer vision","element":"span"},{"text":", pp. 740–755. Springer, 2014.","element":"span"}],[{"id":"id-28","text":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike ","element":"span"},{"text":"Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1907.11692","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-18","text":"Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic ","element":"span"},{"text":"segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pp. 3431–3440, 2015.","element":"span"}],[{"id":"id-32","text":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolin- ","element":"span"},{"text":"guistic representations for vision-and-language tasks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1908.02265","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-21","text":"Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word represen- ","element":"span"},{"text":"tations in vector space. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1301.3781","element":"span"},{"text":", 2013.","element":"span"}],[{"id":"id-22","text":"Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word ","element":"span"},{"text":"representation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)","element":"span"},{"text":", pp. 1532–1543, 2014.","element":"span"}],[{"id":"id-24","text":"Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under- ","element":"span"},{"text":"standing by generative pre-training. 2018.","element":"span"}],[{"id":"id-25","text":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language ","element":"span"},{"text":"models are unsupervised multitask learners. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"OpenAI Blog","element":"span"},{"text":", 1(8), 2019.","element":"span"}],[{"id":"id-40","text":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object ","element":"span"},{"text":"detection with region proposal networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 91–99, 2015.","element":"span"}],[{"id":"id-5","text":"Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, ","element":"span"},{"text":"hypernymed, image alt-text dataset for automatic image captioning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","element":"span"},{"text":", pp. 2556–2565, 2018.","element":"span"}],[{"id":"id-30","text":"Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional trans- ","element":"span"},{"text":"former for temporal representation learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.05743","element":"span"},{"text":", 2019a.","element":"span"}],[{"id":"id-29","text":"Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint ","element":"span"},{"text":"model for video and language representation learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1904.01766","element":"span"},{"text":", 2019b.","element":"span"}],[{"id":"id-33","text":"Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans- ","element":"span"},{"text":"formers. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-1","text":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ","element":"span"},{"text":"Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 5998–6008, 2017.","element":"span"}],[{"id":"id-57","text":"Jesse Vig. ","element":"span"},{"text":"A multiscale visualization of attention in the transformer model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.05714","element":"span"},{"text":", 2019. URL ","element":"span"},{"href":"https://arxiv.org/abs/1906.05714","text":"https://arxiv.org/abs/1906.05714","element":"a"},{"text":".","element":"span"}],[{"id":"id-39","text":"Alex Wang and Kyunghyun Cho. Bert has a mouth, and it must speak: Bert as a markov random ","element":"span"},{"text":"field language model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1902.04094","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-13","text":"Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, ","element":"span"},{"text":"Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1609.08144","element":"span"},{"text":", 2016.","element":"span"}],[{"id":"id-26","text":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V ","element":"span"},{"text":"Le. Xlnet: Generalized autoregressive pretraining for language understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.08237","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-3","text":"Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual ","element":"span"},{"text":"denotations: New similarity metrics for semantic inference over event descriptions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions of the Association for Computational Linguistics","element":"span"},{"text":", 2:67–78, 2014.","element":"span"}],[{"id":"id-52","text":"Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mat- ","element":"span"},{"text":"tnet: Modular attention network for referring expression comprehension. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 1307–1315, 2018.","element":"span"}],[{"id":"id-10","text":"Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual ","element":"span"},{"text":"commonsense reasoning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 6720–6731, 2019.","element":"span"}],[{"id":"id-55","text":"Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answer- ","element":"span"},{"text":"ing in images. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pp. 4995–5004, 2016.","element":"span"}],[{"id":"id-42","text":"Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and ","element":"span"},{"text":"Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE international conference on computer vision","element":"span"},{"text":", pp. 19–27, 2015.","element":"span"}]]},{"heading":"A APPENDIX","paragraphs":[[{"style":{"width":"63%"},"width":1006,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-0.png","element":"img"}],[{"text":"Table ","element":"span"},{"href":"#id-31","text":"5 ","element":"a"},{"text":"compares among VL-BERT and other concurrent works for pre-training generic visual-linguistic representations.","element":"span"}],[{"style":{"width":"99%"},"width":1574,"height":773,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-1.png","element":"img"}],[{"style":{"height":7.2},"width":9,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-2.png","element":"img","alt":"‡","inline":true,"padRight":true},{"text":"LXMERT is pre-trained on COCO Caption ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"(Chen et al., ","element":"a"},{"href":"#id-4","referenceIndex":4,"text":"2015)","element":"a"},{"text":", VG Caption ","element":"span"},{"href":"#id-46","referenceIndex":20,"text":"(Krishna et al., ","element":"a"},{"href":"#id-46","referenceIndex":20,"text":"2017)","element":"a"},{"text":", VG QA ","element":"span"},{"href":"#id-55","referenceIndex":46,"text":"(Zhu et al., ","element":"a"},{"href":"#id-55","referenceIndex":46,"text":"2016)","element":"a"},{"text":", VQA ","element":"span"},{"href":"#id-6","referenceIndex":3,"text":"(Antol et al., ","element":"a"},{"href":"#id-6","referenceIndex":3,"text":"2015) ","element":"a"},{"text":"and GQA ","element":"span"},{"href":"#id-9","referenceIndex":15,"text":"(Hudson & Manning, ","element":"a"},{"href":"#id-9","referenceIndex":15,"text":"2019)","element":"a"},{"text":".","element":"span"}],[{"id":"id-31","text":"Table 5: Comparison among our VL-BERT and other works seeking to derive pre-trainable generic ","element":"figcaption","subtype":"caption"},{"text":"representations for visual-linguistic tasks.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"43%"},"width":685,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-3.png","element":"img"}],[{"text":"Pre-training is conducted on 16 Tesla V100 GPUs for 250k iterations by SGD. In each mini-batch, 256 samples are drawn. Among them, 128 samples are of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Caption, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"pairs from Conceptual Captions, and the rest 128 samples are sequential tokens (at most 64 tokens for each sequence) from BooksCorpus & English Wikipedia. In SGD, Adam optimizer ","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"(Kingma & Ba, ","element":"a"},{"href":"#id-56","referenceIndex":18,"text":"2014) ","element":"a"},{"text":"is applied, with base learning rate of ","element":"span"},{"style":{"height":16.99},"width":522.34,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-4.png","element":"img","alt":" 2 × 10−5, β1 = 0.9, β2 = 0.999","inline":true},{"text":", weight decay of ","element":"span"},{"style":{"height":13.39},"width":80.76,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-5.png","element":"img","alt":" 10−4","inline":true},{"text":", learning rate warmed up over the first 8,000 steps, and linear decay of the learning rate. All the parameters in VL-BERT and Fast R-CNN are jointly trained in both pre-training and fine-tuning phase. The visual feature input for textual corpus is a learnable embedding shared for all words. In the task of Masked RoI Classification with Linguistic Clues, the pixels lying in all the masked RoIs are set as zeros in the image. A box covering the whole image is added as a RoI and would not be masked.","element":"span"}],[{"text":"For VCR, the fine-tuning is conducted on 16 Tesla V100 GPUs for 20 epochs. In each mini-batch, 256 triplets of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"are sampled. In SGD, the basic mini-batch gradient descent is conducted, with base learning rate of ","element":"span"},{"style":{"height":13.39},"width":151.25,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-6.png","element":"img","alt":" 5 × 10−3","inline":true},{"text":", momentum of 0.9, and weight decay of ","element":"span"},{"style":{"height":13.39},"width":80.76,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-7.png","element":"img","alt":"10−4","inline":true},{"text":". The learning rate is linearly warmed up in the first 1,000 steps from an initial learning rate of 0, and is decayed by 0.1 at the 14-th and the 18-th epochs.","element":"span"}],[{"text":"For VQA, the fine-tuning is conducted on 16 Tesla V100 GPUs for 20 epochs. In each mini-batch, 256 triplets of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Question, Answer, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"are sampled. In SGD, Adam optimizer is applied, with base learning rate of ","element":"span"},{"style":{"height":16.59},"width":522.34,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-8.png","element":"img","alt":" 1 × 10−4, β1 = 0.9, β2 = 0.999","inline":true},{"text":", weight decay of ","element":"span"},{"style":{"height":13.39},"width":80.76,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-9.png","element":"img","alt":" 10−4","inline":true},{"text":", learning rate warmed up over the first 2,000 steps, and linear decay of the learning rate.","element":"span"}],[{"text":"For RefCOCO+, the fine-tuning is conducted on 16 Tesla V100 GPUs for 20 epochs. In each mini-batch, 256 pairs of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"Query, Image","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"are sampled. In SGD, Adam optimizer is applied, with base learning rate of ","element":"span"},{"style":{"height":16.59},"width":537.74,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-10.png","element":"img","alt":" 1 × 10−4, β1 = 0.9, β2 = 0.999","inline":true},{"text":", weight decay of ","element":"span"},{"style":{"height":13.39},"width":80.76,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/13-11.png","element":"img","alt":" 10−4","inline":true},{"text":", learning rate warmed up over the first 500 steps, and linear decay of the learning rate.","element":"span"}],[{"style":{"width":"88%"},"width":1402,"height":2331,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/14-0.png","element":"img"}],[{"id":"id-58","text":"Figure 3: Visualization of attention maps in pre-trained VL-BERT","element":"figcaption","subtype":"caption"},{"text":"BASE","element":"figcaption","subtype":"caption"},{"text":". Line intensity indicates the magnitude of attention probability with the text token as query and the image RoI as key. The intensity is affinely rescaled to set the maximum value as 1 and the minimum as 0, across different heads in each layer. The index of network layer and attention head is counted from 0.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"61%"},"width":976,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1908.08530/images/15-0.png","element":"img"}],[{"text":"To better understand what VL-BERT learns from pre-training, we visualized the attention maps of pre-trained VL-BERT (without fine-tuning on downstream tasks) using BertViz","element":"span"},{"text":"3","element":"span"},{"href":"#id-57","referenceIndex":39,"text":"(Vig, ","element":"a"},{"href":"#id-57","referenceIndex":39,"text":"2019)","element":"a"},{"text":".","element":"span"}],[{"text":"Some visualization results on COCO ","element":"span"},{"href":"#id-50","referenceIndex":25,"text":"(Lin et al., ","element":"a"},{"href":"#id-50","referenceIndex":25,"text":"2014; ","element":"a"},{"href":"#id-4","referenceIndex":4,"text":"Chen et al., ","element":"a"},{"href":"#id-4","referenceIndex":4,"text":"2015) ","element":"a"},{"text":"val2017 set are shown in Figure ","element":"span"},{"href":"#id-58","text":"3. ","element":"a"},{"text":"We can see different attention patterns across attention heads. For some attention heads, text tokens attend more on the associated image RoIs. While in some other heads, text tokens attend uniformly to all RoIs. It demonstrates the ability of VL-BERT in aggregating and aligning visual-linguistic contents.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]