Given an image, the aim of scene graph parsing is to infer a visually grounded graph comprising localized entity categories, along with predicate edges denoting their pairwise relationships. This is often formulated as the detection of triplets within an image, e.g.
Figure 1: Example of failure of models without our losses and success of our losses. (a) RelDN learned with only multi-class cross-entropy loss incorrectly relates the man with the microphone, while (b) RelDN learned with our Graphical Contrastive Losses detects the correct relationship
in Figure 1b. Current state-of-the-art methods achieve this goal by a two-stage mechanism: first detecting entities, then predicting a predicate for each pair of entities.
We find that scene graph parsing models using such pipelines tend to struggle with two types of errors. The first is Entity Instance Confusion, in which the subject or object is related to one of many instances of the same class, and the model fails to distinguish between the target instance and the others. We show an example in Figure 2a, in which the model identifies the man is holding a wine glass, but struggles to determine exactly which of the 3 visually similar wine glasses is being held. The incorrectly predicted wine glass is transparent and intersecting with the left arm, which makes it look like being held. The second type of error, Proximal Relationship Ambiguity, occurs when the image contains multiple subject-object pairs interacting in the same way, and the model fails to identify the correct pairing. An example can be seen in the multiple musicians ”playing” their respective instruments in Figure 2b. Due to their close proximity, visual features for each musicianinstrument pair overlap significantly, making it difficult for the scene graph models to identify the correct pairings.
The primary cause of these two failures lies in the inherent difficulty of inferring relationships like “hold” and “play” from visual cues. Concretely, which glass is being held is determined by the small part of the hand that covers the glass. Whether a player is playing the drum can only be inferred by very subtle visual cues such as his standing pose or where his fingers are placed. It is challenging for any model to learn to attend to these details precisely, and it would be impractical to specify which details to focus on for all kinds of relationships, let alone to learn all these details. These challenges motivate the need for a mechanism that can automatically learn fine details that determine visual relationships, and explicitly discriminate related entities from unrelated ones, for all types of relationships. This is the goal of our work.
In this paper we propose a set of Graphical Contrastive Losses to tackle these issues. The losses use the form of the margin-based triplet loss, but are specifically designed to address the two aforementioned errors. It adds additional supervision in the form of hard negatives specific to Entity Instance Confusion and Proximal Relationship Ambiguity. To demonstrate the effectiveness of our proposed losses, we design a relationship detection network named RelDN using the aforementioned pipeline with our losses. Figure 1 shows a result of RelDN with N-way cross-entropy loss only vs. with our additional contrastive losses. Our best model achieves 0.328 on the Private set of the OpenImages Relationship Detection Challenge, outperforming the winning model by a significant 4.7% (16.5% relative) margin. It also attains state-of-the-art performance on the Visual Genome[10] and VRD[15] datasets.
In this paper, we denote subject, predicate, object and attribute with s, pred, o, a. We use “entity” to describe individual detected objects to distinguish from “object” in the semantic sense, and use “relationships” to describe the entire tuple, not to be confused with “predicate,” which is an element of said tuple.
Scene Graph Parsing: A large number of scene graph parsing approaches have emerged during the last couple of years. They use the same pipeline that first either uses off-the-shelf detectors [15, 44, 38, 3, 35, 31] or detectors fine-tuned with relationship datasets [11, 29, 37, 40, 41, 32, 30] to detect entities, then predicts the predicate using proposed methods. Most of them [15, 44, 38, 3, 35, 32, 11, 29, 37, 40, 42, 43] model the second step as a classification task that takes features of each entity pair as input and output a label independently from other pairs. [41] instead learn embeddings for subjects, predicates and objects and use nearest neighbor searching during testing to predict predicates.
Figure 2: Examples of Entity Instance Confusion and Proximal Relationship Ambiguity. Red boxes highlight relationships our baseline model predicts incorrectly. (a) the man is not holding the predicted wine glass. (b) the guitar player on the right is not playing drum.
Nevertheless, the prediction is still done on each entity pair individually. We show that this pipeline struggles with two major scenarios. We find that ignoring the intrinsic graph structure of relationships and predicting each predicate separately is the main cause. Our proposed losses compensate for such drawback by contrasting positive against negative edges for each node, providing global supervision to the classifier and significantly alleviating those two issues.
The scene graph parsing work most related to ours is Associative Embedding [21]. They use use a push and pull contrastive loss to train embeddings for entities within a visual genome scene graph. Our work differs in that we propose to have different sets of hard negatives to target specific error types within scene graph parsing. Phrase Grounding and Referring Expressions: Phrase grounding and referring expression models aim to localize the region described by a given expression, with the latter focusing more on cases of possible reference confusion [33, 17, 34, 20, 7, 16, 25, 27, 14, 2, 6, 23]. It can be abstracted as a bipartite graph matching problem, where nodes on the visual side are the regions and nodes on the language side are the expressions, and the goal is to find all matched pairs. In contrast, scene graphs are arbitrarily connected graphs whose nodes are visual entities and edges are predicates with rich semantic information. Our losses are designed to leverage that information to better discriminate between related and non-related entities. Contrastive Training: Contrastive training using a triplet loss [8] has wide application in both computer vision and natural language processing. Representative works include Negative Sampling [18] and Noise Contrastive Sampling [19]. More recent work also utilizes it to solve multi-modal tasks such as phrase grounding, image captioning, VQA, and vector embeddings [27, 5, 33, 21]. Our setting differs
in that we define hard negative contrastive margins along the known structure of the annotated scene graph, allowing us to specifically target entity instance and proximal relationship confusion. By adding our losses as additional supervision on top of the N-way cross-entropy loss, we are able to improve the model by significant margins.
Our Graphical Contrastive Losses encompass three types of loss, each addressing the two aforementioned issues in their own way: 1) Class Agnostic: contrasts positive/negative entity pairs regardless of their relation and adds contrastive supervision for generic cases; 2) Entity Class Aware: addresses the issue in Figure 2a by focusing on entities with the same class; 3) Predicate Class Aware: addresses the issue in Figure 2b by focusing on entity pairs with the same potential predicate. We define our contrastive losses over an affinity term , which can be interpreted as the probability that subject s and object o have some relationship or interaction. Given a model that outputs the distribution over predicate classes conditioned on a subject and object pair p(pred|s, o), we define
as:
where is the class symbol representing no relationship. This is equivalent to summing over all predicate classes except
.
3.1. Class Agnostic Loss
Our first contrastive loss term aims to maximize the affinity of the lowest scoring positive pairing and minimize the affinity of the highest scoring negative pairing. For a subject indexed by i and an object indexed by j, the margins we wish to maximize can be written as:
where and
represent sets of objects related to and not related to subject
and
are defined similarly for object j as the sets of subjects related to and not related to
.
The class agnostic loss for all sampled positive subjects and objects is written as:
where N is the number of annotated entities and is the margin threshold.
This loss tries to contrast positive and negative (s, o) pairs, ignoring any class information, and is similar to the triplet losses used referring expression and phrasegrounding literature. We found it works as well in our scenario and even better with the following class-aware losses, as shown in Table 1.
3.2. Entity Class Aware Loss
The Entity Class Aware loss deals with entity instance confusion, in which the model struggles to determine interactions between a subject (object) and multiple instances of a same-class object (subject). It can be viewed as an extension of the Class Agnostic loss where we further specify a class c when populating the positive and negative sets and
. We extend the formulation in equation (3) as:
where and
are now constrained to instances of class c.
The entity class aware loss for all sampled positive subjects and objects is defined as
where C() returns the set of unique classes of the sets and
as defined in the class agnostic loss. Compared to the class agnostic loss which maximizes the margins across all instances, this loss maximizes the margins between instances of the same class. It forces a model to disentangle confusing entities illustrated in Figure 2a, where the subject has several potentially related objects with the same class.
3.3. Predicate Class Aware Loss
Similar to the entity class aware loss, this loss maximizes the margins within groups of instances determined by their associated predicates. It is designed to deal with the proximal relationship ambiguity as exemplified in Figure 2b, where instances joined by the same predicate class are within close proximity of each other. In the context of Figure 2b, this loss would encourage the correct pairing of who is playing which instrument by penalizing wrong pairing, i.e., “man plays drum” in the red box. Replacing the class groupings in equation (4) with predicate groupings restricted to predicate class e, we define our margins to maximize as:
Here, we define the sets and
as the sets of subject-object pairs where the ground truth predicate between
and
is e, anchored with respect to subject i and object j respectively. We define the sets
and
as is the set of instances where the model incorrectly predicts (via argmax) the predicate to be e, anchored with respect to subject i and object j respectively.
The predicate class aware loss for all sampled positive subjects and objects is defined as
where E() returns the set of unique predicates associated with the input (excluding ). The final loss is expressed as:
where is the cross-entropy loss over predicate classes.
3.4. Complexity Analysis
We look at the case where the subject is fixed and we vary object for positive/negative pairings. The reverse case (object fixed, subject varies) has the same complexity. All sampling is conducted on the entities of a single image per batch. The set of entities include ground truth bounding boxes, as well as any detector output with >= 0.5 IOU to ground truth entities.
For the Class Agnostic Loss , the computational complexity of the sampling procedure is
, where N is the upper bounded on number of sampled entities per image. In practice, for each subject, we randomly sample at most K non-related objects (negative pairings), which makes the actual complexity O(NK).
For the Entity Class Aware Loss , the sampling procedure is the same as with
, except that we need to keep only those non-related objects that are of class c, i.e., the object class of the current o in the sampled (s, o) pair. This involves a filtering operation on the K objects which takes O(K) time, therefore the overall complexity is still O(NK).
The analysis for the Predicate Class Aware Loss is similar to that of
, except that the filtering operation looks at the predicate class e instead of the object class c. The overall complexity is also O(NK).
We set N = 512 and K = 64 per batch in practice.
We demonstrate the efficacy of our proposed losses with our Relationship Detection Network (RelDN). The RelDN follows a two stage pipeline: it first identifies a proposal set of likely subject-object relationship pairs, then extracts features from these candidate regions to perform a fine-grained classification into a predicate class. We build a separate CNN branch for predicates (conv body rel) with the same structure as that of entity detector CNN (conv body det) to extract predicate features. The intuition for having a separate branch is that we want visual features for predicates to focus on the interactive areas of subjects and objects as opposed to individual entities. As Figure 4 illustrates, the predicate CNN clearly learns better features which concentrate on regions that strongly imply relationships.
The first stage of the RelDN exhaustively returns bounding box regions containing every pair. In the second stage, it computes three types of features for each relationship proposal: semantic, visual, and spatial. Each feature is used to output a set of class logits, which we combine via elementwise addition, and apply softmax normalization to attain a probability distribution over predicate classes. See Figure 3 for our model pipeline.
Semantic Module: The semantic module conditions the predicate class prediction on subject-object class cooccurrence frequencies. It is inspired by Zeller, et al. [37] which introduced a frequency baseline that performs reasonably well on Visual Genome by counting frequencies of predicates given subject and object. Its motivation is that in general, the combination of relationships between two entities is usually very limited, e.g., the relationship between a person-horse subject-object pairing is most likely to be ride, walk, or feed, and unlikely to be stand on or wear. For each training image, we count the occurrences of predicate class pred given subject and object classes s and o in the ground truth annotations. This gives us an empirical distribution p(pred|s, o). We assume that the test set is also drawn from the same distribution.
Spatial Module: The spatial module conditions the predicate class predictions on the relative positions of the subject and object. One of the major predicate types are about positions, for example, “on”, “under”, or “inside of.” These predicate types can often be inferred using only relative spatial information. We capture spatial information by encoding the box coordinates of subjects and objects using the box delta [24] and normalized coordinates.
We define the delta feature between two sets of bounding
Figure 3: The RelDN model architecture. The structures of conv body det and conv body rel are identical. We freeze the weights of the former and only train the latter.
Figure 4: Visualization of CNN features by averaging over the channel dimension of convolution feature maps [36]. (a) shows the image ground truth relationships, (b) shows the convolution feature from the entity detector backbone, and (c) shows the feature from the predicate backbone. In all the three examples there are clear shifts of salience from large entities to small areas that strongly indicate the predicates (highlighted in white boxes).
box coordinates as follows:
where and
are two coordinate tuples in the form of (x, y, w, h).
We then compute the normalized coordinate features for a bounding box b as follows:
where and
are the width and height dimensions of the image. Our spatial feature vector for the subject, object, and predicate bounding boxes
is represented as:
Note that is the tightest bounding box around
and
. This feature vector is fed through an MLP to attain predicate class logit scores. Visual Module: The visual module produces a set of class logits conditioned ROI feature maps, as in the fast-RCNN pipeline. We extract subject and object ROI features from the entity detector’s convolution layers (conv body det in Figure 3) and extract predicate ROI features from the relationship convolution layers (conv body rel in Figure 3). The subject, object, and predicate feature vectors are concatenated and passed through an MLP to attain the predicate class logits.
We also include two skip-connections projecting subjectonly and object-only ROI features to the predicate class logits. These skip connections are inspired by the observation that many relationships, such as human interactions [4], can be accurately inferred by the appearance of only the subjects or objects. We show an improvement from adding these skip connections in 6.4. Module Fusion: As illustrated in Figure 3, we obtain the fi-nal probability distribution over predicate classes by adding the three scores followed by softmax normalization:
where fare unnormalized class logits from the visual, spatial, semantic modules.
We train the entity detector CNN (conv body det) independently using entity annotations, then fix it when training our model. While previous works [11, 3, 32] claim it is beneficial to fine-tune the entity detector end-to-end with the second stage of the pipeline, we opt to freeze our entity detector weights for simplicity. We initialize the predicate CNN (conv body rel) with the entity detector’s weights and fine-tune it end-to-end with the second stage.
During training, we independently sample positive and negative pairs for each loss, subject to their respective constraints. For , we sample 512 pairs in total where 128 of them are positive. For our class-agnostic loss, we sample 128 positive subjects, then for each of them sample the two closet contrastive pairs according to Eq.2; we do the sampling symmetrically for objects. For our entity and predicate aware losses, we sample in the same way with class-agnostic except that negative pairs are grouped by entity and predicate classes, as described in Eq.4,6. We set
, determined by crossvalidations, for all experiments.
During testing, we take up to 100 outputs from the entity detector and exhaustively group all pairs as relationship proposals/entity pairs. We rank relationship proposals by multiplying the predicted subject, object, predicate probabilities as pwhere p
are the probabilities of the predicted subject and object classes from the entity detector, and p
is the probability of the predicted predicate class from the result of Eq.12.
To match the architectures of previous state-of-the-art methods, We use ResNeXt-101-FPN [28, 13] as our OpenImages backbone and VGG-16 on Visual Genome (VG) and Visual Relationship Detection (VRD).
We present experimental results on three datasets: OpenImages (OI) [9], Visual Genome (VG) [10] and Visual Relationship Detection (VRD) [15]. We first report evaluation settings, followed by ablation studies and finally external comparisons.
6.1. Evaluation Settings
OpenImages: The full train and val sets contains 53,953 and 3,234 images, which takes our model 2 days to train. For quick comparisons, we sample a “mini” subset of 4,500 train and 1,000 validation images where predicate classes are sampled proportionally with a minimum of one instance per class in train and val. We first conduct parameter searches on the mini set, then train and compare with the top model of the OpenImages VRD Challenge [1] on the full set. We show two types of results, one using the same entity detector from the top model, and the other using a detector trained by our own initialized by COCO pre-trained weights.
In the OpenImages Challenge, results are evaluated by calculating Recall@50 (R@50), mean AP of relationships (mAP), and mean AP of phrases (mAP
). The final score is obtained by score
. The mAP
evaluates AP of s, pred, o triplets where both the subject and object boxes have an IOU of at least 0.5 with ground truth. The mAP
is similar, but applied to the enclosing relationship box
. In practice, we find mAP
and mAP
to suffer from extreme predicate class imbalance. For example, 64.48% of the relationships in val have the predicate “at”, while only 0.03% of them are “under”. This means a single “under” relationship is worth much more than the more common “at” relationships. We address this by scaling each predicate category by their relative ratios in the val set, which we refer to as the weighted mAP (wmAP). We use wmAP in all of our ablation studies (Table 1-4), in addition to reporting score
which replaces mAP with wmAP in the score formula.
We compare with other top models on the official evaluation server. The official test set is split into a Public and Private set with a 30%/70% split. The Public set is used as a dev set. We present individual results for both, as well as their weighted average under Overall in Table 9. Visual Genome: We follow the same train/val splits and evaluation metrics as [37]. We train our entity detector initialized by COCO pre-trained weights. Following [37], we conduct three evaluations: scene graph detection(SGDET), scene graph classification (SGCLS), and predicate classifi-cation (PRDCLS). We report results for these tasks with and without the Graphical Contrastive Losses. VRD: We evaluate our model with entity detectors initialized by ImageNet and COCO pre-trained weights. We use the same evaluation metrics as in [35], which reports R@50 and R@100 for relationship predictions at 1, 10, and 70 predicates per entity pair.
6.2. Loss Analysis
Loss Combinations: We now look at whether our proposed losses reduce two aforementioned errors without affecting the overall performance, and whether all three losses are necessary. Results in Table 1 show that combination of all the three losses with the N-way cross-entropy loss () has consistently superior performance over just
. Notably, AP
on “holds” improves by from 41.84 to 43.09 (+1.3). It improves even more significantly from 36.04 to 41.04 (+5.0) on “plays” and from 40.43 to 44.16 (+3.7) on “interacts with” respectively. These three classes suffer the most from the two aforementioned problems. Our results also show that any subset of the losses is worse than the entire ensemble. We see that
and
are inferior to
, especially on “holds”, “plays”, and “interacts with”, where the largest margin is 3.87 (
vs.
on “play”).
To better verify the isolated impact of our losses, we carefully sample a subset of 100 images containing five predicates that significantly suffer from the two aforementioned problems, selected via visual inspection on a random set of images. The five predicates are “at”, “holds”, “plays”, “interacts with”, and “wears”. We sample them by looking at the raw images and select those with either entity instance confusion or proximal relationship ambiguity. Example im-
Table 1: Ablation Study on our losses. We report a frequency-balanced wmAP instead of mAP, as the test set is extremely imbalanced and would fluctuate wildly otherwise (see fluctuations in columns “under” and “hits”). We also report score, which is the official OI scoring formula but with wmAP in place of mAP. “Under” and “hits” are not highlighted due to having too few instances.
Table 2: Comparison of our model with Graphical Contrastive Loss vs. without the loss on 100 images containing the 5 classes that suffer from the two aforementioned confusions, selected via visual inspection on a random set of images.
Table 3: Ablation Study on RelDN modules. sem only means using only the semantic module without training any model; using only the
concatenation without the separate S,O layers in the visual module; vis means our full visual module, and spt means spatial module. “Under” and “hits” are not highlighted due to having too few instances.
Figure 5: Example results of RelDN with only and with our losses. The top row shows RelDN outputs and the bottom row visualizes the learned predicate CNN features of the two models. Red and green boxes highlight the wrong and right outputs (the first row) or feature saliency (the second row). As it shows, our losses force the model to attend to the representative regions that discriminate the correct relationships against unrelated entity pairs, thus is able to disentangle entity instance confusion and proximal relationship ambiguity.
ages can be found in Figure 7. Table 2 shows comparison of our losses with only on this subset. The overall gap is
Table 4: Ablation Study on the margin threshold m. We use m = 0.2 everywhere in our experiments.
Table 5: Ablation Study on our losses with the official mAPmAP
and score metrics. Metric marked with a * means “under” and “hits” are excluded from evaluation. The fluctuating numbers in mAP
and score indicate that the mAP metrics are unstable and unreliable, while when “under” and “hits” are excluded, all the results become consistent with Table 1.
Table 6: Comparison of our model with Graphical Contrastive Loss vs. without the loss on 100 images containing the 5 classes that suffer from the two aforementioned confusions, selected via visual inspection on a random set of images. The metrics are the official mAPand the score. The “under” and “hits” predicates are not in this 100 image subset.
1.4 and the largest gap is 4.1 at APon “holds”.
Figure 5 shows two examples from this subset, one containing entity instance confusion and the other containing proximal relationship ambiguity. In Figure 5a the model with only fails to identify the wine glass being held, while by adding our losses, the area surrounding the correct wine glass lights up. In Figure 5b
is incorrectly predicted since the
-only model mistakenly pairs the unplayed drum with the singer – a reasonable error considering the amount of person-play-drum examples as well as the relative proximities between the singer and the drum. Our losses successfully suppress that region and attend to the correct microphone being held, demonstrating the effectiveness of our hard-negative sampling strategies.
Margin Thresholds: We study the effects of various values of the margin thresholds used in Eq.3,5,7. For each experiment, we set
while varying m. As shown in Table 4, we observe similar results with previous work [8, 26] that m = 0.1 or m = 0.2 achieves the best performance. Note that m = 1.0 is the largest possible margin, as our affinity scores range from 0 to 1.
6.3. Loss Analysis with the Official mAP metrics
Here, we show our ablation studies using the offi-cial uniform-class-weighting evaluation metrics, mAP, mAP
and score. We also include mAP*
, mAP*
and score*, which is the standard mAP and score excluding “under” and “hits” in the evaluation. Table 5 presents ablation study results on loss components. Table 6 shows comparison between the
-only model against the model with our losses on the 100 selected images. In Table 5 the variation of numbers using mAP and score demonstrates the necessity of de-emphasizing the extremely infrequent classes. Note that the mAP*-based columns show a similar trend to our wmAP-based results from the paper. In Table 6, the model with our losses is still better than the
-only model by a non-trivial margin, mainly because the former outperform the latter on almost every per-class AP metric for those 5 selected classes. Note that since “under” and “hits” are not in the 100 image subset, there is no need to evaluate with mAP*
, mAP*
and score*.
6.4. Model Analysis
We conduct an effectiveness evaluation on the three modules of the RelDN. For the visual module, we also investigate the two skip-connections. As Table 3 shows, the semantic module alone cannot solve relationship detection by using language bias only. By adding the basic visual feature, i.e., the S,P,O
concatenation, we see a significant 4.7 gain, which is further improved by adding additional separate S,O skip-connections, especially at “plays” (+3.1), “interacts with” (+1.0), “wears” (+2.0) where subjects’ or objects’ appearance and poses are highly representative of the interactions. Finally, adding the spatial module gives the best results, and the most obvious gaps are at spatial relationships, i.e., “at” (+0.2), “on” (+0.2), “inside of” (+2.4).
6.5. Comparison to State of the Art
OpenImages: We present results compared with top 5 models from the Challenge in Table 9. We surpass the 1st place Seiji by 4.7% on Private set and 2.9% on the full set, which is in fact a significant margin considering the low absolute scores and the large amount of test images (99,999 in total). Even using the same entity detector as Seiji, we noticeable gaps (1.4% and 0.8%) on the two sets.
Visual Genome: Table 7 shows that our model is better than state-of-the-arts on all metrics. It outperforms the previous best, MotifNet-LeftRight, by a 2.4% gap on Scene Graph Detection (SGDET) with Recall@100 and by a 12.7% gap on Predicate Classification (PRDCLS) with Recall@50. Note that although our entity detector is better than MotifNet-LeftRight on mAP at 50% IoU (25.5 vs. 20.0), our implementation of Frequency+Overlap baseline (Recall@20: 16.2, Recall@50: 19.8, Recall@100: 21.5) is
Table 7: Comparison with state-of-the-arts on VG. is the RelDN without our losses. We also include results of our model with ResNeXt-101-FPN as the backbone for future work reference.
Table 8: Comparison with state-of-the-art on VRD (means unavailable / unknown). Same with Table 7,
is the RelDN without our losses. “Free k” means considering k as a hyper-parameter that can be cross-validated.
Table 9: Comparison with models from OpenImages Challenge. RelDNmeans using the same entity detector from Seiji, the champion model. Overall is computed as 0.3*Public+0.7*Private. Note that this table uses the official mAP
not better than their version (Recall@20: 21.0, Recall@50: 26.2, Recall@100: 30.1), indicating that our better relationship performance mostly comes from our model design.
We also observe that our losses achieve smaller gains over the standard cross-entropy loss setup than it does on OpenImages mini. The reasons are two-fold: 1) One of the few dominant relationship types in the Visual Genome dataset is possessive, e.g., “ear of man”, which has much fewer entity confusion issues; 2) The Recall@k metric is less strict than mAP. If there is an image with only one ground truth, then Recall@100 will always be 100% as long as this ground truth target is within the top 100 model predictions, regardless of the ranking of the 100 outputs. As such, the small improvements in ranking the top 100 will not affect the score. Nevertheless, the improvements from our loss is still non-trivial and consistent on all metrics under different values of k.
In addition, we also show results using a better backbone, ResNeXt-101-FPN [28, 13], for the entity detector in Table 7. VRD: Table 8 presents results on VRD compared with state-of-the-art methods. Note that only [32] specifically states that they use ImageNet pre-trained weights while others remain unknown. Therefore, we show results for pre-training on either ImageNet or COCO. Our model is competitive with those methods when pre-trained on ImageNet, but significantly outperforms when pre-trained on COCO. The gap between only and the full model is smaller when pre-trained on ImageNet than on COCO. We believe the stronger localization features from pre-training on COCO is much easier for our model and losses to leverage.
6.6. Qualitative Results
In Figure 6 we provide four example images where our losses correct the false predictions made by the only model. Both the Entity Instance Confusion and the Proximal Relationship Ambiguity issues are included here. In the fourth row, the
only model is confused between two entity instances, i.e., which person is holding the microphone, while our losses manage to refer to the correct one. In the third row the relationship between the guitar player and the drum is ambiguous. Here, the
only model fails by predicting a false-positive, but our model trained with all losses correctly detects no relationship there.
In this work we present methods to overcome two major issues in scene graph parsing: Entity Instance Confusion and Proximal Relationship Ambiguity. We show that traditional multi-class cross-entropy loss does not take advantage of intrinsic knowledge of structured scene graphs and is therefore insufficient to handle these two issues. To address that, we propose Graphical Contrastive Losses which effectively utilize semantic properties of scene graphs to contrast positive relationships against hard negatives. We carefully design three types of losses to solve the issues in three aspects. We demonstrate efficacy of our losses by adding it to a model built with the same pipeline, and we achieve state-of-the-art results on three datasets.
[1] Openimages vrd challenge. https://storage. googleapis.com/openimages/web/challenge. html.
[2] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regres- sion network with context policy for phrase grounding. In ICCV, 2017.
[3] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017.
[4] G. Gkioxari, R. Girshick, P. Doll´ar, and K. He. Detecting and recognizing human-object intaractions. CVPR, 2018.
[5] T. Gupta, K. J. Shih, S. Singh, and D. Hoiem. Aligned image- word representations improve inductive transfer across vision-language tasks. In ICCV, 2017.
[6] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017.
[7] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Dar- rell. Natural language object retrieval. In CVPR, 2016.
[8] R. Kiros, R. Salakhutdinov, R. S. Zemel, and et al. Unify- ing visual-semantic embeddings with multimodal neural language models. TACL, 2015.
[9] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El- Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from
https://storage.googleapis.com/openimages/web/index.html, 2017.
[10] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
[11] Y. Li, W. Ouyang, and X. Wang. Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection. In CVPR, 2017.
[12] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. arXiv preprint arXiv:1703.03054, 2017.
[13] T.-Y. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[14] J. Liu, L. Wang, M.-H. Yang, et al. Referring expression gen- eration and comprehension via attributes. In CVPR, 2017.
[15] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual rela- tionship detection with language priors. In ECCV, 2016.
[16] R. Luo and G. Shakhnarovich. Comprehension-guided refer- ring expressions. In CVPR, 2017.
[17] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
[19] A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, 2013.
[20] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016.
[21] A. Newell and J. Deng. Pixels to graphs by associative em- bedding. In NIPS, 2017.
[22] J. Peyre, I. Laptev, C. Schmid, and J. Sivic. Weaklysupervised learning of visual relations. In ICCV, 2017.
[23] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hock- enmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
[24] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
[25] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
[26] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. Orderembeddings of images and language. In ICLR, 2016.
[27] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure- preserving image-text embeddings. In CVPR, 2016.
[28] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
[29] D. Xu, Y. Zhu, C. Choy, and L. Fei-Fei. Scene graph gener- ation by iterative message passing. In CVPR, 2017.
[30] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In ECCV, 2018.
[31] X. Yang, H. Zhang, and J. Cai. Shuffle-then-assemble: Learning object-agnostic visual relationship features. In ECCV, 2018.
[32] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy. Zoom-net: Mining deep feature interactions for visual relationship recognition. In ECCV, 2018.
[33] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
[34] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV, 2016.
[35] R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relation- ship detection with internal and external linguistic knowledge distillation. In ICCV, 2017.
[36] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
[37] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
[38] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
[39] H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise rfcn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4233–4241, 2017.
[40] J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgam- mal. Relationship proposal networks. In CVPR, 2017.
[41] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgam- mal, and M. Elhoseiny. Large-scale visual relationship understanding. In AAAI, 2019.
[42] J. Zhang, K. Shih, A. Tao, B. Catanzaro, and A. Elgammal. An interpretable model for scene graph generation. arXiv preprint arXiv:1811.09543, 2018.
[43] J. Zhang, K. Shih, A. Tao, B. Catanzaro, and A. Elgam- mal. Introduction to the 1st place winning model of openimages relationship detection challenge. arXiv preprint arXiv:1811.00662, 2018.
[44] B. Zhuang, L. Liu, C. Shen, and I. Reid. Towards context- aware interaction recognition for visual relationship detection. In ICCV, 2017.
Figure 6: Example images where RelDN with only predicts incorrectly while our loss succeeds. For each image we check the number of its ground truth relationships, then we output the same number of top predictions from a model to see its ranking accuracy. Red boxes in (b) highlight the false predictions from RelDN with
only and green boxes in (c) highlight the correct ones from RelDN with all losses.
Figure 7: Example images of the 100 image subset with ground truth relationships. The subset contains five predicates where the Entity Instance Confusion and Proximal Relationship Ambiguity commonly occur.