Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14]. Once equipped with SGG, these high-level tasks have to abandon the ambiguous visual relationships
Figure 1. An example of scene graph generation (SGG). (a) An input image with bounding boxes. (b) The distribution of sample fraction for the most frequent 20 predicates in Visual Genome [22]. (c) SGG from re-implemented MOTIFS [71]. (d) SGG by the proposed unbiased prediction from the same model.
— yet on which are our core efforts made [71, 55, 6], then pretend that there is a graph — nothing but a sparse object layout with binary links, and finally shroud it into graph neural networks [65] for merely more contextual object representations [67, 16, 56]. Although this is partly due to the research gap in graph reasoning [2, 51, 15], the crux lies in the biased relationship prediction.
Figure 1 visualizes the SGG results from a state-of-the-art model [71]. We can see a frustrating scene: among almost perfectly detected objects, most of their visual relationships are trivial and less informative. For example in Figure 1(c), except the trivial 2D spatial layouts, we know little about the image from near, on, and has. Such heavily biased generation comes from the biased training data, more specifically, as shown in Figure 1(b), the highlyskewed long-tailed relationship annotations. For example, if a model is trained for predicting on 1,000 times more than standing on, then, during test, the former is more likely to prevail over the latter. Therefore, to perform a sensible graph reasoning, we need to distinguish more fine-grained relationships from the ostensibly probable but trivial ones, such as replacing near with behind/in front of, and on with parking on/driving on in Figure 1(d).
Figure 2. (a) The biased generation that directly predicts labels from likelihood. (b) An intuitive example of the proposed total direct effect, which calculates the difference between the real scene and the counterfactual one. Note that the “wipe-out” is only for the illustrative purpose but not considered as visual processing.
However, we should not blame the biased training because both our visual world per se and the way we describe it are biased: there are indeed more person carry bag than dog carry bag (i.e., the long-tail theory); it is easier for us to label person beside table rather than eating on (i.e., bounded rationality [52]); and we prefer to say person on bike rather than person ride on bike (i.e., language or reporting bias [35]). In fact, most of the biased annotations can help the model learn good contextual prior [31, 71] to filter out the unnecessary search candidates such as apple park on table and apple wear hat. A promising but embarrassing find-ing [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], we can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex SGG models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74]. Not surprisingly, as we will show in Section 5, conventional debiasing methods who do not respect the “good bias” during training, e.g., re-sampling [11] and re-weighting [29], fail to generalize to unseen relationships, i.e., zero-shot SGG [31].
For both machines and humans, decision making is a collaboration of content (endogenous reasons) and context (exogenous reasons) [58]. Take SGG as an example, in most SGG models [71, 5, 74], the content is the visual features of the subject and object, and the context is the visual features of the subject-object union regions and the pairwise object classes. We humans — born and raised in the biased nature — are ambidextrous in embracing the good while avoiding the bad context, and making unbiased decisions together with the content. The underlying mechanism is causality-based: the decision is made by pursuing the main causal effect caused by the content but not the side-effect by context. However, on the other hand, machines are usually likelihood-based: the prediction is analogous to look-up the
1 Biased Generation Proposed Unbiased GenerationRecall
Figure 3. (a) The example of total direct effect calculation and corresponding operations on the causal graph, where represents wiped-out X. (b) Recall@100 of Predicate Classification for selected predicates ranking by sampling fraction. The biased generation refers to re-implemented MOTIFS [71] and the proposed unbiased generation is the result from the same model using TDE.
content and its context in a huge likelihood table, interpolated by population training. We believe that the key is to teach machines how to distinguish between the “main effect” and “side-effect”.
In this paper, we propose to empower machines the ability of counterfactual causality [41] to pursue the “main effect” in unbiased prediction:
The counterfactual lies between the fact that “I see” and the imagination “I had not”, and the comparison between the factual and counterfactual will naturally remove the effect from the context bias, because the context is the only thing unchanged between the two alternatives.
To better illustrate the profound yet subtle difference between likelihood and counterfactual causality, we present a dog standing on surfboard example in Figure 2(a). Due to the biased training, the model will eventually predict the on. Note that even though the rest choices are not all exactly correct, thanks to the bias, they still help to filter out a large amount of unreasonable ones. To take a closer look at what relationship it is in the context bias, we are essentially comparing the original scene with a counterfactual scene (Figure 2(b)): only the visual features of the dog and surfboard are wiped out, while keeping the rest — the scene and the object classes — untouched, as if the visual features had ever existed. By doing this, we can focus on the main visual effects of the relationship without losing the context.
We propose a novel unbiased SGG method based on the Total Direct Effect (TDE) analysis framework in causal inference [59, 39, 60]. Figure 3(a) shows the underlying causal graphs [40, 41] of the two alternate scenes: factual and counterfactual. Although a formal introduction of them is given in Section 3-4, now you can simply understand the nodes as data features and the directed links as (parametric) data flows. For example, , and
indicate that the relationship Y is a combined effect caused by content: the pair of object visual features X, context: their object classes Z, and scene: the image I; the faded links denote that the wiped-out
is no longer caused by I or affects Z. These graphs offer an algorithmic formulation to calculate TDE, which exactly realizes the counterfactual thinking in Figure 2. As shown in Figure 3(b), the proposed TDE significantly improves most of the predicates, and impressively, the distribution of the improved performances is no longer long-tailed, indicating the fact that our improvement is indeed from the proposed method, but NOT from the better exploitation of the context bias. A closer analysis in Figure 6 further shows that the worse predictions like on — though very few — are due to turning to more fine-grained results such as stand on and park on. We highlight that TDE is a model-agnostic prediction strategy and thus applicable for a variety of models and fusion tricks [73, 71, 55].
Last but not least, we propose a new standard of SGG diagnosis toolkit (cf. Section 5.2) for more comprehensive SGG evaluations. Besides traditional evaluation tasks, it consists of the bias-sensitive metric: mean Recall [55, 6] and a new Sentence-to-Graph Retrieval for a more comprehensive graph-level metric. By using this toolkit on SGG benchmark Visual Genome [22] and several prevailing baselines, we verify the severe bias in existing models and demonstrate the effectiveness of the proposed unbiased prediction over other debiasing strategies.
Scene Graph Generation. SGG [64, 71] has received increasing attention in computer vision community, due to the potential revolution that would be brought to down-stream visual reasoning tasks [51, 67, 21, 16]. Most of the existing methods [64, 62, 7, 25, 70, 55, 66, 10, 43, 61] struggle for better feature extraction networks. Zellers et al. [71] firstly brought the bias problem of SGG into attention and the followers [55, 6] proposed the unbiased metric (mean Recall), yet, their approaches are still restricted to the feature extraction networks, leaving the biased SGG problem unsolved. The most related work [27] just prunes those dominant and easy-to-predict relationships in the training set.
Unbiased Training. The bias problem has long been investigated in machine learning [57]. Existing debiasing methods can be roughly categorized into three types: 1) data
Figure 4. (a) The framework used in our biased training. (b) The causal graph of the SGG framework. (c) An illustration of the proposed TDE inference.
augmentation or re-sampling [9, 24, 26, 11, 3], 2) unbiased learning through elaborately designed training curriculums or learning losses [72, 29], 3) disentangling biased representations from the unbiased [35, 4]. The proposed TDE analysis can be regarded as the third category, but the main difference is that TDE doesn’t require to train additional layers like [35, 4] to model the bias, it directly separates the bias from existing models through the counterfactual surgeries on causal graphs.
Mediation Analysis. It is also known as effect analysis [59, 41], which is widely adopted in medical, political or psychological research [45, 18, 8, 32, 20] as the tool of studying the effect of certain treatments or policies. However, it has been neglected in the community of computer vision for years. There are very few recent works [36, 23, 37, 42, 54, 68] trying to endow the model with the capability of causal reasoning. More detailed background knowledge can be found in [40, 41, 59].
As illustrated in Figure 4, we summarize the SGG framework in the form of Causal Graph (a.k.a., structural causal model) [41, 38, 40]. It is a directed acyclic graph G = {N, E}, indicating how a set of variables N interact with each other through the causal links E. It provides a sketch of the causal relations behind the data and how variables obtain their values, e.g., . Before we conduct counterfactual analysis that deliberately manipulates the values of nodes and prunes the causal graph, we first revisit the conventional biased SGG model training in the graphical view.
The causal graph in Figure 4(b) is applicable to a variety of SGG methods, since it is highly general, imposing no constraints on the detailed implementations. We casestudy three representative model formulations: the classic VTransE [73], the state-of-the-art MOTIFS [71] and VCTree [55], using the language of nodes and links.
Node I (Input Image&Backbone). A Faster R-CNN [44] is pre-trained and frozen in this node, It outputs a set of bounding boxes and the feature map M from image I.
Link (Object Feature Extractor). It firstly extracts RoIAlign features [12]
and tentative object labels
by the object classifier on Faster R-CNN. Then, like MOTIFS [71] or VCTree [55], we can use the following module to encode visual contexts for each object:
where MOTIFS implements it as bidirectional LSTMs (BiLSTMs) and VCTree [55] adopts bidirectional TreeLSTMs (Bi-TreeLSTMs) [53], early works like VTransE [73] simply use fully connected layers.
Node X (Object Feature). The pairwise object feature X takes value from . We slightly abuse the notation hereinafter, denoting the combination of representations from i and j as subscript
.
Link (Object Classification). The fine-tuned label of each object is decoded from the corresponding
by:
where MOTIFS [71] and VCTree [55] utilizes LSTM and TreeLSTM as decoders to capture the co-occurrence among object labels, respectively. The input of each LSTM/ TreeLSTM cell is the concatenation of feature and the previous label . VTransE [73] uses the conventional fully connected layer as the classifier.
Node Z (Object Class). It contains a pair of one-hot vectors for object labels .
Link (Object Feature Input for SGG). For relationship classification, pairwise feature X are merged into a joint representation by the module:
where another Bi-LSTMs and Bi-TreeLSTMs layers are applied in MOTIFS [71] and VCTree [55], respectively, before concatenating the pair of object features. VTransE [73] uses fully connected layers and element-wise subtraction for feature merging.
Link (Object Class Input for SGG). The language prior is calculated in this link through a joint embedding layer
, where
generates the one-hot unique vector
for the pair of N-way object labels.
Link (Visual Context Input for SGG). This link extracts the contextual union region features
Convs(RoIAlign
where
indicates the union box of two RoIs.
Figure 5. The original causal graph of SGG together with two in- terventional and counterfactual alternates.
Node Y (Predicate Classification). The final predicate logits Y that takes inputs from the three branches is then generated by using a fusion function. In Section 5, we test two general fusion functions: 1) SUM: , 2) GATE:
, where
is element-wise product,
is a sigmoid function. Training Loss. All models are trained by using the conventional cross-entropy losses of object labels and predicate labels. To avoid any single link spontaneously dominating the generation of logits
, especially
, we further add auxiliary cross-entropy losses that individually predict
from each branch.
Once the above training has been done, the causal dependencies among the variables are learned, in terms of the model parameters. The conventional biased prediction can only see the output of the entire graph given an image I = u without any idea about how a specific pair of objects affect their predicate. However, causal inference [41] encourages us to think out of the black box. From the graphical point of view, we are no longer required to run the entire graph as a whole. We can directly manipulate the values of several nodes and see what would be going on. For example, we can cut off the link and assign a dummy value to X, then investigate what the predicate would be. The above operation is termed intervention in causal inference [40]. Next, we will make unbiased predictions by intervention and its induced counterfactuals.
4.1. Notations
Intervention. It can be denoted as . It wipes out all the in-coming links of a variable and demands the variable to take a certain value, e.g.
in Figure 5(b), meaning X is no longer affected by its causal parents.
Counterfactual. It means “counter to the facts” [47], and takes one step further that assigns the “clash of worlds” combination of values to variables. Take Figure 5(c) as an example, if the intervention is conducted on X, the variable Z still takes the original z as if x had existed.
Causal Effect. Throughout this section, we will use the pairwise object feature X as our control variable where the intervention is conducted, aiming to assess its effects, due to the fact that there wouldn’t be any valid relationship if the pair of objects do not exist. The observed X is denoted as x while the intervened unseen value is , which is set to either the mean feature of the training set or zero vector. The object label z on Figure 5(c) is calculated from Eq. (2), taking x as input. We denote the output logits Y after the intervention
as follows (Figure 5(b)):
where u is the input image in SGG. Following the above notation, the original and counterfactual Y , i.e., Figure 5(a,c), can be re-written as and
, respectively.
4.2. Total Direct Effect
As we discussed in Section 1, instead of the static likelihood that tends to be biased, the unbiased prediction lies in the difference between the observed outcome and its counterfactual alternate
. The later one is a context-specific bias that we want to remove from prediction. Intuitively, the unbiased prediction that we seek is the visual stimuli from blank to the observed real objects with spe-cific attributes, states, and behaviors, but not merely from the surroundings and language priors. Those specific visual cues of objects are the key to the more fine-grained and informative unbiased predictions, because even if the overall prediction is biased towards the relationship like dog on surfboard, the “straight legs” would cause more effect on standing on rather than sitting on. In causal inference [59, 60], the above prediction process can be calculated as Total Direct Effect (TDE):
where the first term is from the original graph and the second one is from the counterfactual, as illustrated in Figure 5.
Note that there is another type of effect [59], Total Effect (TE), which is easy to be mixed up with TDE. Instead of deriving counterfactual bias , TE lets all the descendant nodes of X change with intervention
as shown in Figure 5(b). TE is therefore formulated as:
The main difference lies in the fact that is not conditioned on the original object labels (those caused by x), so TE only removes the general bias in the whole dataset (similar to the b in
), rather than the specific bias caused by the mediator we care about. The subtle difference between TE and TDE is further defined as Natural Indirect Effect (NIE) [59] or Pure Indirect Effect (PIE) [60]. More experimental analyses among these three types of effect are given in Section 5.
Overall SGG. At last, the proposed unbiased prediction is obtained by replacing the conventional one-time pre- diction with TDE, which essentially “thinks” twice: one for observational
, the other for imaginary
. The unbiased logits of Y is therefore defined as follows:
It is also worth mentioning that the proposed TDE doesn’t introduce any additional parameters and is widely applicable to a variety of models.
5.1. Settings and Models
Dataset. For SGG, we used Visual Genome (VG) [22] dataset to train and evaluate our models, which is composed of 108k images across 75k object categories and 37k predicate categories. However, as 92% of the predicates have no more than 10 instances, we followed the widely adopted VG split [64, 71, 55, 5] containing the most frequent 150 object categories and 50 predicate categories. The original split only has training set (70%) and test set (30%). We followed [71] to sample a 5k validation set from training set for parameter tuning. For Sentence-to-Graph Retrieval (cf. Section 5.2), we selected the overlapped 41,859 images between VG and MS-COCO Caption dataset [30] and divided them into train/test-1k/test-5k (35,859/1,000/5,000) sets. The later two only contain images from VG test set in case of exposing to grount-truth SGs. Each image has at least 5 captions serving as human queries, the same as how we use searching engines. Model Zoo. We evaluated three models: VTransE [73], MOTIFS [71], VTree [55], and two fusion functions: SUM and GATE. They were re-implemented using the same codebase as we proposed. All models shared the same hyper-parameters and the pre-trained detector backbone.
5.2. Scene Graph Generation Diagnosis
Our proposed SGG diagnosis has the following three evaluations: 1. Relationship Retrieval (RR). It can be further divided into three sub-tasks: (1) Predicate Classification (PredCls): taking ground truth bounding boxes and labels as inputs, (2) Scene Graph Classification (SGCls): using ground truth bounding boxes without labels, (3) Scene Graph Detection (SGDet): detecting SGs from scratch. The conventional metric of RR is Recall@K (R@K), which was abandoned in this paper due to the reporting bias [35]. As illustrated in Figure 3(b), previous methods like [71] with good performance on R@K unfairly cater to “head” predicates,
Table 1. The SGG performances of Relationship Retrieval on mean Recall@K [55, 6]. The SGG models re-implemented under our codebase are denoted by the superscript
e.g., on, while neglect the “tail” ones, e.g., predicates like parked on, laying on have embarrassingly 0.0 Recall@100. To speak for the valuable “tail” rather than the trivial “head”, we adopted a recent replacement, mean Recall@K (mR@K), proposed by Tang et al. [55] and Chen et al. [6]. mR@K retrieves each predicate separately and then averages R@K for all predicates.
2. Zero-Shot Relationship Retrieval (ZSRR). It was introduced by Lu et al. [31] as Zero-Shot Recall@K and was firstly evaluated on VG dataset in this paper, which only reports the R@K of those subject-predicate-object triplets that have never been observed in the training set. ZSRR also has three sub-tasks as RR.
3. Sentence-to-Graph Retrieval (S2GR). It uses the image caption sentence as the query to retrieve images represented as SGs. Both RR and ZSRR are triplet-level evaluations, ignoring the graph-level coherence. Therefore, we design S2GR, using human descriptions to retrieve detected SGs. We didn’t use proxy vision-language tasks like captioning [67, 69] and VQA [56, 14] as the diagnosis, because their implementations have too many components unrelated to SGG and their datasets are challenged by their own biases [1, 13, 33]. In S2GR, the detected SGs (using SGDet) are regarded as the only representations of images, cut off all the dependencies on black-box visual features, so any bias on SGG would sensitively violate the coherence of SGs, resulting in worse retrieval results. For example, if walking on was detected as the biased alternative on, images would be mixed up with those have sitting on or laying on. Note that S2GR is fundamentally different
Table 2. The results of Zero-Shot Relationship Retrieval.
from the previous image retrieval with scene graph [17, 50], because the latter still consider the images as visual features but not SGs. Recall@20/100 (R@20/100) and median ranking indexes of retrieved results (Med) on the gallery size of 1,000 and 5,000 were evaluated. Note that S2GR should have diverse implementations as long as its spirit: graph-level symbolic retrieval, is fulfilled. We provide our implementation in the next sub-section.
5.3. Implementation Details
Object Detector. Following the previous works [64, 71, 55], we pre-trained a Faster R-CNN [44] and froze it to be the underlying detector of our SGG models. We equipped the Faster R-CNN with a ResNeXt-101-FPN [28, 63] back-
Table 3. The results of Sentence-to-Graph Retrieval.
bone and scaled the longer side of input images to be 1k pixels. The detector was trained on the training set of VG using SGD as optimizer. We set the batch size to 8 and the initial learning rate to , which was decayed by the factor of 10 on the 30kth and 40kth iterations. The final detector achieved 28.14 mAP on VG test set (using 0.5 IoU threshold). 4 2080ti GPUs were used for the pre-training.
Scene Graph Generation. On top of the frozen detector, we trained SGG models using SGD as optimizer. Batch size and initial learning rate were set to be 12 and for PredCls and SGCls; 8 and
for SGDet. The learning rate would be decayed by 10 two times after the validation performance plateaus. For SGDet, 80 RoIs were sampled for each image and Per-Class NMS [48, 71] with 0.5 IoU was applied in object prediction. We sampled up to 1,024 subject-object pairs containing 75% background pairs during training. Different from previous works [71, 55, 5], we didn’t assume that non-overlapping subject-object pairs are invalid in SGDet, making SGG more general.
Sentence-to-Graph Retrieval. We handled S2GR as a graph-to-graph matching problem. The query captions of each image were stuck together and parsed to a text-SG using [50]. We set all the subject/object and predicates that appear less than 5 times to “UNKNOWN” tokens, obtaining a dictionary of size 4,459 subject/object entities and 645 predicates, respectively. The original image SG generated from SGDet contains a fixed number of RoIs and forces all valid subject-object pairs to predict foreground relationships, to serve the K number in mR@K, which is inappropriate for S2GR. Therefore, we used a threshold of 0.1 to filter RoIs by the label probabilities and removed all background predicates from the graph. Recall that the vocabulary size of the entity and predicate for image SGs are 150 and 50 as we mentioned before. To match the two heterogeneous graphs: image SG and text SG, in a unified space, we used BAN [19] to encode the two graph types into fixed-dimension vectors to facilitate the retrieval. More details can be found in supplementary material.
Figure 6. The pie chart summarizes all the relationships, that are correctly detected by the baseline model but considered “incorrect” by TDE. The right side of the pie chart shows the corresponding labels given by the TDE. Combining with our qualitative examples, we believe that the drop of Recall@K is caused by two reasons: 1) the annotators preference towards simple annotations caused by bounded rationality [52], 2) TDE tends to predict more action-like relationships rather than vague prepositions.
5.4. Ablation Studies
Except for the models and fusion functions that we’ve discussed before, we also investigated three conventional debiasing methods, two intuitive causal graph surgeries, and other two types of causal effects: 1) Focal: focal loss [29] automatically penalizes well-learned samples and focuses on the hard ones. We followed the hyper-parameters () optimized in [29]. 2) Reweight: weighted cross-entropy is widely used in the industry for biased data. The inversed sample fractions were assigned to each predicate category as weights. 3) Resample [3]: rare categories were up-sampled by the inversed sample fraction during training. 4) X2Y: since we argued that the unbiased effect was under the effect of object features X, it directly generated SG by the outputs of
branch after biased training. 5) X2Y-Tr: it even cut off other branches, using
for both training and testing. 6) TE: as we introduced in Section 4, TE is the debiasing method that not conditioned on the contexts. 7) NIE: it is the marginal difference between TDE and TE, i.e., NIE = TE-TDE, which can be considered as the pure effect caused by introducing the bias
. NOTE: although zero vector can also be used as the wiped-out input
, we chose the mean feature of training set for minor improvements.
5.5. Quantitative Studies
RR & ZSRR. The results are listed in Table 1& 2. Despite the fact that conventional debiasing methods: Reweight and Resample, directly hack the mR@K metric, they only gained limited advantages in RR but not in ZSRR. In contrast to the high mR@K of Reweight in RR SGDet, it got embarrassingly 0.0/0.0 in ZSRR SGDet, indicating that such debiased training methods ruin the useful context prior. Focal loss [29] barely worked for both RR and ZSRR.
Figure 7. Results of scene graphs generated from MOTIF-SUM baseline (yellow) and corresponding TDE (green). Top: relationship retrieval results. Mid: zero shot relationship retrieval results. Red boxes indicate the zero shot triplets. Bottom: results of S2GR. Red boxes mean the correctly retrieved SGs. Part of the trivial detected objects are removed from the graphs, due to space limitation.
Causal graph surgeries, X2Y and X2Y-Tr, both improved RR and ZSRR from the baseline, yet their increases were limited. TE had a very similar performance to TDE, but as we discussed, it removed the general bias rather than the subject-object specific bias. NIE is the marginal improvements from TE to TDE, which was even worse than baseline. Although R@K is not a qualified metric for RR as we discussed, we still reported the R@50/100 performance of MOTIFS-SUM in Figure 6. We can observe a performance drop from baseline to TDE, but a further analysis shows that those considered as correct in baseline and “incorrect” in TDE were mainly the “head” predicates, and they are classified by TDE into more fine-grained “tail” classes. Among all three models and two fusion functions, even the worst TDE performance outperforms previous state-of-the-art methods [55, 6] by a large margin on RR mR@K.
S2GR. In S2GR, Focal and Reweight are even worse than the baseline. Among all the three conventional debiasing methods, Resample was the most stable one based on our experiments. X2Y and X2Y-Tr have minor advantages over baseline. TE takes the 2nd place and was only a little bit worse than TDE. NIE is the worst as we expected because it is only based on the pure context bias. It is worth highlighting that all the three models and two fusion functions had significant improvements after we applied TDE.
5.6. Qualitative Studies
We visualized several SGCls examples that generated from MOTIFS-SUM baseline and TDE in the top and mid rows of Figure 7, scene graphs generated by TDE are much more discriminative compared to the baseline model which prefers trivial predicates like on. The right half of the mid row shows that the baseline model would even generate holding due to the long-tail bias when the girl is not touching the kite, implying that the biased predictions are easy to be “blind”, while TDE successfully predicted looking at. The bottom of Figure 7 is an example of S2GR, where the SGs detected by baseline model lost the detailed actions of people, considering both person walking on street and person standing on street as person on street, which caused worse retrieval results. All the examples show a clear trend that TDE is much more sensitive to those semantically informative relationships instead of the trivially biased ones.
We presented a general framework for unbiased SGG from biased training, and this is the first work addressing the serious bias issue in SGG. With the power of counterfactual causality, we can remove the harmful bias from the good context bias, which cannot be easily identified by traditional debiasing methods such as data augmentation [9, 11] and unbiased learning [29]. We achieved the unbiasedness by calculating the Total Direct Effect (TDE) with the help of a causal graph, which is a roadmap for training any SGG model. By using the proposed Scene Graph Diagnosis toolkit, our unbiased SGG results are considerably better than their biased counterparts.
Acknowledgments We’d like to thank all reviewers for their constructive comments. This work was partially supported by the NTU-Alibaba JRI.
[1] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018.
[2] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez- Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
[3] E. Burnaev, P. Erofeev, and A. Papanov. Influence of resam- pling on accuracy of imbalanced classification. In ICMV, 2015.
[4] R. Cadene, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh. Rubi: Reducing unimodal biases in visual question answering. arXiv preprint arXiv:1906.10169, 2019.
[5] L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 2019.
[6] T. Chen, W. Yu, R. Chen, and L. Lin. Knowledge-embedded routing network for scene graph generation. In CVPR, 2019.
[7] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017.
[8] G. Dunn, R. Emsley, H. Liu, S. Landau, J. Green, I. White, and A. Pickles. Evaluation and validation of social and psychological markers in randomised trials of complex interventions in mental health: a methodological research programme. 2015.
[9] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wich- mann, and W. Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
[10] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling. Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.
[11] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 2009.
[12] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. In ICCV, 2017.
[13] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV. Springer, 2018.
[14] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
[15] D. A. Hudson and C. D. Manning. Learning by abstraction: The neural state machine. NeurIPS, 2019.
[16] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In CVPR, 2018.
[17] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015.
[18] L. Keele. The statistics of causal inference: A view from political methodology. Political Analysis, 2015.
[19] J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear attention net- works. In Advances in Neural Information Processing Systems, 2018.
[20] B. G. King. A political mediation model of corporate re- sponse to social movement activism. Administrative Science Quarterly, 2008.
[21] R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In CVPR, 2018.
[22] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
[23] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfac- tual fairness. In Advances in Neural Information Processing Systems, 2017.
[24] Y. Li, Y. Li, and N. Vasconcelos. Resound: Towards action recognition without representation bias. In ECCV, 2018.
[25] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and caption regions. In ICCV, 2017.
[26] Y. Li and N. Vasconcelos. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
[27] Y. Liang, Y. Bai, W. Zhang, X. Qian, L. Zhu, and T. Mei. Vrr-vg: Refocusing visually-relevant relationships. In ICCV, pages 10403–10412, 2019.
[28] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss for dense object detection. In ICCV, 2017.
[30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
[31] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual rela- tionship detection with language priors. In ECCV, 2016.
[32] D. P. MacKinnon, A. J. Fairchild, and M. S. Fritz. Mediation analysis. Annu. Rev. Psychol., 2007.
[33] V. Manjunatha, N. Saini, and L. S. Davis. Explicit bias dis- covery in visual question answering models. In CVPR, 2019.
[34] F. Massa and R. Girshick. maskrcnn-benchmark: Fast, mod- ular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch, 2018.
[35] I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, 2016.
[36] S. Nair, Y. Zhu, S. Savarese, and L. Fei-Fei. Causal induc- tion from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751, 2019.
[37] Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J.-R. Wen. Counterfactual vqa: A cause-effect look at language bias. arXiv, 2020.
[38] J. Pearl. Causality: models, reasoning and inference. Springer, 2000.
[39] J. Pearl. Direct and indirect effects. In Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001.
[40] J. Pearl, M. Glymour, and N. P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.
[41] J. Pearl and D. Mackenzie. THE BOOK OF WHY: THE NEW SCIENCE OF CAUSE AND EFFECT. Basic Books, 2018.
[42] J. Qi, Y. Niu, J. Huang, and H. Zhang. Two causal principles for improving visual dialog. arXiv preprint arXiv:1911.10496, 2019.
[43] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo. Attentive re- lational networks for mapping images to scene graphs. In CVPR, 2019.
[44] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 2015.
[45] L. Richiardi, R. Bellocco, and D. Zugna. Mediation analysis in epidemiology: methods, interpretation and bias. International journal of epidemiology, 2013.
[46] J. M. Robins and S. Greenland. Identifiability and exchange- ability for direct and indirect effects. Epidemiology, 1992.
[47] N. J. Roese. Counterfactual thinking. Psychological bulletin, 1997.
[48] A. Rosenfeld and M. Thurston. Edge and curve detection for visual scene analysis. IEEE Transactions on computers, 1971.
[49] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- fied embedding for face recognition and clustering. In CVPR, 2015.
[50] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, 2015.
[51] J. Shi, H. Zhang, and J. Li. Explainable and explicit visual reasoning over scene graphs. In CVPR, 2019.
[52] H. A. Simon. Bounded rationality. In Utility and probability. Springer, 1990.
[53] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In ACL, 2015.
[54] H. Z. Q. S. Tan Wang, Jianqiang Huang. Visual common- sense r-cnn. In Conference on Computer Vision and Pattern Recognition, 2020.
[55] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu. Learning to compose dynamic tree structures for visual contexts. In CVPR, 2019.
[56] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In CVPR, 2017.
[57] A. Torralba, A. A. Efros, et al. Unbiased look at dataset bias. In CVPR, 2011.
[58] N. Van Hoeck, P. D. Watson, and A. K. Barbey. Cognitive neuroscience of human counterfactual reasoning. Frontiers in human neuroscience, 2015.
[59] T. VanderWeele. Explanation in causal inference: methods for mediation and interaction. Oxford University Press, 2015.
[60] T. J. VanderWeele. A three-way decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology (Cambridge, Mass.), 2013.
[61] W. Wang, R. Wang, S. Shan, and X. Chen. Exploring context and visual pattern of relationship for scene graph generation. In CVPR, 2019.
[62] S. Woo, D. Kim, D. Cho, and I. S. Kweon. Linknet: Re- lational embedding for scene graph. In Advances in Neural Information Processing Systems, 2018.
[63] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
[64] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017.
[65] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convo- lutional networks for skeleton-based action recognition. In AAAI, 2018.
[66] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In ECCV, 2018.
[67] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019.
[68] X. Yang, H. Zhang, and J. Cai. Deconfounded image cap- tioning: A causal retrospect, 2020.
[69] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relation- ship for image captioning. In ECCV, 2018.
[70] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy. Zoom-net: Mining deep feature interactions for visual relationship recognition. In ECCV, 2018.
[71] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
[72] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In ICML, 2013.
[73] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
[74] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catan- zaro. Graphical contrastive losses for scene graph parsing. In CVPR, 2019.
This supplementary document is organized as follows: 1) section A: a comprehensive review of causal effect analysis in causal inference; 2) section B: more details of the simpli-fied network structures in the original paper; 3) section C: more quantitative studies; 4) section D: more qualitative studies.
In this section, a comprehensive review of causal effect analysis is given in the form of the causal graph we proposed in Section 3, and we still follow the notations from the original paper. More detailed background knowledge about causal inference can be found in [40, 41] while the extension of effect analysis (a.k.a. mediation analysis) is given in [46, 39, 60, 59].
A.1. Mediator
Since the exhaustive introduction of causal inference would beyond the scope of this paper, we simplified or skipped the definitions of several concepts in the original paper without affecting the understanding. One of the skipped concepts is the mediator. In a causal graph, when we care about the effect of a variable X to the output variable Y , the descendant node of X that is located in the path between them is the mediator. For example, in the study of carcinogenesis by smoke (Cigarette Nicotine
Cancer), nicotine is the mediator. In our case, object labels Z is the mediator of X to Y , which can be considered as the side effect of X that also affects Y .
A.2. Total, Direct and Indirect Effects
As we discussed in Section 4.2, without further counterfactual intervention on the mediator Z, the overall effect of X towards Y is regarded as the Total Effect (TE) of X on Y , which can be calculated as:
As illustrated in Figure 8, other than the path that is cut off by the intervention
, all the other variables will take their values through the links of causal graph. Especially, the mediator Z will get value
, which is calculated from Eq. (2) given
as input.
However, by only using the TE, we are still not able to separate the mediator-specific “causal effect” from “side effect”, which limits the value of causal effect analysis. Thanks to the development of causal inference, here comes the decomposition of TE [39, 60]. Generally, the TE of X is composed of the Direct Effect (DE) caused by the causal
Figure 8. The illustration of Total Effect on causal graph.
path and Indirect Effect (IE) caused by the side-effect path
. Depending on whose effect we want to obtain, two kinds of decomposition can be applied. Decomposition 1: The first kind of decomposition is what we used in the Section 4.2, which separates the TE into the Total Direct Effect (TDE) and the Natural/Pure Indirect Effect (NIE/PIE). The former one has already been defined in the original paper as:
which can be regarded as the effect of X in the real situation, i.e., Z always takes the value z as if it had seen the real x. Meanwhile, the NIE or PIE is the effect caused by the mediator Z under a pure/natural situation, i.e., X will not take the value x under the specific case and it’s only assigned to the general unactivated value . Therefore, the NIE of Z is denoted as:
where we can easily identify that NIE is the effect of Z when it changes from to z in a pure environment, i.e.,
. The illustrations of TDE and NIE are given in Figure 9.
Decomposition 2: The second type of decomposition is opposite to the first one. It’s mainly adopted when the indirect effect of the mediator is what we are looking for. For example, in the study of carcinogenesis by smoke (Cigarette Nicotine
Cancer), sometimes the side effect of Nicotine is what researchers really care about. In this case, TE can be decomposed into Total Indirect Effect(TIE) and Natural/Pure Direct Effect (NDE/PDE). The definition of the former one is very similar to the NIE except for the environment being the real case X = x, which is therefore formulated as:
At the same time, since direct effect is not the target, their pure/natural effect should be removed from the TE. The cal-
Figure 9. The illustration of Total Direct Effect and Pure/Natural Indirect Effect on causal graph.
culation of NDE/PDE is following:
where NDE is the effect of X changing from to x under the pure environment
. In general, we should put the effect we care under the real environment, i.e. TDE or TIE, so we can get the results specific to each cases.
The above two types of decomposition are both commonly used in medical, political or psychological research [45, 18, 8, 32, 20], which depends on which effect we want to obtain, main effect or side effect. Note that, if the system is a pure linear system, both two types of decomposition would be exactly the same.
B.1. Scene Graph Generation
In the original paper, we simplified the feature extraction module in Link and the visual context module in Link
. Their details will be given in this subsection. Feature Extraction Module. Since we adopted ResNeXt-101-FPN [28, 63] as the backbone, the extracted M contains feature maps from 4 scales:
. Each bounding box will be assigned to the corresponding
based on their areas [34]. Given a bounding box
with area
, the corresponding index k of feature map is calculated as follows:
Then ROIAlign [12] will be applied to the selected bounding box on the corresponding
for the feature
as we described in Section 3.
Figure 10. The illustration of Total Indirect Effect and Pure/Natural Direct Effect on causal graph.
Table 4. The details of Visual Context Module.
Visual Context Module. To extract the visual context feature for the union box
, we consider all 4 feature maps will provide complementary contextual information from different levels. Therefore, we extract ROIAlign [12] features on all 4 feature maps before we project the visual context feature into a feature space of
. The entire module is summarized in the Table 4, where the dummy mask operation in (7) generates two masks for
and
independently, assigning 1.0 to the pixels inside the bounding box and 0.0 for the rest.
The Special Treatment for PredCls. In the original paper, we skipped a special case of causal graph, i.e., causal graph for Predicate Classification (PredCls), for simplification. In PredCls, the ground truth object labels are given, which means the link is blocked by assigning ground truth labels. It won’t affect TDE calculation, where Z takes the real value z. However, it’s involved in the ablation studies of TE and NIE, where Z could be assigned to
. In this case,
will directly use to the mean vector of training set rather than be calculated from Eq.(2). We also need to notice that, for MOTIFS [71], Eq.(3) will take
as input too, which is simplified in the original paper, because
itself is derived from
and it can be considered as the interaction between link
and
in the causal graph.
B.2. Sentence-to-Graph Retrieval
As we mentioned in the original paper, we treated Sentence-to-Graph Retrieval (S2GR) as the graph-to-graph matching problem, parsing query captions to text-SGs by [50]. Both detected image-SGs and parsed text-SGs are composed of entities and relationships
, where
, sub- ject and object categories (
) share the same dictionary with
for each
denotes the onehot vector of the predicate category.
The image-SGs and text-SGs are equipped with different embedding layers, because they have different dictionaries. The entities and relationships are encoded as:
where 512 is the dimension of embedded feature,
are numbers of entities and relationships for each image.
B.2.1 Bilinear Attention Scene Graph Encoding
Since entities and relationships are both important for SGs, we apply Bilinear Attention Network (BAN) [19] to encode their multimodal interactions into the same representation space. The same BAN model is used for both text-SGs and image-SGs, hence we remove k hereinafter for simplifica-tion. The original BAN involves two steps: 1) attention map generation, and 2) bilinear attended feature calculation. Because scene graph has already provides connections between entities and relationships, we skipped the first step and used normalized scene graph connection as attention map , where
, the scene graph connection M is defined as follows:
The bilinear attended scene graph encoding is calculated by Table 5, where steps (4-10) are calculated 2 times, and the final output is a feature vector representing the whole SG. The same BAN is used for both text-SG or image-SG, i.e., the parameters of the BAN are shared.
The model was trained by the triplet loss [49] with L1 distance. The model was trained in 30 epochs by SGD optimizer and set batch size to be 12. Learning rate was set to
Table 5. The details of Bilinear Attention Scene Graph Encoding Module.
be , which was decayed at 10th and 25th epochs by the factor of 10.
The full results of Relationship Retrieval, including both conventional Recall@K and the adopted mean Recall@K [55, 6], are given in Table 6. Although a performance drop on conventional Recall@k is observed on TDE, the detailed analysis of the “decreased” predicates in Figure 6 of the original paper implies that it’s caused by a more fine-grained predicate classification.
The detailed predicate-level Recall@100 on PredCls of all three models, two fusion functions and baseline vs. TDE are given in Figure 12 13 14. Impressively, the distribution of the improved performances is no longer long-tailed while those conventional debiasing methods illustrated in Figure 11 can’t surpass the dataset distribution anyway. For TDE, very few decreased predicates are mainly due to the more fine-grained classification and we can observe significant improvements on their subclass predicates. Note that, unlike Reweight, which blindly hurt all frequent predicates, the proposed TDE will even improve some of the top-10 frequent predicates, like behind and above, which themselves are the subclasses of near. It further proves that the improvement of the proposed TDE doesn’t come from hacking the distribution.
More Relationship Retrieval (RR) and Zero-Shot Relationship Retrieval (ZSRR) results are given in Figure 15, where top 10 relationships under SGCls are selected for each image. As we can see, other than the trivial relationship problem, conventional baseline barely distinguishes different entities. For example, in the left bottom image, the same sign is almost on every pole in the baseline while the TDE results are more sensitive to different entities. However, one of the problem of TDE is that it over emphasizes the action predicates. It even uses holding for pole and sign while the predicate on used by the baseline is more natural in this case.
Table 6. The SGG performances of Relationship Retrieval on both conventional Recall@K and mean Recall@K [55, 6]. The SGG models reimplemented under our codebase are denoted by the superscript
Figure 11. Conventional Debiasing Methods: Recall@100 on Predicate Classification for the most frequent 35 predicates.
Figure 12. MOTIFS]: Recall@100 on Predicate Classifica-tion for the most frequent 35 predicates.
Another example of Sentence-to-Graph Retrieval (S2GR) is illustrated in Figure 16. Although we only reported sub-graphs of the original SGDet results, due to the limited space, we can still find that the conventional baseline model is not able to detect predicate like eating, which causes the detected SGs only provide the spatial re-
Figure 13. VCTree]: Recall@100 on Predicate Classification for the most frequent 35 predicates.
Figure 14. VTransE]: Recall@100 on Predicate Classifica-tion for the most frequent 35 predicates.
lationships, missing the most discriminative word eating in the query caption.
Figure 15. Top 10 Relationship Retrieval (RR) and Zero-Shot Relationship Retrieval (ZSRR) results of SGCls for MOTIFS+SUM baseline (yellow box) and corresponding TDE (green box). The red predicates indicate misclassified relationships, the purple predicates are those correctly classified relationships (in ground truth), the blue predicates are those not labeled in ground truth.
Figure 16. An example of Sentence-to-Graph Retrieval (S2GR) results for MOTIFS+SUM baseline (yellow box) and corresponding TDE (green box). The red boxes indicate ground truth matching results. Note that we only draw sub-graphs containing important objects and predicates, because the original detected scene graphs from SGDet have too many trivial objects and predicates.