Multimodal Representation Model based on Graph-Based Rank Fusion

2019·Arxiv

Abstract

Abstract

This paper proposes a learning model, based on rank-fusion graphs, for general applicability in multimodal prediction tasks, such as multimodal regression and image classification. Rank-fusion graphs encode information from multiple descriptors and retrieval models, thus being able to capture underlying relationships between modalities, samples, and the collection itself. The solution is based on the encoding of multiple ranks for a query (or test sample), defined according to different criteria, into a graph. Later, we project the generated graph into an induced vector space, creating fusion vectors, targeting broader generality and efficiency. A fusion vector estimator is then built to infer whether a multimodal input object refers to a class or not. Our method is capable of promoting a fusion model better than early-fusion and late-fusion alternatives. Performed experiments in the context of multiple multimodal and visual datasets, as well as several descriptors and retrieval models, demonstrate that our learning model is highly effective for different prediction scenarios involving visual, textual, and multimodal features, yielding better effectiveness than state-of-the-art methods.

Keywords: graph-based rank fusion, graph embedding, multimodal fusion, prediction tasks, rank aggregation,

representation model

1. Introduction

Nowadays, data analysis involving multimedia and heterogeneous content is a hot topic that attracts a lot of attention from not only public and private sectors, but also academia. The proliferation of digital content and social media is expanding substantially the volume and diversity of available digital content. Most of these data are unlabeled, heterogeneous, unstructured, and derived from multiple modalities.

Despite such challenges, such content is of great relevance to support the development of prediction and retrieval models. In particular, multimodal data analysis is required in several scenarios, such as content-based information retrieval, and multimedia event detection of natural disasters, such as flooding.

Massive digital content has demanded the development of textual, visual, and multimedia descriptors for content-based data analysis. Despite the continuous advance on feature extractors and machine learning techniques, a single descriptor or a single modality is often insufficient to achieve effective prediction results in real case scenarios. Descriptors have specific pros and cons because each one often focuses on a specific point of view of a single modality. For example, dedicated descriptors may be created to characterize scenes, textual descriptions, movement, symbols, signals, etc. For this reason, descriptors and retrieval models often provide complementary views, when adopted in combination.

Many works have been proposed to combine heterogeneous data sources (remotely sensed information and social media) to promote multimodal analysis. In [1], authors point out the benefit of exploring and fusing multimodal features with different models. Moreover, combining different kinds of features (local vs holistic) improves substantially the retrieval effectiveness [2], and most search fusion approaches are based on rank fusion [2–4].

Scenarios involving heterogeneous data impose a challenge of selecting features or models to combine, which is performed by either supervised or unsupervised approaches. Unsupervised approaches are necessary in the absence of labeled data, which is prominent nowadays, or in scenarios involving lower computation capabilities or a large amount of data.

Most previous initiatives for multimodal prediction are solely based on either CNN-based descriptors in isolation, feature concatenation [5–10], or graph-based feature fusion [11, 12]. These approaches still ignore the correlation between modalities, as well as object correlation, and they are not consistently better than ranking models that do not rely on fusion.

This paper proposes a learning model, based on rank-fusion graphs, for general applicability in multimodal prediction tasks, such as classification and regression. We explore and extend the concept of a rank-fusion graph – originally proposed as part of a rank aggregation approach for retrieval tasks [4] – for supervised learning and additional applications. Our fusion method relies on the representation of multiple ranks, defined according to different criteria, into a graph. Graphs provide an efficient representation of arbitrary structures and inter-relationships among different elements of a model. We embed the generated graph into a feature space, creating fusion vectors. Next, an estimator is trained to predict if an input multimodal object refers to a target label (or event) or not, following their fusion vectors.

In [13], a methodology to apply rank-fusion graphs for efficient retrieval was presented, in the context of retrieval tasks. Different from that work, here we propose a new learning paradigm over fusion vectors for multimodal prediction. We propose specific components for the training and prediction phases, as well as additional applications for the fusion vectors.

Our method performs unsupervised learning for the object representation, exploring and capturing relationships from the collection into the representation model. It works on top of any descriptors for multimodal data, such as visual or textual. By promoting a representation model solely based on base descriptors and unsupervised data analysis over the collection, we conjecture that our approach leads to a competitive multimodal representation model that explores and encodes information from multiple descriptors and underlying sample relationships automatically, while not requiring labeled data.

Experimental results over multiple multimodal and visual datasets demonstrate that the proposal is robust for different detection scenarios involving textual, visual, and multimodal features, yielding better detection results than state-of-the-art methods from both early-fusion and late-fusion approaches.

This paper extends the work presented in [14] where the notion of a representation model from rank-fusion graphs was introduced and evaluated for multimedia flood detection. Here we extend this approach, proposing and discussing alternative approaches for the representation model. We also show additional applications for it, promoting general applicability for prediction tasks. The method is evaluated over multiple multimodal and image scenarios. We also evaluate the method against additional baselines, either early-fusion and late-fusion approaches. Finally, we evaluate the influence of hyper-parameters and provide guidelines for their selection.

2. Related Work

Representation learning models have been developed and advanced for the data modalities individually, such as image, text, video, and audio. However, their combined exploration for multimodal tasks remains an open issue. Multimodality imposes even more challenges, depending on the scenario, such as translation between modalities, exploration of complementarity and redundancy, co-learning, and semantic alignment [15].

Some works have focused on multimodal events, which require modeling spatio-temporal characteristics of data [16]. Faria et al. [17] proposed a time series descriptor that generates recurrence plots for the series, coupled with a bio-inspired optimization of how to combine classifiers. In this paper, we focus on multimodal tasks that do not depend on temporal modeling as well as unsupervised models.

In order to achieve fusion capabilities, early-fusion approaches emphasize the generation of composite descriptions for samples, thus working at the feature level. Conversely, late-fusion approaches perform a combination of techniques focused on a target problem, fusing at score or decision levels. On a smaller scale, some papers propose hybrid solutions based on both approaches [18, 19]. As we propose a representation model, our solution can be seen as an early-fusion approach. Nevertheless, it is based on retrieval models, without the need to work directly on a feature level. In this sense, we categorize the method as a hybrid.

While early-fusion approaches are theoretically able to capture correlations between modalities, often a certain modality produces unsatisfactory performance and leads to biased or over-fitted models [18]. Most early-fusion methods work in a two-step procedure; first extracting features from different modalities, and then fusing them by strategies such as concatenation [20], singular values decomposition [21], or autoencoders [22]. A few others focus on multimodal features jointly [23, 24], although generally restricted to a pre-defined textual attribute set. Concatenation is a straightforward yet widely used early-fusion approach, which merges vectors obtained by different descriptors. As a drawback, concatenation does not explore inherent correlations between modalities.

Supervised early-fusion optimizes a weighted feature combination, either during or after feature extraction. A common strategy is to build a neural architecture with multiple separate input layers, then including a final supervised layer, such as a regressor [25]. Another approach is the design of a composite loss function, suited for the particular desired task [26]. Composite loss functions work well in practice, but, as they need both multimodal composition and supervision, they are tied to the domain of interest. Supervised early-fusion usually suffers from high memory and time consumption costs. Besides, they usually have difficulty in preserving feature-based similarities and semantic correlations [19].

Late-fusion approaches are useful when the raw data from the objects are not available. Besides, they are less prone to over-fit. Mixture of experts (MoE) approaches focus on performing decision fusion, combining predictors to address a supervised learning problem [27]. Majority voting of classifiers [25], rank aggregation functions [4], and matrix factorization [28] are examples of late-fusion methods. Both fusions based on rank aggregation functions and matrix factorization are based on manifold learning, i.e., the exploration of dataset geometry.

Majority voting is a well-known approach to combine multiple estimators, being effective due to bias reduction. It is applied to scenarios involving an odd number of estimators so that each predicted output is taken as the one most frequently predicted by the base estimators.

Rank aggregation functions combine results from different retrieval models. They target a good permutation of retrieved objects obtained from different input ranks, in order to promote more effective retrieval results [4].

In general, unsupervised learning needs more investigation, especially for multimodal representation models. We

explore how unsupervised rank aggregation capabilities can be applied to prediction tasks. Despite their original intent regarding better retrieval accuracy, we claim that unsupervised rank aggregation functions can provide effective dataset exploitation.

Considering previous works regarding fusion approaches for prediction tasks, a considerable amount of them are still based on classic visual descriptors [6, 8, 10, 11]. Most of them resorted to pre-trained CNN-based models for visual feature extraction [5, 7, 9, 12, 29–31], from which just a few fine-tuned their models [9, 31]. When dealing with specific tasks, especially for competitions, some works use preprocessing steps, such as image cropping and filtering [32], but this is beyond our general intent in this paper. In order to explore the textual modality, most initiatives used BoW, using either TF or TF-IDF weighting, while others presented more complex formulations, such as word embeddings [5, 10], Long Short-Term Memory (LSTM) networks [9], or relation networks [31]. Regarding multimodal scenarios, most works rely on early-fusion approaches, such as a concatenation of visual and textual feature vectors [5–10] or graph-based attribute fusion [11, 12], while only a few others adopted late-fusion approaches [29, 30].

3. Preliminaries

Let a sample s be any digital object, such as a document, an image, a video, or even a hybrid (multimodal) object. s is characterized by a descriptor D, which relies on a particular point of view to describe s as a vector, graph or any data structure . Descriptions allow samples to be compared to each other, thus composing the basis of retrieval and learning models.

A comparator C, applied over a tuple (), produces a (e.g., the Euclidean distance or the cosine similarity). Either similarity or dissimilarity functions can be used to implement C. A query sample, or just query q, refers to a particular sample taken as an input object in the context of a search, whose purpose is to retrieve response items from a response set (S) according to relevance criteria. A response set is a collection of n samples, where n is the collection size. In prediction tasks, a query is called a test sample, and a response set is called a train set. These terms can be used interchangeably, but a train set refers to labeled samples.

A ranker is a tuple R = (D, C), which is employed to compute a rank for q, denoted by to distinguish its corresponding query. A rank is defined as a permutation of , where in general, subject to providing the least dissimilar samples to q from S, in order. L is a cut-off parameter. The position of x in is expressed by ), starting by value 1. ) is the score between and with respect to the same descriptor and comparator from the particular ranker that produced the rank for the query q.

A ranker establishes a ranking system, but different descriptors and comparators can compose rankers. Besides, descriptors are commonly complementary, as well as comparators. Given m rankers, , used for query retrieval over a collection S, for every query q, we can obtain , from which a rank aggregation function f produces a combined rank ), presumably more effective than the individual ranks .

4. Representation and Prediction Based on Graph-Based Rank Fusion

Fig. 1 presents an overview of our method – a multimodal representation and estimator based on rank-fusion graphs. The solution is composed of three main generic components, introduced here and detailed in the next sections. Two phases are also defined.

Figure 1: Graph-based rank fusion for multimodal prediction.

Based on a train set and multiple rankers, a training phase generates fusion graphs and then fusion vectors for the train set. Then, a multimodal prediction model is built based on the fusion vectors and their corresponding train labels. The second phase, named inference phase, refers to the multimodal prediction. It first applies for the test sample a similar rank-based fusion approach for multimodal representation, then predicts an outcome for that test sample. The training phase is performed only once, while the inference phase is performed per prediction. The first two components – fusion

graph extraction and graph embedding – are used in both phases.

The fusion graph extraction (component 1 in Fig. 1) generates a fusion graph G for a given test sample q. G is an aggregated representation of multiple ranks for q that intends to encode information of multiple ranks. By doing so, this representation is capable of correlating any object to the others from train set, in an unsupervised manner, with respect to multiple modalities. This formulation is presented in Section 4.1. Graph Embedding (2) projects fusion graphs into a vector space model, producing a corresponding fusion vector V for G, then enabling better performance and efficiency to the representation model. We propose an embedding formulation in Section 4.2. At the end, a multimodal learning component (3) builds an estimator trained over the response fusion vectors, in order to predict for test samples (also modeled as fusion vectors). This is detailed in Section 4.3.

4.1. Rank-Fusion Graphs

The fusion graph extraction component produces G for a given q, in terms of m rankers and n response items. A fusion G graph is an encoding of ranks for q, aiming at encapsulating and correlating works as an underlying representation

Figure 2: Extraction of a fusion graph.

model that enables – by means of multiples modalities and rankers – multimodal objects to be compared and analyzed in a uniform and robust way. One G is generated for each query sample.

We follow the fusion graph formulation from [4], referred to as FG, that defines a mapping function , based on its ranks and ranks’ inter-relationships. The initial formulation includes a dissimilarity function for G, and a retrieval model based on fusion graphs. Here, however, we focus on the definition of FG itself, extending its use as part of a rank-based late-fusion approach for representation model in multimodal prediction tasks, without these components.

The process is illustrated in Fig. 2. Given a query (or test sample) q, m rankers, and a train set of size n, m ranks are generated. These ranks are then normalized to allow for producing the fusion graph G for q. In rank normalization, the scores in the ranks generated by dissimilarity-based comparators are converted to similarity-based scores. Besides, all ranks have their scores rescaled to the same interval.

In this formulation, G is built as a weighted directed graph. G, for q, and it includes all response items from each as vertices. In essence, the vertex set for G is composed of the union of all response samples found in all ranks defined for q. A vertex is associated with a response sample A. The vertex weight of , expressed by and given by Equation 1, is expected to encode how relevant A is to is taken as the sum of the similarities that A has in the ranks of q.

Edges are created between vertices by taking into account the degree of relationship of their response items, and the degree of their relationships to q. There is an edge , linking to , if A and B are both responses in any rank of q and if B occurs in any rank of A. The weight of , is the sum of the similarities that the response item B has in the ranks of A, divided by the position of A in each rank of q (Equation 2), considering position values starting by 1.

4.2. Embedding of Rank-Fusion Graphs

Let be the fusion graph set for the response set of a given collection. Based on G, an embedding function E defines a vector space in which a fusion graph G is projected as a fusion vector V, i.e., V = E(G) for any G. We claim V can be adopted as a suitable representation model of multi-ranked objects for multimodal prediction tasks. It encodes the use of multiple rankers and allows the fusion of multiple modalities.

Important advantages arrive from this embedding process to our prediction framework: (i) the vector domain brings broader availability of algorithms and techniques, compared to the graph domain; (ii) the objects can be indexed by structures such as search trees or inverted files [33], promoting overall efficiency; and (iii) improved compression and storage capabilities can be applied.

Dissimilarity scores between fusion vectors can be obtained by traditional vector comparators, ranging from correlation metrics to traditional distances and dissimilarity functions, such as Jaccard, cosine, or the Euclidean distance. Note also that fusion graph vector representations can be combined with indexing mechanisms to allow fast retrieval, in sub-linear time. E can be defined by either supervised or unsupervised or strategies. We explore three approaches in this paper, preliminarily presented in [13] in the context of retrieval tasks. Different from that work, here we extend its use on prediction tasks involving supervised learning. As our method is flexible in terms of its components, other embeddings could be further explored, based on node signatures [34, 35] or subgraph matching [36, 37].

This first embedding approach, , derives the vector space based on vertex analysis. Let ) be the weight of the vertex v, if , otherwise 0. Similarly, let ) be the weight of the edge e, if , otherwise 0. Also, let d be the dimensionality of the vector space induced from E, such that IR. For , one vector attribute is induced for each response object, therefore generates the weights in V according to the vertices from G, such that V[i] is the importance value of the i-th attribute, therefore V[i] = ). Despite the vector space increases linearly to the collection size, the fusion vectors are sparse, i.e., a few non-zero entries compose the vector, which makes this approach simple and efficient in practice.

is a hybrid embedding approach based on both vertex and edge analysis. encodes more information into the vector space, at a cost of a higher dimensionality d = . These two approaches explore node-based and edge-based local

signatures [34]. Comparing both, is expected to gain in effectiveness and lose in efficiency. V is defined as

The third approach, , extends the Bag of Graphs (BoG) archetype to embed graphs as a histogram of kernels, where the vectorial attributes are selected by unsupervised selection of common subgraph patterns. The kernels are obtained from the centroids of a graph clustering process. Then, a vector quantization process, consisting of assignment and pooling procedures, is adopted to embed an input graph to the vector space. extends BoG using the following definitions:

• Graph of Interest (GoI): a kernel sub-structure from G, such that a set of valid subgraphs from G can be induced.

• GoI Dissimilarity Function: provides a dissimilarity score between two subgraphs. We employ MCS (Equation 4),

• Codebook Generation: a subset of the training GoI’s must be selected to compose the codebook. BoG suggests

• Assignment: defines an activation value correlating a subgraph to a vector attribute. We adopt Soft Assignment,

assignment value between the subgraph and the attribute is the vocabulary size, ) computes the GoI dissimilarity between and , and K(x) =

• Pooling: summarizes assignments, producing the final vector. We adopt Average Pooling, which assigns to the j-th

is potentially more discriminative and concise than and in terms of the resulting embedding, but it requires more computation, domain specialization, and hyperparameter tuning. BoG has been successfully applied in scenarios involving graph classification, textual representation, and information retrieval [13, 38, 40]. This is so far the first extension of BoG in the context of multimodal representation for prediction tasks.

Among those embeddings, is expected to be the best in terms of effectiveness, but to be the worst with regard to efficiency. This tradeoff must be evaluated in the target scenario to guide the choice of the type of embedding. We refer to FV-V as the fusion vector generated by , while FV-H is generated by , and FV-K by .

4.3. Multimodal Prediction based on Graph Representations

Let S be a training corpus of size n. We refer to predictor as a statistical model that learns from historical data to make predictions about future or unknown events [42]. A predictor can be modeled as , where f is an approximation function, X are the independent variables, Y is the dependent variable (target), and are unknown parameters. A learning model explores S to find an f that minimizes a certain error metric. The training samples are generally labeled, so Y may be categorical. Still, a regressor can be built if the ground truth is numerical, as E(Y |X) = ), so that posterior probabilities are inferred in order to estimate a confidence score of a sample to refer to a class of not.

Fusion vectors allow the creation of predictors, such as classifiers or regressors, and also ad-hoc retrieval systems, depending on the underlying demanded task. In this work, we apply them to build multimodal predictors, where training objects – associated with ground-truth information – are used to train a supervised estimator. This estimator infers, for a given input object, whether it relates to a label (or event) or not. X, in our case, refers to the fusion vector, acting as a set of variables that intrinsically describes the sample in terms of its multiple multimodal information and its relationship to S.

From S and m rankers, we build , from which we derive . A classifier f is trained over V and their corresponding labels. Any standard learning model can be applied, such as multiclass SVM, a neural network, random forests, and so forth. We suggest the adoption of non-linear classifiers that can handle large dimensionality, such as SVM or gradient boosting.

4.4. Cost Analysis

Base rankers usually adopt indexing structures, such as search trees or inverted files, leading to rank generations in sub-linear time with respect to the response set, taking O(log n). One rank is stored in O(L), due to our hyper-parameter

The fusion graph extraction (Section 4.1) has an asymptotic cost of O(mL) to induce the vertices, and O(mLmL) to induce the edges, leading to ). As both small values of L and small number (m) of rankers are used, the cost of the graph extraction itself is negligible when compared to the prior generation of base ranks. The number of vertices in each fusion graph is O(mL) in the worst case. Therefore, the storage of a fusion graph is ) in the worst case.

As the adopted fusion graph does not repeat vertex labels, we can adopt efficient graph comparison functions based on minimum common subgraphs computed in ) [43]. The comparison between two fusion graphs is bounded to ). Both aspects lead to fairly efficient graph comparators in practice.

The inference phase relies on the existence of ranks, fusion graphs and fusion vectors from the response set plus the estimator. The rank generation for the response set, considering m rankers, takes O(n(m log n)), and requires O(n(mL)) for storage. As for the response fusion graphs, it takes ) for both generation and storage. The embedding of rank-fusion graphs take O(mLn) for the train set and O(mL) for a test sample, as for , and it depends on the underlying embedding approach. The cost storage of fusion vectors also depends on the embedding approach, as the dimensionality varies. The vectors produced by the proposed approaches are sparse, and it takes O(mLn) for the train set. Besides, as they are used for multimodal learning, they do not need to be stored. Only the estimator must be kept. The estimator can be trained in O(n) by learning models such as SVM. A multimodal prediction for an input fusion vector takes O(1).

The final cost of the training stage is O(n(m log )) for execution, and ) for storage, or approximately O(n log n) and O(n) for small values of m and L, which is overall efficient.

The cost of an inference is the sum of the costs for: generating the ranks for q; generating the query fusion graph ; generating the fusion vector V; and the prediction by the estimator over V. The cost from steps two to four are negligible when compared to the first. The generation of is asymptotically limited by the slowest ranker. In general, this step takes O(m log n). The second and third steps take ), and the final step is O(1). The method is efficient and flexible in terms of its components. The main trade-off is the eventual reduction in the prediction time in relation to the potential gain in accuracy.

5. Experimental Evaluation

This section presents the experimental protocol used to evaluate the method, and the results achieved comparatively to state-of-the-art baselines. We evaluate the effectiveness of our method as a representation model in prediction tasks. The focus is on validating our fusion method comparatively to the individual use of descriptors with no fusion, as well as to compare it to early-fusion and late-fusion approaches.

5.1. Evaluation Scenarios

We evaluate the proposed method on multiple datasets, comprised of heterogeneous multimodal data, in order to assess its general applicability. Our experimental evaluation comprises four scenarios, as follows:

• ME17-DIRSM dataset, the acronym for MediaEval 2017 Disaster Image Retrieval from Social Media [44], is a

• Brodatz [45] is a dataset of texture images, labeled across 111 classes. There are 16 samples per class, composing

• Soccer [46] is an image dataset, labeled across 7 categories (soccer teams), containing 40 images each.

• UW [47], or University of Washington dataset, is a multimodal collection of 1,109 images annotated by textual

5.2. Evaluation Protocol

For any dataset that does not explicitly define train and test sets, we initially split it into stratified train and test sets, at a proportion of 80%/20%. The proportions per class remain equal. The same train and test sets per dataset are adopted to evaluate all methods under the same circumstances, as well as the evaluation metrics.

For each representation model, we fit a multiclass SVM classifier, with the one-vs-all approach and linear kernel, as it is a good fit for general applicability. Hyper-parameters are selected by grid search on the train set, using 5-fold cross validation. We evaluate the effectiveness of each method by the balanced accuracy score – defined as the average of recall scores per class – which is suitable to evaluate on both balanced and imbalanced datasets. The methods are compared by their balanced accuracy.

The descriptors compose rankers, which are employed to generate ranks in our late-fusion representation. Our method varies with respect to which rankers are used, whether visual or textual rankers, or even their combinations for the multimodal scenario are applied. Besides that, it also varies with respect to which embedding approach is adopted. We evaluate these aspects experimentally.

We model our solution as a rank-fusion approach, followed by an estimator based on rank-fusion vectors. This approach intends to validate our hypothesis that unsupervised graph-based rank-fusion approaches can lead to effective representation models for prediction tasks in general.

We adopt the same experimental evaluation for all datasets but ME17-DIRSM, that defines its own procedure. In this case, the task imposes three evaluation scenarios, as follows. In the first one, called “visual”, only visual data can be used. In the second, called “textual”, only textual data are used. In the third, called “multimodal”, both visual and

Table 1: Datasets and descriptors for the experimental evaluation.

textual data are expected to be used. The correctness is evaluated, over the test set, by the metric Average Precision at K (AP@K) at various cutoffs (e.g., 50, 100, 250, and 480), and by their mean value (mAP).

Although the ME17-DIRSM task may be seen as a multimodal binary classification problem, the evaluation metrics require ranking-based solutions, or equivalently confidence-level regressors, so that the first positions are the most likely to refer to a flood event. For the estimator component in ME17-DIRSM, we adopt SVR, an L2-regularized logistic regression based on linear SVM in its dual form, with probabilistic output scores, and trained over the fusion vectors from devset. Probabilistic scores are used so that we can sort the test samples by confidence expectancy of being flood.

In ME17-DIRSM, our results are compared to those from state-of-the-art baselines. In Soccer, Brodatz, and UW, our results are compared to those from two major fusion approaches: concatenation, and majority vote. They cover baselines from both early-fusion and late-fusion families. For the concatenation procedure, we first normalize the vectors to the [0, 1] interval for each attribute, in order to avoid disparities due to different descriptor attribute ranges. We apply majority voting in the scenarios involving an odd number of descriptors, so that each predicted class is taken as the one most frequently predicted by the estimators constructed for each descriptor.

5.3. Descriptors and Rankers

For ME17-DIRSM, we selected three visual descriptors and three textual descriptors, for individual analysis in the designed evaluation scenarios, and to evaluate different possibilities of rank-fusion aggregations. We adopt the following state-of-the-art visual descriptors: (i) ResNet50IN : 2048-dimensional average pooling of the last convolutional layer of ResNet50 [48], pre-trained on ImageNet [49], a dataset of about 14M images labeled for object recognition; (ii) VGG16P365: 512-dimensional average pooling of the last convolutional layer of VGG16 [50], pre-trained on Places365-Standard [51], a dataset of about 10M images of labeled scenes; (iii) NASNetIN : 2048-dimensional average pooling of the last convolutional layer of NASNet [52], pre-trained on ImageNet dataset.

Based on the textual metadata in ME17-DIRSM, we adopt the following descriptors: (i) BoW : Bag of Words (BoW) with Term Frequency (TF) weighting; (ii) 2grams: 2grams with TF weighting; (iii) doc2vecWiki: 300-dimensional doc2vec [53] pre-trained on English Wikipedia dataset, of about 35M documents and dumped at 2015-12-01.

For the other datasets, we elected a number of heterogeneous descriptors. In Soccer: BIC, GCH, and ACC. In Brodatz: JCD, FCTH, and CCOM. In UW: JAC, ACC, JCD, and CEDD, as visual descriptors, and word2vecSum, word2vecAvg, doc2vecWiki, and doc2vecApnews, as textual descriptors.

For the deep networks used for CNN-based visual feature extraction, as well as in the textual feature extraction with doc2vec, we take advantage of pre-trained models. This practice, known as transfer learning, has been effective in many scenarios [54], and it is also particularly beneficial for datasets that are not large enough to generalize the training of such large architectures, as in our case. Because the problem requires prediction of flood images, we prioritize, in the selection of visual descriptors, datasets for pre-training that focus on images of scenes, aiming at better generality to the target problem.

We perform the same preprocessing steps for every textual descriptor: lower case conversion, digit and punctuation removal, and English stop word removal. For BoW and 2grams, we also apply Porter stemming.

The word2vecSum descriptor produces, for any input document, a vector corresponding to the sum of the word embedding vectors [55] related to each term within that document, while word2vecAvg computes the mean vector of them.

The doc2vec model promotes document-level embeddings for texts, and it is based on word embeddings [55], a preliminary work that assigns vector representations for words in order to capture their semantic relationships. doc2vecApnews stands for a 300-dimensional doc2vec [53] model, pre-trained over the Associated Press News textual dataset, of about 25M news articles from 2009 to 2015. The datasets and descriptors are summarized in Table 1.

From the descriptors, rankers are defined as tuples of (descriptor, comparator), where the comparator corresponds to a dissimilarity function. We compose a ranker for each descriptor by choosing an appropriate comparator. We define dissimilarity functions to be used along with those descriptors that are not explicitly associated with one. This is the case for the textual descriptors adopted in UW, as well as the descriptors adopted in ME17-DIRSM. All remaining descriptors define their own comparators. For the textual descriptors BoW and 2grams, we adopt the weighted Jaccard distance, defined as 1 ), where J is the Ruzicka similarity metric (Equation 7). Jaccard is a well-known and widely-used comparison metric for classic textual descriptors, especially for short texts, as in our case. For the remaining descriptors, we choose the Pearson correlation distance, defined as 1 ) (Equation 8), which is a general-purpose metric due to its suitability for high dimensional data and scale invariance.

5.4. Fusion Setups

For both visual and textual scenarios in ME17-DIRSM, we analyze three variants of our method with respect to the input rankers for late-fusion. For the visual scenario, the combinations are ResNet50IN + NASNetIN, ResNet50IN + VGG16P365, and ResNet50IN + NASNetIN + VGG16P365. For the textual scenario, the combinations are BoW + 2grams, BoW + doc2vecWiki, and BoW + 2grams + doc2vecWiki.

As for the multimodal scenario, we investigate some combinations taking one ranker of each type, two of each, and three of each. Six multimodal combinations are evaluated: ResNet50IN + BoW, ResNet50IN + NASNetIN + BoW + 2grams, ResNet50IN + NASNetIN + BoW + doc2vecWiki, ResNet50IN + VGG16P365 + BoW + 2grams, ResNet50IN + VGG16P365 + BoW + doc2vecWiki, and ResNet50IN + NASNetIN + VGG16P365 + BoW + 2grams + doc2vecWiki.

We report three results for the adoption of FV, in its different embedding approaches, as a representation model for prediction tasks, in Soccer, Brodatz, and UW. We report the results for multiple descriptor combinations, in order to analyze: (i) the method against baselines, (ii) the embedding approaches, and (iii) the comparative effectiveness between the descriptor combinations. The descriptor combinations selected, although not exhaustive, are targeted for a large

Figure 3: Base results of the descriptors, along with a SVR, in ME17-DIRSM, visual scenario.

Table 2: Base results of the descriptors, along with a SVM, in Soccer.

number of scenarios. In Soccer and Brodatz, all possible combinations were selected for evaluation. In UW, several visual combinations and multimodal combinations were selected. The descriptor combinations in Soccer are: ACC + BIC, BIC + GCH, ACC + GCH, and ACC + BIC + GCH. In Brodatz: CCOM + FCTH, CCOM + JCD, FCTH + JCD, and CCOM + FCTH + JCD. In UW: ACC + CEDD, ACC + JCD, CEDD + JAC, CEDD + JCD, ACC + CEDD + JCD, and CEDD + JAC + JCD, for visual fusion, and ACC + doc2vecApnews, JCD + doc2vecApnews, ACC + JCD + doc2vecApnews, ACC + JCD + doc2vecWiki, ACC + JCD + word2vecAvg, and ACC + JCD+word2vecSum, for multimodal fusion.

5.5. Results and Discussion

5.5.1. Base Results

Here we report results by the use of individual descriptors. They constitute an initial baseline for our method as well as for other fusion approaches. Fig. 3 and 4 present the results for the visual and textual scenarios in ME17-DIRSM, achieved by the visual and textual selected descriptors, along with a SVR regressor. As the task only showed AP@480 and mAP in their leaderboard, we focus our discussions on these two metrics. The correctness for the visual scenario is already high within these baselines, around 85% in AP@480. In the textual scenario, AP@480 is around 65%, which suggests more room for improvement. We report, in Tables 2, 3, 4, and 5, the results obtained in Soccer, Brodatz, UW (visual), and UW (textual), respectively, for the base descriptors along with a SVM classifier.

Table 3: Base results of the descriptors, along with a SVM, in Brodatz.

Figure 4: Base results of the descriptors, along with a SVR, in ME17-DIRSM, textual scenario.

Table 4: Base results of the descriptors, along with a SVM, in UW, visual scenario.

Table 5: Base results of the descriptors, along with a SVM, in UW, textual scenario.

Figure 5: Effect of the rank size limit (L) for the fusion graph extraction, in the mAP score, for different fusion scenarios in ME17-DIRSM.

5.5.2. Parameter Analysis

We begin the experimental evaluation of the method by evaluating its hyper-parameters and proposed variants. The resulting size of G is affected by the input rank sizes, defined by the hyper-parameter L. For the same reason, larger FG’s either increase the vocabulary sizes of FV or the complexity to generate them. Dourado et al. [4] showed that an increase in L leads to more discriminate graphs up to a saturation point. A practical upper bound for the choice of L tends to be the maximum rank size of users’ interest, indirectly expressed here by the evaluation metrics.

As ME17-DIRSM defines evaluation metrics for ranks up to 480, we start by empirically evaluating the influence of L in the mAP score, for values to up 480. Fig. 5 reports the influence of L for some of the elected fusion scenarios. The results were as expected: the effectiveness usually increased as L was larger. For the next evaluation scenarios in ME17-DIRSM, we adopt L = 480. For the other datasets, we adopt L = 10.

5.5.3. Fusion Results in Event Detection

We present our results achieved for the three scenarios in ME17-DIRSM, using the combinations proposed, along with the results of the teams that participated in the competition [5–10, 12, 29–32]. These results are presented in Fig. 6, 7, and 8, respectively for the visual, textual, and multimodal scenarios. In ME17-DIRSM, we focused on the embedding approach. For the other datasets, we evaluate all of them.

In the visual scenario, only Ahmad et al. [30] and Bischke et al. [5] performed better, in terms of AP@480 and mAP, than our preliminary base setup, based on individual descriptors along with the SVR regressor. As for the textual scenario, only Tkachenko et al.[10] surpassed BoW + SVR in AP@480, and only Bischke et al. [5], Nogueira et al. [31], and Zhao and Larson [32] surpassed BoW + SVR in mAP. This indicates that descriptors properly selected to the target problem can overcome more complex models, also requiring less effort.

Figure 6: Flood detection based on visual features, in ME17-DIRSM.

Figure 7: Flood detection based on textual features, in ME17-DIRSM.

Figure 8: Flood detection based on multimodal features, in ME17-DIRSM.

Table 6: Balanced accuracies by fusion methods in Soccer.

Our method was superior in the visual scenario to all baselines, for two of three proposed variants of ranker combinations. Compared to the strongest baselines considering this scenario, it presents gains from around 1 to 2% in AP@480, and 1% in mAP. Compared to the visual base results, from the individual descriptors, 3 to 4% in AP@480, and 1 to 2% in mAP.

In the textual scenario, our gains were expressive. It was superior in the textual scenario to all related works, for all three proposed variants of ranker combinations. Compared to the strongest baselines, it presents gains from 5 to 7% in AP@480, and 6 to 11% in mAP. Compared to the textual base results, from the individual descriptors, 6 to 8% in AP@480, and 14 to 16% in mAP. In the multimodal scenario, considered baselines were more competitive. Again, however, our method presents gains over them, of 0.5% in AP@480 and 0.23% in mAP.

5.5.4. Fusion Results in Classification Tasks

Tables 6, 7, 8, and 9 report the results obtained by our method variants, respectively in Soccer, Brodatz, UW for image data, and UW for multimodal data, besides the results obtained by the baselines.

Our method led to significant gains when compared to the best base result from the descriptors in each dataset: around 8 p.p. in Soccer, 13.2 p.p. in Brodatz, 2.2 p.p. in UW for visual fusion, and 5.9 p.p. in UW for multimodal fusion. The

Table 7: Balanced accuracies by fusion methods in Brodatz.

Table 8: Balanced accuracies for visual fusion in UW.

Table 9: Balanced accuracies for multimodal fusion in UW.

gains in UW were comparatively lower than others, yet consistent, because the base results were already higher, so that there was less room for improvement. Our method, when compared to the best baseline in each descriptor combination in each dataset, had gains in all cases: up to 12.4 p.p. in Soccer, up to 2.9 p.p. in Brodatz, up to 7.1 p.p. in UW for visual fusion, and up to 3.9 p.p. in UW for multimodal fusion. Overall, all the FV approaches outperformed all baselines in all datasets. In only 6 out of 20 descriptor combinations, some of the baselines surpassed any of the FV approaches.

The accuracy disparities, obtained by FV or any other fusion method, across the multiple descriptor combinations in each dataset, show that there is no obvious choice when dealing with which descriptors to be used together. As our representation model is meant to be unsupervised, this choice could only be guided by general heuristics, such as a selection of effective and low correlated descriptors, as discussed in [4]. We leave this exploration for future work.

FV-K usually outperformed FV-H and FV-V, and FV-H usually outperformed FV-V, although in both cases the gains were at most 3 p.p. On the opposite side, FV-V is the simplest among the three, and FV-K requires more computational steps. These two aspects combined impose that the practical choice among the three must take into account the trade-off between accuracy vs computational cost. In any case, our method is unsupervised and feasible for general applicability.

6. Conclusions

This paper presented an unsupervised graph-based rank-fusion approach as a representation model for multimodal prediction tasks. Our solution is based on encoding multiple ranks into a graph representation, which is later embedded into a vectorial representation. Next, an estimator is built to predict if an input multimodal object refers to a target event or not, given their graph-based fusion representations. The proposed method extends the fusion graphs – first introduced in [4] – for supervised learning tasks. It also applies a graph embedding mechanism in order to obtain the fusions vectors, a late-fusion vector representation that encodes multiple ranks and their inter-relationships automatically. Performed experiments in multiple prediction tasks, such as flood detection and multimodal classification, demonstrate that our solution leads to highly effective results overcoming state-of-the-art solutions from both early-fusion and late-fusion approaches. Future work focuses on investigating the impact of semi-supervised and supervised approaches for the fusion graph and fusion vector constructions. We also plan to investigate our method in other multimodal problems, such as recommendation and hierarchical clustering. Finally, we plan to evaluate the proposal for other multimedia data, such

as audio and video.

References

[1] W. Zhou, H. Li, Q. Tian, Recent advance in content-based image retrieval: A literature survey, arXiv:1706.06064 (2017).

[2] S. Zhang, M. Yang, T. Cour, K. Yu, D. N. Metaxas, Query specific rank fusion for image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015) 803–815.

[3] I.-S. Dao, P. Q. N. Minh, A. Kasem, A context-aware late-fusion approach for disaster image retrieval from social media, in: Proc. 8th International Conference on Multimedia Retrieval, ACM, 2018, pp. 266–273.

[4] I. C. Dourado, D. C. G. Pedronette, R. S. Torres, Unsupervised graph-based rank aggregation for improved retrieval, Information Processing & Management 56 (2019) 1260–1279.

[5] B. Bischke, P. Bhardwaj, A. Gautam, P. Helber, D. Borth, A. Dengel, Detection of flooding events in social multimedia and satellite imagery using deep neural networks, in: MediaEval Workshop, 2017.

[6] M. S. Dao, Q. N. M. Pham, D. Nguyen, D. Tien, A domain-based late-fusion for disaster image retrieval from social media, in: MediaEval Workshop, 2017.

[7] X. Fu, Y. Bin, L. Peng, J. Zhou, Y. Yang, H. T. Shen, Bmc@mediaeval 2017 multimedia satellite task via regression random forest, in: MediaEval Workshop, 2017.

[8] M. Hanif, M. A. Tahir, M. Khan, M. Rafi, Flood detection using social media data and spectral regression based kernel discriminant analysis, in: MediaEval Workshop, 2017.

[9] L. Lopez-Fuentes, J. van de Weijer, M. Bolanos, H. Skinnemoen, Multi-modal deep learning approach for flood detection, in: MediaEval Workshop, 2017.

[10] N. Tkachenko, A. Zubiaga, R. Procter, Wisc at mmediaeval 2017: Multimedia satellite task, in: MediaEval Workshop, 2017.

[11] R. O. Werneck, I. C. Dourado, S. G. Fadel, S. Tabbone, R. S. Torres, Graph-based early-fusion for flood detection, in: 25th International Conference on Image Processing, IEEE, 2018, pp. 1048–1052.

[12] K. Avgerinakis, A. Moumtzidou, S. Andreadis, E. Michail, I. Gialampoukidis, S. Vrochidis, I. Kompatsiaris, Visual and textual analysis of social media and satellite images for flood detection@ multimedia satellite task mediaeval 2017, in: MediaEval Workshop, 2017.

[13] I. C. Dourado, R. S. Torres, Fusion vectors: Embedding graph fusions for efficient unsupervised rank aggregation, arXiv:1906.06011 (2019).

[14] I. C. Dourado, S. Tabbone, R. S. Torres, Event prediction based on unsupervised graph-based rank-fusion models, in: Graph-Based Representations in Pattern Recognition, Springer, 2019, pp. 88–98.

[15] T. Baltruˇsaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2018) 423–443.

[16] C. Tzelepis, Z. Ma, V. Mezaris, B. Ionescu, I. Kompatsiaris, G. Boato, N. Sebe, S. Yan, Event-based media processing and analysis: A survey of the literature, Image and Vision Computing 53 (2016) 3–19.

[17] F. A. Faria, J. Almeida, B. Alberton, L. P. C. Morellato, R. S. Torres, Fusion of time series representations for plant recognition in phenology studies, Pattern Recognition Letters 83 (2016) 205–214.

[18] A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor fusion network for multimodal sentiment analysis, arXiv:1707.07250 (2017).

[19] Z.-Z. Lan, L. Bao, S.-I. Yu, W. Liu, A. G. Hauptmann, Multimedia classification and event detection using double fusion, Multimedia tools and applications 71 (2014) 333–347.

[20] D. Kiela, L. Bottou, Learning image embeddings using convolutional neural networks for improved multi-modal semantics, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 36–45.

[21] E. Bruni, N.-K. Tran, M. Baroni, Multimodal distributional semantics, Journal of Artificial Intelligence Research 49 (2014) 1–47.

[22] C. Silberer, M. Lapata, Learning grounded meaning representations with autoencoders, in: Proc. 52nd Association for Computational Linguistics, 2014, pp. 721–732.

[23] S. Kottur, R. Vedantam, J. M. Moura, D. Parikh, Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes, in: Computer Vision and Pattern Recognition, 2016, pp. 4985–4994.

[24] F. Hill, A. Korhonen, Learning abstract concept embeddings from multi-modal data: Since you probably can’t see what i mean, in: Proc. Empirical Methods in Natural Language Processing, 2014, pp. 255–265.

[25] K. Nogueira, S. G. Fadel, I. C. Dourado, R. O. Werneck, J. A. V. Mu˜noz, O. A. B. Penatti, R. T. Calumby, L. T. Li, J. A. dos Santos, R. S. Torres, Exploiting convnet diversity for flooding identification, IEEE Geoscience and Remote Sensing Letters 15 (2018) 1446–1450.

[26] G. Park, W. Im, Image-text multi-modal representation learning by adversarial backpropagation, arXiv:1612.08354 (2016).

[27] S. E. Yuksel, J. N. Wilson, P. D. Gader, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems 23 (2012) 1177–1193.

[28] X. Dong, Y. Yan, M. Tan, Y. Yang, I. W. Tsang, Late fusion via subspace search with consistency preservation, IEEE Transactions on Image Processing 28 (2018) 518–528.

[29] K. Ahmad, K. Pogorelov, M. Riegler, N. Conci, H. Pal, Cnn and gan based satellite and social media data fusion for disaster detection, in: MediaEval Workshop, 2017.

[30] S. Ahmad, K. Ahmad, N. Ahmad, N. Conci, Convolutional neural networks for disaster images retrieval, in: MediaEval Workshop, 2017.

[31] K. Nogueira, S. G. Fadel, I. C. Dourado, R. O. Werneck, J. A. Mu˜noz, O. A. B. Penatti, R. T. Calumby, L. T. Li, J. A. dos Santos, R. S. Torres, Data-driven flood detection using neural networks, in: MediaEval Workshop, 2017.

[32] Z. Zhao, M. Larson, Retrieving social flooding images based on multimodal information, in: MediaEval Workshop, 2017.

[33] C. D. Manning, P. Raghavan, H. Sch¨utze, Introduction to information retrieval, Cambridge university press, 2008.

[34] P. C. Wong, H. Foote, G. Chin, P. Mackey, K. Perrine, Graph signatures for visual analytics, IEEE Transactions on Visualization and Computer Graphics 12 (2006) 1399–1413.

[35] N. Hu, R. M. Rustamov, L. Guibas, Stable and informative spectral signatures for graph matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2305–2312.

[36] S. Jouili, S. Tabbone, Graph matching based on node signatures, in: Graph-Based Representations in Pattern Recognition, Springer, 2009, pp. 154–163.

[37] L. Zou, M. T. ¨Ozsu, L. Chen, X. Shen, R. Huang, D. Zhao, gstore: a graph-based sparql query engine, Very Large Data Bases 23 (2014) 565–590.

[38] I. C. Dourado, R. Galante, M. A. Gon¸calves, R. S. Torres, Bag of textual graphs (botg): A general graph-based text representation model, Journal of the Association for Information Science and Technology (2019).

[39] D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 603–619.

[40] F. B. Silva, R. O. Werneck, S. Goldenstein, S. Tabbone, R. S. Torres, Graph-based bag-of-words for classification, Pattern Recognition 74 (2018) 266 – 285.

[41] J. C. Van Gemert, C. J. Veenman, A. W. Smeulders, J. M. Geusebroek, Visual word ambiguity, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010) 1271–1283.

[42] T. Mitchell, Machine Learning, McGraw-Hill, New York, NY, USA, 1997.

[43] P. J. Dickinson, H. Bunke, A. Dadej, M. Kraetzl, Matching graphs with unique node labels, Pattern Analysis and Applications 7 (2004) 243–254.

[44] B. Bischke, P. Helber, C. Schulze, S. Venkat, A. Dengel, D. Borth, The multimedia satellite task at mediaeval 2017: Emergence response for flooding events, in: MediaEval Workshop, 2017.

[45] P. Brodatz, Textures: A photographic album for artists and designers, Dover, 1966.

[46] J. Van De Weijer, C. Schmid, Coloring local feature extraction, Proc. 9th European Conference on Computer Vision (2006) 334–348.

[47] T. Deselaers, D. Keysers, H. Ney, Features for image retrieval: an experimental comparison, Information Retrieval 11 (2008) 77–107.

[48] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Computer Vision and Pattern Recognition, IEEE, 2016, pp. 770–778.

[49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. B., A. C. Berg, L. Fei-Fei, Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (2015) 211–252.

[50] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556 (2014).

[51] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).

[52] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning transferable architectures for scalable image recognition, arXiv:1707.07012 (2017).

[53] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: International Conference on Machine Learning, 2014, pp. 1188–1196.

[54] S. Kornblith, J. Shlens, Q. V. Le, Do better imagenet models transfer better?, arXiv:1805.08974 (2018).

[55] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119.