Interactions in information spread: quantification and interpretation using stochastic block models

2020Β·arXiv

ABSTRACT

1 INTRODUCTION

With the Internet, people have begun to interact with each other as never before. Nowadays, social networks such as Facebook, Reddit or WhatsApp let us share and compare ideas. Modeling the dynamics of these exchanges can help us understanding why and how various pieces of information (e.g., hashtags, memes, ideas, etc.) will flow through a community. We refer to these pieces of information as entities. Understanding the underlying dynamics at work provides powerful means to predict and control entitiesβ spread. Up to now, little work has been done on investigating the role of interacting entities in usersβ decisions (retweet, share, comment answer, etc.), or more generally in the probability of outcomes. A number of previous works on information diffusion theory only consider a user acting on an isolated piece of information [20, 21]. On some occasions, theoretical frameworks have been developed to investigate how the presence of concurrent pieces of information affect the action a user exerts on them in a network [2]. However, a fundamental question to be answered is how pieces of information interact in the informational landscape. Considering a recommender system, a customer that bought a smartphone might be interested by side accessories such as headphones or selfies sticks, but a customer that bought a smartphone and a camera lens extension might be more interested in buying a professional camera. Another example considering a disease-symptoms corpus, the diagnosis of a disease is established from the interaction of several symptoms; the interaction between βrunning noseβ and βheadacheβ symptoms is likely to lead to a βfluβ diagnosis. An approach without interactions would be less successful here, since each of the products/symptoms can lead to a number of different recommendations/diagnoses. The same line of reasoning can be applied to the prediction of retweets (user exposures to A and B affects its retweet of C), music playlist building (same as before), detection of controversial posts (what combinations of words trigger what answers), etc. Beyond the state-of-the-art guess that a single piece of information is enough to trigger another one (in music recommendation for instance, a user might associate βACDCβ βMetallicaβ, or βChopinβ βMozartβ) [18], we believe that the world is more complex, and that such a mechanism should not be so simplistic. Our idea in the present work is to take the interaction of pairs of entities into account in the prediction of an outcome (βACDCβ + βMetallicaβ βPink Floydβ). Therefore, throughout the present article we develop a novel prediction method with multiple entity types.

The remainder of this article is organized as follows. In Section 2, we present an overview of the landscape of information interaction studies for both information spread and recommender systems. Then in Section 3, we detail our model and its mathematical derivation and describe an efficient algorithm to apply it. Next, in Section 4, we assess the model on real-world datasets. Finally, in Section 5, we quantify the role of interactions in each corpus and provide insight to analyzing the results from a semantic point of view.

2 BACKGROUND

2.1 Interaction in information diffusion

Much previous research was aimed at understanding the mechanisms of information propagation. Many studies focus on the importance of the network structure coupled with the intrinsic contagiousness, or virality, of the spreading entity (that depends on its nature and content [19, 25]) in the modeling of the spreading processes [13, 24].

Recent works proposed to model the diffusion of information as the result of an interaction between information [2, 22]. Following a similar idea, Myers and Leskovec [18] investigated interactions between contagions on Twitter. The authors aim to find the interaction factors between different tweets in oneβs activity feed. Their findings suggest that interactions between tweets play a determinant role in their retweet. In their work, they assume that there is an inherent virality for every tweet (that is an inherent probability to be retweeted) computed from the frequency of retweets, to which is added a small interaction term.

While this latter work opens paths for studying interactions, it also presents various limitations that we address here. Firstly, the method proposed by the authors makes predictions solely based on tweets that have been observed in the feed of a given user. It therefore limits the application range of the model uniquely to systems based on the retweets (or share) concept, where information has to appear first in order to be spread. This model is hardly applicable to systems that are based on exogenous reactions (e.g., online forums, playlist building and recommender systems) where information can appear as a consequence of different entities (βCapitalβ + βNetherlandsβ βAmsterdamβ). We address this problem by developing a more global framework allowing for outputs that are different from the inputs in the considered datasets. Secondly, interaction is defined as a correction to the frequency of retweets of a given tweet in any context. We argue this can lead to false conclusions about interactions. Imagine that interactions lead to a shift of on the base probability (virality) of a retweet, and that this interaction happens in a fraction of all observations of a given tweet being retweeted. The virality as defined in [18] then equals , which is by construct larger than the actual probability of a retweet in the absence of interaction. Therefore, defining interaction according to this quantity is wrong (for instance it results in adding an interaction term when there is in fact no interaction). Virality needs to be inferred by the model at the same time as the contribution of interactions to be properly defined, which is part of what we propose to do here.

2.2 A recommender system approach

Research in recommender systems applied to multiple pieces of information is motivated by numerous descriptive studies on multimodal networks structure [11, 23, 27, 29]. Typically in [29], the authors study interaction between multiple entity types via a heterogeneous network representation and define clusters of entities based on structural properties of the resulting graph. However, as pointed out by the authors, this method is heavily influenced by the structural clustering method used βin this case a meta-path-based clustering [28]. Moreover, defining edge weights in heterogeneous graphs is subjective and requires additional learning algorithms.

A more direct representation of real-world systems is based on collaborative filtering techniques, that directly mines clusters from pieces of information interaction patterns and generates a weighted interaction graph between entities from unit independent observations. Typically, a widely used method in commercial applications is based on a Matrix Factorization approach [14]. This method considers a large number of user-item pairs and identifies regularities to model them in a lower dimensional space (e.g., it groups regularities into clusters). It has been shown that this algorithm is in fact a particular case of a wider model family: the Mixed Membership Stochastic Block Model (MMSBM) [4, 16]. Another particular case of this family is the single membership Stochastic Block Model (SBM) [30], whose use in the discovery of underlying interaction dynamics has been suggested in recent years [8].

More recently, in [6] the authors develop an extension of the MMSBM considering bipartite graphs. This formulation generalizes MF models into a more global framework based on the MMSBM. In this approach, entities of different type (e.g. users and items in the case of online shopping recommendation) are grouped into distinct clusters whose interaction result in an outcome (e.g. buy or not buy). This model is optimized via a scalable EM algorithm; it outperforms state-of-the-art models such as SBM [8, 30], MF [14] and Mixed-Membership MF [17] both in performing predictions and in scalability.

However, none of the cited works consider the interaction between pieces of information of the same nature in the prediction of an outcome. A drug might interact with another one, but the joint interaction of two drugs on a third one cannot be investigated [8]. A user on Netflix is predicted to like a given movie because she is partially part of the group that liked the movie A and partially part of the group that disliked movie B, but all the user groups are independent from one another [6, 14]. Friendship between individuals is determined on the basis of the independent groups of friends they belong to, but not on the basis of the joint belonging to various groups [4, 12, 16]. Typically in [6], embedding pieces of information of same nature in a bipartite graph seems irrelevant: taking the example of a online purchase recommendation system, a product should not interact differently with an other (or belong to different clusters) because it is on the left or the right side of a bipartite graph. Instead, we want to enforce a symmetrical interaction and thus a single clustering for pieces of information that interact in this way.

Relaxing the independence assumption and looking at the probability of an outcome due to the joint membership to more than one group (while also allowing for outcomes resulting that do not result from any interaction) provides a better assumption for modeling the subtle interaction process. Taking back the previous example of medical diagnosis: a disease can seldom be diagnosed on the basis of a single symptom (e.g. without interaction between symptoms), but rather on a symmetric combination of symptoms (e.g. with interaction).

2.3 Contributions

β’ We develop a scalable model that accounts for interactions between entities: the Interacting Mixed Membership Stochastic Block Model (IMMSBM).

β’ We show by comparing our results with non-interacting MMSBM [6] that taking interaction into account leads to a more accurate recommendations in real-world situations on 4 different datasets.

β’ We provide a proper way to infer virality, which allows to correctly define and compute interaction terms.

β’ We quantify the role of interactions in various corpus and highlight the necessity of taking them into account when dealing with real-world spreading or recommendation problems.

3 THE IMMSBM MODEL

In this section, we develop the IMMSBM model. We propose an approach based on standard Mixed Membership Stochastic Block Modeling [4, 6], which we modify in order to take interactions into account. Building our model on the MMSBM allows to assume that each entity does not have only one membership, which is in line with the real situation. The IMMSBM requires no prior information on the system and its numerical implementation is possible via a scalable Expectation-Maximization algorithm of linear complexity with the size of the dataset. In addition to the state-of-the-art problems which our method answers, it also offers better predictive power than non-interacting baselines.

The goal of the model is to predict the most likely result of an interaction between two entities (i and j in Fig.1). In recommender systems, it would recommend the song the most likely to fit userβs tastes given previous songs they listened to, or the product the most likely to be bought given a userβs purchase history. Another example about assisted medical diagnosis: observing the words βfatigueβ and βcoughβ in a medical report is more likely to imply the observation of βfluβ than βanemiaβ, despite βanemiaβ being often associated with the βfatigueβ symptom. The model will group data into clusters (membership matrix Fig.1) that interact symmetrically with each other (interactions tensor p), resulting in a probability over the possible outputs to appear (histograms Fig.1-top). We have no prior knowledge of the content of the groups, and we only need to set the number of clusters T.

3.1 Interacting MMSBM

We refer to the interacting entities as input entities aim to predict an output entity is the input space (the entities that interact: products, symptoms or songs for instance) and is the output space (the entities resulting from the interaction in an answer, diagnosed diseases for instance). We illustrate in Figure 1-top how the input space and output space are related according to our model: input entities are clustered, and the interaction between those clusters give rise to probability distributions over the output entities. Note that and can be identical or radically different according to what we want to model. In the case of musical recommendation, accounts for observed artists in a userβs feed, and for recommended artists βthat do not necessarily appears in

Figure 1: Illustration of the model β (Top) Schema of IMMSBM for a single pair of entities (which could be βfeverβ and βcoughβ for instance). Input entities are grouped into T clusters in different proportions; the proportion to which they belong to each cluster is quantified by a (dimension is the input space). The clusters then interact to generate a probability distribution over the output entities defined by the interactions tensor p (dimension is the output space). (Bottom) Alternative representation of the IMMSBM as a graphical model. To generate each output, for each observation () in the set , a cluster () is drawn for each input entity (a distribution encoded in the matrix. The generated output is drawn from a multinomial distribution conditioned by the previously drawn clusters k and l encoded in

the userβs feed, unlike [18]. In the case of medical diagnosis, accounts for symptoms (fever, cough, anemia, etc.) and for diseases (Alzheimer, hepatitis, etc.). If one wants to predict named entities in the answer to a Reddit post based on interacting named entities in the original post, then since the predicted vocabulary can also be used as an input (e.g., the named entity βSunβ can either be an input that interacts with other inputs, or an output). As an alternative visualization, we present the graphical generative model of the IMMSBM in Figure 1-bottom. The observed data then takes the form of triplets () signifying that the combined presence of input entities i and j leads to the output entity x. A given triplet can obviously occur several times in the same dataset.

We assume there are regularities in the studied dataset, so that given inputs interact with each other in the same way. Their classification into clusters would therefore be relevant. For the medical dataset example, this means that symptoms such as βfeverβ and βpallorβ often come in pairs and therefore are considered as similar regarding the diagnosis; they would belong to the same cluster. We define the membership matrix associating each input entity with clusters in different proportions, such that is a vector with 1. Note that unlike in single membership stochastic block models an entity does not have to belong to only one group [5]. Given the possible semantic variation of entities (polysemy of words in natural languages -e.g. βlikeβ, βswallowβ-, symptoms with various causes in medicine -βheadacheβ, βfeverβ-, etc.), an approach via a mixed-membership clustering seems more relevant.

We then define the cluster interactions tensor of dimensions as the probability that the interaction between clusters gives rise to the output property of this definition is that . The role of those two quantities is schematized in Fig.1.

We choose to consider only one membership matrix for all of the inputs, instead of one per input entry as in [6]. It implies the assumption that interactions to be symmetric, which means that an observation () is equivalent to (). This follows the idea of [2] where it is assumed that the interaction between two viruses is symmetric, meaning the interaction influences both viruses with the same magnitude. Therefore, there is no need to consider a different clustering for inputs i and j, which motivates the use of a single membership matrix

We now propose to define the entities interactions tensor , representing the probability that the interaction between inputs i and j implies the output x as:

For the sake of brevity, from now on we will refer to as . We define the likelihood of the observations given the parameters as:

where denotes the set of triplets in the training set (input, input, output). Note that the remaining triplets are used as test set.

Importantly, this formulation differs from related work in that the model presented here considers a symmetric interaction between pieces of information in the prediction of an outcome instead of their asymmetric combination [6]. Considering a musics recommender system for instance, clustering the same songs using two independent membership matrices (which is the formulation introduced in [6]) would make few sense and might eventually lead to incorrect results. Indeed, since the convergence towards a global optimum is not guaranteed, such algorithm might assign different memberships to a song depending on whether it is on the left or right side of a triplet. Put differently, we would have different output probabilities for , which is something we avoid here. In our formulation, only the outputs can be of different nature than the inputs.

3.2 Inference of the parameters

We are now looking at inferring the model parameters, namely the components of the interactions tensor . As stated in the previous section, we decomposed this tensor into an algebraic combination of the matrix and the tensor . In this section, we introduce a method to infer them via an Expectation-Maximization (EM) algorithm. EM is a 2-step iterative algorithm. The first step consists of computing the expectation of the likelihood with respect to the latent variables, denoted , with parameters as constant. The second step consists of maximizing this expectation of the likelihood with respect to the parameters , p. Iterating this process guarantees that the likelihood converges towards a local maximum.

3.2.1 Expectation step. Taking the logarithm of the likelihood as defined in Eq.2, denoted

We used Jensenβs inequality to go from the 2nd to 3rd line. The inequality in Eq.3 becomes an equality for:

where is interpreted as the probability that the observation is due to i belonging to the group , that is the expectation of the likelihood of the observation with respect to the latent variables . Therefore, Eq.4 is the formula for the expectation step of the EM algorithm. An alternative derivation of the expectation step formula is discussed in [3].

3.2.2 Maximization step. This step consists in maximizing the likelihood using the parameters of the model and p, independently of the latent variables. In order to take into account the normalization constraints, we introduce the Lagrange multipliers and .

Following this, the constrained log-likelihood reads:

where stands for the set of observations in which the entry m appears . Note that summing over this set in line 2 of Eq.6 implies that the relation between two inputs entities is symmetric βif . Summing the last line of Eq.6 over then multiplying it by , we get an expression for

Where is the total number of times that m appears as an input in . Finally, plugging back this result in Eq.6, we get:

Following the same line of reasoning for p, we get:

The set of Equations 8 and 9 constitutes the maximization step of the EM algorithm. They hold only if the input entities interactions are symmetric (e.g. when ), which is what we aimed to do. It is worth noting that the proposed algorithm offers linear complexity with the size of the dataset provided the number of clusters is constant, while guaranteeing convergence to a local maximum.

4 EXPERIMENTS

4.1 Datasets and evaluation protocol

We test the performance of our model on 4 different datasets. The first dataset (PubMed dataset) is built with 15,809,271 medical reports collected from the PubMed database as a good approximation for human-disease network [31]. This dataset is not explicitly about recommender systems, but provides an intuitive way to understand how our recommendation approach works by suggesting likely diseases given a collection of symptoms. The second (Twitter dataset) with 139,098 retweets gathered in october 2010 associated with the 3 last tweets in the feed preceding each retweet [10]. The task is to infer tweets a user is the most likely to retweet. A possible application would be a personnalized recommendation of such tweets in the βTrends for youβ Twitter section. The third (Reddit dataset) with the entirety of posts in the subreddit r/news in May 2019 (225,485 message-answer relationships in total). We aim at predicting the content of the answer given the incoming message. A possible application would be similar as what Gmail does when suggesting automated answers to an email given keywords present in the message. Finally, the last (Spotify dataset) is built with 2,000 music playlists associated with keywords βenglishβ and βrockβ of random Spotify users. We predict the next song a user will add to a playlist given this userβs history. A RS application would suggest a ranked list of songs to the user is likely to add to a playlist. Each dataset is formed by associating every pair of inputs in a message (i.e. a list of symptoms, a userβs feed, a Reddit post, and a playlistβs last artists) with an answer (i.e. a disease, a retweet, a Reddit answer, an artist added to a playlist). The building process of datasets is further described in SI, Section 1, together with direct links to access them for possible replication studies 1.

From the raw datasets, we form the test set by randomly sampling 10% of the coupled structures (messageanswer). The 90% coupled structures left are used as a training set. The number of clusters is determined using the elbow method. For each corpus: 30 for PubMed, 15 for Twitter, 30 for Reddit, and 15 for Spotify. We perform 100 independent runs, each with independent random initialization of the parameters and . The EM loop stops once the relative variation of the likelihood between two iterations is less than 0.001%.

4.2 Baselines

4.2.1 Naive baseline. The naive baseline is simply the frequency of the outputs in the test set. This naive classifier predicts the value of every output independently of the inputs.

4.2.2 MMSBM. We use the classical MMSBM as a second baseline. In this formulation interactions are not taken into account and is identical to state-of-the-art work done in recommender systems [4, 6, 16]. Instead of considering triplets (input, input, output), we instead here apply the classical MMSBM on pairs (input, output). This baseline is exactly the same as the model introduced in [4]. We then train our baseline whose log-likelihood is defined as on the same datasets as in the main experiments. We infer the parameters via an ExpectationMaximization algorithm similar to the one described in Section 3. The number of clusters being the only parameter to be tuned. It is chosen in the same way as it is for the IMMSBM βby minimizing the AIC criterion [1]. We see further in the experiments that this criterion matches with the elbow of the various metrics. We make 100 independent runs with random initialization and discuss here the results of the highest-likelihood run. This baseline provides an alternative way to quantify the importance of interactions by comparison with the case where these are taken into account. We expect the baseline model to find results that are equivalent to the diagonal probabilities of the main model (i.e. similar to ). This is because the diagonal supposed to account for the apparition of x given only the presence of i, which is what this baseline computes. Furthermore, this model will provide insight to the generalization of the assumption made in

Figure 2: Performance variations on all the metrics for every dataset considered. Dashed blue line: upper limit to performances ; blue line: IMMSBM ; yellow dashed line: MMSBM ; red line: naive baseline. Top left: PubMed ; top right: Spotify ; bottom left: Twitter ; bottom right: Reddit. The vertical green line shows the selected number of clusters; it is chosen using the AIC criterion, which matches with the elbow of the various metrics considered.

Myers and Leskovec [18], stating that the probability of an output is greatly dominated by a prior quantity unrelated to any interaction (known as virality, see Section 2).

4.2.3 Perfect modeling - Upper limit to prediction. We also compare our results to a mathematical upper limit to our predictions. While in some systems the predictions could theoretically be perfect, in most situations the dataset simply does not allow it. Consider as an example a case where the test set contains twice the triplet (βfeverβ, βpallorβ, βinfluenzaβ) and once the triplet (βfeverβ, βpallorβ, βanemiaβ): the model cannot make a prediction better than 66%. In SI, Section 2, we develop a general method to derive such an upper limit to predictions for a given dataset. This upper limit is mathematically the best performance any model can do given a dataset structure.

4.3 Results

The metrics we use to assess the performance of our model are the max-F1 score, the area under the receiver operating characteristic curve (ROC) curve and the precision@10. For all of these quantities, the closer to 1 they are, the better the performance.

For evaluation, we adopt a guessing process as follows. For every pair of inputs, we compute the probability vector for the presence of every possible output. Then we predict all of the probabilities larger than a given threshold to be βpresentβ, and all the others to be βabsentβ. Comparing those predictions with the observations in the test set, we get the confusion matrix for the given threshold. We then lower the threshold and repeat the process to compute the various metrics.

We recall that we do not compare our approach to [18] because its formulation does not allow to make prediction on exogenous output, that is when the output is not part of the input pair. The more general model presented here is needed in order to consider the effect of interactions on the probability of exogenous outputs.

In Table 1, we show the performances of our model compared with the baselines introduced in Section 4.2. We see that IMMSBM outperforms all of the baselines in most cases. A remarkable feature is the good P@10 of our model, primordial in many applications such as diseases diagnosis or in recommender systems. As expected,

Table 1: Experimental results for the four metrics considered, from each model applied to each corpus. We see that our model outperforms the proposed baselines in every dataset for almost every evaluation metric β the error bars overlap for the AUC (ROC) on the PubMed corpus. The given error corresponds to the standard deviation over the 10 runs. The naive baseline and upper limit results are constant over the runs and therefore have no variance.

taking interactions between entities into account leads to an improved accuracy on the prediction of the missing data. This correlates to the conclusions drawn in Myers and Leskovec, stating the importance of interactions in real-world phenomenon modeling [18]. Overall, our model performs better in many different applications for most of the corpora considered. It systematically yields a better P@10 and F1 score, making it of use in recommender systems applications. The AUC (ROC) score can be interpreted as the quality of classification of low-probability events. Overall, little improvement is observed on this aspect compared to the non-interacting baseline.

We notice however that accounting for interactions did not lead to a significant improvement in performance over the two baselines on the Reddit corpus. Given the obvious fact that language is partially formed by interacting named entities, this result may look surprising. We attribute the lack of improvement when considering interactions to the dataset provided. Indeed the dataset contains few observations for every possible pair, due to the wide range of available vocabulary of natural language [15]. It is likely that we did not train the model with enough data for it to learn actual regularities in pair interactions. This can also be seen during the building of the test set: approximately one half of its pairs have never been observed in the training set. Therefore, we suppose that the model simply lacks enough data to identify generalities. In future work, it might be interesting to answer this problem by considering a corpus of pre-clustered entities instead of independent named entities, hence reducing the vocabulary range and adding to the regularity of the dataset.

5 DISCUSSION

5.1 Relative importance of interactions

5.1.1 Global importance of interactions. We recall the argument exposed in the introduction that interactions have to be inferred along with the virality of a piece of information, which cannot be calculated heuristically. Imagine that interactions lead to a shift of on the base probability (virality) of a retweet, and that this interaction happens in a fraction of all observations of a given tweet being retweeted. The virality as defined in [18] then equals , which is by construct larger than the actual probability of a retweet in the absence of interaction. Therefore, defining interaction according to this quantity is wrong (for instance it results in adding an interaction term when there is in fact no interaction). Virality needs to be inferred by the model at the same time as the contribution of interactions to be properly defined.

Our model infers virality along with interaction terms and yield better results than state-of-the-art methods (see Table 1), which provides solid ground for analyzing the effect of information interaction. Close analysis of interactions between pieces of informations has been little considered in literature βwhat lexical fields, groups of symptoms, musical genres, kind of tweets interact with each other. We can evaluate the importance of interactions between entities on the basis of inferred virality. We consider two quantities within each corpus: the overall relative impact of the interactions on the probability of an outcome and the contribution of each pair of groups in the modification of outcome probabilities.

To evaluate the global impact of taking the interactions into account, we compute the relative change of probability according to the virality for each triplet and average this quantity over all the

Figure 3: Importance of interactions - Contribution of each pair of clusters (heatmap) and average impact of the interactions Β―(on the right) in outcome probabilities for each corpus. Clusters typically interact with a limited number of others; these interactions still play a significant role in outcomes probabilities. The cluster have been annotated manually.

triplets in the corpus. We note this quantity Β―

where denotes the probability of out- come given the entities ; as shown in the previous section, the diagonal elements account for the virality of results are shown Fig.3. It results that for every corpus, interaction between entities exerts a non-negligible influence on the probability of an output. Those results confirm previous work done on interactions modeling, stating the importance of taking interactions into account when analyzing real-world datasets [18]. Interactions increase the virality of an output by a factor by 2.58 in the PubMed corpus, 2.78 in the Twitter corpus, 1.73 in the Spotify corpus and 1.57 in the Reddit corpus. Interactions have a greater effect on output probabilities for PubMed and Twitter corpora, and a lesser role for the Spotify and Reddit ones. Besides, our model applied to a dataset where interactions do not play any role ( Β―0) reduces to the non-interacting MMSBM baseline. This metric therefore allows to assert the importance of the interactions in a given corpus.

5.1.2 Which clusters interact. To evaluate the role of each pair of groups plays in the modification of output probability, we consider the following quantity:

where and are defined as before. This quantity is the weighted average of the change in output probability with respect to virality due the interaction between every pair of clusters to which belong entities in a proportion (). The results are shown in the heatmaps Fig3. Note that the clusters have been manually given a name on the basis of their composition; labeling is subjective.

We see that most of the clusters do not interact with each other; the interactions essentially take place between a limited number of clusters. Typically, a cluster interacts significantly with only one or two other clusters in every corpus (βVertigoβ and βSpeech disorderβ in PubMed, βStudentsβ and βSchoolsβ in Reddit, etc.). We also notice that in each corpus, the model forms some non-interacting cluster with low values of (βNeuropsychatryβ in PubMed, βRecent popβ in Spotify, etc.); for those, the probability of an output is essentially equal to the virality of this output. We also notice that the diagonal of thematrices have low values; this expresses that the interaction of a group with itself leads to an output probability close to its virality. To picture how this makes sense, we can imagine diagnosing a disease on the basis of two ear-related symptoms (βearacheβ and βhearing disordersβ): the diagnosis is likely to be related to the ear as we would have guessed with only one symptom (its probability equals virality). Now imagine two symptoms of different kind (βearacheβ and βspeech disorderβ): the diagnosis is then likely to be related to the brain and less to the ear, so the interaction lowers the base probability (virality) of the βear diseaseβ output and increases the one of the brain disease.

Being able to see in details the extent to which interactions exert an influence in a corpus and between which categories they take place open new perspectives in research. Understanding and explaining underlying mechanisms ruling corpora is a desirable property of AI models in social sciences for instance [7, 9, 26]. As we demonstrated, the present work follows this line by developing an explainable model for investigating interactions.

5.2 Entropy of membership

Another interesting quantity to look at when considering the results is the membership entropy of the entities. This quantity measures how entities are spread over all the clusters; when this value is low, it means that the model finds strong regularities in the corpus, and the clusters are likely to be readily interpretable. Therefore, we use the normalized Shannon entropy of memberships of user i, , whose formula is:

Here the lowest entropy reachable is 0, which corresponds to an entity belonging to only one group ; the largest is 1 corresponding to belonging to every cluster evenly (with probability 1

Overall, the entropy of memberships is low. The average entropy values per corpus are: 0.320 for PubMed (equivalent to belonging on average to 2-3 clusters), 0.324 for Twitter (2 clusters), 0.561 for Reddit (6-7 clusters) and 0.364 for Spotify (2-3 clusters). The low number of entities spread among clusters means that the clustering done by our model is easy to interpret βwhich eased the manual annotation of the clusters presented in the previous section.

6 CONCLUSION

In most previous approaches to information spreading, the effect of interactions between diffusing entities has been neglected. Here, we proposed a modified MMSBM that allows for investigating the role of interactions and quantifying them. By design, it also allows assessment of the importance of interactions compared with the virality of single pieces of information (i.e. their intrinsic ability to spread on their own). On this basis, we show that the effect of interactions on information spread is not trivial and that taking them into account increases predictive performance in several real-world applications. Following this observation, we proposed an implementation via a scalable EM algorithm allowing for its application on large datasets.

However, a major limitation to our model is that it only accounts for pair interactions. While our conclusions state their importance in some real-world systems, it might not be sufficient for a number of others. For instance, a disease is seldom diagnosed on the basis of a single pair of symptoms; a model accounting for the interaction between n entities might be more relevant in this case. The same line of reasoning can be applied to gene and protein interactions [31], bacteria symbiosis or even species co-evolution networks. We are currently working on a generalization of our model that will hopefully provide a better description of interacting processes at work everywhere in nature.

REFERENCES

[1] Hirotogu Akaike. 1973. Information Theory and an Extension of the Maximum Likelihood Principle. 2nd International Symposium on Information Theory (1973), 267β281.

[2] Alex Beutel, Aditya B. Prakash, Roni Rosenfeld, and Christos Faloutsos. 2012. Interacting viruses in networks: Can both survive? Proceedings of the ACM SIGKDD (2012).

[3] Christopher Bishop. 2006. Pattern Recognition and Machine Learning. Vol. 16. 450β455. https://doi.org/10.1117/1.2819119

[4] Airoldi E.M., Blei D.M., Fienberg S.E., and Xing E.P. 2008. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research 9 (2008), 1991β 1992.

[5] Thorben Funke and Till Becker. 2019. Stochastic block models: A comparison of variants and inference methods. PloS one (2019).

[6] Antonia Godoy-Lorite, Roger GuimerΓ , Cristopher Moore, and Marta Sales-Pardo. 2016. Accurate and scalable social recommendation using mixed-membership stochastic block models. PNAS 113, 50 (2016), 14207β14212.

[7] Roger Guimera, Alejandro Llorente, and Marta Sales-Pardo. 2012. Predicting Human Preferences Using the Block Structure of Complex Social Networks. PloS One 7, 9 (2012).

[8] Roger GuimerΓ and Marta Sales-Pardo. 2013. A Network Inference Method for Large-Scale Unsupervised Identification of Novel Drug-Drug Interactions. PLoS Comput Biol (2013).

[9] Mario GutiΓ©rres-Roig, Carlota Segura, Jordi Dutch, and Josep Perello. 2016. Market Imitation and Win-Stay Lose-Shift Strategies Emerge as Unintended Patterns in Market Direction Guesses. PLOS One (2016).

[10] Nathan O. Hodas and Kristina Lerman. 2014. The Simple Rules of Social Contagion. Scientific Reports 4, 4343 (2014).

[11] Zhao Huan, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. SIGKDD (2017).

[12] Mohsen Jamali, Tianle Huang, and Martin Ester. 2011. A Generalized Stochastic Block Model for Recommendation in Social Rating Networks (Rec- . Association for Computing Machinery, New York, NY, USA, 53β60. https://doi.org/10.1145/2043932.2043946

[13] David Kempe, Jon Kleinberg, and Eva Tardos. 2015. Maximizing the Spread of Influence through a Social Network. Theory of computing 11 (2015).

[14] Y. Koren, R. Bell, and C. Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30β37.

[15] Vittorio Loreto, Pietro Gravino, Vito D. P. Servedio, and Francesca Tria. 2016. On the Emergence of Syntactic Structures: Quantifying and Modeling Duality of Patterning. Topics in Cognitive Science 8, 2 (2016), 469β480.

[16] Airoldi E. M., Blei D. M., Fienberg S. E., E. P. Xing, and Jaakkola T. 2006. Mixed membership stochastic block models for relational data with application to protein-protein interactions. Proceedings of the international biometrics society annual meeting 15 (2006).

[17] LW Mackey, DJ Weiss, and MI Jordan. 2010. Mixed Membership Matrix Factorization. International Conference on Machine Learning (ICML) (2010).

[18] Seth Myers and Jure Leskovec. 2012. Clash of the Contagions: Cooperation and Competition in Information Diffusion. IEEE International Conference on Data Mining (ICDM) (2012), 539β548.

[19] Aastha Nigam, Kijung Shin, Ashwin Bahulkar, Bryan Hooi, David Hachen, Boleslaw Szymanski, Christos Faloutsos, and Nitesh Chawla. 2019. ONE-M: Modeling the Co-evolution of Opinions and Network Connections. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) (2019), 122β140.

[20] Romualdo Pastor-Satorras, Claudio Castellano, Piet Van Mieghem, and Alessandro Vespignani. 2015. Epidemic processes in complex networks. Reviews of Modern Physics 87, 3 (2015), 925β979.

[21] GaΓ«l Poux-MΓ©dard, Romualdo Pastor-Satorras, and Claudio Castellano. 2020. Influential spreaders for recurrent epidemics on networks. Physical Review Research 2, 2 (2020), 023332.

[22] Gael Poux-Medard, Julien Velcin, and Sabine Loudcher. 2021. Information Interaction Profile of Choice Adoption.

[23] Ahmed Rashed, Shayan Jawed, Lars Schmidt-Thieme, and Andre Hintsches. 2020. MultiRec: A Multi-Relational Approach for Unique Item Recommendation in Auction Systems. In Fourteenth ACM Conference on Recommender Systems (Virtual Event, Brazil) . Association for Computing Machinery, New York, NY, USA, 230β239. https://doi.org/10.1145/3383313.3412242

[24] Manuel Gomez Rodriguez, Jure Leskovec, and Bernhard Schoelkopf. 2013. Structure and Dynamics of Information Pathways in Online Media. ACM International Conference on Web Search and Data Mining (WSDM) (2013).

[25] Daniel Romero, Brendan Meeder, and Jon Kleinberg. 2011. Differences in the Mechanics of Information Diffusion Across Topics: Idioms, Political Hashtags, and Complex Contagion on Twitter. WWW 695-704 (2011), 695β704.

[26] Cobo-LΓ³pez S., Godoy-Lorite A., and Duch J. 2018. Optimal prediction of decisions and model selection in social dilemmas using block models. EPJ Data Sci 7(48) (2018).

[27] Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and Philip S. Yu. 2016. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering (2016), 17β37.

[28] Yizhou Sun. 2011. PathSim: meta path-based top-K similarity search in heterogeneous information networks. VLDB (2011).

[29] Yizhou Sun and Jiawei Han. 2012. Mining Heterogeneous Information Networks: A Structural Analysis Approach. SIGKDD Explorations 14 (2012).

[30] Yuchung J. Wang and George Y. Wong. 1987. Stochastic Blockmodels for Directed Graphs. J. Amer. Statist. Assoc. 82, 397 (1987), 8β19.

[31] Xuezhong Zhou, JΓΆrg Menche, Albert-Laszlo Barabasi, and Amitabh Sharma. 2014. Human symptomsβdisease network. Nature communications 5 (2014), 4212.

Supplementary Information for: Information Interactions in Outcome Prediction: Quantification and Interpretation using Stochastic Block Models

UniversitΓ© de Lyon, Lyon 2, UR 3083, France UniversitΓ© de Lyon, Lyon 2, UR 3083, France UniversitΓ© de Lyon, Lyon 2, UR 3083, France

For each result, we got a list of every publication in which the keyword is among the main topics. Then, we build the

triplets (symptom1, symptom2, disease) by looking at all the pairs of symptoms in an article and linking each one of

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

of 4 songs to build the dataset. The artist of the song immediately after the running window is the output we aim at

predicting, x, and the artists of the 4 songs within the running window are the interacting inputs. Once again, we

Secondly, we considered only the users who have not tweeted a given URL more than 5 times, this behavior being an

activity of human users. Then, we follow a dataset building process similar to [? ]. For each user, we slice her feed +

tweets temporal sequence in intervals separated by the tweets of the user. Every time a user tweets something, the

interval ends. An interval therefore consists of the tweet of the user and all the tweets she has been exposed to right

comments system work as a directed tree network, where each answer to a given comment initiates a new branch.

We derive an analytical expression for the upper-limit to our model for a given dataset. Explicitly, we analytically

We enforce the constraint that the sum over the output space of probabilities given any observations made has to

Where correspond to any given couple of inputs (i,j) in the model presented in the main paper. We use the following notation: , with the dataset entries. Therefore, we can define 1

This equation gives the probability maximizing the likelihood for any entry of the dataset. In the main model, it

translates to the number of times output x has been witnessed after a pair of inputs (i,j). Note that this is simply the frequency of an output given a pair of input entities.

Designed for Accessibility and to further Open Science