b

DiscoverSearch
About
My stuff
NEW: A Generic Learning Model for Tie Strength Prediction in Networks
2020·arXiv
Abstract
Abstract

Tie strength prediction, sometimes named weight prediction, is vital in exploring the diversity of connectivity pattern emerged in networks. Due to the fundamental significance, it has drawn much attention in the field of network analysis and mining. Some related works appeared in recent years have significantly advanced our understanding of how to predict the strong and weak ties in the social networks. However, most of the proposed approaches are scenario-aware methods heavily depending on some special contexts and even exclusively used in social networks. As a result, they are less applicable to various kinds of networks.

In contrast to the prior studies, here we propose a new computational framework called Neighborhood Estimating Weight (NEW) which is purely driven by the basic structure information of the network and has the flexi-bility for adapting to diverse types of networks. In NEW, we design a novel index, i.e., connection inclination, to generate the representative features of the network, which is capable of capturing the actual distribution of the tie strength. In order to obtain the optimized prediction results, we also propose a parameterized regression model which approximately has a linear time complexity and thus is readily extended to the implementation in large-scale networks. The experimental results on six real-world networks demonstrate that our proposed predictive model outperforms the state of the art methods, which is powerful for predicting the missing tie strengths when only a part of the network’s tie strength information is available.

Keywords: Tie strength prediction, Generic learning model, Parameter tuning, Weight distribution

Traditionally, tie strength is a purely sociological concept which is used to describe how close the relationships are between people in a social network [1]. In this study, we expand its connotation to address the connection strength in any kinds of networks. If the tie strength could be quantified by a value, a larger value would represent a tighter connection called a strong tie and a smaller one corresponds to a looser relationship called a weak tie. It is worth mentioning that the true meaning of the tie strength would be different on a case-by-case basis and highly depend on the specified scenario. For example, the friendships in social networks have some apparent difference between close friends and mere acquaintances, implying the degree of intimacy between two friends can be regarded as the ”tie strength” of friendship. In an email communication network, the frequency of email exchanges would be able to account for the ”tie strength” between the sender and the recipient, i.e., the more frequent email contacts represent the stronger tie strength.

In the field of network mining and analysis, tie strengths actually play an important role in reflecting the pattern of connections between the memberships in a network when we explore the mechanism of link formation, the evolution of the community and the growth of the network [2]. Yet, in some cases, part of the tie strengths would be invisible due to the information is missing or undetectable. So, it is of theoretical interest to restore or infer the unobservable tie strengths based on the known information in a network. How to accurately predict the tie strength in various networks is the main challenge in this study. Because the tie strength is commonly marked as weight over an edge when we formulate a network into a weighted graph, the tie strength prediction can be also regarded as a problem of weight prediction.

As aforementioned, the tie strengths would represent various meanings from distinct contexts. As a result, most proposed tie strength prediction models highly rely on the use of semantic features such as the profile information of the on-line users which are strongly correlated with the scenario. This means that the model would be effective in a context but fail to be applicable in a different context. Due to lack of generality for most existing tie strength prediction models, we are motivated to study a generic model which can be applied to fulfill the task of tie strength prediction in many kinds of networks. As most networks can be represented by the weighted graphs, we try to merely use the topological information of the graph to generate the features and design the predictive model while discarding the traditional idea of taking in account any possibly available semantic information existing in the networks. By doing so, the generality of the proposed model will be guaranteed.

The innovations and contributions of our study to tie strength prediction are three-fold. (i) We proposed a generic computational framework by simply using Neighborhood information to Estimate the Weights (NEW) of links in the network. Extensive experiments conducted on real world networks verify that the model has a better performance in prediction accuracy against the state of the art methods. (ii)The learning model approximately has a linear time complexity for its computations mainly rely on the magnitude of the links in the networks, which means that the model can be readily scaled up to handle networks with very large size in a nearly linear growth of time consumption . (iii) To figure out the role of the free parameter introduced in the learning model, we performed some comprehensive analyses and found that the parameter tuning actually can statistically allow the predicated weights having a closer center as well as a more similar distribution to the actual weights.

The rest of the paper has a layout organized as follows. In section 2, we will review the related studies in this field. Then, we will clearly define the problem of tie strength prediction and introduce the baseline methods in section 3. In the next section, we will investigate how to exploit the network features by using neighborhood information of the nodes. Further, we proposed a learning model to predict the unobserved tie strength. In section 5, we present the experimental results including the accuracy comparisons with some baseline methods. In section 6, some analyses on the proposed model are conducted. In the last section, we conclude our work and give some outlooks of the future works.

In this paper, our work is relevant to the tie strength determination and the study of the predictive model. Therefore, we mainly review some related literature from the two aspects.

2.1. The background knowledge of the tie strength

In Granovetter’s pioneering work [3], he had initially addressed the conception of tie strength in social relationships among people and highlighted the role of weak tie in the spread of information. In particular, Granovetter characterized the tie strength from four dimensions including amount of time, intimacy, intensity and reciprocal services. Thanks to the rich on-line features provided by the Facebook, Gilbert et al. [4] widened the scope of featuring the tie strength to seven dimensions with 74 variables. From the previous studies, to the best of our knowledge, tie strength would be a sociological notion mainly used in the analysis of human relationships. In this work, we borrow this concept as a unified statement to address any connection strength between entities in diverse networks like strength of interaction between proteins in a biological network.

On the other hand, how to obtain the ground-truth data of the tie strength in a given network is still an open problem. According to the literature [5, 6, 7], there are roughly three ways to fulfill this task. The first one is to collect feedbacks from the users who agree to participate in a survey regarding the degree of relationships with their friends which is initiated by the researchers. This is the most common method to get the social tie strength between people. The second one is to make use of a trusted network to determine strong ties. For example, the phone book network is considered as a trusted network and the tie strength between people who have phone contact with each other is recognized as the strong tie. Thirdly, in some other works, researchers define the tie strength in terms of its actual meaning in the network. One of the typical examples is the rating scores between users in the Bitcoin network, which represent the degree of trusting each other between the buyer and the seller in a transaction of the virtual coins. As the networks studied in this work are not limited to the social networks, we adopt the third way to define the tie strength in the networks for its adaptability to diverse networks.

2.2. Overview of the studies of tie strength prediction

As a branch of the link prediction, the study of tie strength prediction has received much attention from researchers in recent years and some interesting models regarding this topic are proposed [8, 9]. Here, we need to point out that most of them prefer to treat it as a binary prediction problem, namely classification of the strong and weak ties. For example, Xiang et al. proposed a model to infer relationship strength based on profile similarity and interaction activity, with the goal of automatically distinguishing strong relationships from weak ones [10]. Kahanda & Neville [11] applied a supervised learning approach to the problem by constructing a predictor that determines whether links in an online campus network are the strong or weak ties. They report that the network’s transactional features (such as communications or file transfers) are the most influential features for this task. Rotabi et al. tried to use motifs (subgraphs frequently appeared in a network) on multiple networks to detect strong ties [6]. The closed triadic structure previously was deemed quite useful for link prediction in the social networks[12, 13]. But, in this work, they found some motifs being larger than the triadic structure are even better in the context of strong tie prediction. Jones et al. used online interaction data (interactions on Facebook like comments, messages, wall post, etc.) to successfully identify real-world strong ties between some surveyed Facebook users via using classic classifiers including logistic regression, support vector machine and random forest [14].

Note that, almost all the mentioned works prefer to utilize the semantic features rather than the structural features or combine the two sides together in their models. Although the predictive model proposed by the Rotabi et al. is based upon the network structure, it focused on the tie strength prediction in multiple networks and additionally used a reference network with weak ties as the priori knowledge. For a given weighted network, we argue that the graph structure itself has already provided us the sufficient and useful information which can be utilized for modeling the tie strength prediction. Moreover, an advantage of merely using the structural information of the ego network can ensure the predictive model generic since it is free of the dependence on any particular contexts.

In addition, unlike the problem setting in previous works, our study does not simply treat the tie strength prediction as a binary classification problem. Instead, we formulate it as an issue of multi-value prediction which would be even more challenging. Similar to the rating prediction problems defined in the study of recommender system [15, 16], we define our prediction task as an explicit tie strength prediction while the task of binary classification for strong and weak ties belongs to the implicit tie strength prediction.

At the same time, there is an another active branch which is working on the tie strength prediction in weighted signed networks (WSNs) [13, 17]. The weighted signed networks refer to networks having edges labeled with positive and negative weights. Kumar et al. proposed two novel measures to quantify the user’s behaviors including so-called goodness and fairness and used them to predict the weights in the WSNs [18]. Wang et al. proposed a novel and flexible end-to-end Signed Heterogeneous Information Network Embedding (SHINE) framework to extract users latent representations from heterogeneous networks and predict the sign of unobserved sentiment links [19]. Despite that signed edges are not necessary for our problem setting, their research ideas are instructive and inspiring to our study.

To the best of our knowledge, we finally only find three works which have the similar motivation as our paper to study the explicit prediction of the tie strength by merely using the network’s topological characteristics. In Lv et al.’s study [20], they uncovered that weak ties play a significant role in the problem of link prediction in terms of the motif analysis made in some empirical networks and defined three weighted structural similarity indices which can be used for tie strength prediction. The second is the study conducted by Zhao et al. [21] who also proposed three weighted local measures based on the reliable-route to perform the tie strength prediction. The last one is found that Fu et al. [22] proposed the line graph indices by converting the edges to the nodes and utilizing the node centrality measures in line graph to define the importance of links in original graph. The approaches proposed by the three highly related works are adopted for comparisons with the new model to verify its superiority. The comparison results will be presented in detail in section 5.

In this section, to clearly address the problem of tie strength prediction studied in this paper, an example is introduced to explain the steps of the whole process. Meanwhile, aiming to this issue, some typical topology-based methods for tie strength prediction without considering any concrete scenarios are briefly introduced here as well.

3.1. Problem description

A weighted network can be represented by a 3-tuple G = (V, E, W) where V denotes the set of the nodes, E denotes the set of the edges or links and W denotes the set of the weights ( each  w ∈ R+) over edges which reflect the tie strengths of the links. Here, we give a toy example to illustrate the problem of tie strength prediction. The Fig. 1(a) shows a complete weighted network G. If we randomly remove a fraction of weights  W ∗ (e.g., 10% of weights in the network) but keep the edges  E∗ marked by red color in the network as shown in Fig. 1(b), the task of the tie strength prediction for the network  G∗ = (V, E, W − W ∗) is to restore the removed weights  W ∗over the edges  E∗ as accurately as possible. In practice, for a network with missing weights, one can use the current network as an observable network or a training network and develop a predictive model to predict the weights absent from the links.

image

Figure 1: (a) A weighted toy network. (b) The network with weights removed from three edges marked by red color.

3.2. Similarity based methods

As mentioned in the former section, only a few of the tie strength prediction approaches are independent to the scenario and mainly based on the calculations of the network’s local structure similarity. In a weighted graph, the two ends of an edge is called a node pair. For a given connected node pair (x, y) with missing tie strength, the calculated similarity between the two nodes can be directly used as an estimation of the tie strength. Here, some commonly used similarity measures for tie strength prediction are introduced and will be treated as baseline methods for comparison in the section of experimental results.

(1) Weighted Common Neighbor (WCN) measure:

image

(3) Weighted Resource Allocation (WRA) measure:

image

The s(z) shown in Eqs. (3) and (4) is the sum over weights of links attaching to node z which has a form as

image

(4) The reliable-route Weighted Common Neighbor (rWCN) measure:

image

where N(x) denotes the neighbors of node x. (5) The reliable-route Weighted Adamic Adar (rWAA) measure:

image

(6) The reliable-route Weighted Resource Allocation (rWRA) measure:

image

It is worth mentioning that, as a step of data pretreatment, one needs to map the weights into the range of (0,1) by using the following formula before calculating the above six similarity indices.

image

4.1. Implicit structure feature mining

image

Figure 2: Illustration of the common neighbors or 2-order paths between the nodes x and y.

Inspired by the theory of homophily from sociology [23] (i.e., people are more likely to form strong ties when they share more attributes such as tweets or followers in an on-line social network), we focus on the node pair’s common neighbors to exploit the useful local structure features which would potentially affect the tie strength. In light of this idea, as shown in Fig. 2, we consider that the existing common neighbors actually have shown us how many connections with 2-order paths exist between the nodes x and y. It is reasonable for us to pay attention to these paths because they are supposed to provide us some indirect evidences or clues about the tie strength between the two nodes. Here, we introduce a local structural feature based upon the common neighbors to reflect the tendencies of two nodes connecting to each other. The feature has a computable expression for each end of the node pair as

image

We call the feature as connection inclination which quantifies, for a node pair (x, y), how strong one node tends to connect to the other. We infer that there are three cases for the node pair including both x and y having strong connection inclinations, one node having strong connection inclination yet the other node having weak connection inclination, and both x an y having weak connection inclinations. So, from a perspective of network structure, the tie strength would be simultaneously impacted by the connection inclinations from the two ends of the node pair. We will analyze the significance of the connection inclination feature in Section of model analysis.

4.2. Parameterized regression model

In this section, we consider to propose a function to fit the tie strength by using the exploited structural features from an observable weighted network. First of all, we formulate the task of the tie strength prediction into a regression problem which has a mathematical form as

image

where  W = (w1, w2, ..., wM) is a vector containing the weights of M observed ties in the network, f(X, Θ) is a fitting function to ensure the best fit to the weight vector. In the fitting function, a  M×2 attribute matrix X corresponds to the node pairs of the ties in the network, in which the entries of the ith row of the matrix, i.e.,  xi1 and xi2, contain the values of the ith node pair’s structural attributes, namely the calculated connection inclinations for the two ends of the node pair. Θ is a row vector with unknown parameters, and ∥ · ∥2denotes the  L2norm. The objective function of the regression problem is to minimize the loss function as

image

To ensure the constant parameter can be included in the model training, we further add a column of ones to define the generalized form of the feature matrix X as

image

To fit the weights of ties in the training network, we define a fitting function as

image

where the operator  ◦krefers to the k Hadamard power [24] and ΘT meansthe transpose of the vector Θ. Substituting Eq. 13 into Eq. 11, we can get an expression as

image

where Θ = (θ1, θ2, θ3) is a vector of three unknown parameters to be determined and k is a free parameter which would have varied values along with different networks. To obtain the optimal parameters, we need to solve the above optimization problem. Meanwhile, to avoid the issue of over-fitting possibly existing in the learning model, we add a regulation term by slightly changing the Eq.(14) as

image

where  λis a constant commonly set as 1. To solve Eq.(15), we use the technique of stochastic gradient descent (SGD) [25] to calculate the parameters of Θ. The partial derivatives of the Eq.(15) on the Θ are

image

Thereby, we are able to update the parameters by calculating the Eq.(19) iteratively until they reach the state of convergence.

image

where  αis a learning rate which commonly has a value in the scope of (0, 1). Some detailed implementations of the model are formally described in Algorithms 1 and 2.

5.1. Data description

To evaluate the performance of the proposed model, six real-world networks are used for the tests. Some background information of the networks as well as the specified definitions of the tie strength are briefly introduced as follows.

image

image

Caenorhabditis elegans [26]: This is a metabolic network of the roundworm Caenorhabditis elegans. Nodes are metabolites (e.g., proteins), and edges are interactions between them which are undirected. There may be multiple interactions between any two metabolites, representing the tie strength between them.

UC social network [27]: This directed network contains sent messages between the users of an online community of students from the University of California, Irvine. A node represents a user. A directed edge represents a sent message. The weights denote multiple messages. Here, we simply treat it as an undirected and weighted graph .

Political blogs [28]: A network with directed hyperlinks between political web blogs on 2004 US Election is recorded in 2005 by Adamic and Glance. We treat it as an undirected graph and the weight for an edge is defined by the multiple hyperlinks between two blogs.

Coauthorships in network science [29]: A co-authorship network of scientists working on network theory and experiment, as compiled by M. Newman in May 2006. Here, we merely use the largest component of this network and the weights represent the strength of the collaboration between scientists marked by M. Newman which are calculated via the information of co-authored papers and the number of the authors.

Neural network [30]: A directed, weighted graph represents the neural network of C. Elegans, in which an edge joins two neurons if they are

connected by either a synapse or a gap junction. The network is treated as the undirected graph and the weights on edges denote the multiple interactions between the neurons.

German Wikipedia [31]: This is a dataset of discussion threads on the German Wikipedia. Nodes of the network are users of the German Wikipedia. A directed link from user A to user B denotes that user A wrote a comment in a discussion as a reply to a comment of user B. Multiple comments between the users A and B form the tie strength which can be marked as the weight over the edge between A and B.

Some basic statistics of the six networks are summarized in Table 1.

Table 1: Statistics of the networks’ basic information. <w> denotes the average of the weights in the network. Max(w) and Min(w) represents the maximum and minimum of the weights in the network, respectively.

image

For setting up the experiment, we adopt the scheme of leave-10%-out prediction which is widely used for model assessment in the domain of link prediction [32, 33]. The weights from some randomly selected edges in the original network with a fixed proportion of 10% are removed. Then, the 10% links as well as the attached weights constitute a probe or test set while the remainder of the network is treated as the training set. Note that the topology information of the entire network, i.e., (V, E), is always available. Thus, a weighted network is properly divided into two parts which allow us to conduct the model parameter training and the test of the tie strength prediction, respectively. We call such a process as one partition of the network. In case of that a single network partition would result in an unexpected biased experimental result, we adopt ten independently random partitions on the original network to generate ten groups of training and probe sets and take the averaged results over the total of ten experiments as a criterion to assess the performance of the model.

5.2. Metric for evaluation

The frequently used evaluation metrics in the related literature [22, 21] include Pearson Correlation Coefficient (PCC) and Root Mean-Squared Error (RMSE). Due to the reason that the mentioned studies have verified both of them are able to consistently evaluate the accuracy of the explicit tie strength prediction, we merely choose one of two metrics, i.e., the Root Mean-Squared Error, for experimental results evaluation in this work, which has a mathematical form as

image

where  W = (w1, w2, ..., wN) is the actual weight vector for N existing links in the test set and ˆW = ( ˆw1, ˆw2, ..., ˆwN) is the predicted weight vector for the corresponding links. This metric can measure how close the values of the predicted weights and the actual ones are. Therefore, the smaller the value of RMSE is, the better the accuracy of the tie strength prediction.

5.3. Comparisons of the prediction results

Since the k introduced in Eq. (15) is a hyper-parameter, a process of parameter tuning is required to obtain the optimized model. For a given network, after performing tests on ten pretreated network partitions with the same setting of k value, we average the RMSE values obtained from the ten experiments and can derive the mean value of the RMSEs. Repeating this process by changing the value of k, we will derive a series of averaged RMSE values with a variety of k value settings. Putting the obtained RMSE mean values together, we will find out an optimal k value. Apparently, the optimal k value will result in the smallest RMSE mean value. It turns out that the optimal value of the k is commonly in the scope of (0,2] in terms of the experimental results on the six real-world networks and the procedure of the model training will become harder to converge when the k is larger than 2. The corresponding curves of RMSE against k for the six networks are illustrated in Fig. 3. The optimal k values are 0.4, 2, 1, 0.3, 0.6 and 1.2 for networks of UC social, C.elegans, P.blog, Netscience, Neural network and Wiki, respectively.

Here, the machine used for the test is a desktop with 16 Gigabyte memory and 3.0 GHz Intel core-i3 processor. The results reported in Table 2 are the

image

Figure 3: The average RMSE curves under the tuning of parameter k for the six networks.

RMSE mean values as well as their standard deviations over ten experimental results obtained by the NEW model and nine baseline methods on six real-world networks. Among the nine compared approaches, six indices have been introduced in Section 3.2 while the three other ones are line-graph based approaches. The main idea of line-graph model is to calculate node centrality with 8 Centrality-based measures such as Closeness Centrality, Betweenness Centrality, and Eigenvector Centrality, etc., in the line graph and convert them into 8 features of the edges in the original graph. Note that a node in the line graph is actually corresponding to an edge in the original graph. Thus, one can apply the edge features into some learning models like Random forest, SVM and GBDT and infer the link weight. The detailed description of line-graph model can be found in literature [22]. The outcomes shown here for NEW model are calculated with the optimized k settings. Overall, it is verified that our model performs better than the compared methods on the six tested networks. More specifically, compared to the second best method, the proposed model has achieved further declines of the RMSE by 5.19%, 4.21%, 16.49%, 48.32% 5.09% and 97.93% on the networks of UC social, C.elegans, P.blogs, Netscience, Neural network and Wiki, respectively. As for network Wiki, we failed to predict the link weight via line-graph based approaches due to the Centrality-based measures are too expensive to calculate on a large-sized network with our testing machine. Therefore, aside from having the relatively weaker predictive performance compared to our NEW model, the line-graph based methods are less attractive for owning another significant drawback, i.e., lack of computational scalability.

Table 2: Experimental results on prediction accuracy for the six networks. The best result for each network is marked in boldface.

image

5.4. Performance analysis

Based on the derived optimal k values via parameter tunning for the six networks, Fig. 4 illustrates the convergence curves of parameters  θ1, θ2 andθ3under model training on one partition for each network. The results of running time suggest that the training of model parameters is quite fast on real-world networks as, even on the large-sized network like Wiki, it only requires a couple of minutes to finish the training procedure.

According to Algorithm 1, the calculations for the proposed model are mainly related to the number of edges in the training network and the number of common neighbors between node pairs. For initializing the attribute matrix, it requires  O( ˆNcn|E−E∗|). Due to the fact of sparsity for the real-world network’s structure, the averaged common neighbors ˆNcnfor node pairs are far fewer than the number of the existing edges which can be regarded as a constant. For parameter training, one iteration needs  O(|E − E∗|). Yet,the number of iterations required for the parameter convergence depends on the values of  εand learning rate  α. Practically, proper settings of the two parameters will lead to a fast convergence of the model which means that the number of the iterations is far fewer than the amount of edges |E|. Therefore, we can also treat it as a constant c. Thus, the parameter training requires O(c|E−E∗|). As a result, the approximate time complexity of the Algorithm 1 is  O( ˆNcn|E − E∗|) + O(c|E − E∗|) ≈ O(|E − E∗|). As for Algorithm 2, the time complexity apparently is  O(|E∗|). Therefore, Algorithms 1 and 2 have the combined complexity of O(|E|) which is merely correlated with the size of the edges in the network. It proves that our model has very good performance which can be easily scaled up to handle very large sized networks with a linear time complexity. The accumulated training time on the ten training sets for the six networks shown in Fig. 5 also verified that the training time nearly grows linearly along with the size of the training sets.

6.1. The significance of the structure features

Before investigating why the proposed structure features are crucial to our model, we firstly analyze the weight distributions of the six networks as plotted in Fig. 6. If we take those weights smaller than the average value over all weights in a network as small weight and other weights larger than the average value as large weight, The distributions for different networks show a commonality that weights with relatively small values belong to the majority whereas the large ones are the minority. In particular, weights in the networks of UC social and Wiki even have the typical power law distribution which means that nearly 80% of the weights are small weight and 20% of

image

Figure 4: The convergence curves of parameters  θ1, θ2 and θ3under model training on one partition for each network. The specified training parameters including hyper-parameter k and learning rate  αas well as time consumption for model training are presented in the plots.

image

image

Figure 5: Scalability of the NEW framework with the sizes of edges in the training sets for the six networks.

them belong to the large weight. That is to say, the real world networks have heavily imbalanced weight distributions. For each link denoted by (x, y), we can use Eq. (9) to calculate the dual connection inclinations from both ends of the (x, y). So, we will be able to obtain two distributions of the connection inclinations from all links shown in Fig. 7. Interestingly, both of the distributions for the six networks are quite similar to the actual weight distributions. Moreover, with regard to the composition of the distributions, even the fractions of the small weight and large weight for the six networks are very close to their counter parts, i.e., the fractions of the small and large values of connection inclination. It suggests that the proposed two structure features have captured the characteristics of the weight distribution and can be used as good indicators to fit the weights accurately.

6.2. The role of the parameter k

In the model design, we initially consider to use a linear regression model. But, in view of that the weights in real world networks commonly having non-uniform distributions, a linear regression model may be useful but not a desired tool to well fit the weights. So, we specially design a non-linear regression model and the objective that we introduce the parameter k into the model is to nonlinearly regulate the process of weight fitting. Via a process of the parameter k tuning, we can obtain the optimized prediction results when an optimal value of the k is determined. Nevertheless, the role of the parameter k is still unclear. Here, we try to analyze this issue from

image

Figure 6: Weight distributions for the six networks. The dashed line corresponds to the average value of the weights in each plot. The value shown on the left side of the line is the percentage of the small weights and that shown on the right side of the line is the percentage of the large weights.

image

Figure 7: (a) x’s connection inclination distributions for the six networks where x denotes one end of each link. (b) y’s connection inclination distributions for the six networks where y denotes the other end of each link. The dashed line corresponds to the average value of the connection inclinations in each plot. The value shown on the left side of the line is the percentage of the small weights and that shown on the right side of the line is the percentage of the large weights.

two basic observations. As the amount of the small weights are dominated in the tested networks, most of the predicted weights are supposed to be small as well. This means that the center of the weights and that of the predicted weights would be quite close to each other if we take the mean values averaged over the weights and the predicted ones as the centers of the two distributions, respectively. On the other hand, the weight distributions for all tested networks are heavily imbalanced. As a result, the predicted weights should have the similar or matched distributions if the model has the potential to accurately fit the weights. Here, we introduce two measures to statistically quantify the difference between the real weights and the predicted weights from the two aspects mentioned above. The first measure is to quantify the distance between two distributions’ centers.

image

where  wi and ˆwiare the weight and the predicted one of the ith link among N edges in the test set of the network, respectively. As there are D test sets (D = 10 in this paper) available due to the multiple independent network partitions, we need to average the D results to ensure the wdiff score unbiased. The second one is to measure the difference of the distributions using the information entropy.

image

where S is the total number of the weight categories in which those weights having identical values represent a category and the ratio of the weight quantity of the ith category to the quantity of the total weight is denoted by  pi.Accordingly, ˆpiis the percentage of the predicted weights falling in the ith category. With respect to the weights and predicted weights, in our opinion, smaller absolute scores of the wdiff and hdiff suggest a better result of the prediction. So, for the six networks, we separately plot the perturbation curves of the wdiff scores and hdiff scores under the k tuning as shown in Fig. 8 and mark the corresponding point in red in each plot when k reaches the optimal value. From the Fig. 7, we observe that the optimal k results in three consequences. Firstly, the optimal k will lead to the least wdiff score like the networks of C.elegans, P.blogs and Wiki. Secondly, it will promote the score of hdiff to be very close to the least value and the networks of UC social and Wiki are the examples. Thirdly, it will reach a trade-off state between the scores of wdiff and hdiff, both of which have the relatively small values such as Netscience. We also notice that the scores of wdiff are impacted more significantly under the parameter tuning when putting the two measures together for comparison. Therefore, we conclude that the tuning of the parameter k mainly promotes the center of the predicted weights approaching the center of the actual weights as closely as possible and also partially ensure the distribution of the predicted weights is similar to the actual distribution of the weights.

image

Figure 8: (a) Perturbation analysis of the wdiff scores for the six networks. (b) Per- turbation analysis of the hdiff scores for the six networks. The red dot for each plot corresponds to the score of the metric when the parameter k reaches the optimal value for the network.

In this paper, we are motivated from a viewpoint of generic learning to study the problem of tie strength prediction in various networks, which is significantly different from the traditional research paradigm mainly focusing on the specified social networks. Following this idea, we do not need to pay attention to any explicit domain features provided by a network which will restrict the utility of the predictive model in a wide range of networks. Instead, we focus on the structure features owned by any kinds of networks. According to the empirical analysis, the proposed index called connection inclination has the capacity to capture the tie strength from both ends of the links which merely uses the shared neighborhood information of the node pair. Aiming to fit the weights having imbalanced distributions in the real world networks, we proposed a parameterized non-linear regression algorithm accordingly. Essentially, comparing to the linear regression model, the non-linear model has a better adaptability to the non-uniform weight distribution. So, both the structure-driven features and non-linear learning model ensure that the entire computational framework is more generic than the conventional methods. The extensive experiments demonstrate that our proposed model overall has better performance than the baseline methods in the accuracy of tie strength prediction on six tested networks. Meanwhile, both the theoretical analysis and time consumption tests on different networks verify that our model has an approximate linear time complexity and thus has a good scalability to the large sized networks. To the best of our knowledge, our work is the first attempt in systematically studying the generic learning model for tie strength prediction in networks which would have provided some insights for the future works in this domain. It is worth noting that the introduced free parameter k plays an important role for the model optimization but technically, parameter tuning would be a tedious task. So, estimation of the optimal k is preferred in practice and also can be treated as a future extension of the current model. Meanwhile, if there is no common neighbors between node pairs, our model simply treats that the features of rx and ryequal to 0. In the future extension, we need to handle this case to specially define the features when two nodes of an edge do not possess a common neighbor. It would be also interesting to extend our model to predict the tie strength in temporal networks as the weights of the links would be changing over time. Recently, Qu et al. [34] studied the issue of discovering vital nodes in temporal networks, their findings could enlighten us to exploit some new node features to capture the link weights in time-varying networks.

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions.

[1] M. Burke, R. E. Kraut, Growing closer on facebook: changes in tie strength through social network site use, in: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, 2014, pp. 4187–4196.

[2] G. Kossinets, D. J. Watts, Empirical analysis of an evolving social net- work, science 311 (5757) (2006) 88–90.

[3] M. S. Granovetter, The strength of weak ties, in: Social networks, Else- vier, 1977, pp. 347–367.

[4] E. Gilbert, K. Karahalios, Predicting tie strength with social media, in: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, 2009, pp. 211–220.

[5] P. V. Marsden, K. E. Campbell, Measuring tie strength, Social forces 63 (2) (1984) 482–501.

[6] R. Rotabi, K. Kamath, J. Kleinberg, A. Sharma, Detecting strong ties using network motifs, in: Proceedings of the 26th International Conference on World Wide Web Companion, International World Wide Web Conferences Steering Committee, 2017, pp. 983–992.

[7] T. Moore, N. Christin, Beware the middleman: Empirical analysis of bitcoin-exchange risk, in: International Conference on Financial Cryptography and Data Security, Springer, 2013, pp. 25–33.

[8] E. Gilbert, Predicting tie strength in a new medium, in: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, ACM, 2012, pp. 1047–1056.

[9] P. V. Marsden, K. E. Campbell, Reflections on conceptualizing and measuring tie strength, Social forces 91 (1) (2012) 17–23.

[10] R. Xiang, J. Neville, M. Rogati, Modeling relationship strength in online social networks, in: Proceedings of the 19th international conference on World wide web, ACM, 2010, pp. 981–990.

[11] I. Kahanda, J. Neville, Using transactional information to predict link strength in online social networks., ICWSM 9 (2009) 74–81.

[12] J. Leskovec, D. Huttenlocher, J. Kleinberg, Signed networks in social media, in: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, 2010, pp. 1361–1370.

[13] J. Leskovec, D. Huttenlocher, J. Kleinberg, Predicting positive and neg- ative links in online social networks, in: Proceedings of the 19th international conference on World wide web, ACM, 2010, pp. 641–650.

[14] J. J. Jones, J. E. Settle, R. M. Bond, C. J. Fariss, C. Marlow, J. H. Fowler, Inferring tie strength from online directed behavior, PloS one 8 (1) (2013) e52168.

[15] Y. Koren, Collaborative filtering with temporal dynamics, in: Proceed- ings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2009, pp. 447–456.

[16] Z. Liu, K. Tan, X.-Q. Wang, S.-H. Tang, A learning framework for tem- poral recommendation without explicit iterative optimization, Applied Soft Computing 67 (2018) 529–539.

[17] J. Tang, Y. Chang, C. Aggarwal, H. Liu, A survey of signed network mining in social media, ACM Computing Surveys (CSUR) 49 (3) (2016) 42.

[18] S. Kumar, F. Spezzano, V. Subrahmanian, C. Faloutsos, Edge weight prediction in weighted signed networks., in: ICDM, 2016, pp. 221–230.

[19] H. Wang, F. Zhang, M. Hou, X. Xie, M. Guo, Q. Liu, Shine: signed heterogeneous information network embedding for sentiment link prediction, in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, ACM, 2018, pp. 592–600.

[20] L. L¨u, T. Zhou, Link prediction in weighted networks: The role of weak ties, EPL (Europhysics Letters) 89 (1) (2010) 18001.

[21] J. Zhao, L. Miao, J. Yang, H. Fang, Q.-M. Zhang, M. Nie, P. Holme, T. Zhou, Prediction of links and weights in networks by reliable routes, Scientific reports 5 (2015) 12261.

[22] C. Fu, M. Zhao, L. Fan, X. Chen, J. Chen, Z. Wu, Y. Xia, Q. Xuan, Link weight prediction using supervised learning methods and its application

to yelp layered network, IEEE Transactions on Knowledge and Data Engineering 30 (8) (2018) 1507–1518.

[23] P. N. Krivitsky, M. S. Handcock, A. E. Raftery, P. D. Hoff, Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models, Social networks 31 (3) (2009) 204– 213.

[24] G. P. Styan, Hadamard products and multivariate statistical analysis, Linear algebra and its applications 6 (1973) 217–240.

[25] A. ˇZilinskas, Practical mathematical optimization: An introduction to basic optimization theory and classical and new gradient-based algorithms (2006).

[26] Z. Liu, J.-L. He, K. Kapoor, J. Srivastava, Correlations between commu- nity structure and link formation in complex networks, PloS one 8 (9) (2013) e72908.

[27] T. Opsahl, P. Panzarasa, Clustering in weighted networks, Social net- works 31 (2) (2009) 155–163.

[28] L. A. Adamic, N. Glance, The political blogosphere and the 2004 us election: divided they blog, in: Proceedings of the 3rd international workshop on Link discovery, ACM, 2005, pp. 36–43.

[29] M. E. Newman, Finding community structure in networks using the eigenvectors of matrices, Physical review E 74 (3) (2006) 036104.

[30] D. J. Watts, S. H. Strogatz, Collective dynamics of small-worldnetworks, nature 393 (6684) (1998) 440.

[31] E. Margaretha, H. L¨ungen, Building linguistic corpora from wikipedia articles and discussions., JLCL 29 (2) (2014) 59–82.

[32] C. V. Cannistraci, G. Alanis-Lobato, T. Ravasi, From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks, Scientific reports 3 (2013) 1613.

[33] L. L¨u, T. Zhou, Link prediction in complex networks: A survey, Physica A: statistical mechanics and its applications 390 (6) (2011) 1150–1170.

[34] C. Qu, X. Zhan, G. Wang, J. Wu, Z.-k. Zhang, Temporal information gathering process for node ranking in time-varying networks, Chaos: An Interdisciplinary Journal of Nonlinear Science 29 (3) (2019) 033116.


Designed for Accessibility and to further Open Science