b

DiscoverSearch
About
My stuff
Kernel Node Embeddings
2019·arXiv
Abstract
Abstract

Learning representations of nodes in a low dimensional space is a crucial task with many interesting applications in network analysis, including link prediction and node classifi-cation. Two popular approaches for this problem include matrix factorization and random walk-based models. In this paper, we aim to bring together the best of both worlds, towards learning latent node representations. In particular, we propose a weighted matrix factorization model which encodes random walk-based information about the nodes of the graph. The main benefit of this formulation is that it allows to utilize kernel functions on the computation of the embeddings. We perform an empirical evaluation on real-world networks, showing that the proposed model outperforms baseline node embedding algorithms in two downstream machine learning tasks.

Index Terms—Network representation learning, node embedding, link prediction, node classification, kernel functions

With the advancements in data production, storage and consumption, networks are becoming omnipresent; data from diverse disciplines can be represented as graph structures with prominent examples here being various social, information, technological and biological networks. Developing machine learning algorithms to analyze, predict and make sense of the structure of graph data has become a crucial task with a plethora of cross-disciplinary applications [1], [2]. The major challenge in machine learning on graph data concerns the encoding of information about its structural properties into the learning model. To this direction, a recent paradigm in network analysis, known as network representation learning (NRL), aims at embedding the nodes of the graph into a lowerdimensional space, in such a way that similarity among nodes in the graph is captured by the similarity of the embeddings in the latent space [3], [4], [5], [6], [7], [8]. Many of the proposed models in network representation learning have mostly concentrated on computing node embeddings relying on matrix factorization techniques that encode information about structural node similarity [5], [6], [7]. Nevertheless, the majority of those approaches are not efficient for large scale networks, mainly due to the high computational cost required to perform matrix factorization [1], [2].

Being inspired by the field of natural language processing [9], random-walk based models have gained considerable attention [3], [4], [10], [11]. Typically, these methods first generate a set of node sequences (i.e., context nodes) for every node (i.e., center) in the network, based on some random walk strategy; then, node representations are learned by predicting context-center node co-occurrences within the random walks.

In this paper, we aim at combining the previously proposed broad modeling approaches for NRL – namely matrix factorization and random walks. In particular, we focus on modeling the interactions between nodes based on random walks, under a weighted matrix factorization framework. The potential advantage of such a modeling approach is that it allows to take advantage of and combine the elegant mathematical formulation that matrix factorization can offer with the expressive power of random walks to capture a notion of “stochastic” node similarity in an efficient way. More importantly, this formulation allows us to utilize kernel functions in the node representation learning task.

Kernel functions have mostly been introduced along with popular learning algorithms, such as PCA [12], SVMs [13], Spectral Clustering [14] and Collaborative Filtering [15]. The idea is to map non-linearly separable points into a (generally) higher dimensional feature space, so that the inner product in the new space can be computed without needing to compute the exact feature maps. Here, we aim at obtaining embeddings, given values that represent the relationships among nodes. Because of the nature of matrix factorization-based methods, these values are viewed as an inner product of vectors lying on a latent space, which allows us to utilize kernels interpreting the embeddings in a higher dimensional feature space using non-linear maps. The main contributions of the paper are the following:

We propose a novel approach for learning node embeddings by incorporating kernel functions with models relying on weighted matrix factorization, encoding random walk-based structural information of the graph.

We extensively evaluate the performance of the proposed method in the downstream tasks of node classification and link prediction and we show that the model generally outperforms the well-known baseline methods on various network datasets. Notation. We use the notation M to denote a matrix,  Mi,jpoints out the entry located at the i’th row and j’th column of the matrix, and  Mi,:indicates the i’th row of the matrix.

Source code. The C++ implementation of the proposed methodology and the networks used in the study, can be reached at: https://abdcelikkanat.github.io/projects/kernelNE/.

Let G = (V, E) be a graph where V = {1, ..., n} and E ⊆ V × Vare the vertex and edge sets, respectively. Our goal is to find node representations in a latent space, preserving properties of the network. More formally, we define the general objective function of our problem as a weighted matrix factorization [16], as follows:

image

where  M ∈ Rn×nis the target matrix constructed based on the desired properties of a given network, which is used to learn node embeddings  A, B ∈ Rn×d. W ∈ Rn×nis the weight matrix in which each element  Wv,ucaptures the importance of the approximation error between nodes v and u, and  ⊙indicates the Hadamard product. Depending on the desired graph properties that we are interested to encode, there are many possible alternatives to choose matrix M; such include the number of common neighbors between a pair of nodes, higher-order node proximity based on the Adamic-Adar or Katz indices [7], as well based on k-hop information [6]. Here, we will design M as a sparse binary matrix utilizing information of random walks over the network. Note that, matrices M and W do not need to be symmetric.

Random walk-based node embedding models [3], [4], [10], [17], [18], [19] have received great attention because of their good prediction performance and efficiency on large scale networks. Typically, those models generate a set of node sequences by simulating random walks; node representations are then learned by optimizing a model which defines the relationships between nodes and their contexts within the walks. More formally, for a random walk  w = (w1, ..., wℓ), the context of the center node  wl ∈ Vat position l in the walk w is defined as  Cw(wl) := (wl−γ, ..., wl−1, wl+1, ..., wl+γ), where  γis called the window size and it denotes the furthest distance between the center and context nodes  wk ∈ Vfor l − γ ≤ k ≤ l + γand  k ̸= l. The embedding vectors are then obtained by maximizing the likelihood of occurrences of nodes within the context of given center nodes. Here, we will also follow a similar random walk strategy, formulating the problem under a matrix factorization framework.

Let  Mv,ube a binary value representing if node u appears in the context of v in any walk. Also, let  Fv,ube the number of occurrences of node u in the contexts of v in the generated walks. Setting each term  Wv,uas the square root of  Fv,u, the objective function in (1) can be expressed under a random walk-based formulation as follows:

image

where each  w ∈ Vℓindicates a random walk of length  ℓin the collection W and  Mwwl,urepresents the occurrence of u in the context  Cw(wl). Matrix A in Eq. (2), contains the embedding vectors of nodes when they are considered as centers; those will be the embeddings that are used in the experimental evaluation. The choice of matrix M and the reformulation of the objective function as stated above, offers a computational advantage during the optimization step. More importantly, as we will present in the next section, we can further benefit from a kernelized version of the objective function.

Similar to other matrix factorization techniques that aim at finding latent representations in a lower dimensional space (d ≪ n)(e.g., [20], [21], [22]), one can adopt Singular Value Decomposition (SVD) to provide the best approximation of the objective function in (1), as long as the weight matrix is uniform [23]. It is also implicitly assumed that every element of the target matrix M can be written as inner product of vectors in the latent space, and in that case, it becomes difficult to obtain an exact low-rank decomposition. To overcome this limitation, in our approach we utilize kernel functions to learn node representations via matrix factorization.

Let  (X, dX)be a metric space and H be a Hilbert space of real-valued functions defined on X. A Hilbert space is called reproducing kernel Hilbert space (RKHS) if the point evaluation map over H is a continuous linear functional. Furthermore, a feature map is defined as a function  Φ : X → H, H is referred to as feature space and every feature map defines a kernel  κ : X × X → Ras follows:

image

It can be seen that  κ(·, ·)is symmetric and positive definite due to the properties of an inner product space.

A function  g : X → Ris called induced by  κ, if there exists h ∈ Hsuch that  g = ⟨h, Φ(·)⟩, for a feature vector  Φof kernel κ(note that, the definition is independent of the feature map Φand space H) [24]. Let  Iκ := {g : X → R | ∃h ∈ Hs.t. g = ⟨h, Φ(·)⟩}be the set of induced functions by kernel  κ. Then, a continuous kernel  κon a compact metric space  (X, dX)is called universal, if the set  Iκis dense in C(X). In other words, for any function  f ∈ C(X)and  ǫ > 0, there exists  gh ∈ Iκsatisfying

image

where  ghis defined as  ⟨h, Φ(·)⟩for some  h ∈ H. We use the next proposition as the basis for our approach.

Proposition 1 ([24]). Let (X, d) be a compact metric space and  κ(·, ·)be a universal kernel on X. Then, for all compact and mutually disjoint subsets  K1, ..., Kn ⊂ X, all  α1,...,αn∈ R, and all  ǫ > 0, there exists a function g induced by  κwith  ∥g∥∞ ≤ maxi |αi| + ǫsuch that

image

image

The universality property of a kernel helps us in finding the decomposition of matrix M in the feature space. Following Proposition (1), for each row of M, we can always find  h ∈ Hto approximate the row values in a higher dimensional inner product space. We can choose node representations from the disjoint subsets, but note that, each element  h ∈ Hdoes not have to be in the image of the feature map.

Based on the above, we move the inner product from space X to the feature space H, by reformulating Eq. (2) as follows:

image

That way, we obtain a kernelized matrix factorization model for node embeddings based on random walks. For the numerical evaluation of our method, we use the following universal kernels [25], [24]:

image

where  κGand  κScorrespond to the Gaussian and Schoenberg kernels respectively. We will refer to the proposed kernel-based node embeddings methodology as KERNELNE (the two different kernels will be denoted by GAUSS and SCH).

Model Optimization. For the optimization step, we employ Stochastic Gradient Descent (SGD) [26]. Note that, Eq. (3) can be divided into two parts with respect to the values of  Mwv,u ∈{0, 1}. That way, we apply negative sampling [9] which is a variant of noise-contrastive estimation [27], proposed as an alternative to solve the computational problem of hierarchical softmax. For each context node  u+ ∈ Cw(wl), we sample k negative instances  u−from the noise distribution  p−:

image

Each sample is generated proportionally to its frequency raised to the power of 0.75 and the number of negative instances is chosen as 5. In our experiments, we set the initial learning rate of SGD to 0.025; then it decreases linearly according to the number of processed nodes. The dimension of the embedding vectors is selected as d = 128 and the window size for the random walks as  γ = 10.

We evaluate the performance of our approach on the node classification and link prediction tasks. The experiments have been performed on a server with 60Gb RAM. Table I gives

TABLE I STATISTICS OF NETWORKS USED IN THE EXPERIMENTS.  |V|: NUMBER OFNODES,  |E|:NUMBER OF EDGES,  |K|:NUMBER OF LABELS AND  |C|:NUMBER OF CONNECTED COMPONENTS.

image

the statistics of the network datasets used in the experiments (all the networks are considered as undirected).

A. Baseline Methods

We consider five widely used baseline models to compare the performance of our approach. DEEPWALK [3] performs uniform random walks to generate the context of a node; then, the Skip-Gram model is used to learn node representations. NODE2VEC [4] combines Skip-Gram with biased random walks, using two extra parameters that control the walk to simulate a BFS or DFS exploration. In the experiments, we set those parameters to 1.0, the number of walks to 80 and walk length to 10. In our approach, we sample context nodes using NODE2VEC’s random walk strategy. LINE [21] learns embeddings relying on first-order and second-order proximity information of nodes. HOPE [7] is a matrix factorization approach aiming at capturing higher-order node similarity patterns based on the Katz index. Lastly, NETMF [22] targets to factorize the matrix approximated by the pointwise mutual information of center and context pairs. Those methods are compared against the KERNELNE-GAUSS and KERNELNESCH models.

B. Node Classification

Experimental set-up. In the node classification task, we have access to the labels of a certain fraction of nodes in the network (training set), and our goal is to predict the labels of the remaining nodes (test set). In the experiments, we learn embeddings on varying sizes of training data, ranging from 1% up to 90%. The experiments have been carried out by applying an one-vs-rest logistic regression classifier with L2regularization; the average scores of 50 experiments are reported.

Experimental results. Table II shows the Micro-F1scores for each network. For the CiteSeer network, KERNELNE outperforms the baselines for all training sizes. The SCH kernel with  α = 1gives gain of up to 7.0% against the best baseline model, while KERNELNE-GAUSS with  σ2 = 2has the best performance for larger training sizes. For the Cora network, the GAUSS kernel with  σ2 = 2also performs quite well especially for small training ratios (gain 0.91% up to 7.60%. Lastly, in the DBLP network, we choose  σ = 0.3for the GAUSS kernel, which is the best performing model. The SCH

TABLE II MICRO-F1SCORES FOR THE NODE CLASSIFICATION TASK.

image

TABLE III AREA UNDER CURVE (AUC) SCORES FOR THE LINK PREDICTION TASK.

image

kernel with  α = 3.0, also performs better than the baselines, especially for small training sizes.

C. Link Prediction

Experimental set-up. For the link prediction task, we remove half of the edges of the network in order to obtain positive samples for the test set; the same number of node pairs, not existing in the initial graph, are added to the test set. We then learn node embedding using the residual network; the feature vector of an edge (v, u) is formed with the operation  |xvi −yui |2for each coordinate i of the embedding vectors x and y corresponding to nodes v and u. In the experiments, we use logistic regression with  L2regularization.

Experimental results. Table III shows the area under curve (AUC) scores for the link prediction task. We choose kernel parameters as  σ = 0.3and  α = 2except the Gnutella network in which  σ = 3.0. In all cases, the largest connected component of the datasets is used. As we can observe, in almost all cases, the proposed kernel-based models outperform the baselines. The only exception is the Facebook dataset, where NODE2VEC is just slightly better than KERNELNE.

image

Fig. 1. Influence of the dimension size d on the CiteSeer network.

image

Fig. 2. Influence of kernel parameters on the CiteSeer network.

D. Parameter Sensitivity

The effect of dimension size. Figure 1 shows the Micro- F1scores of the proposed models for varying embedding dimension sizes, ranging from d = 32 up to d = 224. As it can be seen, both of the kernel instances have the same tendency, where the performance increases proportionally to the size of the embedding vectors.

The effect of kernel parameters. In Figure 2, we study the behaviour of kernel functions with respect to the chosen parameters. The GAUSS kernel shows comparable results for values of  σ2between 0.4 and 4.0. In addition, we observed that its performance is limited for very big or very small values of this parameter. The SCH kernel also behaves similarly.We have reached the highest score with parameter values around α = 1.0. Lastly, we observed poor performance for very small values of  α, which are not included in the figure.

We have introduced the KERNELNE model for learning node embeddings. We interpret our random-walk based method under a weighted matrix factorization framework, which is then generalized to kernel functions. The numerical evaluation showed that the proposed kernel-based models substantially outperform baseline NRL methods in both node classification and link prediction tasks. An interesting future research direction concerns the extension of the proposed methodology to the multiple kernel learning framework [33].

[1] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” IEEE Data Eng. Bull., vol. 40, no. 3, pp. 52–74, 2017.

[2] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation learning: A survey,” CoRR, 2018.

[3] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learning of social representations,” in KDD, 2014, pp. 701–710.

[4] A. Grover and J. Leskovec, “Node2Vec: Scalable feature learning for networks,” in KDD, 2016, pp. 855–864.

[5] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in NIPS, 2001, pp. 585–591.

[6] S. Cao, W. Lu, and Q. Xu, “GraRep: Learning graph representations with global structural information,” in CIKM, 2015, pp. 891–900.

[7] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in KDD, 2016, pp. 1105–1114.

[8] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis, “Adaptive diffusions for scalable learning over graphs,” IEEE Transactions on Signal Processing, vol. 67, no. 5, pp. 1307–1321, March 2019.

[9] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013, pp. 3111–3119.

[10] D. Nguyen and F. D. Malliaros, “BiasedWalk: Biased sampling for representation learning on graphs,” in IEEE Big Data, 2018, pp. 4045– 4053.

[11] A. C¸ elikkanat and F. D. Malliaros, “TNE: A latent model for repre- sentation learning on networks,” in NeurIPS Relational Representation Learning Workshop, 2018.

[12] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Kernel principal component analysis,” in ICANN, 1997, pp. 583–588.

[13] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.

[14] I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: Spectral clustering and normalized cuts,” in KDD. ACM, 2004, pp. 551–556.

[15] X. Liu, C. Aggarwal, Y.-F. Li, X. Kong, X. Sun, and S. Sathe, “Kernelized matrix factorization for collaborative filtering,” in SDM, 2016, pp. 378–386.

[16] N. Srebro and T. Jaakkola, “Weighted low-rank approximations,” in ICML, 2003, pp. 720–727.

[17] A. Epasto and B. Perozzi, “Is a single embedding enough? learning node representations that capture multiple social contexts,” in WWW, 2019, pp. 394–404.

[18] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim, “Dynamic network embeddings: From random walks to temporal random walks,” in IEEE Big Data, 2018, pp. 1085–1092.

[19] C. Lin, P. Ishwar, and W. Ding, “Node embedding for network commu- nity discovery,” in ICASSP, 2017, pp. 4129–4133.

[20] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and J. Tang, “NetSMF: Large-scale network embedding as sparse matrix factorization,” in WWW, 2019, pp. 1509–1520.

[21] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: Large- scale information network embedding,” in WWW, 2015, pp. 1067–1077.

[22] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and Node2Vec,” in WSDM, 2018, pp. 459–467.

[23] C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, Sep 1936.

[24] I. Steinwart, “On the influence of the kernel on the consistency of support vector machines,” J. Mach. Learn. Res., vol. 2, pp. 67–93, Mar. 2002.

[25] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” J. Mach. Learn. Res., vol. 7, pp. 2651–2667, Dec. 2006.

[26] L. Bottou, “Stochastic gradient learning in neural networks,” in In Proceedings of Neuro-Nimes. EC2, 1991.

[27] A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficiently with noise-contrastive estimation,” in NIPS, 2013, pp. 2265–2273.

[28] H. Chen, B. Perozzi, Y. Hu, and S. Skiena, “Harp: Hierarchical repre- sentation learning for networks,” in AAAI, 2018.

[29] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi- Rad, “Collective classification in network data,” AI magazine, 2008.

[30] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena, “Don’t walk, skip!: Online learning of multi-scale network embeddings,” in ASONAM, 2017, pp. 258–265.

[31] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graph evolution: Den- sification and shrinking diameters,” ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, 2007.

[32] J. Leskovec and J. J. Mcauley, “Learning to discover social circles in ego networks,” in NIPS, 2012, pp. 539–547.

[33] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the smo algorithm,” in ICML, 2004, p. 6.


Designed for Accessibility and to further Open Science