With the advancements in data production, storage and consumption, networks are becoming omnipresent; data from diverse disciplines can be represented as graph structures with prominent examples here being various social, information, technological and biological networks. Developing machine learning algorithms to analyze, predict and make sense of the structure of graph data has become a crucial task with a plethora of cross-disciplinary applications [1], [2]. The major challenge in machine learning on graph data concerns the encoding of information about its structural properties into the learning model. To this direction, a recent paradigm in network analysis, known as network representation learning (NRL), aims at embedding the nodes of the graph into a lowerdimensional space, in such a way that similarity among nodes in the graph is captured by the similarity of the embeddings in the latent space [3], [4], [5], [6], [7], [8]. Many of the proposed models in network representation learning have mostly concentrated on computing node embeddings relying on matrix factorization techniques that encode information about structural node similarity [5], [6], [7]. Nevertheless, the majority of those approaches are not efficient for large scale networks, mainly due to the high computational cost required to perform matrix factorization [1], [2].
Being inspired by the field of natural language processing [9], random-walk based models have gained considerable attention [3], [4], [10], [11]. Typically, these methods first generate a set of node sequences (i.e., context nodes) for every node (i.e., center) in the network, based on some random walk strategy; then, node representations are learned by predicting context-center node co-occurrences within the random walks.
In this paper, we aim at combining the previously proposed broad modeling approaches for NRL – namely matrix factorization and random walks. In particular, we focus on modeling the interactions between nodes based on random walks, under a weighted matrix factorization framework. The potential advantage of such a modeling approach is that it allows to take advantage of and combine the elegant mathematical formulation that matrix factorization can offer with the expressive power of random walks to capture a notion of “stochastic” node similarity in an efficient way. More importantly, this formulation allows us to utilize kernel functions in the node representation learning task.
Kernel functions have mostly been introduced along with popular learning algorithms, such as PCA [12], SVMs [13], Spectral Clustering [14] and Collaborative Filtering [15]. The idea is to map non-linearly separable points into a (generally) higher dimensional feature space, so that the inner product in the new space can be computed without needing to compute the exact feature maps. Here, we aim at obtaining embeddings, given values that represent the relationships among nodes. Because of the nature of matrix factorization-based methods, these values are viewed as an inner product of vectors lying on a latent space, which allows us to utilize kernels interpreting the embeddings in a higher dimensional feature space using non-linear maps. The main contributions of the paper are the following:
• We propose a novel approach for learning node embeddings by incorporating kernel functions with models relying on weighted matrix factorization, encoding random walk-based structural information of the graph.
• We extensively evaluate the performance of the proposed method in the downstream tasks of node classification and link prediction and we show that the model generally outperforms the well-known baseline methods on various network datasets. Notation. We use the notation M to denote a matrix, points out the entry located at the i’th row and j’th column of the matrix, and
indicates the i’th row of the matrix.
Source code. The C++ implementation of the proposed methodology and the networks used in the study, can be reached at: https://abdcelikkanat.github.io/projects/kernelNE/.
Let G = (V, E) be a graph where V = {1, ..., n} and are the vertex and edge sets, respectively. Our goal is to find node representations in a latent space, preserving properties of the network. More formally, we define the general objective function of our problem as a weighted matrix factorization [16], as follows:
where is the target matrix constructed based on the desired properties of a given network, which is used to learn node embeddings
is the weight matrix in which each element
captures the importance of the approximation error between nodes v and u, and
indicates the Hadamard product. Depending on the desired graph properties that we are interested to encode, there are many possible alternatives to choose matrix M; such include the number of common neighbors between a pair of nodes, higher-order node proximity based on the Adamic-Adar or Katz indices [7], as well based on k-hop information [6]. Here, we will design M as a sparse binary matrix utilizing information of random walks over the network. Note that, matrices M and W do not need to be symmetric.
Random walk-based node embedding models [3], [4], [10], [17], [18], [19] have received great attention because of their good prediction performance and efficiency on large scale networks. Typically, those models generate a set of node sequences by simulating random walks; node representations are then learned by optimizing a model which defines the relationships between nodes and their contexts within the walks. More formally, for a random walk , the context of the center node
at position l in the walk w is defined as
, where
is called the window size and it denotes the furthest distance between the center and context nodes
for
and
. The embedding vectors are then obtained by maximizing the likelihood of occurrences of nodes within the context of given center nodes. Here, we will also follow a similar random walk strategy, formulating the problem under a matrix factorization framework.
Let be a binary value representing if node u appears in the context of v in any walk. Also, let
be the number of occurrences of node u in the contexts of v in the generated walks. Setting each term
as the square root of
, the objective function in (1) can be expressed under a random walk-based formulation as follows:
where each indicates a random walk of length
in the collection W and
represents the occurrence of u in the context
. Matrix A in Eq. (2), contains the embedding vectors of nodes when they are considered as centers; those will be the embeddings that are used in the experimental evaluation. The choice of matrix M and the reformulation of the objective function as stated above, offers a computational advantage during the optimization step. More importantly, as we will present in the next section, we can further benefit from a kernelized version of the objective function.
Similar to other matrix factorization techniques that aim at finding latent representations in a lower dimensional space (e.g., [20], [21], [22]), one can adopt Singular Value Decomposition (SVD) to provide the best approximation of the objective function in (1), as long as the weight matrix is uniform [23]. It is also implicitly assumed that every element of the target matrix M can be written as inner product of vectors in the latent space, and in that case, it becomes difficult to obtain an exact low-rank decomposition. To overcome this limitation, in our approach we utilize kernel functions to learn node representations via matrix factorization.
Let be a metric space and H be a Hilbert space of real-valued functions defined on X. A Hilbert space is called reproducing kernel Hilbert space (RKHS) if the point evaluation map over H is a continuous linear functional. Furthermore, a feature map is defined as a function
, H is referred to as feature space and every feature map defines a kernel
as follows:
It can be seen that is symmetric and positive definite due to the properties of an inner product space.
A function is called induced by
, if there exists
such that
, for a feature vector
of kernel
(note that, the definition is independent of the feature map
and space H) [24]. Let
s.t. g =
be the set of induced functions by kernel
. Then, a continuous kernel
on a compact metric space
is called universal, if the set
is dense in C(X). In other words, for any function
and
, there exists
satisfying
where is defined as
for some
. We use the next proposition as the basis for our approach.
Proposition 1 ([24]). Let (X, d) be a compact metric space and be a universal kernel on X. Then, for all compact and mutually disjoint subsets
, all
,...,
, and all
, there exists a function g induced by
with
such that
The universality property of a kernel helps us in finding the decomposition of matrix M in the feature space. Following Proposition (1), for each row of M, we can always find to approximate the row values in a higher dimensional inner product space. We can choose node representations from the disjoint subsets, but note that, each element
does not have to be in the image of the feature map.
Based on the above, we move the inner product from space X to the feature space H, by reformulating Eq. (2) as follows:
That way, we obtain a kernelized matrix factorization model for node embeddings based on random walks. For the numerical evaluation of our method, we use the following universal kernels [25], [24]:
where and
correspond to the Gaussian and Schoenberg kernels respectively. We will refer to the proposed kernel-based node embeddings methodology as KERNELNE (the two different kernels will be denoted by GAUSS and SCH).
Model Optimization. For the optimization step, we employ Stochastic Gradient Descent (SGD) [26]. Note that, Eq. (3) can be divided into two parts with respect to the values of {0, 1}. That way, we apply negative sampling [9] which is a variant of noise-contrastive estimation [27], proposed as an alternative to solve the computational problem of hierarchical softmax. For each context node
, we sample k negative instances
from the noise distribution
:
Each sample is generated proportionally to its frequency raised to the power of 0.75 and the number of negative instances is chosen as 5. In our experiments, we set the initial learning rate of SGD to 0.025; then it decreases linearly according to the number of processed nodes. The dimension of the embedding vectors is selected as d = 128 and the window size for the random walks as .
We evaluate the performance of our approach on the node classification and link prediction tasks. The experiments have been performed on a server with 60Gb RAM. Table I gives
TABLE I STATISTICS OF NETWORKS USED IN THE EXPERIMENTS. NODES,
NUMBER OF EDGES,
NUMBER OF LABELS AND
NUMBER OF CONNECTED COMPONENTS.
the statistics of the network datasets used in the experiments (all the networks are considered as undirected).
A. Baseline Methods
We consider five widely used baseline models to compare the performance of our approach. DEEPWALK [3] performs uniform random walks to generate the context of a node; then, the Skip-Gram model is used to learn node representations. NODE2VEC [4] combines Skip-Gram with biased random walks, using two extra parameters that control the walk to simulate a BFS or DFS exploration. In the experiments, we set those parameters to 1.0, the number of walks to 80 and walk length to 10. In our approach, we sample context nodes using NODE2VEC’s random walk strategy. LINE [21] learns embeddings relying on first-order and second-order proximity information of nodes. HOPE [7] is a matrix factorization approach aiming at capturing higher-order node similarity patterns based on the Katz index. Lastly, NETMF [22] targets to factorize the matrix approximated by the pointwise mutual information of center and context pairs. Those methods are compared against the KERNELNE-GAUSS and KERNELNESCH models.
B. Node Classification
Experimental set-up. In the node classification task, we have access to the labels of a certain fraction of nodes in the network (training set), and our goal is to predict the labels of the remaining nodes (test set). In the experiments, we learn embeddings on varying sizes of training data, ranging from 1% up to 90%. The experiments have been carried out by applying an one-vs-rest logistic regression classifier with regularization; the average scores of 50 experiments are reported.
Experimental results. Table II shows the Micro-scores for each network. For the CiteSeer network, KERNELNE outperforms the baselines for all training sizes. The SCH kernel with
gives gain of up to 7.0% against the best baseline model, while KERNELNE-GAUSS with
has the best performance for larger training sizes. For the Cora network, the GAUSS kernel with
also performs quite well especially for small training ratios (gain 0.91% up to 7.60%. Lastly, in the DBLP network, we choose
for the GAUSS kernel, which is the best performing model. The SCH
TABLE II MICRO-SCORES FOR THE NODE CLASSIFICATION TASK.
TABLE III AREA UNDER CURVE (AUC) SCORES FOR THE LINK PREDICTION TASK.
kernel with , also performs better than the baselines, especially for small training sizes.
C. Link Prediction
Experimental set-up. For the link prediction task, we remove half of the edges of the network in order to obtain positive samples for the test set; the same number of node pairs, not existing in the initial graph, are added to the test set. We then learn node embedding using the residual network; the feature vector of an edge (v, u) is formed with the operation for each coordinate i of the embedding vectors x and y corresponding to nodes v and u. In the experiments, we use logistic regression with
regularization.
Experimental results. Table III shows the area under curve (AUC) scores for the link prediction task. We choose kernel parameters as and
except the Gnutella network in which
. In all cases, the largest connected component of the datasets is used. As we can observe, in almost all cases, the proposed kernel-based models outperform the baselines. The only exception is the Facebook dataset, where NODE2VEC is just slightly better than KERNELNE.
Fig. 1. Influence of the dimension size d on the CiteSeer network.
Fig. 2. Influence of kernel parameters on the CiteSeer network.
D. Parameter Sensitivity
The effect of dimension size. Figure 1 shows the Micro- scores of the proposed models for varying embedding dimension sizes, ranging from d = 32 up to d = 224. As it can be seen, both of the kernel instances have the same tendency, where the performance increases proportionally to the size of the embedding vectors.
The effect of kernel parameters. In Figure 2, we study the behaviour of kernel functions with respect to the chosen parameters. The GAUSS kernel shows comparable results for values of between 0.4 and 4.0. In addition, we observed that its performance is limited for very big or very small values of this parameter. The SCH kernel also behaves similarly.We have reached the highest score with parameter values around
. Lastly, we observed poor performance for very small values of
, which are not included in the figure.
We have introduced the KERNELNE model for learning node embeddings. We interpret our random-walk based method under a weighted matrix factorization framework, which is then generalized to kernel functions. The numerical evaluation showed that the proposed kernel-based models substantially outperform baseline NRL methods in both node classification and link prediction tasks. An interesting future research direction concerns the extension of the proposed methodology to the multiple kernel learning framework [33].
[1] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” IEEE Data Eng. Bull., vol. 40, no. 3, pp. 52–74, 2017.
[2] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation learning: A survey,” CoRR, 2018.
[3] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learning of social representations,” in KDD, 2014, pp. 701–710.
[4] A. Grover and J. Leskovec, “Node2Vec: Scalable feature learning for networks,” in KDD, 2016, pp. 855–864.
[5] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in NIPS, 2001, pp. 585–591.
[6] S. Cao, W. Lu, and Q. Xu, “GraRep: Learning graph representations with global structural information,” in CIKM, 2015, pp. 891–900.
[7] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in KDD, 2016, pp. 1105–1114.
[8] D. Berberidis, A. N. Nikolakopoulos, and G. B. Giannakis, “Adaptive diffusions for scalable learning over graphs,” IEEE Transactions on Signal Processing, vol. 67, no. 5, pp. 1307–1321, March 2019.
[9] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013, pp. 3111–3119.
[10] D. Nguyen and F. D. Malliaros, “BiasedWalk: Biased sampling for representation learning on graphs,” in IEEE Big Data, 2018, pp. 4045– 4053.
[11] A. C¸ elikkanat and F. D. Malliaros, “TNE: A latent model for repre- sentation learning on networks,” in NeurIPS Relational Representation Learning Workshop, 2018.
[12] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Kernel principal component analysis,” in ICANN, 1997, pp. 583–588.
[13] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.
[14] I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: Spectral clustering and normalized cuts,” in KDD. ACM, 2004, pp. 551–556.
[15] X. Liu, C. Aggarwal, Y.-F. Li, X. Kong, X. Sun, and S. Sathe, “Kernelized matrix factorization for collaborative filtering,” in SDM, 2016, pp. 378–386.
[16] N. Srebro and T. Jaakkola, “Weighted low-rank approximations,” in ICML, 2003, pp. 720–727.
[17] A. Epasto and B. Perozzi, “Is a single embedding enough? learning node representations that capture multiple social contexts,” in WWW, 2019, pp. 394–404.
[18] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim, “Dynamic network embeddings: From random walks to temporal random walks,” in IEEE Big Data, 2018, pp. 1085–1092.
[19] C. Lin, P. Ishwar, and W. Ding, “Node embedding for network commu- nity discovery,” in ICASSP, 2017, pp. 4129–4133.
[20] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and J. Tang, “NetSMF: Large-scale network embedding as sparse matrix factorization,” in WWW, 2019, pp. 1509–1520.
[21] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: Large- scale information network embedding,” in WWW, 2015, pp. 1067–1077.
[22] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and Node2Vec,” in WSDM, 2018, pp. 459–467.
[23] C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, Sep 1936.
[24] I. Steinwart, “On the influence of the kernel on the consistency of support vector machines,” J. Mach. Learn. Res., vol. 2, pp. 67–93, Mar. 2002.
[25] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” J. Mach. Learn. Res., vol. 7, pp. 2651–2667, Dec. 2006.
[26] L. Bottou, “Stochastic gradient learning in neural networks,” in In Proceedings of Neuro-Nimes. EC2, 1991.
[27] A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficiently with noise-contrastive estimation,” in NIPS, 2013, pp. 2265–2273.
[28] H. Chen, B. Perozzi, Y. Hu, and S. Skiena, “Harp: Hierarchical repre- sentation learning for networks,” in AAAI, 2018.
[29] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi- Rad, “Collective classification in network data,” AI magazine, 2008.
[30] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena, “Don’t walk, skip!: Online learning of multi-scale network embeddings,” in ASONAM, 2017, pp. 258–265.
[31] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graph evolution: Den- sification and shrinking diameters,” ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, 2007.
[32] J. Leskovec and J. J. Mcauley, “Learning to discover social circles in ego networks,” in NIPS, 2012, pp. 539–547.
[33] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the smo algorithm,” in ICML, 2004, p. 6.