Graphs naturally arise as models in a variety of applications, ranging from social networks (Scott, 1988) and molecular biology (Higham et al., 2008) to recommendation systems (Ma et al., 2018) and transportation (Bell and Iida, 1997). In a variety of problems, graphs tend to be high-dimensional and highly entangled, and hence difficult to directly learn from. As a prominent remedy, graph representation learning aims to learn a mapping that represents each vertex as low-dimensional vector such that structural properties of the original graph are preserved. Those learned low-dimensional representations, also called embeddings, are further used as the input features in downstream machine learning tasks, such as link prediction (Taskar et al., 2004; Al Hasan and Zaki, 2011), node classification (Bhagat et al., 2011), and community detection (Fortunato, 2010).
There are three major approaches to graph embedding: matrix factorization-based algorithms (Belkin and Niyogi, 2002; Ahmed et al., 2013), random walk algorithms (Perozzi et al., 2014; Grover and Leskovec, 2016), and graph neural networks (Scarselli et al., 2008; Zhou et al., 2018; Wu et al., 2019). These approaches can be unified via the encoder-decoder framework proposed in Hamilton et al. (2017b). In this framework, the encoder is a mapping that projects each vertex or a subgraph to a low-dimensional vector, whereas the decoder is a probability model that infers the structural information of the graph from the embeddings generated by the encoder. The structural information here depends on the specific downstream tasks of interest, which also determine the loss function of the decoder. The desired graph representations are hence obtained by minimizing the loss function as a function of embedding vectors. For example, in the link prediction task, the decoder predicts whether an edge between two vertices exists or not using a Bernoulli model and logistic loss function, and the model parameter is a function of embeddings (Baldin and Berthet, 2018).
Such an encoder-decoder architecture motivates the study of graph representation learning through the lens of statistical estimation for generative models. In particular, suppose the observed graph is generated by a statistical model specified by the decoder with true graph representations as its inputs. We can then assess the performance of a graph embedding algorithm by examining the difference between the learned representation and the ground truth. Baldin and Berthet (2018) adopted this perspective to study the performance of a linear embedding method for the link prediction problem. The validity of their results hinges on the condition that both the linear model of the encoder and the Bernoulli model of the decoder are correctly specified. When either of these assumptions are violated, they would incur large estimation error. Recent advances in graph representation learning are attributed to more flexible decoders (Cho et al., 2014; Goodfellow et al., 2016; Badri- narayanan et al., 2017), which are based on deep neural networks and can handle graphs with edge attributes that can be categorical. These approaches are poorly understood from a theoretical point of view.
In the present paper, we focus on bipartite graphs, where there are two distinct sets of vertices, U and V , and only edges between two vertices in different sets are allowed. We study the semiparametric nonlinear bipartite graph representation learning problem under the encoder-decoder framework. We assume that each vertex is associated with a high-dimensional Gaussian vector . Similarly, each vertex is associated with a high-dimensional Gaussian vector . The encoder maps them via one-layer neural networks to low-dimensional vectors are weight matrices, are activation functions evaluated entrywise, and ). Furthermore, in the decoder, we consider the link prediction task under a semiparametric model. In particular, we assume that the attribute of an edge follows a natural exponential family distribution parameterized by the proximity between two vertices, which is defined as the inner product between the embedding vectors. Here, ) is the embedding vector of ) is the embedding vector of v. We do not specify the base measure of the exponential family distribution but, instead, treat it as a nuisance parameter. This gives us a semiparametric model for the decoder and robustness to model misspecification within the exponential family.
In the above described semiparametric nonlinear model, our goal is to recover weight matrices . Based on these weight matrices, we can then compute embeddings for all vertices. There are two main obstacles that make the estimation problem challenging. First, while the activation functions make the encoder model more flexible, their nonlinearity leads to a loss function that is nonconvex and nonsmooth. Second, while the unknown nonparametric nuisance component of the decoder model makes the graph representation learning robust to the model misspecification, it also makes the likelihood function not available. To overcome these obstacles, we propose a pseudo-likelihood objective, which is minimized at () locally. We analyze the landscape of the empirical objective and show that, in a neighborhood around the ground truth, the objective is strongly convex. Therefore, the vanilla gradient descent (GD) achieves linear convergence rate. Moreover, we prove that the sample complexity is linear in dimensions , up to logarithmic factors, which matches the best known result under the parametric model (Zhong et al., 2019). Experiments on synthetic and real data corroborate our theoretical results and illustrate flexibility of the proposed representation learning model.
Notations. For any positive integer n, [n] = {1, 2, . . . , n} denotes the index set, and Unif([n]) is a uniform sampling over the indices. We write constant , which equals to 1 if i = j and 0 otherwise. For any matrix U, vec(U) denotes the column vector obtained by vectorizing . As usual, refer to the Frobenius and operator norm, respectively, and ) denotes the p-th singular value of U. For a square matrix ) is a vector including all diagonal entries of U; when U is symmetric, )) denotes its maximum (minimum) eigenvalue. We write is positive semidefinite and if it is positive definite. For any vector is the minimal absolute value of its entries.
Structure of the paper. In Section 2, we formalize the semiparametric graph representation learning problem and introduce related work. In Section 3, we present our estimation method by proposing a pseudo-likelihood objective, and the theoretical analysis of such objective is provided in Section 4. In Section 5 we show experimental results and conclusions are summarized in Section 6. Section 7 provides proofs of main technical results, while the proofs of auxiliary results are given in the appendix.
We describe the setup of our problem and introduce the applications and related work. We particularly focus on the statistical literature on theory of semiparametric estimation and matrix completion, although bipartite graph representation learning has been routinely applied to varied deep neural networks (Nassar, 2018; Wu et al., 2018). We point reader to Zha et al. (2001) for a survey on bipartite graph.
2.1 Problem formulation
Let G = (U, V, E) be a bipartite graph where U and V are two sets of vertices and E denotes the set of edges between two vertex sets. For each vertex , we assume it is associated with a Gaussian vector , while for each we have a Gaussian vector An edge between u and v has an attribute that follows the following semiparametric exponential family model
which is parameterized by the base measure function and a scalar
In model (1), ) is the log-partition function (or normalizing function) that makes the density have unit integral. The parametric component of the exponential family, ), depends on the covariate coming from the set U and the covariate from the set V . The nonparametric component f is treated as a nuisance parameter, which gives us flexibility in modeling the edge attributes. To make notation concise, we will drop the subscript of hereinafter, and use x and z to denote covariates from set U and V , respectively. In our analysis, the activation functions have one of the following three forms:
We formalize the bipartite graph representation learning as a statistical parameter estimation problem of a generative model. In particular, suppose the graph is generated by the exponential family model (1) with some unknown base measure f, and we observe part of edge attributes, y, and associated covariates on two ends, x and z. Thus, we obtain data set index the vertices of two sets. The graph representation learning in our setup is then equivalent to recovering , which can be used to compute parametric component of the decoder model and estimate embedding vectors, ), for all vertices in two sets, since activation functions are user-chosen and known.
2.2 Applications and related work
Graph representation learning underlies a number of real world problems, including object recognition in image analysis (Bunke and Messmer, 1995; Fiorio, 1996), community detection in social science (Perozzi et al., 2014; Cavallari et al., 2017), and recommendation systems in machine learning (Kang et al., 2016; Jannach et al., 2016). See Bengio et al. (2013), Hamilton et al. (2017a), Hamilton et al. (2017b) for recent surveys and other applications. The bipartite graph is of particular interest since it classifies vertices into two types, which extensively appears in modern applications.
For concreteness, in user-item recommendation systems, the attribute of an edge between a user node and an item node represents the rating, which is modeled by the proximity of projected features onto the latent space. Specifically, each user is represented by a high-dimensional feature vector x and each item is represented by a high-dimensional feature vector z. A simple generative model for the rating y that a user gives to an item is y = 1) independent from x, z. Such a model is studied in the inductive matrix completion (IMC) literature (Abernethy et al., 2006; Jain and Dhillon, 2013; Si et al., 2016; Berg et al., 2017). Zhong et al. (2019) studied nonlinear IMC problem, where a generalized model for the rating is common activation function. In this generalized nonlinear model, one-layer neural network compresses the high-dimensional features into low-dimensional embeddings. Zhong et al. (2019) proposed to minimize the squared loss to recover weight matrices established consistency for their minimizer, with linear sample complexity in dimension , up to logarithmic factors.
Our work contributes to this line of research by enhancing the IMC model from two aspects. First, we allow for two separate neural networks to embed user and item covariates. Although this modification may seem minor, it makes theoretical analysis more challenging when two networks mismatch: one network has a smooth activation function while the other does not. Second, we consider an exponential family model with unknown base measure, which extends the applicability of the model and allows for model misspecification within the exponential family. In particular, the semiparametric setup makes our estimator independent of the specific form of f. For example, the model in Zhong et al. (2019) is a special case of (1) with 2), while the link prediction problem in Liben- Nowell and Kleinberg (2007) and Menon and Elkan (2011) is a special case with f(y) = 1.
Furthermore, our work contributes to the literature on graph embedding (Qiu et al., 2018; Goyal and Ferrara, 2018). Our paper studies the bipartite graph and casts the graph representation learning as the problem of parameter estimation in a generative model. This setup allows us to analyze statistical properties, such as consistency and convergence rate, of the learned embedding features. To the best of our knowledge, statistical view of representation learning is missing although it was successfully used in real experiments (see, e.g., Graepel et al., 2001; Yang et al., 2015). In addition, our work also contributes to a growing literature on semiparametric modeling (Fengler, 2005; Li and Liang, 2008; Fan et al., 2017), where the parametric component in (1) is given by and the goal is to estimate by regressing y on x, without knowing f. Fosdick and Hoff (2015) formalized the representation learning as a latent space network model, where the parameter given by the inner product of two latent vectors and 2), that is under a Gaussian noise setup, and proposed methodology for testing the dependence between nodal attributes and latent factors. Ma et al. (2019) studied a similar model with f(y) = 1 and proposed both convex and nonconvex approaches to recover latent factors. However, our work is more challenging due to the nonlinearity of activation functions and the missing knowledge of f.
Lastly, several estimation methods for pairwise measurements have been studied in related, but simpler, models (Chen and Goldsmith, 2014; Chen and Suh, 2015; Chen et al., 2016; Pananjady et al., 2017; Negahban et al., 2018; Chen and Cand`es, 2018; Chen et al., 2019). Chen et al. (2018) studied model (1) by assuming the parameter matrix of the graph to be low-rank, and estimated as a whole. As a comparison, our model is more complicated since each entry of in our setup is given by the inner product of two embedding vectors, which measures the proximity of two vertices. Our task is to recover two underlying weight matrices that are convolved by activation functions to generate
We propose a pseudo-likelihood objective function to estimate the unknown weight matrices and discuss identifiability of the parameters. The objective function is minimized by the gradient descent with a constant step size. Theoretical analysis of the iterates is provided in Section 4.
The likelihood of the model (1) is not available due to the presence of the infinite-dimensional nuisance parameter f. Using the rank-order decomposition technique (Ning et al., 2017), we focus on the pairwise differences and develop a pseudo-likelihood objective. Importantly, the differential pseudo-likelihood does not depend on f and, as a result, our estimator is valid for a wide range of distributions, without having to explicitly specify them in advance.
We follow the setup described in Section 2.1. To simplify the presentation, suppose we have 2vertices in vertices in V , denoted by and , respectively. For , and suppose that ), independent of each other. Further, we assume to observe m edge attributes, y, between vertices , and another m edge attributes, , both of which follow the distribution in (1) and are sampled with replacement from the set of all possible edges. We note that the sampling setup is commonly adopted in the literature on partially observed graphs and matrix completion problems (Zhong et al., 2019; Chen et al., 2018), which is equivalent to assuming edges on a graph are missing at random.
Denote sample sets
where ]). While the observations within Ω or Ωare not independent, as they may have common features x or z, the observations between Ω and Ωare independent. Such two independent sets of samples are obtained by sample splitting in practice. We stress that the sample splitting setup in our paper is used only to make the analysis concise without enhancing the order of sample complexity. In particular, it does not help us avoid the main difficulties of the problem.
Based on samples Ω and Ω, we consider pairwise differences and construct an empirical loss function. For
denote the true parameter associated with the k-th sample (similarly for ). Note that is the underlying parametric component of the model that generates key idea in constructing the pseudo-likelihood objective is to use rank-order decomposition to extract a factor, that is independently from the base measure. Given a pair of independent samples, , we denote their order statistics as and rank statistics as R. Then we know 1). Thus, (characterizes the pair (), and is hence a sufficient statistics. Note that
The first term is the density of the rank statistics given order statistics, which is only a function of unknown weight matrices . The second term is the density of order statistics, which relies on the specific base measure f. Thus, we omit the second term and sum over all paired samples for the first term to arrive at the following objective
The above loss function is similar to the logistic loss for the pairwise measurements. However, it is nonconvex in both components even for identity activation functions. When feature vectors x, z follow the multinomial distribution and activation functions not present, Chen et al. (2018) estimated the rank-as a whole by minimizing (3) with an additional nuclear norm penalty. Our goal is to recover both components , in the presence of nonlinear activation functions, resulting in a challenging nonconvex optimization problem.
3.1 Gradient Descent
We propose to minimize loss function (3) using the gradient descent with a constant step size. The iteration is given by
For future references, we provide explicit formulas of the gradient and the Hessian for loss (3). We introduce some definitions beforehand. Let us denote each column of weight matrices as ) (similar for To simplify notations, for a sequence of vectors ) be the long vector by stacking them up; for a sequence of matrices , we let diagthe block diagonal matrix with each block being specified by sequentially. Moreover, we define the following quantities:
The quantities on the left part are vectors or matrices calculated by using samples in Ω, which is indexed by k, while the quantities on the right part are calculated by using samples in Ω, which is indexed by l. We should mention that are the first derivative and the second derivative of the activation function is ReLU then = 0), while superscript of ) means the sample is from Ω(i.e. the sample index l is always used with superscript (). In addition, we define two scalars as
With above definitions and by simple calculations, one can show the gradient is given by
(5) Furthermore, ], one can show
To combine all blocks and form the Hessian matrix, we will vectorize weight matrices and further define long vectors block diagonal matrices (similar for ). Then, the Hessian matrix
3.2 Identifiability
In general, the weight matrices in loss function (3) are not identifiable as the function is bilinear in U, V. For example, when both activation functions are identity, and L(U, V) have the same value for any invertible matrix , which makes the Hessian at () indefinite. Similarly, for ReLU activation, this phenomenon reappears by letting Q be any diagonal matrix with positive entries. To resolve this issue, one can use a penalty function to balance two components U and V (Yi et al., 2016; Park et al., 2018; Na et al., 2019). Fortunately, in our problem, the identifiability issue disappears when a smooth nonlinear activation is used, such as sigmoid or tanh, although their nonconvexity brings other challenges.
We stress that, different from over-parameterized problems in neural networks (Sagun et al., 2017; Li and Liang, 2018; Allen-Zhu et al., 2018), the identifiability issue comes from the redundancy of parameters, which is also observed in inductive matrix completion problem (Zhong et al., 2019). Zhong et al. (2019) showed that by fixing the first row of both components are recoverable from the square loss even with ReLU activation. In our problem, when either one of activation functions is ReLU, we use a similar restriction on and show that the loss in (3) has positive definite Hessian at (), without adding any penalties.
In this section, we will show that the ground truth () is a stationary point of the loss (3) and then show that the loss is strongly convex in its neighborhood. Using these two observations, we further establish the local linear convergence rate for iterates in (4). Since the radius of the neighborhood is fixed in terms of (), a wart-start initialization can be obtained by a third-order tensor method (see, e.g., Zhong et al., 2017, 2019). In our simulations, due to high computational cost of a tensor method, we recommend a random initialization (Du et al., 2017; Cao and Gu, 2019).
4.1 Assumptions
We require two assumptions to establish our main results. The first assumption fixes the scale of weight matrices.
Assumption 1 The weight matrices and satisfy
The second assumption imposes a mild regularity condition.
complete subgraphs (the edges between are ignored). We assume
(a) (boundedness): There exist such that, for any sample (have
where , and assume is a continuous, positive two-dimensional function.
Assumption 2 is widely assumed in the analysis of logistic loss function (Chen et al., 2018). In particular, Assumption 2(a) restricts the parametric component into a compact set, which controls the range of proximity between two connected nodes. Intuitively, larger implies a harder estimation problem. We also add boundedness condition on the response y for simplicity. It can be replaced by assuming y to be subexponential (Ning et al., 2017). Boundedness holds deterministically for some distribution in exponential family, such as Bernoulli and Beta, and holds with high probability for a wide range of exponential family distributions, though may depend on the sample size . Assumption 2(b) is the regularity condition, which plays the key role when showing the strong convexity of the population loss at the ground truth. It can be shown to hold for all exponential family distributions with bounded support, and for some unbounded distributions, such as Gaussian and Poisson.
4.2 Properties of the Population Loss
With the above assumptions, our first result shows that the gradient of the population loss at () is zero. For all quantities defined in Section 3.1, we add superscript (to denote the underlying true quantities, which are obtained by replacing U, V with true weight matrices . For example, we have
The following lemma shows that the conditional expectation of given all covariates associated to two end vertices is zero.
, we have that the conditional expectation given all covariates
Since is a common factor of the gradients ), as shown in (5), and vectors only depend on covariates, one can first take conditional expectation given covariates and show the following result.
)] as an example, while )] can be shown similarly. By the formula in (5),
where, for the second term from the end, the outer expectation is taken over randomness in sampling and all covariate, and the last equality is due to Lemma 3. Doing same derivation for each column and we obtain . Similarly
We then study the local curvature of the population loss at (), which is obtained in the next two steps. We simplify the notation further by dropping the subscripts of sample index. We let , and their corresponding (version, denote general references of corresponding quantities, which may be computed by using any samples in . We stress that all samples in have the same distribution, so that independent from each other.
Proposition 5 Suppose Assumptions 1 and 2 hold. Define
then we have
Proof Recall the formula for the Hessian matrix in (6). The second term has zero expectation at () by Lemma 3. Therefore,
In our notations, is written as
where ) (cf. Section 2.1 for definition of () are two independent samples from , respectively. By Assumption 2, . Using the symmetry and monotonicity of ), defined in Assumption 2,
Therefore,
Taking conditional expectation in (7) and using the definition of
Note that ) is strictly positive on [], by continuity, ) attains its minimum value in the compact support, hence, 0. This completes the proof.
Note that in Proposition 5 depends on reciprocally. The next result lower bounds
the minimum eigenvalue of E
(2019) established a similar result when are not present and our result is based on pairwise measurements which allows for adaptivity to nonparametric (nuisance) parameter in the model and, further, also allows for mismatch in activation functions. These two differences make the proof more involved. We separate results into two cases: (1) ; (2) either
(Case 2) if either is ReLU, then by fixing the first row of (i.e. treating it as known),
Combining the results of Proposition 5 and Lemma 6, we immediately get the following result regarding the local curvature of the population loss at the ground truth.
Theorem 7 (Local curvature) Suppose Assumptions 1 and 2 hold. There exists a constant C > 0, independent of , such that: (Case 1) if
(Case 2) if either is ReLU, then by fixing the first row of
By symmetry one can alternatively fix the first row of in the second case. We realize that the lower bound of population Hessian in Case 2 is smaller than the bound in Case 1. This is due to nonsmoothness and unboundedness of ReLU activation function. In later analysis we will see the sample complexity when using ReLU for either networks will have larger logarithmic factor, while is linear in in both cases.
Combining Theorem 4 and 7, we obtain that () is a local minimizer of the population loss. In order to characterize how the empirical loss behaves near the ground truth, we study its local geometry via the concentration of the Hessian matrix.
4.3 Concentration of the Hessian Matrix
In this section, we characterize the concentration of the Hessian matrix. We show that ()) is sufficient to guarantee that the empirical loss also has positive curvature locally.
Let
and define
By the formula in (6), we have that
The concentration of each term will be built separately in next two lemmas. We let q = 1 if either is ReLU and q = 0 otherwise, and = 1 if both are ReLU and = 0 otherwise.
Lemma 8 Suppose Assumptions 1 and 2 hold. For any
then with probability at least 1
Lemma 9 Suppose Assumptions 1 and 2 hold. For any
then with probability at least 1
Comparing the sample complexity in Lemma 8 and 9, we see that (9) is dominated by (8). Technically, this is because are not present if = ReLU. Combining the above two lemmas and using the inequality that
we immediately obtain the following concentration on the Hessian matrix.
Theorem 10 (Concentration of the Hessian matrix) Suppose Assumptions 1 and 2
where q = 0 for Case 1 and q = 1 for Case 2, then with probability at least 1
Replacing () in the above inequality, one can show that is lower bounded away from zero when is sufficiently large. It turns out this observation is a fundamental condition for establishing local linear convergence rate for gradient descent.
Comparing the above sample complexity with the one established for inductive matrix completion problem (Zhong et al., 2019), our rate improves from are sigmoid or tanh. Moreover, we allow a semiparametric model with two different activation functions, which results in a more involved analysis.
4.4 Local Linear Convergence
The local geometry established for the loss function (3) in previous two subsections allows us to prove the local result: the gradient descent with constant step size converges to the ground truth linearly. For ease of notation, let
be the minimum and maximum eigenvalue of the population Hessian. The explicit lower bound of is provided in Theorem 7, while the upper bound of is provided in the following Lemma 12. We define the local neighborhood of (
with radius satisfying
for a sufficiently small constant . The above radius is determined by the concentration bound of Hessian in Theorem 10, based on which one can show that for any (). We also note that the above radius only depends on true weight matrices and is independent from sample sizes and dimensions. Thus, it will not vanish as dimension increases, provided () scale properly.
In preparation for the convergence analysis, next lemma characterizes the difference ) for any (
Lemma 11 Suppose the conditions of Theorem 10 hold. For any (
with probability at least 1
The next result provides an upper bound on and we then establish the local linear convergence rate.
Lemma 12 Under Assumption 2,
Theorem 13 (Local linear convergence rate) Suppose Assumptions 1 and 2 hold and the initial point (, if the sample sizes satisfies (10), then with probability at least 1 , the iterates in (4) with
where the contraction rate
Based on previous preparation work, the proof of Theorem 13 is standard for gradient descent. For completeness, we present the proof in Section 7.
In next section, we demonstrate the superiority and generality of the proposed representation learning model via extension simulations and real experiments.
We show experimental results on synthetic and real-world data. In the following, we call our model nonlinear semiparametric matrix completion (NSMC). We compare NSMC with the baseline nonlinear inductive matrix completion (NIMC) proposed by Zhong et al. (2019), where they assumed the generative model to be Gaussian and minimized the squared loss. The models obtained by removing non-linear activation functions in NSMC and NIMC are called SMC and IMC, respectively.
5.1 Local Linear Convergence
We verify the local linear convergence of GD on synthetic data sets sampled with ReLU activation functions. We fix = 3. The features , are independently sampled from a Gaussian distribution. We fix 400 and the number of observations m = 2000. We randomly initialize (ground truth () with fixed error in Frobenius norm. In particular, we fix = 1. For the Gaussian model, ). For the binomial model, . For Poisson model, )). To introduce some variations, as well as to verify that our model allows for two separate neural networks, we let = ReLU and ReLU, sigmoid, tanh}. The estimation error during training process is shown in Figure 1, which verifies the linear convergence rate of GD before reaching the local minima.
5.2 Robustness to Model Misspecification
We generate synthetic data with model misspecification and compare the performance of estimators given by NSMC, SMC, NIMC and IMC. We fix = 400, and use ReLU as the activation function for NSMC and NIMC. For NSMC and SMC, we randomly generate two independent sample sets with m = 1000 observations, which are denoted as Ω and Ω. The observed sample set for NIMC and IMC are set to be the union Ω . For NSMC and SMC, we minimize the proposed pseudo-likelihood objective. For NIMC and IMC, we minimize the square loss as suggested by Zhong et al. (2019). We apply gradient descent starting from a random initialization near the ground truth (), in order to guarantee convergence of all methods. We evaluate the estimated
Figure 1: Local linear convergence of gradient descent on synthetic data sets.
Figure 2: Relative Error of NSMC, SMC, NIMC and IMC. The plot shows how relative error of estimations given by each method varies with parameter , which introduces model misspecification in the Gaussian model. We see that NSMC gives accurate and robust estimation, while NIMC suffers from model misspecification. SMC, IMC fail to learn the non-linear embeddings and give unsatisfactory estimations for all
matrix ˆU using the relative approximation error similarly. We also evaluate the performance of a solution ( ˆ) on recovering the parametric component using relative test error where is a newly sampled test data set. For each setting below, we report results averaged over 10 runs.
Gaussian model. We introduce model misspecification by sampling . Parameter is introduced to modify the impact of model mis-specification. We summarize the relative errors in Table 1 and Figure 2. When = 0, there is no model misspecification and NSMC and NIMC achieve comparable relative approxi-
Table 1: Relative error in the Gaussian model.
Table 2: Relative error in the Binomial model.
mation errors. As increases, the relative approximation errors of NIMC grow rapidly due to the increase of model misspecification. However, NSMC gives robust estimations. SMC and IMC serve as bilinear modeling baselines that fail to learn in the nonlinear embedding setting.
) and apply NSMC and SMC with original attributes y. For NIMC and IMC, we first do variance-stabilizing transformation ˜y = arcsin ( ) as the data preprocessing step, inspired by what people might do for non- Gaussian data in practical applications. From Table 2, NSMC achieves the best estimating result in each setting, while other methods fail to learn the embeddings with a binomial model.
Poisson model. We generate )), where the activation functions are = ReLU and ReLU, sigmoid, tanh}. Due to model misspecification, we apply transformation ˜for NIMC and IMC. The activation function of NIMC is set to be the same as . We see from Table 3 that NSMC achieves the best estimating result, while other methods fail to recover the parameters.
5.3 Clustering of Embeddings
We generate synthetic data with clustered embeddings and compare the performance of NSMC and NIMC on learning the true embedding clustering. We fix = 400, and choose tanh as the activation function. We generate features x and z independently from a Gaussian mixture model with four components, resulting in the ground-truth embedding clustering with four components. We sample y from a binomial model with = 20. We fix observed sample size m = 1000 and apply NSMC and NIMC to get the estimated ˆ, respectively. We plot the top 2 left singular vectors (ˆ) for NSMC and NIMC, respectively, where the points are colored according to the ground-truth clustering. We also plot the top 2 left singular vectors () of the ground- truth embeddings ). Similar plots for feature z are shown as well. We see from
Table 3: Relative error in the Poison model.
Figure 3 that NIMC fails to find the ground-truth embeddings due to model misspecification, while NSMC gives robust estimation and recovers the ground-truth embeddings.
To quantitatively evaluate the performance, we apply the k-means clustering to the left singular vectors. We define the clustering error following Zhong et al. (2019) as
where is the ground-truth clustering and is the predicted clustering. As a result, NIMC attains clustering error 0.0596 and 0.1725 for x and z respectively. NSMC achieves a better performance with clustering error 0.0196 and 0.0147 for x and z respectively.
5.4 Semi-supervised Clustering
We further illustrate the superior performance of NSMC over NIMC with real-world data. Following the experimental setting in Zhong et al. (2019), we apply NSMC and NIMC to a semi-supervised clustering problem, where we only have one kind of features, a set of items. The edge attribute = 1, if the i-th item and j-th item are similar, and = 0, if they are dissimilar. To apply NSMC and NIMC, we set and assume . We initialize as the same random Gaussian matrix and apply gradient descent to ensure during training. After training, we apply k-means clustering to the top r left singular vectors of ). We follow Zhong et al. (2019) and again use the clustering error defined by (11). We set the activation function tanh for all data sets. For NSMC, we first uniformly sample two independent sets of items with = 1000. Then we generate independent observation sets Ω and Ωm = 5000. For NIMC, the observed dataset is set to be the union Ω. We consider three datasets: Mushroom, Segment and Covtype (Dua and Graff, 2017), and regard items with the same label as similar (dataset is subsampled first to balance the size of each cluster. As shown in Table 4, for linear separable dataset Mushroom, both NSMC and NIMC achieve perfect clustering. For the other two datasets, NSMC achieves better clustering results than NIMC.
We studied the nonlinear bipartite graph representation learning problem. We formalized the representation learning problem as a statistical parameter estimation problem in a semiparametric model. In particular, the edge attributes, given node features, are assumed to
Figure 3: The comparison of learned embeddings based on NIMC and NSMC, with the ground-truth embeddings. The first row shows embeddings of x, while the second row shows embeddings of z. The points are colored according to the ground-truth clustering.
Table 4: Clustering error on real-world data.
follow an exponential family distribution with unknown base measure. The parametric component of the model is assumed to be the proximity of outputs of one-layer neural network, whose inputs are node representations. In this setting, learning embedding vectors is equivalent to estimating two low-rank weight matrices (). Using the rank-order decomposition technique, we proposed a quasi-likelihood function, and proved that GD with constant step size achieves local linear convergence rate. The sample complexity is linear in dimensions up to a logarithmic factor, which matches existing results in matrix completion. However, our estimator is robust to model misspecification within exponential family due to the adaptivity to the base measure. We also provided numerical simulations and real experiments to corroborate the main theoretical results, which demonstrated superior performance of our method over existing approaches.
One potential extension is to consider a more general distribution for node representations. For example, when node representations follow a heavy-tailed distribution, it is not clear whether we can still recover () with the same convergence rate. In addition, using two-layer or even deep neural networks for encoders in our semiparametric model, while still providing theoretical guarantee is another interesting extension.
In this section, we provide proofs of lemmas in the main text. Auxiliary results are presented in the appendix.
7.1 Proof of Lemma 3
For any pair (denote the rank statistics, and denote the order statistics. We have
Moreover, as shown in (2),
Thus,
which completes the proof.
7.2 Proof of Lemma 6
Let us first introduce additional notations. Suppose QR decomposition of , respectively, with
be the orthogonal complement of . For any vectors ) such that = 1, we express each component by . Further, we let , and also let ¯denote the matrix that replaces the diagonal entries of by 0. Lastly, for i = 1, 2 and variable 1), we define following quantities
Using the above notations,
= 12
where we have used the independence among By Lemma 15, there exists a constant not depending on () such that
For term , let us denote the inside variable as
Using Lemma 21, Assumption 1, and independence among
Combining the above display with (12) and (13), Minimizing over the set
The above display, together with (12) and (13), leads to
Since the first row of is fixed, we minimize over the set 0. Equivalently, the right hand side has the following optimization problem
By Theorem D.6. in Zhong et al. (2018),
Thus,
This completes the proof.
7.3 Proof of Lemma 8
The concentration is shown by taking expectation hierarchically. In particular, we let , where the expectation is over the random sampling of the entries from . Then, we know )]. Moreover,
Using Lemma 17, for any
Using Lemma 19,
Combining the above two displays, using the fact that , and dropping high order terms, we know that, with probability at least 1
Noting that the proof.
7.4 Proof of Lemma 9
The proof is similar to that of Lemma 8. Let
Using Lemma 18 and noting that
Using Lemma 20,
Combining the last two displays, we complete the proof.
7.5 Proof of Lemma 11
Note that
Following the proof of Lemma 17, 18, 19 and 20, we can show that, for any probability 1
and
The proof follows by noting that
7.6 Proof of Lemma 12
By definition in Section 4.3, we have the following decomposition
By (34),
This completes the proof.
7.7 Proof of Theorem 13
Define Υfor sufficiently large constant two points (), if their distance satisfies
and the sample sizes satisfy (which is implied by the condition in Theorem 10)
by Lemma 11 we know
Using this result and letting
Moreover, for any (U, V) in this neighborhood, by Theorem 10, we have
with high probability. Thus, by Weyl’s theorem (Weyl, 1912), we have
and similarly 2. Let us consider doing one-step GD at (U, V). Let
Suppose the continuous line from () is parameterized by
of interval [0
] and have set . Taking the union bound over S,
Furthermore, since Ξ is a net of [0, 1], for any 1] there exists ] such that
Thus, by (16), (17), and Weyl’s theorem, we obtain
With this,
The last inequality is from Theorem 4 and the fact that only contributes higher-order terms by concentration. Let
which completes the proof.
This work is partially supported by the William S. Fishman Faculty Research Fund at the University of Chicago Booth School of Business. This work was completed in part with resources provided by the University of Chicago Research Computing Center.
Appendix A. Complementary Lemmas
In this section, we list intermediate results required for proving lemmas in Section 7. Notations in each lemma are introduced in the proofs of the corresponding lemmas.
(Case 2) if either is ReLU, then
Proof The notations in this proof are inherited from the proof of Lemma 6 in Section 7.2. By the definition of
Therefore,
and
Plugging the expressions of in Lemma 16 into (19), combining with (18), and using the fact that
Trace(¯
we obtain
Based on the above expression, we further provide the lower bound for Var(separate into two cases.
. By symmetry of activation functions, plugging into (20),
Var(
+
where, for
Further, by Stein’s identity (Stein, 1972), . We can also numerically check that 0. Therefore, the above display leads to
is ReLU. Without loss of generality, we assume is ReLU. Then,
plugging into (20),
Var(
+
+
+
Define
Then, we can numerically check
This completes the proof.
Lemma 15 Under conditions of Lemma 6, there exists a constant not depending on
Proof By symmetry, we only show the proof for . By the definition of in (12) and noting that the inner variable has mean zero,
where the third equality is due to the independence among inequality is due to Lemma 21 and Assumption 1. Here, ¯the p-th component of ¯, respectively. Moreover,
Combining with (21),
Lemma 16 Under the setup of Lemma 14, we have
+ 2
and
Proof By symmetry, we only show the proof for can be proved analogously. By the definition of
=
By simple derivations, we let -th entry of
and
Plugging into the formula of
Combining (22), (23), (24) together,
+ 2+ Trace(
Recall from Section 7.2 that ¯2, denotes the matrix that replaces the diagonal entries of by 0. Therefore, the above equation can be further simplified as
+ 2
This completes the proof for can be obtained analogously by changing the role of and . By the definition of
This completes the proof.
Lemma 17 Under conditions of Lemma 8, we let is ReLU and . Then for any
For any two samples (, let us define
where . To ease notations, we suppress the evaluation sample of . We apply Lemma 24 to bound . We first check all conditions of Lemma 24. By Assumption 2 and symmetry of (
For activation functions in {sigmoid, tanh, ReLU}, the last inequality is due to the fact that 1. Note that
thus we further obtain
By Lemma 22,
We bound the second term in (26) similarly and have
We next verify the second condition in Lemma 24. By the symmetry of , we only need bound the following quantity
We only bound as an example. can be bounded in the same sketch. . For any vectors ) such that
By Lemma 23 and we have
. We apply Lemma 26. Let us first define the random matrix
For the condition (a) in Lemma 26, we note that
By Lemma 22, for any constants 1 (in what follows we may keep using such notation, where the first superscript indexes the function we are dealing with; the second superscript indexes the times we have used for this notation),
For the condition (b) in Lemma 26, we apply the inequalities in Lemma 23 and have
For the condition (c) in Lemma 26, we consider the following quantity for any unit vector (a; b):
Combining (29), (30), (31), (32) together and defining
we know conditions in Lemma 26 hold for with parameters (up to constants)
Thus,
In the above inequality, for any constant
By simple calculation, we can let
and further have
Under the conditions of Lemma 17, we combine the above inequality with (29) and have 1. Dealing with in (28) similarly, one can show (29) and the above result hold for as well. So Plugging back into (28), we can define and then conditions of Lemma 24 hold for with parameters ) (defined in (27)) and ). Therefore, we have
For any
and have
The result follows by the definition of Υin (33) and noting that the first term in dominant term.
We apply Lemma 25 to bound . We check all conditions of Lemma 25. Some of steps are similar as above. By definition of
We first bound
where the last inequality is derived similarly to (31). For the condition (a) in Lemma 25, we apply (26) and Lemma 22 (similar to (30)),
For the condition (b) in Lemma 25,
For the condition (c) in Lemma 25,
Thus, conditions of Lemma 25 hold with parameters (up to constants)
Similar to the proof of
and then have
Noting that the first term in is the dominant term, we complete the proof.
Lemma 18 Under conditions of Lemma 9 and the definition of in Lemma 17, we have that for any
For any two samples (
where . We follow the same proof sketch as Lemma 17. We apply Lemma 24 to bound . We first check all conditions of Lemma 24. By Assumption 2,
Here, the third inequality is due to the fact that if is ReLU. Taking union bound over , noting that log(log ()), and applying Lemma 22, for any 1, we define
Υ
and have
Similarly to Lemma 17, we have two steps. . For any vectors ) such that
Using Lemma 23 and maximizing over set
. We still apply Lemma 26. Define the following random matrix
For the condition (a) in Lemma 26, we note that
and we have
For the condition (b) in Lemma 26, let us define
Then,
By simple calculations based on Lemma 23,
Combining the above displays together and maximizing over
For condition (c) in Lemma 26,
Applying Lemma 23,
Thus,
Combining (40), (42), (43), (44), and defining
then conditions in Lemma 26 hold for with parameters
Here, Υare defined in (41), (45), and are any constant. So,
For any
Then, Υ. Noting that , we can let
and then have
Combining the above inequality with (40), back into (39), combine with (38), and know Lemma 24 holds for with parameters . Finally we apply Lemma 24 and obtain that
For any
and have
This completes the proof for the first part. We apply Lemma 25 to bound . We check all conditions of Lemma 25. By definition of
We first bound as follows:
For the condition (a) in Lemma 25, we have shown in (36) that
Thus, similar to (42),
For the condition (b) in Lemma 25,
For the condition (c) in Lemma 25, we use Lemma 23 and obtain
Thus, conditions of Lemma 25 hold for with parameters (up to constants)
and then have
We finish the proof by noting that the first term of is the dominant.
Lemma 19 Under conditions of Lemma 8, we have
Proof By definition of
For
We only bound the first term. The second term has the same bound using the equation ] for any variable independent from x. Note that
We focus on the first term in the above equality. By simple calculations using the boundedness and Lipschitz continuity of
Plugging the above inequality back into (49), dealing with other terms similarly, and applying Lemma 27 by noting
+
+
+
where Υis defined in the same way as Υin (33) but calculated using bound in (48). Since is Lipschitz continuous,
Thus,
For the first term,
For the second term, from (32) we see maxbining with the above two displays, and (50) and (48),
This completes the proof.
Lemma 20 Under conditions of Lemma 9, we have
Proof We follow the same proof sketch as Lemma 19. By definition of
For
where Υhas the same form as Υdefined in (46) but is calculated using we use the Lipschitz continuity of 1/(1 + exp(x)), and simplify analogously to obtain
Combining the above three displays,
We complete the proof.
Appendix B. Auxiliary Results
be a full-column rank matrix. Let , then we have
Lemma 22 (Concentration of quadratic form and norm)
Proof Result in (a) directly comes from the Chernoff bound and Remark 2.3 in Hsu et al. (2012). We use union bound and (a) to prove (b). (c), (d) and (e) are directly from (a) and (b). (f) is from the Chapter 3 in Vershynin (2018). (g) is due to the fact that sub-Gaussian variable.
Lemma 23 (Expectation of product of quadratic form)
Proof Note that
This shows the part (a). (b) can be showed similarly using the H¨older’s inequality twice. For (c),
Here the first inequality is due to the H¨older’s inequality and the second equality is from Lemma 2.2 in Magnus (1978).
a sample set, and let Ω = be a collection of samples of D, where each (is sampled with replacement from D uniformly. Independently, we have another sets For any pair (a matrix following conditions hold with not depending on
then
Proof For any integer k, we define ¯k to be the remainder of k/m such that 1 (i.e. ¯m = m). Then we can express H as
Note that is the sum of m independent samples, and for any have the same distribution with conditional expectation
Therefore,
By the proof of Corollary 6.1.2 in Tropp et al. (2015), the right hand side satisfies
Combining the above two displays and using the fact that for any event A, we finish the proof.
be a sample set with size and each pair (x, z) follows the same distribution F; similarly but independently, let another sample set. Let be a random matrix corresponding to (the following conditions hold with
Proof For simplicity we suppress the evaluation point of ¯
For the first term,
For the third term,
For the second term, without loss of generality, we assume . For any integer k, we let , where integer 0 and remainder ¯k satisfies 1 . We also let , where integer satisfies 1
Based on this decomposition, we see that ¯is a sum of i.i.d. random matrices and that have the same distribution. Similar to the proof of Lemma 24, we have
We apply Corollary 6.1.2 in Tropp et al. (2015). Note that
A similar bound holds for . Therefore
Putting everything together finishes the proof.
random matrix corresponding to (the following conditions hold with
Proof The result follows directly from Lemma 25.
where is ReLU and q = 0 otherwise.
Proof By H¨older’s inequality,
If , we finish the proof by using the Lipschitz continuity of Lemma 23. If is ReLU, we apply Lemma E.17 in Zhong et al. (2018) to complete the proof.
J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. arXiv preprint cs/0611124, 2006.
A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 37–48. ACM, 2013.
M. Al Hasan and M. J. Zaki. A survey of link prediction in social networks. In Social network data analytics, pages 243–275. Springer, 2011.
Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder- decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
N. Baldin and Q. Berthet. Optimal link prediction with matrix logistic regression. arXiv preprint arXiv:1803.07054, 2018.
M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.
M. G. Bell and Y. Iida. Transportation network analysis. 1997.
Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013.
R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.
S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social network data analytics, pages 115–148. Springer, 2011.
H. Bunke and B. T. Messmer. Efficient attributed graph matching and its application to image analysis. In International Conference on Image Analysis and Processing, pages 44–55. Springer, 1995.
Y. Cao and Q. Gu. Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems, pages 10611– 10621, 2019.
S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, and E. Cambria. Learning community embedding with community detection and node embedding on graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 377–386. ACM, 2017.
Y. Chen, Z. Yang, Y. Xie, and Z. Wang. Contrastive learning from pairwise measurements. In Advances in Neural Information Processing Systems, pages 10909–10918, 2018.
Y. Chen and E. J. Cand`es. The projected power method: An efficient algorithm for joint alignment from pairwise differences. Communications on Pure and Applied Mathematics, 71(8):1648–1714, 2018.
Y. Chen and A. J. Goldsmith. Information recovery from pairwise measurements. In 2014 IEEE International Symposium on Information Theory, pages 2012–2016. IEEE, 2014.
Y. Chen and C. Suh. Spectral mle: Top-k rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pages 371–380, 2015.
Y. Chen, G. Kamath, C. Suh, and D. Tse. Community recovery in graphs with locality. In International Conference on Machine Learning, pages 689–698, 2016.
Y. Chen, J. Fan, C. Ma, K. Wang, et al. Spectral method and regularized mle are both optimal for top-k ranking. The Annals of Statistics, 47(4):2204–2235, 2019.
K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden- layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
D. Dua and C. Graff. UCI machine learning repository, 2017.
J. Fan, H. Liu, Y. Ning, and H. Zou. High dimensional semiparametric latent graphical model for mixed data. J. R. Stat. Soc. Ser. B. Stat. Methodol., 79(2):405–421, 2017.
M. R. Fengler. Semiparametric modeling of implied volatility. Springer Finance. SpringerVerlag, Berlin, 2005.
C. Fiorio. A topologically consistent representation for image analysis: the frontiers topo- logical graph. In International Conference on Discrete Geometry for Computer Imagery, pages 151–162. Springer, 1996.
S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
B. K. Fosdick and P. D. Hoff. Testing and modeling dependencies between a network and nodal attributes. J. Amer. Statist. Assoc., 110(511):1047–1056, 2015.
I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
T. Graepel, M. Goutrie, M. Kr¨uger, and R. Herbrich. Learning on graphs in the game of go. In International Conference on Artificial Neural Networks, pages 347–352. Springer, 2001.
A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017a.
W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arxiv: 1709.05584, 2017b, arXiv:1709.05584v3.
D. J. Higham, M. Raˇsajski, and N. Prˇzulj. Fitting a geometric graph to a protein–protein interaction network. Bioinformatics, 24(8):1093–1099, 2008.
D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17(52):1–6, 2012.
P. Jain and I. S. Dhillon. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626, 2013.
D. Jannach, P. Resnick, A. Tuzhilin, and M. Zanker. Recommender systemsbeyond matrix completion. Communications of the ACM, 59(11):94–102, 2016.
Z. Kang, C. Peng, and Q. Cheng. Top-n recommender system via matrix completion. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
R. Li and H. Liang. Variable selection in semiparametric regression modeling. Ann. Stat., 36(1):261–286, 2008.
Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
M. Ma, S. Na, C. Xu, and X. Fan. The graph-based broad behavior-aware recommendation system for interactive news. arXiv preprint arXiv:1812.00002, 2018.
Z. Ma, Z. Ma, and H. Yuan. Universal latent space model fitting for large networks with edge covariates. Journal of Machine Learning Research (to appear), 2019.
J. R. Magnus. The moments of products of quadratic forms in normal variables. Statist. Neerlandica, 32(4):201–210, 1978.
A. K. Menon and C. Elkan. Link prediction via matrix factorization. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2011.
S. Na, M. Kolar, and O. Koyejo. Estimating differential latent variable graphical models with applications to brain connectivity. arXiv preprint arXiv:1909.05892, 2019.
M. Nassar. Hierarchical bipartite graph convolution networks. arXiv preprint arXiv:1812.03813, 2018.
S. Negahban, S. Oh, K. K. Thekumparampil, and J. Xu. Learning from comparisons and choices. The Journal of Machine Learning Research, 19(1):1478–1572, 2018.
Y. Ning, T. Zhao, and H. Liu. A likelihood ratio framework for high-dimensional semipara- metric regression. Ann. Statist., 45(6):2299–2327, 2017.
A. Pananjady, C. Mao, V. Muthukumar, M. J. Wainwright, and T. A. Courtade. Worst- case vs average-case design for estimation from fixed pairwise comparisons. arXiv preprint arXiv:1707.06217, 2017.
D. Park, A. Kyrillidis, C. Caramanis, and S. Sanghavi. Finding low-rank solutions via non- convex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences, 11(4):2165–2204, 2018.
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representa- tions. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang. Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 459– 467. ACM, 2018.
L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
J. Scott. Social network analysis. Sociology, 22(1):109–127, 1988.
S. Si, K.-Y. Chiang, C.-J. Hsieh, N. Rao, and I. S. Dhillon. Goal-directed inductive matrix completion. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165–1174. ACM, 2016.
C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. pages 583–602, 1972.
B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Advances in neural information processing systems, pages 659–666, 2004.
J. A. Tropp et al. An introduction to matrix concentration inequalities. Foundations and , 8(1-2):1–230, 2015.
R. Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer.
H. Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differen- tialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Math. Ann., 71(4):441–479, 1912.
Y. Wu, H. Liu, and Y. Yang. Graph convolutional matrix completion for bipartite edge prediction. In KDIR, pages 49–58, 2018.
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang. Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
X. Yi, D. Park, Y. Chen, and C. Caramanis. Fast algorithms for robust PCA via gradient descent. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4152–4160. Curran Associates, Inc., 2016.
H. Zha, X. He, C. Ding, H. Simon, and M. Gu. Bipartite graph partitioning and data clus- tering. In Proceedings of the tenth international conference on Information and knowledge management, pages 25–32, 2001.
K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4140–4149. JMLR. org, 2017.
K. Zhong, Z. Song, P. Jain, and I. S. Dhillon. Nonlinear inductive matrix completion based on one-layer neural networks. arXiv preprint arXiv:1805.10477, 2018.
K. Zhong, Z. Song, P. Jain, and I. S. Dhillon. Provable non-linear inductive matrix comple- tion. In Advances in Neural Information Processing Systems, pages 11435–11445, 2019.
J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.