b

DiscoverSearch
About
My stuff
Semiparametric Nonlinear Bipartite Graph Representation Learning with Provable Guarantees
2020·arXiv
Abstract
Abstract

Graph representation learning is a ubiquitous task in machine learning where the goal is to embed each vertex into a low-dimensional vector space. We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution. The bipartite graph is assumed to be generated by a semiparametric exponential family distribution, whose parametric component is given by the proximity of outputs of two one-layer neural networks, while nonparametric (nuisance) component is the base measure. Neural networks take high-dimensional features as inputs and output embedding vectors. In this setting, the representation learning problem is equivalent to recovering the weight matrices. The main challenges of estimation arise from the nonlinearity of activation functions and the nonparametric nuisance component of the distribution. To overcome these challenges, we propose a pseudo-likelihood objective based on the rank-order decomposition technique and focus on its local geometry. We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate. Moreover, we prove that the sample complexity of the problem is linear in dimensions (up to logarithmic factors), which is consistent with parametric Gaussian models. However, our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.

Keywords: bipartite graph, nonconvex optimization, representation learning, semiparametric estimation

image

Graphs naturally arise as models in a variety of applications, ranging from social networks (Scott, 1988) and molecular biology (Higham et al., 2008) to recommendation systems (Ma et al., 2018) and transportation (Bell and Iida, 1997). In a variety of problems, graphs tend to be high-dimensional and highly entangled, and hence difficult to directly learn from. As a prominent remedy, graph representation learning aims to learn a mapping that represents each vertex as low-dimensional vector such that structural properties of the original graph are preserved. Those learned low-dimensional representations, also called embeddings, are further used as the input features in downstream machine learning tasks, such as link prediction (Taskar et al., 2004; Al Hasan and Zaki, 2011), node classification (Bhagat et al., 2011), and community detection (Fortunato, 2010).

There are three major approaches to graph embedding: matrix factorization-based algorithms (Belkin and Niyogi, 2002; Ahmed et al., 2013), random walk algorithms (Perozzi et al., 2014; Grover and Leskovec, 2016), and graph neural networks (Scarselli et al., 2008; Zhou et al., 2018; Wu et al., 2019). These approaches can be unified via the encoder-decoder framework proposed in Hamilton et al. (2017b). In this framework, the encoder is a mapping that projects each vertex or a subgraph to a low-dimensional vector, whereas the decoder is a probability model that infers the structural information of the graph from the embeddings generated by the encoder. The structural information here depends on the specific downstream tasks of interest, which also determine the loss function of the decoder. The desired graph representations are hence obtained by minimizing the loss function as a function of embedding vectors. For example, in the link prediction task, the decoder predicts whether an edge between two vertices exists or not using a Bernoulli model and logistic loss function, and the model parameter is a function of embeddings (Baldin and Berthet, 2018).

Such an encoder-decoder architecture motivates the study of graph representation learning through the lens of statistical estimation for generative models. In particular, suppose the observed graph is generated by a statistical model specified by the decoder with true graph representations as its inputs. We can then assess the performance of a graph embedding algorithm by examining the difference between the learned representation and the ground truth. Baldin and Berthet (2018) adopted this perspective to study the performance of a linear embedding method for the link prediction problem. The validity of their results hinges on the condition that both the linear model of the encoder and the Bernoulli model of the decoder are correctly specified. When either of these assumptions are violated, they would incur large estimation error. Recent advances in graph representation learning are attributed to more flexible decoders (Cho et al., 2014; Goodfellow et al., 2016; Badri- narayanan et al., 2017), which are based on deep neural networks and can handle graphs with edge attributes that can be categorical. These approaches are poorly understood from a theoretical point of view.

In the present paper, we focus on bipartite graphs, where there are two distinct sets of vertices, U and V , and only edges between two vertices in different sets are allowed. We study the semiparametric nonlinear bipartite graph representation learning problem under the encoder-decoder framework. We assume that each vertex  u ∈ Uis associated with a high-dimensional Gaussian vector  xu ∈ Rd1. Similarly, each vertex  v ∈ Vis associated with a high-dimensional Gaussian vector  zv ∈ Rd2. The encoder maps them via one-layer neural networks to low-dimensional vectors  φ1(U⋆T xu), φ2(V⋆T zv) ∈ Rr, where U⋆ ∈ Rd1×r,V⋆ ∈ Rd2×r are weight matrices,  {φi}i=1,2are activation functions evaluated entrywise, and r ≪ (d1 ∧ d2). Furthermore, in the decoder, we consider the link prediction task under a semiparametric model. In particular, we assume that the attribute of an edge follows a natural exponential family distribution parameterized by the proximity between two vertices, which is defined as the inner product  ⟨φ1(U⋆T xu), φ2(V⋆T zv)⟩between the embedding vectors. Here,  φ1(U⋆T xu) is the embedding vector of  u, while φ2(V⋆T zv) is the embedding vector of v. We do not specify the base measure of the exponential family distribution but, instead, treat it as a nuisance parameter. This gives us a semiparametric model for the decoder and robustness to model misspecification within the exponential family.

In the above described semiparametric nonlinear model, our goal is to recover weight matrices  U⋆ and V⋆. Based on these weight matrices, we can then compute embeddings for all vertices. There are two main obstacles that make the estimation problem challenging. First, while the activation functions  {φi}i=1,2make the encoder model more flexible, their nonlinearity leads to a loss function that is nonconvex and nonsmooth. Second, while the unknown nonparametric nuisance component of the decoder model makes the graph representation learning robust to the model misspecification, it also makes the likelihood function not available. To overcome these obstacles, we propose a pseudo-likelihood objective, which is minimized at (U⋆, V⋆) locally. We analyze the landscape of the empirical objective and show that, in a neighborhood around the ground truth, the objective is strongly convex. Therefore, the vanilla gradient descent (GD) achieves linear convergence rate. Moreover, we prove that the sample complexity is linear in dimensions  d1 ∨ d2, up to logarithmic factors, which matches the best known result under the parametric model (Zhong et al., 2019). Experiments on synthetic and real data corroborate our theoretical results and illustrate flexibility of the proposed representation learning model.

Notations. For any positive integer n, [n] = {1, 2, . . . , n} denotes the index set, and Unif([n]) is a uniform sampling over the indices. We write  a ≲ b if a ≤ c · b for someconstant  c, and a ≍ b if a ≲ b and b ≲ a. We define δij = 111i=j, which equals to 1 if i = j and 0 otherwise. For any matrix U, vec(U) denotes the column vector obtained by vectorizing  U and ∥U∥p,q = (�j(�i |Uij|p)q/p)1/q. As usual,  ∥U∥F , ∥U∥2refer to the Frobenius and operator norm, respectively, and  σp(U) denotes the p-th singular value of U. For a square matrix  U, diag(U) = (U11; U22; . . .) is a vector including all diagonal entries of U; when U is symmetric,  λmax(U) (λmin(U)) denotes its maximum (minimum) eigenvalue. We write  A ⪰ B if A − Bis positive semidefinite and  A ≻ Bif it is positive definite. For any vector  a, ∥a∥min = mini |ai|is the minimal absolute value of its entries.

Structure of the paper. In Section 2, we formalize the semiparametric graph representation learning problem and introduce related work. In Section 3, we present our estimation method by proposing a pseudo-likelihood objective, and the theoretical analysis of such objective is provided in Section 4. In Section 5 we show experimental results and conclusions are summarized in Section 6. Section 7 provides proofs of main technical results, while the proofs of auxiliary results are given in the appendix.

We describe the setup of our problem and introduce the applications and related work. We particularly focus on the statistical literature on theory of semiparametric estimation and matrix completion, although bipartite graph representation learning has been routinely applied to varied deep neural networks (Nassar, 2018; Wu et al., 2018). We point reader to Zha et al. (2001) for a survey on bipartite graph.

2.1 Problem formulation

Let G = (U, V, E) be a bipartite graph where U and V are two sets of vertices and E denotes the set of edges between two vertex sets. For each vertex  u ∈ U, we assume it is associated with a Gaussian vector  xu ∈ Rd1, while for each  v ∈ Vwe have a Gaussian vector  zv ∈ Rd2.An edge between u and v has an attribute  y(u,v)that follows the following semiparametric exponential family model

image

which is parameterized by the base measure function  f : R → Rand a scalar

image

In model (1),  b(·, ·) is the log-partition function (or normalizing function) that makes the density have unit integral. The parametric component of the exponential family,  Θ⋆(u,v) =Θ⋆(xu, zv), depends on the covariate  xucoming from the set U and the covariate  zv comingfrom the set V . The nonparametric component f is treated as a nuisance parameter, which gives us flexibility in modeling the edge attributes. To make notation concise, we will drop the subscript of  xu and zvhereinafter, and use x and z to denote covariates from set U and V , respectively. In our analysis, the activation functions  {φi}i=1,2have one of the following three forms:

image

We formalize the bipartite graph representation learning as a statistical parameter estimation problem of a generative model. In particular, suppose the graph is generated by the exponential family model (1) with some unknown base measure f, and we observe part of edge attributes, y, and associated covariates on two ends, x and z. Thus, we obtain data set  {(yij, xi, zj)}i,j where i, jindex the vertices of two sets. The graph representation learning in our setup is then equivalent to recovering  U⋆ ∈ Rd1×r and V⋆ ∈ Rd2×r, which can be used to compute parametric component of the decoder model and estimate embedding vectors,  φ1( ˆUT x) and φ2( ˆVT z), for all vertices in two sets, since activation functions are user-chosen and known.

2.2 Applications and related work

Graph representation learning underlies a number of real world problems, including object recognition in image analysis (Bunke and Messmer, 1995; Fiorio, 1996), community detection in social science (Perozzi et al., 2014; Cavallari et al., 2017), and recommendation systems in machine learning (Kang et al., 2016; Jannach et al., 2016). See Bengio et al. (2013), Hamilton et al. (2017a), Hamilton et al. (2017b) for recent surveys and other applications. The bipartite graph is of particular interest since it classifies vertices into two types, which extensively appears in modern applications.

For concreteness, in user-item recommendation systems, the attribute of an edge between a user node and an item node represents the rating, which is modeled by the proximity of projected features onto the latent space. Specifically, each user is represented by a high-dimensional feature vector x and each item is represented by a high-dimensional feature vector z. A simple generative model for the rating y that a user gives to an item is y = ⟨U⋆T x, V⋆T z⟩ + ϵ with ϵ ∼ N(0,1) independent from x, z. Such a model is studied in the inductive matrix completion (IMC) literature (Abernethy et al., 2006; Jain and Dhillon, 2013; Si et al., 2016; Berg et al., 2017). Zhong et al. (2019) studied nonlinear IMC problem, where a generalized model for the rating is  y = ⟨φ(U⋆T x), φ(V⋆T z)⟩ + ϵ, with φ(·) being acommon activation function. In this generalized nonlinear model, one-layer neural network compresses the high-dimensional features into low-dimensional embeddings. Zhong et al. (2019) proposed to minimize the squared loss to recover weight matrices  U⋆ and V⋆, andestablished consistency for their minimizer, with linear sample complexity in dimension d1 ∨ d2, up to logarithmic factors.

Our work contributes to this line of research by enhancing the IMC model from two aspects. First, we allow for two separate neural networks to embed user and item covariates. Although this modification may seem minor, it makes theoretical analysis more challenging when two networks mismatch: one network has a smooth activation function while the other does not. Second, we consider an exponential family model with unknown base measure, which extends the applicability of the model and allows for model misspecification within the exponential family. In particular, the semiparametric setup makes our estimator independent of the specific form of f. For example, the model in Zhong et al. (2019) is a special case of (1) with  f(y) = exp(−y2/2), while the link prediction problem in Liben- Nowell and Kleinberg (2007) and Menon and Elkan (2011) is a special case with f(y) = 1.

Furthermore, our work contributes to the literature on graph embedding (Qiu et al., 2018; Goyal and Ferrara, 2018). Our paper studies the bipartite graph and casts the graph representation learning as the problem of parameter estimation in a generative model. This setup allows us to analyze statistical properties, such as consistency and convergence rate, of the learned embedding features. To the best of our knowledge, statistical view of representation learning is missing although it was successfully used in real experiments (see, e.g., Graepel et al., 2001; Yang et al., 2015). In addition, our work also contributes to a growing literature on semiparametric modeling (Fengler, 2005; Li and Liang, 2008; Fan et al., 2017), where the parametric component in (1) is given by  Θ⋆ = β⋆T xand the goal is to estimate  β⋆ by regressing y on x, without knowing f. Fosdick and Hoff (2015) formalized the representation learning as a latent space network model, where the parameter  Θ⋆ isgiven by the inner product of two latent vectors and  f(y) = exp(−y2/2), that is under a Gaussian noise setup, and proposed methodology for testing the dependence between nodal attributes and latent factors. Ma et al. (2019) studied a similar model with f(y) = 1 and proposed both convex and nonconvex approaches to recover latent factors. However, our work is more challenging due to the nonlinearity of activation functions and the missing knowledge of f.

Lastly, several estimation methods for pairwise measurements have been studied in related, but simpler, models (Chen and Goldsmith, 2014; Chen and Suh, 2015; Chen et al., 2016; Pananjady et al., 2017; Negahban et al., 2018; Chen and Cand`es, 2018; Chen et al., 2019). Chen et al. (2018) studied model (1) by assuming the parameter matrix of the graph to be low-rank, and estimated  Θ⋆ as a whole. As a comparison, our model is more complicated since each entry of  Θ⋆ in our setup is given by the inner product of two embedding vectors, which measures the proximity of two vertices. Our task is to recover two underlying weight matrices  U⋆, V⋆ that are convolved by activation functions to generate  Θ⋆.

We propose a pseudo-likelihood objective function to estimate the unknown weight matrices and discuss identifiability of the parameters. The objective function is minimized by the gradient descent with a constant step size. Theoretical analysis of the iterates is provided in Section 4.

The likelihood of the model (1) is not available due to the presence of the infinite-dimensional nuisance parameter f. Using the rank-order decomposition technique (Ning et al., 2017), we focus on the pairwise differences and develop a pseudo-likelihood objective. Importantly, the differential pseudo-likelihood does not depend on f and, as a result, our estimator is valid for a wide range of distributions, without having to explicitly specify them in advance.

We follow the setup described in Section 2.1. To simplify the presentation, suppose we have 2n1vertices in  U and 2n2vertices in V , denoted by  U = {u1, . . . , un1, u′1, . . . , u′n1}and  V = {v1, . . . , vn2, v′1, . . . , v′n2}, respectively. For  i ∈ [n1] and j ∈ [n2], we let xi =xui, x′i = xu′i, zj = zvj, z′j = zv′j, and suppose that  xi, x′i i.i.d∼ N(0, Id1) and zj, z′j i.i.d∼N(0, Id2), independent of each other. Further, we assume to observe m edge attributes, y, between vertices  {u1, . . . , un1} and {v1, . . . , vn2}, and another m edge attributes,  y′, between{u′1, . . . , u′n1} and {v′1, . . . , v′n2}, both of which follow the distribution in (1) and are sampled with replacement from the set of all possible  n1n2edges. We note that the sampling setup is commonly adopted in the literature on partially observed graphs and matrix completion problems (Zhong et al., 2019; Chen et al., 2018), which is equivalent to assuming edges on a graph are missing at random.

Denote sample sets

image

where  u(k), u′(l) = Unif([n1]) and v(k), v′(l) = Unif([n2]). While the observations within Ω or Ω′ are not independent, as they may have common features x or z, the observations between Ω and Ω′ are independent. Such two independent sets of samples are obtained by sample splitting in practice. We stress that the sample splitting setup in our paper is used only to make the analysis concise without enhancing the order of sample complexity. In particular, it does not help us avoid the main difficulties of the problem.

Based on samples Ω and Ω′, we consider  m2 pairwise differences and construct an empirical loss function. For  k ∈ [m], let k1 = u(k), k2 = v(k), and

image

denote the true parameter associated with the k-th sample (similarly for  Θ⋆′l1l2). Note that Θ⋆k1k2 is the underlying parametric component of the model that generates  yk = yk1,k2. Thekey idea in constructing the pseudo-likelihood objective is to use rank-order decomposition to extract a factor, that is independently from the base measure. Given a pair of independent samples,  yk and y′l, we denote their order statistics as  y(·)and rank statistics as R. Then we know  y(·) = (yk, y′l) or y(·) = (y′l, yk), and R = (1, 2) or R = (2,1). Thus, (y(·), R) fullycharacterizes the pair (yk, y′l), and is hence a sufficient statistics. Note that

image

The first term is the density of the rank statistics given order statistics, which is only a function of unknown weight matrices  U⋆ and V⋆. The second term is the density of order statistics, which relies on the specific base measure f. Thus, we omit the second term and sum over all  m2 paired samples for the first term to arrive at the following objective

image

The above loss function is similar to the logistic loss for the pairwise measurements. However, it is nonconvex in both components even for identity activation functions. When feature vectors x, z follow the multinomial distribution and activation functions  {φi}2i=1 arenot present, Chen et al. (2018) estimated the rank-r matrix U⋆V⋆T as a whole by minimizing (3) with an additional nuclear norm penalty. Our goal is to recover both components  U⋆,V⋆, in the presence of nonlinear activation functions, resulting in a challenging nonconvex optimization problem.

3.1 Gradient Descent

We propose to minimize loss function (3) using the gradient descent with a constant step size. The iteration is given by

image

For future references, we provide explicit formulas of the gradient and the Hessian for loss (3). We introduce some definitions beforehand. Let us denote each column of weight matrices as  U = (u1, . . . , ur) and V = (v1, . . . , vr) (similar for  U⋆, V⋆).To simplify notations, for a sequence of vectors  a1, . . . , an, we let (ai)ni=1 = (a1; . . . ; an) be the long vector by stacking them up; for a sequence of matrices  A1, . . . , An, we let diag�(Ai)ni=1�bethe block diagonal matrix with each block being specified by  Aisequentially. Moreover, we define the following quantities:  ∀k, l ∈ [m] and ∀i ∈ [r],

image

The quantities on the left part are vectors or matrices calculated by using samples in Ω, which is indexed by k, while the quantities on the right part are calculated by using samples in Ω′, which is indexed by l. We should mention that  φ′i, φ′′i are the first derivative and the second derivative of the activation function  φi (if φiis ReLU then  φ′′i = 0), while superscript of  x′l1 (and z′l2) means the sample is from Ω′(i.e. the sample index l is always used with superscript (·)′). In addition, we define two scalars as

Akl =(yk − y′l)2 · exp�(yk − y′l)(Θk1k2 − Θ′l1l2)��1 + exp�(yk − y′l)(Θk1k2 − Θ′l1l2)��2 , Bkl = yk − y′l1 + exp�(yk − y′l)(Θk1k2 − Θ′l1l2)�.

With above definitions and by simple calculations, one can show the gradient is given by

∇UL(U, V) =�∂L(U, V)∂u1 , . . . , ∂L(U, V)∂ur

image

∇VL(U, V) =�∂L(U, V)∂v1 , . . . , ∂L(U, V)∂vr

(5) Furthermore,  ∀i, j ∈ [r], one can show

image

To combine all blocks and form the Hessian matrix, we will vectorize weight matrices and further define long vectors  dk = (dki)ri=1, pk = (pki)ri=1, d′l = (d′li)ri=1, p′l = (p′li)ri=1, andblock diagonal matrices  Qk = diag ((Qki)ri=1), Rk = diag ((Rki)ri=1), Sk = diag ((Ski)ri=1)(similar for  Q′l, R′l, S′l). Then, the Hessian matrix  ∇2L(U, V) ∈ Rr(d1+d2)×r(d1+d2) is

image

3.2 Identifiability

In general, the weight matrices in loss function (3) are not identifiable as the function is bilinear in U, V. For example, when both activation functions are identity,  L(UQ, V(QT )−1)and L(U, V) have the same value for any invertible matrix  Q ∈ Rr×r, which makes the Hessian at (U⋆, V⋆) indefinite. Similarly, for ReLU activation, this phenomenon reappears by letting Q be any diagonal matrix with positive entries. To resolve this issue, one can use a penalty function  ∥UT U − VT V∥2F to balance two components U and V (Yi et al., 2016; Park et al., 2018; Na et al., 2019). Fortunately, in our problem, the identifiability issue disappears when a smooth nonlinear activation is used, such as sigmoid or tanh, although their nonconvexity brings other challenges.

We stress that, different from over-parameterized problems in neural networks (Sagun et al., 2017; Li and Liang, 2018; Allen-Zhu et al., 2018), the identifiability issue comes from the redundancy of parameters, which is also observed in inductive matrix completion problem (Zhong et al., 2019). Zhong et al. (2019) showed that by fixing the first row of  U⋆,both components are recoverable from the square loss even with ReLU activation. In our problem, when either one of activation functions is ReLU, we use a similar restriction on U⋆ and show that the loss in (3) has positive definite Hessian at (U⋆, V⋆), without adding any penalties.

In this section, we will show that the ground truth (U⋆, V⋆) is a stationary point of the loss (3) and then show that the loss is strongly convex in its neighborhood. Using these two observations, we further establish the local linear convergence rate for iterates in (4). Since the radius of the neighborhood is fixed in terms of (U⋆, V⋆), a wart-start initialization can be obtained by a third-order tensor method (see, e.g., Zhong et al., 2017, 2019). In our simulations, due to high computational cost of a tensor method, we recommend a random initialization (Du et al., 2017; Cao and Gu, 2019).

4.1 Assumptions

We require two assumptions to establish our main results. The first assumption fixes the scale of weight matrices.

Assumption 1 The weight matrices  U⋆, V⋆ have rank rand satisfy  σr(U⋆) = σr(V⋆) = 1.

The second assumption imposes a mild regularity condition.

image

complete subgraphs (the edges between  xi and z′j, and zj and x′i are ignored). We assume

(a) (boundedness): There exist  α, β > 0such that, for any sample (y, x, z) ∈ D ∪ D′, wehave  |Θ⋆| = |⟨φ1(U⋆T x), φ2(V⋆T z)⟩| ≤ α and |y| ≤ β;

image

where  ψ(x) = exp(x)/(1+exp(x))2, and assume  Mα(Θ⋆, Θ⋆′)is a continuous, positive two-dimensional function.

Assumption 2 is widely assumed in the analysis of logistic loss function (Chen et al., 2018). In particular, Assumption 2(a) restricts the parametric component  Θ⋆ into a compact set, which controls the range of proximity between two connected nodes. Intuitively, larger  αimplies a harder estimation problem. We also add boundedness condition on the response y for simplicity. It can be replaced by assuming y to be subexponential (Ning et al., 2017). Boundedness holds deterministically for some distribution in exponential family, such as Bernoulli and Beta, and holds with high probability for a wide range of exponential family distributions, though  βmay depend on the sample size  n1 and n2. Assumption 2(b) is the regularity condition, which plays the key role when showing the strong convexity of the population loss at the ground truth. It can be shown to hold for all exponential family distributions with bounded support, and for some unbounded distributions, such as Gaussian and Poisson.

4.2 Properties of the Population Loss

With the above assumptions, our first result shows that the gradient of the population loss at (U⋆, V⋆) is zero. For all quantities defined in Section 3.1, we add superscript (·)⋆to denote the underlying true quantities, which are obtained by replacing U, V with true weight matrices  U⋆, V⋆. For example, we have  A⋆kl, B⋆kl, d⋆ki, p⋆ki, Q⋆ki, R⋆ki, S⋆ki.

The following lemma shows that the conditional expectation of  B⋆kl given all covariates associated to two end vertices is zero.

Lemma 3 For any k, l ∈ [m], we have that the conditional expectation given all covariates E�B⋆kl | xk1, zk2, x′l1, z′l2�= 0.

Since  B⋆kl is a common factor of the gradients  ∇UL(U⋆, V⋆) and ∇VL(U⋆, V⋆), as shown in (5), and vectors  d⋆ki, d′⋆li, p⋆ki, p′⋆li only depend on covariates, one can first take conditional expectation given covariates and show the following result.

Theorem 4 The loss (3) satisfies E [∇L(U⋆, V⋆)] = 0.

Proof We take E[∇UL(U⋆, V⋆)] as an example, while  E[∇VL(U⋆, V⋆)] can be shown similarly. By the formula in (5),  ∀i ∈ [r], we have

image

image

where, for the second term from the end, the outer expectation is taken over randomness in sampling and all covariate, and the last equality is due to Lemma 3. Doing same derivation for each column and we obtain  E [∇UL(U⋆, V⋆)] = 0. Similarly  E [∇VL(U⋆, V⋆)] = 0.

We then study the local curvature of the population loss at (U⋆, V⋆), which is obtained in the next two steps. We simplify the notation further by dropping the subscripts of sample index. We let  A, B, d, q, d′, q′, . . ., and their corresponding (·)⋆ version, denote general references of corresponding quantities, which may be computed by using any samples in D and D′. We stress that all samples in  D and D′ have the same distribution, so that d1, . . . , dm ∼ d, p1, . . . , pm ∼ p, with d and d′, and p and p′ independent from each other.

Proposition 5 Suppose Assumptions 1 and 2 hold. Define

image

then we have  γα > 0 and

image

Proof Recall the formula for the Hessian matrix in (6). The second term has zero expectation at (U⋆, V⋆) by Lemma 3. Therefore,

image

In our notations,  A⋆ is written as

image

where  Θ⋆ = Θ⋆(x, z), Θ⋆′ = Θ⋆(x′, z′) (cf. Section 2.1 for definition of  Θ⋆(x, z)), and(y, x, z) and (y′, x′, z′) are two independent samples from  D and D′, respectively. By Assumption 2,  |Θ⋆| ∨ |Θ⋆′| ≤ α. Thus, |(y − y′)(Θ⋆ − Θ⋆′)| ≤ 2α|y − y′|. Using the symmetry and monotonicity of  ψ(x), defined in Assumption 2,

image

Therefore,

image

Taking conditional expectation in (7) and using the definition of  γα,

image

Note that  |Θ⋆|∨|Θ⋆′| ≤ α and Mα(·, ·) is strictly positive on [−α, α]×[−α, α], by continuity, Mα(·, ·) attains its minimum value in the compact support, hence,  γα >0. This completes the proof.

Note that  γαin Proposition 5 depends on  αreciprocally. The next result lower bounds

image

the minimum eigenvalue of E

(2019) established a similar result when  d⋆′, p⋆′ are not present and  φ1 = φ2. However,our result is based on pairwise measurements which allows for adaptivity to nonparametric (nuisance) parameter in the model and, further, also allows for mismatch in activation functions. These two differences make the proof more involved. We separate results into two cases: (1)  φ1, φ2 ∈ {sigmoid, tanh}; (2) either  φ1 or φ2 is ReLU.

image

(Case 2) if either  φ1 or φ2is ReLU, then by fixing the first row of  U⋆ (i.e. treating it as known),

image

Combining the results of Proposition 5 and Lemma 6, we immediately get the following result regarding the local curvature of the population loss at the ground truth.

Theorem 7 (Local curvature) Suppose Assumptions 1 and 2 hold. There exists a constant C > 0, independent of  U⋆ and V⋆, such that: (Case 1) if  φ1, φ2 ∈ {sigmoid, tanh}, then

image

(Case 2) if either  φ1 or φ2is ReLU, then by fixing the first row of  U⋆,

image

By symmetry one can alternatively fix the first row of  V⋆ in the second case. We realize that the lower bound of population Hessian in Case 2 is smaller than the bound in Case 1. This is due to nonsmoothness and unboundedness of ReLU activation function. In later analysis we will see the sample complexity when using ReLU for either networks will have larger logarithmic factor, while is linear in  d1 ∨ d2in both cases.

Combining Theorem 4 and 7, we obtain that (U⋆, V⋆) is a local minimizer of the population loss. In order to characterize how the empirical loss behaves near the ground truth, we study its local geometry via the concentration of the Hessian matrix.

4.3 Concentration of the Hessian Matrix

In this section, we characterize the concentration of the Hessian matrix. We show that (m ∧ n1 ∧ n2) ≳ (d1 ∨ d2)poly(log(d1 + d2)) is sufficient to guarantee that the empirical loss also has positive curvature locally.

Let

image

and define

image

By the formula in (6), we have that

image

The concentration of each term will be built separately in next two lemmas. We let q = 1 if either  φiis ReLU and q = 0 otherwise, and  q′ = 1 if both  φiare ReLU and  q′ = 0 otherwise.

Lemma 8 Suppose Assumptions 1 and 2 hold. For any  s ≥ 1, if

image

then with probability at least 1  − 1/(d1 + d2)s,

image

Lemma 9 Suppose Assumptions 1 and 2 hold. For any  s ≥ 1, if

image

then with probability at least 1  − 1/(d1 + d2)s,

image

Comparing the sample complexity in Lemma 8 and 9, we see that (9) is dominated by (8). Technically, this is because  Qk and Rkare not present if  φ1 = φ2= ReLU. Combining the above two lemmas and using the inequality that

image

we immediately obtain the following concentration on the Hessian matrix.

Theorem 10 (Concentration of the Hessian matrix) Suppose Assumptions 1 and 2

image

where q = 0 for Case 1 and q = 1 for Case 2, then with probability at least 1  − 1/(d1 + d2)s,

image

Replacing (U, V) with (U⋆, V⋆) in the above inequality, one can show that  ∇2L(U⋆, V⋆)is lower bounded away from zero when  m ∧ n1 ∧ n2is sufficiently large. It turns out this observation is a fundamental condition for establishing local linear convergence rate for gradient descent.

Comparing the above sample complexity with the one established for inductive matrix completion problem (Zhong et al., 2019), our rate improves from  d(log d)3 to d log d, whenφ1, φ2are sigmoid or tanh. Moreover, we allow a semiparametric model with two different activation functions, which results in a more involved analysis.

4.4 Local Linear Convergence

The local geometry established for the loss function (3) in previous two subsections allows us to prove the local result: the gradient descent with constant step size converges to the ground truth linearly. For ease of notation, let

image

be the minimum and maximum eigenvalue of the population Hessian. The explicit lower bound of  λ⋆min is provided in Theorem 7, while the upper bound of  λ⋆max is provided in the following Lemma 12. We define the local neighborhood of (U⋆, V⋆) as

image

with radius satisfying

image

for a sufficiently small constant  cB. The above radius is determined by the concentration bound of Hessian in Theorem 10, based on which one can show that  ∇2L(U, V) ⪰ λ⋆min/2·Ifor any (U, V) ∈ BR(U⋆, V⋆). We also note that the above radius only depends on true weight matrices and is independent from sample sizes and dimensions. Thus, it will not vanish as dimension increases, provided (U⋆, V⋆) scale properly.

In preparation for the convergence analysis, next lemma characterizes the difference ∇2L(U1, V1) − ∇2L(U2, V2) for any (U1, V1), (U2, V2) ∈ BR(U⋆, V⋆).

Lemma 11 Suppose the conditions of Theorem 10 hold. For any  s ≥ 1 and any (U1, V1),(U2, V2) ∈ BR(U⋆, V⋆),

image

with probability at least 1  − 1/(d1 + d2)s.

The next result provides an upper bound on  λ⋆max and we then establish the local linear convergence rate.

Lemma 12 Under Assumption 2,

image

Theorem 13 (Local linear convergence rate) Suppose Assumptions 1 and 2 hold and the initial point (U0, V0) ∈ BR(U⋆, V⋆). For any s ≥ 1, if the sample sizes satisfies (10), then with probability at least 1  − T/(d1 + d2)s, the iterates in (4) with  η = 1/λ⋆max satisfy

image

where the contraction rate  ρ = 1 − λ⋆min/(7λ⋆max).

Based on previous preparation work, the proof of Theorem 13 is standard for gradient descent. For completeness, we present the proof in Section 7.

In next section, we demonstrate the superiority and generality of the proposed representation learning model via extension simulations and real experiments.

We show experimental results on synthetic and real-world data. In the following, we call our model nonlinear semiparametric matrix completion (NSMC). We compare NSMC with the baseline nonlinear inductive matrix completion (NIMC) proposed by Zhong et al. (2019), where they assumed the generative model to be Gaussian and minimized the squared loss. The models obtained by removing non-linear activation functions in NSMC and NIMC are called SMC and IMC, respectively.

5.1 Local Linear Convergence

We verify the local linear convergence of GD on synthetic data sets sampled with ReLU activation functions. We fix  d = d1 = d2 = 10 and r= 3. The features  {xi, x′i}i∈[n1],{zj, z′j}j∈[n2], are independently sampled from a Gaussian distribution. We fix  n1 = n2 =400 and the number of observations m = 2000. We randomly initialize (U0, V0) near theground truth (U⋆, V⋆) with fixed error in Frobenius norm. In particular, we fix  ||U0 −U⋆||2F + ||V0 − V⋆||2F = 1. For the Gaussian model,  y ∼ N(Θ · σ2, σ2). For the binomial model,  y ∼ B�(NB, exp(Θ)1+exp(Θ)�. For Poisson model,  y ∼ Pois(exp(Θ)). To introduce some variations, as well as to verify that our model allows for two separate neural networks, we let φ1= ReLU and  φ2 ∈ {ReLU, sigmoid, tanh}. The estimation error during training process is shown in Figure 1, which verifies the linear convergence rate of GD before reaching the local minima.

5.2 Robustness to Model Misspecification

We generate synthetic data with model misspecification and compare the performance of estimators given by NSMC, SMC, NIMC and IMC. We fix  d = d1 = d2 = 50, r = 3,n1 = n2= 400, and use ReLU as the activation function for NSMC and NIMC. For NSMC and SMC, we randomly generate two independent sample sets with m = 1000 observations, which are denoted as Ω and Ω′. The observed sample set for NIMC and IMC are set to be the union Ω  ∪ Ω′. For NSMC and SMC, we minimize the proposed pseudo-likelihood objective. For NIMC and IMC, we minimize the square loss as suggested by Zhong et al. (2019). We apply gradient descent starting from a random initialization near the ground truth (U⋆, V⋆), in order to guarantee convergence of all methods. We evaluate the estimated

image

Figure 1: Local linear convergence of gradient descent on synthetic data sets.

image

Figure 2: Relative Error of NSMC, SMC, NIMC and IMC. The plot shows how relative error of estimations given by each method varies with parameter  τ, which introduces model misspecification in the Gaussian model. We see that NSMC gives accurate and robust estimation, while NIMC suffers from model misspecification. SMC, IMC fail to learn the non-linear embeddings and give unsatisfactory estimations for all  τ.

matrix ˆU using the relative approximation error  E ˆU = || ˆU−U⋆||F /||U⋆||F , with E ˆV definedsimilarly. We also evaluate the performance of a solution ( ˆU, ˆV) on recovering the parametric component ˆΘusing relative test error  E ˆΘ =��(x,z)∈Ωt( ˆΘ − Θ⋆)2/ �(x,z)∈Ωt Θ⋆2,where ˆΘ = ⟨φ( ˆUT x), φ( ˆVT z)⟩, Θ⋆ = ⟨φ(U⋆T x), φ(V⋆T z)⟩, and Ωtis a newly sampled test data set. For each setting below, we report results averaged over 10 runs.

Gaussian model. We introduce model misspecification by sampling  y from y ∼N�(1 − τ)2 · Θ, (1 − τ)2�. Parameter  τis introduced to modify the impact of model mis-specification. We summarize the relative errors in Table 1 and Figure 2. When  τ= 0, there is no model misspecification and NSMC and NIMC achieve comparable relative approxi-

image

Table 1: Relative error in the Gaussian model.

image

Table 2: Relative error in the Binomial model.

mation errors. As  τincreases, the relative approximation errors of NIMC grow rapidly due to the increase of model misspecification. However, NSMC gives robust estimations. SMC and IMC serve as bilinear modeling baselines that fail to learn in the nonlinear embedding setting.

Binomial model. We sample y ∼ B(NB, exp(Θ)1+exp(Θ)) and apply NSMC and SMC with original attributes y. For NIMC and IMC, we first do variance-stabilizing transformation ˜y = arcsin ( yNB ) as the data preprocessing step, inspired by what people might do for non- Gaussian data in practical applications. From Table 2, NSMC achieves the best estimating result in each setting, while other methods fail to learn the embeddings with a binomial model.

Poisson model. We generate  y ∼ Pois(exp(Θ)), where the activation functions are φ1= ReLU and  φ2 ∈ {ReLU, sigmoid, tanh}. Due to model misspecification, we apply transformation ˜y = √yfor NIMC and IMC. The activation function of NIMC is set to be the same as  φ2. We see from Table 3 that NSMC achieves the best estimating result, while other methods fail to recover the parameters.

5.3 Clustering of Embeddings

We generate synthetic data with clustered embeddings and compare the performance of NSMC and NIMC on learning the true embedding clustering. We fix  d = d1 = d2 = 30,r = 2, n1 = n2= 400, and choose tanh as the activation function. We generate features x and z independently from a Gaussian mixture model with four components, resulting in the ground-truth embedding clustering with four components. We sample y from a binomial model with  NB= 20. We fix observed sample size m = 1000 and apply NSMC and NIMC to get the estimated ˆU and ˆV, respectively. We plot the top 2 left singular vectors (ˆι1,ˆι2) ofφ1( ˆUT x) for NSMC and NIMC, respectively, where the points are colored according to the ground-truth clustering. We also plot the top 2 left singular vectors (ι⋆1, ι⋆2) of the ground- truth embeddings  φ1(U⋆T x). Similar plots for feature z are shown as well. We see from

image

Table 3: Relative error in the Poison model.

Figure 3 that NIMC fails to find the ground-truth embeddings due to model misspecification, while NSMC gives robust estimation and recovers the ground-truth embeddings.

To quantitatively evaluate the performance, we apply the k-means clustering to the left singular vectors. We define the clustering error following Zhong et al. (2019) as

image

where  ℵ⋆ is the ground-truth clustering and  ℵis the predicted clustering. As a result, NIMC attains clustering error 0.0596 and 0.1725 for x and z respectively. NSMC achieves a better performance with clustering error 0.0196 and 0.0147 for x and z respectively.

5.4 Semi-supervised Clustering

We further illustrate the superior performance of NSMC over NIMC with real-world data. Following the experimental setting in Zhong et al. (2019), we apply NSMC and NIMC to a semi-supervised clustering problem, where we only have one kind of features,  x ∈ Rd1, ona set of items. The edge attribute  yij= 1, if the i-th item and j-th item are similar, and yij= 0, if they are dissimilar. To apply NSMC and NIMC, we set  x = z, φ1 = φ2 = φ,and assume  U⋆ = V⋆. We initialize  U0 = V0 as the same random Gaussian matrix and apply gradient descent to ensure  Ut = Vt during training. After training, we apply k-means clustering to the top r left singular vectors of  φ( ˆUT x). We follow Zhong et al. (2019) and again use the clustering error defined by (11). We set the activation function  φ to betanh for all data sets. For NSMC, we first uniformly sample two independent sets of items with  n1 = n2= 1000. Then we generate independent observation sets Ω and Ω′ with sizem = 5000. For NIMC, the observed dataset is set to be the union Ω∪Ω′. We consider three datasets: Mushroom, Segment and Covtype (Dua and Graff, 2017), and regard items with the same label as similar (yij = 1). Covtypedataset is subsampled first to balance the size of each cluster. As shown in Table 4, for linear separable dataset Mushroom, both NSMC and NIMC achieve perfect clustering. For the other two datasets, NSMC achieves better clustering results than NIMC.

We studied the nonlinear bipartite graph representation learning problem. We formalized the representation learning problem as a statistical parameter estimation problem in a semiparametric model. In particular, the edge attributes, given node features, are assumed to

image

Figure 3: The comparison of learned embeddings based on NIMC and NSMC, with the ground-truth embeddings. The first row shows embeddings of x, while the second row shows embeddings of z. The points are colored according to the ground-truth clustering.

image

Table 4: Clustering error on real-world data.

follow an exponential family distribution with unknown base measure. The parametric component of the model is assumed to be the proximity of outputs of one-layer neural network, whose inputs are node representations. In this setting, learning embedding vectors is equivalent to estimating two low-rank weight matrices (U⋆, V⋆). Using the rank-order decomposition technique, we proposed a quasi-likelihood function, and proved that GD with constant step size achieves local linear convergence rate. The sample complexity is linear in dimensions up to a logarithmic factor, which matches existing results in matrix completion. However, our estimator is robust to model misspecification within exponential family due to the adaptivity to the base measure. We also provided numerical simulations and real experiments to corroborate the main theoretical results, which demonstrated superior performance of our method over existing approaches.

One potential extension is to consider a more general distribution for node representations. For example, when node representations follow a heavy-tailed distribution, it is not clear whether we can still recover (U⋆, V⋆) with the same convergence rate. In addition, using two-layer or even deep neural networks for encoders in our semiparametric model, while still providing theoretical guarantee is another interesting extension.

In this section, we provide proofs of lemmas in the main text. Auxiliary results are presented in the appendix.

7.1 Proof of Lemma 3

For any pair (yk, y′l), let Rkldenote the rank statistics, and  ykl(·) denote the order statistics. We have

image

Moreover, as shown in (2),

image

Thus,

image

which completes the proof.

7.2 Proof of Lemma 6

Let us first introduce additional notations. Suppose QR decomposition of  U⋆, V⋆ is U⋆ =Q1R1 and V⋆ = Q2R2, respectively, with  Qi ∈ Rdi×r and Ri ∈ Rr×r for i = 1, 2. Let

Q⊥i ∈ Rdi×(di−r)be the orthogonal complement of  Qi. For any vectors  a = (a1; . . . ; ar) andb = (b1; . . . ; br) such that  ap ∈ Rd1, bp ∈ Rd2 for p ∈ [r] and ∥a∥22+∥b∥22 = 1, we express each component by  ap = Q1r1p + Q⊥1 s1p and bp = Q2r2p + Q⊥2 s2p, and let ri = (ri1, . . . , rir) ∈Rr×r and si = (si1, . . . , sir) ∈ R(di−r)×r. Further, we let  ti = (ti1, . . . , tir) ∈ Rr×r withtip = R−1i rip, and also let ¯ti ∈ Rr×rdenote the matrix that replaces the diagonal entries of tiby 0. Lastly, for i = 1, 2 and variable  x ∼ N(0,1), we define following quantities

image

Using the above notations,

image

= 12E�� r�p=1

image

where we have used the independence among  xT Q1r1p, xT Q⊥1 s1p, zT Q2r2p and zT Q⊥2 s2p.By Lemma 15, there exists a constant  C1not depending on (U⋆, V⋆) such that

image

For term  I1, let us denote the inside variable as

image

Using Lemma 21, Assumption 1, and independence among  x, x′, z, z′,

image

image

Combining the above display with (12) and (13), Minimizing over the set  {(a, b) : ∥a∥2F +

image

The above display, together with (12) and (13), leads to

image

Since the first row of  U⋆ is fixed, we minimize over the set  {(a, b) : ∥a∥2F +∥b∥2F = 1, eT1 ap =0, ∀p ∈ [r]}. Equivalently, the right hand side has the following optimization problem

image

By Theorem D.6. in Zhong et al. (2018),

image

Thus,

image

This completes the proof.

7.3 Proof of Lemma 8

The concentration is shown by taking expectation hierarchically. In particular, we let ∇2 ¯L1(U, V) = E�∇2L1(U, V) | D, D′�, where the expectation is over the random sampling of the entries from  D and D′. Then, we know  E[∇2 ¯L1(U, V)] = E[∇2L1(U, V)]. Moreover,

image

Using Lemma 17, for any  s ≥ 1,

image

Using Lemma 19,

image

Combining the above two displays, using the fact that  ∥V∥2qF +∥U∥2qF ≲ ∥V−V⋆∥2qF +∥U−U⋆∥2qF + ∥V⋆∥2qF + ∥U⋆∥2qF , and dropping high order terms, we know that, with probability at least 1  − 1/(d1 + d2)s,

image

Noting that  ∥U − U⋆∥1−q/2F + ∥V − V⋆∥1−q/2F ≲�∥U − U⋆∥2F + ∥V − V⋆∥2F� 2−q4 completesthe proof.

7.4 Proof of Lemma 9

The proof is similar to that of Lemma 8. Let  ∇2 ¯L2(U, V) = E�∇2L2(U, V) | D, D′�. Then

image

Using Lemma 18 and noting that  ∥V∥q2(1−q1)2 + ∥U∥q1(1−q2)2 ≤ ∥V∥q2 + ∥U∥q2, for all s ≥ 1,

image

Using Lemma 20,

image

Combining the last two displays, we complete the proof.

7.5 Proof of Lemma 11

Note that

image

Following the proof of Lemma 17, 18, 19 and 20, we can show that, for any  s ≥ 1, withprobability 1  − 1/(d1 + d2)s,

image

and

image

The proof follows by noting that  ∥U∥2F +∥V∥2F ≲ ∥U⋆∥2F +∥V⋆∥2F for (U, V) ∈ BR(U⋆, V⋆).

7.6 Proof of Lemma 12

By definition in Section 4.3, we have the following decomposition

image

By (34),

image

This completes the proof.

7.7 Proof of Theorem 13

Define Υ⋆ = CBβ3r3(1−q)2 �∥U⋆∥3qF + ∥V⋆∥3qF�for sufficiently large constant  CB. For anytwo points (U1, V1), (U2, V2) ∈ BR(U⋆, V⋆), if their distance satisfies

image

and the sample sizes  m, n1, n2satisfy (which is implied by the condition in Theorem 10)

image

by Lemma 11 we know

image

Using this result and letting  cB ≤ 1/(4CB)

image

Moreover, for any (U, V) in this neighborhood, by Theorem 10, we have

image

with high probability. Thus, by Weyl’s theorem (Weyl, 1912), we have

image

and similarly  λmax(∇2L(U, V)) ≤ 3λ⋆max/2. Let us consider doing one-step GD at (U, V). Let

image

Suppose the continuous line from (U, V) to (U⋆, V⋆) is parameterized by  ξ ∈ [0, 1] with

image

of interval [0, 1] with |Ξ| = 5

i ∈ [|Ξ|] and have set  S = {(U1, V1), . . . , (U|Ξ|, V|Ξ|)}. Taking the union bound over S,

image

Furthermore, since Ξ is a net of [0, 1], for any  ξ ∈ [0,1] there exists  ξ′ ∈ [|Ξ|] such that

image

Thus, by (16), (17), and Weyl’s theorem, we obtain

image

With this,

image

The last inequality is from Theorem 4 and the fact that  ∥∇L(U⋆, V⋆) − E[∇L(U⋆, V⋆)]∥Fonly contributes higher-order terms by concentration. Let  η = 1/λ⋆max, then

image

which completes the proof.

This work is partially supported by the William S. Fishman Faculty Research Fund at the University of Chicago Booth School of Business. This work was completed in part with resources provided by the University of Chicago Research Computing Center.

Appendix A. Complementary Lemmas

In this section, we list intermediate results required for proving lemmas in Section 7. Notations in each lemma are introduced in the proofs of the corresponding lemmas.

image

(Case 2) if either  φ1 or φ2is ReLU, then

image

Proof The notations in this proof are inherited from the proof of Lemma 6 in Section 7.2. By the definition of  g(·, ·) in (14),

image

Therefore,

image

and

image

Plugging the expressions of  I4, I5, and I6in Lemma 16 into (19), combining with (18), and using the fact that

Trace(¯t21) = 12∥¯t1 + ¯tT1 ∥2F − ∥¯t1∥2F , 2Trace(¯t1¯t2) = ∥¯t1 + ¯tT2 ∥2F − ∥¯t1∥2F − ∥¯t2∥2F ,

we obtain

image

image

Based on the above expression, we further provide the lower bound for Var(g(¯x, ¯z)). Weseparate into two cases.

Case 1, φ1, φ2 ∈ {sigmoid, tanh}. By symmetry of activation functions,  τ ′i,1,1 = 0. Thus,plugging into (20),

Var(g(¯x, ¯z))

image

≥ τ1,1,1τ2,1,1τ ′1,1,0τ ′2,1,0∥¯t1 + ¯tT2 ∥2F + ρ1�∥¯t1∥2F + ∥¯t2∥2F�+ τ ′′1 τ ′′2 ∥diag(t1) + diag(t2)∥22+  ρ2�∥diag(t1)∥22 + ∥diag(t2)∥22�,

where, for  j = 1, 2, i = 1, 2 and ¯i = 3 − i, ρj = ρj1 ∧ ρj2 with

image

Further, by Stein’s identity (Stein, 1972),  τi,1,1 = τ ′i,1,0. We can also numerically check that τ ′′1 , τ ′′2 , ρ1, ρ2 >0. Therefore, the above display leads to

image

Case 2, either φ1 or φ2is ReLU. Without loss of generality, we assume  φ1is ReLU. Then,

image

plugging into (20),

Var(g(¯x, ¯z))

image

image

+�τ ′′2 − π + 22π τ2,1,0τ ′2,1,1 − 1πτ ′2,1,0τ ′2,1,2�diag(t1)T diag(t2)

image

+�τ2,2,0 − π + 22π τ 22,1,0 − 1π(τ ′2,1,0)2�∥diag(t1)∥22

image

+�τ ′′2 − π + 22π τ2,1,0τ ′2,1,1 − 1πτ ′2,1,0τ ′2,1,2�diag(t1)T diag(t2).

Define

image

Then, we can numerically check  ρ3, ρ4 > 0 when φ2 ∈ {sigmoid, tanh, ReLU} and hence

image

This completes the proof.

Lemma 15 Under conditions of Lemma 6, there exists a constant  C2 > 0not depending on  U⋆, V⋆ such that

image

Proof By symmetry, we only show the proof for  I2. By the definition of  I2in (12) and noting that the inner variable has mean zero,

image

image

where the third equality is due to the independence among  u⋆Tp x, xT Q⊥1 and z; the lastinequality is due to Lemma 21 and Assumption 1. Here, ¯x, ¯z i.i.d∼ N(0, Ir) and ¯xp, ¯zp denotethe p-th component of ¯x, ¯z, respectively. Moreover,

image

Combining with (21),

image

Lemma 16 Under the setup of Lemma 14, we have

I4 =�τ2,2,0τ ′1,2,0 − τ 22,1,0(τ ′1,1,0)2�∥¯t1∥2F + τ 22,1,0(τ ′1,1,1)2Trace(¯t21) + τ 22,1,0(τ ′1,1,0)2∥¯t1111∥22+ 2τ 22,1,0τ ′1,1,2τ ′1,1,0111T ¯tT1 diag(t1) + τ 22,1,0(τ ′1,1,1)2(111T diag(t1))2

image

and

image

Proof By symmetry, we only show the proof for  I4 and I5can be proved analogously. By the definition of  I4 in (19),

image

=

image

By simple derivations, we let  t1pp = [t1p]p be the p-th entry of  t1p, and have

image

and

image

image

Plugging into the formula of  I42,

image

Combining (22), (23), (24) together,

image

+ 2�τ ′1,1,2τ ′1,1,0 − (τ ′1,1,0)2� �111T tT1 diag(t1) − ∥diag(t1)∥22�+ (τ ′1,1,1)2�(111T diag(t1))2+ Trace(t21) − 2∥diag(t1)∥22��

image

Recall from Section 7.2 that ¯ti ∈ Rr×r, i = 1,2, denotes the matrix that replaces the diagonal entries of  tiby 0. Therefore, the above equation can be further simplified as

I4 =�τ2,2,0τ ′1,2,0 − τ 22,1,0(τ ′1,1,0)2�∥¯t1∥2F + τ 22,1,0(τ ′1,1,1)2Trace(¯t21) + τ 22,1,0(τ ′1,1,0)2∥¯t1111∥22+ 2τ 22,1,0τ ′1,1,2τ ′1,1,0111T ¯tT1 diag(t1) + τ 22,1,0(τ ′1,1,1)2(111T diag(t1))2

image

This completes the proof for  I4. I5can be obtained analogously by changing the role of  φ1and  φ2. By the definition of  I6 in (19),

image

I6 =

image

image

This completes the proof.

Lemma 17 Under conditions of Lemma 8, we let  qi = 1 if φiis ReLU and  qi = 0 ifφi ∈ {sigmoid, tanh}. Then for any  s ≥ 1,

image

Proof Proof of J1.For any two samples (y, x, z) ∈ D and (y′, x′, z′) ∈ D′, let us define

image

where  Θ = ⟨φ1(UT x), φ2(VT z)⟩. To ease notations, we suppress the evaluation sample of H1. We apply Lemma 24 to bound  J1. We first check all conditions of Lemma 24. By Assumption 2 and symmetry of (d, p) and (d′, p′),

image

For activation functions in {sigmoid, tanh, ReLU}, the last inequality is due to the fact that |φ′i| ≤1. Note that

image

thus we further obtain

image

image

By Lemma 22,  ∀s ≥ 1

image

We bound the second term in (26) similarly and have

image

We next verify the second condition in Lemma 24. By the symmetry of  H1, we only need bound the following quantity

image

We only bound  J11as an example.  J12can be bounded in the same sketch. Step 1. Bound ∥E[J11]∥2. For any vectors  a = (a1; . . . ; ar) and b = (b1; . . . ; br) such that

image

image

By Lemma 23 and we have

image

Step 2. Bound ∥J11 − E[J11]∥2. We apply Lemma 26. Let us first define the random matrix

image

For the condition (a) in Lemma 26, we note that

image

By Lemma 22, for any constants  K(1,1)1 ∧ K(1,1)2 ∧ K(1,1)3 ≥1 (in what follows we may keep using such notation, where the first superscript indexes the function  {Li}i=1,2we are dealing with; the second superscript indexes the times we have used for this notation),

image

≤ 2 exp�−(d1 ∧ d2)K(1,1)3 �+ q2 exp

For the condition (b) in Lemma 26, we apply the inequalities in Lemma 23 and have

∥E[J11(x, z)J11(x, z)T ]∥2 = max∥a∥2F +∥b∥2F =1 E��dT d + pT p�3�aT d + bT p�2�

image

For the condition (c) in Lemma 26, we consider the following quantity for any unit vector (a; b):

image

Combining (29), (30), (31), (32) together and defining

image

we know conditions in Lemma 26 hold for  J11with parameters (up to constants)

image

Thus,  ∀t > 0

image

In the above inequality, for any constant  s ≥ 1 we let

image

By simple calculation, we can let

image

and further have

image

Under the conditions of Lemma 17, we combine the above inequality with (29) and have P(∥J11∥2 ≳ Υ1Υ2) ≲ 1/(d1 + d2)s, ∀s ≥1. Dealing with  J12in (28) similarly, one can show (29) and the above result hold for  J12as well. So  P(∥J12∥2 ≳ Υ1Υ2) ≲ 1/(d1 + d2)s.Plugging back into (28), we can define  ν2(J1) = β4Υ1Υ2and then conditions of Lemma 24 hold for  J1with parameters  ν1(J1) (defined in (27)) and  ν2(J1). Therefore, we have  ∀t > 0

image

For any  s ≥ 1, we let

image

and have

image

The result follows by the definition of Υ2in (33) and noting that the first term in  ϵ2 is thedominant term.

Proof of J2.We apply Lemma 25 to bound  J2. We check all conditions of Lemma 25. Some of steps are similar as above. By definition of  H1,

image

We first bound  ∥E[H1]∥2. We have

image

where the last inequality is derived similarly to (31). For the condition (a) in Lemma 25, we apply (26) and Lemma 22 (similar to (30)),

image

For the condition (b) in Lemma 25,

image

For the condition (c) in Lemma 25,

image

Thus, conditions of Lemma 25 hold with parameters (up to constants)

image

Similar to the proof of  J1, for any s ≥ 1, we let K(1,2)1 = K(1,2)2 = 2 log n1n2 +s log(d1 +d2),

image

and then have

image

Noting that the first term in  ϵ3is the dominant term, we complete the proof.

Lemma 18 Under conditions of Lemma 9 and the definition of  q1, q2in Lemma 17, we have that for any  s ≥ 1,

image

Proof Proof of T1.For any two samples (y, x, z) ∈ D and (y′, x′, z′) ∈ D′, we define

image

where  Θ = ⟨φ1(UT x), φ2(VT z)⟩. We follow the same proof sketch as Lemma 17. We apply Lemma 24 to bound  T1. We first check all conditions of Lemma 24. By Assumption 2,

image

Here, the third inequality is due to the fact that  |φ′′i | ≤ 2 if φi ∈ {sigmoid, tanh} and φ′′i = 0if  φiis ReLU. Taking union bound over  D ∪ D′, noting that log(r(n1 + n2)(d1 + d2)) ≍log (r(d1 + d2)), and applying Lemma 22, for any  s ≥1, we define

Υ3 = (1 − q1)d1 (log(r(d1 + d2)))q2/2 ∥V∥q22 +�d1d2 + (1 − q2)d2 (log(r(d1 + d2)))q1/2 ∥U∥q12

image

and have

image

Similarly to Lemma 17, we have two steps. Step 1. Bound ∥E[T11]∥2. For any vectors  a = (a1; . . . ; ar) and b = (b1; . . . ; br) such that

image

image

Using Lemma 23 and maximizing over set  {(a, b) : ∥a∥22 + ∥b∥22 = 1}, we get

image

Step 2. Bound ∥T11 − E[T11]∥2. We still apply Lemma 26. Define the following random matrix

image

For the condition (a) in Lemma 26, we note that

image

image

and we have

image

For the condition (b) in Lemma 26, let us define

image

Then,

image

By simple calculations based on Lemma 23,

image

image

Combining the above displays together and maximizing over  {(a, b) : ∥a∥22 + ∥b∥22 = 1},

image

For condition (c) in Lemma 26,

image

Applying Lemma 23,

image

Thus,

image

Combining (40), (42), (43), (44), and defining

image

then conditions in Lemma 26 hold for  T11with parameters

image

Here, Υ4, Υ5, Υ6are defined in (41), (45), and  {K(2,1)i }i=1,2,3are any constant. So,  ∀t > 0,

image

image

For any  s ≥ 1, we let

image

Then, Υ4 ≍ Υ3. Noting that  q = q1 ∨ q2 and q′ = q1q2, we can let

image

and then have

image

Combining the above inequality with (40),  P(∥T11∥2 ≳ Υ6) ≲ 1/(d1 + d2)s. We plugback into (39), combine with (38), and know Lemma 24 holds for  T1with parameters ν1(T1) = βΥ3 and ν2(T1) = β2Υ6. Finally we apply Lemma 24 and obtain that  ∀t > 0

image

For any  s ≥ 1, we let

image

and have

image

This completes the proof for the first part. Proof of T2.We apply Lemma 25 to bound  T2. We check all conditions of Lemma 25. By definition of  H2 in (35),

image

We first bound  ∥E[H2]∥2as follows:

image

image

For the condition (a) in Lemma 25, we have shown in (36) that

image

Thus, similar to (42),

image

For the condition (b) in Lemma 25,

image

For the condition (c) in Lemma 25, we use Lemma 23 and obtain

image

Thus, conditions of Lemma 25 hold for  T2with parameters (up to constants)

image

and then have

image

We finish the proof by noting that the first term of  ϵ6is the dominant.

Lemma 19 Under conditions of Lemma 8, we have

image

Proof By definition of  J3,

image

For  J31,

image

We only bound the first term. The second term has the same bound using the equation E[x]E[x]T = E[xx′T ] for any variable  x′ independent from x. Note that

image

image

We focus on the first term in the above equality. By simple calculations using the boundedness and Lipschitz continuity of  φi, φ′i,

image

Plugging the above inequality back into (49), dealing with other terms similarly, and applying Lemma 27 by noting  σr(U⋆) ∧ σr(V⋆) ≥ 1,

image

+

image

+

image

+

image

image

where Υ⋆2 is defined in the same way as Υ2in (33) but calculated using  U⋆, V⋆. Next, webound  J32in (48). Since  ψis Lipschitz continuous,

image

Thus,

∥J32∥2 ≲ β3����E���φ1(UT x)T φ2(VT z) − φ1(U⋆T x)T φ2(V⋆T z)���d⋆ − d′⋆p⋆ − p′⋆

image

For the first term,

image

For the second term, from (32) we see max∥a∥2F +∥b∥2F =1�E[(d⋆T a + p⋆T b)4] ≲ Υ⋆2. Com-bining with the above two displays, and (50) and (48),

image

This completes the proof.

Lemma 20 Under conditions of Lemma 9, we have

image

Proof We follow the same proof sketch as Lemma 19. By definition of  T3,

image

For  T31,

image

image

where Υ⋆7 has the same form as Υ7defined in (46) but is calculated using  U⋆, V⋆. For T32,we use the Lipschitz continuity of 1/(1 + exp(x)), and simplify analogously to  J32. Weobtain

image

Combining the above three displays,

image

We complete the proof.

Appendix B. Auxiliary Results

Lemma 21 (Lemma D.4 in Zhong et al. (2018)) Let U ∈ Rd×r be a full-column rank matrix. Let  g : Rk → [0, ∞). Define ¯κ(U) = �rp=1σp(U)σr(U), then we have

image

Lemma 22 (Concentration of quadratic form and norm)  Suppose x1, x2, . . . , xniid∼

image

Proof Result in (a) directly comes from the Chernoff bound and Remark 2.3 in Hsu et al. (2012). We use union bound and (a) to prove (b). (c), (d) and (e) are directly from (a) and (b). (f) is from the Chapter 3 in Vershynin (2018). (g) is due to the fact that  |xT u| issub-Gaussian variable.

Lemma 23 (Expectation of product of quadratic form)  Suppose x ∼ N(0, Id), U ∈

image

Proof Note that

image

This shows the part (a). (b) can be showed similarly using the H¨older’s inequality twice. For (c),

image

Here the first inequality is due to the H¨older’s inequality and the second equality is from Lemma 2.2 in Magnus (1978).

image

a sample set, and let Ω =  {(xk, zk)}mk=1 be a collection of samples of D, where each (xk, zk)is sampled with replacement from D uniformly. Independently, we have another sets  D′ = {(x′, z′)} and Ω′ = {(x′k, z′k)}mk=1.For any pair (x, z) and (x′, z′), we havea matrix  A�(x, z), (x′, z′)�∈ Rd1×d2. Define H = 1m2�mk,l=1 A ((xk, zk), (x′l, z′l)). If thefollowing conditions hold with  ν1, ν2not depending on  D, D′:

image

then  ∀t > 0,

image

Proof For any integer k, we define ¯k to be the remainder of k/m such that 1  ≤ ¯k ≤ m(i.e. ¯m = m). Then we can express H as

image

Note that  Hkis the sum of m independent samples, and for any  k = 0, 1, ..., m − 1, theyhave the same distribution with conditional expectation

image

Therefore,

image

By the proof of Corollary 6.1.2 in Tropp et al. (2015), the right hand side satisfies

image

Combining the above two displays and using the fact that  P(A) = E [111A] = E [E [111A | D, D′]]for any event A, we finish the proof.

image

F : i ∈ [n1], j ∈ [n2]}be a sample set with size  n1n2and each pair (x, z) follows the same distribution F; similarly but independently, let  D′ = {(x′i, z′j) ∼ F′ : i ∈ [n1], j ∈ [n2]} beanother sample set. Let  A ((x, z), (x′, z′)) ∈ Rd1×d2 be a random matrix corresponding to (x, z) ∈ D, (x′, z′) ∈ D′, and let H = 1n21n22�(x,z)∈D�(x′,z′)∈D′ A ((x, z), (x′, z′)). Supposethe following conditions hold with  µ1, ν1, ν2, ν3:

image

image

Proof For simplicity we suppress the evaluation point of  A. Let ¯A = A · 111∥A∥2≤µ1 and¯H = 1n21n22�(x,z)∈D�(x′,z′)∈D′ ¯A. Then,

image

For the first term,

image

For the third term,

image

For the second term, without loss of generality, we assume  n1 ≤ n2. For any integer k, we let  k = s1n1 + ¯k, where integer  s1 ≥0 and remainder ¯k satisfies 1  ≤ ¯k ≤ n1. We also let k = s2n2 + ˜k, where integer  s2 ≥ 0 and ˜ksatisfies 1  ≤ ˜k ≤ n2. Then

image

Based on this decomposition, we see that ¯Hk,l,jis a sum of  n1i.i.d. random matrices and that  { ¯Hk,l,j}have the same distribution. Similar to the proof of Lemma 24, we have

image

We apply Corollary 6.1.2 in Tropp et al. (2015). Note that  ∥ ¯A − E[ ¯A]∥2 ≤ 2µ1 and

image

image

A similar bound holds for  ∥E[ ¯AT ¯A] − E[ ¯AT ]E[ ¯A]∥2. Therefore

image

Putting everything together finishes the proof.

image

random matrix corresponding to (x, z) ∈ D and let H = 1n1n2�(x,z)∈D A (x, z). Supposethe following conditions hold with  µ1, ν1, ν2, ν3:

image

Proof The result follows directly from Lemma 25.

Lemma 27 Suppose x ∼ N(0, Id), φ ∈ {sigmoid, tanh, ReLU}. For any u, u⋆, a, b ∈ Rd,

image

where  q = 1 if φis ReLU and q = 0 otherwise.

Proof By H¨older’s inequality,

image

If  φ ∈ {sigmoid, tanh}, we finish the proof by using the Lipschitz continuity of  φ′ andLemma 23. If  φis ReLU, we apply Lemma E.17 in Zhong et al. (2018) to complete the proof.

J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. arXiv preprint cs/0611124, 2006.

A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 37–48. ACM, 2013.

M. Al Hasan and M. J. Zaki. A survey of link prediction in social networks. In Social network data analytics, pages 243–275. Springer, 2011.

Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.

V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder- decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.

N. Baldin and Q. Berthet. Optimal link prediction with matrix logistic regression. arXiv preprint arXiv:1803.07054, 2018.

M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.

M. G. Bell and Y. Iida. Transportation network analysis. 1997.

Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013.

R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.

S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social network data analytics, pages 115–148. Springer, 2011.

H. Bunke and B. T. Messmer. Efficient attributed graph matching and its application to image analysis. In International Conference on Image Analysis and Processing, pages 44–55. Springer, 1995.

Y. Cao and Q. Gu. Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems, pages 10611– 10621, 2019.

S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, and E. Cambria. Learning community embedding with community detection and node embedding on graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 377–386. ACM, 2017.

Y. Chen, Z. Yang, Y. Xie, and Z. Wang. Contrastive learning from pairwise measurements. In Advances in Neural Information Processing Systems, pages 10909–10918, 2018.

Y. Chen and E. J. Cand`es. The projected power method: An efficient algorithm for joint alignment from pairwise differences. Communications on Pure and Applied Mathematics, 71(8):1648–1714, 2018.

Y. Chen and A. J. Goldsmith. Information recovery from pairwise measurements. In 2014 IEEE International Symposium on Information Theory, pages 2012–2016. IEEE, 2014.

Y. Chen and C. Suh. Spectral mle: Top-k rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pages 371–380, 2015.

Y. Chen, G. Kamath, C. Suh, and D. Tse. Community recovery in graphs with locality. In International Conference on Machine Learning, pages 689–698, 2016.

Y. Chen, J. Fan, C. Ma, K. Wang, et al. Spectral method and regularized mle are both optimal for top-k ranking. The Annals of Statistics, 47(4):2204–2235, 2019.

K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden- layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.

D. Dua and C. Graff. UCI machine learning repository, 2017.

J. Fan, H. Liu, Y. Ning, and H. Zou. High dimensional semiparametric latent graphical model for mixed data. J. R. Stat. Soc. Ser. B. Stat. Methodol., 79(2):405–421, 2017.

M. R. Fengler. Semiparametric modeling of implied volatility. Springer Finance. SpringerVerlag, Berlin, 2005.

C. Fiorio. A topologically consistent representation for image analysis: the frontiers topo- logical graph. In International Conference on Discrete Geometry for Computer Imagery, pages 151–162. Springer, 1996.

S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.

B. K. Fosdick and P. D. Hoff. Testing and modeling dependencies between a network and nodal attributes. J. Amer. Statist. Assoc., 110(511):1047–1056, 2015.

I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.

T. Graepel, M. Goutrie, M. Kr¨uger, and R. Herbrich. Learning on graphs in the game of go. In International Conference on Artificial Neural Networks, pages 347–352. Springer, 2001.

A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.

W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017a.

W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arxiv: 1709.05584, 2017b, arXiv:1709.05584v3.

D. J. Higham, M. Raˇsajski, and N. Prˇzulj. Fitting a geometric graph to a protein–protein interaction network. Bioinformatics, 24(8):1093–1099, 2008.

D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17(52):1–6, 2012.

P. Jain and I. S. Dhillon. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626, 2013.

D. Jannach, P. Resnick, A. Tuzhilin, and M. Zanker. Recommender systemsbeyond matrix completion. Communications of the ACM, 59(11):94–102, 2016.

Z. Kang, C. Peng, and Q. Cheng. Top-n recommender system via matrix completion. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.

R. Li and H. Liang. Variable selection in semiparametric regression modeling. Ann. Stat., 36(1):261–286, 2008.

Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.

D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.

M. Ma, S. Na, C. Xu, and X. Fan. The graph-based broad behavior-aware recommendation system for interactive news. arXiv preprint arXiv:1812.00002, 2018.

Z. Ma, Z. Ma, and H. Yuan. Universal latent space model fitting for large networks with edge covariates. Journal of Machine Learning Research (to appear), 2019.

J. R. Magnus. The moments of products of quadratic forms in normal variables. Statist. Neerlandica, 32(4):201–210, 1978.

A. K. Menon and C. Elkan. Link prediction via matrix factorization. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2011.

S. Na, M. Kolar, and O. Koyejo. Estimating differential latent variable graphical models with applications to brain connectivity. arXiv preprint arXiv:1909.05892, 2019.

M. Nassar. Hierarchical bipartite graph convolution networks. arXiv preprint arXiv:1812.03813, 2018.

S. Negahban, S. Oh, K. K. Thekumparampil, and J. Xu. Learning from comparisons and choices. The Journal of Machine Learning Research, 19(1):1478–1572, 2018.

Y. Ning, T. Zhao, and H. Liu. A likelihood ratio framework for high-dimensional semipara- metric regression. Ann. Statist., 45(6):2299–2327, 2017.

A. Pananjady, C. Mao, V. Muthukumar, M. J. Wainwright, and T. A. Courtade. Worst- case vs average-case design for estimation from fixed pairwise comparisons. arXiv preprint arXiv:1707.06217, 2017.

D. Park, A. Kyrillidis, C. Caramanis, and S. Sanghavi. Finding low-rank solutions via non- convex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences, 11(4):2165–2204, 2018.

B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representa- tions. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.

J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang. Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 459– 467. ACM, 2018.

L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.

J. Scott. Social network analysis. Sociology, 22(1):109–127, 1988.

S. Si, K.-Y. Chiang, C.-J. Hsieh, N. Rao, and I. S. Dhillon. Goal-directed inductive matrix completion. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165–1174. ACM, 2016.

C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. pages 583–602, 1972.

B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Advances in neural information processing systems, pages 659–666, 2004.

J. A. Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R⃝ in Machine Learning, 8(1-2):1–230, 2015.

R. Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer.

H. Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differen- tialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Math. Ann., 71(4):441–479, 1912.

Y. Wu, H. Liu, and Y. Yang. Graph convolutional matrix completion for bipartite edge prediction. In KDIR, pages 49–58, 2018.

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.

C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang. Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

X. Yi, D. Park, Y. Chen, and C. Caramanis. Fast algorithms for robust PCA via gradient descent. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4152–4160. Curran Associates, Inc., 2016.

H. Zha, X. He, C. Ding, H. Simon, and M. Gu. Bipartite graph partitioning and data clus- tering. In Proceedings of the tenth international conference on Information and knowledge management, pages 25–32, 2001.

K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4140–4149. JMLR. org, 2017.

K. Zhong, Z. Song, P. Jain, and I. S. Dhillon. Nonlinear inductive matrix completion based on one-layer neural networks. arXiv preprint arXiv:1805.10477, 2018.

K. Zhong, Z. Song, P. Jain, and I. S. Dhillon. Provable non-linear inductive matrix comple- tion. In Advances in Neural Information Processing Systems, pages 11435–11445, 2019.

J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.


Designed for Accessibility and to further Open Science