Semiparametric Nonlinear Bipartite Graph Representation Learning with Provable Guarantees

2020·Arxiv

Abstract

Abstract

Graph representation learning is a ubiquitous task in machine learning where the goal is to embed each vertex into a low-dimensional vector space. We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution. The bipartite graph is assumed to be generated by a semiparametric exponential family distribution, whose parametric component is given by the proximity of outputs of two one-layer neural networks, while nonparametric (nuisance) component is the base measure. Neural networks take high-dimensional features as inputs and output embedding vectors. In this setting, the representation learning problem is equivalent to recovering the weight matrices. The main challenges of estimation arise from the nonlinearity of activation functions and the nonparametric nuisance component of the distribution. To overcome these challenges, we propose a pseudo-likelihood objective based on the rank-order decomposition technique and focus on its local geometry. We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate. Moreover, we prove that the sample complexity of the problem is linear in dimensions (up to logarithmic factors), which is consistent with parametric Gaussian models. However, our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.

Keywords: bipartite graph, nonconvex optimization, representation learning, semiparametric estimation

1. Introduction

Graphs naturally arise as models in a variety of applications, ranging from social networks (Scott, 1988) and molecular biology (Higham et al., 2008) to recommendation systems (Ma et al., 2018) and transportation (Bell and Iida, 1997). In a variety of problems, graphs tend to be high-dimensional and highly entangled, and hence difficult to directly learn from. As a prominent remedy, graph representation learning aims to learn a mapping that represents each vertex as low-dimensional vector such that structural properties of the original graph are preserved. Those learned low-dimensional representations, also called embeddings, are further used as the input features in downstream machine learning tasks, such as link prediction (Taskar et al., 2004; Al Hasan and Zaki, 2011), node classification (Bhagat et al., 2011), and community detection (Fortunato, 2010).

There are three major approaches to graph embedding: matrix factorization-based algorithms (Belkin and Niyogi, 2002; Ahmed et al., 2013), random walk algorithms (Perozzi et al., 2014; Grover and Leskovec, 2016), and graph neural networks (Scarselli et al., 2008; Zhou et al., 2018; Wu et al., 2019). These approaches can be unified via the encoder-decoder framework proposed in Hamilton et al. (2017b). In this framework, the encoder is a mapping that projects each vertex or a subgraph to a low-dimensional vector, whereas the decoder is a probability model that infers the structural information of the graph from the embeddings generated by the encoder. The structural information here depends on the specific downstream tasks of interest, which also determine the loss function of the decoder. The desired graph representations are hence obtained by minimizing the loss function as a function of embedding vectors. For example, in the link prediction task, the decoder predicts whether an edge between two vertices exists or not using a Bernoulli model and logistic loss function, and the model parameter is a function of embeddings (Baldin and Berthet, 2018).

Such an encoder-decoder architecture motivates the study of graph representation learning through the lens of statistical estimation for generative models. In particular, suppose the observed graph is generated by a statistical model specified by the decoder with true graph representations as its inputs. We can then assess the performance of a graph embedding algorithm by examining the difference between the learned representation and the ground truth. Baldin and Berthet (2018) adopted this perspective to study the performance of a linear embedding method for the link prediction problem. The validity of their results hinges on the condition that both the linear model of the encoder and the Bernoulli model of the decoder are correctly specified. When either of these assumptions are violated, they would incur large estimation error. Recent advances in graph representation learning are attributed to more flexible decoders (Cho et al., 2014; Goodfellow et al., 2016; Badri- narayanan et al., 2017), which are based on deep neural networks and can handle graphs with edge attributes that can be categorical. These approaches are poorly understood from a theoretical point of view.

In the present paper, we focus on bipartite graphs, where there are two distinct sets of vertices, U and V , and only edges between two vertices in different sets are allowed. We study the semiparametric nonlinear bipartite graph representation learning problem under the encoder-decoder framework. We assume that each vertex is associated with a high-dimensional Gaussian vector . Similarly, each vertex is associated with a high-dimensional Gaussian vector . The encoder maps them via one-layer neural networks to low-dimensional vectors are weight matrices, are activation functions evaluated entrywise, and ). Furthermore, in the decoder, we consider the link prediction task under a semiparametric model. In particular, we assume that the attribute of an edge follows a natural exponential family distribution parameterized by the proximity between two vertices, which is defined as the inner product between the embedding vectors. Here, ) is the embedding vector of ) is the embedding vector of v. We do not specify the base measure of the exponential family distribution but, instead, treat it as a nuisance parameter. This gives us a semiparametric model for the decoder and robustness to model misspecification within the exponential family.

In the above described semiparametric nonlinear model, our goal is to recover weight matrices . Based on these weight matrices, we can then compute embeddings for all vertices. There are two main obstacles that make the estimation problem challenging. First, while the activation functions make the encoder model more flexible, their nonlinearity leads to a loss function that is nonconvex and nonsmooth. Second, while the unknown nonparametric nuisance component of the decoder model makes the graph representation learning robust to the model misspecification, it also makes the likelihood function not available. To overcome these obstacles, we propose a pseudo-likelihood objective, which is minimized at () locally. We analyze the landscape of the empirical objective and show that, in a neighborhood around the ground truth, the objective is strongly convex. Therefore, the vanilla gradient descent (GD) achieves linear convergence rate. Moreover, we prove that the sample complexity is linear in dimensions , up to logarithmic factors, which matches the best known result under the parametric model (Zhong et al., 2019). Experiments on synthetic and real data corroborate our theoretical results and illustrate flexibility of the proposed representation learning model.

Notations. For any positive integer n, [n] = {1, 2, . . . , n} denotes the index set, and Unif([n]) is a uniform sampling over the indices. We write constant , which equals to 1 if i = j and 0 otherwise. For any matrix U, vec(U) denotes the column vector obtained by vectorizing . As usual, refer to the Frobenius and operator norm, respectively, and ) denotes the p-th singular value of U. For a square matrix ) is a vector including all diagonal entries of U; when U is symmetric, )) denotes its maximum (minimum) eigenvalue. We write is positive semidefinite and if it is positive definite. For any vector is the minimal absolute value of its entries.

Structure of the paper. In Section 2, we formalize the semiparametric graph representation learning problem and introduce related work. In Section 3, we present our estimation method by proposing a pseudo-likelihood objective, and the theoretical analysis of such objective is provided in Section 4. In Section 5 we show experimental results and conclusions are summarized in Section 6. Section 7 provides proofs of main technical results, while the proofs of auxiliary results are given in the appendix.

2. Preliminaries and related work

We describe the setup of our problem and introduce the applications and related work. We particularly focus on the statistical literature on theory of semiparametric estimation and matrix completion, although bipartite graph representation learning has been routinely applied to varied deep neural networks (Nassar, 2018; Wu et al., 2018). We point reader to Zha et al. (2001) for a survey on bipartite graph.

2.1 Problem formulation

Let G = (U, V, E) be a bipartite graph where U and V are two sets of vertices and E denotes the set of edges between two vertex sets. For each vertex , we assume it is associated with a Gaussian vector , while for each we have a Gaussian vector An edge between u and v has an attribute that follows the following semiparametric exponential family model

which is parameterized by the base measure function and a scalar

In model (1), ) is the log-partition function (or normalizing function) that makes the density have unit integral. The parametric component of the exponential family, ), depends on the covariate coming from the set U and the covariate from the set V . The nonparametric component f is treated as a nuisance parameter, which gives us flexibility in modeling the edge attributes. To make notation concise, we will drop the subscript of hereinafter, and use x and z to denote covariates from set U and V , respectively. In our analysis, the activation functions have one of the following three forms:

We formalize the bipartite graph representation learning as a statistical parameter estimation problem of a generative model. In particular, suppose the graph is generated by the exponential family model (1) with some unknown base measure f, and we observe part of edge attributes, y, and associated covariates on two ends, x and z. Thus, we obtain data set index the vertices of two sets. The graph representation learning in our setup is then equivalent to recovering , which can be used to compute parametric component of the decoder model and estimate embedding vectors, ), for all vertices in two sets, since activation functions are user-chosen and known.

2.2 Applications and related work

Graph representation learning underlies a number of real world problems, including object recognition in image analysis (Bunke and Messmer, 1995; Fiorio, 1996), community detection in social science (Perozzi et al., 2014; Cavallari et al., 2017), and recommendation systems in machine learning (Kang et al., 2016; Jannach et al., 2016). See Bengio et al. (2013), Hamilton et al. (2017a), Hamilton et al. (2017b) for recent surveys and other applications. The bipartite graph is of particular interest since it classifies vertices into two types, which extensively appears in modern applications.

For concreteness, in user-item recommendation systems, the attribute of an edge between a user node and an item node represents the rating, which is modeled by the proximity of projected features onto the latent space. Specifically, each user is represented by a high-dimensional feature vector x and each item is represented by a high-dimensional feature vector z. A simple generative model for the rating y that a user gives to an item is y = 1) independent from x, z. Such a model is studied in the inductive matrix completion (IMC) literature (Abernethy et al., 2006; Jain and Dhillon, 2013; Si et al., 2016; Berg et al., 2017). Zhong et al. (2019) studied nonlinear IMC problem, where a generalized model for the rating is common activation function. In this generalized nonlinear model, one-layer neural network compresses the high-dimensional features into low-dimensional embeddings. Zhong et al. (2019) proposed to minimize the squared loss to recover weight matrices established consistency for their minimizer, with linear sample complexity in dimension , up to logarithmic factors.

Our work contributes to this line of research by enhancing the IMC model from two aspects. First, we allow for two separate neural networks to embed user and item covariates. Although this modification may seem minor, it makes theoretical analysis more challenging when two networks mismatch: one network has a smooth activation function while the other does not. Second, we consider an exponential family model with unknown base measure, which extends the applicability of the model and allows for model misspecification within the exponential family. In particular, the semiparametric setup makes our estimator independent of the specific form of f. For example, the model in Zhong et al. (2019) is a special case of (1) with 2), while the link prediction problem in Liben- Nowell and Kleinberg (2007) and Menon and Elkan (2011) is a special case with f(y) = 1.

Furthermore, our work contributes to the literature on graph embedding (Qiu et al., 2018; Goyal and Ferrara, 2018). Our paper studies the bipartite graph and casts the graph representation learning as the problem of parameter estimation in a generative model. This setup allows us to analyze statistical properties, such as consistency and convergence rate, of the learned embedding features. To the best of our knowledge, statistical view of representation learning is missing although it was successfully used in real experiments (see, e.g., Graepel et al., 2001; Yang et al., 2015). In addition, our work also contributes to a growing literature on semiparametric modeling (Fengler, 2005; Li and Liang, 2008; Fan et al., 2017), where the parametric component in (1) is given by and the goal is to estimate by regressing y on x, without knowing f. Fosdick and Hoff (2015) formalized the representation learning as a latent space network model, where the parameter given by the inner product of two latent vectors and 2), that is under a Gaussian noise setup, and proposed methodology for testing the dependence between nodal attributes and latent factors. Ma et al. (2019) studied a similar model with f(y) = 1 and proposed both convex and nonconvex approaches to recover latent factors. However, our work is more challenging due to the nonlinearity of activation functions and the missing knowledge of f.

Lastly, several estimation methods for pairwise measurements have been studied in related, but simpler, models (Chen and Goldsmith, 2014; Chen and Suh, 2015; Chen et al., 2016; Pananjady et al., 2017; Negahban et al., 2018; Chen and Cand`es, 2018; Chen et al., 2019). Chen et al. (2018) studied model (1) by assuming the parameter matrix of the graph to be low-rank, and estimated as a whole. As a comparison, our model is more complicated since each entry of in our setup is given by the inner product of two embedding vectors, which measures the proximity of two vertices. Our task is to recover two underlying weight matrices that are convolved by activation functions to generate

3. Methodology

We propose a pseudo-likelihood objective function to estimate the unknown weight matrices and discuss identifiability of the parameters. The objective function is minimized by the gradient descent with a constant step size. Theoretical analysis of the iterates is provided in Section 4.

The likelihood of the model (1) is not available due to the presence of the infinite-dimensional nuisance parameter f. Using the rank-order decomposition technique (Ning et al., 2017), we focus on the pairwise differences and develop a pseudo-likelihood objective. Importantly, the differential pseudo-likelihood does not depend on f and, as a result, our estimator is valid for a wide range of distributions, without having to explicitly specify them in advance.

We follow the setup described in Section 2.1. To simplify the presentation, suppose we have 2vertices in vertices in V , denoted by and , respectively. For , and suppose that ), independent of each other. Further, we assume to observe m edge attributes, y, between vertices , and another m edge attributes, , both of which follow the distribution in (1) and are sampled with replacement from the set of all possible edges. We note that the sampling setup is commonly adopted in the literature on partially observed graphs and matrix completion problems (Zhong et al., 2019; Chen et al., 2018), which is equivalent to assuming edges on a graph are missing at random.

Denote sample sets

where ]). While the observations within Ω or Ωare not independent, as they may have common features x or z, the observations between Ω and Ωare independent. Such two independent sets of samples are obtained by sample splitting in practice. We stress that the sample splitting setup in our paper is used only to make the analysis concise without enhancing the order of sample complexity. In particular, it does not help us avoid the main difficulties of the problem.

Based on samples Ω and Ω, we consider pairwise differences and construct an empirical loss function. For

denote the true parameter associated with the k-th sample (similarly for ). Note that is the underlying parametric component of the model that generates key idea in constructing the pseudo-likelihood objective is to use rank-order decomposition to extract a factor, that is independently from the base measure. Given a pair of independent samples, , we denote their order statistics as and rank statistics as R. Then we know 1). Thus, (characterizes the pair (), and is hence a sufficient statistics. Note that

The first term is the density of the rank statistics given order statistics, which is only a function of unknown weight matrices . The second term is the density of order statistics, which relies on the specific base measure f. Thus, we omit the second term and sum over all paired samples for the first term to arrive at the following objective

The above loss function is similar to the logistic loss for the pairwise measurements. However, it is nonconvex in both components even for identity activation functions. When feature vectors x, z follow the multinomial distribution and activation functions not present, Chen et al. (2018) estimated the rank-as a whole by minimizing (3) with an additional nuclear norm penalty. Our goal is to recover both components , in the presence of nonlinear activation functions, resulting in a challenging nonconvex optimization problem.

3.1 Gradient Descent

We propose to minimize loss function (3) using the gradient descent with a constant step size. The iteration is given by

For future references, we provide explicit formulas of the gradient and the Hessian for loss (3). We introduce some definitions beforehand. Let us denote each column of weight matrices as ) (similar for To simplify notations, for a sequence of vectors ) be the long vector by stacking them up; for a sequence of matrices , we let diagthe block diagonal matrix with each block being specified by sequentially. Moreover, we define the following quantities:

The quantities on the left part are vectors or matrices calculated by using samples in Ω, which is indexed by k, while the quantities on the right part are calculated by using samples in Ω, which is indexed by l. We should mention that are the first derivative and the second derivative of the activation function is ReLU then = 0), while superscript of ) means the sample is from Ω(i.e. the sample index l is always used with superscript (). In addition, we define two scalars as

With above definitions and by simple calculations, one can show the gradient is given by