Noisy Inductive Matrix Completion Under Sparse Factor Models

2016·Arxiv

ABSTRACT

ABSTRACT

Inductive Matrix Completion (IMC) is an important class of matrix completion problems that allows direct inclusion of available features to enhance estimation capabilities. These models have found applications in personalized recommendation systems, multilabel learning, dictionary learning, etc. This paper examines a general class of noisy matrix completion tasks where the underlying matrix is following an IMC model i.e., it is formed by a mixing matrix (a priori unknown) sandwiched between two known feature matrices. The mixing matrix here is assumed to be well approximated by the product of two sparse matrices—referred here to as “sparse factor models.” We leverage the main theorem of [1] and extend it to provide theoretical error bounds for the sparsity-regularized maximum likelihood estimators for the class of problems discussed in this paper. The main result is general in the sense that it can be used to derive error bounds for various noise models. In this paper, we instantiate our main result for the case of Gaussian noise and provide corresponding error bounds in terms of squared loss.

1. INTRODUCTION

In recent years, matrix completion has been a venue for constant research—both from theoretical as well as algorithmic front. The applications include collaborative filtering, multiclass learning, dimensionality reduction etc. In most general terms, the aim of these problems is to estimate the elements of a matrix from noisy observations collected on only a subset of its locations. Of course, such estimation problems are ill-posed without further assumptions, since the value of at the unobserved locations can be arbitrary. A common approach is to augment the inference method with the assumption that the underlying matrix to be estimated exhibits some form of intrinsic low-dimensional structure, making, in some cases, the problem feasible. For example, the assumption of low-rank on the target matrix has been the main focus of research for these problems. Even though the rank constrained problem is non-convex and difficult to analyze, convex relaxations to these problems have been well studied in noiseless [2], additive uncertainty [3], and even in settings where observations are nonlinear (e.g., one-bit quantized [4]). However, due to the computational complexity reasons, the most popular algorithms to solve these problems are variants of alternating minimization [1]; the low-rank structure is imposed explicitly via bilinear models i.e., is assumed to be a product of two factor-matrices as

where modifications of this formulation also allows for explicit control and facilitates imposing additional constraints like positivity [5], sparsity [6], or even tree-sparsity [7] on the factors.

In terms of the well-known user-item recommendation problem [8]: pertains to the rating matrix, matrices of user and item latent-factors, respectively. Apart from the rating information, generally, we also have other user and item side-features (like age, gender, etc) that can be used to further improve the quality of estimation. These features can be included via a Bayesian framework [9, 10] or by directly changing the model as in Inductive Matrix Completion (IMC) [11, 12] where we have the following model assumption on

where are known matrices, for example, user and item feature matrices, is a mixing matrix whose low-rank structure is explicitly imposed via a bilinear model

The goal of this paper is to provide estimation error bounds for matrix completion under inductive setting, as in (2), by using the framework of regularized maximum likelihood estimation [1]. For generality, we derive these bounds under the assumption that both can be sparse, and noise can follow any generic distribution. This kind of sparse-structure arises in recommendation problems where the desire is to have sparse latent vectors due to requirement of low computational and storage complexity [13], and variant of dictionary learning problems where apart from the coefficient matrix being sparse, dictionary itself is modeled to be sparse [14].

1.1. Related Works

Our inference tasks here essentially entails learning two structured factors in a bilinear model. With a few notable exceptions (e.g., low-rank matrices and certain non-negative matrices [15, 16]), the non-convex bilinear models studied in this paper are difficult to analyze. Several recent efforts in the dictionary learning literature have established theoretical guarantees on identifiability, as well as algorithmic analyses of a number of optimization approaches [17–20]. The most closely related to our work are [1, 6, 17], where authors provide a framework for deriving estimation error bounds for the models of form (1) under various noise distributions and sparsity structure on by using a regularized maximum-likelihood approach based on the ideas from [21]. The Overall theme of the estimation bounds obtained using this framework is that the error of their estimate, , decays as the ratio of number of degrees of freedom to estimate over the number of observations.

The idea of including side features in the factorization models to improve recommendations is considered using Bayesian formulations in [22] or via directly incorporating these features in a generative model in IMC [12], Factorization Machines [23]. The statistical performance of IMC has been theoretically analyzed via convex relaxations based on nuclear-norm in noiseless [12] and noisy settings [24]. From the algorithmic perspective, provable alternatingminimization based approach with bilinear models has been recently proposed and analyzed [11]. Our results in this paper provide additional insights into these problems under a general setting of noise and sparsity of the bilinear factors. It is to be noted here that the aim of all these related works is to accurately estimate the mixing matrix ; we provide error bounds for the estimation of

Many recent works have used IMC framework for large scale recommendations [25] or extended IMC to automatically construct better features [26].

1.2. Specific Contributions

In this paper we provide estimation error guarantees for matrix completion using regularized maximum likelihood estimator under the model described in (2) for the cases where the factor matrices are dense (either one or both), or when is itself sparse. In order to arrive at these bounds, we extend the fundamental theorem proposed in [1] to the inductive matrix completion setting studied here. This extension is detailed in Theorem 2.1. The main utility of this theorem is that it can provide estimation error guarantees for a variety of noise distributions; for example, Gaussian, Poisson, Laplace, and even under the extreme quantized setting of 1-bit observations. Here we instantiate it for the Gaussian noise and obtain corresponding error guarantees.

1.3. Preliminaries

To set the stage for the statement of our main result, we recall several information-theoretic preliminaries. When p(Y ) and q(Y ) denote the pdf (or pmf) of a real-valued random variable Y , the Kullback-Leibler divergence (or KL divergence) of q from p is denoted and given by

where the logarithm is taken to be the natural log. By definition, is finite only if the support of p is contained in the support of q. Further, the KL divergence satisfies when p(Y ) = q(Y ). We also use the Hellinger affinity denoted by A(p, q) and given by

It is known that are parameterized by elements of matrices tively, so that use the shorthand notation

. Finally, for a matrix M we denote by its number of nonzero elements, and magnitude of its largest element (in absolute value). Also,

2. PROBLEM STATEMENT AND RECOVERY RESULT

Our focus here is to estimate a matrix that admits a model as in (2) from noisy observations collected at a subset of its locations. Sampling locations are randomly chosen in the sense that for an integer m satisfying and , each location is observed independently via a Bernoulli(Going forward, we assume the following bounds on the participating matrices:

for some positive constants

Given , we write the joint pdf (or pmf) of the observations as

where denotes the corresponding scalar pdf (or pmf), and we use the shorthand to denote the collection of elements of . Then our task may be described succinctly as follows: given S and corresponding noisy observations distributed according to (3), our goal is to estimate

under the assumption that it admits a model as in (2). Our approach will be to estimate via sparsity-penalized maximum likelihood; we consider estimates of the form

where are user-specified regularization parameter, is shorthand for the collection of entries of X indexed by S, and X is an appropriately constructed class of candidate estimates. To facilitate our analysis here, we take X to be a countable class of estimates constructed as follows: for a specified we set and construct P to be the set of all matrices whose elements are either 0, or are discretized to one of uniformly spaced levels in the range be the set of all matrices whose elements either take the value zero, or are discretized to one of uniformly-spaced levels in the range . Then, we let

Our main result establishes error bounds under general noise or corruption models. We state the result here as a theorem; its proof appears in Section 3.

Theorem 2.1. Let be as defined above and is any constant satisfying

Then for any

the complexity penalized maximum likelihood estimator (4) satisfies the (normalized, per-element) error bound

Before proceeding to specific case of Gaussian noise, we note a few salient points about this result. First, as alluded above, our results is general and can be used to analyze the error performance under a variety of noise models. Specialization to a given noise model requires us to compute the upper bounds for KL divergences in terms of problem parameters, and the lower bound of negative log Hellinger affinities in terms of the required error function. Second, the above result is kind of an oracle bound, in that it is specified in terms of a minimum over . We can construct a valid upper bound by evaluating this for any . We construct a specific that is close to , and evaluate the oracle bound for the constructed point to get a non-trivial upper bound.

Finally, we note that the optimization problem (4) is non-convex, as often is the case, not only because of penalty, or discretized space (X ) used for proofs in this paper, but due the fundamental challenge posed by bilinear models.

2.1. Implications for Gaussian Noise

Below we provide scaling behavior of error bound with problem parameters for Gaussian noise with known variance by instantiating Theorem 2.1. The joint likelihood of observations can be written as

where we have used the shorthand notation . To that end, we fix

for describing the number of discretization levels in the elements of each factor matrix and regularization parameters right hand side of (6). In this setting we have the following result; its proof appears in Section 3.2.

Corollary 2.1. Let be as defined above with . The estimate obtained via (4) satisfies

A few comments are in order regarding these error guarantees. We may interpret the quantity as the number of degrees of freedom to be estimated, and in this sense we see that the error rate of the penalized maximum likelihood estimator exhibits characteristics of the well-known parametric rate (modulo the logarithmic factor). If are instead dense, then we get the familiar error bound that decays as this with the case where matrix is just low-rank (as in formal matrix completion tasks), the error rates are of the form [1]. Clearly, the IMC model has a better error performance if or even better if the factor matrices are sparse.

The presence of the factor is akin to the incoherence con- ditions prevalent in the matrix-completion literature. In essence, it pertains to the spikiness of the underlying matrix we are trying to estimate [27].

3. PROOFS

3.1. Proof of Theorem 2.1

We utilize the following Lemma from [1] and apply it to our problem.

Lemma 3.1. Let matrix whose elements we aim to estimate, and let X be a countable collection of candidate reconstructions , each with corresponding penalty so that the collection of penalties satisfies the summability condition as explained earlier, we have that for any , the complexity penalized maximum likelihood estimator

satisfies the (normalized, per-element) error bound

where, as denoted, the expectation is with respect to the joint distribution of

Now, consider any discretized matrix factors Q, as described in Section 2. For , we encode each nonzero element of bits to denote its loca- tion and bits for its amplitude. We can then encode P with nonzero entries by using bits. Similarly, we can encode where Now, we let be set of all such X = APQB, and let the binary code for each X be the concatenation of the binary code for P followed by the binary code for Q. It follows that we may assign penalties length satisfy

given that A and B are known matrices.

It is easy to see that such codes are (by construction) uniquely decodable, so we have that by the well- known Kraft-McMillan Inequality [28]. Now let X be any subset of . For randomly subsampled and noisy observations estimates take the form

Further, using Lemma 3.1 we have

where . Finally, letting

and using the fact that

which follows by our selection of and the fact that ; it follows (after some straightforward simplifica-tion) that () follow (6), and the estimate (4) satisfies

as claimed.

3.2. Proof of Corollary 2.1

Proof. For as specified and any , using the model (8) we have

for any fixed . It follows that , and using the fact that the amplitudes of entries of and all are no larger than , it is clear that we may choose . Further, for any and any fixed it is easy to show that in this case

Incorporating this into Theorem 2.1, we obtain that for regularization parameters as in (6) with substituted, the sparsity penalized maximum-likelihood estimate satisfies the per-element mean-square error bound

We now establish the error bound for the case where are sparse with corresponding cardinality Consider a candidate reconstruction of the form where the elements of are the closest discretized surrogates of the nonzero entries of , and the entries of and are the closest discretized surrogates of the nonzero entries of . Then it is easy to see that

Given the range limits on allowable P and Q and that each range is quantized to levels, we have that and . Now, we can obtain a bound on the magnitudes of the elements of that hold uniformly over all i, j, as follows

where the inequality follows from a application of triangle inequality followed by the bounds on entry-wise bounds on elements of allowable P and Q, and because . Now, it is straight-forward to show that our choice of (9) implies

so each entry of is bounded in magnitude by follows that each element of the candidate constructed above is bounded in magnitude by is indeed a valid element of the set X .

Further, the approximation error analysis above also implies directly that

where the last line follows from the fact that our specific choice of in (9) also implies

Now, evaluating the oracle term at the candidate and using the fact that have

4. REFERENCES

[1] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Noisy matrix com- pletion under sparse factor models,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3636–3661, June 2016.

[2] E. J. Cand`es and B. Recht, “Exact matrix completion via con- vex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009.

[3] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisy entries,” Journal of Machine Learning Research, vol. 11, pp. 2057–2078, 2010.

[4] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bit matrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.

[5] A. Soni and J. Haupt, “Estimation error guarantees for poisson denoising with sparse and structured dictionary models,” in 2014 IEEE International Symposium on Information Theory, June 2014, pp. 2002–2006.

[6] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Error bounds for maximum likelihood matrix completion under sparse factor models,” in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, Dec 2014, pp. 399–403.

[7] A. Soni and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 133– 149, Jan 2014.

[8] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech- niques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.

[9] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King, “Sorec: social recommendation using probabilistic matrix factorization,” in Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008, pp. 931– 940.

[10] Ian Porteous, Arthur U Asuncion, and Max Welling, “Bayesian matrix factorization with side information and dirichlet process mixtures.,” in AAAI, 2010.

[11] Prateek Jain and Inderjit S Dhillon, “Provable inductive matrix completion,” arXiv preprint arXiv:1306.0626, 2013.

[12] Miao Xu, Rong Jin, and Zhi-Hua Zhou, “Speedup matrix com- pletion with side information: Application to multi-label learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2301–2309.

[13] E. Acar, G. Grdeniz, M. A. Rasmussen, D. Rago, L. O. Drag- sted, and R. Bro, “Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics,” in 2012 IEEE 12th International Conference on Data Mining Workshops, Dec 2012, pp. 1–8.

[14] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, March 2010.

[15] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra, “Computing a nonnegative matrix factorization– provably,” Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 145–162, 2012.

[16] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization and dimensionality reduction on physical space,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3239–3252, 2012.

[17] J. D. Haupt, N. D. Sidiropoulos, and G. B. Giannakis, “Sparse dictionary learning from 1-bit data,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 7664–7668.

[18] R. Gribonval and K. Schnass, “Dictionary identification – Sparse matrix-factorization via minimization,” IEEE Transactions on Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.

[19] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,” Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006.

[20] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,” in Proceedings of the TwentyThird international joint conference on Artificial Intelligence, 2013, pp. 3087–3090.

[21] A. R. Barron, C. Huang, J. Q. Li, and X. Luo, “The MDL prin- ciple, penalized likelihoods, and statistical risk,” Festschrift for Jorma Rissanen. Tampere University Press, Tampere, Finland, 2008.

[22] Deepak Agarwal and Bee-Chung Chen, “flda: matrix factor- ization through latent dirichlet allocation,” in Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010, pp. 91–100.

[23] Steffen Rendle, “Factorization machines,” in 2010 IEEE International Conference on Data Mining. IEEE, 2010, pp. 995– 1000.

[24] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon, “Ma- trix completion with noisy side information,” in Advances in Neural Information Processing Systems, 2015, pp. 3447–3455.

[25] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and In- derjit S Dhillon, “Tumblr blog recommendation with boosted inductive matrix completion,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 203–212.

[26] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inder- jit S. Dhillon, “Goal-directed inductive matrix completion,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), aug 2016.

[27] S. Negahban and M. J. Wainwright, “Restricted strong con- vexity and weighted matrix completion: Optimal bounds with noise,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 1665–1697, 2012.

[28] T. M. Cover and J. A. Thomas, Elements of information theory, John Wiley & Sons, 2006.

designed for accessibility and to further open science