b

DiscoverSearch
About
My stuff
Noisy Inductive Matrix Completion Under Sparse Factor Models
2016·arXiv
ABSTRACT
ABSTRACT

Inductive Matrix Completion (IMC) is an important class of matrix completion problems that allows direct inclusion of available features to enhance estimation capabilities. These models have found applications in personalized recommendation systems, multilabel learning, dictionary learning, etc. This paper examines a general class of noisy matrix completion tasks where the underlying matrix is following an IMC model i.e., it is formed by a mixing matrix (a priori unknown) sandwiched between two known feature matrices. The mixing matrix here is assumed to be well approximated by the product of two sparse matrices—referred here to as “sparse factor models.” We leverage the main theorem of [1] and extend it to provide theoretical error bounds for the sparsity-regularized maximum likelihood estimators for the class of problems discussed in this paper. The main result is general in the sense that it can be used to derive error bounds for various noise models. In this paper, we instantiate our main result for the case of Gaussian noise and provide corresponding error bounds in terms of squared loss.

In recent years, matrix completion has been a venue for constant research—both from theoretical as well as algorithmic front. The applications include collaborative filtering, multiclass learning, dimensionality reduction etc. In most general terms, the aim of these problems is to estimate the elements of a matrix  X∗ ∈ Rn1×n2from noisy observations collected on only a subset of its locations. Of course, such estimation problems are ill-posed without further assumptions, since the value of  X∗ at the unobserved locations can be arbitrary. A common approach is to augment the inference method with the assumption that the underlying matrix to be estimated exhibits some form of intrinsic low-dimensional structure, making, in some cases, the problem feasible. For example, the assumption of low-rank on the target matrix has been the main focus of research for these problems. Even though the rank constrained problem is non-convex and difficult to analyze, convex relaxations to these problems have been well studied in noiseless [2], additive uncertainty [3], and even in settings where observations are nonlinear (e.g., one-bit quantized [4]). However, due to the computational complexity reasons, the most popular algorithms to solve these problems are variants of alternating minimization [1]; the low-rank structure is imposed explicitly via bilinear models i.e.,  X∗ is assumed to be a product of two factor-matrices as

image

where  D∗ ∈ Rn1×r, A∗ ∈ Rr×n2 and r ≪ min(n1, n2). Slightmodifications of this formulation also allows for explicit control and facilitates imposing additional constraints like positivity [5], sparsity [6], or even tree-sparsity [7] on the factors.

In terms of the well-known user-item recommendation problem [8]:  X∗ pertains to the rating matrix,  D∗ and A∗ denote thematrices of user and item latent-factors, respectively. Apart from the rating information, generally, we also have other user and item side-features (like age, gender, etc) that can be used to further improve the quality of estimation. These features can be included via a Bayesian framework [9, 10] or by directly changing the model as in Inductive Matrix Completion (IMC) [11, 12] where we have the following model assumption on  X∗:

image

where  A ∈ Rn1×r1, B ∈ Rr2×n2 are known matrices, for example, user and item feature matrices,  W∗ ∈ Rr1×r2 is a mixing matrix whose low-rank structure is explicitly imposed via a bilinear model W∗ = P∗Q∗, P∗ ∈ Rr1×r and Q∗ ∈ Rr×r2.

The goal of this paper is to provide estimation error bounds for matrix completion under inductive setting, as in (2), by using the framework of regularized maximum likelihood estimation [1]. For generality, we derive these bounds under the assumption that both P∗ and Q∗ can be sparse, and noise can follow any generic distribution. This kind of sparse-structure arises in recommendation problems where the desire is to have sparse latent vectors due to requirement of low computational and storage complexity [13], and variant of dictionary learning problems where apart from the coefficient matrix being sparse, dictionary itself is modeled to be sparse [14].

1.1. Related Works

Our inference tasks here essentially entails learning two structured factors in a bilinear model. With a few notable exceptions (e.g., low-rank matrices and certain non-negative matrices [15, 16]), the non-convex bilinear models studied in this paper are difficult to analyze. Several recent efforts in the dictionary learning literature have established theoretical guarantees on identifiability, as well as algorithmic analyses of a number of optimization approaches [17–20]. The most closely related to our work are [1, 6, 17], where authors provide a framework for deriving estimation error bounds for the models of form (1) under various noise distributions and sparsity structure on A∗ by using a regularized maximum-likelihood approach based on the ideas from [21]. The Overall theme of the estimation bounds obtained using this framework is that the error of their estimate, ˆX ofX∗, decays as the ratio of number of degrees of freedom to estimate over the number of observations.

The idea of including side features in the factorization models to improve recommendations is considered using Bayesian formulations in [22] or via directly incorporating these features in a generative model in IMC [12], Factorization Machines [23]. The statistical performance of IMC has been theoretically analyzed via convex relaxations based on nuclear-norm in noiseless [12] and noisy settings [24]. From the algorithmic perspective, provable alternatingminimization based approach with bilinear models has been recently proposed and analyzed [11]. Our results in this paper provide additional insights into these problems under a general setting of noise and sparsity of the bilinear factors. It is to be noted here that the aim of all these related works is to accurately estimate the mixing matrix W∗; we provide error bounds for the estimation of  X∗.

Many recent works have used IMC framework for large scale recommendations [25] or extended IMC to automatically construct better features [26].

1.2. Specific Contributions

In this paper we provide estimation error guarantees for matrix completion using regularized maximum likelihood estimator under the model described in (2) for the cases where the factor matrices are dense (either one or both), or when  W∗ is itself sparse. In order to arrive at these bounds, we extend the fundamental theorem proposed in [1] to the inductive matrix completion setting studied here. This extension is detailed in Theorem 2.1. The main utility of this theorem is that it can provide estimation error guarantees for a variety of noise distributions; for example, Gaussian, Poisson, Laplace, and even under the extreme quantized setting of 1-bit observations. Here we instantiate it for the Gaussian noise and obtain corresponding error guarantees.

1.3. Preliminaries

To set the stage for the statement of our main result, we recall several information-theoretic preliminaries. When p(Y ) and q(Y ) denote the pdf (or pmf) of a real-valued random variable Y , the Kullback-Leibler divergence (or KL divergence) of q from p is denoted  D(p∥q)and given by

image

where the logarithm is taken to be the natural log. By definition, D(p∥q)is finite only if the support of p is contained in the support of q. Further, the KL divergence satisfies  D(p∥q) ≥ 0 and D(p∥q) = 0when p(Y ) = q(Y ). We also use the Hellinger affinity denoted by A(p, q) and given by

image

It is known that  0 ≤ A(p, q) ≤ 1. When p and qare parameterized by elements  Xi,j and �Xi,jof matrices  X and �X, respec-tively, so that  p(Yi,j) = pXi,j(Yi,j) and q(Yi,j) = q �Xi,j(Yi,j), weuse the shorthand notation  D(pX∥q �X) ≜ �i,j D(pXi,j∥q �Xi,j ) and

A(pX, q �X) ≜ �i,j A(pXi,j, q �Xi,j ). Finally, for a matrix M we denote by  ∥M∥0its number of nonzero elements, and  ∥M∥max themagnitude of its largest element (in absolute value). Also,  (a1 ∨a2) = max(a1, a2).

Our focus here is to estimate a matrix  X∗ that admits a model as in (2) from noisy observations  YS = {Yi,j}(i,j)∈Scollected at a subset  S ⊂ [n1]×[n2]of its locations. Sampling locations are randomly chosen in the sense that for an integer m satisfying  4 ≤ m ≤ n1n2and  γ = m(n1n2)−1, each location is observed independently via a Bernoulli(γ) model.Going forward, we assume the following bounds on the participating matrices:

∥X∗∥max ≤ Xmax/2, ∥P∗∥max ≤ 1, ∥Q∗∥max ≤ Qmax,∥A∥max ≤ Amax, and ∥B∥max ≤ Bmax,

for some positive constants  Qmax, Amax, and Bmax.

Given  S and YS, we write the joint pdf (or pmf) of the observations as

image

where  pX∗i,j(Yi,j)denotes the corresponding scalar pdf (or pmf), and we use the shorthand  X∗S to denote the collection of elements of  X∗ indexed by (i, j) ∈ S. Then our task may be described succinctly as follows: given S and corresponding noisy observations YS of X∗ distributed according to (3), our goal is to estimate  X∗

under the assumption that it admits a model as in (2). Our approach will be to estimate  X∗ via sparsity-penalized maximum likelihood; we consider estimates of the form

image

where  {λP, λQ} > 0are user-specified regularization parameter, XSis shorthand for the collection  {Xi,j}(i,j)∈Sof entries of X indexed by S, and X is an appropriately constructed class of candidate estimates. To facilitate our analysis here, we take X to be a countable class of estimates constructed as follows: for a specified  β ≥ 1,we set  Llev = 2⌈log2(n1∨n2)β⌉ and construct P to be the set of all matrices  P ∈ Rr1×r whose elements are either 0, or are discretized to one of  Llevuniformly spaced levels in the range  [−1, 1] and Q tobe the set of all matrices  Q ∈ Rr×r2 whose elements either take the value zero, or are discretized to one of  Llevuniformly-spaced levels in the range  [−Qmax, Qmax]. Then, we let

image

Our main result establishes error bounds under general noise or corruption models. We state the result here as a theorem; its proof appears in Section 3.

Theorem 2.1. Let  S, γ, m, β, X , and YSbe as defined above and CDis any constant satisfying

image

Then for any

image

the complexity penalized maximum likelihood estimator (4) satisfies the (normalized, per-element) error bound

image

Before proceeding to specific case of Gaussian noise, we note a few salient points about this result. First, as alluded above, our results is general and can be used to analyze the error performance under a variety of noise models. Specialization to a given noise model requires us to compute the upper bounds for KL divergences in terms of problem parameters, and the lower bound of negative log Hellinger affinities in terms of the required error function. Second, the above result is kind of an oracle bound, in that it is specified in terms of a minimum over  X ∈ X. We can construct a valid upper bound by evaluating this for any  X ∈ X. We construct a specific X ∈ Xthat is close to  X∗, and evaluate the oracle bound for the constructed point to get a non-trivial upper bound.

Finally, we note that the optimization problem (4) is non-convex, as often is the case, not only because of  ℓ0penalty, or discretized space (X ) used for proofs in this paper, but due the fundamental challenge posed by bilinear models.

2.1. Implications for Gaussian Noise

Below we provide scaling behavior of error bound with problem parameters for Gaussian noise with known variance by instantiating Theorem 2.1. The joint likelihood of observations can be written as

image

where we have used the shorthand notation  ∥YS − X∗S∥2F ≜�(i,j)∈S(Yi,j − X∗i,j)2. To that end, we fix  β as

image

for describing the number of discretization levels in the elements of each factor matrix and regularization parameters  (λP, λQ) equal toright hand side of (6). In this setting we have the following result; its proof appears in Section 3.2.

Corollary 2.1. Let  β, λP, and λQbe as defined above with  CD =2X2max/σ2. The estimate  �Xobtained via (4) satisfies

image

A few comments are in order regarding these error guarantees. We may interpret the quantity  ∥P∗∥0 + ∥Q∗∥0as the number of degrees of freedom to be estimated, and in this sense we see that the error rate of the penalized maximum likelihood estimator exhibits characteristics of the well-known parametric rate (modulo the logarithmic factor). If  P∗ and Q∗ are instead dense, then we get the familiar error bound that decays as  (r1+r2)r/m with m. Comparedthis with the case where matrix  X∗ is just low-rank (as in formal matrix completion tasks), the error rates are of the form  (n1 + n2)r/m[1]. Clearly, the IMC model has a better error performance if  r2 ≪n2or even better if the factor matrices are sparse.

The presence of the factor  X2max is akin to the incoherence con- ditions prevalent in the matrix-completion literature. In essence, it pertains to the spikiness of the underlying matrix we are trying to estimate [27].

3.1. Proof of Theorem 2.1

We utilize the following Lemma from [1] and apply it to our problem.

Lemma 3.1. Let  X∗ be an n1 × n2matrix whose elements we aim to estimate, and let X be a countable collection of candidate reconstructions  X of X∗, each with corresponding penalty  pen(X) ≥ 1,so that the collection of penalties satisfies the summability condition �X∈X 2−pen(X) ≤ 1. Given m, γ, S, YS, and CDas explained earlier, we have that for any  ξ ≥�1 + 2CD3 �·2 log 2, the complexity penalized maximum likelihood estimator

image

satisfies the (normalized, per-element) error bound

image

where, as denoted, the expectation is with respect to the joint distribution of  S and YS.

Now, consider any discretized matrix factors  P ∈ P and Q ∈Q, as described in Section 2. For  LPloc ≜ 2⌈log2(r1·r)⌉, we encode each nonzero element of  P using log2 LPloc bits to denote its loca- tion and  log2 Llevbits for its amplitude. We can then encode P with  ∥P∥0nonzero entries by using  ∥P∥0(log2 LPloc + log2 Llev)bits. Similarly, we can encode  Q using ∥Q∥0(log2 LQloc+log2 Llev)where  LQloc ≜ 2⌈log2(r2·r)⌉.Now, we let  X ′ be set of all such X = APQB, and let the binary code for each X be the concatenation of the binary code for P followed by the binary code for Q. It follows that we may assign penalties  pen(X) to all X ∈ X ′ whoselength satisfy

image

given that A and B are known matrices.

It is easy to see that such codes are (by construction) uniquely decodable, so we have that �X∈X ′ 2−pen(X) ≤ 1by the well- known Kraft-McMillan Inequality [28]. Now let X be any subset of  X ′. For randomly subsampled and noisy observations  YS ourestimates take the form

image

Further, using Lemma 3.1 we have

image

where  log2 Lloc =�log2 LPloc ∨ log2 LQloc�. Finally, letting

image

and using the fact that

image

which follows by our selection of  Llev and Llocand the fact that r1 < n1, r2 < n2; it follows (after some straightforward simplifica-tion) that (λP, λQ) follow (6), and the estimate (4) satisfies

image

as claimed.

3.2. Proof of Corollary 2.1

Proof. For  X∗ as specified and any  X ∈ X, using the model (8) we have

image

for any fixed  (i, j) ∈ S. It follows that  D(pX∗∥pX) = ∥X∗ −X∥2F /2σ2, and using the fact that the amplitudes of entries of  X∗and all  X ∈ Xare no larger than  Xmax, it is clear that we may choose  CD = 2X2max/σ2. Further, for any  X ∈ Xand any fixed (i, j) ∈ Sit is easy to show that in this case

image

Incorporating this into Theorem 2.1, we obtain that for regularization parameters as in (6) with  CDsubstituted, the sparsity penalized maximum-likelihood estimate satisfies the per-element mean-square error bound

image

We now establish the error bound for the case where  P∗ andQ∗ are sparse with corresponding cardinality  ∥P∗∥0 and ∥Q∗∥0.Consider a candidate reconstruction of the form  X∗q = AP∗qQ∗qB,where the elements of  P∗q are the closest discretized surrogates of the nonzero entries of  P∗, and the entries of and  Q∗q are the closest discretized surrogates of the nonzero entries of  Q∗. Denote P∗q =P∗ + △P∗ and Q∗q = Q∗ + △Q∗. Then it is easy to see that

image

Given the range limits on allowable P and Q and that each range is quantized to  Llevlevels, we have that  ∥△P∗∥max ≤ 1/(Llev − 1)and  ∥△Q∗∥max ≤ Qmax/(Llev − 1). Now, we can obtain a bound on the magnitudes of the elements of  X∗q − X∗that hold uniformly over all i, j, as follows

image

where the inequality follows from a application of triangle inequality followed by the bounds on  ∥△P∗∥max and ∥△Q∗∥max and theentry-wise bounds on elements of allowable P and Q, and because Llev ≥ 2. Now, it is straight-forward to show that our choice of  β in(9) implies

image

so each entry of  X∗q − X∗is bounded in magnitude by  Xmax/2. Itfollows that each element of the candidate  X∗q constructed above is bounded in magnitude by  Xmax, so X∗q is indeed a valid element of the set X .

Further, the approximation error analysis above also implies directly that

image

where the last line follows from the fact that our specific choice of  βin (9) also implies

image

Now, evaluating the oracle term at the candidate  X∗q = AP∗qQ∗qB,and using the fact that  ∥P∗q∥0 = ∥P∗∥0 and ∥Q∗q∥0 = ∥Q∗∥0, wehave

image

[1] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Noisy matrix com- pletion under sparse factor models,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3636–3661, June 2016.

[2] E. J. Cand`es and B. Recht, “Exact matrix completion via con- vex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009.

[3] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisy entries,” Journal of Machine Learning Research, vol. 11, pp. 2057–2078, 2010.

[4] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bit matrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.

[5] A. Soni and J. Haupt, “Estimation error guarantees for poisson denoising with sparse and structured dictionary models,” in 2014 IEEE International Symposium on Information Theory, June 2014, pp. 2002–2006.

[6] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Error bounds for maximum likelihood matrix completion under sparse factor models,” in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, Dec 2014, pp. 399–403.

[7] A. Soni and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 133– 149, Jan 2014.

[8] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech- niques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.

[9] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King, “Sorec: social recommendation using probabilistic matrix factorization,” in Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008, pp. 931– 940.

[10] Ian Porteous, Arthur U Asuncion, and Max Welling, “Bayesian matrix factorization with side information and dirichlet process mixtures.,” in AAAI, 2010.

[11] Prateek Jain and Inderjit S Dhillon, “Provable inductive matrix completion,” arXiv preprint arXiv:1306.0626, 2013.

[12] Miao Xu, Rong Jin, and Zhi-Hua Zhou, “Speedup matrix com- pletion with side information: Application to multi-label learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2301–2309.

[13] E. Acar, G. Grdeniz, M. A. Rasmussen, D. Rago, L. O. Drag- sted, and R. Bro, “Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics,” in 2012 IEEE 12th International Conference on Data Mining Workshops, Dec 2012, pp. 1–8.

[14] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, March 2010.

[15] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra, “Computing a nonnegative matrix factorization– provably,” Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 145–162, 2012.

[16] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization and dimensionality reduction on physical space,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3239–3252, 2012.

[17] J. D. Haupt, N. D. Sidiropoulos, and G. B. Giannakis, “Sparse dictionary learning from 1-bit data,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 7664–7668.

[18] R. Gribonval and K. Schnass, “Dictionary identification – Sparse matrix-factorization via  l1minimization,” IEEE Transactions on Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.

[19] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,” Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006.

[20] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,” in Proceedings of the TwentyThird international joint conference on Artificial Intelligence, 2013, pp. 3087–3090.

[21] A. R. Barron, C. Huang, J. Q. Li, and X. Luo, “The MDL prin- ciple, penalized likelihoods, and statistical risk,” Festschrift for Jorma Rissanen. Tampere University Press, Tampere, Finland, 2008.

[22] Deepak Agarwal and Bee-Chung Chen, “flda: matrix factor- ization through latent dirichlet allocation,” in Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010, pp. 91–100.

[23] Steffen Rendle, “Factorization machines,” in 2010 IEEE International Conference on Data Mining. IEEE, 2010, pp. 995– 1000.

[24] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon, “Ma- trix completion with noisy side information,” in Advances in Neural Information Processing Systems, 2015, pp. 3447–3455.

[25] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and In- derjit S Dhillon, “Tumblr blog recommendation with boosted inductive matrix completion,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 203–212.

[26] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inder- jit S. Dhillon, “Goal-directed inductive matrix completion,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), aug 2016.

[27] S. Negahban and M. J. Wainwright, “Restricted strong con- vexity and weighted matrix completion: Optimal bounds with noise,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 1665–1697, 2012.

[28] T. M. Cover and J. A. Thomas, Elements of information theory, John Wiley & Sons, 2006.


Designed for Accessibility and to further Open Science