Transduction with Matrix Completion Using Smoothed Rank Function

2018·Arxiv

Abstract

Abstract

In this paper, we propose two new algorithms for transduction with Matrix Completion (MC) problem. The joint MC and prediction tasks are addressed simultaneously to enhance the accuracy, i.e., the label matrix is concatenated to the data matrix forming a stacked matrix. Assuming the data matrix is of low rank, we propose new recommendation methods by posing the problem as a constrained minimization of the Smoothed Rank Function (SRF). We provide convergence analysis for the proposed algorithms. The simulations are conducted on real datasets in two different scenarios of randomly missing pattern with and without block loss. The results confirm that the accuracy of our proposed methods outperforms those of state-of-the-art methods even up to 10% in low observation rates for the scenario without block loss. Our accuracy in the latter scenario, is comparable to state-of-the-art methods while the complexity of the proposed algorithms are reduced up to 4 times.

Keywords: Semi-Supervised Learning, Matrix Completion, Smoothed Rank Function, Multi-label Learning, Constrained Rank Minimization

1. Introduction

Prediction using labels known as supervised learning is tackled in many papers in the literature, and several efficient approaches are introduced to this end [1]. In many real-world applications, missing information or random sampling scenario is an inseparable part of the problem [2, 3]. One of the most common approaches to address unobserved or quantized attributes is utilizing Matrix Completion (MC) methods [4, 5], and [6]. Combining the two aforementioned concepts, the classification, prediction, or multi-label learning tasks are considered in the missing information scenario. This problem can be generally addressed both directly, or indirectly. The indirect approach addresses the imputation and prediction tasks separately [7, 8], while the direct approaches introduce a unique platform where both tasks are conducted simultaneously [[9, 10, 11].

The direct Transduction with MC task, introduced in [12], not only addresses the multi-label problem but it also imputes the unobserved entries in a unique framework, simultaneously. In their proposed method MC-1, the labels and data matrices are concatenated forming a larger stacked matrix. Then, they minimize the penalized nuclear norm of the stacked matrix assuming the low-rank property holds for the data matrix and for the stacked matrix consequently (in linear models). In their model, nuclear norm approximation of the rank function is utilized as its convex surrogate. In [13], the authors have suggested an algorithm which is more robust than the one proposed in [12], and also outperformed their accuracy in terms of the Average Precision (AP) measure.

In our paper, we introduce a novel direct method to impute the labels and missing data together. To this end, we pose a new optimization problem model, approximating the rank of the stacked matrix with a smoothed function. The Smoothed Rank Function (SRF) concept, leveraged in our proposed model, leads to the differentiability property [5]. Thus, we take the advantage of using the Projected Gradient (PG), and Spectral Projected Gradient (SPG) method [14], which are more robust and faster than Subgradient based methods derived from the penalized nuclear norm cost functions. It is worth noting that the problem model we introduce is different from a simple MC task since the hard labels force additional nonaffine constraints. In our work, we introduce two new algorithms based on projected GD and SPG. We have also achieved noticeable simulation results which illustrate our methods’ outperformance both in accuracy and complexity in most of the cases compared to state-of-the-art methods. Detailed simulation analysis is provided in Section 5. We also provide convergence analysis for our proposed algorithms.

Other authors have used the concatenation concept for different purposes such as the image classification scenario, and have leveraged the semi-supervised transduction with MC for tagging and classifying images [15], where the authors propose a novel Hashing approach for Tag Completion and Prediction. In [16] and [17], the applications of this model to social image tagging and image classification are investigated. In [18], a novel matrix-completion method called Inductive Matrix Completion is applied to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. In [19], the authors use the ADMM technique for optimizing the augmented Lagrangian function for the sake of MC. The matrix in their work is a concatenated version based on the similar idea introduced in [12]. Their purpose is to carry out head and body pose estimation which could be considered as one of the wearable device applications. In [20], the authors use ADMM to optimize a class of sub-modular cost functions in order to deal with the missing information and class imbalance in multi-linear learning simultaneously. In [21] two noise-tolerant optimization models, DRMC-b and DRMC-1, for distantly supervised relation extraction task from a novel perspective are introduced.

The rest of the paper is organized as follows: In Section 2, the problem formulation is provided. In Section 3, we review the smoothed rank function approximation and explain the motivation of the new objective function taken into account. We also include our proposed algorithms in this section. Next, in Section 4, we analyze the convergence of our proposed algorithms. We illustrate the performance of our method and compare it to several state-of-the-art methods in Section 5. Finally, we conclude the paper in Section 6.

2. Problem Formulation

Let be feature vectors associated with n items. These vectors are combined together in a row-wise fashion to create a feature matrix, ; ] . Let be n classification label vectors of size t. These vectors are combined together to create the label matrix, ; ] . In missing data scenario, some of the entries in X and Y are observed and the others are Missing Completely at Random (MCAR) [22]. We assume some of the entries in X and Y are randomly lost. Let Ωand Ωdenote the sets of observed entries in X and Y, respectively. If a specific feature relating to an item is not reported or in other words, is not observed in these matrices, then it is reported as 0 (In older literature on missing data, NA (Not Assigned) was used to denote the missing entries). Thus, the entries of the matrix Y are reported as 1 or 1 for classified labels and 0 for missing labels.

Our goal is to predict the missing labels for (as well as imputing the missing features in X. To solve this generally ill-posed problem, we assume that X and Y are jointly produced by an underlying low rank matrix [12]. We assume is the low rank pre-feature matrix. Let denote the soft labels associated with . By the assumption, is produced as + b, where is the weight matrix and is the bias vector. The hard labels are generated from soft labels via some function (In general, Sign function or the Logistic function is used). Let ] be the soft label matrix. Since + b, the columns of the soft label matrix are linear combinations of the columns in [] where 1 is the all-1 vector. Thus, rank([]). It is assumed that is low-rank, therefore [] is also low-rank, because rank([rank() + 1. Let Z be the + 1) stacked matrix []. Our goal is to recover this stacked matrix in which the unknown labels are also imputed as built-in parts of a global matrix. The recovered stacked matrix should be consistent with the observed data. Additionally, Z is desired to be of low-rank. Thus, the following constrained optimization problem is obtained:

In our proposed problem model, we substitute the function rank(Z) with our proposed smoothed function, and do not relax the hard constraints. Further elaborations on our model and algorithms are provided in the subsequent Section.

3. The Proposed Algorithm

The concept forming our algorithm is based on approximating the rank function with a smooth function and then improve this approximation by tuning the smoothing function. This concept is introduced in [5] to solve the MC problem. Generally, the rank function is not differentiable and gradient methods cannot be efficiently applied to problems containing the rank function. However, we use a smooth differentiable function to approximate the rank function. This will allow us to use the gradient methods in order to optimize the smooth function. Then, we update and tune the parameter of the smooth function to improve the accuracy of our approximation. Let denote the vector containing all of the singular values of the matrix Z where ) assuming . We have rank(. Also, we have )) where is the Kronecker delta function,

Next, we seek for a class of appropriate functions approximating the rank function. The following definition introduces this class of functions [5]:

Definition 1. Qualified Rank Approximation (QRA). A function [0, 1] is called a qualified rank approximation if

Further, we define ( ). Many functions may be found that satisfy the QRA conditions. Through this paper, we consider which satisfies the QRA conditions. It can be observed that ) converges in a pointwise fashion to the Kronecker delta function as 0.

Assume f(x) is a QRA function. Thus, we have

Now, we define

This is an approximation of the rank function. Instead of the rank function, we solve the optimization problem for ) which gives us an approximation of the solution of problem (1). Thus, we use the previous solution as a warm-start and the new value of to solve the new optimization problem. After iterating this procedure, we will obtain a sequence of matrices , where each term is obtained by optimizing ) for some fixed using the previous solution as the warm-start in each iteration. Since different values are close to each other and ) is continuous, we expect and be close to each other w.r.t the Frobenius norm. On the other hand, we improve accuracy in each step by shrinking the which leads to a better approximation of the rank function. Thus, we expect converges to the solution of problem (1) as . We will analytically show in Section 4 that this sequence of matrices converges to the solution of problem (1). In the rest of this section, we will describe the algorithm completely.

3.1. Constrained Optimization of the Rank Approximation Fδ(Z)

As explained before, for some fixed , we solve the following problem which is obtained by substituting the rank by ) in problem (1):

Let Z denote the feasible region in problem (5), and let . Assume 1 . If (we do not have any constraint on i.e. . Otherwise, regarding label constraints, we have 0 or 0 . (0 can be interpreted as 1 or 1). For t + 1 , if (, then can take any value; otherwise, . If j = t + d + 1, then ), we have lower and upper bounds such as , which means Z lies inside a box. Therefore, Z is a convex set. However, problem (5) is generally non-concave since ) is not concave. Recalling the third property of QRA, by choosing an appropriate value for , we can convert problem (5) to a locally concave problem, and solve it using robust methods. Thus, we assume the values of are chosen appropriately and problem (5) is locally concave. We use the PG technique [23] to solve this problem. We calculate the gradient of ) w.r.t the matrix Z. The gradient function is provided in 1 as follows:

Theorem 1. [5, Thm. 1] Suppose that is represented as )) where with the Singular Value Decomposition (SVD) Z = Udiag(, ) : contains the singular values of the matrix ), and is absolutely symmetric and differentiable. Then the gradient of F(Z) at Z is

where .

Recalling (4), we have ) and since f(x) is an even differentiable function, h(y) isabsolutely even(symmetric) and differentiable. Also, by the definition,

Denoting G as the direction of movement in gradient ascent step, we have

In the next step, we must project the point obtained by moving in the direction of gradient onto the feasible region of the problem. Projection onto the feasible region which is a box can be easily described as . Specifically, this projection can be described as in (10).

Now, we have described all of the components of PG. Solution of the problem (5) is obtained by iterating the PG procedure until convergence is reached. In each iteration, Z is updated as

where G is defined in (9) and is the gradient ascent step size. Choosing this step size can be done via cross-validation. We will discuss about choosing the step size in Section 5. Algorithm 1 includes the procedure of the PG method in order to solve the optimization problem in 5.

Algorithm 1 Transductive Imputation of Matrix using Smoothed Rank Function (PG based version) TIM-

In order to enhance the robustness and convergence rate of the proposed algorithm we have also used

the concept of Quasi-Newton minimization approach. We leverage the SPG method as introduced in [14] in algorithm 2. G(Z) in algorithm 2 is defined as in (9).

4. Convergence Analysis

In this section, we investigate convergence of the proposed algorithms in the previous section. We start

with finding reasonable conditions under which, the solution of problem (1) is unique. Unlike the problem in

[5] where all the constraints are affine, the first constraint in problem (1) is nonaffine. We define a secondary

problem as in (12) by just considering the affine constraints.

Let X denote the feasible region of problem (12). Define + 1 as

It can be verified that T is a linear operator. Also, we define the linear operator vec: ,

as an operator which gets a matrix and vectorizes it. Finally, we define as

This operator is linear as it can be considered as a composition of two linear operators. We can rewrite

problem (12) as

where c represents constraint constants. Now, consider the following definition.

Definition 2. Spherical Section Property[24]. The spherical section constant of a linear operator

is defined as

Further, A is said to have the ∆spherical section property if ∆(∆.

It has been proven in [25] that if all entries of the matrix representation of A are identically and inde-

pendently distributed from a zero-mean, unit-variance Gaussian distribution, then, A has the ∆spherical

section property with high probability under some reasonable conditions.

We add 2 assumptions to our problem.

Assumption 1: A has the ∆spherical section property.

Assumption 2: There exists such that rank(.

We have because we have ignored the first constraint in problem (1) to obtain X. Thus .

Recalling part (a) of theorem 2.1 of [24], , we have rank(Z) > rank(). Therefore, , we have

rank(Z) > rank() and this proves uniqueness of the global solution of problem (1).

Let denote the global solution of problem (5) for some fixed . Our next goal is to show that

lim. This is done in the following theorem.

Theorem 2. Assume has the ∆-spherical section property, ) is

a QRA, is defined as in assumption 2 and , and X are defined as before. If represents the

maximizer of ) over Z, then

Proof. We have

The first inequality is correct since is the maximizer of over Z. The second inequality is correct since

rank(and therefore has zero singular values. Considering the definition of ), we have

.

Taking lemmas 3 and 4 of [5] into account, (19) is resulted from (18) as:

This is followed immediately by

5. Simulation Results

In this section, we provide simulations to compare our proposed algorithms to state-of-the-art ones on

three well-known real datasets. Several studies have been conducted to address the transduction with MC

task. We explain about the datasets taken into account and the methods considered in our simulations in

the two following subsections.

5.1. Datasets

• Yeast: This biological dataset is studied for Yeast gene functional classification task by Elisseeff and

• CAL500: a collection of semantic information about music is provided in this dataset [27]. This

• Music Emotions: This dataset is utilized to discover the emotions existing inside different pieces of

5.2. Methods Investigated in the Simulations

We consider the following methods in our simulations as they have been proven to be the state-of-the-art

methods in the literature.

• MC-1: Goldberg, et al., formulated the problem for the first time in [12], and they leveraged low-rank

• Maxide: This method is introduced by Xu, et al., in [12]. Their proposed method called Maxide

• SRF+SVM: In this method, direct imputation by concatenation of labels and the data is not em-

• TIM-SRF: TIM-SRF is our proposed method. We have provided two algorithms for implementation

5.3. Missing Scenarios

Two main set of simulations are considered, each representing a different missing pattern. We provided

the results of these two scenarios in Tables I and II, respectively. We discuss the simulations results in two

subsections. The evaluation of our proposed methods and the other discussed algorithms is based on the

area under the curve (AUC). The computational runtime is also measured in seconds on an Intel(R) Core

(TM) i7-2600K CPU @3.40 GHz system.

5.3.1. Random Missing Pattern

First, we assume the missing entries are uniformly selected from the concatenated data. This setting is

considered in [12], where the sampling method on the labels is completely at random. The results of simu-

lations for this scenario are reflected in Table 5.3.2. The observation percentage values are: 80%, 60%, 40%,

and 20%. Let denote the observation percentage. We provide detailed analyses of the results as follows:

On the Music Emotions data, TIM-SRF2 outperforms other methods both in terms of accuracy and

computational runtime. In addition, TIM-SRF1 performs closely similar to TIM-SRF2 with slight infe-

riority and is second in terms of AUC except for method is the second best with

slight difference. On the CAL500 dataset, the best accuracy performance for

TIM-SRF2. For the rest of values, TIM-SRF1 outperforms the other methods. TIM-SRF1, however,

owns the minimum runtime complexity for the CAL500 case. On the Yeast dataset, TIM-SRF2 outper-

forms other methods for 60%, and 80%. TIM-SRF1 achieves the best accuracy for

while the best runtime is achieved by Maxide algorithm. It is worth noting that TIM-SRF2 is faster than

TIM-SRF1 when

5.3.2. Random Missing Pattern + Block loss on Labels

In this scenario, in addition to the random missing mask, 10% of the labels are chosen as a whole block

which is entirely missing, i.e., ten percents of the instances do not have any assigned labels, and are therefore

considered as the test part. Again, the values 20%, 40%, 60%, 80% are considered for in this scenario. It

is worth noting that, random label rows which are selected to be omitted could be merged together and

considered as a whole block loss. On the Music Emotions dataset, Maxide method outperforms the other

methods except for shows the best performance. The lowest time complexity belongs to TIM-SRF2. The accuracy measure of the method TIM-SRF2 is close to Maxide and both

TIM-SRF methods outperform the accuracy of MC-1. On the CAL500, Maxide algorithm achieves the

highest accuracy. The second best accuracy goes to TIM-SRF2. In terms of runtime, TIM-SRF1 and

TIM-SRF2 are the fastest methods of all. On the Yeast dataset, the method SRF+SVM has the highest

accuracy. This observation can be reasoned as follows: Knowing that there is a in the labels in this scenario,

the methods which concatenate the two matrices may not perform well since the adversely affects their

performance. However, the SRF+SVM method considers the initial phase of completion simply on the data

matrix and is therefore more efficient in completion since the is not taken into account. The second phase

is SVM implementation which is used for the prediction. SVM is computationally complex and as a result,

the runtime of this method is far larger than the other methods although the accuracy is improved. The

other methods show superior performance when the labels are not forced to have . The second best method

on . For the rest of values, TIM-SRF2 has the second best accuracy performance.

In terms of the complexity, Maxide goes to the second ranking.

6. Conclusion

In this paper, the general problem of semi-supervised multi-label learning is addressed. We have taken

the advantage of concatenating the label and feature matrix to enhance the accuracy of imputation. We

have proposed a new optimization model based on the Smoothed Rank Function (SRF) approximation.

Two novel algorithms (TIM-SRF1, and TIM-SRF2) are proposed using Projected Gradient (PG), and

Spectral Projected Gradient (SPG) methods. These methods are employed to reduce the complexity as they

are computationally efficient. We have provided convergence analysis for our algorithms as well.

Our simulation results reveal robustness and superiority of our proposed algorithms in prediction accuracy

in various settings. We have implemented simulations on real datasets in two main scenarios:

• Random Missing Pattern

• Random Missing Pattern + block loss on Labels

Low observation rates are common in practical settings. Our simulations in the first scenario, illustrate

that the proposed algorithms have improved the results of state-of-the-art methods even up to 10% in terms

of the accuracy in such cases. Moreover, for higher observation rates, the AUC is enhanced by 3% on

average. The computational runtime of TIM-SRF2 is up to 4 times lower than other mentioned methods in the first scenario. In the latter, in spite of slightly lower AUC in comparison to Maxide, TIM-SRF1

and TIM-SRF2 outperformed Maxide in terms of complexity in some cases.

Acknowledgements

This research did not receive any specific grant from funding agencies in the public, commercial, or

not-for-profit sectors.

References References

[11] M. A. Kiasari, G.-J. Jang, M. Lee, Novel iterative approach using generative and discriminative models for classification

[12] A. Goldberg, B. Recht, J. Xu, R. Nowak, X. Zhu, Transduction with matrix completion: Three birds with one stone, in:

[13] M. Xu, R. Jin, Z.-H. Zhou, Speedup matrix completion with side information: Application to multi-label learning, in:

[14] E. G. Birgin, J. M. Mart´ınez, M. Raydan, Nonmonotone spectral projected gradient methods on convex sets, SIAM Journal

[15] Q. Wang, L. Ruan, Z. Zhang, L. Si, Learning compact hashing codes for efficient tag completion and prediction, in:

[16] Y. Luo, T. Liu, D. Tao, C. Xu, Multiview matrix completion for multilabel image classification, IEEE Transactions on

[17] Z. Lin, G. Ding, M. Hu, J. Wang, X. Ye, Image tag completion via image-specific and tag-specific linear sparse reconstruc-

[19] X. Alameda-Pineda, Y. Yan, E. Ricci, O. Lanz, N. Sebe, Analyzing free-standing conversational groups: A multimodal

[20] B. Wu, S. Lyu, B. Ghanem, Constrained submodular minimization for missing labels and class imbalance in multi-label

[21] M. Fan, D. Zhao, Q. Zhou, Z. Liu, T. F. Zheng, E. Y. Chang, Distant supervision for relation extraction with matrix

[24] K. Dvijotham, M. Fazel, A nullspace analysis of the nuclear norm heuristic for rank minimization, in: Acoustics Speech

[25] H. Mohimani, M. Babaie-Zadeh, C. Jutten, A fast approach for overcomplete sparse decomposition based on smoothed l0

[26] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Advances in neural information processing

[27] D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Semantic annotation and retrieval of music and sound effects, IEEE

[28] K. Trohidis, G. Tsoumakas, G. Kalliris, I. P. Vlahavas, Multi-label classification of music into emotions., in: ISMIR, Vol. 8,

Table 1: Simulation results for the scenario 5.3.1 in terms of AUC and simulation time. Observation rates 20%, 40%, 60%, 80%

Table 2: Simulation results for the scenario 5.3.2 in terms of AUC and simulation time. Observation rates 20%, 40%, 60%, 80%

designed for accessibility and to further open science