Prediction using labels known as supervised learning is tackled in many papers in the literature, and several efficient approaches are introduced to this end [1]. In many real-world applications, missing information or random sampling scenario is an inseparable part of the problem [2, 3]. One of the most common approaches to address unobserved or quantized attributes is utilizing Matrix Completion (MC) methods [4, 5], and [6]. Combining the two aforementioned concepts, the classification, prediction, or multi-label learning tasks are considered in the missing information scenario. This problem can be generally addressed both directly, or indirectly. The indirect approach addresses the imputation and prediction tasks separately [7, 8], while the direct approaches introduce a unique platform where both tasks are conducted simultaneously [[9, 10, 11].
The direct Transduction with MC task, introduced in [12], not only addresses the multi-label problem but it also imputes the unobserved entries in a unique framework, simultaneously. In their proposed method MC-1, the labels and data matrices are concatenated forming a larger stacked matrix. Then, they minimize the penalized nuclear norm of the stacked matrix assuming the low-rank property holds for the data matrix and for the stacked matrix consequently (in linear models). In their model, nuclear norm approximation of the rank function is utilized as its convex surrogate. In [13], the authors have suggested an algorithm which is more robust than the one proposed in [12], and also outperformed their accuracy in terms of the Average Precision (AP) measure.
In our paper, we introduce a novel direct method to impute the labels and missing data together. To this end, we pose a new optimization problem model, approximating the rank of the stacked matrix with a smoothed function. The Smoothed Rank Function (SRF) concept, leveraged in our proposed model, leads to the differentiability property [5]. Thus, we take the advantage of using the Projected Gradient (PG), and Spectral Projected Gradient (SPG) method [14], which are more robust and faster than Subgradient based methods derived from the penalized nuclear norm cost functions. It is worth noting that the problem model we introduce is different from a simple MC task since the hard labels force additional nonaffine constraints. In our work, we introduce two new algorithms based on projected GD and SPG. We have also achieved noticeable simulation results which illustrate our methods’ outperformance both in accuracy and complexity in most of the cases compared to state-of-the-art methods. Detailed simulation analysis is provided in Section 5. We also provide convergence analysis for our proposed algorithms.
Other authors have used the concatenation concept for different purposes such as the image classification scenario, and have leveraged the semi-supervised transduction with MC for tagging and classifying images [15], where the authors propose a novel Hashing approach for Tag Completion and Prediction. In [16] and [17], the applications of this model to social image tagging and image classification are investigated. In [18], a novel matrix-completion method called Inductive Matrix Completion is applied to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. In [19], the authors use the ADMM technique for optimizing the augmented Lagrangian function for the sake of MC. The matrix in their work is a concatenated version based on the similar idea introduced in [12]. Their purpose is to carry out head and body pose estimation which could be considered as one of the wearable device applications. In [20], the authors use ADMM to optimize a class of sub-modular cost functions in order to deal with the missing information and class imbalance in multi-linear learning simultaneously. In [21] two noise-tolerant optimization models, DRMC-b and DRMC-1, for distantly supervised relation extraction task from a novel perspective are introduced.
The rest of the paper is organized as follows: In Section 2, the problem formulation is provided. In Section 3, we review the smoothed rank function approximation and explain the motivation of the new objective function taken into account. We also include our proposed algorithms in this section. Next, in Section 4, we analyze the convergence of our proposed algorithms. We illustrate the performance of our method and compare it to several state-of-the-art methods in Section 5. Finally, we conclude the paper in Section 6.
Let be feature vectors associated with n items. These vectors are combined together in a row-wise fashion to create a feature matrix,
;
]
. Let
be n classification label vectors of size t. These vectors are combined together to create the label matrix,
;
]
. In missing data scenario, some of the entries in X and Y are observed and the others are Missing Completely at Random (MCAR) [22]. We assume some of the entries in X and Y are randomly lost. Let Ω
and Ω
denote the sets of observed entries in X and Y, respectively. If a specific feature relating to an item is not reported or in other words, is not observed in these matrices, then it is reported as 0 (In older literature on missing data, NA (Not Assigned) was used to denote the missing entries). Thus, the entries of the matrix Y are reported as
1 or 1 for classified labels and 0 for missing labels.
Our goal is to predict the missing labels for (
as well as imputing the missing features in X. To solve this generally ill-posed problem, we assume that X and Y are jointly produced by an underlying low rank matrix [12]. We assume
is the low rank pre-feature matrix. Let
denote the soft labels associated with
. By the assumption,
is produced as
+ b, where
is the weight matrix and
is the bias vector. The hard labels
are generated from soft labels via some function (In general, Sign function or the Logistic function is used). Let
] be the soft label matrix. Since
+ b, the columns of the soft label matrix
are linear combinations of the columns in [
] where 1 is the all-1 vector. Thus, rank([
]). It is assumed that
is low-rank, therefore [
] is also low-rank, because rank([
rank(
) + 1. Let Z be the
+ 1) stacked matrix [
]. Our goal is to recover this stacked matrix in which the unknown labels are also imputed as built-in parts of a global matrix. The recovered stacked matrix should be consistent with the observed data. Additionally, Z is desired to be of low-rank. Thus, the following constrained optimization problem is obtained:
In our proposed problem model, we substitute the function rank(Z) with our proposed smoothed function, and do not relax the hard constraints. Further elaborations on our model and algorithms are provided in the subsequent Section.
The concept forming our algorithm is based on approximating the rank function with a smooth function and then improve this approximation by tuning the smoothing function. This concept is introduced in [5] to solve the MC problem. Generally, the rank function is not differentiable and gradient methods cannot be efficiently applied to problems containing the rank function. However, we use a smooth differentiable function to approximate the rank function. This will allow us to use the gradient methods in order to optimize the smooth function. Then, we update and tune the parameter of the smooth function to improve the accuracy of our approximation. Let denote the vector containing all of the singular values of the matrix Z where
) assuming
. We have rank(
. Also, we have
)) where
is the Kronecker delta function,
Next, we seek for a class of appropriate functions approximating the rank function. The following definition introduces this class of functions [5]:
Definition 1. Qualified Rank Approximation (QRA). A function [0, 1] is called a qualified rank approximation if
Further, we define (
). Many functions may be found that satisfy the QRA conditions. Through this paper, we consider
which satisfies the QRA conditions. It can be observed that
) converges in a pointwise fashion to the Kronecker delta function as
0.
Assume f(x) is a QRA function. Thus, we have
Now, we define
This is an approximation of the rank function. Instead of the rank function, we solve the optimization problem for ) which gives us an approximation of the solution of problem (1). Thus, we use the previous solution as a warm-start and the new value of
to solve the new optimization problem. After iterating this procedure, we will obtain a sequence of matrices
, where each term is obtained by optimizing
) for some fixed
using the previous solution as the warm-start in each iteration. Since different
values are close to each other and
) is continuous, we expect
and
be close to each other w.r.t the Frobenius norm. On the other hand, we improve accuracy in each step by shrinking the
which leads to a better approximation of the rank function. Thus, we expect
converges to the solution of problem (1) as
. We will analytically show in Section 4 that this sequence of matrices converges to the solution of problem (1). In the rest of this section, we will describe the algorithm completely.
3.1. Constrained Optimization of the Rank Approximation Fδ(Z)
As explained before, for some fixed , we solve the following problem which is obtained by substituting the rank by
) in problem (1):
Let Z denote the feasible region in problem (5), and let . Assume 1
. If (
we do not have any constraint on
i.e.
. Otherwise, regarding label constraints, we have
0 or 0
. (0 can be interpreted as 1 or
1). For t + 1
, if (
, then
can take any value; otherwise,
. If j = t + d + 1, then
), we have lower and upper bounds such as
, which means Z lies inside a box. Therefore, Z is a convex set. However, problem (5) is generally non-concave since
) is not concave. Recalling the third property of QRA, by choosing an appropriate value for
, we can convert problem (5) to a locally concave problem, and solve it using robust methods. Thus, we assume the values of
are chosen appropriately and problem (5) is locally concave. We use the PG technique [23] to solve this problem. We calculate the gradient of
) w.r.t the matrix Z. The gradient function is provided in 1 as follows:
Theorem 1. [5, Thm. 1] Suppose that is represented as
)) where
with the Singular Value Decomposition (SVD) Z = Udiag(
,
) :
contains the singular values of the matrix
), and
is absolutely symmetric and differentiable. Then the gradient of F(Z) at Z is
where .
Recalling (4), we have ) and since f(x) is an even differentiable function, h(y) isabsolutely even(symmetric) and differentiable. Also, by the definition,
Denoting G as the direction of movement in gradient ascent step, we have
In the next step, we must project the point obtained by moving in the direction of gradient onto the feasible region of the problem. Projection onto the feasible region which is a box can be easily described as . Specifically, this projection can be described as in (10).
Now, we have described all of the components of PG. Solution of the problem (5) is obtained by iterating the PG procedure until convergence is reached. In each iteration, Z is updated as
where G is defined in (9) and is the gradient ascent step size. Choosing this step size can be done via cross-validation. We will discuss about choosing the step size in Section 5. Algorithm 1 includes the procedure of the PG method in order to solve the optimization problem in 5.
Algorithm 1 Transductive Imputation of Matrix using Smoothed Rank Function (PG based version) TIM-
In order to enhance the robustness and convergence rate of the proposed algorithm we have also used
the concept of Quasi-Newton minimization approach. We leverage the SPG method as introduced in [14] in algorithm 2. G(Z) in algorithm 2 is defined as in (9).
In this section, we investigate convergence of the proposed algorithms in the previous section. We start
with finding reasonable conditions under which, the solution of problem (1) is unique. Unlike the problem in
[5] where all the constraints are affine, the first constraint in problem (1) is nonaffine. We define a secondary
problem as in (12) by just considering the affine constraints.
Let X denote the feasible region of problem (12). Define + 1 as
It can be verified that T is a linear operator. Also, we define the linear operator vec: ,
as an operator which gets a matrix and vectorizes it. Finally, we define
as
This operator is linear as it can be considered as a composition of two linear operators. We can rewrite
problem (12) as
where c represents constraint constants. Now, consider the following definition.
Definition 2. Spherical Section Property[24]. The spherical section constant of a linear operator
is defined as
Further, A is said to have the ∆spherical section property if ∆(
∆.
It has been proven in [25] that if all entries of the matrix representation of A are identically and inde-
pendently distributed from a zero-mean, unit-variance Gaussian distribution, then, A has the ∆spherical
section property with high probability under some reasonable conditions.
We add 2 assumptions to our problem.
Assumption 1: A has the ∆spherical section property.
Assumption 2: There exists such that rank(
.
We have because we have ignored the first constraint in problem (1) to obtain X. Thus
.
Recalling part (a) of theorem 2.1 of [24], , we have rank(Z) > rank(
). Therefore,
, we have
rank(Z) > rank() and this proves uniqueness of the global solution of problem (1).
Let denote the global solution of problem (5) for some fixed
. Our next goal is to show that
lim. This is done in the following theorem.
Theorem 2. Assume has the ∆-spherical section property,
) is
a QRA, is defined as in assumption 2 and
, and X are defined as before. If
represents the
maximizer of ) over Z, then
Proof. We have
The first inequality is correct since is the maximizer of
over Z. The second inequality is correct since
rank(and therefore
has
zero singular values. Considering the definition of
), we have
.
Taking lemmas 3 and 4 of [5] into account, (19) is resulted from (18) as:
This is followed immediately by
In this section, we provide simulations to compare our proposed algorithms to state-of-the-art ones on
three well-known real datasets. Several studies have been conducted to address the transduction with MC
task. We explain about the datasets taken into account and the methods considered in our simulations in
the two following subsections.
5.1. Datasets
• Yeast: This biological dataset is studied for Yeast gene functional classification task by Elisseeff and
• CAL500: a collection of semantic information about music is provided in this dataset [27]. This
• Music Emotions: This dataset is utilized to discover the emotions existing inside different pieces of
5.2. Methods Investigated in the Simulations
We consider the following methods in our simulations as they have been proven to be the state-of-the-art
methods in the literature.
• MC-1: Goldberg, et al., formulated the problem for the first time in [12], and they leveraged low-rank
• Maxide: This method is introduced by Xu, et al., in [12]. Their proposed method called Maxide
• SRF+SVM: In this method, direct imputation by concatenation of labels and the data is not em-
• TIM-SRF: TIM-SRF is our proposed method. We have provided two algorithms for implementation
5.3. Missing Scenarios
Two main set of simulations are considered, each representing a different missing pattern. We provided
the results of these two scenarios in Tables I and II, respectively. We discuss the simulations results in two
subsections. The evaluation of our proposed methods and the other discussed algorithms is based on the
area under the curve (AUC). The computational runtime is also measured in seconds on an Intel(R) Core
(TM) i7-2600K CPU @3.40 GHz system.
5.3.1. Random Missing Pattern
First, we assume the missing entries are uniformly selected from the concatenated data. This setting is
considered in [12], where the sampling method on the labels is completely at random. The results of simu-
lations for this scenario are reflected in Table 5.3.2. The observation percentage values are: 80%, 60%, 40%,
and 20%. Let denote the observation percentage. We provide detailed analyses of the results as follows:
On the Music Emotions data, TIM-SRF2 outperforms other methods both in terms of accuracy and
computational runtime. In addition, TIM-SRF1 performs closely similar to TIM-SRF2 with slight infe-
riority and is second in terms of AUC except for method is the second best with
slight difference. On the CAL500 dataset, the best accuracy performance for
TIM-SRF2. For the rest of values, TIM-SRF1 outperforms the other methods. TIM-SRF1, however,
owns the minimum runtime complexity for the CAL500 case. On the Yeast dataset, TIM-SRF2 outper-
forms other methods for 60%, and 80%. TIM-SRF1 achieves the best accuracy for
while the best runtime is achieved by Maxide algorithm. It is worth noting that TIM-SRF2 is faster than
TIM-SRF1 when
5.3.2. Random Missing Pattern + Block loss on Labels
In this scenario, in addition to the random missing mask, 10% of the labels are chosen as a whole block
which is entirely missing, i.e., ten percents of the instances do not have any assigned labels, and are therefore
considered as the test part. Again, the values 20%, 40%, 60%, 80% are considered for in this scenario. It
is worth noting that, random label rows which are selected to be omitted could be merged together and
considered as a whole block loss. On the Music Emotions dataset, Maxide method outperforms the other
methods except for shows the best performance. The lowest time complexity belongs to TIM-SRF2. The accuracy measure of the method TIM-SRF2 is close to Maxide and both
TIM-SRF methods outperform the accuracy of MC-1. On the CAL500, Maxide algorithm achieves the
highest accuracy. The second best accuracy goes to TIM-SRF2. In terms of runtime, TIM-SRF1 and
TIM-SRF2 are the fastest methods of all. On the Yeast dataset, the method SRF+SVM has the highest
accuracy. This observation can be reasoned as follows: Knowing that there is a in the labels in this scenario,
the methods which concatenate the two matrices may not perform well since the adversely affects their
performance. However, the SRF+SVM method considers the initial phase of completion simply on the data
matrix and is therefore more efficient in completion since the is not taken into account. The second phase
is SVM implementation which is used for the prediction. SVM is computationally complex and as a result,
the runtime of this method is far larger than the other methods although the accuracy is improved. The
other methods show superior performance when the labels are not forced to have . The second best method
on . For the rest of
values, TIM-SRF2 has the second best accuracy performance.
In terms of the complexity, Maxide goes to the second ranking.
In this paper, the general problem of semi-supervised multi-label learning is addressed. We have taken
the advantage of concatenating the label and feature matrix to enhance the accuracy of imputation. We
have proposed a new optimization model based on the Smoothed Rank Function (SRF) approximation.
Two novel algorithms (TIM-SRF1, and TIM-SRF2) are proposed using Projected Gradient (PG), and
Spectral Projected Gradient (SPG) methods. These methods are employed to reduce the complexity as they
are computationally efficient. We have provided convergence analysis for our algorithms as well.
Our simulation results reveal robustness and superiority of our proposed algorithms in prediction accuracy
in various settings. We have implemented simulations on real datasets in two main scenarios:
• Random Missing Pattern
• Random Missing Pattern + block loss on Labels
Low observation rates are common in practical settings. Our simulations in the first scenario, illustrate
that the proposed algorithms have improved the results of state-of-the-art methods even up to 10% in terms
of the accuracy in such cases. Moreover, for higher observation rates, the AUC is enhanced by 3% on
average. The computational runtime of TIM-SRF2 is up to 4 times lower than other mentioned methods in the first scenario. In the latter, in spite of slightly lower AUC in comparison to Maxide, TIM-SRF1
and TIM-SRF2 outperformed Maxide in terms of complexity in some cases.
This research did not receive any specific grant from funding agencies in the public, commercial, or
not-for-profit sectors.
[11] M. A. Kiasari, G.-J. Jang, M. Lee, Novel iterative approach using generative and discriminative models for classification
[12] A. Goldberg, B. Recht, J. Xu, R. Nowak, X. Zhu, Transduction with matrix completion: Three birds with one stone, in:
[13] M. Xu, R. Jin, Z.-H. Zhou, Speedup matrix completion with side information: Application to multi-label learning, in:
[14] E. G. Birgin, J. M. Mart´ınez, M. Raydan, Nonmonotone spectral projected gradient methods on convex sets, SIAM Journal
[15] Q. Wang, L. Ruan, Z. Zhang, L. Si, Learning compact hashing codes for efficient tag completion and prediction, in:
[16] Y. Luo, T. Liu, D. Tao, C. Xu, Multiview matrix completion for multilabel image classification, IEEE Transactions on
[17] Z. Lin, G. Ding, M. Hu, J. Wang, X. Ye, Image tag completion via image-specific and tag-specific linear sparse reconstruc-
[19] X. Alameda-Pineda, Y. Yan, E. Ricci, O. Lanz, N. Sebe, Analyzing free-standing conversational groups: A multimodal
[20] B. Wu, S. Lyu, B. Ghanem, Constrained submodular minimization for missing labels and class imbalance in multi-label
[21] M. Fan, D. Zhao, Q. Zhou, Z. Liu, T. F. Zheng, E. Y. Chang, Distant supervision for relation extraction with matrix
[24] K. Dvijotham, M. Fazel, A nullspace analysis of the nuclear norm heuristic for rank minimization, in: Acoustics Speech
[25] H. Mohimani, M. Babaie-Zadeh, C. Jutten, A fast approach for overcomplete sparse decomposition based on smoothed l0
[26] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Advances in neural information processing
[27] D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Semantic annotation and retrieval of music and sound effects, IEEE
[28] K. Trohidis, G. Tsoumakas, G. Kalliris, I. P. Vlahavas, Multi-label classification of music into emotions., in: ISMIR, Vol. 8,
Table 1: Simulation results for the scenario 5.3.1 in terms of AUC and simulation time. Observation rates 20%, 40%, 60%, 80%
Table 2: Simulation results for the scenario 5.3.2 in terms of AUC and simulation time. Observation rates 20%, 40%, 60%, 80%