Image Annotation Incorporating Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors

2015·Arxiv

Abstract

Abstract

Tag-based image retrieval (TBIR) has drawn much attention in recent years due to the explosive amount of digital images and crowdsourcing tags. However, TBIR is still suffering from the incomplete and inaccurate tags provided by users, posing a great challenge for tag-based image management applications. In this work, we proposed a novel method for image annotation, incorporating several priors: Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors. Highly representative CNN feature vectors are adopt to model the tag-visual correlation and narrow the semantic gap. And we extract word vectors for tags to measure similarity between tags in the semantic level, which is more accurate than traditional frequency-based or graph-based methods. We utilize the accelerated proximal gradient (APG) method to solve our model efficiently. Extensive experiments conducted on multiple benchmark datasets demonstrate the effectiveness and robustness of the proposed method.

1 Introduction

The prevalence of social network and digital photography in recent years makes image retrieval an urgent need. Image retrieval methods can be classified into two categories: content-based image retrieval (CBIR) and tag-based image retrieval (TBIR). The performance of CBIR algorithms are limited due to the semantic gap between the low-level visual features used to represent images and the high-level semantic meaning behind images. Tags can represent the semantics of images more precisely than low-level visual features, giving rise to research on TBIR.

However, tags are usually noisy and incomplete due to the arbitrariness of user tagging behaviors, leading to performance degradations of TBIR systems. What’s more, manual annotation is laborious, error prone, and subjective, making automatic image annotation an attractive research task.

Many machine learning methods have been developed for image annotation. They can be roughly grouped into three categories: supervised methods, unsupervised methods and semi-supervised methods.

Supervised methods use the tagged images to train a dictionary of concept models and formulate image annotation as a supervised learning problem. They annotate images using the likelihood between images and tags. [1] formulates the annotation problem in a probabilistic framework and images are represented as bags of localized feature vectors. [2] learns a two-dimensional Multi-resolution Hidden Markov Model (2D-MHMM) on a fixed-grid segmentation of all category examples. [3] models image annotation procedure as a translation problem between image blobs and tags.

Unsupervised methods, e.g. search based-methods, learn the distribution of images and tags and annotate tags among clusters. Search-based methods always search in the feature space to find the most relevant images to the query image, and transfer tags to it using various tag transfer algorithms [4,5,6]. JEC [5] demonstrates that simple baseline algorithm can achieve high performance. TagProp [4] applies metric learning in the neighborhood of the feature space to annotate query images.

In recent years, semi-supervised approaches have been proposed in this field [7,8,9]. Semi-supervised algorithms can exploit the unlabelled information to improve the learning procedure and achieve satisfactory performance. [8,10] models the annotation task as a matrix completion problem, assuming the low-rankness property of the underlying matrix. [11] combined language model with matrix completion by assuming the independency of tags. Kernel trick and metric learning are exploited in [12] to capture the nonlinear relationships between visual features and semantics of the images. Semisupervised relational topic model (ssRTM) is exploited to explicitly model the image content and their relations [13].

To utilize the large amount of unlabeled dataset for removing noisy tags and completing the missing ones, we propose a semi-supervised method. We formulate the annotation task as a transduction matrix completion problem, taking the following four priors into consideration:

1. Low-rankness. Many methods formulated the image annotation problem in a matrix completion framework by constructing and refining the image-tag matrix [7,9,12,11]. Existing works have demonstrated that semantic space spanned by tags can be approximated by a much smaller subset of words derived from the original space [14]. As text information, tags are consequently subjected to such subset property [7]. According to the subset property, we assume that the image-tag matrix is a low rank matrix. Thus we can exploit the low rank matrix completion techniques to complete the matrix, thereby annotating the images.

2. Tag Correlation. Tags have high level semantic meanings and often appear correlatively at the semantic level. However, traditional methods treat tags merely as labels, reducing the annotation task as a multi-label classification problem. In recent years, researchers have explored the relation among tags. Graph-based methods calculated the semantic correlation among tags using the WordNet distance [15]. Frequency-based methods [9,16] estimated tag correlations using Jaccard coefficient or co-occurrence in text search results. Jensen-Shannon divergence is introduced in the Flickr Distance to make the algorithm more precise and reasonable [17]. However, these methods are still imprecise and inefficient. In this work, we utilize the vector representations for tags instead of labels. Word vectors [18], which are seldom used in this field, can present a much higher level semantic meanings than labels, thus we can measure the tag similarity much more easily and precisely.

3. Tag-Visual Correlation. Tag-Visual Correlation describes the correlation between the content level and the semantic level. Visually similar images often belong to similar themes and thus are annotated with similar tags. This prior has been widely explored in the image classification field [19,20]. However, there still exists a seman-

tic gap between the content level and the semantic level. Traditional methods usually adopt low level image features, such as color, texture or shape descriptors, to represent the images, which are not so correlated with the semantic level. To narrow the semantic gap and make full use of the correlation property, we utilize high level visual features in our model, such as DeCAF6, which demonstrate much stronger tag-visual correlation than low level visual features.

4. Inhomogeneous Errors. Tagging errors come from two aspects: missing tags and noisy tags. Since human-beings are relatively reasonable, we should assume that the tagging results are reasonably accurate. We can observe from the datasets that one image usually has relation with only a few tags, but we have to calculate its association with hundreds or even thousands of tags. For example, images from the MIRFlickr-25K have about 12.7 tags on average [21], but the dataset has 1, 386 unique tags, which means that each image should only be annotated with less than 1% of all the tags. Hence users are more likely to adding noisy tags than missing noisy tags since there are too many unrelated tags. And the errors are mainly composed of noisy tags rather than missing tags. Thus we should treat these two errors with different strategies. We should put more emphasis on denoising rather than completing, paying more attention to the annotated tags rather than the unannotated ones. In other words, if an image is not originally annotated with a tag, it is more likely that they really have no relation at all.

Existing methods never model these two kinds of errors separately. They simply model the errors as Laplacian noise [7] or Gaussian noise [9]. To our knowledge, our model is the first to model the missing errors and noisy errors separately. The model can further adapt to different datasets according to their noise levels.

The novelties and main contributions of this paper are summarized as follows:

– We propose a new image annotation model that incorporates four priors: Lowrankness, Tag Correlation, Tag-Visual Correlation, and Inhomogeneous Errors.

– We utilize the word vectors and CNN features for the tag and the visual features, respectively. These high level features can narrow the semantic gap effectively. It is the first time to utilize both the features for image annotation.

– We model tag correlation and tag-visual correlation in different ways according to their semantic levels.

– We model two kinds of errors separately, the model can adapt to different datasets according to the noise level.

– We utilize the APG to solve our model efficiently.

The most related work to our model is LRES [7]. In their work, the authors formulated the image annotation task as a Robust PCA [22] framework, decomposing the original tag matrix into a refined tag matrix and a sparse error matrix. LRES also takes the tag correlation and tag-visual correlation into consideration and achieves good performance. However, our model is different from LRES in several aspects. First, our model measures tag correlation and tag-visual correlation using different models according to their different semantic levels, rather than using the same Graph Laplacian model in LRES. Second, we adopt more representative features such as CNN features and word vectors to narrow the semantic gap. Third, we do not model the error matrix simply as a sparse matrix, since thee errors are inhomogeneous and the distribution varies across different datasets.

2 Our Image Annotation Model

Here we model the aforementioned four priors one after another.

2.1 Low-Rankness

Denote the image collection I = {i1, i2, . . . , im}, where m is the size of the image set. All original tags appearing in the set form a tag set W = {w1, w2, . . . , wn}, where n denotes the total number of unique tags. We can construct a binary matrix ˆT whose element ˆTiindicates the relation between image ii and tag wj , i.e. if ii is annotated with tag wj, ˆTi1, otherwise ˆTi0. We use T to represent the refined tag matrix, where Ti(0, 1) the confidence score of assigning wj to ii. As mentioned above, we want the refined matrix T to be low rank. Since the low-rankness constraint on T is NP-hard to solve, we replace it with the standard relaxation, the trace norm, i.e. sum of singular values : .

2.2 Tag Correlation Using Word Vectors

To narrow the semantic gap, we extract word vectors for each tag rather than treating tags merely as labels. Word vectors contain rich semantic information, e.g. semantic similarity. We denote the word vectors as WV = {wv1, wv2, . . . , wvn}. Given the completed tag matrix, T i and T j are the ith and jth columns of the tag matrix T. Thus we can measure the correlation between tag i and tag j in two ways: 1) similarity between word vectors wv1 and wv2, 2) similarity between tag vectors T i and T j.

The tag correlation prior can be enforced by solving the following optimization

where T i T jmeasures the similarity between tag vectors T i and T j and S imeasures the similarity between word vectors wvi and wv j. The formulation forces tag vectors with large similarities also have large similarity in their corresponding word vectors and vice versa, which essentially embodies the tag correlation prior.

The formulation can be rewritten as Tr(T T LT), where L = diag(S T1) S is the Graph Laplacian [23]. In our formulation, we define S icos(wvi, wv j).

2.3 Tag-Visual Correlation Using CNN Features

The tag-visual correlation is not as strong as the correlation between tags owing to the semantic gap. Thus we just formulate the problem in a widely used model [9], which is much more simple and intuitive compared with the Graph Laplacian framework. Denote the image visual features as matrix V, where each visual image is represented as a row vector in V Rm. Given the visual feature matrix, we can compute the visual similarity between ii and i j as VTi V j, where VTi and VTj are the ith and jth rows of matrix V. Given the completed tag matrix, we can compute the similarity between ii and i j basing on the overlap between their corresponding tags, i.e., T Ti T j, where T Ti and T Tj are the ith and jth rows of the tag matrix T [9]. To model the aforementioned tag-visual correlation, we expect |T Ti T j VTi V jto be as small as possible. Thus we can model the tag-visual correlation as T Ti T j VTi V jTT T VVT.

2.4 Inhomogeneous Errors

To model the inhomogeneous errors, we set different weight to the annotated positions and unannotated positions separately: ( ˆT T)( ˆT T)representS the positions where the images are annotated with tags, Pand Pare projection operators, and are positive weighting parameters. and will change adaptively in different datasets according to their noisy levels. Different from the assumption of sparse errors [7], we model the errors using the Frobenius norm since we observe that large scale noisy datasets tend to be contaminated with dense Gaussian noises rather than Laplacian noises. Experiments on noisy datasets have confirmed our assumption.

2.5 Object Function Formulation:The Four Priors Model

Based on the terms regarding low-rankness, tag correlation, tag-visual correlation and inhomogeneous errors, we formulate the objective function as follows:

We set 1 for computational efficiency, and denote the nuclear norm as g(T) and the other terms together as f(T). And F(T) = g(T) + f(T), where g() is nonsmooth and f() is smooth. We pursuit an effective iterative procedure to solve this optimization based on Accelerated Proximal Gradient Method (APG) [24].

The proposed Four Priors method belongs to transductive learning category, which means it reasons from both labeled and unlabeled data. We can further turn it into a inductive model using traditional machine learning approaches [25].

3 Solving the Four Priors Model

3.1 Accelerated Proximal Gradient Method

Given the following unconstrained problem

where g() is nonsmooth, f() is smooth and its gradient is Lipschitz continuous. To avoid the computation of subgradient, proximal gradient algorithms minimize a sequence of separable quadratic approximations to F(X), denoted as Q(X, Y), formed at specially chosen points Y

APG set Yk = Xk (Xk Xk) for a sequence {bk} satisfying b2kbkb2k to get an O(k) convergence rate. The APG method is described in Algorithm 1.

The main advantage of the APG method is that the minimizer Xkhas a simple or even closed-form solution when the g() is norm or nuclear norm [7]. It is obvious that the APG method naturally fits for the Four Priors model. We estimate the L f using backtracking method and calculate the f(T):

where Pand Pare the adjoint operators of Pand P, respectively. Basing on Eq. (5) and Eq. (6) we can obtain the subproblem (Step 4 in Algorithm 1) for our model.

where UVT is the singular value decomposition (SVD) of Mk and S ) is the singular

value thresholding operator [26].

4 Experimental Evaluation

4.1 Datasets and Experimental Setup

The proposed algorithm is denoted as Four Priors and is evaluated on two well known benchmark datasets: MIRFlickr-25K , Corel5K and Labelme. MIRFlickr-25K is collected from Flickr. Compared to the Corel5K, tags in Labelme and MIRFlickr-25K are rather noisy and many of them are misspelled or meaningless words. Hence, a preprocessing is performed. We match each tag with entries in a Wikipedia thesaurus and only retain the tags in accordance with Wikipedia. We use the pre-trained word and phrase vectors [18] to extract tag vectors from the tags in these two datasets. To narrow the semantic gap, we utilized DeCAF [27] to extract the DeCAF6 features, which have high level semantic meanings.

Table 1: Statistics of 3 Datasets Statistics

We compare the proposed Four Priors model with the state-of-the-art methods, including matrix completion-based model LRES [7], TCMR [11], RKML [12], search-based algorithms (i.e. JEC [5], TagProp [4], and TagRelevance [6]), mixture models (i.e. CMRM [28] and MBRM [29]), tag recommendation approaches (i.e. Vote+ [30] and Folk [31]), co-regularized learning model FastTag [32] and Bayesian network model InfNet [33]. Note that the parameters of adopted baselines are also carefully tuned on the validation set of Corel5K with corresponding proposed tuning strategy.

We measure all the algorithms in terms of average precision@N (i.e. AP@N), average recall@N (i.e. AR@N) and coverage@N (i.e. C@N). In the top N completed tags, precision@N is to measure the ratio of correct tags in the top N competed tags and recall@N is to measure the ratio of missing ground-truth tags, both averaged over all test images. Coverage@N is to measure the ratio of test images with at least one correctly completed tag.

4.2 Evaluation of Tag Completion on Corel5K

We adopt the tuning strategy used in [9] to set and 8. Table 2 demonstrates the performance comparisons. Due to the space limit, we only report results when N = 2, 3, 5, 10.

4.3 Evaluation of Tag Completion on MIRFlickr-25K and Labelme

We tuned and 5 using cross validation on MIRFlickr-25K. The two datasets use the same parameters since they are both noisy. Note that as the

Table 2: Performance Comparison on Corel5K Dataset

datasets become large or noisy, the semantic gap expands, leading to the decrease of . And varies according to different noisy level.

Table 3 and Table 4 demonstrate the performance comparisons. Note that Folk and InfNet is unable to run on the large dataset MIRFlickr-25K. Besides, search-based baselines (JEC, TagProp, and TagRel) cost a lot of time to run on the dataset.

Table 3: Performance Comparison on Labelme Dataset

Table 4: Performance Comparison on MIRFlickr-25K Dataset

4.4 Observations on Experimental Results

We observe that: 1) Generally algorithms achieve better performance on Corel5K, since tags in MIRFlickr-25K are more noisy. 2) Matrix completion-based methods, such as Four Priors, LRES and TCMR, usually achieve the best performances. 3) Four Priors shows increasing advantage to LRES as the data become more and more noisy, justifying our assumption and model of the noises. 4) Four Priors nearly outperforms all the other algorithms in all cases. 5) Performance on MIRFlickr-25K in some sense provides an evidence for the robustness of Four Priors.

5 Conclusions and Future Work

We have proposed an effective mothod for image annotation. The model takes four priors into consideration: Low-Rankness, Tag Correlation, Tag-Visual Correlation and Inhomogeneous Errors. This is the first work to model inhomogeneous errors in the image annotation field. We utilize word vectors to calculate tag correlation and CNN features to measure tag-visual correlation. It achieves the state-of-the-art performance in extensive experiments conducted on benchmark datasets for image annotation.

Our future work focuses on several directions. First, we will explore other methods, e.g. Gaussian Mixture Model (GMM), to model the error term more precisely. Second, we want to exploit more efficient ways to optimize our model. Third, we may extend the range of candidate tags and search more proper tags by jointly learning the distributed representations of both images and corresponding tags, inspired by some embedding techniques [34,35,36,37,38].

References

1. Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of semantic classes for image annotation and retrieval. TPAMI (2007)

2. Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling ap- proach. TPAMI (2003)

3. Duygulu, P., Barnard, K., de Freitas, J.F., Forsyth, D.A.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: ECCV. (2002)

4. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative metric learn- ing in nearest neighbor models for image auto-annotation. In: ICCV. (2009)

5. Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: ECCV. (2008)

6. Li, X., Snoek, C.G., Worring, M.: Learning social tag relevance by neighbor voting. TM (2009)

7. Zhu, G., Yan, S., Ma, Y.: Image tag refinement towards low-rank, content-tag prior and error sparsity. In: ACM MM. (2010)

8. Goldberg, A., Recht, B., Xu, J., Nowak, R., Zhu, X.: Transduction with matrix completion: Three birds with one stone. In: NIPS. (2010)

9. Wu, L., Jin, R., Jain, A.K.: Tag completion for image retrieval. TPAMI (2013)

10. Fan, M., Zhao, D., Zhou, Q., Liu, Z., Zheng, T.F., Chang, E.Y.: Distant supervision for relation extraction with matrix completion. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, Association for Computational Linguistics (2014) 839–849

11. Feng, Z., Feng, S., Jin, R., Jain, A.K.: Image tag completion by noisy matrix recovery. In: ECCV. (2014)

12. Feng, Z., Jin, R., Jain, A.: Large-scale image annotation by efficient and robust kernel metric learning. In: ICCV. (2013)

13. Niu, Z., Hua, G., Gao, X., Tian, Q.: Semi-supervised relational topic model for weakly annotated image recognition in social media. In: CVPR. (2014)

14. Zhao, R., Grosky, W.I.: Narrowing the semantic gap-improved text-based web document retrieval using visual features. TM (2002)

15. Jin, Y., Khan, L., Wang, L., Awad, M.: Image annotations by combining multiple evidence & wordnet. In: ACM MM. (2005)

16. Cilibrasi, R.L., Vitanyi, P.: The google similarity distance. TKDE (2007)

17. Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr distance. In: ACM MM. (2008)

18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

19. Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for non- parametric object and scene recognition. TPAMI (2008)

20. Zhang, H., Berg, A.C., Maire, M., Malik, J.: Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In: CVPR. (2006)

21. Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: MIR ’08: Proceedings of the 2008 ACM ICMI. (2008)

22. Cand`es, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? JACM (2011)

23. Chung, F.R.: Spectral graph theory. (1997)

24. Toh, K.C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. PJO (2010)

25. Gammerman, A., Vovk, V., Vapnik, V.: Learning by transduction. In: UAI. (1998)

26. Cai, J.F., Cand`es, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization (2010)

27. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)

28. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross- media relevance models. In: ACM SIGIR. (2003)

29. Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: CVPR. (2004)

30. Sigurbj¨ornsson, B., Van Zwol, R.: Flickr tag recommendation based on collective knowl- edge. In: ACM WWW. (2008)

31. Lee, S., De Neve, W., Plataniotis, K.N., Ro, Y.M.: Map-based image tag recommendation using a visual folksonomy. PRL (2010)

32. Chen, M., Zheng, A., Weinberger, K.: Fast image tagging. In: ICML. (2013)

33. Metzler, D., Manmatha, R.: An inference network approach to image retrieval. In: CIVR. (2004)

34. Fan, M., Zhou, Q., Chang, E., Zheng, T.F.: Transition-based knowledge graph embedding with relational mapping properties. (2014)

35. Fan, M., Cao, K., He, Y., Grishman, R.: Jointly embedding relations and mentions for knowl- edge population. arXiv preprint arXiv:1504.01683 (2015)

36. Fan, M., Zhou, Q., Zheng, T.F., Grishman, R.: Probabilistic belief embedding for knowledge base completion. arXiv preprint arXiv:1505.02433 (2015)

37. Fan, M., Zhou, Q., Zheng, T.F.: Learning embedding representations for knowledge infer- ence on imperfect and incomplete repositories. CoRR abs/1503.08155 (2015)

38. Fan, M., Zhou, Q., Zheng, T.F., Grishman, R.: Large margin nearest neighbor embedding for knowledge representation. CoRR abs/1504.01684 (2015)

designed for accessibility and to further open science