When any scene is presented to the human visual system, it rapidly summarizes it through eye fixation on salient regions of the scene. Visual attention, or more particularly selective visual attention is the main reason behind this phenomenon. For more than a decade, researchers are trying to develop computational models of selective attention as its modeling has numerous important applications across different fields like computer vision, robotics etc. [1][2]. Many methods of saliency detection have been reported in existing literature and they can be broadly categorized into two groups: low level or bottom-up methods and learning based methods. Low level methods generally seek inspirations from biological processes. Most of the models from this category follow a general pipeline which was first proposed by the seminal work of Itti et al. [3]. The authors extracted low level features such as color, orientation, texture etc., from images, computed feature specific saliency maps and finally integrated these to produce master saliency map. Center-surround difference operator is usually employed to construct feature-specific saliency maps. Gao et al. [4] also compared center and surround features, using KL-Divergence in order to measure distinctness of a specific pixel and subsequently its saliency. Bruce and Tsotsos [5] conjectured salient regions contain maximum selfinformation relative to their surroundings. Seo and Milanfar [6] proposed a local self-resemblance mechanism based saliency model. Among more recent bottom-up approaches, Murray et al. [7] modeled saliency from a color space perspective. Holzbach and Cheng [8] proposed a method which predicts saliency via calculating dissimilarity between multiple sampling templates. Goferman et al. [9] also exploited mainly low level features for saliency detection; however their model also incorporates face detector for high level feature detection.
Recently, machine learning based approaches have gained popularity because in addition to the low level features, these models also take high level contextual and semantic features into account. As high level features play an important role in driving visual attention, learning based models generally performs better. Judd et al. [10] trained a SVM (support vector machine) classifier based model directly from human eye tracking data by utilizing hand crafted low, mid and high level features. Vig et al. [11] also projected a similar SVM based algorithm but instead of using hand-tuned features their model learns the optimal saliency features automatically from the human eye fixation data. Kavak et al. [12] proposed a multiple kernel based learning approach to saliency detection.
In this paper, we propose an end to end convolutional neural network based model, WEPSAM, for accurate saliency detection. It is a well-known fact that convolutional neural networks (CNN) are very powerful learning systems. From semantic segmentation to object recognition, CNN based models have achieved state of the art performances in a wide range of computer vision tasks. However one major drawback associated with convolutional nets is that their performance critically depends on the size of the dataset. Often large scale datasets, required for proper training of convolutional nets,
Fig. 1: The CNN architecture used for visual saliency prediction.
are not available. To tackle this problem, we introduce a weak data driven pre-training paradigm which proves to be a simple but effective solution. The main objective of our work is not to endorse any particular CNN architecture, but rather to present a new training scheme which can help us to train a CNN much faster (compared to a randomly initialized network) for tasks such asIt saliency prediction where ground truth data is scarce.
The primary contributions of our paper are as follows:
• To the best of our knowledge, this is the first ever attempt of utilizing ImageNet data for weakly pre-training a CNN. Previous models [13, 14] have attempted to pre-train on ImageNet for object recognition task, but it is more prudent to pre-train a model for domain specific task of saliency prediction. The paper thus opens up a neoteric horizon of effectively leveraging enormous image datasets for visual saliency prediction.
• Pre-trained model is then fine-tuned on the actual ground truth fixations. We show that rate of decay of squared error loss of WEPSAM is much faster compared to a randomly initialized CNN network.
• We compare our model on the challenging MIT300 dataset with recent state-of-the-art methods on five popularly used metric for saliency prediction task.
In this section we briefly describe the CNN architecture used for both pre-training and fine tuning stage. We wish to reiterate that the purpose of this work is not propose or use a very deep CNN architecture, but to study the feasibility of leveraging domain specific pre-training for saliency prediction. We use a shallow CNN with only 5 layers inspired from [15] with subtle modifications. The network is shown in Fig. 1. The network consists of three stages of CONV-ReLU-MAX POOL followed by two fully-connected layers, the last of which is subjected to a maxout operation. The input to the network is a RGB image and the output is a 1024 dimensional vector that is resized to a
salience map. This is a 1024-D regression task with element wise squared error loss. Receptive fields of [5
5], [3
3] and [3
3] are used in 1
, 2
and 3
stage on convolution. Receptive field of MAX POOL layer is [2
2]. During pre-training, networks weights were initialized by uniform sampling from a zero mean Normal distribution with standard deviation of 0.01. The bias terms were set to 0.1 at beginning. We used stochastic gradient descent with Nestorov momentum for faster convergence.The learning rate was adaptively decreased from 0.3 at beginning to 10
at end of training. Upon culmination of training, 32
32 map later resized to exact resolution of input image using bilinear interpolation.
In this section we describe our proposed two step hierarchical framework for training a CNN for visual saliency prediction task.
3.1. Weakly Pretraining Stage on ImageNet
In supervised learning paradigm, weak training is a neoteric attempt of reducing human effort for meticulously creating enormous ground truth dataset for large scale learning frameworks. The key idea is to extract auxiliary information from unannotated data. ImageNet [16], for example, has about 1 million natural images. For generating ground truth eye fix-ations on ImageNet, a human operator has to look over the entire dataset; such task is definitely not prudent.
We propose an elegant solution to circumnavigate this
Fig. 2: Training and validation loss of training a CNN model on human eye fixation maps of iSUN and SALICON datasets. Loss is defined as the average of pixelwise squared error between ground truth saliency map and predicted map. It is evident that a weakly pretrained model such as WEPSAM fosters faster convergence rate.
problem. Our work is motivated by the fact that supervised pre-training followed by fine tuning fosters faster convergence rate in CNN[17]. Each RGB image is first down sampled to 128X128X3 and then we create a gray scale saliency map of 32X32 using a graph based saliency model [18]. It is to be noted that the maps produced by [18] only provide an approximation to actual saliency prediction behavior of human visual system. Saliency maps produced by [18] tend to much more diffused compared to actual eye fixation, specially in scarcity of salient objects in an image. We define a filter criterion based on entropy of the maps. Entropy of an image I(x, y) is defined as,where,
denotes the normalized histogram count of
bin. To imitate human eye fixation, it is desired to generate low entropy saliency maps. So, for pre-training, we sort the maps according to increasing order of entropy and select the top 10
entries for pre-training the CNN. We pre-train the CNN model for 500 epochs.
3.2. Fine Tuning of Weak Model
In this stage we use actual ground truth fixations from widely used public databases for fine tuning our previously developed weak trained CNN model. In this stage, we use the same CNN architecture but initialize the network with weights learnt in pre-training step. This ensures that we achieve faster rate of error convergence on training set and simultaneously manifest better generalization performance. Training in this stage has been run for 1200 epochs, after which, both training and validation loss begin to saturate.
In this section we demonstrate experimental results to evaluate the efficacy of the proposed approach. In the first part of our results, we show how pre-training using weak data helps us to train the convolutional network faster. In the second part, we test our model on the challenging MIT300 dataset
Table 1: Quantitative comparison between different saliency models on the challenging MIT300 dataset. Best results are marked in bold. Though MR-CNN has slightly better SIM and NSS metric compared to WEPSAM, complexity of MRCNN is much higher because it trains 3 CNNs at multiple scales. Also, the basic CNN architecture of MR-CNN is more complex compared to WEPSAM.
and compare its performance quantitatively and qualitatively with recent state of the art methods. For fine tuning our network after pre-training we have used ground truths from iSUN [25] and SALICON [26] datasets. iSUN contains 6000 training and 926 validation image-map pairs. SALICON dataset has 10000 training and 5000 validation pairs.
4.1. Effect of Pre-Training
In Fig. 2 we plot the training and validation loss on the combined ground truth maps of iSUN and SALICON datsets. The green lines show the train and validation loss for proposed WEPSAM model while black denotes the metrics for a randomly initialized net of same architecture.
It is evident that pre-training fosters faster decay of training loss compared to a randomly initialized net and simultaneously manifests better generalization accuracy.Specifically
Fig. 3: Comparsion of saliency maps on some exemplary MIT300 test set. Col 1: Actual RGB image, Col 2: Saliency map by our proposed WEPSAM, Col 3: CIW [7], Col 4: CAS [9], Col 5: RARE-2012 [21] . Our proposed model emphasizes only those regions in an image where a human would look at first glance. Competing models, instead, highlight mainly the edges and thereby produces much more diffused map. WEPSAM model is thus superior in identifying semantically important image regions.
at onset of training, train and validation loss for WEPSAM are 8.2and 8.4
respectively, while those of random initialized net are 10.6
and 9.6
. After 400 epochs, train and validation loss for WEPSAM are 7.2
and 7.3
respectively, while those of random initialized net are 7.7
and 7.8
. During weak pre-training weights of our network were learnt so as to approximately imitate human eye fixation model. Thus, during fine tuning, prediction of pre-learnt net is much more coherent with ground truth than a randomly initialized net.
4.2. Performance on MIT300 Database
Next we compare saliency prediction performance on the challenging MIT300 dataset [27, 28]. It is to be noted that WEPSAM was fine-tuned only on images of iSUN and SALICON datasets and thus the test images are substantially different than training images. We compare our model with recent state-of-the-art methods such as multi resolution CNN (MRCNN) [19], CNN-VLM [20], multiple kernel based learning (MKL)[12], RARE-2012[21], Context Aware Saliency Model(CAS) [9], Local+Global Saliency Model (LGS) [22], Generalized Nonlocal Mean Saliency (GNMS) [23], NARFI saliency (NARFI) [24], Sampled Template Collation (STC) [8] and Chromatic Induction Wavelet Model (CIW) [7]. The first three models are essentially learning based. In Table 1 we compare the performances of competing models based on six popularly used metrics, viz., AUC-Judd, AUC-Borji, CC (correlation coefficient), SIM (similarity metric), KL (Kullback- Leibler divergence) and NSS (normalized scanpath saliency). From Table 1 we see that proposed WEPSAM outperforms non learning based methods by significant margins on multiple metrics. MR-CNN outperforms our model on SIM and KL but it is to be noted that MR-CNN model is much more complex than WEPSAM. MR-CNN trains three different CNNs on [400 400], [250
250] and [150
150] scales with 6 layers of convolution. In contrast proposed WEPSAM only uses a single resolution of [128
128] using only 3 layers of convolution. In Fig. 3 we present the saliency maps on three images of MIT300. Ground truths have not been released to public but intuitively we can see that WEPSAM emphasizes only those regions in an image which are semantically important to a human. Competing methods mainly highlight the image gradients and thereby manifesting diffused, semantically insignificant maps.
In this paper we presented WEPSAM as a pioneering effort of developing a weakly pre-trained end-to-end CNN based model for saliency prediction. WEPSAM used an elegant approach of weakly learning saliency maps on ImageNet. Such pre-training acted as a regularizer and fostered in quicker convergence of validation loss on ground truth eye fixations. We hope that this work will instigate a new genre of research of using auxiliary data for saliency modeling. In future, we wish to test our model with more complex CNN models such as GoogleNet [29] and VGGnet [30] to exploit the benefit of prelearning on a larger scale.
[1] Rui Zhao, Wanli Ouyang, and Xiaogang Wang, “Unsupervised salience learning for person re-identification,” in CVPR. IEEE, 2013, pp. 3586–3593.
[2] Chin-Kai Chang, Christian Siagian, and Laurent Itti, “Mobile robot vision navigation & localization using gist and saliency,” in Intelligent Robots and Systems (IROS). IEEE, 2010, pp. 4147–4154.
[3] Laurent Itti, Christof Koch, and Ernst Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254– 1259, Nov. 1998.
[4] Dashan Gao, Vijay Mahadevan, and Nuno Vasconcelos, “On the plausibility of the discriminant center-surround hypothesis for visual saliency,” J. of Vis., vol. 8, no. 7, pp. 13–13, 2008.
[5] Neil Bruce and John Tsotsos, “Saliency based on information maximization,” in NIPS, Y. Weiss, B. Sch¨olkopf, and J.C. Platt, Eds., pp. 155–162. MIT Press, 2006.
[6] Hae Jong Seo and Peyman Milanfar, “Static and space-time visual saliency detection by self-resemblance,” J. of Vis., vol. 9, no. 12, pp. 15, 2009.
[7] N. Murray, M. Vanrell, X. Otazu, and C.A. Parraga, “Saliency estimation using a non-parametric low-level vision model,” in CVPR, June 2011, pp. 433–440.
[8] A. Holzbach and G. Cheng, “A scalable and efficient method for salient region detection using sampled template collation,” in ICIP, Oct 2014, pp. 1110–1114.
[9] Stas Goferman, Lihi Zelnik-Manor, and Ayellet Tal, “Context- aware saliency detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 10, pp. 1915–1926, Oct 2012.
[10] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in Computer Vis., Sept 2009, pp. 2106–2113.
[11] E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchical features for saliency prediction in natural images,” in CVPR, June 2014, pp. 2798–2805.
[12] Yasin Kavak, Erkut Erdem, and Aykut Erdem, “Visual saliency estimation by integrating features using multiple kernel learning,” arXiv preprint arXiv:1307.5693, 2013.
[13] Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao, “Salicon: Saliency in context,” in CVPR, 2015, pp. 1072–1080.
[14] Srinivas SS Kruthiventi, Kumar Ayush, and R Venkatesh Babu, “Deepfix: A fully convolutional neural network for predicting human eye fixations,” arXiv preprint arXiv:1510.02927, 2015.
[15] Junting Pan and Xavier Gir´o-i Nieto, “End-to-end convolutional network for saliency prediction,” arXiv preprint arXiv:1507.01422, 2015.
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR. IEEE, 2009, pp. 248–255.
[17] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[18] Jonathan Harel, Christof Koch, and Pietro Perona, “Graphbased visual saliency,” in NIPS, 2006, pp. 545–552.
[19] Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu, “Predicting eye fixations using convolutional neural networks,” in CVPR, June 2015, pp. 362–370.
[20] Hiroharu Kato and Tatsuya Harada, “Visual language modeling on cnn image representations,” arXiv preprint arXiv:1511.02872, 2015.
[21] Nicolas Riche, Matei Mancas, Matthieu Duvinage, Makiese Mibulumukini, Bernard Gosselin, and Thierry Dutoit, “Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis,” Signal Processing: Image Communication, vol. 28, no. 6, pp. 642 – 658, 2013.
[22] A. Borji and L. Itti, “Exploiting local and global patch rarities for saliency detection,” in CVPR, June 2012, pp. 478–485.
[23] Guangyu Zhong, Risheng Liu, Junjie Cao, and Zhixun Su, “A generalized nonlocal mean framework with object-level cues for saliency detection,” The Vis. Computer, pp. 1–13, 2015.
[24] J. Chen, H. Cao, Z. Ju, L. Qin, and S. Su, “Non-attention region first initialisation of k-means clustering for saliency detection,” Electron. Lett., vol. 49, no. 22, pp. 1384–1386, Oct 2013.
[25] Pingmei Xu, Krista A Ehinger, Yinda Zhang, Adam Finkel- stein, Sanjeev R Kulkarni, and Jianxiong Xiao, “Turkergaze: Crowdsourcing saliency with webcam based eye tracking,” arXiv preprint arXiv:1504.06755, 2015.
[26] Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao, “Sal- icon: Reducing the semantic gap in saliency prediction by adapting deep neural networks,” in CVPR, 2015, pp. 262–270.
[27] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Fr´edo Du- rand, Aude Oliva, and Antonio Torralba, “Mit saliency benchmark,” .
[28] Tilke Judd, Fr´edo Durand, and Antonio Torralba, “A bench- mark of computational models of saliency to predict human fixations,” in MIT Technical Report, 2012.
[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in CVPR. IEEE, 2015, pp. 1–9.
[30] Karen Simonyan and Andrew Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.