The proliferation of digital cameras has led to a significant increase in the number of photographs captured by people. While capturing images, the camera is not always held or mounted at correct angle, which results in image being displayed in wrong orientation. Modern digital cameras and smartphones have a built-in orientation sensor, which records the orientation of the camera during capture and writes it in the EXIF [1] data of the image. However, this technique is not consistently applied across different applications; many applications, such as default photo viewer in Windows 7 doesn’t support orientation tag. When an image is edited and saved using these applications, the orientation tag gets deleted, while in some cases the tag is not updated when image is rotated manually. Also, orientation sensor doesn’t help when camera is aiming towards ground, for e.g., while capturing photos of documents and pictures kept on a table. The images captured by first person cameras, such as GoPro, which are mounted sideways and even upside-down, often require orientation correction. Automatic content creator software applications assume that input images are correctly oriented. Automatic detection and correction of image orientation is also useful in several image processing and computer vision systems. It is shown in [2, 3] that spatial transformations such as translations, scaling and especially rotations dwindle the accuracy of deep convolutional neural networks (CNNs). The traditional approach of making the systems transformation invariant doesn’t work in case of large scale transformations. Therefore, an accurate image orientation detection and correction method is required to tackle aforementioned problems.
1.1. Related Work
Image orientation detection is a challenging task because digital images vary greatly in content (see Fig. 1c, 1d, 1h, 1i and 1j). As a result, existing methods which mainly reckon on hand-engineered features for orientation detection are limited in their performance due to the intrinsic semantic gap between low-level vision features and high-level image semantics. Vailaya et al. [4] first addressed the problem of image orientation detection using a Bayesian learning framework and spatial color moments as features. They reported an accuracy of 97% on a high quality image set derived from Corel photos dataset. However, as stated in other works [5, 6], their remarkable accuracy was an artifact of the test dataset which mainly contained prototypical images. Later, Wang and Zhang [7] using color moments and edge direction histogram features, obtained an accuracy of 78% on another subset of Corel photos dataset. Zhang et al. [8] treated indoor and outdoor images separately, Wang et al. [9] integrated human perception cues, such as orientation of faces, position of sky, etc., into a Bayesian framework to obtain exact orientation angle of an image, reporting an accuracy of 94% on a small test set of 1287 images. Luo et al. [5] covered the psychophysical aspects of image orientation perception. Using the insights from [5], Luo and Boutell [6] integrated low-level features and several detectable semantic cues, such as faces, sky, grass, etc., into a Bayesian framework obtaining an accuracy of 90% on a personal dataset of 3652 images. The approach described in [10] obtained accuracy close to 60% on a personal test dataset. Baluja [11] used more than hundred classifiers trained with Adaboost and obtained maximum accuracy of 80.3% on Corel photos dataset. Ciocca et al. [12] incorporated faces as additional cue and obtained an accuracy of 86% on a dataset of 4000 online images. Cingovska et al. [13] used a hierarchical approach by first classifying images into their semantic group, such as faces, sky, etc. and then used a separately trained classifier for each semantic group to classify images into their correct orientation.
Evidently, highly discrepant detection rates have been reported in literature. The main reason for this discrepancy is large differences in test datasets; in some cases [4, 9, 13] small and/or homogenous datasets are used for evaluation. Except [12], all approaches have used rejection criteria with different rejection rates to achieve best accuracy. Due to these reasons, it is very difficult to ascertain the true performance of existing orientation detection methods. Furthermore, all existing methods have used highly imbalanced training and testing datasets in which more than 50% images were in correct () orientation; additionally [6, 8, 10, 11, 12, 13] completely ignored
oriented images stating them as impractical. Psychophysical study [5] says that humans are more likely to mis-orient images by
orientation; therefore, removing this difficult case and using imbalanced datasets in which more than half images belong to correct orientation, considerably simplifies the orientation detection task. We argue that
orientation is practically possible in case of first person cameras, such as GoPro as well as when camera is aiming towards ground.
Recently, Ciocca et al. [14] used local binary patterns texture features with SVM as classifier and obtained 92% accuracy for image orientation detection task. They addressed the problem of small and homogenous datasets by using SUN397 [15] dataset for training and testing. Their method also outperformed other existing methods in literature by a significant margin. However, similar to other methods, they completely ignored orientation; the authors also used training and testing datasets that were highly imbalanced—more than 72% images were in correct orientation. This made the orientation detection task considerably simpler. In this work, an extensive evaluation and comparison of Ciocca et al.’s method reveals that their impressive accuracy was an artifact of their imbalanced training and testing datasets. It is also shown that their method doesn’t generalize well to images outside SUN397 dataset. Apart from image orientation detection, there are other works in literature which focus on estimating (regressing) the exact skew angle of images [16, 17, 18, 19]; however, it is a slightly different problem and their discussion is out of scope of this paper.
In this preliminary work, we use the CNN architecture proposed by Krizhevsky et al. [20] (popularly known as AlexNet) and fine-tune it on the largest training dataset to-date for the image orientation detection task. Our extensive cross-dataset evaluation on several challenging scene and object recognition benchmark datasets [21, 22, 23] reveals that our model remarkably generalizes to correct orientation of a large variety of images with an impressive accuracy of 95%, which is very close to that of humans [5]. Our model also significantly outperforms the current state-of-the-art method in [14] which reckons on hand-engineered features.
1.2. Our Contributions
• As far as we know, this is the first work to leverage representational power of CNNs exclusively for image orientation detection task.
• We, unlike existing methods, do not ignore orientation and train as well as test our model on balanced datasets, therefore our model has no bias.
• We perform extensive evaluation of our model on challenging scene and object recognition benchmark datasets [15, 22, 21, 23] to show its impressive generalizing capability. We didn’t find such rigorous evaluation in any of the existing works in literature.
• Results show that our model significantly outperforms the current state-of-the-art method [14] and achieves 95% accuracy, which is very close to that of humans [5].
• Lastly, we show visualizations of local image regions which are considered important by our model for classification [24], helping us to compare its performance with human behavior.
In existing works, small and homogenous datasets have been used for training as well as evaluation. To address this, we derive our training set from the challenging scene recognition benchmark dataset SUN397 [15] (similar to Ciocca et al. [14]) and perform extensive cross-dataset evaluation using other challenging benchmark datasets in computer vision. SUN397 dataset has 397 scene categories, each category having at least 100 images and there are total 108, 754 images. For cross-dataset evaluation, we consider MIT Indoor [22], INRIA Holidays [21] and Pascal VOC 2012 [23] datasets. We chose MIT Indoor dataset for testing because existing methods found difficulties with indoor images which contain lots of background clutter and lack discriminative features. INRIA Holidays dataset is a very good representative of real life images captured by people in their leisure time. Pascal VOC is an object-centric dataset compared to other datasets which contain only scene-centric images. As stated earlier, we use balanced testing datasets for evaluation, i.e., equal number of images for each orientation category to eliminate the effect of any bias. Additionally, we compare the performance of our model with current state-of-the-art method of Ciocca et al. [14] on aforementioned datasets.
Our main focus in this work is to bridge the semantic gap between existing image orientation detection methods and human behavior. Astonished by the recent success of CNNs in challenging computer vision tasks [20, 25, 26], we decided to leverage their representational power for image orientation detection. For this task, it is possible to create a large training dataset and train the network from scratch; however, it is well known that pre-training a CNN on a large corpus of outside data and fine-tuning it on the target data not only helps the model to converge faster, but also results in significant performance boost [26, 27], assuming that the outside and target data are of similar visual characteristics. Therefore, to this end, we decided to choose AlexNet CNN model proposed by Krizhevsky et al. [20] and pre-trained on the MIT Places dataset [28], specifically, Places365 dataset which comprises of 1.8 million images from 365 scene categories. ImageNet [29] dataset has object-centric images which are quite dissimilar to our training dataset; moreover, Zhou et al. [25] discovered that CNNs trained to perform scene classification implicitly learned to detect objects. Therefore, we found CNN pre-trained on Places365 dataset as a better choice for our task.
We restrict image rotation to following four angles: ,
and
. We argue that this kind of coarse orientation correction suffices for majority of user-centric use-cases; moreover, orientation correction at finer-level has deteriorating effects of image cropping. However, when it is required to determine the exact skew angle of an image, we favor a hierarchical approach in which an image is first correctly oriented to one of the four aforementioned orientation angles. This approach minimizes the search for correct skew angle to a reasonable range (say,
), thus, preventing erroneous estimations. This approach is also consistent with the assumption of current skew detection algorithms in [16, 18].
3.1. CNN Architecture
Our CNN architecture is inspired by AlexNet [20] and pre-trained on Places365 dataset [28]. The network has five convolution (conv) layers which are activated by rectfied linear units (ReLU), max-pooling is applied after and
convolution layers. Local response normalization is applied after
and
convolution layers. Layers 6, 7 are fully connected (fc) layers and layer 8 is a softmax layer. Our network accepts 256x256x3 size images as input. The last output layer fc8 was removed and replaced by the one with four outputs for our task. Similar to [20], dropout with rate 0.5 was implemented after fc6 and fc7 to control overfitting. The remaining parameters of the network remained same as in [20].
3.2. Training
We used 45, 000 images from SUN397 dataset for training, all images were initially in their correct orientation. For training, we additionally rotate each training image by ,
and
degrees and label them accordingly. Let
be the training dataset. Here, N is the number of training samples, which is 180, 000 in our case. The class label
denotes the correct orientation of an input image—0 for
for
for
and 3 for
. Let z be the four dimensional vector representing final softmax layer (fc8) of the network, with
denoting output at
unit. Therefore, probability that the class label of
training sample is j, is calculated as:
The corresponding simplified cross-entropy loss () for this
training sample is given by:
The learning task of our four-class classification problem is to minimize the above cross-entropy loss over the entire training dataset. Since conv1, conv2 and conv3 contain generic low-level features, such as Gabor filters and color blobs [27, 30], we kept these layers intact. First, we tried fine-tuning only the fully-connected layers fc6 and fc7. It is shown in [27, 26] that higher layers (fc6, fc7 and fc8) learn features which are task-specific and are non-transferable, while features learned by middle-level convolution layers (conv4, conv5) are transferable and can be fine tuned. Out of all different experiments, fine-tuning conv4, conv5 layers and training fc6, and fc7 layers from scratch performed best for us. In one experiment, we removed fc7 layer and reduced dimension of fc6 to 1024; however, it lead to slight decrease () in overall accuracy. We trained our model using the stochastic gradient descent (SGD) method with momentum 0.9 and batch size of 256. The non-fine-tuning layers (fc6, fc7, fc8) were initialized by zero-mean Gaussian distribution with 0.01 standard deviation. We initialized learning rate of fine tuning (conv4, conv5) as well as non-finetuning layers (fc6, fc7, fc8) to 0.01 and used overall network learning rate as
(see [31]) with weight decay 0.0005. This was to prevent significant weight changes in conv4 and conv5 layers during initial phase of learning when fc6, fc7 and
Fig. 1. Qualitative results of our method. First row shows rotated input images. Second row images show discriminative regions from corresponding first row images identified by our model for orientation classification task. Third row shows images rotated according to predicted orientation label which is written in the caption. Images best viewed in color.
fc8 had random initializations. We trained the model with this configuration till 10 epochs, after which we increased the overall learning rate to . After this, we closely monitored the learning process and controlled learning rates and weight decay manually. We stopped training after 30 epochs when our validation loss plateaued.
Data Agumentation: In order to prevent model overfitting, we augmented our training dataset by applying random brightness adjustment, contrast adjustment and gaussian noise to each training image. We did not apply cropping because it often removed important semantic cues from images. Similar to [20], we subtracted mean RGB pixel values computed over the entire training dataset from each input image.
All experiments were performed using Caffe [31] deep learning framework with NVIDIA GeForce GTX Titan X GPU support. The results of our model are compared with current state-of-the-art method of Ciocca et al. [14], using the original code provided by the authors. For testing and comparison, we used 58, 754 images from SUN397 dataset. For cross-dataset evaluation, we consider MIT Indoor [22], INRIA Holidays [21] and Pascal VOC 2012 [23] datasets. We use recommended test set of 1340 images from 67 different indoor scene categories of MIT Indoor dataset. INRIA Holidays dataset originally had 1491 images; however, we removed several duplicate and orientation ambigous images, leading to final test set size of 1233 images. Lastly, we used the PascalVOC 2012 training and validation dataset of 6233 images from 12 different object categories. In the test datasets, images were initially in correct orientation and for
Table 1. Comparison of accuracy with Ciocca et al. [14]
testing purpose, we rotated randomly selected images by ,
or
. The test datasets were balanced, i.e., each of the four orientation classes had equal number of images in the test datasets. However, Ciocca et al.’s method was evaluated under three conditions:
1. CC-ORIG—All test datasets are created according to the scheme proposed in their paper, 72% images are in orientation, 14% in
and remaining 14% in
.
2. CC-BAL—All test datasets are balanced, i.e., 34% images are in orientation, 33% in
and rest 33% in
.
3. CC-OUR —We modified their method to include
orientation and trained as well as tested the method on balanced datasets.
Table 1 shows the quantitative results of our model compared to Ciocca et al.’s method. The lower accuracies obtained with CC-ORIG on MIT Indoor (77.69%), Holidays (74.13%) and PascalVOC (74.94%) datasets show that the method doesn’t generalize properly to images outside SUN397 dataset. The hand-engineered low-level features do not help the model to generalize properly to images outside the training dataset. Further, when the test datasets are balanced (CC-BAL), the accuracy of the method drops drastically even for the SUN397 test dataset (only 64.7% from 92%). This clearly shows that the method is baised towards correctly oriented images which constitute 72% of training and testing datasets. CC-OUR gives average results on all test datasets with accuracy ranging from 70-82%. The quantitative results of CC-BAL and CC-ORIG reveal that the 92% accuracy obtained with CC-ORIG on SUN397 dataset is an artifact of imbalanced training and testing datasets. It also shows that the problem of image orientation detection was considerably simplified in CC-ORIG by ignoring orientation.
In contrast, our model achieved an impressive accuracy of 95% on SUN397 and MIT Indoor datasets, while on INRIA Holidays and PascalVOC 2012 datasets it achieved approximately 91% accuracy. This shows the remarkable generalization capability of our model which detects correct orientation angle of a large variety of images outside the training dataset. The drop in accuracy in case of Holidays and Pascal VOC 2012 test datasets can be attributed to broad categories of objects, such as animals, cooking utensils, bicycles, etc., which were either absent or sparse in the training dataset derived from SUN397 dataset. The accuracy of 95% on MIT Indoor dataset is quite impressive because existing methods have reported problems with indoor images which contain lots of background clutter and lack discriminative features compared to outdoor images.
Fig. 1 shows the qualitative results obtained with our model for some of the challenging images from different test datasets. We have also presented visualization of the local image regions [24] which were considered discriminative by our model for the orientation detection task. Images shown in Fig. 1c, 1d, 1h, 1i, 1j are quite challenging. In Fig. 1h, the model recognizes ground to correctly orient the confusing image of bamboo trees, in Fig. 1i it identifies occluded person, while in Fig. 1j it discriminates between actual mountains and their reflections.
The qualitative as well as quantitative results clearly indicate superiority of our CNN model over current state-of-the-art method [14] which is based on hand-crafted features. It is evident from the evaluation results that the hand-engineered features used in [14] fail to capture the vast amount of semantic content in images which is required for image orientation detection task. It is also evident from the results that the existing methods considerably simplified the image orientation detection problem. After a rigorous quantitative evaluation of our model on balanced test datasets which also include images in orientation, we obtain an impressive average accuracy of 93%. This is quite close to human performance (98%), as reported in the psychophysical study [5]. We found that the performance of our model on noisy, tilted and underwater images was encouraging compared to existing method. Overall, we observed that our model lacked orientation knowledge of objects which were absent or scarce in our training dataset. The performance of our model can be easily improved by extending the training dataset to include different variety of images.
In this work, for the first time, a deep learning based approach for image orientation detection task was proposed. Our fine-tuned convolutional neural network model significantly outperformed the state-of-the-art method in literature. It was shown that the existing methods which mainly reckon on hand-engineered features, fail to generalize properly to images outside the training dataset. It was also shown that the problem of image orientation detection was considerably simplified by existing methods and their performance on real life images is average. In contrast, the proposed model, after extensive evaluation, achieved an impressive maximum accuracy of 95% and average accuracy of 93% which is best till date and is very close to human performance reported in literature. The quantitative as well as qualitative results show the impressive generalizing capability of the proposed deep learning based model for the challenging image orientation detection task. It is shown that unlike existing methods which reckon on hand-engineered features, the performance of the proposed model on real life images is superior and far better. In future, we will work towards enhancing our training dataset, consider other deep learning architectures and work on estimating the exact skew angle of images using a hierarchical approach.
We are thankful to the anonymous reviewers, Dr. Narasinga Rao Miniskar, Dr. Pratibha Moogi and Mr. Anurag Mithalal Jain for their invaluable feedback and suggestions.
[1] “Exif.org - Exif and Related Resources,” 2017, http://www.exif.org/.
[2] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu, “Spatial transformer networks,” in NIPS, 2015.
[3] Vijay Chandrasekhar, Jie Lin, Olivier Mor`ere, Hanlin Goh, and Antoine Veillard, “A practical guide to cnns and fisher vectors for image instance retrieval,” Signal Processing, 2016.
[4] Aditya Vailaya, HongJiang Zhang, and Anil K. Jain, “Automatic image orientation detection,” in ICIP, 1999.
[5] Jiebo Luo, David J. Crandall, Amit Singhal, and Robert T. Gray, “Psychophysical study of image orientation perception,” in Human Vision and Electronic Imaging VIII, 2003.
[6] Jiebo Luo and Matthew R. Boutell, “A probabilistic approach to image orientation detection via confidence-based integration of low-level and semantic cues,” in CVPR Workshops, 2004.
[7] Yongmei Wang and Hongjiang Zhang, “Content-based image orientation detection with support vector machines,” in Proc. IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL), 2001.
[8] Lei Zhang, Mingjing Li, and Hong-Jiang Zhang, “Boosting image orientation detection with indoor vs. outdoor classification,” in Proc. IEEE Workshop on Applications of Computer Vision (WACV), 2002.
[9] Lei Wang, Xu Liu, Lirong Xia, Guangyou Xu, and Alfred M. Bruckstein, “Image orientation detection with integrated human perception cues (or which way is up),” in ICIP, 2003.
[10] Siwei Lyu, “Automatic image orientation determination with natural image statistics,” in Proc. ACM International Conference on Multimedia, 2005.
[11] Shumeet Baluja, “Automated image-orientation detection: a scalable boosting approach,” Pattern Anal. Appl., 2007.
[12] Gianluigi Ciocca, Claudio Cusano, and Raimondo Schettini, “Image orientation detection using low-level features and faces,” in Digital Photography VI, IS&T-SPIE Electronic Imaging Symposium, 2010.
[13] Ivana Cingovska, Zoran A. Ivanovski, and Franc¸ois Martin, “Automatic image orientation detection with prior hierarchical content-based classification,” in ICIP, 2011.
[14] Gianluigi Ciocca, Claudio Cusano, and Raimondo Schettini, “Image orientation detection using lbp-based features and logistic regression,” Multimedia Tools Appl., 2015.
[15] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010.
[16] Hyung Il Koo and Nam Ik Cho, “Skew estimation of natural images based on a salient line detector,” J. Electronic Imaging, 2013.
[17] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox, “Image orientation estimation with convolutional networks,” in Proc. German Conf. on Pattern Recognition (GCPR), 2015.
[18] Zhiqiang Cao, Xilong Liu, Nong Gu, Saeid Nahavandi, De Xu, Chao Zhou, and Min Tan, “A fast orientation estimation approach of natural images,” IEEE Trans. Systems, Man, and Cybernetics: Systems, 2016.
[19] Lokesh Boominathan, Suraj Srinivas, and R. Venkatesh Babu, “Compensating for large in-plane rotations in natural images,” in Proc. Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), 2016.
[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
[21] Herve Jegou, Matthijs Douze, and Cordelia Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in ECCV, 2008.
[22] Ariadna Quattoni and Antonio Torralba, “Recognizing indoor scenes,” in CVPR, 2009.
[23] Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, 2010.
[24] Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” arXiv preprint arXiv:1610.02391, 2016.
[25] Bolei Zhou, Aditya Khosla, `Agata Lapedriza, Aude Oliva, and Antonio Torralba, “Object detectors emerge in deep scene cnns,” CoRR, 2014.
[26] Maxime Oquab, L´eon Bottou, Ivan Laptev, and Josef Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014.
[27] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, “How transferable are features in deep neural networks?,” in NIPS, 2014.
[28] Bolei Zhou, Aditya Khosla, `Agata Lapedriza, Antonio Torralba, and Aude Oliva, “Places: An image database for deep scene understanding,” CoRR, 2016.
[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, 2015.
[30] Matthew D. Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014.
[31] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. ACM International Conference on Multimedia, 2014.