This paper addresses the visualization task of deep learning models. To improve Class Activation Mapping (CAM) based visualization method, we offer two options. First, we propose Gaussian upsampling, an improved upsampling method that can reflect the characteristics of deep learning models. Second, we identify and modify unnatural terms in the mathematical derivation of the existing CAM studies. Based on two options, we propose Extended-CAM, an advanced CAM-based visualization method, which exhibits improved theoretical properties. Experimental results show that Extended-CAM provides more accurate visualization than the existing methods.
Deep Convolutional Neural Networks (DCNNs) have exhibited remarkable accuracy in various image processing fields such as object detection [Redmon et al., 2016; Girshick, 2015], semantic segmentation [Long et al., 2015; Chen et al., 2017], and monocular depth estimation [Godard et al., 2017; Fu et al., 2018]. DCNNs have been confirmed in various studies to possess strong modeling abilities in general machine learning problems. DCNN, however, is a black-box model and does not provide the cause or process for the results it outputs. As practical applications require stability and understanding, researches have been conducted to visualize the inner behavior of DCNN.
The visualization methods of DCNN are largely divided into two popular branches. The first is the gradient-based methods [Zeiler and Fergus, 2014; Springenberg et al., 2014]. They produce activation by flowing the gradient of DCNN to the learned weights. The gradient-based methods mainly pay attention to detail. The second is the Class Activation Mapping (CAM) methods [Zhou et al., 2016; Selvaraju et al., 2017; Chattopadhay et al., 2018]. CAM methods operate on feature maps and represent rough shapes. Each algorithm is developed independently, but recently, [Selvaraju et al., 2017] mixed the two methods to complement their advantages and disadvantages.
Figure 1: An example of visualizing DCNN’s inner behavior for a given image using our Extended-CAM.
In this paper, several problems in existing CAM studies are identified. We propose Extended-CAM, a new visualization method that improves on these problems. The main contributions of this paper are summarized as follows:
1. We propose an improved upsampling method, Gaussian upsampling. The Gaussian upsampling offers a natural way for upsampling CAM and reflects the characteristics of DCNN. We argue that bilinear upsampling is an inappropriate upsampling method for CAM.
2. Problematic mathematical derivation on existing CAM studies are identified. We discuss the correct mathematical derivation for DCNN that exhibits non-linearities.
3. Combining the two options, we propose ExtendedCAM, a new CAM-based visualization method. We verified that our Extended-CAM provides more accurate visualization than the existing methods.
This paper is organized as follows. Section 2 reviews the literature on CAM. Section 3 describes the two components that consist of Extended-CAM, Gaussian upsampling and modified mathematical derivation. We verify the accuracy of the visualization mask with experiments in Section 4 and discuss underlying meaning in Section 5.
This section reviews existing CAM studies. While the CAM approach is general for DCNNs, we mainly focus on DCNN for classification problems of 2D images such as the VGG models. DCNN consists of a feature extractor and classifier. The feature extractor in front of DCNN consists of convolutional layers and pooling layers, obtaining feature maps from the input image. The final output of the feature extractor is the last feature map , where i, j are spatial indices and k is channel index. The last feature map passes through the Fully-Connected layers, which forms classifier at the back, to obtain a score . Finally, is passed through the softmax layer to obtain a probability for classification.
The goal of the CAM methods is to obtain which sat- isfies . Here, i, j are the indices of , the last feature map of DCNN. In other words, the CAM aims to calculate the contribution to prediction in units of i, j.
Original CAM [Zhou et al., 2016] obtains by replac- ing the classifier architecture of DCNN after feature extractor with linear layers. The linear layers consist of Global Average Pooling (GAP) and Fully-Connected layer (FC). At this time, the last feature map and prediction are represented in a linear relationship. Specifically, GAP outputs, , and FC that holds weight outputs , so we derive,
As the goal is to find that satisfies , we obtain .The disadvantages of original CAM lie in the linear layers. If linear layers replace the classifier architecture, retraining of DCNN is required and non-linearity of classi-fier vanishes. To enable the direct use of the CAM approach in DCNN without any modification of the architecture, Grad-CAM/Grad-CAM++ [Selvaraju et al., 2017; Chattopadhay et al., 2018] have been proposed. As in the original CAM, they also frame the form but use mathematical derivation to obtain the coefficient under the assumption of the non-linear classifier. The resulting Grad-CAM equation is:
where Z is the area of the last feature map. Grad-CAM++ adds a term to generalize Grad-CAM and uses,
, we obtain, .
Figure 2: (a) When the pixel-level effective receptive field region passes through convolution and pooling layers, the value of one grid-level cell is determined. (b) Conversely, to interpret the grid-level value in the pixel-level, the value of one cell in the grid must spray a 2D Gaussian of the effective receptive field on the pixel-level.
3.1 Gaussian Upsampling
In DCNN, downsampled feature maps appear due to the pooling layers or the strided convolution layers. Accordingly, the units of last feature map i, j are different from pixel-level units. Assume a image I is given to a DCNN. Using a CAM method, we obtain a grid-level CAM like . To obtain pixel-level CAM , where x, y are units of pixel-level, an upsampling method is required.
Here, bilinear upsampling method is used in CAM, GradCAM, and Grad-CAM++. Nevertheless, bilinear upsampling is a simple method, which linearly interpolates only through adjacent values. The fatal problem of bilinear upsampling is that it fails to link the physical meaning between the grid-level CAM and the pixel-level CAM . We want to revise the upsampling method to reflect how the grid and pixel link, i.e., how many pixels a grid-level cell covers.
Such a link can be investigated from the nature of the convolutional operation. Indeed, a cell value of the feature map is determined only by a local region of the pixel-level, which is known as a receptive field. Meanwhile, [Luo et al., 2016] found effective receptive field, which is different from the previously known square type receptive field. By checking the pixel that is actually activated, the shape of the effective receptive field is shown to be a 2D Gaussian.
Thus, a pixel-level region of effective receptive field determines the value of a grid-level cell as it passes through a feature extractor of the DCNN. Conversely, the pixel-level region covered by a grid-level cell would appear as the 2D Gaussian of the effective receptive field (Fig. 2). We describe this as spraying 2D Gaussian on the pixel-level. Therefore, to transform a grid-level CAM to hold meaning at the pixel-level, all cell values in the grid must spray 2D Gaussians on the pixel-level.
Now, we investigate the CAM value of a pixel. The CAM value of a pixel is determined by the combination of Gaussian sprays of all the cells in the grid-level CAM. Due to the characteristics of Gaussian spray, a pixel is sprayed much from the close cell, and less from the far cell. We express this as a combination of 2D Gaussian, as follows.
where u, v = 14, w, h = 224, and , and are the standard deviations for 2D Gaussian of the effective receptive field. We call this improved upsampling method that links physical meanings of grid and pixel, Gaussian upsampling.
Although we use the term “upsampling”, Gaussian upsampling differs from the conventional upsampling method. Gaussian upsampling results in upsampling, but it aims to naturally link the physical meaning of the pixel and grid. Gaussian upsampling does not aim to increase resolution by making interpolation more accurate. According to Gaussian upsampling, the CAM value of a pixel is determined by a Gaussian combination of the entire grid. Even if a grid cell is far away, a pixel may be influenced. Thus, conventional upsampling methods that interpolate only through adjacent values are completely different from Gaussian upsampling. Conversely, such a conventional upsampling method is unsuitable for the upsampling of CAM because distant, nonadjacent grid cell values cannot influence the corresponding pixel.
3.2 Modified Mathematical Derivation
CAM replaced DCNN’s classifier architecture with linear layers of Global Average Pooling (GAP) and a Fully-Connected (FC) layer. However, the neural network is essentially a non-linear model. To apply the CAM approach and allow non-linearity of DCNN, Grad-CAM/Grad-CAM++ have been proposed.
However, Grad-CAM/Grad-CAM++ hold two common problems. First, even Grad-CAM/Grad-CAM++ assume linear layers of GAP and FC in mathematical derivation. This is unsuitable for assuming non-linear layers. Second, GradCAM/Grad-CAM++ use as the coefficient of , and it is assumed to be independent of i, j. We will demonstrate that this is also an unnecessary assumption and the use of , which include dependent information on i, j, is more natural.
Following mathematical derivation we provide improves on these two problems. In DCNN, the last feature map passes through Fully-Connected layers, which outputs the prediction score . We represent the Fully-Connected layers by a non-linear function f.
The function f is non-linear because the activation function of DCNN holds non-linearity. We note the form of activation functions such as ReLU, LReLU, sigmoid, and tanh [Nair and Hinton, 2010; Maas et al., 2013]. We assume that piecewise-linear function can approximate them. Using this approximation, all terms of quadratic and higher degree in Equation 6 are eliminated. We confirmed that the constant term C was insignificant and ignored it. So, we approximate the non-linear function f only through the first order.
The goal of the CAM approach is to obtain such that . So,
In the context of existing CAM studies which use , Equation 8 can be interpreted as using the coefficient
, which includes i, j units. We can understand that the use of , which consider i, j units, is mathematically nat- ural. It is unnatural to additionally assume that the coefficient is independent of i, j and to use , which is averaged over i, j, like Grad-CAM. The meaning of will be discussed again in the discussion section.
On the other hand, additional post-processing is applied in the existing CAM studies. Grad-CAM applies ReLU on and clips the negative value to 0 to ignore the negative contribution. Grad-CAM++ also takes ReLU on the gradient term to ignore the negative gradient. These manipulations using ReLU lead to several problems: the information of negative value is lost, an unnatural formula is created, and the design choice of where to assign ReLU is required. However, our method uses Gaussian upsampling, so even if a negative value appears in , it is smoothed by blending with the surround- ing value. Thus, the information of negative can be natu- rally reflected in the surrounding value. Experimenting with various design choices that apply ReLU on and the gra- dient term, we confirmed that the original Equation 8 without any ReLU yielded the best results.
4.1 Estimation of Effective Receptive Field
To use Gaussian upsampling, we need to estimate , the standard deviations of 2D Gaussian of the effective receptive field. First, we obtained the effective receptive field of the last feature map of the VGG-16 model [Simonyan and Zisser- man, 2014]. This can be achieved by obtaining the gradient of an image for the last feature map and averaging it over the various images (Fig. 3).
We used the LmFit library [Newville et al., 2016] to fit the effective receptive field into 2D Gaussian. The unused pixels which appear as microscopic holes in the effective receptive field have negative values. We observed that ignoring them does not affect the fitting results.
The fitting results showed that the value is quite close to 1, indicating that the effective receptive field matches well to the shape of the 2D Gaussian (Table 1). The resulting , are used in the Gaussian upsampling.
Figure 3: The effective receptive field of the VGG-16 model. The bright areas are highly activated areas due to large gradients, and the dark areas are areas that are not.
Table 1: 2D Gaussian fitting result of the effective receptive field. The value which is close to 1 indicates that the effective receptive field well matches to the 2D Gaussian shape. The resulting are used for Gaussian upsampling.
4.2 Visualization with Extended-CAM
We implemented Extended-CAM, which uses Gaussian upsampling and the modified mathematical equation using . We obtained several examples of visualization from Extended-CAM and existing CAM methods (Fig. 4). This visualization represents important parts for the VGG-16 model to classify a given image. The visualization results of GradCAM/Grad-CAM++ often failed to accurately represent important areas. Also, they suffer from various artifacts such as cross or rhombus patterns in the visualization results. These artifacts appear when bilinear upsampling is applied in 2D. In contrast, the visualization of Extended-CAM appears more accurate and more natural without any artifacts. Besides, Extended-CAM represents less noise in other areas than Grad-CAM/Grad-CAM++.
Following the same experiments in [Chattopadhay et al., 2018], we systematically evaluated the accuracy of these visualization masks. They measured the confidence change of DCNN when an image is masked. These DCNN-based evaluation methods are often used in a variety of fields because DCNN’s characteristics resemble human perception [John- son et al., 2016; Zhang et al., 2018; Salimans et al., 2016]. First, we masked the original image I with to obtain the masked image . The confidences that DCNN outputs for I and are then compared. If is an inaccurate visual- ization, it masks out important parts in the image, resulting in the dropped confidence. Conversely, the less the confi-dence drops, the better the visualization is. The tendency to increase confidence also represents an accurate visualization as it removes unnecessary parts in the image. To compare the confidences, we used two indicators, Average Drop % and % Increase. We measured them on average using the PASCAL VOC 2007 validation dataset [Everingham et al., 2007].
On both Average Drop % and % Increase, Extended-CAM outperforms the existing CAM methods (Table 2). The result indicates that confidence drops less and the tendency to increase is larger. These confidence changes mean that important parts of the image remain well when applying the visualization mask suggested by Extended-CAM. In other words, the visualization of Extended-CAM highlights more important parts of the image than the existing CAM methods.
Meanwhile, the visualizations of Extended-CAM tend to be wider than Grad-CAM and Grad-CAM++ (Fig. 4). The area of the CAM mask can be an important factor. Depending on the distribution of , the original image I may be masked much or less masked. Thus, indicators such as Average Drop % or % Increase may be influenced. In the visualization task, the key is to find relatively important parts in the image, regardless of the absolute area of the mask. In consideration of this, we designed additional experiments with relative masking.
We designed the following experiments with relative masking, which masks with a fixed area. Based on the value of , we left only the top 50 % pixels in the image and mask out the remaining 50 % of pixels. Using masked images from relative masking, Average Drop % and % Increase are measured again (Table 3). Extended-CAM also exhibits better results than existing CAM methods in relative masking. This means that Extended-CAM finds important parts in the image well regardless of the distribution of .
We proposed two options to improve the CAM method: 1) bilinear upsampling or Gaussian upsampling in the upsampling process, and 2) the existing formula using or the gener- alized one using in the calculation. Apparently, the two options seem to be independent. However, in this section, we will discuss in detail that the two options are not independent and determine a factor, smoothness.
First, Gaussian upsampling has the effect of making smoother. In terms of signal processing, Equation 4 is equivalent to applying zero-insertion and filtering. The latter is commonly known as Gaussian smoothing. In another aspect, as mentioned in Section 3, Gaussian upsampling differs from conventional upsampling methods and more naturally represents grid-pixel relationships. If we use Gaussian upsampling, a pixel-level CAM value is also influenced by distant grid cells. That is, Gaussian upsampling combines through a wide area. Thus, when bilinear upsampling is replaced with Gaussian upsampling, becomes smooth overall. We call this additional smoothing explicit smoothing.
Second, consider the setting of w. We illustrate an intuitive comparison of the use of and (Fig. 5). averages the gradient in the i, j direction and ignores the i, j units. Because is calculated from and , the averaged creates a smoothing effect on . We call this effect implicit smoothing. In contrast, if is used, implicit smoothing
Figure 4: Visualization examples of existing CAM methods and Extended-CAM. We confirm that the visualization results of Extended-CAM represent more important areas.
Table 2: Comparison of visualization performance of CAM-based methods.
Table 3: Comparison of visualization performance of CAM-based methods with relative masking.
does not appear because is calculated using the gradient that is not averaged in the i, j direction.
In summary, the existing CAM method of using bilinear upsampling and causes implicit smoothing. Our method of using Gaussian upsampling and leads to explicit smoothing. Thus, both the upsampling method and the w setting option determine smoothness. Indeed, Gaussian upsampling can mimic the effects of the use of . We found that if we apply explicit smoothing using certain smoothness rather than the standard deviations of the effective receptive field, the results become quite similar to those of Grad-CAM where implicit smoothing occurs.
So, how should smoothing be applied? Note that implicit smoothing has a smoothing effect on through averaging . Here, smoothness is arbitrarily determined. If we use im- plicit smoothing, we won’t be able to control the smoothness as desired, even if we know the reasonable smoothness. Explicit smoothing, in contrast, allows us to set the smoothness directly. Note that Gaussian upsampling naturally links the grid-pixel relationship and it means smoothing. This implies that the appropriate amount of smoothness exists and is determined by explicit smoothing. Therefore, it is preferable to apply explicit smoothing by calculating the appropriate smoothness rather than arbitrarily determining the smoothness through implicit smoothing.
Indeed, the amount of smoothing strongly relates to the accuracy of the visualization mask. We believe that the reason Extended-CAM provides a more accurate visualization mask lies in the valid smoothness that explicit smoothing determines. We present the observations that other incorrect smoothnesses degrade performance (Table 4. We found that applying Gaussian upsampling to Grad-CAM/GradCAM++ results in worse visualization than Extended-CAM. This means that using both implicit smoothing and explicit
Figure 5: Intuitive comparison of w setting. (Left) Our Extended-CAM use that preserves i, j information. Implicit smoothing does not occur. (Right) Existing CAM methods use averaged . Implicit smoothing occurs in . The symbol means element-wise product, i.e., the Hadamard product.
Table 4: We examined various smoothing options. The best result can be obtained using explicit smoothing only. If we use other smoothing options, the results become worse due to the wrong smoothness. We measured Average Drop % and % Increase using relative masking.
smoothing at the same time applies the smoothing twice and determines the wrong smoothness. Also, when bilinear upsampling and are used instead of Gaussian upsampling and , the performance is degraded. We believe these op- tions also determine wrong smoothness because both implicit and explicit smoothings do not appear. These observations agree with our analysis that applying explicit smoothing alone determines valid smoothness, leading to improved performance.
This paper proposes Extended-CAM, a new CAM-based visualization method of DCNN. First, the limitations of bilinear upsampling are identified in the pixel-level upsampling. We discussed the validity of Gaussian upsampling using an effective receptive field. Second, the problems of the mathematical derivation presented in the previous papers are identified and corrected. We found that these two options are factors that determine smoothness and can be interpreted as explicit smoothing and implicit smoothing, respectively. Experimental results showed that Extended-CAM represents a more accurate visualization than previous studies.
There are still tasks left. All CAM approaches have a limitation in that the last feature map is highly coarse (). In this study, we focused on the last feature map. But in the future, we expect the finer structure in the image to be investigated by applying a CAM that sets another feature map in the front as a target layer. Also, while we used DCNN for 2D images, we expect the potentials of Extended-CAM for other general-purpose DCNNs such as speech recognition.
[Chattopadhay et al., 2018] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847. IEEE, 2018.
[Chen et al., 2017] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[Everingham et al., 2007] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zis-
serman. The pascal visual object classes challenge 2007 (voc2007) results. 2007.
[Fu et al., 2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
[Girshick, 2015] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
[Godard et al., 2017] Cl´ement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279, 2017.
[Johnson et al., 2016] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
[Long et al., 2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431– 3440, 2015.
[Luo et al., 2016] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, pages 4898–4906, 2016.
[Maas et al., 2013] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
[Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
[Newville et al., 2016] Matthew Newville, Till Stensitzki, Daniel B Allen, Michal Rawlik, Antonino Ingargiola, and Andrew Nelson. Lmfit: Non-linear least-square minimization and curve-fitting for python. Astrophysics Source Code Library, 2016.
[Redmon et al., 2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
[Salimans et al., 2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
[Selvaraju et al., 2017] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi
Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017.
[Simonyan and Zisserman, 2014] Karen Simonyan and An- drew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[Springenberg et al., 2014] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
[Zeiler and Fergus, 2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
[Zhang et al., 2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
[Zhou et al., 2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.