Spatial-Spectral Residual Network for Hyperspectral Image Super-Resolution

2020·Arxiv

Abstract

Abstract

Deep learning-based hyperspectral image super-resolution (SR) methods have achieved great success recently. However, most existing models can not effectively explore spatial information and spectral information between bands simultaneously, obtaining relatively low performance. To address this issue, in this paper, we propose a novel spectral-spatial residual network for hyperspectral image super-resolution (SSRNet). Our method can effectively explore spatial-spectral information by using 3D convolution instead of 2D convolution, which enables the network to better extract potential information. Furthermore, we design a spectral-spatial residual module (SSRM) to adaptively learn more effective features from all the hierarchical features in units through local feature fusion, significantly improving the performance of the algorithm. In each unit, we employ spatial and temporal separable 3D convolution to extract spatial and spectral information, which not only reduces unaffordable memory usage and high computational cost, but also makes the network easier to train. Extensive evaluations and comparisons on three benchmark datasets demonstrate that the proposed approach achieves superior performance in comparison to existing state-of-the-art methods.

Index Terms—Hyperspectral image, super-resolution (SR), convolutional neural networks (CNNs), spatial-spectral residual, local feature fusion

I. INTRODUCTION

HYPERSPECTAL imaging system collects surface infor-mation in tens to hundreds of continuous spectral bands to acquire hyperspectral image. Compared with multispectral image or natural image, hyperspectral image has more abundant spectral information of ground objects, which can reflect the subtle spectral properties of the measured objects in detail [1]. As a result, it is widely used in various fields, such as mineral exploration [2], medical diagnosis [3], plant detection [4], etc. However, the obtained hyperspectral image is often low-resolution because of the interference of environment and other factors, which limits the performance of high-level tasks, including change detection [5], image classification [6], etc.

To better and accurately describe the ground objects, the hyperspectral image super-resolution (SR) is proposed [7]– [9]. It aims to restore high-resolution hyperspectral image from degraded low-resolution hyperspectral image. In practical application, the objects in the image are often detected or recognized according to the spectral reflectance of the object. Therefore, spectral and spatial resolution should be considered

The authors are with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, China (e-mail: crabwq@gmail.com, liqmges@gmail.com, xuelong li@nwpu.edu.cn) (Corresponding author: Xuelong Li.)

Fig. 1. Comparisons of our SSRNet with existing methods on hyperspectral image SR for scale factor 4. The absolute error map of one band is showed between reconstructed hyperspectral image and ground-truth. In general, the bluer the absolute error map is, the better the restored image is.

simultaneously for hyperspectral image SR, which is different from natural image SR in computer vision [10].

Since the spatial resolution of hyperspectral image is lower than that of RGB image [11], existing methods mainly fuse high-resolution RGB image with low-resolution hyperspectral image [12]–[14]. For instance, Kwon et al. [15] utilize the RGB image corresponding to high-resolution hyperspectral image to obtain poorly reconstructed image. Then the image in local is refined by sparse coding to obtain better SR image. Under the prior knowledge on spectral and spatial transform responses, Wycoff et al. [16] formulate the SR problem into non-negtive sparse factorization. The problem is effectively addressed by alternating direction method of multipliers [17]. These methods realize hyperspectral image SR under the guidance of RGB images generated by the same camera spectral response (CSR)1, ignoring the differences of CSR between datasets or scenes. Suppose that the same CSR value is used in the process of reconstruction, which will obviously lead to the poor robustness of the algorithm. To address this issue, Fu et al. [18] design the CSR function selection layer, which can automatically select the optimal CSR according to a particular scene. In addition to the CSR function selection mechanism, the method simulates CSR as the convolutional layer to learn the optimal CSR function, significantly improving the performance of hyperspectral image SR. However, such a scheme requires the pair of images to be well registered,

which is usually difficult to follow in practice. Moreover, the scholars claim that these algorithms are unsupervised, but they are not actually unsupervised in that the ground-truth for RGB image is adopted during reconstruction. The research of natural image SR has achieved great success

in recent years due to the powerful representational ability

of convolution neural networks (CNNs) [19], [20]. Its main principle is to learn the mapping function between low-resolution and high-resolution images in a supervised way. The typical methods include SRCNN [21], EDSR [22], and SRGAN [23], etc. Due to the satisfying performance in natural image SR, the scholars apply these methods for hyperspectral image SR. Inspired by deep recursive residual network [24], Li et al. [25] propose grouped deep recursive residual network (GDRRN) to execute hyperspectral image SR task in space. As we mentioned earlier, obviously, this method does not take into account spectral resolution and thus may lead to spectral distortion of the restored hyperspectral image. Considering this limitation, Mei et al. [26] present 3D full convolution neural network (3D-FCNN) to explore the relationship of the spatial information and adjacent pixels between spectra. Although this method effectively uncovers spatial information and spectral information between bands, it changes the size of the estimated hyperspectral image, which is not suitable for the purpose of image reconstruction. To address these drawbacks, in this paper, we propose a

novel spectral-spatial residual network for hyperspectral image

super-resolution (SSRNet). Our method learns the mapping function in a supervised way without using RGB image corresponding to high-resolution hyperspectral image. The whole network uses 3D convolution to extract hyperspectral image features instead of 2D convolution. In each spatial-spectral residual module (SSRM), the network can adaptively learn more effective spatial and spectral features from all the hierarchical units. To reduce unaffordable memory usage and high computational cost, we employ separable 3D convolution to extract spatial information and spectral information between bands in residual unit. Through three evaluation indexes, we demonstrate that the performance of SSRNet is superior to the state-of-the-art hyperspectral image SR approaches based on deep learning on three datasets. Besides, our proposed SSRNet generates more realistic visual results compared with other methods, as shown in Fig. 1. In summary, our main contributions are follows: • A novel spatial-spectral residual network (SSRNet) is

proposed to reconstruct hyperspectral image. The network

can explore the spatial information and spectral information between bands without changing the size of hyperspectral image. It significantly enhances the performance. • The spatial-spectral residual module (SSRM) is designed

to adaptively preserve the accumulated features through local

feature fusion. It makes full use of all the hierarchical features in the unit, which enables the network to fully extract the features of hyperspectral images. • Spatial and temporal separable 3D convolution is em-

ployed to extract spatial and spectral features in each unit,

respectively. It can reduce unaffordable memory usage and high computational cost, and make the network easier to train.

Fig. 2. Spatial and temporal separable 3D convolution.

The remainder of this paper is organized as follows: Section II describes existing hyperspectral image SR with CNNs and the detailed 3D convolution. Section III introduces our proposed SSRNet, including network structure, spectral-spatial residual module, skip connections, etc. Then, experiments on benchmark datasets are performed to verify our method in Section IV. Finally, Section V gives the conclusion.

II. RELATED WORK

There exists an extensive body of literatures on hyperspectral image SR. Here we first outline several deep learning-based hyperspectral image SR methods. In order to better understand the proposed method, we then give a brief introduction to 3D convolution.

A. Hyperspectral Image SR with CNNs

Recently, deep learning-based methods [27] have achieved remarkable advantages in the field of hyperspectral image SR. Here, we will briefly introduce several methods with CNNs. Li et al. [28] propose a deep spectral difference convolutional neural network (SDCNN) by using five convolutional layers to improve spatial resolution. Under spatial constraint strategy, it makes the reconstructed hyperspectral image preserve spectral information through post-processing. Jia et al. [29] present spectral-spatial network (SSN), including spatial and spectral sections. They try to learn the mapping function between low-resolution and high-resolution images and fine-tune spectrum. Yuan et al. [30] utilize the knowledge from natural image to restore high-resolution hyperspectral image by transfer learning, and collaborative nonnegative matrix factorization is proposed to enforce collaborations between low-resolution and high-resolution hyperspectral images. All of these methods need two steps to achieve image reconstruction, that is, the algorithm first improves the spatial resolution. To avoid spectral distortion, some constraint criteria are then employed to retain the spectral information. It is clear that the spatial resolution may be changed while maintaining the spectral information.

Considering this issue, Li et al. [25] and Wang et al. [31] introduce spectral angle error and set a new loss function by combining it with the mean square error. When training the network, these methods combine two error functions and deliberately reduce the distortion of the spectrum. However, it affects the performance of the reconstructed spatial resolution. Unlike natural image, the hyperspectral image has tens to hundreds of continuous spectral bands. Mei et al. [26] take advantage of this property of hyperspectral image and adopt 3D convolution to extract the features, which effectively retains the information of the original spectrum and improves the performance of image SR. However, the size of reconstructed image is changed.

Fig. 3. Overall architecture of our proposed SSRNet.

B. 3D Convolution

For natural image SR, the scholars usually employ 2D convolution to extract the features and obtain good performance [32], [33]. As we introduced earlier, the hyperspectral image contains many continuous bands, which results in a significant characteristic that there is a great correlation between adjacent bands [34]. If we directly utilize 2D convolution to conduct hyperspectral image SR task, it will make it impossible to effectively exploit potential features between bands. Therefore, in order to make full use of this characteristic, we design network by using 3D convolution to analyze the spatial and spectral features of hyperspectral image in our paper.

Since 3D convolution takes into account the inter-frame motion information in the time dimension, it is widely used in video classification [35], action recognition [36] and other fields. Unlike 2D convolution, the 3D convolution operation is implemented by convolving a 3D kernel with feature maps. Intuitively, the number of parameters of the training network using 3D convolution is an order of magnitude more than that of the 2D convolution. To address this problem, Xie et al. [37] develop typical separable 3D CNNs (S3D) model to accelerate video classification. In this model, the standard 3D convolution is replaced by spatial and temporal separable 3D convolution (see Fig. 2), which demonstrates that this way can effectively reduce the number of parameters while still maintain good performance.

III. PROPOSED METHOD

A. Network Structure

In this section, we will detail overall architecture of our SSRNet, whose flowchart is shown in Fig. 3. As can be seen from this figure, our method mainly consists of three parts: initial feature extraction (IFE) subnetwork, spatial-spectral residual module (SSRM) subnetwork, and image reconstruction (IR) subnetwork. Let and represent the input low-resolution hyperspectral image and the output reconstructed hyperspectral image, where W and H are the width and height of each band, and L represents the total number of the bands in hyperspectral image. In order to employ 3D convolution, we need unsqueeze into four dimensions () at the beginning of the network. Then, a standard 3D convolution is applied to extract shallow features about , i.e.,

where Unsqueeze(.) means the input hyperspectral image is expanded four dimensions, and denotes 3D convolution operation. The initial features of is fed into spatial-spectral residual module, which is described in detail in Section III-B. After D residual modules and global skip connection, the deep feature maps are denoted as

where denotes the operation of the d-th residual module. With respect to the impact of the number of residual module D in our network, we will analyze it in Section IV-D4. For IR sub-network, we use transposed convolution layer to upsample these feature maps to the desired scale via scale factor r, which is followed by a convolution layer. After squeeze process, the output size becomes . Finally, the output of SSRNet can be obtained by

where and squeeze(.) are the functions for upsampling and squeeze, respectively.

B. Spatial-Spectral Residual Module

The architecture of spatial-spectral residual module (SSRM) is illustrated in Fig. 4. As provided in this figure, the module mainly contains three residual units, local feature fusion, and a block. In the d-th SSRM, suppose and are the input and output feature maps, respectively. Under the local residual connection, the output of the d-th SSRM can be defined as

where is the function of the block. Next, we will present the details about the proposed residual unit and block.

1) Residual Unit: As we said in Section II, the previous work use spatial and temporal separable 3D convolution to represent the standard 3D convolution for video classification, i.e, the size of the filter is modified as and , which has been proven to perform better. To reduce unaffordable memory usage and high computational cost, in our paper, we use this method to replace the standard 3D convolution in the block. Specifically, the filter is used to extract the features between spectra, and the filter is adopted to extract the spatial features of each band. Moreover, we add the rectified linear unit (ReLU) after

Fig. 4. Architecture of the d-th spectral-spatial residual module (SSRM). The module contains three units, local feature fusion, and a block. The feature maps from are first fed into the first unit. After two units, the output of each unit is concatenated together to fuse these features of different depths. More effective features are attached to block after local residual learning and the output of the module is finally obtained.

Fig. 5. Architecture of the n-th residual unit.

each convolution operation (see Fig. 5(a)). Finally, the block can be formulated as

where denotes the ReLU activation function. In terms of this way, it can not only effectively mine the potential information between spectra, but also speed up the implementation of the algorithm. Since the size of convolution kernel can extract image features well for natural image SR, in our work, the parameter k of convolution is set to 3.

Now we present the proposed residual unit, which is shown in Fig. 5(b). Let and are the input and output feature maps of the n-th unit in the d-th SSRM, respectively. Through the local skip connection and two blocks, the output feature maps can be obtained by

By doing so, it can not only greatly reduce the computational cost, but also simultaneously learn spectral and spatial information of hyperspectral image.

2) Local Feature Fusion: To make the network learn more useful information, we design local feature fusion strategy (see Fig 4) to adaptively retain the cumulative features, which enables the network can fully extract hyperspectral image features. Specifically, the features from different units are first concatenated to learn fusion information. In order to do a local residual learning between the fused result and the input , it is necessary to reduce the number of features. Thus, we add a convolution layer with the size after concatenation to adaptively retain valid information. Besides, we also set the ReLU activation function after convolution. As a result, the output of local feature fusion is formulated as

where Concat(.) denotes concatenation function of different hierarchical features.

C. Skip Connections

As the depth of the network increases, the weakening of information flow and the disappearance of gradient hinder the training of the network. Recently, there are many ways to solve these problems. For instance, He et al. [38] first utilize skip connection between layers so as to improve the information flow and make it easier to train. To fully explore the advantages of skip connection, Huang et al. [39] propose DenseNet. The network has the advantages of strengthening feature propagation, supporting feature reuse, and reducing the number of parameters.

For SR task, the input low-resolution image is greatly similar to the output high-resolution image, that is, the lowfrequency information carried by the low-resolution image is similar to that of the high-resolution image [40]. According to this characteristic, the researchers use dense connections to enhance the information flow of the whole network and alleviate the disappearance of the gradient for natural image SR, thus effectively improving the performance of the algorithm. Therefore, we add several global residual connections in our network. Since the shallow network can retain more edge or texture information of hyperspectral image, the feature maps from IFE are fed into the the back of each module, which can enhance the performance of the entire network.

D. Network Learning

For network training, the SSRNet is optimized by minimizing the difference between reconstructed hyperspectral image and corresponding ground-truth hyperspectral image . Mean square error (MSE) is often used as loss function to study the parameters of the network for hyperspectral image SR algorithms based on deep learning [28]. Additionally, some methods design two terms in loss function to minimize the difference, including MSE and spectral angle mapping (SAM) [25], [31]. In fact, these loss functions do not make the network converge better and obtain poor results, which is proved in the experiment section. For natural image SR, as far as we know, many networks in recent years usually use L1 as loss function, and the experiments also demonstrate that the L1 can obtain

Fig. 6. Some RGB images corresponding to hyperspectral images on three datasets.

more powerful performance and convergence [19]. Therefore, in this paper, we refer to the natural image SR method and adopt L1 as the loss function of our designed network. The loss function of SSRNet is

where M is the number of training patches and denotes the parameter set of the SSRNet network.

IV. EXPERIMENT

To verify the effectiveness of the proposed SSRNet, in this section, we first introduce three public datasets, implementation details, and evaluation indexes. We then analyze the proposed method from many aspects, including loss function analysis, ablation study, etc. Finally, we assess the performance of our SSRNet by comparisons to the state-of-the-art methods.

A. Datasets

1) CAVE: The CAVE dataset2 is gathered by cooled CCD camera at a 10nm step from 400 nm to 700 nm (31 bands) [41]. The dataset contains 31 scenes, divided into 5 sections: real and fake, skin and hair, paints, food and drinks, and stuff. The size of all hyperspectral image is in this dataset. Each band is stored as a 16-bit grayscale PNG image.

2) Harvard: The Harvard dataset3 is obtained by Nuance FX, CRI Inc. camera in the wavelength range of 400 nm to 700 nm. [42]. The dataset consists of 77 hyperspectral images of real-world indoor or outdoor scenes under daylight illumination. The size of each hyperspectral image is in this dataset. Unlike CAVE dataset, this dataset is stored as .mat file.

3) Foster: The Foster dataset4 is collected using a low-noise Peltier-cooled digital camera (Hamamatsu, model C4742-95-12ER) [43]. The dataset includes 30 images from the Minho region of Portugal during late spring and summer

TABLE I LOSS FUNCTION ANALYSIS FOR SCALE FACTOR

of 2002 and 2003. Each hyperspectral image has 33 bands with the size of 1204 1344 pixels. Similarly, the dataset is also stored as .mat file. Some RGB images corresponding to hyperspectral images are shown in Fig. 6.

B. Implementation Details

As mentioned earlier, different datasets are gathered by different hyperspectral cameras, so we need to train and test each dataset individually, which is different from the natural image SR. In our work, 80% of the samples are randomly selected as training set, and the rest are used for testing.

For the training phase, since there are too few images in these datasets for deep learning algorithm, we augment the training data by randomly selecting 24 patches with the size of . Each patch is horizonta flipped, rotated (, and ), and scaled (1, 0.75, and 0.5). According to scale factor r, these patches are downsampled as low-resolution hyperspectral images by bicubic interpolation. Before feeding the mini-batch into our network, we subtract the average value of the entire training images for patches. In our work, we set the size of filter as and in each convolution layer expect those for initial feature extraction and image reconstruction (the size of filter is set to ), and the number of filter for all layer in our network is set to 64. We initialize each convolutional filter using [44]. The ADAM optimizer with is employed to train our network. The learning rate is initialized as for all layers, which decreases by a half at every 35 epochs.

For the test phase, in order to improve the efficiency of the test, we only use the top left region of each test image for evaluation. Our method is conducted using the PyTorch framework with NVIDIA GeForce GTX 1080 GPU.

C. Evaluation Metrics

To qualitatively measure the proposed SSRNet, three evaluation methods are employed to verify the effectiveness of the algorithm, including peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and spectral angle mapping (SAM). In general, the larger the PSNR and SSIM is and the smaller the SAM is, the better the performance of the reconstructed hyperspectral image is.

D. Model Analysis

In this section, to verify the effectiveness of our proposed method, we conduct sufficient experiments from the following four aspects.

1) Loss Function Analysis: To demonstrate the effect of different loss functions, the loss functions of [31], [28], and L1 in our work are employed to train SSRNet on CAVE dataset. The evaluation results are shown in Table I. When adding SAM in loss function, it is clear that the spatial resolution has changed, and the spectral distortion has become more serious. Moreover, the loss function containing MSE and SAM gets a lower PSNR value, which is mainly due to the fact that the loss function weakens the performance of spatial resolution. As seen from this table, L1 in our paper can achieve the best performance than other loss functions for three indexes. It verifies our method can effectively optimize the difference between and using L1.

2) Ablation Study: Table II shows the ablation study on the impacts of local feature fusion (LFF) in module and global residual learning (GRL). We set the different combinations of components to analyze the performance of the proposed SSRNet. To simply do fair comparison, our network with 3 modules is adopted to implement ablation investigation for scale factor 2 on CAVE dataset.

First, without the local feature fusion and global residual learning (LFF0GRL0), the network yields the worst performance. It mainly lacks of adequate learning of effective features, which also shows that spectral and spatial features can not be extracted well without these components. Thus, these components are required in our network. Then, we add one of these components, LFF, to the baseline (LFF0GRL0). The performance of the network is improved in PSNR and SAM. Accordingly, only GRL (denote as LFF0GRL1) is added to the baseline. Evaluation indexes attain relatively better than the results of LFF0GRL0, except for SSIM. In short, the experiments demonstrate that each component can clearly enhance the performance of the network. This indicates that each component plays a key role in making the network easier to train. Finally, two components (LFF1GRL1) are attached to the baseline. The table exhibits that the results of two components are significantly better than the performance of only one in each dimension, which reveals that two components contribute to the flow of information and gradient transmission in the network.

We also provide the convergence analysis only using PSNR for different combinations of components in Fig. 7. One can observe that the convergence curve for LFF0GRL1 is more stable than that of LFF1GRL0 in the early iterations. Compared with baseline, LFF and GRL can effectively improve the performance in PSNR, which is consistent with the above analyses. To sum up, the analyses reveal that the effectiveness and benefits of the proposed LFF and GRL.

3) Study of Block: In this section, we study the efficiency of the proposed block using different types in module, including standard 3D convolution and separable 3D convolution. The one is that we use block with separable 3D convolution, the other is standard 3D convolution that has removed ReLU activation function. Note that the convolution operations in initial feature extraction and image reconstruction are not replaced by separable 3D convolution in our network. The comparison results are shown in Table III. Obviously, our proposed block can greatly reduce parameters (reduce ratio is 49.77%), which

Fig. 7. Ablation study of the the proposed method for scale factor CAVE dataset.

TABLE II ABLATION STUDY ABOUT THE COMPONENTS FOR SCALE FACTOR CAVE DATASET.

makes the network easier to train while reducing memory footprint. With respect to the results of PSNR, using standard 3D convolution is lower than that of separable 3D convolution. We think that there are two main reasons for this problem: 1) there are too many parameters of the network, which makes the network more difficult to train; 2) the network all use the standard 3D convolution, which makes the network pay too much attention to spectral information, so as to weaken the ability to learn spatial features.

4) Study of D: The structure of our proposed SSRNet is determined by the number of the spatial-spectral residual module D. To analyze the effect of parameters on the performance, we set the range of D from 2 to 5, and the results are displayed in Table. IV. One can observe that no matter what D is, the values of SAM remain basically the same. Moreover, the values of PSNR and SSIM do not increase significantly when D > 3. Although when N is set to 5, the value of each evaluation index has been improved, especially for PSNR. However, it leads to a obvious increase in the corresponding network parameters. Therefore, we empirically set the parameter D to 3 in our paper.

E. Comparisons with the State-of-the-art Methods

In this section, we adopt three public hyperspectral image datasets to evaluate the effectiveness of our SSRNet with existing SR approaches using three evaluation indexes. Table V depicts the quantitative evaluation of state-of-the-art SR algorithms by average PSNR/SSIM/SAM for different scale factors.

As shown in table, our method can achieve the best results than other algorithms on CAVE dataset. Specifically,

Fig. 8. Absolute error map comparisons of our SSRNet with existing methods for scale factor

Fig. 9. Spectral distortion comparisons by randomly selecting a pixel. (a)-(f) Results of the spectral curves of six scenes, respectively.

TABLE III COMPARISON OF THE PERFORMANCE OF STANDARD 3D CONVOLUTION AND SEPARABLE 3D CONVOLUTION.

TABLE IV ANALYSIS OF THE INFLUENCE OF THE NUMBER OF SPATIAL-SPECTRAL RESIDUAL MODULES ON THE PERFORMANCE.

the Bicubic produces the worst performance among these competitors. For the GDRRN algorithm, all the results are slightly higher than the worst Bicubic but lower than other methods. It is caused by the addition of a SAM item in the loss function. As a result, the network can not optimize the difference between reconstructed and high-resolution images. Furthermore, the results of 3D-FCNN in PSNR and SSIM are lower than that of EDSR, but the performance in SAM of 3D-FCNN is obviously higher than that of EDSR, which is due to the fact that 3D-FCNN uses 3D convolution to extract the spectral features of hyperspectral image. Thus, this algorithm can void the spectral distortion of the reconstructed hyperspectral image well. However, the image obtained by 3D-FCNN lose part of the bands (the algorithm only obtains 23 bands on hyperspectral image with 31 bands), which is not suitable for image SR. Compared with the existing SR approaches, our method obtains excellence performance. The proposed method is significantly superior to the scale factor of the second performance algorithm (EDSR) in terms of three evaluation metrics (+0.36dB, +0.002, and -0.07).

Similarly, except for 3D-FCNN, the SSRNet outperforms other competitors in three aspects on Hararvd dataset. Concretely, unlike on CAVE dataset, GDRRN and 3D-FCNN has achieved approximately the same results, because the number of hyperspectral images on augmented Harvard dataset is more than that on CAVE dataset. This is more beneficial to network training with many parameters, such as EDSR. Moreover, it also enables our approach to achieve higher performance (+0.98dB, +0.004, and -0.03) on this dataset than on CAVE dataset for scale factor . Likewise, the proposed approach achieves good performance in comparison to existing state-of-the-art methods on Foster dataset, particularly in SSIM and SAM.

In Fig. 8, we show visual comparisons with different algorithms for scale factor on three datasets. The figure only provides visual results of the 27-th band of six typical scenes. As revealed in the figure, the ground-truth is grey. So in order to observe the difference between reconstructed hyperspectral image and ground-truth clearly, the absolute error map between them is presented. In general, the bluer the absolute error map is, the better the restored image is. Note that each hyperspectral image is normalized. From this figure, we can see that our proposed SSRNet obtains very low absolute error results. In some regions, especially for the edges of the image, our method generates shallow edge information with

TABLE V QUANTITATIVE EVALUATION OF STATE-OF-THE-ART SR ALGORITHMS BY AVERAGE PSNR/SSIM/SAM FOR DIFFERENT SCALE FACTORS. THE BOLD INDICATES THE BEST PERFORMANCE.

little or no edge information. It means our proposed SSRNet generates more realistic visual results compared with other methods, which is consistent with our analysis in Table V.

We also visualize the spectral distortion of the reconstructed image by drawing spectral curves for six scenes, which is presented in Fig. 9. Since 3D-FCNN loses some of the bands during reconstruction, we only show some of bands. It can be seen from this figure that the distortion for 3D-FCNN is the most severe. The distortion of the spectral curve obtained by Bicubic is relatively small compared with 3D-FCNN. Moreover, among these competitors, the spectral curves of GDRRN, EDSR, and SSRNet are basically consistent with that of ground-truth, but the results of our method are much closer to the ground truth in most cases, which proves our algorithm attains higher spectral fidelity. In conclusion, SSRNet can not only outperform state-of-the-art SR algorithms through quantitative evaluation, but also yield more realistic visual results.

V. CONCLUSION

Considering that existing deep learning-based hyperspectral image super-resolution (SR) methods can not simultaneously explore spatial information and spectral information between bands, we develop a novel spectral-spatial residual network (SSRNet) to reconstruct hyperspectral image, claiming the following contributions: 1) without changing the size of the hyperspectral image, our proposed network adopts 3D convolution to effectively exploit spatial and spectral features instead of 2D convolution; 2) we propose spatial-spectral residual module (SSRM). The module can make full use of the hierarchical features generated by each unit and learn effective features adaptively by the way of local feature fusion; and 3) we employ separable 3D convolution to extract spatial and spectral features respectively, which reduces the training parameters of the network, thus making the network easier to train. Extensive benchmark evaluations well demonstrate that our SSRNet can not only outperform state-of-the-art SR algorithms, but also yield more realistic visual results.

In the feature, we plan to improve proposed SSRNet by two aspects. The one is to increase the spatial feature extraction capability of the network and reduce the feature extraction between spectra in unit. Second, hybrid 3D/2D convolution is adopted to reduce the training complexity of each module and thus accelerate the execution speed of the network.

REFERENCES

[1] Q. Li, Q. Wang, and X. Li, “An efficient clustering method for hyper- spectral optimal band selection via shared nearest neighbor,” Remote Sens., vol. 11, no. 3, pp. 350, 2019.

[2] F. F. Sabins, “Remote sensing for mineral exploration,” Ore Geol. Rev., vol. 14, no. 3-4, pp. 157–183, 1999.

[3] J. Lin, N. T. Clancy, J. Qi, Y. Hu, T. Tatla, D. Stoyanov, L. Maier-Hein, and D. S. Elson, “Dual-modality endoscopic probe for tissue surface shape reconstruction and hyperspectral imaging enabled by deep neural networks,” Med. Image Anal., vol. 48, pp. 162–176, 2018.

[4] A. Lowe, N. Harrison, and A. P. French, “Hyperspectral image analysis techniques for the detection and classification of the early onset of plant disease and stress,” Plant Methods, vol. 13, no. 1, pp. 80, 2017.

[5] Q. Wang, Z. Yuan, and X. Li, “GETNET: A general end-to-end two- dimensional cnn framework for hyperspectral image change detection,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 1, pp. 3–13, 2019.

[6] Q. Wang, X. He, and X. Li, “Locality and structure regularized low rank representation for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 911–923, 2019.

[7] W. Xie, X. Jia, Y. Li, and J. Lei, “Hyperspectral image super-resolution using deep feature matrix factorizationk,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 8, pp. 6055–6067, 2019.

[8] W. Dong, F. Fu, G. Shi, X. Gao, J. Wu, G. Li, and X. Li, “Hyperspectral image super-resolution via non-negative structured sparse representation,” IEEE Trans. Image Process., vol. 25, no. 5, pp. 2337–2352, 2016.

[9] T. Akgun, Y. Altunbasak, and R. M. Mersereau, “Super-resolution reconstruction of hyperspectral images,” IEEE Trans. Image Process., vol. 14, no. 11, pp. 1860–1875, 2005.

[10] Y. Hu, J. Li, Y. Huang, and X. Gao, “Channel-wise and spatial feature modulation network for single image super-resolution,” IEEE Trans. Circuits Syst. Video Technol., 2019.

[11] Y. Qu, H. Qi, and C. Kwan, “Unsupervised sparse dirichlet-net for hyperspectral image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2511–2520.

[12] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee”, “Rgb-guided hyperspectral image upsampling,” in Proc. IEEE Int. Conf. on Comput. Vis., 2015, pp. 307–315.

[13] N. Akhtar, F. Shafait, and A. S. Mian, “Hierarchical beta process with gaussian process prior for hyperspectral image super resolution,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 103–120.

[14] N. Akhtar, F. Shafait, and A. S. Mian, “Bayesian sparse representation for hyperspectral image super resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3631–3640.

[15] H. Kwon and Y. Tai, “Rgb-guided hyperspectral image upsampling,” in Proc. IEEE Int. Conf. on Comput. Vis., 2015, pp. 307–315.

[16] E. Wycoff, T. Chan, K. Jia, W. Ma, and Y. Ma, “A non-negative sparse promoting algorithm for high resolution hyperspectral imaging,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 1409–1413.

[17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends Rin Machine learning, vol. 3, no. 1, pp. 1–122, 2011.

[18] Y. Fu, T. Zhang, Y. Zheng, D. Zhang, and H. Huang, “Hyperspectral image super-resolution with optimized rgb guidance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019.

[19] S. Anwar, S. Khan, and N. Barnes, “A deep journey into super-resolution: A survey,” arXiv preprint arXiv:1904.07523, 2019.

[20] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2472–2481.

[21] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 184–199.

[22] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1132–1140.

[23] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4681–4690.

[24] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3147–3155.

[25] Y. Li, L. Zhang, C. Ding, W. Wei, and Y. Zhang, “Single hyperspectral image super-resolution with grouped deep recursive residual network,” in Proc. IEEE International Conference on Multimedia Big Data, 2018, pp. 1–4.

[26] S. Mei, J. Ji X. Yuan, Y. Zhang, S. Wan, and Q. Du, “Hyperspectral image spatial super-resolution via 3d full convolutional neural network,” Remote Sens., vol. 9, pp. 1139, 2017.

[27] X. Han, B. Shi, and Y. Zheng, “Ssf-cnn: Spatial and spectral fusion with cnn for hyperspectral image super-resolution,” in Proc. IEEE International Conference on Image Processing, 2018, pp. 2506–2510.

[28] R. Li, J. Hu, X. Zhao, W. Xie, and J. Li, “Hyperspectral image super- resolution using deep convolutional neural network,” Neurocomputing, vol. 266, pp. 29–41, 2017.

[29] J. Jia, L. Ji, Y. Zhao, and X. Geng, “Hyperspectral image super-resolution with spectralspatial network,” in Proc. International Journal of Remote Sensing, 2018, pp. 7806–7829.

[30] Y. Yuan, X. Zheng, and X. Lu, “Hyperspectral image superresolution by transfer learning,” IEEE Trans. Geosci. Remote Sens., vol. 10, no. 5, pp. 1963–1974, 2017.

[31] C. Wang, Y. Liu, X. Bai, W. Tnag, P. Lei, and J. Zhou, “Deep residual convolutional neural network for hyperspectral image super-resolution,” in Proc. International Conference on Image and Graphics, 2017, pp. 370–380.

[32] S. Li, F. He, B. Du, L. Zhang, Y. Xu, and D. Tao, “Fast spatiotemporal residual network for video super-resolution,” arXiv preprint arXiv:1904.02870, 2019.

[33] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang, “Second-order attention network for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 11065–11074.

[34] Q. Wang, Q. Li, and X. Li, “Hyperspectral band selection via adaptive subspace partition strategy,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., 2019.

[35] D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classifica-tion with channel-separated convolutional networks,” arXiv preprint arXiv:1904.02811, 2019.

[36] Y. Zhou, X. Sun, Z. Zha, and W. Zeng, “Mict: Mixed 3D/2D convolutional tube for human action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 449–458.

[37] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotem- poral feature learning: Speed-accuracy trade-offs in video classification,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 305–321.

[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[39] G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4700–4708.

[40] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1646–1654.

[41] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2241–2253, 2010.

[42] A. Chakrabarti and T. Zickler, “Statistics of real-world hyperspectral images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 193–200.

[43] S. M.Nascimento and K. Amanoand H. D. Foster, “Spatial distributions of local illumination color in natural scenes,” Vision Res., vol. 120, pp. 39–44, 2016.

[44] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang, “Wide activation for efficient and accurate image super-resolution,” arXiv:1808.08718v2, 2018.