3D Quasi-Recurrent Neural Network for Hyperspectral Image Denoising

2020·Arxiv

Abstract

Abstract

In this paper, we propose an alternating directional 3D quasi-recurrent neural network for hyperspectral image (HSI) denoising, which can effectively embed the domain knowledge — structural spatio-spectral correlation and global correlation along spectrum. Specifically, 3D convolution is utilized to extract structural spatio-spectral correlation in an HSI, while a quasi-recurrent pooling function is employed to capture the global correlation along spectrum. Moreover, alternating directional structure is introduced to eliminate the causal dependency with no additional computation cost. The proposed model is capable of modeling spatio-spectral dependency while preserving the flexibility towards HSIs with arbitrary number of bands. Extensive experiments on HSI denoising demonstrate significant improvement over state-of-the-arts under various noise settings, in terms of both restoration accuracy and computation time. Our code is available at https://github.com/Vandermode/QRNN3D.

Index Terms—Hyperspectral image denoising, structural spatio-spectral correlation, global correlation along spectrum, quasi-recurrent neural networks, alternating directional structure

I. INTRODUCTION

HYPERSPECTRAL image (HSI) is made up of massivediscrete wavebands for each spatial position of real scenes and provides much richer information about scenes than RGB images, which has led to numerous applications in remote sensing [27], [34], classification [2], [6], [31], [38], [45], tracking [37], face recognition [36], and more. However, due to the limited light for each band, traditional HSIs are often degraded by various noises (i.e., Gaussian, stripe, deadline, and impulse noises) during the acquisition process. These degradations negatively influence the performance of all subsequent HSI processing tasks aforementioned. Therefore, HSI denoising is an essential pre-processing in the typical workflow of HSI analysis and processing.

Recently, more HSI denoising works pay attention to the domain knowledge of the HSI — structural spatio-spectral correlation and global correlation along spectrum (GCS) [42]. Top-performing classical methods [8], [9], [39], [41], [42] typically utilize non-local low-rank tensors to model them. Although these methods achieve higher accuracy by effectively considering these underlying characteristics, the performance of such methods is inherently determined by how well the human handcrafted prior (e.g. low-rank tensors) matches with the intrinsic characteristics of an HSI. Besides, such approaches generally formulate the HSI denoising as a complex optimization problem to be solved iteratively, making the denoising process time-consuming.

Alternative learning-based approaches rely on convolutional neural networks in lieu of the costly optimization and hand-

Fig. 1: Our QRNN3D outperforms all leading-edge methods on ICVL dataset in both Gaussian and complex noise cases.

crafted priors [7], [46]. Promising results notwithstanding, these approaches model HSI by learned multichannel or bandwise 2D convolutions, which sacrifice either the flexibility with respect to the spectral dimension [7] (hence requiring retraining network to adapt to HSIs with mismatched spectral dimention), or the model capability to extract GCS knowledge [46] (thus leading to relatively low performance as shown in Figure 1).

In principal, the trade-off between the model capability and flexibility imposes a fundamental limit for real-world applications. In this paper, we find that combining domain knowledge with 3D deep learning (DL) can achieve both goals simultaneously. Unlike prior DL approaches [7], [46] that always utilize the 2D convolution as a basic building block of network, we introduce a novel building block namely 3D quasi-recurrent unit (QRU3D) to model HSI from a 3D perspective. This unit contains a 3D convolutional subcomponent and a quasi-recurrent pooling function [5], enabling structural spatio-spectral correlation and GCS modeling respectively. The 3D convolutional subcomponent can extract spatio-spectral features from multiple adjacent bands, while the quasi-recurrent pooling recurrently merges these features over the whole spectrum, controlled by a dynamic gating mechanism. This mechanism renders the pooling weights to be dynamically calculated by the input features, thereby allowing for adaptively modeling the GCS knowledge. To eliminate the unidirectional causal dependency (Figure 4), introduced by the vanilla recurrent structure, we furthermore propose an alternating directional structure with no additional computation cost.

Our network, called 3D quasi-recurrent neural network (QRNN3D), has been designed to make full use of the domain knowledge especially the GCS. It makes significant improvements in model capability/accuracy while is agnostic to the spectral dimension of input HSIs, thus can be applied to any HSIs captured by unknown sensors (with different spectral resolutions). Over extensive experiments, QRNN3D outperforms all leading-edge methods on several benchmark datasets under various noise settings as shown in Figure 1.

Our main contributions are summarized that we 1) present a novel building block namely QRU3D that can effectively exploit the domain knowledge – structural spatio-spectral correlation and global correlation along spectral (GCS) simultaneously.

2) introduce an alternating directional structure to eliminate the unreasonable causal dependency towards HSI modeling, with no additional computation cost.

3) demonstrate our model pretrained on ICVL dataset can be directly utilized to tackle remotely sensed imagery which is infeasible in conventional 2D DL approaches for the HSI modeling.

The remainder of this paper is organized as follows. In Section II, we review related HSI denoising methods and DL approaches that inspire our work. Section III introduces the QRNN3D approach for HSI denoising. Extensive experimental results on natural scenes of HSI database and remote sensed images are presented in Section IV, followed by more discussions that facilitate the understanding of QRNN3D in Section V. Conclusions are drawn in Section VI.

II. RELATED WORK

A. HSI Denoising

Existing methods towards HSI denoising can be roughly classified into two categories depending on the noise model.

The most frequently used noise model is zero-mean white and homogeneous Gaussian additive noise. Under this assumption, BM4D [28], an extension of the BM3D filter [13] to volumetric data, could be directly applied for HSI denoising. By regarding the GCS and non-local self-similarity in HSI simultaneously, Peng et al. proposed a tensor dictionary learning (TDL) model [30] which achieved very promising performance. Following this line, more sophisticated methods have been successively proposed [8], [9], [14], [16], [19], [41], [42], [50]. Among these methods, the low-rank tensor based models, i.e. ITS-Reg [42], LLRT [9] and a new iterative projection and denoising algorithm, i.e. NG-meet [19] achieve state-of-the-art performance, owing to their elaborate efforts on modeling intrinsic property of the HSI.

Besides, several works [11], [20], [39], [43], [48] aim to resolve the realistic complex noise by modeling the noise with complicated non-i.i.d. statistical structures. They all frame the denoising problem into a low-rank based optimization scheme, and then utilize some constraints (e.g. total variation, and nuclear norm) to remove the complex noise (e.g. non-i.i.d. Gaussian, stripe, deadline, impulse).

Recently, leveraging the power of the DL, Chang et al. [7] extended the 2D image denoising architecture – DnCNN [49] to remove various noise in HSIs. They argued the learned filters can well extract the structural spatial information. Yuan et al. [46] utilized a deep residual network to recover the remotely sensed images under Gaussian noise, which processed HSI with a sliding window strategy. Concurrently to our work, Dong et al. [15] proposed a 3D factorizable U-net architecture to exploit spatial-spectral correlations in HSIs from the 3D perspective. All these DL-based methods insufficiently exploit the GCS knowledge, and they cannot adjust the learned parameters to adaptively fit input data, consequently lacking the freedoms to discriminate the inputdependent spatio-spectral correlations.

In this paper, we leverage the power of the DL to automatically learn the mapping purely from the data instead of handcrafted prior and complex optimization, reaching to orders-of-magnitude speedup in both Gaussian and complex noise contexts. Besides, our DL-based method can effectively exploit the underlying characteristics — structural spatio-spectral correlation and GCS, even without sacrificing the flexibility towards HSIs with arbitrary number of bands.

B. Deep Learning for Image Denoising

Researches on Gray/RGB image denoising has been dominated by the discriminative learning based approach especially the deep convolutional neural network (CNN) in recent years [10], [29], [33], [49], [51], [52]. Zhang et al. [49] proposed a modern deep architecture namely DnCNN by embedding the batch normalization [23] and residual learning [18]. Meanwhile, Mao et al. [29] presented a very deep fully convolutional encoding-decoding framework for image restoration such as denoising and super-resolution. Both of them yielded better Gaussian denoising results and less computation time than the highly-engineered benchmark BM3D [13]. Along this line, more works have been proposed to explore the deep architecture design for image denoising. For example, MemNet [33] introduces memory block to investigate the long-term information. Residual dense network [52] goes beyond that to build dense connections inner blocks. Residual non-local attention network [51] utilizes local and non-local attention blocks to extract features that capture the long-range dependencies between pixels and pay more attention to the challenging parts.

Although all these networks can be directly extended into the HSI case, none of them specifically consider the domain knowledge of the HSI.

C. Deep Image Sequence Modeling

Modeling image sequence with various lengths is a fundamental problem in a variety of research fields such as precipitation nowcasting, video processing, and so on.

Bidirectional recurrent convolutional networks (BRCN) [22] and convolutional LSTM (ConvLSTM) [44] were proposed for resolving the multi-frame super-resolution and precipitation nowcasting problem respectively. The key insight of these models is to replace the common-used recurrent full connections by weight-sharing convolutional connections such that they can greatly reduce the large number of network parameters and well model the temporal dependency in a finer level (i.e. patch-based rather than frame-based). However, these patch-based operations cannot efficiently capture the spectral correlation, meanwhile recurrently applying convolution along

TABLE I: Network configuration of our residual encoder- decoder style QRNN3D for HSI restoration.

spectrum would drastically increase the computational complexity. In contrast, our QRNN3D employs an elementwise recurrent mechanism, enabling good scaling to HSI with a large number of bands. Besides, this mechanism naturally imposes a prior constraint over the spectrum, making it wellsuited for extracting GCS knowledge.

Fig. 2: The overall architecture of our residual encoder-decoder QRNN3D. The network contains layers of symmetric QRU3D with convolution and deconvolution for encoder (blue) and decoder (orange) respectively. Symmetric skip connections are added in each layer. Besides, alternating directional structure is equipped in all layers except the top and bottom ones with bidirectional structure to avoid bias.

III. THE PROPOSED METHOD

An HSI degraded by additive noise can be linearly modeled as

where is the observed noisy image, X is the original clean image, denotes the additive random noise. H, W, B indicate the spatial height, spatial width, and number of spectral bands respectively.

Here, we consider miscellaneous noise removal in denoising context, where can represent different types of random noise including Gaussian noise, sparse noise (stripe, deadline and impulse) or mixture of them. Given a noisy HSI, our goal is to obtain its noise-free counterpart.

In this section, we introduce the residual encoder-decoder QRNN3D for HSI denoising. As shown in Figure 2, our network consists of six pairs of symmetric QRU3D with convolution and deconvolution for encoder and decoder respectively, leading to twelve layers in total. We use two layers with stride=2 convolution to downsample the input in encoder part, and then two layers with stride=1/2 to upsample in decoder part. The benefits from downsampling and unsampling operations are that we can use a larger network under the same computational cost, and increase receptive field size to make use of the context information in larger image region. Table I illustrates our network configuration. Each layer contains a QRU3D with kernel size , which is set to maximize performance empirically [35]. Stride and output channels () in each layer are listed and other configuration (e.g. padding) can be inferred implicitly.

In the following, we first present the QRU3D, which is the core building block in our method. Then, alternating directional structure used to eliminate the unreasonable causal dependency is introduced, and learning details are provided.

A. 3D Quasi-Recurrent Unit

QRU3D is the basic building block of QRNN3D. It consists of two subcomponents, i.e. 3D convolutional subcomponent and quasi-recurrent pooling, as shown in Figure 3. Unlike the 2D convolution, both of the subcomponents do not enforce the number of spectral bands, making the QRNN3D free for processing HSIs with arbitrary bands.

3D Convolutional Subcomponent. The 3D convolutional subcomponent of QRU3D performs two set of 3D convolutions [24], [35] with separated filter banks, producing sequence of tensors passed through different activation functions,

where is the input feature maps coming from last layer (in first layer, input I = Y with ); is a high dimensional candidate tensor. F has the same dimension as Z, representing the neural forget gate that controls the behavior of dynamic memorization. Both and are the 3D convolutional filter banks and denotes a 3D convolution, indicates a sigmoid non-linearity.

The 3D convolution is achieved by convolving a 3D kernel to a whole HSI in both spatial and spectral dimensions. The 3D convolution in the spatial domain can mimic numerous operations widely used in low-level vision (like image patch extraction and 2D patch transform in BM3D [13], [26]) and the 3D convolution in the spectral domain can model the local spectrum continuity to alleviate the spectral distortion. Consequently, the embedded C3D can effectively exploit the structural spatio-spectral correlation in HSIs.

Quasi-Recurrent Pooling. Although the 3D convolutional subcomponent has already exploited the inter-band relationship, it is computed in a local way and cannot explicitly exploit GCS. To effectively utilize the GCS, we present quasi-recurrent pooling, in which pooling operation and dynamic gating mechanism are introduced.

Fig. 3: The overall structure of QRU3D. It can be described in four steps. First, the input I is transformed by two set of 3D convolutions, generating a candidate tensor Z and a neural forget gate F. Second, Z and F are split along the spectrum to produce sequences of and . Third, the quasi-recurrent pooling function is applied recurrently to merge the previous hidden state and current candidate controlled by current neural gates , resulting in a new hidden state . Finally, each hidden state is concatenated together to form the whole output H to the next layer.

In our QRU3D, the quasi-recurrent pooling is applied after the candidate tensor Z and neural forget gate F are obtained by the 3D convolutional subcomponent. We first split Z and F along the spectrum, generating sequences of and respectively, and then feed these states into a quasi-recurrent pooling function [5],

where denotes an element-wise multiplication, is the hidden state merged through all previous states and also represents the -th band in the output of this layer, with all entries equal to zero. The forget gate balances the weight of current candidate and previous memory, i.e. hidden state . Its value depends on the current input I instead of being fixed like a convolutional filter, which can effectively adapt to the input image own and not solely rely on the parameters learned in the training stage. By this construction, the inter-band information would be accurately merged. Meanwhile, since this dynamic pooling recurrently operates across the whole spectrum, the GCS can be effectively exploited. The output feature maps H will be produced by concatenating all hidden states along the spectrum.

In addition, due to independent neural gate and element-wise recurrent operations (multiplication), the QRU3D is highly parallel, enabling good scaling to HSI with a large number of bands. More specifically, the calculation of neural forget gate is only dependent on multiple contiguous bands of input instead of involving the previous hidden state in typical RNNs (e.g. LSTM [21] and GRU [12]). Meanwhile, the elementwise multiplication is exceedingly computationally economical than the convolution used by ConvLSTM [44], thus can be easily recurrently utilized hundreds of times.

B. Alternating Directional Structure

A forward 3D quasi-recurrent unit, as in Equation (3), reads a candidate tensor in order starting from the first to the last , so that a hidden state only depends on the

Fig. 4: Directional structure overview. (a) Unidirectional structure: hidden states propagate unidirectionally. (b) Bidirectional structure: one layer contains two sublayers which propagate states with inverse direction, generating results by adding sublayers’ output. (c) Our proposed alternating directional structure: direction of network changes in each layer.

Fig. 5: Synthesized RGB image samples from ICVL dataset.

previous (and theirs corresponding bands). This introduces the causal dependency since the computing stream of hidden state propagates unidirectionally as shown in Figure 4(a), which is not reasonable for the HSI.

A typical solution is to use a bidirectional structure [4], [22],

[32], in which a layer of network contains two sublayers, i.e. a forward QRU3D and a backward QRU3D in our case, as shown in Figure 4(b). The forward QRU3D reads the candidate tensor sequence in order and calculates a sequence of forward hidden states. The backward QRU3D reads the sequence in reverse order, leading to a sequence of backward hidden states. The output of this layer is calculated by adding the forward and backward hidden states elementwisely. However, this structure makes the computational burden unacceptable because of the nearly double amount of memory consumption.

To ease this issue, we present an alternating directional structure for HSIs. In specific, a QRNN3D with alternating directional structure changes the direction of computing stream of hidden state in each layer, as shown in Figure 4(c). This structure is built by alternately stacking forward and backward QRU3D, in which a forward (or backward) state is be merged by a backward (or forward) state in next layer, such that the global context information could be propagated through the whole spectrum.

Compared with the typical solution by bidirectional structure, our proposed alternating directional structure almost adds no additional computation cost, while keeping the ability to model the dependency from whole spectrum of an HSI regardless of the position of the output.

IV. EXPERIMENTAL RESULTS

A. Experimental settings

Benchmark Datasets. We conduct several experiments using data from ICVL hyperspectral dataset [3], where 201 images were collected at spatial resolution over 31 spectral bands. The simulated pseudo color image samples from this dataset are illustrated in Figure 5. We use 100 images for training, 5 images for validation, while others are for testing. To enlarge the training set, we crop multiple overlapped volumes from training HSIs and then regard each volume as a training sample. During cropping, each volume has a spatial size of and a spectral size of 31 for the purpose of preserving the complete spectrum of an HSI. Data augmentation schemes such as rotation and scaling are also employed, resulting in roughly 50k training samples in total. As for testing set, we crop the main region of each image with size of given the computation cost1.

Besides, we evaluate the robustness and flexibility of our model in remotely sensed hyperspectral datasets including Pavia Centre, Pavia University, Indian Pines and Urban. Pavia Centre and Pavia University were acquired by the ROSIS sensor, the number of spectral bands is 102 for Pavia Centre and 103 for Pavia University. Indian Pines and Urban were gathered by 224-bands AVIRIS sensor and 210-bands HYDICE hyperspectral system respectively. Both of them have been used for real HSI denoising experiments [9], [20], [39].

Noise settings. Real-world HSIs are usually contaminated by several different types of noise, including the most common Gaussian noise, impulse noise, dead pixels or lines, and stripes [11], [17], [48]. We define five types of complex noise as follows, and the types of complex noise are referred as Case 1-5 respectively.

Case 1: Non-i.i.d. Gaussian noise. Entries in all bands are corrupted by zero-mean Gaussian noise with different intensities, randomly selected from 10 to 70.

Case 2: Gaussian + Stripe noise. All bands are corrupted by non-i.i.d. Gaussian noise as Case 1. One third of bands (10 bands for ICVL dataset) are randomly chosen to add stripe noise (5% to 15% percentages of columns).

Case 3: Gaussian + Deadline noise. The noise generation process is nearly the same as Case 2 except the stripe noise is replaced by deadline.

Case 4: Gaussian + Impulse noise. Each band is contaminated by Gaussian noise as Case 1. One third of bands are randomly selected to add impulse noise with intensity ranged from 10% to 70%.

Case 5: Mixture noise. Each band is randomly corrupted by at least one kind of noise mentioned in Case 1-4.

Competing Methods. We compare our method against both traditional and DL methods in both Gaussian and complex noise cases. In general, the traditional methods are best suited to be applied in a specific noise setting, relying on their noise assumption. While DL methods, can be applied in various noise setting by training multiple models to tackle miscellaneous noises. For the sake of fairness, we adopt different traditional baselines in these two noise contexts, given their noise assumptions.

In Gaussian noise case, we compare with several representative traditional methods including filtering-based approaches (BM4D [28]), dictionary learning approach (TDL [30]), and tensor-based approaches (ITSReg [42], LLRT [9]). In complex noise case, the competing traditional baselines include low-rank matrix recovery approaches (LRMR [48], LRTV [20], NMoG [11]), and low-rank tensor approach (TDTV [39]).

For DL approaches, we compare our model with HSIDCNN [46]. Besides, any DL method for single image denoising can be extended to HSI denoising case (by modifying the first layer to adapt the HSI, i.e. changing from 3 to 31). For completeness, we also compare such state-of-the-art 2D DL approach, i.e. MemNet [33] with in first layer, which entails the fixed number of spectral bands. Since the training setting is different between ours and other DL approaches, we finetune/retrain their pretrained models with our well-designed training strategy to achieve better performance in our dataset.

Network learning. We develop an incremental training policy to stabilize and accelerate the training, which also avoids the network converging to a poor local minimum. The philosophy of our training policy is simple: learning to solve tasks in an easy-to-difficult way [1]. Networks are learned by minimizing the mean square error (MSE) between the predicted high-quality HSI and the ground truth. The network parameters are initialized as in [17], and optimized using ADAM optimizer [25] with the deep learning framework Pytorch2 on a machine with NVIDIA GTX 1080Ti GPU, Intel(R) Core(TM) i7-7700K CPU of 4.2GHz and 16 GB RAM. Unlike

TABLE II: Overview of our incremental train policy. Our network learning goes through three stages, from the easy task of Gaussian denoising with fixed noise level, to the difficult one of complex noise removal. In our implementation, fixed noise level in stage 1 is set to 50. Unknown in stage 2 is uniformly sampled from 30 to 70. Unknown complex noise in stage 3 denotes the complex noise randomly chosen from Case 1 to 4 (without Case 5: mixture noise). The models trained at the end of stage 2 (epoch 50) and 3 (epoch 100) are used in Gaussian denoising and complex noise removal tasks respectively.

Fig. 6: Simulated Gaussian noise removal results of PSNR (dB) at band of image under noise level on ICVL dataset. (Best view on screen with zoom)

Fig. 7: Simulated complex noise removal result s on ICVL dataset. Examples for non-i.i.d Gaussian noise, Gaussian + stripes, Gaussian + deadline, Gaussian + impulse and mixture noise removal (Cases 1-5) are presented respectively. (Best view on screen with zoom)

training networks independently to tackle several different types of noise separately, we simply train two models in both

Fig. 8: PSNR values across the spectrum corresponding to Gaussian and complex noise removal results in Figure 6 and 7 respectively.

Gaussian and complex noise cases respectively. Our network learning goes through three stages, from the easy task of Gaussian denoising with fixed noise level, to the difficult one of complex noise removal. The models are incrementally trained that reuse the prior state (pretrained parameters) to maximize the training efficiency (See discussions in Section V-A). We follow the previous image restoration work [29] to choose hyper-parameters of learning algorithm. These values were empirically set to make network learning fast yet stable. Specifically, the learning rate is initialized at and decayed at epochs, where the validation performance not increases any more. Small batch size (i.e. 16) is used to accelerate training at first stage, while large batch size (i.e. 64) is adopted to stabilize training when tackling harder cases (e.g. complex noise case). The overview of our training procedures is shown in Table II, with detailed hyper-parameter setting.

Quantitative Metrics. To give an overall evaluation, three quantitative quality indices are employed, i.e. PSNR, SSIM [40], and SAM [47]. PSNR and SSIM are two conventional spatial-based indexes, while SAM is spectral-based. Larger values of PSNR and SSIM imply better performance, while a smaller value of SAM suggests better performance.

B. Experiments on ICVL Dataset

Denoising in Gaussian Noise Case. Zero mean additive white Gaussian noises with different variance are added to generate the noisy observations. The model trained at the end of stage 2 (epoch 50) is used to tackle all different levels of corruption3. Figure 6 shows the denoising results under noise level . It can be easily observed that the image restored by our method is capable of properly removing the Gaussian noise while finely preserving the structure underlying the HSI. Traditional methods like BM4D and TDL introduce evident artifacts to some areas. Other methods suppress the noise better, but still lose some fine-grained details and produce relatively low-quality results compared with ours. The qualitative assessment results are listed in Table III. Compared with all competing methods, the QRNN3D achieves better performance in most qualitative/quantitative assessments, further confirming the high fidelity of our method.

Denoising in Complex Noise Case. Five types of the complex noise are added to generate noisy samples. In brief, cases

1-5 represent non-i.i.d Gaussian noise, Gaussian + stripes, Gaussian + deadline, Gaussian + impulse, and mixture of

them respectively (see Section IV-A for more details). Like Gaussian noise case, a single model trained at the end of stage 3 (epoch 100) is utilized to dealing with case 1-5 simultaneously. It’s worth noting that each sample in our training set is corrupted by one of noise types (i.e. cases 1-4), while in case 5, each testing sample suffers from multiple types of noise, not contained in the training set. We show the qualitative and quantitative results in Figure 7 and Table IV respectively, which show our QRNN3D significantly outperforms the other methods. Furthermore, the results in mixture noise case exhibit the strong generalization of our model since the mixture noise is not seen by our model in the training stage.

In Figure 7, the observation images are corrupted by miscellaneous complex noises. Low-rank matrix recovery methods, i.e. LRMR and LRTV, holding the assumption that the clean HSI lies in low-rank subspace from the spectral perspective, successfully remove great mass of noise, but at a cost of losing fine details. Our QRNN3D eliminates miscellaneous noises to a great extent, while more faithfully preserving the fine-grained structure of original image (e.g. the texture of road in the second photo of Figure 7) than top-performing traditional low-rank tensor approach TDTV and other DL methods. Figure 8 shows the PSNR value of each bands in these HSIs. It can be seen that the PSNR values of all bands obtained by Our QRNN3D are obviously higher than those compared methods.

C. Experiments on Remotely Sensed Images

Synthetic Data. Here, we conduct experiments on Pavia University in mixture noise case. Given the similarity between Pavia Centre and Pavia University, the model is first trained from scratch only on Pavia Centre. It can be seen our train-from-scratch model (Ours-S in Table V) performs undesirable,

Fig. 9: Simulated complex noise removal results of PSNR (dB) at band of image in case 5 (mixture noise) on Pavia University dataset. (Best view on screen with zoom)

TABLE III: Quantitative results of different methods under several noise levels on ICVL dataset. ”Blind” suggests each sample is corrupted by Gaussian noise with unknown (ranged from 30 to 70).

TABLE IV: Quantitative results of different methods in five complex noise cases on ICVL dataset.

Fig. 10: Real-world unknown noise removal results at band of image on AVIRIS Indian Pines dataset. (Best view on screen with zoom)

Fig. 11: Real-world unknown noise removal results at band of image on HYDICE Urban dataset. (Best view on screen with zoom)

TABLE V: Quantitative results of different methods in mixture noise case on Pavia University dataset. ”Ours-S” is our trained-from-scratch model which is only trained on Pavia Centre dataset; ”Ours-P” denotes our pretrained model which is only trained on ICVL dataset; ”Ours-F” indicates our fine-tuned model which is pretrained on ICVL dataset, and then is fine-tuned on Pavia Centre dataset.

even compared with traditional method TDTV (29.64 v.s. 30.06).

Nevertheless, our method utilizes QRU3D, which makes it can be naturally used for input data with various number of bands. On the basis of this flexibility, we directly apply our model pretrained on ICVL dataset (in complex noise case) to Pavia University. Although the Pavia University is recorded with a spectral curve totally distinct from ICVL dataset, our model called Ours-P performs much better than all compared methods4, which strongly verifies the robustness

TABLE VI: Ablations on ICVL HSI Gaussian denoising (under noise level ). We evaluate the results by PSNR (dB), running Time (sec) and the number of parameters (Params) of these networks. All running times are measured on a Nvidia GTX 1080Ti by processing an HSI with size of 512 512 31. Direction of network is denoted by initials, i.e. U: unidirectional; B: bidirectional; A: alternating directional, Our benchmark network is indicated by boldface. The results of MemNet are also provided as an additional reference.

of our method.

Furthermore, we employ small pieces of samples from Pavia Center to fine-tune the model only learned from ICVL dataset. This learned model (Ours-F in Table V) significantly boosts the performance. The visual comparison is provided in Figure 9. Interestingly, the Gaussian-like residuals are still visible in Ours-S model, while Ours-P model suffers from stripes. OursF model combines the strengths of the two models, yielding clear and clean result. This seems to indicate the knowledge from ICVL dataset is complementary to one from Pavia Centre dataset, so that the transfer learning enabled by flexibility will bring great benefits in performance.

Real-world Noisy Data. We also verify our model in real-world noisy HSI Indian Pines and Urban without corresponding ground truth. It can be observed in Figure 10 and Figure 11 that terrible atmosphere and water absorption obstruct the view to the real scenario, severely degrading the quality of images. The Gaussian denoising methods, e.g. BM4D, TDL, cannot accurately estimate the underlying clean image due to the non-Gaussian noise structure. Our QRNN3D successfully tackles this unknown noise, and produces sharper and clearer result than others, consistently demonstrating the robustness and flexibility of our model.

V. DISCUSSION AND ANALYSIS

In this section, we provide a broad discussion and analysis of QRNN3D to facilitate understanding of where its great performance comes from. We first demonstrate the efficacy of our incremental training policy, then analyze the functionality of each network component in QRNN3D (i.e. 3D convolution, quasi-recurrent pooling, alternating-directional structure). The selection of network hyper-parameters is followed. The visualization method (and results) of GCS knowledge in QRU3D are presented in final.

Fig. 12: Average training loss (Left) and Validation PSNR (Right) of QRNN3D for complex noise removal. We show the results of the model trained from scratch, and the one that reuses the pretrained parameters in Gaussian denoising (incremental training).

A. Efficacy of Incremental Training Policy

The key idea of our training policy lies at the fact that knowledge can be efficiently learned in an easy-to-difficult way [1]. Our training policy enables reusing prior learned knowledge (pretrained parameters), which significantly stabilizes and accelerates the whole training process. As an example, we show the optimization curves with and without reusing the pretrained parameters when training the model in complex noise case. As shown in Figure 12, training from scratch renders the optimization slow, instable and converge to a poor local minimum, in contrast to training with a good initialization in our incremental learning policy.

B. Component Analysis in QRNN3D

To thoroughly verify the functionality of each component in our QRNN3D, comprehensive ablation experiments are conducted on HSI Gaussian denoising task on ICVL dataset. We focus on the components associated with HSI modeling and domain knowledge embedding, and study the best trade-off between performance and computational burden. The evaluation measures include PSNR, running time and total number of parameters of network.

We choose our encoder-decoder QRNN3D as the benchmark. For fair comparison, same network architecture is used except the modification in the investigated component. Ablation results are exhibited in Table VI and analyzed in the following.

Subcomponents Investigation. Table VI investigates the effect of subcomponents (i.e. 3D convolution and quasi-recurrent pooling function) in QRU3D. QRU3D is the basic building block of our QRNN3D. In the experiments, four variants of this basic block are tested, i.e. QRU2D,WQRU2D, C3D and WC3D.

QRU2D is instantiated by replacing the 3D convolution by 2D convolution (implemented by simply setting the kernel size to ). Drastic performance losing (i.e. -1.6 dB) can be observed in Table VI, meaning ignoring the structural spectral correlation would severely impact the model capacity.

WQRU2D is formed by a wider QRU2D model whose number of parameters is comparable to QRU3D. Nevertheless, It can be observed that the QRU3D still outperforms the WQRU2D, even with less computation cost, which suggests

Fig. 13: (a) The captured GCS in a bidirectional QRU3D layer. (b) The number of relative bands for output of each band. Band i defined as an ”relative band” for band j means it will produce at least 10% perturbation (i.e. , where 1 has same size as with all entries equal to 1) to the output if discarded. Forward/Backward denotes the direction of dependency. (i.e. i < j for forward direction). (c) The empirical distribution of the number of relative bands.

the higher efficiency of 3D convolution against the 2D approach towards HSI modeling.

C3D is constructed by removing the quasi-recurrent pooling (and the associated neural gates), definitely a residual encoder-decoder 3D convolutional neural network. We find lack of mechanism to model the GCS, would degrade the performance by a large margin (-3.4 dB).

WC3D is built by a wider C3D model with more parameters (four times as much as the C3D model). It can be seen the PSNR of QRU3D is 40.23 dB, higher than the WC3D’s 40.00 dB. This suggests that the improvement of quasi-recurrent pooling is not just because it adds width to the C3D model. Besides, the QRU3D has only parameters and running time of the WC3D model and is also narrower. This comparison shows that the improvement from quasi-recurrent pooling is complementary to going wider in standard ways.

Direction of Network. Table VI also shows the results of different directional structures denoted by initials (e.g. U for unidirectional, e.t.c.). Without considering backward spectral dependency, the unidirectional architecture performs worst. After eliminating the causal dependency, both alternating directional and bidirectional architectures significantly exceed the unidirectional one, and achieve similar performance (40.26 v.s. 40.23). Nevertheless, the bidirectional version requires much larger memory footprint than ours alternating directional structure, indicating the alternating directional structure can be used as a lightweight alternative to the typical bidirectional one.

C. Network Hyperparameter Selection

Our principle of network hyper-parameter selection is to make it compact yet work. Table VII shows the results of hyper-parameter selection on Gaussian denoising task through a small grid search, where we select the depth and width of our QRNN3D considering the best tradeoff between performance and computation overload.

Nonetheless, we note the major goal of this work is to introduce a novel building block, specially tailored to model HSI.

TABLE VII: Network hyper-parameter selection on ICVL HSI Gaussian denoising (under noise level ) through a small grid search. We evaluate the results by PSNR (dB), running Time (sec) and the number of parameters (Params) of these networks. The selected parameters are indicated by boldface.

Such building block can be naturally inserted into any network topology, not restricted to the encoder-decoder network used in this paper. We mainly show the effectiveness of our proposed building block and don’t pursue higher performance via exhaustive search of other configurations. We have demonstrated state-of-the-art performance of our QRNN3D without heavy engineering effort on network hyper-parameter selection. Our current hyper-parameter setting might not be perfect, and the performance could be boosted potentially by parameter tuning, though this is not a major focus of this paper.

D. Visualizing GCS Knowledge

To visualize the captured GCS knowledge in QRNN3D, we first unfold the Equation (3) and obtain

where / denotes element-wise division. It also implies the band i’s effect on band j. The captured GCS in each QRU3D layer can be calculated through a single inference pass by using Equation (5). To completely visualize GCS5, we choose the first bidirectional QRU3D for such analysis6. Figure 13(a) exhibits the captured GCS of a random selected HSI, showing the output of each band would be highly affected by the whole spectrum. Figure 13(b) illustrates the number of relative bands for output of each band. It can be seen that 15th to 17th bands () are deeply correlated to almost all bands (Z). Figure 13(c) summarizes this statistics of all testing images on ICVL. It shows that a randomly selected band would be typically related to at least 15 bands (31 in total), meaning the GCS is effectively utilized by our model and our method can also automatically determine the most relative bands across global spectra.

VI. CONCLUSIONS

In this paper, we have proposed an alternating directional 3D quasi-recurrent neural network for hyperspectral image denoising. Our main contribution is the novel use of 3D convolution subcomponent, quasi-recurrent pooling function, and alternating directional scheme for efficient spatio-spectral dependency modeling. We have applied our model to resolve HSI denoising beyond the Gaussian, especially in the very challenging real-world complex noise case, and achieve better performance and faster speed. We also show our model pretrained on ICVL dataset can be directly utilized to tackle remotely sensed images which is infeasible in most of existing DL approaches for the HSI modeling.

In addition, the visualized results for global correlation along spectrum (GCS) in our 3D quasi-recurrent unit (QRU3D) further experimentally convinces the GCS is effectively exploited by our model. It’s also worth investigating the proposed QRU3D in other image sequence modeling tasks in future.

REFERENCES

[1] M. Ahissar and S. Hochstein. The reverse hierarchy theory of visual perceptual learning. Trends in Cognitive Sciences, 8(10):457–464, 2004.

[2] N. Akhtar and A. Mian. Nonparametric, coupled ,bayesian ,dictionary ,and classifier learning for hyperspectral classification. IEEE Transactions on Neural Networks and Learning Systems, 29(9):4038–4050, 2018.

[3] B. Arad and O. Ben-Shahar. Sparse recovery of hyperspectral signal from natural rgb images. In European Conference on Computer Vision, pages 19–34. Springer, 2016.

[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (ICLR), 2015.

[5] J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neural networks. International Conference on Learning Representations (ICLR), 2017.

[6] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson. Ad- vances in hyperspectral image classification: Earth monitoring with statistical learning methods. IEEE Signal Processing Magazine, 31(1):45– 54, 2014.

[7] Y. Chang, L. Yan, H. Fang, S. Zhong, and W. Liao. Hsi-denet: Hyperspectral image restoration via convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing, pages 1–16, 2018.

[8] Y. Chang, L. Yan, H. Fang, S. Zhong, and Z. Zhang. Weighted low- rank tensor recovery for hyperspectral image restoration. arXiv preprint arXiv:1709.00192, 2017.

[9] Y. Chang, L. Yan, and S. Zhong. Hyper-laplacian regularized unidirectional low-rank tensor recovery for multispectral image denoising. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4260–4268, 2017.

[10] C. Chen, Z. Xiong, X. Tian, and F. Wu. Deep boosting for image denoising. In The European Conference on Computer Vision (ECCV), September 2018.

[11] Y. Chen, X. Cao, Q. Zhao, D. Meng, and Z. Xu. Denoising hyperspectral image with non-iid noise structure. arXiv preprint arXiv:1702.00098, 2017.

[12] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.

[13] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.

[14] W. Dong, G. Li, G. Shi, X. Li, and Y. Ma. Low-rank tensor approx- imation with laplacian scale mixture modeling for multiframe image denoising. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 442–449, 2015.

[15] W. Dong, H. Wang, F. Wu, G. ming Shi, and X. Li. Deep spatial-spectral representation learning for hyperspectral image denoising. IEEE Transactions on Computational Imaging, pages 1–1, 2019.

[16] Y. Fu, A. Lam, I. Sato, and Y. Sato. Adaptive spatial-spectral dictionary learning for hyperspectral image restoration. International Journal of Computer Vision (IJCV), 122(2):228–245, 2017.

[17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV), December 2015.

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[19] W. He, Q. Yao, C. Li, N. Yokoya, and Q. Zhao. Non-local meets global: An integrated paradigm for hyperspectral denoising. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[20] W. He, H. Zhang, L. Zhang, and H. Shen. Total-variation-regularized low-rank matrix factorization for hyperspectral image restoration. IEEE Transactions on Geoscience and Remote Sensing, 54(1):178–188, 2016.

[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[22] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems (NIPS), pages 235–243, 2015.

[23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.

[24] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(1):221–231, 2013.

[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[26] S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[27] T. Lillesand, R. W. Kiefer, and J. Chipman. Remote sensing and image interpretation. John Wiley & Sons, 2014.

[28] M. Maggioni, V. Katkovnik, K. Egiazarian, and A. Foi. Nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Transactions on Image Processing, 22(1):119–133, 2013.

[29] X. Mao, C. Shen, and Y.-B. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems (NIPS), pages 2802–2810, 2016.

[30] Y. Peng, D. Meng, Z. Xu, C. Gao, Y. Yang, and B. Zhang. Decomposable nonlocal tensor dictionary learning for multispectral image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2949–2956, 2014.

[31] Z. Ping and R. Wang. Jointly learning the hybrid crf and mlr model for simultaneous denoising and classification of hyperspectral imagery. IEEE Transactions on Neural Networks and Learning Systems, 25(7):1319–1334, 2014.

[32] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks.

IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.

[33] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persistent memory network for image restoration. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[34] P. S. Thenkabail and J. G. Lyon. Hyperspectral remote sensing of vegetation. CRC Press, 2016.

[35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2015.

[36] M. Uzair, A. Mahmood, and A. Mian. Hyperspectral face recognition with spatiospectral information fusion and pls regression. IEEE Transactions on Image Processing, 24(3):1127–1137, 2015.

[37] H. Van Nguyen, A. Banerjee, and R. Chellappa. Tracking via object reflectance using a hyperspectral video camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 44–51, 2010.

[38] Q. Wang, J. Lin, and Y. Yuan. Salient band selection for hyperspectral image classification via manifold ranking. IEEE Transactions on Neural Networks and Learning Systems, 27(6):1279–1289, 2017.

[39] Y. Wang, J. Peng, Q. Zhao, Y. Leung, X.-L. Zhao, and D. Meng. Hyperspectral image restoration via total variation regularized low-rank tensor decomposition. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2017.

[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.

[41] K. Wei and Y. Fu. Low-rank bayesian tensor factorization for hyper- spectral image denoising. Neurocomputing, 331:412 – 423, 2019.

[42] Q. Xie, Q. Zhao, D. Meng, Z. Xu, S. Gu, W. Zuo, and L. Zhang. Mul- tispectral images denoising by intrinsic tensor sparsity regularization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1692–1700, 2016.

[43] Y. Xie, Y. Qu, D. Tao, W. Wu, Q. Yuan, and W. Zhang. Hyperspectral image restoration via iteratively regularized weighted schatten p-norm minimization. IEEE Transactions on Geoscience and Remote Sensing, 54(8):4642–4659, 2016.

[44] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.- c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems (NIPS), pages 802–810, 2015.

[45] S. Yang, Z. Feng, M. Wang, and K. Zhang. Self-paced learning-based probability subspace projection for hyperspectral image classi-fication. IEEE Transactions on Neural Networks and Learning Systems, PP(99):1–6, 2018.

[46] Q. Yuan, Q. Zhang, J. Li, H. Shen, and L. Zhang. Hyperspectral image denoising employing a spatialspectral deep residual convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing, 57(2):1205–1218, 2019.

[47] R. H. Yuhas, J. W. Boardman, and A. F. Goetz. Determination of semi- arid landscape endmembers and seasonal trends using convex geometry spectral unmixing techniques. In Summaries of the 4-th Annual JPL Airborne Geoscience Workshop, 1993.

[48] H. Zhang, W. He, L. Zhang, H. Shen, and Q. Yuan. Hyperspectral image restoration using low-rank matrix recovery. IEEE Transactions on Geoscience and Remote Sensing, 52(8):4729–4743, 2014.

[49] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 2017.

[50] L. Zhang, W. Wei, Y. Zhang, C. Shen, A. van den Hengel, and Q. Shi. Cluster sparsity field for hyperspectral imagery denoising. In European Conference on Computer Vision (ECCV), pages 631–647. Springer, 2016.

[51] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu. Residual non-local attention networks for image restoration. In International Conference on Learning Representations, 2019.

[52] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image restoration. arXiv preprint arXiv:1812.10477, 2018.