Spatial-Spectral Feature Extraction via Deep ConvLSTM Neural Networks for Hyperspectral Image Classification

2019·Arxiv

Abstract

Abstract

In recent years, deep learning has presented a great advance in hyperspectral image (HSI) classification. Particularly, Long Short-Term Memory (LSTM), as a special deep learning structure, has shown great ability in modeling long-term dependencies in the time dimension of video or the spectral dimension of HSIs. However, the loss of spatial information makes it quite difficult to obtain the better performance. In order to address this problem, two novel deep models are proposed to extract more discriminative spatial-spectral features by exploiting the Convolutional LSTM (ConvLSTM). By taking the data patch in a local sliding window as the input of each memory cell band by band, the 2-D extended architecture of LSTM is considered for building the spatial-spectral ConvLSTM 2-D Neural Network (SSCL2DNN) to model long-range dependencies in the spectral domain. To better preserve the intrinsic structure information of the hyperspectral data, the spatial-spectral ConvLSTM 3-D Neural Network (SSCL3DNN) is proposed by extending LSTM to 3-D version for further improving the classification performance. The experiments, conducted on three commonly used HSI data sets, demonstrate that the proposed deep models have certain competitive advantages and can provide better classification performance than other state-of-the-art approaches.

Index Terms—Hyperspectral image, Convolutional Long Short-Term Memory, deep learning, feature extraction, classification.

I. INTRODUCTION

THE hyperspectral remote sensing image is a 3-D datacube, which integrates the spectral information with the 2-D spatial information of land covers. With the

Manuscript received July 23, 2019; revised October 20, 2019; accepted December 10, 2019. Date of publication January 15, 2020; date of current version May 21, 2020. This work was supported by the National Natural Science Foundation of China under Grant 61871335. (Corresponding author: Heng-Chao Li.) Wen-Shuai. Hu, Heng-Chao Li, and Lei Pan are with the Sichuan Provincial Key Laboratory of Information Coding and Transmission, Southwest Jiaotong University, Chengdu 610031 China (e-mail: hcli@home.swjtu.edu.cn). Wei Li and Ran Tao are with the School of Information and Electronics, Beijing Institute of Technology, Beijing 100081 China. Qian Du is with the Department of Electrical and Computer Engineering, Mississippi State University, Mississippi State, MS 39762 USA.

development of remote sensing technology, together with the continuous improvement of hyperspectral sensors, hyperspectral images (HSIs) can provide more opportunities to manage and analyze the information from the Earth’s surface [1]-[3]. Correspondingly, HSIs have been widely used in many fields, such as environmental sciences [2], precision agriculture [4], [5], ecological science [6], and geological exploration [7].

As a basic and important research topic, HSI classi-fication has attracted plenty of attentions. The support vector machine (SVM) is the most widely used classi-fier and shows great success in HSI classification [8]. Especially, a composite kernel SVM (SVM-CK) was proposed to exploit the spatial and spectral information simultaneously [9]. Inspired by the successful application on face recognition [10], sparse representation (SR) [11] has also been applied in HSI classification, which was further improved to exploit the spatial information by Chen et al. in [12], resulting in a joint SR classifier (JSRC). Subsequently, more and more classifiers based on joint sparse model were developed, such as non-local weighted JSRC (NLW-JSRC) [13], nearest regularized JSRC (NRJSRC) [14], and correntropy-based robust JSRC (RJSRC) [15].

In recent years, deep learning has presented a great advance for feature extraction and classification in the field of computer vision, such as object detection [16], object tracking [17], and behavior recognition in crowd scene [18], and also shown the effectiveness in HSI classifica-tion task [19]. An increasing number of feature extraction and classification methods deep learning-based have been designed for HSIs. In [20], a classification method based on Stacked AutoEncoder (SAE) was exploited for HSI classification for the first time by extracting spatial-spectral features. Since Hu et al. [21] and Chen et al. [22] introduced Convolutional Neural Network (CNN) into HSI classification, many new classification models based on CNN have emerged and provided the satisfying performance in HSI classification. Li et al. [23] integrated the pixel-pair model into CNN to further obtain discriminative features. Zhao and Du [24] utilized CNN and the balanced local discriminant embedding to fuse the spatial and spectral information. For the sake of joint learning of the spatial-spectral features, many classification models [25]-[28] were built by taking the 3-D data cube as the input and deepening the network to obtain more discriminative features. In order to compute the distance between features of the same classes and that between features of different classes more effectively and quickly, Fang et al. [29] integrated the hashing learning into CNN to realize the transformation from high-dimensional features to low-dimensional features. In addition, to overcome the problem of the limitation of available labeled samples, two self-taught learning frameworks based on the multiscale Independent Component Analysis (MICA) and the Stacked Convolutional Autoencoder (SCAE) [30], semisupervised CNN [31], and a supervised deep feature extraction method based on Siamese CNN (S-CNN) [32] were proposed. In addition to the single-channel deep network models, there are also some novel works [27], [33] based on multi-channel CNN to improve feature extraction and classification. In particular, Li et al. [34] made a systematic review of HSI classification methods, and not only comprehensively classified the existing models based on deep learning, but also presented a review of the strategies that are used to improve the performance when the labeled samples are limited, which is meaningful for the future research.

Obviously, CNN has provided an extremely effective and basic structure for various tasks, and become the core architecture of various models, such as VGG16 [16], ResNet [25], CapsNets [35], DenseNet [36], among which the convolutional layer is the core backbone for feature extraction. To address the gradient vanishing problem caused by deeper and deeper structures [37], effective feature extraction and classification models [25], [35], [36] were proposed by promoting the CNN filters to improve the classification performance for HSIs.

In order to analyze sequential data, Recurrent Neural Network (RNN) [38], as an effective deep learning model, has been widely concerned for modeling long-range dependencies, and also expected to extract features for HSI classification. By using a special activation function and an improved Gated Recurrent Units (GRU), Mou et al. [39] built a RNN model for pixellevel spectral classification. Considering that the spatial information has a positive impact on HSI classifica-tion performance, Liu et al. [40] used multiply RNNs to model long-term dependencies between central and neighborhood pixels. Zhang et al. [41] built a local spatial sequential RNN (LSS-RNN) model, in which the low-level features, such as texture and different morphological profile features, were used to construct LSS features, and then input to RNN to extract high-level features. Based on GRU, Hang et al. [42] designed an end-to-end RNN model for extracting the useful information from adjacent and non-adjacent spectral bands, and integrated the convolutional layers into RNN to improve the classification accuracy. In addition, Mou et al. [43] used RNN to model the temporal dependency of the outputs of convolutional sub-network for change detection in multispectral imagery.

Long Short-Term Memory (LSTM), as a special RNN structure, has demonstrated its stability and power in modeling long-term dependencies in various studies [44]-[45]. The initially proposed structure of LSTM utilizes the special “memory cells” instead of logistic or tanh hidden units [44], and there are three significant gate mechanisms in this kind of structure: input gate, output gate, and forget gate, which are used for implementing information protection, transmission, and control, respectively. Specifically, the input gate is applied to control when the input is allowed to be added to the memory cell, the output gate is designed to decide when the input data has an influence on the output of the memory cell, and the forget gate is mainly used for modeling long-range dependencies. It is the design of this special memory cell with fixed weight and selfconnected circular edges that ensures the gradient can pass across many time steps without gradient vanishing or explosion problems [46]. However, there is an inherent drawback in LSTM, in which the spatial structure information will be lost when unfolding the input to the 1-D from. To better model the spatiotemporal relationships, Shi et al. [47] extended this data processing method in LSTM to the convolution operation and proposed Convolutional LSTM (ConvLSTM), which consists of convolution structures in both the input-to-state and state-to-state transitions. Due to the characteristics of LSTM, more attentions have been paid to it, and the most common way is to use LSTM in combination with CNN. Wu et al. [48] built a hybrid model for HSI classification, in which the convolutional layers followed by the recurrent layers were used to extract spectrally-contextual features. By utilizing the convolutional recurrent layers, Rußwurm et al. [49] built an encoder framework for land cover classification. Song et al. [50] proposed a recurrent 3D fully convolutional network for change detection in HSIs, in which the ConvLSTM layer is used to perform long-term dependencies modeling on the outputs of fully convolutional network in the temporal field. Moreover, Seydgar et al. [51] proposed a two-stage model, in which ConvLSTM, cascading with 3D CNN, was utilized to extract spatial-spectral features.

In addition to the above combination with CNN, there are some works to build deep models using LSTM alone. Zhou et al. [52] designed a spatial-spectral LSTMs (SSLSTMs) model based on two independent LSTMs in the way of spatial LSTM (SaLSTM) and spectral LSTM (SeLSTM). However, this way not only insufficiently fuses the spatial and spectral information, but also loses the spatial structure. Rußwurm et al. [53] employed multi-level cascaded LSTMs for land cover classifica-tion. Inspired by spatial similarity measurements [54], Ma et al. [55] built a LSTM-based model, in which the spatial-spectral features are extracted by these measurement strategies. Moreover, a Bidirectional-ConvLSTM (Bi-CLSTM) model [56] was proposed by cascading all outputs in each layer for extracting spatial-spectral features. However, this simple cascade fails to fully utilize the correlation between different spectral bands.

Inspired by the above works, the main purpose of this paper is to construct two novel deep ConvLSTM neural networks for HSI feature extraction and classification. The main contributions of this paper are listed as follows:

(1) In order to address the problem of underutilization of the correlations between different spectral bands in Bi-CLSTM and the issues of insufficient fusion of the spatial and spectral information and the loss of the spatial structure information in SSLSTMs, an effective feature extraction model, i.e., spatial-spectral ConvLSTM 2-D Neural Network (SSCL2DNN), is proposed by modeling long-term dependencies in the spectral field for joint learning of spatial-spectral features, in which the local window patch is decomposed into a spectral sequence and then input to each memory cell band by band.

(2) To better preserve the intrinsic structure of hyperspectral data, the 3-D structure (namely ConvL-STM3D) is further developed from the basic ConvLSTM cell, with which a novel deep model, i.e., spatial-spectral ConvLSTM 3-D Neural Network (SSCL3DNN), is constructed. Different from the way of band-by-band processing, the local patch directly takes the form of 3D cube as the input of SSCL3DNN, which enables it to further improve classification performance of SSCL2DNN.

The remainder of this paper is organized as follows. Section II presents CNN, LSTM, and ConvLSTM, and discusses the applications on HSI feature extraction and classification. In Sections III and IV, the proposed deep models with the extraction of spatial-spectral features for HSI classification are described in detail, respectively. Comprehensive quantitative analysis and evaluation of the proposed models are implemented in Section V. And conclusively, Section VI summarizes this paper.

II. RELATED WORK

A. Convolutional Neural Network

The fundamental CNN mainly consists of the following parts: convolutional layer, pooling layer, full connection layer, and classification layer. Based on different convolution operations, we can construct various networks to meet a variety of practical requirements. Originally, the calculation formula of each convolutional layer in CNN can be expressed as

where is the output with M feature maps of the rth convolutional layer, denotes the convolution filter, and is the bias of the rth convolutional layer, is the output of the th convolutional layer, and is the nonlinear activation function. has a size of for 2-D CNN. With respect to 3-D CNN, is a convolution filter in the rth convolutional layer with the size of , in which and denote the size and depth of the convolution filter, respectively.

Since Hu et al. [21] and Chen et al. [22] proposed HSI feature extraction and classification models based on CNN, more and more improved CNN-based models have been introduced. CNN has become an extremely effective and basic structure for various tasks. It is noteworthy that a sliding window is still used to extract spatial features in CNN, which is a traditional way of exploiting spatial information. Moreover, the data transmission in CNN only exists between adjacent layers, lacking information interaction inside each layer, which may make it difficult to extract more effective features.

B. Long Short-Term Memory

LSTM is proposed to deal with the issues that RNN is not suitable for learning long-term dependencies and prone to bring about gradient vanishing and exploding problems [46]. It is obvious that the data transmission and processing in LSTM are realized by three key gate units: input gate, output gate, and forget gate, which are used for implementing information protection and control [44]. The calculation formulas between these three gate structures in LSTM are written as

where , and represent the input of the current cell, the output and state of the last cell in LSTM, respectively. , and denote the input gate, forget gate, and output gate of LSTM. and are the weight and bias of the input gate, and are the weight and bias of the forget gate, and and are the weight and bias of the output gate, where indicates x, h, and denotes the Hadamard product. is the nonlinear activation function.

Zhou et al. [52] first attempted to apply LSTM to HSI classification and advanced the spatial-spectral LSTMs (SSLSTMs). The experimental results show that LSTM can also be well used for modeling long-range dependencies in spectral domain. Nevertheless, it is worth noting that the two branches in SSLSTMs are independent of each other. In addition, the unfolding of the original HSI data to one-dimensional vectors as the input of SSLSTMs will actually lose the intrinsic structure of the hyperspectral data, since spatial information is not considered in LSTM.

C. Convolutional LSTM

Considering the shortcomings of LSTM, a modifica-tion and extended version of LSTM, i.e., ConvLSTM [47], is developed. Different from LSTM, the input-to-state and state-to-state transitions in ConvLSTM are realized by convolution, and ConvLSTM holds the same structure as LSTM and can be used to model long-term dependencies in the time domain or spectral domain, in which the calculation formulas can be expressed as

f

where denotes the input of the current cell, and are state and output of the last cell, respectively. * means the convolution operation. W denotes the 2-D convolution filter with a kernel, and k is the size of the convolution kernel, respectively. Specially, the definitions of , and are similar to that in (2), however, the data dimensions and processing methods are different.

Different from the traditional way of extracting spatial features based on sliding window in CNN, the ConvLSTM cell holds three special gate mechanisms to complete data transmission and processing, which makes it possible to utilize spatial information of HSIs more effectively. Simultaneously, the ConvLSTM layer, constructed by the ConvLSTM cell in (3) as the basic unit, can not only implement data transmission and processing of the inter-layer, but also execute those of the intra-layer, which is another great difference from CNN. This special structure enables ConvLSTM layer to extract the more effective feature representation than CNN. Furthermore, compared with LSTM, the implementations of three gate mechanisms are extended from one-dimensional to multi-dimensional convolution operation, and this change can not only capture the spatial context information of the original data similar to the convolutional layer in CNN, but also model the long-range dependencies in the time domain of video or the spectral domain of HSIs.

III. SPATIAL-SPECTRAL CONVLSTM 2-D NEURAL

In SSLSTMs [52], the spatial and spectral features are not effectively fused, and the spatial structure information is not well preserved. As for Bi-CLSTM [56], the correlations between different spectral band is not fully utilized. In order to address the above issues, a novel ConvLSTM-based spatial-spectral feature extraction model is proposed. It should be noted that ConvLSTM given in the Section II-C is actually a 2-D extension structure of LSTM, and in what follows, to distinguish the 3-D extension form in Section VI, ConvLSTM involved in this Section is named ConvL-STM2D.

A. SSCL2DNN

On account of the above analysis, a joint spatial-spectral feature extraction and classification model based on ConvLSTM2D is constructed, which is shown in Fig. 1 labeled by 1with blue. The core structure of our proposed SSCL2DNN model consists of ConvLSTM2D layer and pooling layer. Based on the basic ConvL-STM2D cell in (3), a ConvLSTM2D layer will be built, and a 3-D data cube needs to be decomposed into a spectral sequence, which is then input into each memory cell of the ConvLSTM2D layer one by one. However, there is much redundant information in original HSI data cubes, and the more data used, the greater complexity involved. Unlike [22], PCA is selected as the preprocessing method to implement dimension reduction of data.

First of all, the size of the original HSI data cube can be denoted as , where D indicates the number of the spectral bands, and W and H are the width and height of HSI, respectively. In order to reduce computational complexity, the first K components after PCA are selected as the spectral information of each pixel . Furthermore, for each ConvLSTM2D layer in SSCL2DNN, considering that the spatial context information is beneficial to HSI classification, the data in a local spatial window with the size of is extracted as the spatial information for extracting spatial-spectral features, which is the input of each memory cell in the ConvLSTM2D layer. After data preprocessing, a 3-D input data denoted by is constructed. In particular, suppose that the special dimension timestep in ConvLSTM2D is expressed as a variable , and it needs to be fixed as K to maintain the same dimension as the input data. Concretely, the 3-D input of each pixel is decomposed into K 2-D components and

Fig. 1. The proposed spatial-spectral ConvLSTM 2-D Neural Network (SSCL2DNN) is labeled by 1⃝ with blue. In particular, SSCL2DNN can be transformed into another deep model when K equals 1 and τ equals 1, which is called spatial ConvLSTM 2-D Neural Network (SaCL2DNN) labeled by 2⃝ with red dotted box.

converted into a sequence with the length of K, i.e., , where denotes the kth component of the pixel . Then, this sequence is fed into the ConvLSTM2D layer one by one.

In order to compress the feature maps generated by the ConvLSTM2D layer and reduce the computational complexity, the pooling layer is also used in the proposed deep model. After cascading multiple ConvLSTM2D layers and pooling layers, the final output of the last ConvLSTM2D layer in the proposed SSCL2DNN model is the desired feature representation that is subsequently fed into the classification layer to obtain the final result. Specifically, according to [22], [26], small convolution kernels are efficient for yielding better classification performance. As such, the size of the convolution kernels in our proposed deep model can be set as or , and the kernel size of is adopted to implement the operation of the pooling layer.

Due to the exceptional inner structure of LSTM and its extended architecture, it is evident that if we set to 1 and K to 1 in each ConvLSTM2D layer, and convert the input from a 3-D data to a 2-D data, the proposed deep model is reduced to spatial ConvLSTM 2-D Neural Network (SaCL2DNN), which is shown in Fig. 1 labeled by 2with the red dotted boxes. Specially, this transform is similar to that from 3D CNN to 2D CNN, however, SaCL2DNN can utilize the spatial structure information of hyperspectral data more effectively to obtain better classification performance than 2-D CNN.

Particularly, the proposed deep models can not only implement the same data transmission between layers as CNN does, but also accomplish effective data transmission and processing within each ConvLSTM2D layer. It is known that adjacent spectral bands are highly correlated, and there may also be some correlations between non-adjacent spectral bands [48]. Therefore, by effectively modeling long-range dependencies in the spectral field, SSCL2DNN can jointly consider the spatial-spectral information, which can address the problems of insufficient feature fusion caused by using two subbranches independently and the loss of spatial information in SSLSTMs [52] and underutilization of correlation between different spectral bands in Bi-CLSTM [56].

B. Loss Function and Optimization Method

Suppose that there are L ConvLSTM2D layers and pooling layers in SSCL2DNN. In the first ConvLSTM2D layer, each 2-D component in the spectral sequenceis the input of the kth memory cell, which is actually in (3), and then there will be two kinds of outputs, i.e., the outputs memory cells and the state and output yielded by modeling long-term dependencies, which are actually in (3). The outputs of K memory cells are retained as the inputs of the next layer to extract high-level features. After L ConvLSTM2D layers and pooling layers, the output of the last ConvLSTM2D layer generated by modeling long-range dependencies in the spectral field is the desired spatial-spectral features. After reducing dimensions through the last pooling layer, the spatial-spectral features are converted into the 1-D vectors and then fed into a fully connected layer to map the feature space to class label space. Finally, the feature vectors are input to a softmax function to predict the conditional probability distribution P(y = of each class c, where , and N is the number of classes in the HSI data sets.

In addition, the cross entropy [52] is used as the loss function to obtain the final classification results, which is optimized by adaptive momentum (ADAM) algorithm.

IV. SPATIAL-SPECTRAL CONVLSTM 3-D NEURAL

In order to better preserve the intrinsic structure information of the hyperspectral data, in this Section, the 3-D extended structure (ConvLSTM3D) is further developed,

Fig. 2. The proposed spatial-spectral ConvLSTM 3-D Neural Network (SSCL3DNN) structure.

and another novel deep feature extraction and classifica-tion model, i.e., SSCL3DNN, is proposed, which can yield more discriminative spatial-spectral features for further improving the classification performance.

A. ConvLSTM3D

On the basis of ConvLSTM2D, the 3-D extended version (ConvLSTM3D) is further developed, in which there are three gate units, and the calculation formulas are similar to that in ConvLSTM2D. However, different from it, the whole 3-D data cube is taken as the input of each memory cell in ConvLSTM3D. In particular, the input , the state and , the output and , and the gate units , and of ConvL-STM3D are 4-D tensors, whose last three dimensions are spectral dimension and two spatial dimensions, and the convolution filters , and are 3-D tensors. The structure illustration of the ConvLSTM3D layer is shown in Fig. 3. Specifically, the equations of the ConvLSTM3D cell can be written as

) ˜)

where is the defined 3-D convolution between 4-D input or output and 3-D convolution filter.

Compared with the LSTM cell in (2) and the Con-vLSTM2D cell in (3), the extended ConvLSTM3D cell has a similar calculational model, however, the main difference is the calculation of the convolution in each gate unit. Taking the input gate of the ConvLSTM3D cell as an example. Suppose that is the input of it, which is the tth component in a sequence decomposed from the input of the ConvL-STM3D layer according to the dimension timestep, and , where ,

Fig. 3. The illustration of the inner structure of ConvLSTM3D.

and d are the dimension timestep, width, height, the number of the spectral bands, kernel size and depth, respectively. We define the 3-D convolution of and as , where the output yielded by at position in the input gate is defined as

B. SSCL3DNN

According to the extraordinary structure of ConvL-STM3D, we further extend SSCL2DNN to a new deep model, i.e., SSCL3DNN, based on ConvLSTM3D. Different from SSCL2DNN, the whole data cube must be taken as the input of the memory cell in SSCL3DNN. Therefore, the intrinsic structure of HSIs can be captured well, and the spatial-spectral features can be well learned to further improve the HSI classification performance. Fig. 2 clearly shows the structure of the proposed deep model, and the parameter is fixed as 1.

In particular, PCA is also used to reduce the redundant information in this Subsection, and similarly, the first K components are chosen. With regard to the selection of the size of the convolution kernel, the small kernel size

is also considered, which can be set as or in the proposed deep model. Besides, the size of the pooling kernel in the pooling layer is set as . After the cascade of multiple ConvLSTM3D layers and pooling layers, the spatial-spectral features are extracted from the 3-D data cube and subsequently fed into the classification layer to obtain the classification results.

Although the loss function and optimization method in this Section are the same as those of SSCL2DNN in Section III, the spatial-spectral features and its 1-D vector form of each pixel and data processing at each layer in SSCL3DNN are quite different.

To quantitatively and qualitatively analyze the classi-fication performance of the proposed models, SVM [57], 2-D CNN [22], 3-D CNN [22], SaLSTM [52], SSLSTMs [52], and Bi-CLSTM [56] are used as the comparative algorithms. In particular, the special case (SaCL2DNN) of SSCL2DNN is used to compare fairly with SVM, 2-D CNN, and SaLSTM. Three commonly used quantitative metrics are adopted, i.e., overall accuracy (OA), average accuracy (AA), and Kappa coefficient (). To eliminate the bias introduced by randomly choosing training samples, each experiment is repeated 10 times, and the mean values of each evaluation criterion are presented.

A. Hyperspectral Data Sets

Three common HSI data sets, i.e., Indian Pines, Salinas Valley, and University of Pavia, are considered in our experiments, whose false color maps, groundtruth maps, and the corresponding training size are shown in Fig. 4, respectively.

1) Indian Pines: The Indian Pines data set was captured in 1992 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in Northwestern Indiana, USA, which is mainly composed of multiple agricultural fields. The spatial size of this data set is pixels with a spatial resolution of 20 meters per pixel (mpp), and there are 224 spectral bands in the wavelength range from 0.4 to 2.5 m. Since some of them cannot be reflected by water together with four null bands, there are generally 200 bands remaining for study. After removing the background pixels, 10249 pixels are reserved, which contain useful ground-truth information from 16 different class labels.

2) Salinas Valley: The Salinas Valley data set was collected by the 224-band AVIRIS sensor over Salinas Valley, California. This data set is made up of 512 lines and 217 columns, and contains 16 ground-truth classes. After removing the water absorption bands and noiseaffected bands, there are 204 spectral bands preserved.

3) University of Pavia: The University of Pavia data set was acquired by the Reflective Optics System Imag-

ing Spectrometer (ROSIS) sensor over University of Pavia, Northern Italy. There are 103 spectral bands in the spectral range from 0.43 to 0.86 m by removing several noise-corrupted bands, and it presents a size of pixels with a spatial resolution of 1.3 mpp. Different from the Indian Pines and Salinas Valley data sets, this data set contains 9 distinguishable classes.

B. Experimental Settings

In order to reduce the computational complexity, PCA is used as a preprocessing method to reduce data dimensions. For SVM, 2-D CNN, SaLSTM, and SaCL2DNN, the first principal component is preserved as the input. As far as SSCL2DNN and SSCL3DNN are concerned, the top K components are extracted as the spectral features.

TABLE I

SENSITIVITY COMPARISON AND ANALYSIS UNDER DIFFERENT SIZE (

TABLE II

SENSITIVITY COMPARISON AND ANALYSIS UNDER DIFFERENT NUMBER (

All the parameters of the compared methods are confirmed according to [22], [52], [56], [57] to achieve the quasi-optimal performance. Specifically, for SVM, the radial basis function (RBF) is used in the Libsvm toolbox [57]. There are two key parameters in SVM, i.e., C and , which denote the regularization parameter and the kernel function parameter, respectively. According to [57], fivefold cross-validation is adopted to tune C and from the range ofand , respectively. For Bi-CLSTM [56], the kernel size and the number of feature maps in each ConvLSTM2D layer are fixed as 3 and 32, respectively. As for SeLSTM, SaLSTM, and SSLSTMs [52], the numbers of output nodes in SeLSTM and SaLSTM are set as 64 and 128, respectively, for the Indian Pines and Salinas Valley data sets, while as 128 and 256, respectively, for the University of Pavia data set.

TABLE III

TABLE IV

TABLE V

For SSCL2DNN and SSCL3DNN, there are four key parameters to be determined, i.e., the size () of the local window, the number (K) of the principal components after PCA, the size () of the convolution operation, and the number (M) of the feature maps at each ConvLSTM layer. Firstly, K is fixed as 10. Then, s is generated from {21, 23, 25, 27, 29, 31, 33}, k is from {3, 4, 5}, and M is searched from three given combinations {16, 32}, {32, 64}, and {64, 128}. The experimental results under different size of local window are reported in Table I, from which it can be seen that for these three data sets, the optimal s in SSCL2DNN is 27, while for SSCL3DNN, it is 27, 31, and 31, respectively. Although SSCL3DNN can obtain 0.46% and 0.32% gains when s varies from 27 to 31 for the Salinas Valley and University of Pavia data sets, the computational complexity and runtime will obviously increase. Based on the above analysis, the size of local window for the proposed models is fixed as , which can not only achieve satisfactory classification performance, but also provide convenience for the practical application of the proposed models. In particular, the size of local window in SaCL2DNN is the same as that in SSCL2DNN.

TABLE VI

After that, the experiments for analyzing the influence of different K on the performance are further completed, and K is searched from {5, 10, 15, 20} for the Indian Pines and University of Pavia data sets, while from {5, 10, 15} for the Salinas Valley data set because of the memory problems. The results in Table II reveal that for these three data sets, the optimal K in SSCL3DNN is 10, while for SSCL2DNN, it is 5, 10, and 5, respectively.

In order to ensure the convergence of the loss function, the number of training epochs is fixed as 2000, and the learning rate is set as 0.0001 from epochs 1 to 2000 for the proposed deep models. More detailed parameter settings are summarized in Tables III-V.

For hardware system configuration, all the following experiments are completed on a desktop with an 8th Generation Intel Core i7-8700 processor and up to 3.7 GHz, 16 GB of DDR4 RAM with a serial speed of 2400 MHz, a Nvidia GeForce GTX 1080ti GPU with 11 GB memory, an Inter SSD D3-S4510 with 240GB. For software system configuration, we adopt Ubuntu 16.04.4 x64 as our operating system for all experiments. CUDA 8.0 and cuDNN 7.0.5, Tflearn with 0.3.2, Tensorflow-gpu

TABLE IX

Fig. 5. Classification maps for the Indian Pines data set. (a) Ground-truth map. (b) SVM. (c) 2-D CNN. (d) SaLSTM. (e) SaCL2DNN. (f) SSLSTMs. (g) Bi-CLSTM. (h) SSCL2DNN. (i) 3-D CNN. (j) SSCL3DNN.

with 1.4.0 and python 3.5.4 are the main programming environment. Specially, all methods involved in our experiments are completed in Anaconda 3.4.

It should be noted that Tflearn is a modular and transparent deep learning library built on Tensorflow, and can provide a higher level API than Tensorflow. A combination of Tflearn and Tensorflow constructs the core framework of the proposed deep models.

C. Classification Performance

In order to show the superiority of the proposed deep models, we randomly select 1% of the available labeled samples for training for both the Salinas Valley and University of Pavia data sets while 10% for the Indian Pines data set, respectively. And the remaining samples are used for testing.

According to the experiment settings in Subsection B, the quantitative assessments based on all models are reported in Tables VI-VIII, from which it can be seen that the proposed SSCL2DNN and SSCL3DNN models can provide better classification performance than other considered models. First of all, three special gate mechanisms make it possible for ConvLSTM to take full use of both the spatial and spectral information of HSIs than CNN that is operated by the traditional method based on the sliding window. What’s more, the implementations of the gate mechanisms in ConvLSTM are extended from one-dimensional to multi-dimensional convolution operation, which enables ConvLSTM to better preserve and capture spatial context information of HSIs, and fuse the spatial and spectral information more effectively than LSTM. Specifically, on the one hand, compared with 2-D CNN, SaCL2DNN improves OA by 0.86%, 5.03%, and 4.80%, respectively, for the Indian Pines, Salinas Valley, and University of Pavia data sets, which can better capture spatial context information from hyperspectral data. It should be noted that it is the loss of the spatial information that leads to unsatisfactory performance for

TABLE X

Fig. 6. Classification maps for the Salinas Valley data set. (a) Ground-truth map. (b) SVM. (c) 2-D CNN. (d) SaLSTM. (e) SaCL2DNN. (f) SSLSTMs. (g) Bi-CLSTM. (h) SSCL2DNN. (i) 3-D CNN. (j) SSCL3DNN.

SVM and SaLSTM. In addition, SaCL2DNN is a special case of SSCL2DNN, and compared with it, SSCL2DNN can obtain 0.96%, 1.62%, and 4.78% gains in OA for these three data sets, respectively. For Bi-CLSTM, although ConvLSTM2D is also the core backbone for the feature extraction, this way of simply cascading all the outputs not only does not fully exploit the correlations between different spectral bands, but also easily leads to overfitting problem. Different from it, SSCL2DNN is built by alternating cascades of multi-layer ConvL-STM2D layers and pooling layers, in which the outputs yielded by modeling long-range dependencies are the fi-nal feature representations. To some extent, this approach can not only make great use of the characteristics of ConvLSTM2D, but also reduce the number of features and the complexity of the model, and the gains in OA yielded by SSCL2DNN are 2.41%, 0.58%, and 6.25% for these three HSI data sets, respectively, which verifies the effectiveness of SSCL2DNN, and demonstrates that joint learning of spatial-spectral features by modeling long-term dependencies in the spectral field can provide higher classification performance.

On the other hand, SSCL3DNN fuses the spatial and spectral information more effectively by the special 3-D operation, which can preserve the intrinsic structure of hyperspectral data to further improve the classification performance of SSCL2DNN, and obtains 0.76%, 2.99%, and 5.63% gains in OA for these three HSI data sets, respectively. Compared with 3-D CNN, the improvements in OA provided by SSCL3DNN are 0.51%, 2.12%, and 7.96% for three data sets, respectively. In particular, SSCL3DNN produces the remarkable gains for the Salinas Valley and University of Pavia data sets. From Tables VI-VIII, it can be seen that the 3-D extended architecture of LSTM makes it possible for SSCL3DNN to generate better classification performance by preserving the intrinsic structure of hyperspectral data, and the special gate structures enable SSCL3DNN to extract more discriminative spatial-spectral features. In addition, the classification performance of some classes with high correlation, such as class 10, class 11, and class 12 in the Indian Pines data set, class 8 and class 15 in the Salinas Valley data set, and class 1, class 6, and class 8 in the University of Pavia data set, is improved, which demonstrates the superiority of the proposed models.

Corresponding to Tables VI-VIII, similar conclusions can be drawn from the classification maps presented in Figs. 5-7, from which it is obvious that the maps provided by the proposed models are closest to the ground-truth maps for these three data sets. In addition, there are only fewer misclassifications, and the boundaries of each class are better recognized, especially for class 2, class 10, and class 11 in Fig. 5, class 8 and class 15 in Fig. 6, and class 1, class 6, and class 8 in Fig. 7, which further verifies the effectiveness of the proposed models.

As we all know, it is greatly expensive and difficult to obtain samples with labels. Therefore, it is necessary to investigate the performance under small training size.

D. Sensitivity Comparison and Analysis under Small Samples

In order to further demonstrate the performance of the proposed deep models, we randomly select 10 samples from each labeled class to construct smaller training sets.

The experimental results are reported in Tables IX-X, from which it can be observed that even in the case of small size of training samples, the proposed deep models can also show better classification performance. Compared with 2-D CNN, the improvements in OA yielded by SaCL2DNN are 4.39%, 5.34%, and 16.46%, respectively, and SSCL2DNN improves the performance of SaCL2DNN and obtains 3.12%, 15.43%, and 3.83% gains in OA for these three HSI data sets, respectively, which further verifies the validity of joint learning of the spatial-spectral features. However, Bi-CLSTM obtains better performance than SSCL2DNN, and yields 1.32%, 0.96%, and 1.90% gains in OA, respectively. The main reason is the way of cascading all the outputs of each ConvLSTM2D layer determines that Bi-CLSTM can provide more available features than SSCL2DNN. In particular, it is the defects of Bi-CLSTM and SSCL2DNN that lead to the destruction of the intrinsic structure of the hyperspectral data when putting each component of the local patch as the input of each memory cell. SSCL3DNN overcomes this shortcomings, and yields the best classification performance, which obtains 2.07%, 4.96%, and 22.61% gains in OA for these three data sets, respectively, when compared with 3-D CNN. More detailed results are reported in Tables IX-X, which further demonstrate the advantages of the proposed models.

To further show the effectiveness of the proposed deep models, 20, 30, and 40 samples from each class are randomly extracted for training in these three data sets. In particular, the number of training samples for class 7 and class 9 in the Indian Pines data set is fixed as 10. When taking different numbers of training samples into account, the OA curves of all models are provided in Fig. 8, from which it can be seen that for these three HSI data sets, SSCL3DNN provides the highest classification accuracy. It is worth noting that SSCL3DNN obtains significant performance improvements in the Salinas Valley and University of Pavia data sets. In particular, compared with Bi-CLSTM, SSCL2DNN achieves higher accuracy when the number of training samples is greater than or equal to 20, which means the way of using Con-vLSTM2D to model long-range dependencies between different spectral bands may be more effective than that of simply cascading all the outputs. The results in Tables IX-X and Fig. 8 further demonstrate the effectiveness of the proposed deep models, and the design of the special gate structure makes it possible for them to better capture spatial information and preserve the intrinsic structure information of the original data, which is extremely important for improving the classification performance.

VI. CONCLUSION

In this paper, two novel deep ConvLSTM Neural networks, i.e., SSCL2DNN and SSCL3DNN, have been proposed to extract more effective and discriminative spatial-spectral features for HSI classification. In SSCL2DNN, by taking the local patch as a spectral sequence and feeding into each memory cell band by band, the outputs by modeling long-range dependencies in the spectral domain are the spatial-spectral features, which can reduce the number of features to solve the overfitting problem to a certain extent while achieving satisfactory classification performance. By further developing the 3-D extended structure of LSTM, SSCL3DNN can better preserve the intrinsic structure of hyperspectral data, and the special gate mechanisms enable it to extract more discriminative spatial-spectral features, which can further improve classification performance. The experimental results conducted on three widely used HSI data sets show that the proposed deep models offer competitive advantages over state-of-the-art approaches, especially in the case of small training size.

Fig. 7. Classification maps for the University of Pavia data set. (a) Ground-truth map. (b) SVM. (c) 2-D CNN. (d) SaLSTM. (e) SaCL2DNN. (f) SSLSTMs. (g) Bi-CLSTM. (h) SSCL2DNN. (i) 3-D CNN. (j) SSCL3DNN.

Fig. 8. Overall accuracy (%) of considered methods with different number of training samples for three HSI data sets: (a) Indian Pines, (b) Salinas Valley, (c) University of Pavia. ACKNOWLEDGMENT

The authors would like to thank the Associate Editor and the Anonymous Reviewers for their valuable comments and suggestions, which are greatly helpful to improve the quality and presentation of this paper.

REFERENCES

[1] D. Landgrebe, “Hyperspectral image data analysis,” IEEE Signal Process. Mag., vol. 19, no. 1, pp. 17-28, Jan. 2002.

[2] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data analysis and future challenges,” IEEE Geosci. Remote Sens. Mag., vol. 1, no. 2, pp. 6-36, Jun. 2013.

[3] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 45-54, Jan. 2014.

[4] X. Zhang, Y. Sun, K. Shang, L. Zhang, and S. Wang, “Crop classification based on feature band set construction and objectoriented approach using hyperspectral images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 9, pp. 4117-4128, Sep. 2016.

[5] F. M. Lacar, M. M. Lewis, and I. T. Grierson, “Use of hy- perspectral imagery for mapping grape varieties in the Barossa Valley, South Australia,” in IEEE Int. Geosci. Remote Sens. Symp., Sydney, NSW, Australia, Jul. 2001, pp. 2875-2877.

[6] A. Ghiyamat and H. Z. Shafri, “A review on hyperspectral remote sensing for homogeneous and heterogeneous forest biodiversity assessment,” Int. J. Remote Sens., vol. 31, no. 7, pp. 1837-1856, 2010.

[7] F. van der Meer, “Analysis of spectral absorption features in hyperspectral imagery,” Int. J. Appl. Earth Observ. Geoinf., vol. 5,

no. 1, pp. 55-68, Feb. 2004.

[8] L. Pan, H. Li, W. Li, X. Chen, G. Wu, and Q. Du, “Discriminant analysis of hyperspectral imagery using fast kernel sparse and lowrank graph,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 11, pp. 6085-6098, Nov. 2017.

[9] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila- Frances, and J. Calpe-Maravilla, “Composite kernels for hyperspectral image classification,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1, pp. 93-97, Jan. 2006.

[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210-227, Feb. 2009.

[11] M. Cui and S. Prasad, “Class-dependent sparse representation classifier for robust hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2683-2695, May 2015.

[12] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification using dictionary-based sparse representation,” Remote Sens., vol. 49, no. 10, pp. 3973-3985, Oct. 2011.

[13] H. Zhang, J. Li, Y. Huang, and L. Zhang, “A nonlocal weighted joint sparse representation classification method for hyperspectral imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2056-2065, Jun. 2014.

[14] C. Chen, N. Chen, and J. Peng, “Nearest regularized joint sparse representation for hyperspectral image classification,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 3, pp. 424-428, Mar. 2016.

[15] J. Peng and Q. Du, “Robust joint sparse representation based on maximum correntropy criterion for hyperspectral image classifi-cation,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 12, pp. 7152-7164, Dec. 2017.

[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017.

[17] G. Zhu, F. Porikli, and H. Li, “Robust visual tracking with deep convolutional neural network based object proposals on pets,” in IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Las Vegas, NV, 2016, pp. 1265-1272.

[18] A. N. Shuaibu, A. S. Malik, and I. Faye, “Adaptive feature learning CNN for behavior recognition in crowd scene,” in IEEE Int. Conf. Signal Image Process. Appl., Kuching, 2017, pp. 357-361.

[19] F. Ratle, G. Camps-Valls, and J. Weston, “Semisupervised neural networks for efficient hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 5, pp. 2271-2282, May 2010.

[20] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning- based classification of hyperspectral data,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2094-2107, Jun. 2014.

[21] W. Hu, Y. Huang, W. Li, F. Zhang, and H. Li, “Deep convo- lutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, no. 2, pp. 1-12, Jul. 2015.

[22] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, pp. 6232-6251, Oct. 2016.

[23] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 844-853, Feb. 2017.

[24] W. Shao and S. Du, “Spectral-spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4544-4554, Oct. 2016.

[25] Z. Zhong, J. Li, Z. Luo, and M. Chapman, “Spectral-spatial residual network for hyperspectral image classification: A 3-D deep learning framework,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 847-858, Feb. 2018.

[26] Y. Li, H. Zhang, and Q. Shen, “Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network,” Remote Sens., vol. 9, no. 1, pp. 67, Jan. 2017.

[27] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “A new deep convolutional neural network for fast hyperspectral image

classification,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 120-147, Nov. 2018.

[28] W. Song, S. Li, L. Fang, and T. Lu, “Hyperspectral image classification with deep feature fusion network,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 6, pp. 3173-3184, Jun. 2018.

[29] L. Fang, Z. Liu, and W. Song, “Deep hashing neural networks for hyperspectral image feature extraction,” IEEE Trans. Geosci. Remote Sens. Lett., vol. 16, no. 9, pp. 1412-1416, Sep. 2019, doi: 10.1109/LGRS.2019.2899823.

[30] R. Kemker and C. Kanan, “Self-taught feature learning for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2693-2705, May 2017.

[31] B. Liu, X. Yu, P. Zhang, X. Tan, A. Yu, and Z. Xue, “A semi- supervised convolutional neural network for hyperspectral image classification,” Remote Sens. Lett., vol. 8, no. 9, pp. 839-848, Sep. 2017.

[32] B. Liu, X. Yu, P. Zhang, A. Yu, Q. Fu, and X. Wei, “Supervised deep feature extraction for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 4, pp. 1909-1921, Apr. 2018.

[33] H. Zhang, Y. Li, Y. Zhang, and Q. Shen, “Spectral-spatial classifi- cation of hyperspectral imagery using a dual-channel convolutional neural network,” Remote Sens. Lett., vol. 8, no. 5, pp. 438-447, 2017.

[34] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benedik- tsson, “Deep learning for hyperspectral image classification: An overview,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6690-6709, Sep. 2019, doi: 10.1109/TGRS.2019.2907932.

[35] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Proc. Adv. Neural Inf. Process. Syst., pp. 3859-3869, 2017.

[36] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “Deep & dense convolutional neural network for hyperspectral image classification,” Remote Sens., vol. 10, no. 9, pp. 1454, Sep. 2018.

[37] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2377-2385.

[38] A. Graves, M. Liwicki, S. Fern`andez, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 5, pp. 855-868, May 2009.

[39] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3639-3655, Jul. 2017.

[40] B. Liu, X. Yu, A. Yu, P. Zhang, and G. Wan, “Spectral-spatial classification of hyperspectral imagery based on recurrent neural networks,” Remote Sens. Lett., vol. 9, no. 12, pp. 1118-1127, 2018.

[41] X. Zhang, Y. Sun, K. Jiang, C. Li, L. Jiao, and H. Zhou, “Spatial sequential recurrent neural network for hyperspectral image classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 11, pp. 4141-4155, Nov. 2018.

[42] R. Hang, Q. Liu, D. Hong, and P. Ghamisi, “Cascaded recurrent neural networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 8, pp. 5384-5394, Aug. 2019, doi: 10.1109/TGRS.2019.2899129.

[43] L. Mou, Q. Liu, L. Bruzzone, and X. X. Zhu, “Learning spectral- spatial-temporal features via a recurrent convolutional neural network for change detection in multispectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 924-935, Feb. 2019.

[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural comput., vol. 9, no. 8, pp. 1735-1780, 1997.

[45] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., Montreal, Canada, 2014, pp. 3104-3112.

[46] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural network for sequence learning,” Computer Science, 2015.

[47] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo, “Convolutional LSTM network: a machine learning approach for precipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst., 2015.

[48] H. Wu and S. Prasad, “Convolutional recurrent neural networks for hyperspectral data classification,” Remote Sens., vol. 9, no. 3, pp. 298, Mar. 2017.

[49] M. Rußwurm and M. K¨orner, “Multi-temporal land cover classi- fication with sequential recurrent encoders,” ISPRS Int. J. Geo-Inf., vol. 7, no. 4, pp. 129, Mar. 2018.

[50] A. Song, J. Choi, Y. Han, and Y. Kim, “Change detection in hyperspectral images using recurrent 3D fully convolutional networks,” Remote Sens., vol. 10, no. 11, pp. 1827, Nov. 2018.

[51] M. Seydgar, A. A. Naeini, M. Zhang, W. Li, and M. Satari, “3- D convolution-recurrent networks for spectral-spatial classification of hyperspectral images,” Remote Sens., vol. 11, no. 7, pp. 883, Apr. 2019.

[52] F. Zhou, R. Hang, Q. Liu, and X. Yuan, “Hyperspectral image classification using spectral-spatial LSTMs,” Neurocomputing, vol. 328, no. 7, pp. 39-47, Feb. 2017.

[53] M. Rußwurm and M. K¨orner, “Temporal vegetation modelling using long short-term nemory networks for crop identification from medium-resolution multi-spectral satellite images,” in Conf. Comput. Vis. Pattern Recognit. Workshops, Honolulu, HI, 2017, pp. 1496-1504.

[54] B. Romera-Paredes and P. H. S. Torr, “Recurrent instance seg- mentation,” in Proc. European Conf. Comput. Vis., Amsterdam, The Netherlands, 2016, pp. 312-329.

[55] A. Ma, A. M. Filippi, Z. Wang, and Z. Yin, “Hyperspectral image classification using similarity measurements-based deep recurrent neural networks,” Remote Sens., vol. 11, no. 2, pp. 194, Jan. 2019.

[56] Q. Liu, F. Zhou, R. Hang, and X. Yuan, “Bidirectional- convolutional LSTM based spectral-spatial feature learning for hyperspectral image classification,” Remote Sens., vol. 9, no. 12, pp. 17-28, Dec. 2017.

[57] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1-27, Mar. 2011.

designed for accessibility and to further open science