MAPPING of land cover and its change has a criticalrole in the characterization of the current state of the environment. The changes in land cover can be due either to human activities as well as caused by climate changes on a regional scale. The land cover, on the other hand, affects climate through water and energy exchange with the atmosphere and by changing carbon balance. Because of this, land cover belongs to the Essential Climate Variables [1]. Hence, timely assessment of land cover and its change is one of the most important applications in satellite remote sensing. Thematic maps are needed annually for various purposes in medium resolution (circa 250 m) with less than 15% measurement uncertainty and in high resolution (10-30 m) with less than 5% uncertainty. CORINE Land Cover (CLC) is a notable example of a consistent Pan-European land cover mapping initiative [2], [3] coordinated by the European Environment Agency (EEA).2
CORINE stands for coordination of information on the environment. It is an on-going long-term effort providing most harmonized classification land cover data in Europe with updates approximately every 4 years. The CORINE maps are an important source of land cover information suitable for operational purposes also for various customer groups in Europe. It has altogether 44 classes, though many of them are not strictly ecological classes but rather land use classes. On the continental scale, CORINE provides a harmonized map with 25 ha minimum mapping unit (MMU) for areal phenomena, and a minimum width of 100 m for linear phenomena [4]. National land cover maps in the CORINE framework can exhibit smaller mapping units. In Finland, the latest revision of CORINE land cover map at the time of this study was year 2018 version produced by the Finnish Environment Institute. The map has an MMU of and was produced by a combined automated and manual interpretation of the high-resolution satellite optical data followed by the data integration with existing basic map layers [5].
The state-of-the-art approaches used for land cover mapping mainly rely on the satellite optical imagery. The key role is played by the Landsat imagery often augmented by the MODIS or SPOT-5 imagery [6]–[8]. Other sources of information employed for land cover mapping include Digital Elevation Models (DEM) and very high-resolution imagery [9]. When it comes to the large-scale and multitemporal land cover mapping, a more recent optical imagery source is Copernicus Sentinel-2. With a revisit of 5 days, it has become another key data source [10].
International programs, such as the European Space Agency’s (ESA’s) Copernicus [11] behind the Sentinel satellites are taking significant efforts to make Earth Observation (EO) data freely available for commercial and non-commercial purposes. The Copernicus programme is a multi-billion investment by the EU and ESA aiming to provide essential services based on accurate and timely data from satellites. Its main goals are to improve the ways of managing the environment, to help mitigate the effects of climate change, and enable the creation of new applications and services, such as for environmental monitoring and urban development.
The provision of free satellite data for mapping in the framework of such programs also enables application of methods that could not be used earlier because they require vast and representative datasets for training, for example deep learning. In recent years, deep learning has brought about several breakthroughs in the pattern recognition and computer vision [12]–[14]. The success of the deep learning models can be attributed to both their deep multilayer structure creating nonlinear functions and, hence, allowing extraction of hierarchical sets of features from the data, and to their end-to-end training scheme allowing for simultaneous learning of the features from the raw input and predicting the task at hand. In this way, the heuristic feature design is removed. This is advantageous compared to the traditional machine learning methods (e.g., support vector machine (SVM) and random forest (RF)), which require a multistage feature engineering procedure. In deep learning, such a procedure is replaced with a simple end-to-end deep learning workflow. One of the key requirements for successful application of deep learning methods is a large amount of data available from which the model can automatically learn the representative features for the prediction task [15]. The availability of open satellite imagery, such as from Copernicus, offers just that.
The land cover mapping systems based solely on optical imagery suffer from issues with cloud cover and weather conditions, especially in the tropical areas, and with a lack of illumination in the polar regions. Among the free satellite data offered by the Copernicus programme are synthetic aperture radar (SAR) images from the Sentinel-1 satellites. SAR is an active radar imaging technique that does not require illumination and is not hampered by cloud-cover due to penetration of microwave radiation through clouds. The utilisation of SAR imagery, hence, would allow mapping such challenging regions and increasing the mapping frequency in the orchestrated efforts like CORINE. One of the significant issues previously was the absence of timely and consistent high-resolution wide-area SAR coverage. With the advent of Copernicus Sentinel-1 satellites, operational use of imaging radar data becomes feasible for consistent wide-area mapping. The first Copernicus Sentinel-1 mission was launched in April 2014. Firstly, Sentinel-1A alone was capable of providing Cband SAR data in up to four imaging modes with a revisit time of 12 days. Once Sentinel-1B was launched in 2016 the
revisit time has reduced to 6 days [11].
We studied wide-area SAR-based land cover mapping by methodologically combining the two discussed recent advances: the improved methods for large-scale image processing using deep learning and the availability of SAR imagery from the Sentinel-1 satellites.
A. Land Cover Mapping with SAR Imagery
While using optical satellite data is still a mainstream in land cover and land cover change mapping [5], [16]–[19], SAR data has been getting more attention as more suitable sensors appear. To date, several studies have investigated the suitability of SAR for land cover mapping, focusing primarily on L-band, C-band, and X-band polarimetric [20], [21] multitemporal and multi-frequency SAR [22] [23], as well as, on the combined use of SAR and optical data [24]–[28].
Independently of the imagery used, the majority of land cover mapping methods so far are based on traditional supervised classification techniques [29]. Widely used classifiers are support vector machines (SVM), decision trees, random forests (RF), and maximum likelihood classifiers (MLC) [7], [9], [29], [30]. However, extracting a large number of features needed for classification, i.e., the feature engineering process, is time intensive, and requires lots of expert work in developing and fine-tuning classification approaches. This limits the applications of the traditional supervised classification methods on a large scale.
Backscattered microwave radiation is composed of multiple fundamental scattering mechanisms determined by the vegetation water content, surface roughness, soil moisture, horizontal and vertical structure of the scatterers, as well as imaging geometry during the datatake. Accordingly, a considerable number of classes can be differentiated in SAR images [20], [31]. However, majority of SAR classification algorithms use fixed SAR observables (e.g., polarimetric features) to infer specific land cover classes, despite the large temporal, seasonal and environmental variability between different geographical sites. This leads to a lack of generalisation capability and a need to use extensive and representative reference data and SAR data. The latter means the need to account for not only all variation of SAR signatures for a specific class but also the need to consider seasonal effects, as changes in moisture of soil and vegetation, as well as frozen state of land [32] that strongly affect SAR backscatter. On the other hand, when using multitemporal approaches, such seasonal variation can be used as an effective discriminator among different land cover classes.
When exclusively using SAR data for land cover mapping, reported accuracy often turns out to be relatively low for operational land cover mapping and change monitoring. Methodologically, reported solutions utilized supervised approaches, linking SAR observables and class labels to pixels, superpixels or objects in parametric or nonparametric manner [19]–[21], [31], [33]–[41].
However, tackling relatively large number of classes was considered only in several studies, often with relatively low reported accuracies. For instance, in [42] it was found that Pband PolSAR imagery was unsatisfactory for mapping more than five classes with the iterated conditional mode (ICM) contextual classifier applied to several polarimetric parameters. They achieved a Kappa value of 76.8% when mapping four classes. Classification performance of the L-band ALOS PALSAR and C-band RADARSAT-2 images was compared in the moist tropics [43]. L-band provided 72.2% classification accuracy for a coarse land cover classification system and Cband only 54.7%.In a similar study in Lao PDR, ALOS PALSAR data were found to be mostly useful as a back-up option to optical ALOS AVNIR data [19]. Multitemporal Radarsat-1 data with HH polarization and ENVISAT ASAR data with VV polarization (both C-band) were studied for classification of five land cover classes in Korea with moderate accuracy [44]. Waske et al. [30] applied boosted decision tree and random forests to multi-temporal C-band SAR data reaching accuracy up to 84%. Several studies [21], [20] investigated specifically SAR suitability for the boreal zone, with reported accuracy up to 83% depending on the classification technique (maximum likelihood, probabilistic neural networks, etc.) when five super-classes (based on CORINE data) were used.
The potential of Sentinel-1 imagery for CORINE-type thematic mapping was assessed in a study that used Sentinel-1A data for mapping class composition in Thuringia [31]. Longtime series of Sentinel-1 SAR data are considered especially suitable for crop type mapping [45]–[48], with increased number of studies attempting land cover mapping in general [49], [50].
Moreover, as Sentinel-1 data are presently the only free source of SAR data routinely available for wide-area mapping at no cost for users, it seems the best candidate data for development and testing of improved classification approaches. Previous studies indicate a necessity for developing and testing new methodological approaches that can be effectively applied to a large-scale and deal with the variability of SAR observables concerning ecological land cover classes. We suggest adopting state-of-the-art deep learning approaches for this purpose.
B. Deep Learning in Remote Sensing
The advances in the deep learning techniques for computer vision, in particular, Convolutional Neural Networks (CNNs) [12], [51], have led to the application of deep learning in several domains that rely on computer vision. Examples are self-driving cars, image search engines, medical diagnostics, and augmented reality. Deep learning approaches are becoming extensively applied in the remote sensing domain, as well.
Zhu et al. [52] provide a discussion on the specificities of remote sensing imagery (compared to ordinary RGB images) that result in specific deep learning challenges in this area. For example, remote sensing data are georeferenced, often multi-modal, with particular imaging geometries, there are interpretation difficulties, and the ground-truth or labelled data needed for deep learning is still often lacking. Additionally, most of the state-of-the-art CNNs are developed for threechannel input images (i.e., RGB) and so certain adaptations are needed to apply them on the remote sensing data [53].
Nevertheless, several research papers tackling remote sensing imagery with deep learning techniques were published in recent years. Zhang et al. [54] review the field and find applications to image preprocessing [55], target recognition [56], [57], classification [58]–[60], and semantic feature extraction and scene understanding [61]–[64]. The deep learning approaches are found to outperform standard methods applied up to several years ago, i.e., SVMs and RFs [65], [66].
When it comes to deep learning for land cover or land use mapping, applications have been limited to optical satellite [53], [53], [59], [67] or aerial [68] imagery, and hyperspectral imagery [60], [67] owing to the similarity of these images to ordinary RGB images studied in computer vision [53].
When it comes to SAR images, Zhang et al. [54] found that there is already a significant success in applying deep learning techniques for object detection and scene understanding. However, for classification on SAR data, applications are scarce and advances are yet to be achieved [54]. Published research includes deep learning for crop types mapping using combined optical and SAR imagery [66], as well as the use of SAR images exclusively [69]. However, those methods applied deep learning only to some part of the task at hand and not in an end-to-end fashion. Wang et al. [59], for instance, just used deep neural networks for merging over-segmented elements, which are produced using traditional segmentation approaches. Similarly, Tuia et al. [60] applied deep learning to extract hierarchical features, which they further fed into a multiclass logistic classifier. Duan et al. [69] used first unsupervised deep learning and then continued with a couple of supervised labelling tasks. Chen et al. [67] applied a deep learning technique (stacked autoencoders) to discover the features, but then they still used traditional machine learning (SVM, logistic regression) for the image segmentation. Unlike those methods, we applied the deep learning in an end-to-end fashion, i.e., from supervised feature extraction to the land class prediction. This makes our approach more flexible, robust and, adaptable to the SAR data from new regions, as well as more efficient.
When it comes to the end-to-end approaches for SAR classification, there are several studies where the focus was on a small area and on a specific land cover mapping task. For instance, Mohammadimanesh et al. [70] used fully polarimetric SAR (PolSAR) imagery from RADARSAT-2 to classify wetland complexes, for which they have developed a specifically tailored semantic segmentation model. However, the authors have tackled a small test area (around ) and have not explored how their model generalizes to other types of areas. Similarly, Wang et al. [71] adapted existing CNN models into a fixed-feature-size CNN that they have evaluated on a small scale RADARSAT-2 or AIRSAR (i.e., airborne SAR data). In both cases, they have used more advanced fully polarimetric SAR imagery at better resolution as opposed to Sentinel-1, which means the imagery with more input information to the deep learning models. Importantly, it is only Sentinel-1 that offers open operational data with up to every 6 days repeat. Because of this, the discussed approaches developed and tested specifically for PolSAR imagery at a higher resolution cannot be considered applicable for a wide-area mapping, yet. Similarly, Ahishali et al. [72] applied end-to-end approaches to SAR data. They have also worked with single polarized COSMO-SkyMed imagery. However, all the imagery they considered was X-band SAR contrary to C-band imagery we use here and again only on a small scale. The authors proposed a compact CNN model that they found had outperformed some of the off-the-shelf CNN methods, such as Xception and Inception-ResNet-v2. It is important to note that compared to those, the off-the-shelf models that we consider here are more sophisticated semantic segmentation models, some which employ Xception or ResNet but only as a module in their feature extraction parts.
In summary, the capabilities of the deep learning approaches for the classification have been investigated to a lesser extent for SAR imagery than for optical imagery. The attempts to use SAR data for land cover classification were relatively limited in scope, area, or the number of used SAR scenes. Particularly, wide-area land cover mapping was never addressed. The reasons for this include comparatively poor availability of SAR data compared to optical (greatly changed since the advent of Sentinel-1), complex scattering mechanisms leading to ambiguous SAR signatures for different classes (which makes SAR image segmentation more difficult than the optical image segmentation [73]), as well as the speckle noise caused by the coherent nature of the SAR imaging process.
C. Study goals
Present study addresses the identified research gap of a lack of wide-area land cover mapping using SAR data. We achieve this by training, fine-tuning, and evaluating a set of suitable state-of-the-art deep learning models from the class of semantic segmentation models, and demonstrating their suitability for land cover mapping. Moreover, our work is the first to examine and demonstrate the suitability of deep learning for land cover mapping from SAR images on a large-scale, i.e., across the whole country.
Specifically, we applied the semantic segmentation models on the SAR images taken over Finland. We focused on the images of Finland because there is the land cover mask of a suitable resolution that can be used for training labels (i.e., CORINE). The training is performed with the seven selected models (SegNet [74], PSPNet [75], BiSeNet [76], DeepLabV3+ [77], [78], U-Net [79], [80], FRRN-B [81], and FC-DenseNet [82]), which have encoder modules pre-trained on the large RGB image corpus ImageNet 2012.3 Those models are freely available.4 In other words, we reused semantic segmentation architectures developed for natural images with pre-trained weights on RGB images and we fine-tuned them on the SAR images. Our results (with over 90% overall accuracy) demonstrate the effectiveness of the deep learning methods for the land cover mapping with SAR data.
In addition to having the high-resolution CORINE map that can serve as a ground-truth (labels) for training the deep learning models, another reason that we selected Finland is that it is a northern country with frequent cloud cover, which means that using optical imagery for wide-area mapping is
often not feasible. Hence, demonstrating the usability of radar imagery for land cover mapping is particularly useful here. Even though Finland is a relatively small country, there is still considerable heterogeneity present in terms of land cover types and how they appear in the SAR images. Namely, SAR backscattering is sensitive to several factors that likely differ between countries or between distant areas within a country. Examples of such factors are moisture levels, terrain variation and soil roughness, predominant forest biome and tree species proportions, types of shorter vegetation and crops in agricultural areas, and specific types of built environments. We did not contain our study to a particular area of Finland where the SAR signatures might be consistent but we obtained the images across a wide area. Hence, demonstrating the suitability of our methods in this setting hints at their potential generalizability. Namely, it means that, similarly as we did here, the semantic segmentation models can be fine-tuned and adapted to work on data from other regions or countries with the different SAR signatures. On the other hand, we took into account that the same areas will appear somewhat different on the SAR images across different seasons. Scattering characteristics of many land cover classes change considerably between the summer and winter months, and sometimes even within weeks during seasonal changes [20], [83]. These include snow cover and melting, freeze/thaw of soils, ice on rivers and lakes, crops growing cycle, leaf-on and leaf-off conditions in deciduous trees. Because of this, in the present study, we focused only on the scenes acquired during the summer season. However, we did allow our training dataset to contain several images of the same area, taken during different times during the summer season. This way not only spatial, but also temporal variation of SAR signatures is introduced. Our contributions can be summarised as follows: C1: We thoroughly benchmarked seven selected state-of-the-art semantic segmentation models covering a diverse set of approaches for land cover mapping using Sentinel-1 SAR imagery. We provided insights on the best models in terms of both accuracy and efficiency.
C2: Our results demonstrated the power of deep learning models along with SAR imagery for accurate wide-area land cover mapping in the cloud obscured boreal zone and polar regions These results can serve as baselines when developing new, specialized approaches to SAR imagery.
As with other representation learning models, the power of deep learning models comes from their ability to learn rich features (representations) from the dataset automatically [15]. The automatically learned features are usually better suited for the classifier or other task at hand than handengineered features. Moreover, thanks to a large number of layers employed, it has been proven that the deep learning networks can discover hierarchical representations, so that the higher level representations are expressed in terms of the lower level, simpler ones. For example, in the case of images, the low-level representations that can be discovered are edges, and using them, the mid-level ones can be expressed, such as corners and shapes, and this helps to express the high-level representations, such as object elements and their identities [15].
The deep learning models in computer vision can be grouped according to their main task in three categories. In Table I, we provide a description for those categories. However, the deep learning terminology for those tasks does not always correspond well to the terminology used in the remote sensing community. Relevant to our task, a number of remote sensing studies uses the term classification in the context of land cover mapping, inherently meaning pixel- or regionbased classification, which in the deep learning terminology corresponds to semantic segmentation. In Table I we list the corresponding terminology that we encountered being used for each task in both, the deep learning and remote sensing communities. This is helpful to disambiguate when talking about different, and recognize when talking about the same tasks in the two domains. In the present study, the focus is on land cover mapping. Hence, we tackle semantic segmentation in the deep learning terminology and image classification, i.e., pixel-wise classification, in the remote sensing terminology. Another terminology issue that often arises is about the dataset types used. The dataset that is held out from the training set and used to give an estimate of the model’s performance during the training phase is referred to as a development dataset or validation dataset in the deep learning context. From remote sensing viewpoint, both training and development/validation datasets belong to training phase data. Further, the term validation data in remote sensing context is typically reserved for the datasets used during the final evaluation (accuracy assessment) on completely independent data not involved in the training phase, i.e., what is called a test dataset in deep learning. Hence, to avoid any confusion, we will avoid using the validation term in the text, calling respective datasets as training, development, and test (accuracy assessment) data.
Convolutional Neural Networks (CNNs) [12], [13] are the deep learning models that have transformed the computer vision field. Initially, CNNs are defined to tackle the image classification (deep learning terminology) task. Their structure is inspired by the visual perception of mammals [85]. CNNs are named after one of the most important operations, which is particular to them compared to other neural networks, i.e., convolutions. Mathematically, a convolution is a combination of two other functions. A convolution is applied on the image by sliding a filter (kernel) of a given size which is usually small compared to the original image size. Different purpose filters are designed; for example, a filter can serve as a vertical edge detector. Application of such a convolution operation on an image results in a feature map. Another common operation that is usually applied after a convolution is pooling. Pooling reduces the size of the feature map while providing robustness to the extracted features. Common CNNs end with a fully connected layer which is used for final predictions, commonly for image classification. By employing a large number of convolutional layers (depth), CNNs are able
TABLE I: Terminology for the main tasks in computer vi- sion and its use in the deep learning versus remote sensing communities.
to extract gradually more complex and abstract features. The first CNN model to demonstrate its impressive effectiveness in image classification (of hand digits) was LeNet [12]. Several years later, Krizhevsky et al. [13] developed AlexNet, the deep CNN to dramatically push the limits of classification accuracy on the famous ImageNet computer vision challenge [86]. Since then, a variety of CNN-based models are proposed. Some notable examples are: VGG network [14], ResNet [87], DenseNet [88], and Inception V3 [89]. The effectiveness of CNNs has been also proven in various real-world applications [90], [91].
Once CNNs have proven their effectiveness to classify images, Long et al. [84] were the first to discover how they can augment a given CNN model to make it suitable for the semantic segmentation task – they proposed the Fully Convolutional Neural Network (FCN) framework. This generic architecture can be used to adapt any CNN network used for classification into a segmentation model. Namely, the authors have shown that by replacing the last, fully connected layer, with an appropriate convolutions layer, so that they will upsample and restore the resolution of the input at the output layer, CNNs can be transformed to classify each individual pixel (instead of the whole image). The basic idea is as follows. The encoder is used to learn the feature maps, and is usually based on a pre-trained deep CNN for classification, such as ResNet, VGG, or Inception. The decoder part serves to upsample the discriminative features that the encoder has learned from the coarse-level feature map to the fine, pixel level. Long et al. [84] have shown that this upsampling (backward) computation can be efficiently performed using backward convolutions (deconvolutions). Moreover, this means that the specific CNN models, such as those mentioned above, can all be incorporated in the FCN framework for segmentation, giving rise to FCN-AlexNet [84], FCN-ResNet [87], FCN-VGG16 [84], FCN-DenseNet [82] etc. We present a diagram of the generic FCN architecture in Figure 1.
Fig. 1: The architecture of Fully Convolutional Neural Networks (FCNs) [84]
Here, we first describe the study site, SAR, and reference data. This is followed by an in-depth description of the deep learning terminology and the models used in the study. We finish with the description of the experimental setup and the evaluation metrics.
A. Study site
Our study site is covering the territory of Finland located to the south of 66.0latitude, that is effectively whole country without Finnish Lapland. The study area is shown in Figure 2. Southern Finland is primarily covered by boreal forests with lakes, marshes, open bogs, agricultural areas and urban settlements. We have omitted Lapland due to considerably different land cover composition and topography compared to the rest of the country. The terrain height variation within the study area is moderate and mostly within
meters range.
B. SAR data
Presently, Sentinel-1 is a C-band SAR dual-satellite system with two satellites orbiting apart [11], launched in 2014 and 2016, respectively. The operational acquisition modes are Stripmap (SM), Interferometric Wide-Swath (IW), Extra Wide Swath (EW), and Wave Mode (WV). The IW-mode is the default mode over land, providing 250 km wide swath composed of three sub-swaths, with single look image at 5 m by 20 m spatial resolution. It uses the so-called TOPS (Terrain Observation with Progressive Scan) SAR mode.
The SAR data acquired by Sentinel-1 satellites in IW mode are used in our study. Specifically, we used only Sentinel-1A imagery acquired during the summer 2018.
Original scenes were downloaded as Level-1 Ground Range Detected (GRD) products. They represent focused SAR data that has been detected, multi-looked and projected to groundrange using an Earth ellipsoid. The images were orthorectified using the Technical Research Centre of Finland (VTT) inhouse software employing the local digital terrain model (with 20 m resolution) available from National Land Survey of Finland. The pixel spacing of orthorectified scenes was set to 20 m. Ortho-rectification included terrain flattening to obtain the backscatter signal in gamma-nought format [92]. The scenes were further re-projected to the ERTS89 / ETRS-TM35FIN projection (EPSG:3067) and resampled to a final pixel size of 20 metres.
The Sentinel-1 images were mosaiced into 7 homogeneous SAR mosaics covering whole Finland. Each mosaic was compiled from approximately 90 Sentinel-1 IW scenes (both ascending and descending paths), and it takes about 12 days to collect enough imagery to have the whole country covered. Altogether seven SAR mosaics were produced during the summer 2018. These SAR mosaics are further used for sampling the training, development, and testing images that are input to Deep Learning models as described in detail in Section III-F.
The geographical coverage of each SAR mosaic is shown in Figure 2.
C. Reference data
In Finland, the Finnish Environment Institute (SYKE) is responsible for production of the CORINE maps. While for most of the EU territory, the CORINE mask of spatial resolution is available, the national institutions might choose to create more precise maps, and SYKE, in particular, had produced a
spatial resolution mask for Finland (Figure 3), with the first one in 2000. Since then, the updates have been produced regularly, with the latest one CLC2018 that well corresponds to the acquisition timing of our SAR data. There are 48 different land use classes in the map that can be hierarchically grouped into 4 CLC Levels. In detail, there are 30 classes on CLC Level-3, 15 classes on CLC Level-2, and 5 top CLC Level-1 classes. According to the information provided by SYKE for CLC2012, the accuracy of the CLC Level-3 was 61%, of the CLC Level-2, 83%, and of the CLC Level-1, it was 93%. In this study, we use its updated and revised version, CLC2018, having good results on both internal and external quality control.5 The selected classes and their corresponding color codes used for our segmentation results are shown in Table II. Our superclasses generally correspond to CLC Level-1 classes, with minor corrections for “artificial surfaces” class that is not fully included in urban class, but some elements are distributed to other classes; most notably green urban areas were included to forest class in our study as those are essentially parks and and mixed boreal forestland enclosed within urban-designated areas.
Until the most recent CORINE production round, EEA member countries adopted national approaches for the production of CORINE. EEA Technical Guidelines include manual digitalization of land cover change based on visual interpretation of optical satellite imagery. In Finland, the European CLC was not applicable for the majority of national users due to large minimal mapping unit (MMU). Thus national version was produced with somewhat modified nomenclature
development projects/Projects/Producing land cover and land use data in CORINE Land Cover 2018 project in Finland
Fig. 2: Study area in Finland: (a) reference CORINE land cover data; (b) example of compiled Sentinel-1 SAR mosaic that includes the whole country.
TABLE II: Description of CORINE based land cover classes and their map color codes
agriculture 222,184,135 agricultural and agro-forestry areas, fruittrees and berry plantations, pastures forested areas 127,255,0 broad-leaved, coniferous and mixed forest, transitional woodland/shrub
peatland 173,216,230 peatland, bogs, inland marshes and saltmarshes water bodies 0,191,255 rivers, lakes, sea
of classes [93], [94]. The national high-resolution CLC2018 data is in raster format of 20 m, with corresponding MMU. In the provision of 2018 update of CLC, obtaining optical imagery over Scandinavia and Britain was particularly challenging because of the frequent clouds, thus calling for the use of radar imagery to meet user requirements on accuracy and coverage [31]. CORINE map itself is normally built from high resolution satellite images acquired primarily during the summer and, to a smaller extent, during the spring months [2].
D. Semantic Segmentation Models
We selected following seven state-of-the-art [95] semantic segmentation models to test for our land cover mapping task: SegNet [74], PSPNet [75], BiSeNet [76], DeepLabV3+ [77], [78], U-Net [79], [80], FRRN-B [81], and FC-DenseNet [82]. The models were selected to cover a wide set of approaches to semantic segmentation. In the following, we describe its specific architecture for each of these DL models. We will use the following common abbreviations: conv for convolution operation, concat for concatenation, max pool for max pooling operation, BN for batch normalisation, and ReLU for the rectified linear unit activation function.
1) BiSeNet (Bilateral Segmentation Network): BiSeNet model is designed to decouple the functions of encoding additional spatial information and enlarging the receptive field, which are fundamental to achieving good segmentation performance. As can be seen in Figure 4, there are two main components to this model: Spatial Path (SP) and Context Path (CP). Spatial Path serves to encode rich spatial information. Context Path serves to provide sufficient receptive field and uses global average pooling and pre-trained Xception [96] or ResNet [87] as the backbone. The goal of the creators was not only to obtain superior performance but to achieve a balance between the speed and performance. Hence, BiSeNet is a relatively fast semantic segmentation model.
2) SegNet (Encoder-Decoder-Skip): Similarly to BiSeNet, SegNet is also designed with computational performance in mind, this time, particularly during inference. Because of this, the network has a significantly smaller number of trainable parameters compared to most of the other architectures. The encoder in SegNet is based on VGG16: it consists of its first 13 convolutional layers, while the fully connected layers
Fig. 3: Zoomed in area fragment with our reference data, i.e., CORINE shown on top (left) along with the Google Earth layer (right).
Fig. 4: The architecture of BiSeNet. ARM stands for the Attention Refinement Module and FFM for the Feature Fusion Module introduced in the model’s paper [76].
Fig. 5: The architecture of SegNet-based Encoder-Decoder with Skip connections [74]. Blue tiles represent Convolution + Batch Normalisation + ReLU, green tiles represent Pooling, red – Upsampling, and yellow – a softmax operation.
are omitted. Hence, the novelty of this network lies in its decoder part, as follows. The decoder consists of one decoder layer for each encoder layer and so it also has 13 layers. Each individual decoder layer utilizes max-pooling indices memorized from its corresponding encoder feature map. The authors have shown that this enhances boundary delineation between classes. Finally, the decoder output is sent to a multiclass soft-max function yielding classification for each pixel (see Figure 5).
3) Mobile U-Net: Mobile U-Net is based on the U-Net [97] semantic segmentation architecture shown in Figure 6. In designing U-Net, Fully Convolutional approach was generally employed with a following modification. Their upsampling part of the architecture has no fully convolutional layer but is nearly symmetrical to the feature extraction part due to the use of the similar feature maps. This results in a u-shaped architecture (see Figure 6), and hence the name of the model. While originally developed for biomedical images, the U-net architecture has proven successful for image segmentation in other domains, as well. Here, we somewhat modify the UNet architecture, according to MobileNets [80] framework, to improve its efficiency. In particular, the MobileNets framework uses Depthwise Separable Convolutions, a form which factorizes standard convolutions (e.g., ) into a depthwise convolution (applied separately to each input band) and a pointwise (
) convolution to combine the outputs of depthwise convolution.
Fig. 6: The architecture of U-Net [97]
4) DeepLab-V3+: DeepLab-V3+ [77] is an improved version of DeepLab-V3 [98], while the latter is an improved version the original DeepLab [78] model. This segmentation model does not follow the FCN framework like the previously discussed models. The main features that distinguish the DeepLab model from FCNs are the atrous convolutions for upsampling and the application of probabilistic machine learning models, concretely, conditional random fields (CRFs) for a finer localization accuracy in the final fully connected layer. Atrous convolutions, in particular, allow to enlarge the context from which the next layer feature maps are learned, while preserving the number of parameters (and, thus, the same efficiency). Using a chain of atrous convolutions allows to compute the final output layer of a CNN at an arbitrarily high resolution (removing the need for the upsampling part as used in FCNs). In the follow up work, proposing DeepLab-V3, Chen et al. [98] change the approach to atrous convolutions to gradually double the atrous rates, and show that with an adapted version, their new algorithm outperforms the previous one, even without including the fully connected CRF layer. Finally, in their newest adaption to the model, called DeepLab-V3+, Chen et al. [77] turn to a similar approach to the FCNs, i.e., they add a decoder module to the architecture (see Figure 7). That is, they employ the features extracted by the DeepLab-V3 module in the encoder part, and add the decoder module consisting of and
convolutions.
Fig. 7: The architecture of DeepLabV3+ [77]
5) FRRN-B (Full-Resolution Residual Networks): As we have seen, most of the semantic segmentation architectures are based on some form a FCN, and so they utilize existing classification networks, such on ResNet or VGG16 as encoders. We also discussed the main reason for such approaches, which is to take advantage of the learned weights from those architectures pretrained for the classification task. Nevertheless, one disadvantage of the FCN approach is that the resulting network outputs of the encoder part (particularly, after the pooling operations) are at a lower resolution, which deteriorates localization performance of the overall segmentation model. Pohlen et al. [81] proposed to tackle this by having two parallel network streams processing the input image: a pooling and a residual stream (Figure 8). As the
Fig. 8: The architecture of FRRN-B. RU n and FRRU n stand for residual units and full-resolution residual units with n-channel convolutions, respectively. FRRUs simultaneously operate on the two streams [81].
name says, the pooling stream performs successive pooling and then unpooling operations, and it serves to obtain good recognition of the objects and classes. The residual stream computes residuals at the full image resolution, which enables that low level features, i.e., object pixel-level locations, are propagated to the network output. The name of the model comes from its building blocks, i.e., full-resolution residual units. Each such unit simultaneously operates on the pooling and the residual stream. In the original paper [81], the authors propose two alternative architectures FRRN-A and FRRN-B, and they show that FRRN-B achieves superior performance on the Cityscapes benchmark dataset. Hence, we employ the FRRN-B architecture.
Fig. 9: The architecture of PSPNet [75]
6) PSPNet (Pyramid Scene Parsing Network): Zhao et al. [75] propose the Pyramid Scene Parsing as a solution to the challenge of making the local predictions based on the local context only, and not considering the global image scene. In remote sensing, an example for this challenge happening could be when a model wrongly predicts the water with waves present in it as the dry vegetation class, because they appear similar and the model did not consider that these pixels are being part of a larger water surface, i.e., it missed the global context. In similarity to the other FCN-based approaches, PSPNet uses a pre-trained classification architecture to extract the feature map, in this case, ResNet. The main module of this network is the pyramid pooling, which is enclosed by a square in Figure 9. As can be seen in the Figure, this module fuses features at four scales: from the coarse (red) to the fine (green). Hence, the output of each level in the pyramid pooling module contains the feature map of a different resolution. In the end, the different features are stacked together yielding the final pyramid pooling global feature for predictions.
7) FC-DenseNet (Fully Convolutional DenseNets): This semantic segmentation algorithm is built using DenseNet CNN [88] as a basis for the encoder, followed by applying the FCN approach [82]. The specificity of the DenseNet architecture is the presence of blocks where each layer is connected to all other layers in a feed-forward manner. Figure 10 shows the architecture of FC-DenseNet where the blocks are represented by the Dense Block units. According to [88], such architecture scales well to hundreds of layers without any optimization issues, while yielding excellent results in classification tasks. In order to efficiently upsample the DenseNet feature maps, Jegou et al. [82] substitute the upsampling convolutions of FCNs by Dense Blocks and Transitions Up. The Transition Up modules consist of transposed convolutions, which are then concatenated with the outputs from the input skip connection (the dashed lines in Figure 10).
Fig. 10: The architecture of FC-DenseNet [82]
E. Training approach
To accomplish better segmentation performance, there is an option to pre-train the semantic segmentation models (in particular, their encoder modules) using a larger set of available images of another type (such as natural images). Using the model pre-trained with natural images to continue training with the limited set of SAR images, the knowledge becomes effectively transferred from the natural to the SAR task [99]. To accomplish such transfer, we used the models whose encoders were pre-trained for the ImageNet classification task and fine-tuned them using our SAR dataset (described next).
F. Experimental Setup,
In this section, we first describe how we prepared the SAR images for training with the deep learning models that are originally designed for natural images, and then we provide the details of our models’ implementation and the hardware setup used.
1) SAR Data Preprocessing: Sentinel-1 imagery comes in two polarization channels (VH and VV), each of them being particularly informative about certain types of land cover. Hence, using their combination is expected to yield better land cover mapping results than using any of them independently. Moreover, previous works suggested benefits of employing DEM in land cover mapping [9], so we experimented also with topographic DEM from the National Land Survey. To assess the marginal utility of adding the DEM layer as compared to using solely SAR data, we prepared two training datasets: one with SAR data only, and another one with a DEM layer.
The backscatter amplitude for both polarizations (VH and VV) represented first two channels for both datasets. As the third channel, after some preliminary tests, we decided to include a VH-to-VV amplitude ratio (also known as cross-pol ratio). This dataset was called RGB SAR Ratio. For the second dataset, a DEM layer was used as the third channel, thus this dataset was called RGB SAR DEM. In addition, for the deep learning models, each band should be normalized so that the distribution of the pixel values would resemble a Gaussian distribution centered at zero. This is done to yield a faster convergence during the training. Hence, each channel was normalized by channel-specific calibration factor using percentile contrast stretching, with no more than 1% of pixel values clipped.
The naming of the two datasets comes from the process used to create the images in them. Namely, VH-pol data of a Sentinel-1 image is assigned to R and VV-pol to G channel. For the third, B channel, in each of the datasets we used either cross-pol ratio of Sentinel-1 data or the DEM layer, respectively. Given that the semantic segmentation models expect RGB pixel values in the range (0,255), we scaled the normalized channel values for both datasets to this range.
2) Train/Development and Test (Accuracy Assessment) Dataset : The original images from the needed to be split into partial images (further in the text called imagelets) used for model training and testing. Thus, each imagelet represented an area of roughly
. The first reason for this preprocessing has to do with the squared shape: some of the selected models required squared-shaped images. Some other of the models were flexible with the image shape and size but we wanted to make the setups for all the models the same so that their results are comparable. The second reason for this preprocessing has to do with computational capacity: with our hardware setup (described below), this was the largest image size that we could work with.
Given the geography of Finland, for representative training data, it is useful to include imagelets from the whole country (including the large cities) aside from the Finnish Lapland, where the land classes are distinctly different. On the other hand, some noticeable differences are found also in the gradient from east to west of the country. Hence, to
TABLE III: The properties of the examined semantic segmentation architectures
achieve a representative training dataset, we selected all imagelets between the longitudes of 25and 29
for the accuracy assessment (so-called “unobserved data” for model testing), and all the other imagelets we used for the model training (that is training/validation in the computer vision terminology). In this way, we prevented the situation in which two images of the same area but acquired at different times were used one for training and the other one for testing. In other words, we kept our training/development and test sets completely independent from each other.
The areas for training/development and model testing are shown in Figure 11. From each of the seven SAR mosaics, 1000 imagelets were generated using random sampling, while controlling for no spatial overlap between the imagelets. Among those 1000 imagelets, 400 were sampled from the testing area and set aside for the accuracy assessment, while the remaining 600 were sampled from the training/development area. The procedure resulted in 4200 images in the training and development set and 2800 images in the test (accuracy assessment) set. Finally, we used 60% from the training/development set for training and the rest for the development of the deep learning models.
3) Data Augmentation: Further, we have employed the data augmentation technique. The main idea behind the data augmentation is to enable improved learning by reusing original images with slight transformations such as rotation, flipping, adding Gaussian noise, or slightly changing the brightness. This provides additional information to the model and the dataset size is effectively increased. Moreover, an additional benefit of the data augmentation is in helping the model to learn some invariant data properties for which no examples are present in the original dataset. Given the sensitivity of the SAR backscatter, we did not want to augment the images in terms of the color, brightness, or by adding noise. However, we could safely employ rotations and flipping. For rotations, we only used the increments, giving three possible rotated versions of an image. For image flipping, we applied horizontal and vertical flipping, or both at the same time, giving another three possible versions of the original image.6 Notice that our images are square, so the transformations did not change the image dimensions. Finally, we applied the online augmentation, as opposite to the offline version. In the online process, each augmented image is seen only once, and so this process yields a network that generalises better.
4) Implementation: To apply the described semantic segmentation models, we adapted the open-source Semantic Segmentation Suite. We used Python with TensorFlow [100] backend.
5) Hardware and Training Setup: We trained and tested separately each of the deep learning models on a single GPU (NVIDIA GeForce GTX 1080) on a machine with 32GB of RAM.
For all the models, we used the Adam optimisation method [101] with the learning rate of 0.0001, and with the exponential decay rate for the first moment estimates of 0.9, and for the second moment estimates of 0.999. We applied the early stopping criterion so that, for each model, the training would automatically stop after there was no improvement in the development (validation) loss for 10 epochs. Such early stopping criteria resulted in different models being trained for a different number of epochs: from 69 (for DeepLabV3+ on the RGB SAR Ratio dataset) up to 126 (for SegNet on the RGB SAR DEM dataset) epochs. In general, all the models took slightly longer to train on the RGB SAR DEM dataset. In each case, the checkpoint for the latest model with the best result prior to stopping was saved. Then we used that model for prediction on the test set and we report those results.
The general processing flowchart (RGB SAR Ratio case) is shown in Figure 12. For the second dataset, also DEM layer is used alongside SAR data.
G. Evaluation Metrics
In the review on the metrics used in land cover classifi-cation, Costa et al. [102] have found a lack of consistency, complicating intercomparison of different studies. To avoid such issues and ensure that our results are easily comparable with the literature, we thoroughly evaluated our models. For each model and class, we report the following measures of accuracy: precision, also known as producer’s accuracy (PA), recall, also known as user’s accuracy (UA), and overall accuracy and Kappa coefficient. The formulas are as follows.
For each segmentation class (land cover type) c, we calculate precision (producer’s accuracy):
and recall (user’s accuracy):
where represents true positive,
false positive, and
false negative pixels for the class c. When it comes to accuracy [103], we calculate per class accuracy:7
and overall pixel accuracy:
Fig. 11: The sampling of SAR and land cover imagelets and division into training & development and testing datasets
Fig. 12: General processing flowchart for RGB SAR Ratio dataset
where is the number of pixels having a ground truth label i and being classified/predicted as
is the total number of pixels labelled with i, and L is the number of classes. All these metrics can take values from 0 to 1.
Finally, we also use a Kappa statistic (Cohen’s measure of agreement), indicating how the classification results compare to the values assigned by chance [104]. Kappa statistics can take values from 0 to 1. Starting from a k by k confusion
matrix with elements , following calculations are done:
where the observed proportional agreement (effectively the overall accuracy),
and
are the row and column totals for classes i and j, and
is the expected proportion of agreement. The final measure of agreement is given by such statistic [104]
Depending on the value of Kappa, the observed agreement is considered as either poor (0.0 to 0.2), fair (0.2 to 0.4), moderate (0.4 to 0.6), good (0.6 to 0.8) or very good (0.8 to 1.0).
Using the experimental setup described in previous section, we evaluated the seven selected semantic segmentation models: SegNet [74], PSPNet [75], BiSeNet [76], DeepLabV3+ [77], [78], U-Net [79], [80], FRRN-B [81], and FC-DenseNet [82]. The overall classification performance statistics for all studied models is gathered in Table V. Figure 13 shows maps produced for several imagelets with the best performing model, FC-DenseNet. Obtained results are compared to prior work and classification performance for different land cover classes is discussed further.
A. Classification Performance
All the models performed relatively well on both datasets achieving the overall accuracy above 87% for each model. Four models performed particularly well, achieving the accuracy score above 92% on both datasets; those are: FRRNB, U-Net, SegNet, and FC-DenseNet. The two latter models were also somewhat better than others in terms of kappa statistics, and, along with FRRN-B, were the best models also with respect to class-wise user’s and producer’s accuracy. The advantage for SegNet is that its training and inference times were 2.5 better compared to the FC-DenseNet of similar accuracy. BiSeNet and DeepLabV3 were performing somewhat worse than other five models particularly in terms of agreement (kappa was 0.75-0.82), but also overall accuracy was lower, most strongly for BiSeNet. Overall accuracy and class-wise accuracies obtained on completely independent test dataset were still remarkably high compared to other reported results in the literature when more conventional statistical or traditional machine learning approaches were used with Cband SAR data [105], [106]. Further in-depth comparison can be found in Section IV-C.
Before further analysis, let us recall that CORINE is not exclusively a land cover map, but rather land cover and land use map, thus some specific classes can differ from ecological classes observed by Sentinel-1. Also, the aggregation to CLC Level-1 is sometimes not strictly ”ecological” or complies to physics surface scattering considerations. For example, airports, major industrial areas and road network often exhibit areas similar to field, presence of trees and green vegetation near summer cottages can cause them exhibit signatures close to forest rather than urban, sometimes forest on the rocky terrain can be misclassified as urban instead due to presence of very bright targets and strong disruptive features, while confusion between peatland and agricultural and grassland areas is also common. Finally, the accuracy of the CORINE data is only somewhat higher than 90%.
As for the results across the different land classes, all the models performed particularly well in recognising the water bodies and forested areas, while the urban fabric represented the most challenging class for all the models. The urban class was particularly challenging for the following main reasons. First, this is still esentially a land use class, with continuous urban fabric (easy to recognize by radar) representing only a moderate fraction of the whole class. It also changes the most, as new houses, roads, and urban areas are built. Second, the CORINE map itself does not have a perfect accuracy, neither aggregation rules are perfect. As a matter of fact, in majority of studies where SAR based classification was done versus CLC or similar data, a poor or modest overall agreement was observed for urban land use areas [20], [21], [41], [83], while the user’s accuracy was strongly higher than producer’s [107]. The latter is exactly due to radar being able to sense sharp boundaries and bright targets very well whereas such bright targets often don’t dominate the whole urban land-use class. Importantly, relatively good performance was obtained in mapping agricultural and wetland areas, particularly well differentiating between them while this is often problematic with other remote sensing instruments.
In addition to the VH-to-VV ratio, we have tested topographic DEM as a candidate for the third layer in RGB imagelets. However, the difference in classification accuracy (summarized in Tables 4 and 5) was marginal. This limited gain in accuracy can be explained by several reasons. Firstly, accuracies achieved using only SAR data were large overall, and relatively large for the majority of classes, particularly water. Additionally, DEM variation in the study area was limited, mostly within 0-300 meters asl. If the DEM variation was higher, this could affect land use and vegetation, and result in a larger impact of DEM on classification accuracy (e.g., in Scotland, Norway and many other countries). Moreover, the DEM used in the study is essentially a topographic digital terrain model and doesn’t include forest canopy height models or high-contrast features of urban structures within settlements, which could potentially boost the classification accuracy for urban and forest classes.
We mentioned the issues of SAR backscattering sensitivity to several ground factors so that the same classes might appear differently on the images between countries or between distant areas within a country. An interesting indication of our study,
TABLE IV: Summary of the classification performance and efficiency of deep learning models on the RGB SAR Ratio dataset (UA – user’s accuracy, PA – producer’s accuracy)
TABLE V: Summary of the classification performance and efficiency of deep learning models on the RGB SAR DEM dataset (UA – user’s accuracy, PA – producer’s accuracy)
Fig. 13: Illustration of the FC-DenseNet model performance: selection of classification results, i.e., direct output of the network, without any post-processing (bottom row) versus reference CORINE based land cover (upper row).
TABLE VI: Confusion matrix for classification with FCDenseNet model for RGB SAR Ratio dataset.
however, is that the deep learning models might be able to deal with this issue. Namely, we used the models pre-trained on ImageNet and fine tuned them with a relatively small number of Sentinel-1 images. The models learned to recognize varying types of the backscattering signal across the country of Finland. This indicates that with a similar type of fine-tuning, present models could be relatively easily adapted to the other areas and countries, with different SAR backscattering patterns. Such robustness and adaptability of the deep learning models come from their automatic learning
Fig. 14: Accuracy curves during training and development on both datasets for the fastest (BiSeNet) and the slowest (FCDenseNet) model. The early-stopping criteria with 10 epochs of no improvement for development loss was applied.
of feature representation, without the need for a human expert pre-defining those features.
B. Computational Performance
The training times with our hardware configuration took from 1 up to 2.5 days for the different models. This could be significantly improved by training each model using a multiGPU system instead of a single-GPU in our experiments.
In terms of the inference time, we also saw the differences in the performance. In Table V, we present the average inference time per imagelet that we worked with. The results show that there is a trade-off between classification and computational performance: the best models in terms of classification results (i.e., FC-DenseNet and FRRN-B) take several times longer inference time compared to the rest. A positive exception in this regard is the SegNet model, which achieved the best classification results together with FCDenseNet but with 2.5 times better inference time. Depending on the application, this might or might not be of particular importance.
C. Comparison to Similar Work
Obtained results compare favourably to previous similar studies on land cover classification with SAR data [20], [21], [28], [31], [41], [83]. Depending on the level of classes aggregation (4-5 major classes or more), with using mostly statistical or classical machine learning approaches reported classification accuracies were as high as 80-87% to as low as 30% when only SAR imagery were used.
Two recent studies that employed neural networks to SAR imagery classification (albeit in combination with satellite optical data) for land cover mapping were [28] and [66], with reported classification accuracies of up to 97.5% and 94.6%, respectively.
The best models in our experiments achieved the overall accuracy of 93%. However, our results are obtained using solely the SAR imagery. In contrast, SAR imagery (PALSAR) alone yielded the overall accuracy of 78.1% in [28]. The types of classes they studied are also different compared to ours (crops versus vegetation versus land cover types) and our study is performed on a larger area. Importantly, the previous studies have applied different types of models (regular NNs versus CNN versus semantic segmentation). In particular, the CNN models work on the resolution windows, while we have applied more advanced semantic segmentation models, which work on the level of a pixel. Keeping in mind findings from [28] that the addition of optical images on top of SAR improved the results for over 10%, we expect that our models would perform comparably well or outperform these previous works if applied to a combined SAR and optical imagery.
In terms of the deep learning setup, the most similar to ours are the studies [53] and [70]. However, RapidEye optical imagery at 5 m spatial resolution was used in [53], and the test site was considerably smaller. Study [70], similar to our research, relied exclusively on SAR imagery, however, fully polarimetric images, and acquired by RADARSAT-2 at considerably better resolution. They have developed an FCN-type of a semantic segmentation model ‘specifically designed for the classification of wetland complexes using PolSAR imagery’. Using this model to classify eight wetland map classes, they achieved the overall accuracy of 93%. However, because their model is designed specifically for wetland complexes, it is not clear if such a model would generalize to other types of areas. Compared to our study, they have focused on a considerably smaller area (nearly the size of a single imagelet we used), and on a very specific task (wetland types mapping). Thus, it is not readily clear how general their approach is and how it compares to our presented approach.
D. Outlook and Future Work
There are several lines for potential improvement based on the results of this study, as well as future work directions.
First, using even a larger set of Sentinel-1 images can be recommended since for the supervised deep learning models large amounts of data are crucial. Here, we processed only 6888 imagelets altogether, but deep learning algorithms become efficient typically only once they are trained with hundreds of thousands or millions of images.
Second, if SAR images and reference data of a higher resolution are used, we expect better classification performance, too, as smaller details could be potentially captured. Also, better agreement in acquisition timing of reference and SAR imagery can be recommended. The reference and training data should come from the same months or year if possible, and that the reference maps should represent the reality as accurately as possible. The models in our experiments were certainly limited by the CORINE’s own limited accuracy.
Third, in this study we have tested the effectiveness of off-the-shelf deep learning models for land cover mapping from SAR data. While the results show their effectiveness, it is also likely that the novel types of models, specifically developed for the radar data (such as [70]), will yield even better results. Based on our results, we suggest DenseNet- and SegNet-based models as a starting point. In particular, one could develop the deep learning models to handle directly the SLC data which preserve the phase information.
Focusing on a single season is both an advantage and a limitation. Importantly, we have avoided confusion between SAR signatures varying seasonally for several land cover classes. However, multitemporal dynamics itself can be potentially used as an additional useful class-discriminating parameter. Incorporating seasonal dynamics of each land cover pixel (as a time series) is left for future work, perhaps with additional need to incorporate recurrent neural networks into the approach.
As discussed in Section 3.1.1, it could be suitable to use more detailed (specific) land cover classes, as aggregation of smaller LC classes into CORINE super-classes is not exactly ecological, leading to mixing several distinct SAR signatures in one class, and thus causing additional confusion for the classifier. Later, classified specific classes can be aggregated into larger classes, potentially showing improved performance [19].
Finally, we have used only SAR images and a freelyavailable DEM model for the presented large-scale land cover mapping. If one were to combine other type of remote sensing images, in particular the optical images, we expect that the results would significantly improve. This is true for those areas where such imagery can be collected due to cloud coverage, while in operational scenario it would potentially require use of at least two models (with and without optical satellite imagery). It is also important to access added value of SAR imagery with deep learning models when optical satellite images are available, as well as possible data fusion and decision fusion scenarios, before a decision on the mapping approach is done [19].
Our study demonstrated the potential for applying state-of-the-art semantic segmentation models to SAR image classifica-tion with high accuracy. Several models were benchmarked in a countrywide classification experiment using Sentinel-1 IWmode SAR data, reaching nearly 93% overall classification accuracy with the best performing models (SegNet and FCDenseNet). This indicates strong potential for using pre-trained CNNs for further fine-tuning and seems particularly suitable when the number of training images is limited (to thousand or tens of thousands instead of millions). In addition to suggesting the best candidate semantic segmentation models for land cover mapping with SAR data (that is, the DenseNet-based models), our study offers baseline results against which the newly proposed models should be evaluated. Several possible improvements for future work were identified, including the necessity for testing multitemporal approaches, data fusion, and very high-resolution SAR imagery, as well as developing models specifically for SAR.
The authors were supported by ICEYE Oy during the study. SˇS was also supported by EIT Digital and OA was also supported by Aalto university and VTT. Authors thank reviewers for careful reading of the manuscript and their valuable comments.
The implementation scripts with documentation are available on GitHub,8 the original Sentinel-1 images can be downloaded from SciHub,9 and the processed train/development and test data are published on Zenodo10 and IEEE DataPort.
[1] S. Bojinski, M. Verstraete, T. C. Peterson, C. Richter, A. Simmons, and M. Zemp, “The concept of essential climate variables in support of climate research, applications, and policy,” Bulletin of the American Meteorological Society, vol. 95, no. 9, pp. 1431–1443, 2014.
[2] G. B¨uttner, J. Feranec, G. Jaffrain, L. Mari, G. Maucha, and T. Soukup, “The CORINE land cover 2000 project,” EARSeL eProceedings, vol. 3, no. 3, pp. 331–346, 2004.
[3] M. Bossard, J. Feranec, J. Otahel et al., “CORINE land cover technical guide: Addendum 2000,” 2000.
[4] G. B¨uttner, “CORINE land cover and land cover change products,” in Land Use and Land Cover Mapping in Europe. Springer, 2014, pp. 55–74.
[5] M. T¨orm¨a, T. Markkanen, S. Hatunen, P. H¨arm¨a, O.-P. Mattila, and A. Arslan, “Assessment of land-cover data for land-surface modelling in regional climate studies,” Boreal Environment Research, vol. 20, no. 2, pp. 243–260, 2015.
[6] J. Chen, J. Chen, A. Liao, X. Cao, L. Chen, X. Chen, C. He, G. Han, S. Peng, M. Lu et al., “Global land cover mapping at 30 m resolution: A pok-based operational approach,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 103, pp. 7–27, 2015.
[7] C. A. d. Almeida, A. C. Coutinho, J. C. D. M. Esquerdo, M. Adami, A. Venturieri, C. G. Diniz, N. Dessay, L. Durieux, and A. R. Gomes, “High spatial resolution land use and land cover mapping of the brazilian legal amazon in 2008 using landsat-5/tm and modis data,” Acta Amazonica, vol. 46, no. 3, pp. 291–302, 2016.
[8] C. Homer, J. Dewitz, L. Yang, S. Jin, P. Danielson, G. Xian, J. Coul- ston, N. Herold, J. Wickham, and K. Megown, “Completion of the 2011 national land cover database for the conterminous united states– representing a decade of land cover change information,” Photogrammetric Engineering & Remote Sensing, vol. 81, no. 5, pp. 345–354, 2015.
[9] Y. Zhao, D. Feng, L. Yu, X. Wang, Y. Chen, Y. Bai, H. J. Hern´andez, M. Galleguillos, C. Estades, G. S. Biging et al., “Detailed dynamic land cover mapping of chile: Accuracy improvement by integrating multi-temporal data,” Remote Sensing of Environment, vol. 183, pp. 170–185, 2016.
[10] P. Griffiths, C. Nendel, and P. Hostert, “Intra-annual reflectance com- posites from sentinel-2 and landsat for national-scale crop and land cover mapping,” Remote sensing of environment, vol. 220, pp. 135– 151, 2019.
[11] R. Torres, P. Snoeij, D. Geudtner, D. Bibby, M. Davidson, E. Attema, P. Potin, B. Rommen, N. Floury, M. Brown, I. Traver, P. Deghaye, B. Duesmann, B. Rosich, N. Miranda, C. Bruno, M. L’Abbate, R. Croci, A. Pietropaolo, M. Huchler, and F. Rostan, “GMES Sentinel-1 mission,” Remote Sensing of Environment, vol. 120, pp. 9–24, 2012.
[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[15] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning,” 2016.
[16] W. Cohen and S. Goward, “Landsat’s role in ecological applications of remote sensing,” BioScience, vol. 54, no. 6, pp. 535–545, 2004.
[17] S. Goetz, A. Baccini, N. Laporte, T. Johns, W. Walker, J. Kellndorfer, R. Houghton, and M. Sun, “Mapping and monitoring carbon stocks with satellite observations: A comparison of methods,” Carbon Balance and Management, vol. 4, 2009.
[18] C. Atzberger, “Advances in remote sensing of agriculture: Context description, existing operational monitoring systems and major information needs,” Remote Sensing, vol. 5, no. 2, pp. 949–981, 2013.
[19] T. Hame, J. Kilpi, H. A. Ahola, Y. Rauste, O. Antropov, M. Rautiainen, L. Sirro, and S. Bounpone, “Improved mapping of tropical forests with optical and SAR imagery, part I: Forest cover and accuracy assessment using multi-resolution data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6, no. 1, pp. 74–91, Feb 2013.
[20] O. Antropov, Y. Rauste, H. Astola, J. Praks, T. Hame, and M. Hal- likainen, “Land cover and soil type mapping from spaceborne PolSAR data at L-band with probabilistic neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 9, pp. 5256–5270, 2014.
[21] A. Lonnqvist, Y. Rauste, M. Molinier, and T. Hame, “Polarimetric SAR data in land cover mapping in boreal zone,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 10, pp. 3652–3662, Oct 2010.
[22] B. Waske and M. Braun, “Classifier ensembles for land cover mapping using multitemporal sar imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 64, no. 5, pp. 450–457, 2009.
[23] L. Bruzzone, M. Marconcini, U. Wegm¨uller, and A. Wiesmann, “An advanced system for the automatic classification of multitemporal
sar images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 6, pp. 1321–1334, 2004.
[24] T. Ullmann, A. Schmitt, A. Roth, J. Duffe, S. Dech, H.-W. Hubberten, and R. Baumhauer, “Land cover characterization and classification of arctic tundra environments by means of polarized synthetic aperture x- and c-band radar (polsar) and landsat 8 multispectral imagery — richards island, canada,” Remote Sensing, vol. 6, no. 9, pp. 8565–8593, 2014. [Online]. Available: http://www.mdpi.com/2072-4292/6/9/8565
[25] N. Clerici, C. A. V. Calder´on, and J. M. Posada, “Fusion of sentinel-1a and sentinel-2a data for land cover mapping: a case study in the lower magdalena region, colombia,” Journal of Maps, vol. 13, no. 2, pp. 718–726, 2017.
[26] C. Casta˜neda and D. Ducrot, “Land cover mapping of wetland areas in an agricultural landscape using sar and landsat imagery,” Journal of Environmental Management, vol. 90, no. 7, pp. 2270–2277, 2009.
[27] Y. Ban, H. Hu, and I. M. Rangel, “Fusion of quickbird ms and radarsat sar data for urban land-cover mapping: Object-based and knowledge-based approach,” International Journal of Remote Sensing, vol. 31, no. 6, pp. 1391–1410, 2010.
[28] G. V. Laurin, V. Liesenberg, Q. Chen, L. Guerriero, F. Del Frate, A. Bartolini, D. Coomes, B. Wilebore, J. Lindsell, and R. Valentini, “Optical and sar sensor synergies for forest and land cover mapping in a tropical site in west africa,” International Journal of Applied Earth Observation and Geoinformation, vol. 21, pp. 7–16, 2013.
[29] R. Khatami, G. Mountrakis, and S. V. Stehman, “A meta-analysis of remote sensing research on supervised pixel-based land-cover image classification processes: General guidelines for practitioners and future research,” Remote Sensing of Environment, vol. 177, pp. 89–100, 2016.
[30] B. Waske and M. Braun, “Classifier ensembles for land cover mapping using multitemporal sar imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 64, no. 5, pp. 450–457, 2009.
[31] H. Balzter, B. Cole, C. Thiel, and C. Schmullius, “Mapping CORINE land cover from Sentinel-1A SAR and SRTM digital elevation model data using random forests,” Remote Sensing, vol. 7, no. 11, pp. 14 876– 14 898, 2015.
[32] S.-E. Park, “Variations of microwave scattering properties by seasonal freeze/thaw transition in the permafrost active layer observed by ALOS PALSAR polarimetric data,” Remote Sensing, vol. 7, no. 12, pp. 17 135–17 148, 2015.
[33] M. C. Dobson, L. E. Pierce, and F. T. Ulaby, “Knowledge-based land-cover classification using ERS-1/JERS-1 SAR composites,” IEEE Transactions on Geoscience and Remote Sensing, vol. 34, no. 1, pp. 83–99, Jan 1996.
[34] L. Sirro, T. H¨ame, Y. Rauste, J. Kilpi, J. H¨am¨al¨ainen, K. Gunia, B. de Jong, and F. Paz Pellat, “Potential of different optical and SAR data in forest and land cover classification to support redd+ mrv,” Remote Sensing, vol. 10, no. 6, 2018.
[35] J. D. T. De Alban, G. M. Connette, P. Oswald, and E. L. Webb, “Combined Landsat and L-band SAR data improves land cover classification and change detection in dynamic tropical landscapes,” Remote Sensing, vol. 10, no. 2, 2018.
[36] N. Longepe, P. Rakwatin, O. Isoguchi, M. Shimada, Y. Uryu, and K. Yulianto, “Assessment of alos palsar 50 m orthorectified fbd data for regional land cover classification by support vector machines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 6, pp. 2135–2150, June 2011.
[37] T. Esch, A. Schenk, T. Ullmann, M. Thiel, A. Roth, and S. Dech, “Characterization of land cover types in terrasar-x images by combined analysis of speckle statistics and intensity information,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 6, pp. 1911–1925, June 2011.
[38] J. W. Cable, J. M. Kovacs, J. Shang, and X. Jiao, “Multi-temporal polarimetric radarsat-2 for land cover monitoring in northeastern ontario, canada,” Remote Sensing, vol. 6, no. 3, pp. 2372–2392, 2014. [Online]. Available: http://www.mdpi.com/2072-4292/6/3/2372
[39] X. Niu and Y. Ban, “Multi-temporal radarsat-2 polarimetric sar data for urban land-cover classification using an object-based support vector machine and a rule-based approach,” International Journal of Remote Sensing, vol. 34, no. 1, pp. 1–26, 2013.
[40] T. L. Evans, M. Costa, K. Telmer, and T. S. F. Silva, “Using alos/palsar and radarsat-2 to map land cover and seasonal inundation in the brazilian pantanal,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 3, no. 4, pp. 560–575, Dec 2010.
[41] P. Lumsdon, S. R. Cloude, and G. Wright, “Polarimetric classification of land cover for Glen Affric radar project,” IEE Proceedings - Radar, Sonar and Navigation, vol. 152, no. 6, pp. 404–412, Dec 2005.
[42] C. da Costa Freitas, L. de Souza Soler, S. J. S. Sant’Anna, L. V. Dutra, J. R. Dos Santos, J. C. Mura, and A. H. Correia, “Land use and land cover mapping in the brazilian amazon using polarimetric airborne pband sar data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 46, no. 10, pp. 2956–2970, 2008.
[43] G. Li, D. Lu, E. Moran, L. Dutra, and M. Batistella, “A comparative analysis of alos palsar l-band and radarsat-2 c-band data for land-cover classification in a tropical moist region,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 70, pp. 26–38, 2012.
[44] N. Park and K. Chi, “Integration of multitemporal/polarization c-band sar data sets for land-cover classification,” International Journal of Remote Sensing, vol. 29, no. 16, pp. 4667–4688, 2008.
[45] E. Tomppo, O. Antropov, and J. Praks, “Cropland classification using Sentinel-1 time series: Methodological performance and prediction uncertainty assessment,” Remote Sensing, vol. 11, no. 21, 2019.
[46] D. B. Nguyen, A. Gruber, and W. Wagner, “Mapping rice extent and cropping scheme in the mekong delta using sentinel-1a data,” Remote Sensing Letters, vol. 7, no. 12, pp. 1209–1218, 2016.
[47] A. Veloso, S. Mermoz, A. Bouvet, T. Le Toan, M. Planells, J.-F. Dejoux, and E. Ceschia, “Understanding the temporal behavior of crops using Sentinel-1 and Sentinel-2-like data for agricultural applications,” Remote Sensing of Environment, vol. 199, pp. 415–426, 2017.
[48] G. Satalino, A. Balenzano, F. Mattia, and M. W. Davidson, “C- band SAR data for mapping crops dominated by surface or volume scattering,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 2, pp. 384–388, 2013.
[49] F. Vicente-Guijalba, A. Jacob, J. M. Lopez-Sanchez, C. Lopez- Martinez, J. Duro, C. Notarnicola, D. Ziolkowski, A. Mestre-Quereda, E. Pottier, J. J. Mallorqu´ı, M. Lavalle, and M. Engdahl, “Sincohmap: Land-cover and vegetation mapping using multi-temporal Sentinel-1 interferometric coherence,” in IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, July 2018, pp. 6631– 6634.
[50] S. Ge, O. Antropov, W. Su, H. Gu, and J. Praks, “Deep recurrent neural networks for land-cover classification using Sentinel-1 InSAR time series,” in IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium, July 2019, pp. 473–476.
[51] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
[52] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: a review,” arXiv preprint arXiv:1710.03959, 2017.
[53] M. Mahdianpari, B. Salehi, M. Rezaee, F. Mohammadimanesh, and Y. Zhang, “Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery,” Remote Sensing, vol. 10, no. 7, p. 1119, 2018.
[54] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.
[55] J. Zhang, P. Zhong, Y. Chen, and S. Li, “l {1/2}-regularized deconvolution network for the representation and restoration of optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 5, pp. 2617–2627, 2014.
[56] X. Chen, S. Xiang, C.-L. Liu, and C.-H. Pan, “Aircraft detection by deep belief nets,” in Pattern Recognition (ACPR), 2013 2nd IAPR Asian Conference on. IEEE, 2013, pp. 54–58.
[57] ——, “Vehicle detection in satellite images by hybrid deep convolu- tional neural networks,” IEEE Geoscience and remote sensing letters, vol. 11, no. 10, pp. 1797–1801, 2014.
[58] Y. Liu, G. Cao, Q. Sun, and M. Siegel, “Hyperspectral classification via deep networks and superpixel segmentation,” International Journal of Remote Sensing, vol. 36, no. 13, pp. 3459–3482, 2015.
[59] J. Wang, Q. Qin, Z. Li, X. Ye, J. Wang, X. Yang, and X. Qin, “Deep hierarchical representation and segmentation of high resolution remote sensing images,” in Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International. IEEE, 2015, pp. 4320–4323.
[60] D. Tuia, R. Flamary, and N. Courty, “Multiclass feature learning for hyperspectral image classification: Sparse and hierarchical solutions,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 105, pp. 272–285, 2015.
[61] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14 680–14 707, 2015.
[62] O. A. B. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes
domains?” in 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), June 2015, pp. 44–51.
[63] F. P. Luus, B. P. Salmon, F. Van den Bergh, and B. T. J. Maharaj, “Multiview deep learning for land-use classification,” IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 12, pp. 2448–2452, 2015.
[64] F. Zhang, B. Du, and L. Zhang, “Scene classification via a gradient boosting random convolutional network framework,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 3, pp. 1793– 1802, 2016.
[65] T. Ishii, R. Nakamura, H. Nakada, Y. Mochizuki, and H. Ishikawa, “Surface object recognition with cnn and svm in landsat 8 images,” in 2015 14th IAPR International Conference on Machine Vision Applications (MVA), May 2015, pp. 341–344.
[66] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep learning classification of land cover and crop types using remote sensing data,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 778– 782, 2017.
[67] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
[68] G. Wu, X. Shao, Z. Guo, Q. Chen, W. Yuan, X. Shi, Y. Xu, and R. Shibasaki, “Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks,” Remote Sensing, vol. 10, no. 3, p. 407, 2018.
[69] Y. Duan, F. Liu, L. Jiao, P. Zhao, and L. Zhang, “Sar image seg- mentation based on convolutional-wavelet neural network and markov random field,” Pattern Recognition, vol. 64, pp. 255–267, 2017.
[70] F. Mohammadimanesh, B. Salehi, M. Mahdianpari, E. Gill, and M. Molinier, “A new fully convolutional neural network for semantic segmentation of polarimetric SAR imagery in complex land cover ecosystem,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 151, pp. 223 – 236, 2019.
[71] L. Wang, X. Xu, H. Dong, R. Gui, and F. Pu, “Multi-pixel simultaneous classification of polsar image using convolutional neural networks,” Sensors, vol. 18, no. 3, p. 769, 2018.
[72] M. Ahishali, S. Kiranyaz, T. Ince, and M. Gabbouj, “Dual and single polarized SAR image classification using compact convolutional neural networks,” Remote Sensing, vol. 11, no. 11, p. 1340, 2019.
[73] Z. Li, Z. Yang, and H. Xiong, “Homogeneous region segmentation for SAR images based on two steps segmentation algorithm,” in Computers, Communications, and Systems (ICCCS), International Conference on. IEEE, 2015, pp. 196–200.
[74] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[75] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
[76] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 325–341.
[77] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” arXiv preprint arXiv:1802.02611, 2018.
[78] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
[79] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), ser. LNCS, vol. 9351. Springer, 2015, pp. 234–241, (available on arXiv:1505.04597 [cs.CV]). [Online]. Available: http://lmb.informatik. uni-freiburg.de/Publications/2015/RFB15a
[80] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[81] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4151–4160.
[82] S. J´egou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 1175–1183.
[83] O. Antropov, Y. Rauste, A. Lonnqvist, and T. Hame, “PolSAR mosaic normalization for improved land-cover mapping,” IEEE Geoscience and Remote Sensing Letters, vol. 9, no. 6, pp. 1074–1078, Nov 2012.
[84] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[85] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, vol. 160, no. 1, pp. 106–154, 1962.
[86] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
[87] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[88] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.
[89] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[90] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[91] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE, 2013, pp. 8614–8618.
[92] D. Small, L. Zuberb¨uhler, A. Schubert, and E. Meier, “Terrain-flattened gamma nought Radarsat-2 backscatter,” Canadian Journal of Remote Sensing, vol. 37, no. 5, pp. 493–499, 2012.
[93] P. H¨arm¨a, R. Teiniranta, M. T¨orm¨a, R. Repo, E. J¨arvenp¨a¨a, and M. Kallio, “The production of finnish corine land cover 2000 clas-sification.” XXth ISPRS Congress, Istanbul, Turkey, 2004.
[94] ——, “Finnish corine land cover 2000 classification.” XXth ISPRS Congress, Anchorage, US, 2004.
[95] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez, “A review on deep learning techniques applied to semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.
[96] F. Chollet, “Xception: Deep learning with depthwise separable convo- lutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[97] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- works for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[98] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
[99] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 17–36.
[100] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.
[101] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, ICLR 2015, Y. Bengio and Y. LeCun, Eds., 2015.
[102] H. Costa, G. M. Foody, and D. S. Boyd, “Supervised methods of image segmentation accuracy assessment in land cover mapping,” Remote sensing of environment, vol. 205, pp. 338–351, 2018.
[103] G. Csurka, D. Larlus, F. Perronnin, and F. Meylan, “What is a good evaluation measure for semantic segmentation?” in BMVC, vol. 27. Citeseer, 2013, p. 2013.
[104] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37 – 46, 1960.
[105] H. Balzter, B. Cole, C. Thiel, and C. Schmullius, “Mapping CORINE land cover from Sentinel-1A SAR and SRTM digital elevation model data using random forests,” Remote Sensing,
vol. 7, no. 11, pp. 14 876–14 898, 2015. [Online]. Available: http://www.mdpi.com/2072-4292/7/11/14876
[106] C. Thiel, O. Cartus, R. Eckardt, N. Richter, C. Thiel, and C. Schmullius, “Analysis of multi-temporal land observation at c-band,” in 2009 IEEE International Geoscience and Remote Sensing Symposium, vol. 3, 2009, pp. III–318–III–321.
[107] O. Antropov, Y. Rauste, and T. Hame, “Volume scattering modeling in PolSAR decompositions: Study of ALOS PALSAR data over boreal forest,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 10, pp. 3838–3848, Oct 2011.