Annotating images with short text descriptions or captions
provides one method to create and automate tagging for large
image repositories [Andres, et al., 2012]. Using keywords,
subsequent text searches then can find and group similar images,
thus rendering entire image collections as both indexable and
discoverable [Blanchart, et al., 2010]. Similarly, with an
effective caption generator specialized for satellite imagery, one
can catalogue and inventory large areas of the planet and assist
land use planners or first responders during a disaster recovery
effort [Bratasanu, et al., 2010; Kyzirakos, et al., 2014]. To
streamline this task, one needs to build a deep image search
engine [Mao, et al., 2018], one which characterizes pixel features
in words, extracts metadata from potentially millions of
overhead images, and returns queries either for nearest matches
in keywords or similar imagery.
1.1 Broad Motivation. To render the first functional map of the
world, many researchers have advocated for a combination of
automated labels with overhead imagery [Christie, et al., 2018;
Demir, et al., 2018]. This global inventory would enable complex change detection not previously available.
Furthermore, in cases when the annotation might occur on-board satellites, the automated captioning might Figure 1. Example Overhead Imagery to Caption. Human annotators label this training image as: “Yellow ribbon beach is between green trees and dark green ocean with white waves.”
guide future download choices and minimize data transfer of useless images. Opportunities for obtaining higher priority images would increase with pre-filtering of images degraded by poor camera optics, cloud cover, night, or broad open ocean [Yao, et al. 2016]. In two steps, we approach this satellite captioning problem for overhead images as firstly, recognizing all the objects in a complex satellite image, then secondly, describing all the discovered objects and their mutual relationships in a short sentence or annotation [Tanti, et al., 2018]. The final text serves as a captioned description that subsequently makes the image discoverable, shareable and searchable. For example, one might annotate an image like Figure 1 by combining scene recognition (“yellow beach”) with all other objects relative to its adjoining scene parts (“next to a green forest”). This caption contrasts markedly with a generic model not specialized for overhead imagery, such as the Microsoft CaptionBot, which labels the same image as “I can't really describe the picture, but I do see black, looking, white” [Microsoft, 2019]. 1.2 Anticipated Outcomes. For the present work, we apply multiple deep learning detection models to satellite images. We train these models using the technique of transfer learning, which leverages pre-training of feature extractors on much larger datasets, then extends the final image classification layer to previously undefined classes. Knowing the objects in the image, we generate captions using a recurrent neural network with long short-term memory [LSTM, see Gers, et al. 2000; Brownlee, 2017]. The overall method associates words and annotates complete caption sentences to those recognized objects [Luo, et al., 2011; Luo, et al., 2013]. Compared to previous work, our approach attempts to minimize the image model’s overall size, thus making on-board
satellite processing potentially viable, while also correcting and expanding the vocabulary traditionally included for large caption training steps. This work also explores the benchmark text vocabularies to include spelling
corrections, synonyms and alternative sentence structures. Two unexpected outcomes of this re-examination of the initial captioning vocabulary follow from a built-in annotation bias, or a dominant sensitivity to color descriptions (which offer little new
information over the raw red-green-blue pixels), and its inability to capture world knowledge that a human expert might offer such as describing the physical relationship between ocean white-caps and wind in an image. We explore the implications of these two outcomes more in depth in the Discussion section.
Figure 2. Word cloud for frequently used annotation terms in RSCID dataset. The notable overabundance of colors, trees and buildings from human annotators indicate some repetitive labeling, particularly in residential land types.
Figure 3. Categories of characteristic satellite scenes highlighting land use in RSICD data, including biomes, residential, zoning, cultural and transport classes.
of scene diversity (e.g. number of image classes). Various captioning benchmark efforts have appeared, some of which have offered a specialized earth-observation retrieval system based on the content of satellite imagery [Lu, et al. 2017] or have assembled a combined image and text dataset called the Remote Sensing Image Captioning Dataset (RSICD). To balance and diversify the objects recognized and described in previous captioning examples, the RSICD dataset reduces the relative number of included residential scenes. The authors suggest this residential imbalance biases previous captioning tasks, such as UCM [Yang, Newsam, 2010] and Sydney [Zhang, et al., 2019] datasets. While diversifying the available image classes (e.g. Figure 3), the RSICD authors follow a similar captioning format by providing five different sentences describing each image. They note that the caption diversity depends on two key factors, both how the five sentences differ from each other for a given image (textual depth, as shown in Figure 2) and how the different images differ in their respective captioning choices (image breadth, as shown in Figure 3). Both diversity types (the intra- and inter-image changes) may shape the trained algorithm’s ability to generalize to new test images [Dumitru, et al., 2014; Espinoza-Molina, et al., 2013].
1.4 Original Contributions. For the present work, we extend the RSCID dataset in three key ways. First. we supplement satellite image data [e.g. SIRI-WHU, Zhao, et. Al, 2015 and xView, Lam, et al, 2018]. Secondly, we deploy alternative pre-trained photo models, then modify the underlying captioning vocabulary. The latter attempts to cleanse and augment the overall vocabulary. Finally, we investigate how search discoverability might enable analysts to query large image repositories using both similar image and keyword matches [Shi, et al., 2017]. These results seek to discover if implementing such automated methods for assigning complex image metadata in situ might assist future satellite image analysts.
The research plan includes comparing many different image models to identify the best one for satellite images, then contrasting strategies for captioning them, some of which correct previous vocabulary shortcomings and others of which expand the baseline diversity of annotations. The initial RSICD dataset
[Lu, et al., 2017] consists of 10,921 satellite images broadly grouped into 30 characteristic scenes. RSICD particularly highlights land uses such as residential, urban or agricultural classes. To understand the image clusters, we group the scenes into five major categories as shown in Figure 3, including land biomes, residential density, zoning, cultural and transport-related classes. It does not sub-class some satellite cases typically included elsewhere like roads, construction, and other change detection scenes used for damage assessment from space.
Microsoft's generic caption bot and the present specialized satellite version. The generic bot describes the image as, “I think it’s a closeup of a building,” where the satellite training set specializes to: “many buildings and some green trees are in an industrial area.”
Figure 4. Word frequency distribution in RSICD skews to the top 30 terms and color-related adjacency annotations such as “green trees next to a blue roof”.
overhead mapping services such as Google or Baidu. The dataset does not try to specify either the resolution in ground sample distance (GSD), or lineage by camera or satellite services. Each image carries five humanannotated captions, a high proportion (60+%) which include duplicate annotations. Thus, in total, the text portion of RSICD includes more than 50,000 sentences, 239,765 words, and the book equivalent of a 2100-page manuscript
describing the 30
baseline size with 2643 unique words. However, while this text labeling represents a massive machine learning enterprise, it nevertheless can be both pruned and augmented to address some of its challenges. For example, while having human annotators produce 50,000 descriptions when viewing satellite
imagery represents an undertaking, around sixty percent of the descriptions identically duplicate the others in the same series of five for each image (18,185 unique sentences, or 36,413 duplicates). It’s unclear if this general copying strategy helps the ultimate task of captioning previously unseen test images, or whether the authors sought to over-sample and balance the training text data.
Furthermore, we identified a substantial amount of mis-spellings (14%), or broken syntax, such as “c shape” or “t road” treated as two distinct words, when in context, annotators might have better described “Cshaped” as a single token.
2.3. Text Limitations. As illustrated in Figure 4, the caption vocabulary follows a typical fat-tail word distribution that rapidly decays beyond the most frequent terms. In other contexts, this word commonality might make up a stop-word list. For example, half the total vocabulary (123,153 instances) include just the top thirty frequently
used terms. In addition to the lack of descriptive diversity at the top rank of the distribution, there is a converse problem of many rare words at the long-end; for instance, 42% of the annotators’ vocabulary are only
limitations can be
corrected by either
Figure 5. VGG-16 Layered Convolutional Neural Network Architecture.
cleansing with spell checking, pruning rare one-time uses with synonyms, or augmenting the overall vocabulary size with synonyms to reduce duplicates. 2.4. Text Augmentation Strategies. We initiated all three strategies of textual augmentation [Wei, et al., 2019] and scored the results as both the models’ ability to learn the new nomenclature correctly (entropy loss) and the captioned correspondence to a reference sample (Table 2, Bilingual Evaluation Understudy score, or BLEU-1 to BLEU-4). Higher BLEU scores mean a greater match between generated and reference captions, tending toward a limit of 1 [Papineni, et. Al, 2002]. Lower scores represent a higher divergence (akin to temperature in other language generation models). In a position-independent way, BLEU shows the number of (n-gram) matches between candidates and reference. The mean test BLEU for other benchmarks, like Flickr8k, are typically BLEU=0.37 (see Vinyals, et al. 2015). In summary, we initially divided 10,921 RSICD images (224x224) with 50,000 captions into a traditional split: 70% training, 15% development, 15% testing (N=1528 images). We score both the image and text models simultaneously, with images scored by entropy loss and the captioning quality assessed by BLEU (generated text compared to reference).
2.5. Large and Small Image Models for Feature Extraction. Our approach to the image model highlights mainly the size and architecture of each approach within traditional transfer learning, where the model is pre-trained except for fine-tuning on the final object classification layers. We initially applied VGG-16 and VGG-19 (Visual Geometry Group, Oxford), which stands out among other image classification networks because of its architectural simplicity [Simonyan, et al., 2014]. Only two building blocks are needed for VGG models, a 3x3 convolution and 2x2 pooling layer, throughout the entire network. One drawback of
demands by 28% from previous state-of-the-art models [Zoph, et al., 2018]. The interest in these small but accurate models stem from the needs of low-power, edge computing applications with on-board satellite processors.
3. Results
We examined seven different cases for caption generation based on varying the underlying feature extraction model and extending the training vocabulary, either using synonyms, corrected syntax or reduced
Figure 6. Airport scene from RSCID, useful for comparing the different caption generation strategies,
word diversity. Table 1 summarizes the experimental design along with overall results for different models and vocabulary approaches.
3.1 Promise of Smaller Image Models. Without any alteration of the captioning vocabulary, the best satellite image model for classification in RSICD was NASNetMobile, which similarly was the smallest in model size. The best captioning outcome was the large VGG19 model, but not substantially greater than NASNetMobile. The success of this small
image annotation model offers a potentially
environments typically expected for edge computers. 3.2 Comparison to Previous Results. To compare the quality of generated and reference captions, Table 2 shows BLEU scores for each captioning and model tested. Compared to Lu, et al. (2017), the BLEU scores shown in Table 2 exceed their best sequences for multimodal methods on RSICD. Harvesting just the best performers from their RNN and LSTM series (VLAD), we can compare the effects of variations in both the image and text models. The present version of transfer learning from the pretrained but large image model (VGG 19) and LSTM/RNN for its associated text closely correspond to Lu, et al. (2017). The much reduced NASNetMobile model similarly ranks highly compared to the best previous models with attention but at 100-fold smaller model size both in RAM and on disk.
3.3 Comparative Example Between Model Cases. To illustrate the alternate captions that each model and vocabulary can generate, we excerpt the description to a common airport scene (image Airport_245 from RSCID). Table 3 shows that in this case, the VGG image models incorrectly identify the airport as a railway station or bridge, while the NASNetMobile gets increasingly detailed and nuanced descriptive powers around the correct airport scene. This example highlights that improved syntax, whether corrected spelling or synonym diversity, does provide a richer descriptive vocabulary once the image model captures the key overhead objects of interest and orients them relative to each other. For instance, the baseline classification for many planes transforms after text augmentation steps to include other key potential landmarks, such as the terminal building, then finally identifying several buildings and green trees outside of the central tarmac identification [Wang, et al., 2019].
3.4 Challenge to Diversify the Vocabulary. The effect of augmented and corrected captions for training shows up as lower BLEU scores and higher image entropy losses. As one might expect for deep nets with a large set of tunable parameters, reducing the vocabulary can improve the memorization of input captions and lead to an apparently more accurate caption compared to a reference description. With that caveat, it’s still not clear that one wants to introduce mis-spellings and repetition as an intentional effect from the outset
Figure 7. Caption generator confusion matrix for NWPU test dataset with known scenes correctly labeled by caption mentions. The green diagonal shows correctly generated captions compared to the known NWPU scene labels. The lower table shows the descriptive attributes surrounding the overall scene annotations, such as white beach, yellow desert and trees in the forest.
of building a training set. These results highlight the tradeoffs in developing the most expressive description set possible with the least amount of resources, either computationally in model size or syntactically in vocabulary diversity. One wants a rich but accurate initial vocabulary [Li, et al., 2007; Zhao, et al., 2015], particularly one that corrects the residential bias of previous image collections but also minimizes repetitive use of color, buildings and trees when describing future overhead imagery. One would anticipate that no amount of textual augmentation is likely to improve a mis-classified image just by offering more detailed and incorrect annotation.
3.5 Opportunities to Diversify and Augment the Test Dataset. The airport example in Figure 6 and Table 3 can be generalized to much larger, well-known satellite datasets with extensive labeling efforts to see if the generated captions from those images alone can match with the known objects. For this case, we catalogued eight similar scenes from RCSID classes (airport, beach, desert, forest, port, railway, river and stadium), then deployed that trained model on the NWPU (Figure 7) and xView (Figure 8) satellite datasets in those same groups. In this way, the matrix shows a good image model for the strong diagonal scene match and the improved vocabulary model for the lower (heatmap) table associating the subject with its accompanying adjectives. This bootstrap testing approach is widely employed in other machine learning contexts but seems not to have been applied previously to the satellite captioning problem. This application offers a way out of the labor-intensive task of manual annotators. Example annotations performed on the large xView repository, which were previously not captioned, are shown in Figure 9. One notable outcome for the analyst is now the ability to discover, search and share millions of these images either by keyword queries or caption and pixel similarities [Cordeiro, et al., 2010].
3.6 Generalizing BLEU Scores with Novel Confusion Matrices. It is worth noting that unlike the 10,000 images captioned by RCSID or 800 images captioned by NWPU (REmote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU), [Cheng, et al., 2017]), the xView dataset alone collects bounding box detections on close to 1 million objects, many of
Figure 9 Example captions trained with NASNetMobile on RSCID and tested on xView. The red annotations highlight errors in either the image model’s ability to recognize the scene or objects outside of its class ontology such as transformer station or shipping.
which in image chips can be aggregated to form complex scenes like parking lots, airports and construction sites. As comparable ground truth, the “unknown” captions in xView cases can be confidently inferred because all the objects (yielding “nouns”) are known along with their spatial relationships (yielding “adjectives” and “verbs”). Together this triple of subject-predicate-object [Yang, et al., 2011] constitutes a well-formed caption so that xView scales well to a large repository of discoverable and searchable images. An example query for xView, in this case, might simply ask, “Show me all the airports near a river bridge”. After a monsoon or widespread flooding damage, that application might enable assessments of transport accessibility and availability by ground and air.
This approach generates a traditional multi-class confusion matrix, with the diagonal indicating the number of instances in which our trained caption generator can include the correct matching keywords for the actual scenes as labeled by either xView (Figure 8) or NWPU (Figure 7) object detectors. For instance, a true positive for a known xView airport scene should include a caption mentioning the term “airport”, which then would be counted as correctly assigned or true positive in these captioning error matrices.
This method effectively isolates the overall subject-noun relationship one might hope for in a decent sentence generator, and largely evaluates the accuracy of the image model itself. In other words, at a minimum, a good beach scene description should mention the noun, beach. The more nuanced language of a human analyst might provide additional adjectival support, such as describing the beach as white or the desert as yellow. The lower heatmaps in both Figures 7-8 isolate the keywords found common in both the bounding box description and the generated text output from the image annotator. Note that the annotation has no prior knowledge of the existing xView object detections; each image is processed as a new test case then the output is compared to what the human labelers for xView also marked with boxes. For example, the lower half of Figure 8 shows that a beach scene from xView includes in both our annotations and in its own labels, the expected elements of “sea”, “waves”, “white” and “yellow”. A reasonable caption thus could iterate variations on the short description: “A yellow beach next to the sea with waves”.
This research has tested seven new RNN-LSTM transfer-learning models for satellite image captioning. The models were trained initially on the RSCID dataset, then compared to both pruned and augmented vocabularies for better annotations. The image models included both large (VGG-16 and VGG-19) to compare with previous literature, then extends the work to smaller (NASNetMobile) models that can
vocabularies, this metric highlights comparisons to a reference sentence, which in our case, becomes less well-defined as the vocabulary gets larger and more diverse (see Table 4). The major effect of augmenting and pruning caption vocabulary is to raise both the word- and reading-complexity while either reducing or keeping relatively constant the total number of unique words. This tradeoff highlights the min-max strategy for creating the richest descriptions in the fewest words. A secondary effect is to raise (and lower) selectively the total caption count by 60 percent. By examining the example output qualitatively, the resulting large vocabulary appears to provide a more convincing caption, although all text generation models will fail in the absence of a good image model.
4.2 Overall Summary To make this point quantitatively, we increased the test images by more than a factor of five compared to the largest previous captioning task, then applied our trained annotator to generate descriptions and scored that output in multi-label confusion matrices. We believe this novel approach holds promise to leverage the large amount of existing satellite detection labels and locations. It also alleviates a bottleneck in the laborious task of human labelers required to generate the approximately 2100-pages of
annotations published in RSCID. In summary, this work has
introduced new image and text models, captioned a five-fold
larger dataset for future work, and applied confusion
matrices in a novel way to highlight caption success without
needing a prior reference sentence for BLEU comparisons.
4.3 Further Research Opportunities When considering
future approaches, current human annotators leave out much
of their associated world knowledge. For instance, Lu, et al.
(2017) offers annotation instructions to identify image parts
in six words, without using compass directions (North),
“There is…”, or vague descriptions like tall, large, or many.
However, even a young adult might note much more about
the physical world, such that the beach scene in Figure 1
looks like “a windy day” or “Californian rocky beach”. In
either case, the better caption requires some insight into
either physical association between waves and wind, or geo-
location familiarity.
Further promising work should augment captions by not
only substituting synonyms or correcting phrasing as done
here but also trying new methods such as passing each
caption multiple times through language translators and
back translation, which has proven to yield interesting
results [Ma, 2019; Wei, et al. 2019]. For example, after passing a complex translation cycle (English-to-Spanish, Spanish-to-German, German-to-French, French-to-English), the initial caption, “Island next to crashing waves”, simplifies and generalizes into “Island with waves”. This strategy illustrated in Figure 10 provides automatic caption generalizations without changing their true labels.
Future work should also highlight alternate pre-trained image models (ResNet, U-Net, etc.) and text transformers (BERT and GPT variants) to improve the overall output [see Budzianowski & Vulić, 2019]. It is worth noting however that the model sizes may not be compatible with edge processors and will likely overfit and amplify many of the anomalies found here in existing annotation data, such as mis-spellings, incorrect tenses and repetitive labels.
Acknowledgements. The authors would like to thank the PeopleTec Technical Fellows program for encouragement and project assistance. This research benefited from support from U.S. Army Space and
Missile Defense Command/Army Forces Strategic Command. Figure 10. The back translation from English input through 3 intermediate translation APIs and then returned to English as automated augmentations
Andres, S., Arvor, D., & Pierkot, C. (2012, November). Towards an ontological approach for classifying remote sensing images. In 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems (pp. 825-832). IEEE.
Blanchart, P., & Datcu, M. (2010). A semi-supervised algorithm for auto-annotation and unknown structures discovery in satellite image databases. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 3(4), 698-717.
Bratasanu, D., Nedelcu, I., & Datcu, M. (2010). Bridging the semantic gap for satellite image annotation and automatic mapping applications. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 4(1), 193-204.
Brownlee, J. (2017). Deep Learning for Natural Language Processing: Develop Deep Learning Models for your Natural Language Problems. Machine Learning Mastery.
Budzianowski, P., & Vulić, I. (2019). Hello, It's GPT-2--How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. arXiv preprint arXiv:1907.05774.
Chen, J., Han, Y., Wan, L., Zhou, X., & Deng, M. (2019). Geospatial relation captioning for high-spatial-resolution images by using an attention-based neural network. International Journal of Remote Sensing, 40(16), 6482-6498.
Cheng, G., Han, J., & Lu, X. (2017). Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10), 1865-1883.
Christie, G., Fendley, N., Wilson, J., & Mukherjee, R. (2018). Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6172-6180).
Cordeiro, R. L., Guo, F., Haverkamp, D. S., Horne, J. H., Hughes, E. K., Kim, G., ... & Faloutsos, C. (2010, December). Qmas: Querying, mining and summarization of multi-modal databases. In 2010 IEEE International Conference on Data Mining (pp. 785-790). IEEE.
Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., ... & Raska, R. (2018, June). Deepglobe 2018: A challenge to parse the earth through satellite images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 172-17209). IEEE.
Dumitru, C. O., Cui, S., Schwarz, G., & Datcu, M. (2014). Information content of very-high-resolution SAR images: Semantics, geospatial context, and ontologies. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(4), 1635-1650.
Espinoza-Molina, D., & Datcu, M. (2013). Earth-observation image retrieval based on content, semantics, and metadata. IEEE Transactions on Geoscience and Remote Sensing, 51(11), 5145-5159.
Feng, Y., & Lapata, M. (2010, June). Topic models for image annotation and text illustration. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 831-839). Association for Computational Linguistics.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10), 2451-2471.
Kyzirakos, K., Karpathiotakis, M., Garbis, G., Nikolaou, C., Bereta, K., Papoutsis, I., ... & Kontoes, C. (2014). Wildfire monitoring using satellite images, ontologies and linked geospatial data. Web Semantics: Science, Services and Agents on the World Wide Web, 24, 18-26.
Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M., ... & McCord, B. (2018). xview: Objects in context in overhead imagery. arXiv preprint arXiv:1802.07856.
Li, Y., & Bretschneider, T. R. (2007). Semantic-sensitive satellite image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 45(4), 853-860.
Lu, X., Wang, B., Zheng, X., & Li, X. (2017). Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4), 2183-2195. RSICD dataset available for download: https://github.com/201528014227051/RSICD_optimal
Luo, W., Li, H., & Liu, G. (2011). Automatic annotation of multispectral satellite images using author– topic model. IEEE Geoscience and Remote Sensing Letters, 9(4), 634-638.
Luo, W., Li, H., Liu, G., & Zeng, L. (2013). Semantic annotation of satellite images using author–genre– topic model. IEEE Transactions on Geoscience and Remote Sensing, 52(2), 1356-1368.
Ma, E., 2019, Data Augmentation in NLP, https://towardsdatascience.com/data-augmentation-in-NLP-2801a34dfc28
Mao, G., Yuan, Y., & Xiaoqiang, L. (2018, August). Deep Cross-Modal Retrieval for Remote Sensing Image and Audio. In 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS) (pp. 1-7). IEEE.
Microsoft CaptionBot, 2019, https://www.captionbot.ai/
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics.
Shi, Z., & Zou, Z. (2017). Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Transactions on Geoscience and Remote Sensing, 55(6), 3623-3634.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition (VGG). arXiv preprint arXiv:1409.1556.
Tanti, M., Gatt, A., & Camilleri, K. P. (2018). Where to put the image in an image caption generator. Natural Language Engineering, 24(3), 467-489.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).
Wang, B., Lu, X., Zheng, X., & Li, X. (2019). Semantic descriptions of high-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters.
Wei, J. W., & Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196.
Yang, Y., Teo, C. L., Daumé III, H., & Aloimonos, Y. (2011, July). Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 444-454). Association for Computational Linguistics.
Yao, X., Han, J., Cheng, G., Qian, X., & Guo, L. (2016). Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Transactions on Geoscience and Remote Sensing, 54(6), 3660-3671.
Zhang, X., Wang, X., Tang, X., Zhou, H., & Li, C. (2019). Description generation for remote sensing images using attribute attention mechanism. Remote Sensing, 11(6), 612.
Zhao, B., Zhong, Y., Xia, G. S., & Zhang, L. (2015). Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 54(4), 2108-2123.
Zhu, Q., Zhong, Y., Zhao, B., Xia, G. S., & Zhang, L. (2016). Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geoscience and Remote Sensing Letters, 13(6), 747-751.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition (NASNetMobile). In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697-8710).