Computer vision (CV) is an interdisciplinary scientific field of artificial intelligence (AI), which deals with how computers can be made to gain high-level understanding from digital images or videos. The tasks of computer vision include methods for acquiring, processing, analyzing and understanding digital images, and the process of extracting numerical or symbolic information, e.g., in the forms of decisions or predictions, from high-dimensional raw image data in the real world.
As an interesting, fundamental and challenging problem in computer vision, fine-grained image analysis (FGIA) has been an active area of research for several decades. The goal of FGIA is to retrieve, recognize and generate images belonging to multiple subordinate categories of a super-category (aka
Figure 1: Fine-grained image analysis vs. generic image analysis (taking the recognitiont task for an example).
meta-category), e.g., different species of animals/plants, different models of cars, different kinds of retail products, etc (cf. Fig. 1). In the real-world, FGIA enjoys a wide-range of applications in both industry and research societies, such as automatic biodiversity monitoring, climate change evaluation, intelligent retail, intelligent transportation, and many more. Particularly, a number of influential academic competitions about FGIA are frequently held on Kaggle.1 Several representative competitions, to name a few, are the Nature Conservancy Fisheries Monitoring (for fish species categorization), Humpback Whale Identification (for whale identity categorization) and so on. Each competition attracted more than 300 teams worldwide to participate, and some even exceeded 2,000 teams.
On the other hand, deep learning techniques [LeCun et al., 2015] have emerged in recent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the filed of FGIA. With rough statistics on each year, on average, there are around ten conference papers of deep learning based FGIA techniques published on each of AI’s and CV’s premium conferences, like IJCAI, AAAI, CVPR, ICCV, ECCV, etc. It shows that FGIA with deep learning is of notable research interests. Given this period of rapid evolution, the aim of this paper is to provide a comprehensive survey of the recent achievements in the FGIA filed brought by deep learning techniques.
In the literature, there was an existing survey related to fine-
Figure 2: Main aspects of our hierarchical and structrual organization of fine-grained image analysis (FGIA) in this survey paper.
grained tasks, i.e., [Zhao et al., 2017], which simply included several fine-grained recognition approaches for comparisons. Our work differs with it in that ours is more comprehensive. Specifically, except for fine-grained recognition, we also analyze and discuss the other two central fine-grained analysis tasks, i.e., fine-grained image retrieval and fine-grained image generation, which can not be overlooked as they are two integral aspects of FGIA. Additionally, on another important AI conference in the Pacific Rim nations, PRICAI, Wei and Wu organized a specific tutorial2 aiming at the fine-grained image analysis topic. We refer interested readers to the tutorial which provides some additional detailed information.
In this paper, our survey take a unique deep learning based perspective to review the recent advances of FGIA in a systematic and comprehensive manner. The main contributions of this survey are three-fold:
• We give a comprehensive review of FGIA techniques based on deep learning, including problem backgrounds, benchmark datasets, a family of FGIA methods with deep learning, domain-specific FGIA applications, etc.
• We provide a systematic overview of recent advances of deep learning based FGIA techniques in a hierarchical and structural manner, cf. Fig. 2.
• We discuss the challenges and open issues, and identify
Figure 3: Key challenge of fine-grained image analysis, i.e., small inter-class variations and large intra-class variations. We here present each of four Tern species in each row in the figure, respectively.
the new trends and future directions to provide a potential road map for fine-grained researchers or other interested readers in the broad AI community.
The rest of the survey is organized as follows. Section 2 introduce backgrounds of this paper, i.e., the FGIA problem and its main challenges. In Section 3, we review multiple commonly used fine-grained benchmark datasets. Section 4 analyzes the three main paradigms of fine-grained image recognition. Section 5 presents recent progress of fine-grained image retrieval. Section 6 discusses fine-grained image generation from a generative perspective. Furthermore, in Section 7, we introduce some other domain specific applications of real-world related to FGIA. Finally, we conclude this paper and discuss future directions and open issues in Section 8.
In this section, we summarize the related background of this paper, including the problem and its key challenges.
Fine-grained image analysis (FGIA) focuses on dealing with the objects belonging to multiple sub-categories of the same meta-category (e.g., birds, dogs and cars), and generally involves central tasks like fine-grained image recognition, fine-grained image retrieval, fine-grained image generation, etc.
What distinguishes FGIA from the generic one is: in generic image analysis, the target objects belong to coarse-grained meta-categories (e.g., birds, oranges and dogs), and thus are visually quite different. However, in FGIA, since objects come from sub-categories of one meta-category, the fine-grained nature causes them visually quite similar. We take image recognition for illustration. As shown in Fig. 1, in fine-grained recognition, the task is required to identify multiple similar species of dogs, e.g., Husky, Samoyed and Alaska. For accurate recognition, it is desirable to distinguish them by capturing slight and subtle differences (e.g., ears, noses, tails), which also meets the demand of other FGIA tasks (e.g., retrieval and generation).
Figure 4: Example fine-grained images belonging to different species of flowers/vegetable, different models of cars/aircrafts and different kinds of retail products. Accurate identification of these fine-grained objects requires the dependences on the discriminative but subtle object parts or image regions. (Best viewed in color and zoomed in.)
Table 1: Summary of popular fine-grained image datasets. Note that “BBox” indicates whether this dataset provides object bounding box supervisions. “Part anno.” means providing the key part localizations. “HRCHY” corresponds to hierarchical labels. “ATR” represents the attribute labels (e.g., wing color, male, female, etc). “Texts” indicates whether fine-grained text descriptions of images are supplied.
Furthermore, fine-grained nature also brings the small inter-class variations caused by highly similar sub-categories, and the large intra-class variations in poses, scales and rotations, as presented by Fig. 3. It is the opposite of the generic image analysis (i.e., the small intra-class variations and the large inter-class variations), which makes fine-grained image analysis a challenging problem.
In the past decade, the vision community has released many benchmark fine-grained datasets covering diverse domains such as birds [Wah et al., 2011; Berg et al., 2014], dogs [Khosla et al., 2011], cars [Krause et al., 2013], airplanes [Maji et al., 2013], flowers [Nilsback and Zisserman, 2008], vegetable [Hou et al., 2017], fruits [Hou et al., 2017], retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community, and specifically indicate their meta-category, the amounts of fine-grained images, the number of fine-grained categories, extra different kinds of available supervisions, i.e., bounding boxes, part annotations, hierarchical labels, attribute labels and text visual descriptions, cf. Fig. 5.
These datasets have been one of the most important factors for the considerable progress in the filed, not only as a common ground for measuring and comparing performance of competing approaches, but also pushing this filed towards increasingly complex, practical and challenging problems.
Specifically, among them, CUB200-2011 is one of the most popular fine-grained datasets. Almost all the FGIA approaches
Figure 5: An example image with its supervisions associated with CUB200-2011. As shown, multiple types of supervisions include: image labels, part annotations (aka key point localizations), object bounding boxes (i.e., the green one), attribute labels (i.e., “ATR”), and text descriptions by natural languages. (Best viewed in color.)
choose it for comparisons with state-of-the-arts. Moreover, constant contributions are made upon CUB200-2011 for further research, e.g., collecting text descriptions of the fine-grained images for multi-modality analysis, cf. [Reed et al., 2016; He and Peng, 2017a].
Additionally, in recent years, more challenging and practical fine-grained datasets are proposed increasingly, e.g., iNat2017 for natural species of plants, animals [Horn et al., 2017] and RPC for daily retail products [Wei et al., 2019a]. Many novel features deriving from these datasets are, to name a few, large-scale, hierarchical structure, domain gap and long-tail distribution, which reveals the practical requirements in real-world and could arouse the studies of FGIA in more realistic settings.
Fine-grained image recognition has been the most active research area of FGIA in the past decade. In this section, we review the milestones of fine-grained recognition frameworks since deep learning entered the filed. Broadly, these fine-grained recognition approaches can be organized into three main paradigms, i.e., fine-grained recognition (1) with localization-classification subnetworks; (2) with end-to-end feature encoding and (3) with external information. Among them, the first and second paradigms restrict themselves by only utilizing the supervisions associated with fine-grained images such as image labels, bounding boxes, part annotations, etc. In addition, automatic recognition systems cannot yet achieve excellent performance due to the fine-grained challenges. Thus, researchers gradually attempt to involve external but cheap information (e.g., web data, text descriptions) into fine-grained recognition for further improving accuracy, which corresponds to the third paradigm of fine-grained recognition. Popularly used evaluation metric in fine-grained recognition is the averaged classification accuracy across all the subordinate categories of the datasets.
4.1 By localization-classification subnetworks
To mitigate the challenge of intra-class variations, researchers in the fine-grained community pay attentions on capturing discriminative semantic parts of fine-grained objects and then constructing a mid-level representation corresponding to these parts for the final classification. Specifically, a localization subnetwork is designed for locating these key parts. While later, a classification subnetwork follows and is employed for recognition. The framework of such two collaborative subnetworks forms the first paradigm, i.e., fine-grained recognition with localization-classification subnetworks.
Thanks to the localization information, e.g., part-level bounding boxes or segmentation masks, it can obtain more discriminative mid-level (part-level) representations w.r.t. these fine-grained parts. Also, it further enhances the learning capability of the classification subnetwork, which could signifi-cantly boost the final recognition accuracy.
Earlier works belonging to this paradigm depend on additional dense part annotations (aka key points localization) to locate semantic key parts (e.g., head, torso) of objects. Some of them learn part-based detectors [Zhang et al., 2014; Lin et al., 2015a], and some of them leverage segmentation methods for localizing parts [Wei et al., 2018a]. Then, these methods concatenate multiple part-level features as a whole image representation, and feed it into the following classifica-tion subnetwork for final recognition. Thus, these approaches are also termed as part-based recognition methods.
However, obtaining such dense part annotations is laborintensive, which limits both scalability and practicality of real-world fine-grained applications. Recently, it emerges a trend that more techniques under this paradigm only require image labels [Jaderberg et al., 2015; Fu et al., 2017; Zheng et al., 2017; Sun et al., 2018] for accurate part localization. The common motivation of them is to first find the corresponding parts and then compare their appearance. Concretely, it is desirable to capture semantic parts (e.g., head and torso) to be shared across fine-grained categories, and meanwhile, it is also eager for discovering the subtle differences between these part representations. Advanced techniques, like attention mechanisms [Yang et al., 2018] and multi-stage strategies [He and Peng, 2017b] complicate the joint training of the integrated localization-classification subnetworks.
4.2 By end-to-end feature encoding
Different from the first paradigm, the second paradigm, i.e., end-to-end feature encoding, leans to directly learn a more discriminative feature representation by developing powerful deep models for fine-grained recognition. The most representative method among them is Bilinear CNNs [Lin et al., 2015b], which represents an image as a pooled outer product of features derived from two deep CNNs, and thus encodes higher order statistics of convolutional activations to enhance the mid-level learning capability. Thanks to its high model capacity, Bilinear CNNs achieve remarkable fine-grained recognition performance. However, the extremely high dimensionality of bilinear features still makes it impractical for realistic applications, especially for the large-scale ones.
Aiming at this problem, more recent attempts, e.g., [Gao et al., 2016; Kong and Fowlkes, 2017; Cui et al., 2017], try to aggregate low-dimensional embeddings by applying tensor sketching [Pham and Pagh, 2013; Charikar et al., 2002], which can approximate the bilinear features and maintain comparable or higher recognition accuracy. Other works, e.g., [Dubey et al., 2018], focus on designing a specific loss function tailored for fine-grained and is able to drive the whole deep model for learning discriminative fine-grained representations.
4.3 With external information
As aforementioned, beyond the conventional recognition paradigms, another paradigm is to leverage external information, e.g., web data, multi-modality data or human-computer interactions, to further assist fine-grained recognition.
With web data To identify the minor distinction among various fine-grained categories, sufficient well-labeled training images are in high demand. However, accurate human annotations for fine-grained categories are not easy to acquire, due to the difficulty of annotations (always requiring domain experts) and the myriads of fine-grained categories (i.e., more than thousands of subordinate categories in a meta-category).
Therefore, a part of fine-grained recognition methods seek to utilize the free but noisy web data to boost recognition performance. The majority of existing works in this line can be roughly grouped into two directions. One of them is to crawl noisy labeled web data for the test categories as training data, which is regarded as webly supervised learning [Zhuang et al., 2017; Sun et al., 2019]. Main efforts of these approaches concentrate on: (1) overcoming the dataset gap between easily acquired web images and the well-labeled data from standard datasets; and (2) reducing the negative effects caused by the noisy data. For dealing with the aforementioned problems, deep learning techniques of adversarial learning [Goodfellow et al., 2014] and attention mechanisms [Zhuang et al., 2017] are frequently utilized. The other direction of using web data is to transfer the knowledge from an auxiliary categories with
Figure 6: An example knowledge graph for modeling the categoryattribute correlations on CUB200-2011.
well-labeled training data to the test categories, which usually employs zero-shot learning [Niu et al., 2018] or meta learning [Zhang et al., 2018] to achieve that goal.
With multi-modality data Multi-modal analysis has attracted a lot of attentions with the rapid growth of multi-media data (e.g., image, text, knowledge base, etc). In fine-grained recognition, it takes multi-modality data to establish joint-representations/embeddings for incorporating multi-modality information. It is able to boost fine-grained recognition accuracy. In particular, frequently utilized multi-modality data includes text descriptions (e.g., sentences and phrases of natural languages) and graph-structured knowledge base. Compared with strong supervisions of fine-grained images, e.g., part annotations, text descriptions are weak supervisions. Besides, text descriptions can be relatively accurately returned by ordinary humans, rather than the experts in a spe-cific domain. In addition, high-level knowledge graph is an existing resource and contains rich professional knowledge, such as DBpedia [Lehmann et al., 2015]. In practice, both text descriptions and knowledge base are effective as extra guidance for better fine-grained image representation learning.
Specifically, [Reed et al., 2016] collects text descriptions, and introduces a structured joint embedding for zero-shot fine-grained image recognition by combining texts and images. Later, [He and Peng, 2017a] combines the vision and language streams in a joint training end-to-end fashion to preserve the intra-modality and inter-modality information for generating complementary fine-grained representations. For fine-grained recognition with knowledge base, some works, e.g., [Chen et al., 2018; Xu et al., 2018a], introduce the knowledge base information (always associating with attribute labels, cf. Fig. 6) to implicitly enriching the embedding space (also reasoning about the discriminative attributes for fine-grained objects).
With humans in the loop
Fine-grained recognition with humans in the loop is usually an iterative system composed of a machine and a human user, which combines both human and machine efforts and intelligence. Also, it requires the system to work in a human labor-economy way as possible. Generally, for these kinds of recognition methods, the system in each round is seeking to understand how humans perform recognition, e.g., by asking untrained humans to label the image class and pick up hard examples [Cui et al., 2016], or by identifying key part localization and selecting discriminative features [Deng et al., 2016] for fine-grained recognition.
Figure 7: An illustration of fine-grained retrieval. Given a query image (aka probe) of “grained retrieval is required to return images of the same car model from a car database (aka galaxy). In this figure, the top-4 returned image marked in a red rectangle presents a wrong result, since its model is “
Beyond image recognition, fine-grained retrieval is another crucial aspect of FGIA and emerges as a hot topic. Its evaluation metric is the common mean average precision (mAP).
In fine-grained image retrieval, given database images of the same sub-category (e.g., birds or cars) and a query, it should return images which are in the same variety as the query, without resorting to any other supervision signals, cf. Fig. 7. Compared with generic image retrieval which focuses on retrieving near-duplicate images based on similarities in their contents (e.g., textures, colors and shapes), while fine-grained retrieval focuses on retrieving the images of the same types (e.g., the same subordinate species for the animals and the same model for the cars). Meanwhile, objects in fine-grained images have only subtle differences, and vary in poses, scales and rotations.
In the literature, [Wei et al., 2017] is the first attempt to fine-grained image retrieval using deep learning. It employs pre-trained CNN models to select the meaningful deep descriptors by localizing the main object in fine-grained images unsupervisedly, and further reveals that selecting only useful deep descriptors with removing background or noise could significantly benefit retrieval tasks. Recently, to break through the limitation of unsupervised fine-grained retrieval by pre-trained models, some trials [Zheng et al., 2018; Zheng et al., 2019] tend to discovery novel loss functions under the supervised metric learning paradigm. Meanwhile, they still design additional specific sub-modules tailored for fine-grained objects, e.g., the weakly-supervised localization module proposed in [Zheng et al., 2018], which is under the inspiration of [Wei et al., 2017].
Apart from the supervised learning tasks, image generation is a representative topic of unsupervised learning. It deploys deep generative models, e.g., GAN [Goodfellow et al., 2014], to learn to synthesize realistic images which looks visually authentic. With the quality of generated images becoming higher, more challenging goals are expected, i.e., fine-grained image generation. As the term suggests, fine-grained generation will synthesize images in fine-grained categories such as faces of a specific person or objects in a subordinate category.
The first work in this line was CVAE-GAN proposed in [Bao et al., 2017], which combines a variational auto-encoder with a generative adversarial network under a conditional generative process to tackle this problem. Specifically, CVAE-GAN models an image as a composition of label and latent attributes in a probabilistic model. Then, by varying the fine-grained category fed into the resulting generative model, it can generate images in a specific category. More recently, generating images from text descriptions [Xu et al., 2018b] behaves popular in the light of its diverse and practical applications, e.g., art generation and computer-aided design. By performing an attention equipped generative network, the model can synthesize fine-grained details of subtle regions by focusing on the relevant words of text descriptions.
In the real world, deep learning based fine-grained image analysis techniques are also adopted to diverse domain specific applications and shows great performance, such as clothes/shoes retrieval [Song et al., 2017] in recommendation systems, fashion image recognition [Liu et al., 2016] in e-commerce platforms, product recognition [Wei et al., 2019a] in intelligent retail, etc. These applications are highly related to both fine-grained retrieval and recognition of FGIA.
Additionally, if we move down the spectrum of granularity, in the extreme, face identification can be viewed as an instance of fine-grained recognition, where the granularity is under the identity granularity level. Moreover, person/vehicle re-identification is another fine-grained related task, which aims at determining whether two images are taken from the same specific person/vehicle. Apparently, re-identification tasks are also under identity granularity.
In practice, these works solve the corresponding domain specific tasks by following the motivations of FGIA, which includes capturing the discriminative parts of objects (faces, persons and vehicles) [Suh et al., 2018], discovering coarse-to-fine structural information [Wei et al., 2018b], developing attribute-based models [Liu et al., 2016], and so on.
Fine-grained image analysis (FGIA) based on deep learning have made great progress in recent years. In this paper, we give an extensive survey on recent advances in FGIA with deep learning. We mainly introduced the FGIA problem and its challenges, discussed the significant improvements of fine-grained image recognition/retrieval/generation, and also presented some domain specific applications related to FGIA. Despite the great success, there are still many unsolved problems. Thus, in this section, we will point out these problems explicitly and introduce some research trends for the future evolution. We hope that this survey not only provides a better understanding of FGIA but also facilitates future research activities and application developments in this field.
Automatic fine-grained models Nowadays, automated machine learning (AutoML) [Feurer et al., 2015] and neural architecture search (NAS) [Elsken et al., 2018] are attracting fervent attentions in the artificial intelligence community, especially in computer vision. AutoML targets automating the end-to-end process of applying machine learning to real-world tasks. While, NAS, the process of automating neural network architecture designing, is thus a logical next step in AutoML. Recent methods of AutoML and NAS could be comparable or even outperform hand-designed architectures in various computer vision applications. Thus, it is also promising that automatic fine-grained models developed by AutoML or NAS techniques could find a better and more tailor-made deep models, and meanwhile it can advance the studies of AutoML and NAS in turn.
Fine-grained few-shot learning Humans are capable of learning a new fine-grained concept with very little supervision, e.g., few exemplary images for a species of bird, yet our best deep learning fine-grained systems need hundreds or thousands of labeled examples. Even worse, the supervision of fine-grained images are both time-consuming and expensive, since fine-grained objects should be always accurately labeled by domain experts. Thus, it is desirable to develop fine-grained few-shot learning (FGFS) [Wei et al., 2019b]. The task of FGFS requires the learning systems to build clas-sifiers for novel fine-grained categories from few examples (only one or less than five) in an meta-learning fashion. Robust FGFS methods could extremely strengthen the usability and scalability of fine-grained recognition.
Fine-grained hashing As there exist growing attentions on FGIA, more large-scale and well-constructed fine-grained datasets have been released, e.g., [Berg et al., 2014; Horn et al., 2017; Wei et al., 2019a]. In real applications like fine-grained image retrieval, it is natural to raise a problem that the cost of finding the exact nearest neighbor is prohibitively high in the case that the reference database is very large. Hashing [Wang et al., 2018; Li et al., 2016], acting as one of the most popular and effective techniques of approximate nearest neighbor search, has the potential to deal with large-scale fine-grained data. Therefore, fine-grained hashing is a promising direction worth further explorations.
Fine-grained analysis within more realistic settings In the past decade, fine-grained image analysis related techniques have been developed and achieve good performance in its traditional settings, e.g., the empirical protocols of [Wah et al., 2011; Khosla et al., 2011; Krause et al., 2013]. However, these settings can not satisfy the daily requirements of various real-world applications nowadays, e.g., recognizing retail products in storage racks by models trained with images collected in controlled environments [Wei et al., 2019a] and recognizing/detecting natural species in the wild [Horn et al., 2017]. In consequence, novel fine-grained image analysis topics, to name a few—fine-grained analysis with domain adaptation, fine-grained analysis with knowledge transfer, fine-grained analysis with long-tailed distribution, and fine-grained analysis running on resource constrained embedded devices— deserve a lot of research efforts towards the more advanced and practical FGIA.
[Bao et al., 2017] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: Finegrained image generation through asymmetric training. In ICCV, pages 2745–2754, 2017.
[Berg et al., 2014] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2019–2026, 2014.
[Charikar et al., 2002] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693–703, 2002.
[Chen et al., 2018] T. Chen, L. Lin, R. Chen, Y. Wu, and X. Luo. Knowledgeembedded representation learning for fine-grained image recognition. In IJCAI, pages 627–634, 2018.
[Cui et al., 2016] Y. Cui, F. Zhou, Y. Lin, and S. Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In CVPR, pages 1153–1162, 2016.
[Cui et al., 2017] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel pooling for convolutional neural network. In CVPR, pages 2921–2930, 2017.
[Deng et al., 2016] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4):666–676, 2016.
[Dubey et al., 2018] A. Dubey, O. Gupta, R. Raskar, and N. Naik. Maximum entropy fine-grained classification. In NeurIPS, pages 637–647, 2018.
[Elsken et al., 2018] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.
[Feurer et al., 2015] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In NIPS, pages 2962–2970, 2015.
[Fu et al., 2017] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, pages 4438–4446, 2017.
[Gao et al., 2016] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, pages 317–326, 2016.
[Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
[He and Peng, 2017a] X. He and Y. Peng. Fine-grained image classification via com- bining vision and language. In CVPR, pages 5994–6002, 2017.
[He and Peng, 2017b] X. He and Y. Peng. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In AAAI, pages 4075–4081, 2017.
[Horn et al., 2017] G. Van Horn, O. M. Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The iNaturalist species classification and detection dataset. In CVPR, pages 8769–8778, 2017.
[Hou et al., 2017] S. Hou, Y. Feng, and Z. Wang. VegFru: A domain-specific dataset for fine-grained visual categorization. In ICCV, pages 541–549, 2017.
[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
[Khosla et al., 2011] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In CVPR Workshop on Fine-Grained Visual Categorization, pages 806–813, 2011.
[Kong and Fowlkes, 2017] S. Kong and C. Fowlkes. Low-rank bilinear pooling for fine-grained classification. In CVPR, pages 365–374, 2017.
[Krause et al., 2013] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for fine-grained categorization. In ICCV Workshop on 3D Representation and Recognition, 2013.
[LeCun et al., 2015] Y. LeCun, Y. Bengion, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
[Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, pages 167–195, 2015.
[Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016.
[Lin et al., 2015a] D. Lin, X. Shen, C. Lu, and J. Jia. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In CVPR, pages 1666– 1674, 2015.
[Lin et al., 2015b] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, pages 1449–1457, 2015.
[Liu et al., 2016] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104, 2016.
[Maji et al., 2013] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Finegrained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
[Nilsback and Zisserman, 2008] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Indian Conf. on Comput. Vision, Graph. and Image Process., pages 722–729, 2008.
[Niu et al., 2018] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification. In CVPR, pages 7171–7180, 2018.
[Pham and Pagh, 2013] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD, pages 239–247, 2013.
[Reed et al., 2016] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In CVPR, pages 49–58, 2016.
[Song et al., 2017] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV, pages 5551–5560, 2017.
[Suh et al., 2018] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilinear representations for person re-identification. In ECCV, pages 402–419, 2018.
[Sun et al., 2018] M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for fine-grained image recognition. In ECCV, pages 834–850, 2018.
[Sun et al., 2019] X. Sun, L. Chen, and J. Yang. Learning from web data using adversarial discriminative neural networks for fine-grained classification. In AAAI, 2019.
[Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001, 2011.
[Wang et al., 2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI, 40(4):769–790, 2018.
[Wei et al., 2017] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional descriptor aggregation for fine-grained image retrieval. TIP, 26(6):2868–2881, 2017.
[Wei et al., 2018a] X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
[Wei et al., 2018b] X.-S. Wei, C.-L. Zhang, L. Liu, C. Shen, and J. Wu. Coarse-to-fine: A RNN-based hierarchical attention model for vehicle re-identification. In ACCV, 2018.
[Wei et al., 2019a] X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. RPC: A large-scale retail product checkout dataset. arXiv preprint arXiv:1901.07249, 2019.
[Wei et al., 2019b] X.-S. Wei, P. Wang, L. Liu, C. Shen, and J. Wu. Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. TIP, in press, 2019.
[Xu et al., 2018a] H. Xu, G. Qi, J. Li, M. Wang, K. Xu, and H. Gao. Fine-grained image classification by visual-semantic embedding. In IJCAI, pages 1043–1049, 2018.
[Xu et al., 2018b] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324, 2018.
[Yang et al., 2018] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learning to navigate for fine-grained classification. In ECCV, pages 438–454, 2018.
[Zhang et al., 2014] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In ECCV, pages 834–849, 2014.
[Zhang et al., 2018] Y. Zhang, H. Tang, and K. Jia. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In ECCV, pages 233–248, 2018.
[Zhao et al., 2017] B. Zhao, J. Feng, X. Wu, and S. Yan. A survey on deep learningbased fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, 14(2):119–135, 2017.
[Zheng et al., 2017] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV, pages 5209–5217, 2017.
[Zheng et al., 2018] X. Zheng, R. Ji, X. Sun, Y. Wu, F. Huang, and Y. Yang. Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In IJCAI, pages 1226–1233, 2018.
[Zheng et al., 2019] X. Zheng, R. Ji, X. Sun, B. Zhang, Y. Wu, and F. Huang. Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In AAAI, 2019.
[Zhuang et al., 2017] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid. Attend in groups: a weakly-supervised deep learning framework for learning from web data. In CVPR, pages 1878–1887, 2017.