Deep Convolutional Neural Networks (ConvNets) has achieved excellent performance in numerous computer vision tasks in recent years. Trained with a large amount of annotated data, the ConvNets can extract robust and effective representations for classification. However, ConvNets suffers from its weak generalization ability and poor performance when the annotated samples for training are very limited. In contrast, we humans can identify novel classes with only a single or few samples. Thus, recognizing novel categories from very few samples is an important and significant problem, which is often termed Few-shot learning (FSL).
Recently, the FSL problem has attracted increasing attention, and a number of methods have been proposed to tackle this task. Fine-tuning the pre-trained model on the novel datasets is a common and simple method. Besides, To generalize the model to novel datasets, an emerging direction is to apply the meta-learning paradigm on few-shot learning. Meta-learning trains an across-task meta-learner which can accumulate transferable knowledge in one task and generalize to other novel tasks quickly. Another common approach is based on metric-learning, which learns an informative similarity metric between the query and the support samples, thus performing few-shot classification.
Due to the very limited samples, most of the approaches encounter over-fitting and poor generalization. Thus, many data augmentation (DA) methods have been proposed. [Wang et al., 2018] and [Hariharan and Girshick, 2017] are both data generation based method, which can generate additional examples for data-starved classes. However, they need a large of extra annotated data to train such a specialized data generator or hallucinator. [Schwartz et al., 2019] leverages extra multiple semantic and feature fusion to train a more robust embedded module. Nevertheless, these methods contain complicated feature fusion networks, and refer to extra semantic information may limit application scenarios.
Different from DA, localization can distinguish the most discriminative regions from distractors without using extra annotated samples. Mask-CNN [Wei et al., 2016] utilized the fully convolutional network to locate the most discriminative parts to fulfill the fine-grained recognition. Inspirited by this, we argue that guiding to localize the discriminate regions to perform few-shot classification should make a significant improvement of the FSL task, especially the fine-grained few-shot (FGFS) task. However, many weakly-supervised object localization (WSOL) methods fail to localize the integral regions of the objects. For instance, [Zhou et al., 2016; Oquab et al., 2015] replace the last few layers of the clas-sification network with a global pooling layer and a fullyconnected layer to generate the discriminative class activation maps (CAM). However, CAM tends to cover only the most discriminative part which leads to classification accuracy improvement.
To bridge this gap, we propose a novel end-to-end net-
Figure 1: Illustration of Self-Attention Based Complementary Module (SCA module). Our method mainly contains: (a) Channel-Based Attention Module (CBAM) to generate the important mask, (b) Classifier to fulfill the classification, and output the class activation map (cam). The classifier learns to obtain two complementary discriminative regions by classifying two complementary features to the same classes. The complementary features contain the features spatial-wise multiplication with the important mask and the features spatial-wise multiplication with the complementary erased mask. Note that our SAC module is lightweight can be easily applied to convolutional feature maps of various models (e.g., VGG16) to improve the localization accuracy. Best viewed in color.
work to achieve weakly-supervised object localization which is shown in Figure 1. Inder to localize the integral objects and select the most discriminate features for few-shot classification, we design the Channel-Based Attention Module (CBAM). CBAM takes the feature maps as input and generates the important mask. The complementary erased mask is also produced using the threshold. Then the clas-sifier classifies the complementary features to the same class to obtain the completely class activation map (e.g., ). By fusing both the
and
, the SCA module can capture the more interval class activation map for the input images. After obtaining the most discriminative regions, we applied the interpolation to select the useful deep descriptors and feed them to the semantic alignment module to compute the semantic alignment distance for few-shot learning.
Many previous methods compute a prototype of each class by averaging all the support data of the class. Despite its effi-ciency, it is vulnerable to noise. Inspirited by NBNN [Boiman et al., 2008], we align each deep descriptor of the key object by its nearest neighbor among all of the support deep descriptors, which mean the distance between the query images and the support images can be computed in a relative high-data regime, making it more stable and biased-noisy. We encapsulate this part as a semantic alignment module to output the distance. The pipeline of our method for few-shot classifica-tion is shown in Figure 2.
Below, we list our main contributions: (1) We propose a lightweight and efficient module named SAC module that can localize the integral discriminative region. The designed CBAM efficiently helps the classifier to capture the discriminative part and complementary discriminative part. (2) We design the semantic alignment module to boost few-shot clas-sification over the selected deep descriptors since it effectively reduces background noise. (3) Extensive experiments on benchmark datasets and fine-grained few-shot datasets, and the generalization evaluation experiment all show the superiorities of our method. Meanwhile, the visualization shows the proposed method can localize the key objects accurately.
2.1 Meta-learning and Metric-learning
Meta-learning based method trains an across-task meta-learner with the meta-learning paradigm. MAML [Finn et al., 2017] trained a model agnostic meta-learner and found the initial parameters adapting to a variety of tasks with similar distribution, such that the model can quickly generalize to the new tasks. Meta-Learning LSTM [Ravi and Larochelle, 2017] proposed a model based on LSTM to learn an optimization method as well as the general initialization of the classifier. Metric-learning tackles the FSL by learning an embedding space where the input of the samples of the same categories is closer than those of different categories. Combining with attention mechanism, Matching Network [Vinyals et al., 2016] used the cosine distance to train a k-nearest neighbor classifier on the learned embedding space. [Snell et al., 2017] proposed a prototypical network to learn a prototype representation of each category and performed classification by the Euclidean distance between the query and prototype. Relation network [Zhang et al., 2018b] learned a nonlinear comparator to compare the distance metric between the query and the support images. Different from these metric-learning approaches that directly using the vector obtained by flatten-ing the embedded feature, we use a set of selected deep descriptors to represent the embedded feature.
2.2 Object Localization
Many works [Wei et al., 2017; Wei et al., 2017] have shown that utilizing localization can help to learn the more discriminative embedded feature for classification. In this context, [Wei et al., 2017] presented the Selective Convolutional Descriptor Aggregation method to achieve unsupervised localization in fine-grained datasets. Similarly, [Sun et al., 2019] used the class attention map (CAM) generated by classifica-tion networks to locate the key region and fused with other different scale features for fine-grained few-shot classifica-tion. [Wertheimer and Hariharan, 2019] used the assistant bounding box annotations to achieve the localization within the few-shot classification, thus to address the FSL task over heavy-tailed datasets. Leveraging localization can improve few-shot classification performance. However, in real-world applications, bounding box annotations may be hard to meet or impracticable. To address the FSL problem with realistic settings, we propose a novel method to achieve weakly-supervised object localization.
3.1 Problem Definition
There is the train dataset D, support dataset S, and the query dataset Q in the FSL task. contains N samples, where
is the label of image
. The support set
includes M examples in test phase and there is K labeled samples for each of C novel categories (C-way K-shot problem). The query dataset
shares the same label space with S. Their relationship is denoted as
. FSL aims to train a model from D, then classify the novel samples from Q based on the S during the testing phase. To mimic the few-shot learning task, the episodic training mechanism is adopted to train the model. Episode is a mini-batch includes
and
, where we randomly sample C categories from
and for each category of C categories, its labeled samples are randomly split into subset
with K samples and subset
with the rest samples. Through this training mechanism, the model can learn transferable knowledge.
3.2 Model
Deep descriptor For an image X, the activation of a convolution layer can be formatted as an 3D tensor denoted as On the one hand, E(X) includes d feature maps with the size of
and is denote as M =
also known as the feature map in nth channel. On the other hand, E(X) can be considered as including
deep descriptors and each deep descriptor is a d -dimension vector. We denote it as:
where (i, j) is the position of the descriptor and . Thus, a set of deep descriptors is the representation containing spatial information.
Channel-Based Attention Module (CBAM) From previous works, erasing the most discriminate part is the effective method to obtain the integral object in WSOL tasks. We design a very lightweight yet efficient module to capture the most discriminative part based on the channel attention, which is denoted as CBAM. Through the CBAM, we produce an important mask directly. Then a complementary erased mask is also produced by the threshold. The module can be easily incorporated into various existing models (e.g., vgg16).
Figure 3: Illustration of the CBAM. It contains the avgpooling, max-pooling, and 1x1 convolution operation. Then we concat the output of the three operations and apply a 1x1 convolution operation with a sigmoid function to produce the important mask of the input feature maps. Best viewed in color.
Classifier to obtain the cam Different from CAM which needs an extra step (e.g. CAM get the classification weight through gradient backhaul) to obtain the class activation maps after the forward, Acol [Zhang et al., 2018a] proposed a novel method to obtain the class activation map directly from the feature map of the last convolutional layer. Suppose there are C classes during the meta-train phase, the last convolutional layer of the classifier is a C channel with 1x1 kernel size. The output of the classifier is fed to the softmax for classification. Suppose the weight matrix of the 1x1 convolutional layer is . Then we can directly obtain the class activation map:
Figure 4: Illustration of the process to produce the cam through the classifier. The classifier contains 3 convolutional block and 1
convolutional block. Best vied in color.
Semantic Alignment Module (SAM) This module is designed to calculate the semantic alignment distance for the query/support pair. In this paper, we suppose that each deep descriptor of the key object is independent and has clear semantics. For instance, the selected set of deep descriptors can be interpreted as a bird, dog, etc. Based on the above suppose,
Figure 2: The architecture of our method for few-shot learning. We utilize Conv-64 or ResNet-12 as the backbone to obtain the convolutional activation tensor. Then we use the fused cam generated by the WSOL network (VGG16+SAC) to select the useful deep descriptors. The cam is resized to a suitable scale by nearest interpolation. Excepted the selected deep descriptors, the rest deep descriptors are set to zero vector. Finally, the selected is feed to the SAM module to compute the semantic alignment distance. We maximizing the distance if the input pair belongs to the same class while minimizing the score if comes from a different class. Best viewed in color.
we search the nearest-neighbor (NN) among the support set for each descriptor of the query image and accumulate them as the final distance. We define such distance between two discriminative regions as semantic alignment distance. By applying the NN algorithm over each deep descriptor can we guarantee the content between the query image and support images to be aligned. We chose the cosine distance to measure the distance between two descriptors and
For the query image of class k, its embedded features is denoted as: . Similarly, the deep descriptors of all support embedded features of class k are denoted as:
. In order to correctly classify the query image, the model needs to guarantee the semantic alignment distance between
and its support embedded feature set
to be highest (nearest), thus that each deep descriptor of the query image can be accurately aligned by its nearest neighbor deep descriptor (
) from support set. The semantic alignment distance between the embedded feature of the query image and the embedded feature of the support category k is :
where represents
is the nearest neighbor descriptor of
among
over cosine distance. Since the deep descriptors from the key regions of the query will be mostly activated by the deep descriptors from the regions belonging to the same class, Maximizing the score can guarantee each deep descriptor can be aligned correctly. We directly perform classification by the class of its nearest D distance without additional supervisory loss. For the C - Way K -Shot task, the probability of query image (x, y) belongs to the true category
is denoted as:
For N query images in each episode, we minimize the loss
4.1 dataset
miniImageNet As the mini-version of ImageNet, it contains 100 classes with 600 color images per class. The splits proposed in Matching Net is becoming the standard splitting rule of miniImageNet, where 64 categories are for training,16 categories for validation, and 20 categories for testing. In this work, we adopt this split to compare our approach with state-of-the-art methods.
Fine-Grained Datasets In this paper, we pick three fine-grained datasets, e.g., Stanford Dogs [Khosla et al., 2011], Stanford Cars [Makadia and Yumer, 2014] and CUB-200 [Wah et al., 2011] to conduct the fine-grained few-shot learning task. Stanford Dogs contains 120 categories with 20,580 color images, where 70, 20, and 30 categories are used for training, validation, and testing, respectively. Stanford Cars containing 16,185 color images of 196 classes of cars, which is divided into 130, 17, and 49 categories for training, validation, and testing, respectively. CUB-200 is an image dataset with 6033 color images of 200 bird species for the FGVC task. Similarly, we split it into 130, 20 and 50 categories for training, validation, and testing, respectively.
4.2 Settings and Experiments
Settings We adopt the episodic training mechanism to make the training phase more faithful to the testing phase.
Few-shot Classification on miniImageNet We report the experiment results in table 1. When adopting the Resnet as the embedding module, our model can achieve state-of-the-art results both in the 5-way 1-shot and 5-shot task, especially in the 5-shot task (3.29 % higher than the 74.44% reported by DN4 [Li et al., 2019a]). Besides, when using the Conv as embedding module, our model also achieves the highest accuracy on the 5-way 5-shot task, gaining the 4.40%, 1.03%, and 0.04% over CovaMNet [Li et al., 2019b], DN4, and SalNet [?]. We also obtain very competitive accuracy on 5-way 1-shot task with Conv embedding module, gaining 3.82%, 2.13%, 2.08% improvement over R2D2 [Bertinetto et al., 2019], CovaMNet, and DN4. As for Dynamic-Net and SalNet on the 5-way 1-shot task, they perform very complicated training steps to obtain state-of-the-art results. The former utilizes a two-stage model and needs to pre-train the model while our approach does not. The latter utilizes the state-of-the-art saliency detection model to generate the saliency map, thus to directly locate the key object. On the contrary, our approach achieves weakly-supervised localization with only image-level. Our approach is more simple but efficient and outperforms over state-of-the-art methods both on 5-way 1-shot and 5-shot.
Generalizing to other datasets To better reflect the generalization performance of the few-shot learning models, we evaluate the few-shot learning model on the completely different datasets. A new dataset that totally different from the training dataset may present data distribution shift [Recht et al., 2019], which will cause significant performance degradation of the model. According to section 3.1, the training classes and the testing classes do not share the same label space, but they still possess the same data distribution because of coming from the same dataset. In this section, we train the model on miniImageNet and conduct the testing on the novel datasets to evaluate the generalization capability. The experiment results show that our model outperforms previous work(Proto Net, Relation Net, and K-tuplet loss [Li et al., 2019c]) on the three novel datasets, which demonstrates the superior generalization capacity of our approach.
Table 1: The mean accuracies of the 5-way 1-shot and 5-shot tasks on the miniImageNet dataset, with 95% confidence intervals.
Model Embedding 5-Way Accuracy(%) 1-shot 5-shot
Table 2: The mean accuracies of the 5-way 1-shot and 5-shot accuracies (%) on three fined-grained datasets using the model trained on miniImageNet, with 95% confidence intervals. All the experiments are conducted with the same network for fair comparison.
Few-shot Classification on fine-grained datasets Compared with the generic few-shot classification task, it’s more challenging to perform fine-grained few-shot (FGFS) classifi-cation due to the smaller inter-class and larger intra-class variations of the fine-grained datasets. However, since FGFS receives very little attention, most of the exiting few-shot learning methods do not report their performance on such fine-grained datasets. Therefore, we implement and evaluate our approach on fine-grained datasets. Meanwhile, we also report the DN4, CovaMNet, GNN [Satorras and Estrach, 2018], Proto Net, MattML [Zhu et al., 2020], LRPABN [Huang et al., 2020] to make a comparison. As table 3 shown, compared with those methods, our method achieves the best performance on three fine-grained datasets under both the 5-way 1-shot and 5-shot tasks. In more detail, on the Stanford Dogs dataset, our method gains 4.18% and 15.79% improvement respectively over the second place under 1-shot and 5-shot settings. On the Stanford Cars dataset, our method achieves the state of the art performance both on 1-shot and 5-shot,
Table 3: The mean accuracies of the 5-way 1-shot and 5-shot tasks on three fine-grained datasets, with 95% confidence intervals.
gaining 16.13% and 5.83% improvement over the second place. As for the CUB 200-2011 dataset, our method gains competitive accuracy under the 1-shot setting and best performance under the 5-shot setting. The result demonstrates that our methos helps to boost the fine-grained few-shot classifi-cation performance by utilizing the localization information.
Weakly-supervised object localization performance CUB200-2011 dataset is a benchmark for wsol task. It contains 200 categories of birds with 5994 training images and 5794 testing images. For each, it provides the bounding box for localization. We train our model on the training set without using any bounding box. During the meta-test phase, we predict the bounding box and the label for each input image. We use the Top-1 localization accuracy(Top-1 Loc) and localization accuracy with known ground truth class (GT-Known Loc). GT-Known is correct when the intersection over union (IoU) between the ground truth box and predicted box is 50% or more. Top-1 Loc is correct when the Top-1 classification result (Top-1 Clas) and GT-Known Loc are both correct. We adopt the VGG-16 as our backbone for a fair comparison.
As table 4 shown, our method performs superior to the previous works both on Top-1 Loc acc and Top-1 Clas acc. Figure 5 also shows that the proposed method can locate greater regions of the objects than the CAM method.
Table 4: Object localization preformance on CUB 200-2011
Figure 5: Compared with the CAM method on few-shot fine-grained datasets. Our method can locate more interval regions to improve localization performance. (ground-true bounding boxes are in red and the predicted are in green). Best vied in color.
5.1 Ablation study
Influence of the backbone We execute the experiments to explore the influence of the backbone both on the few-shot learning dataset and fine-grained few-shot learning dataset. Conv-64 refers to a shallow network with 4 convolutional blocks and each block contains 64 filters with size , tailed with a max-pooling layer. The Resnet-12 refers to a ResNet-like network consisting of 4 residual blocks and each block contains 3 convolutional layers with
kernel. As the table shows, compares with Conv-64, ResNet-12 has made a significant improvement in all data sets under both 1-shot and 5-shot settings.
Influence of the SAC and SAM To demonstrate the in-fluence of various modules, we perform the ablation study and the results are shown in Table 4. Firstly, we replace the SAM with Euclidean distance (ED) both in the w/SAC and w/o SAC combination. The implementation ED classifier is similar to the prototype network, where we use the vector flatted from the embedded feature to represent the embedded feature. The (w/SAC+SAM) outperforms (w/SAC+ED) in different settings, especially with the Resnet backbone,
Table 5: Comparision the performance of different backbone (Conv-64 and ResNet-12) in our method. The mean accuracies of the 5-way 1-shot and 5-shot accuracies (%) on three fined-grained datasets and miniImageNet dataset, with 95% confidence intervals.
Table 6: Testing ecah module of the proposed method during training on miniImageNet. The mean accuracies of the Resnet and Conv embedding module, with 95% confidence intervals.
Module Combination Embedding Acc % 1-shot 5-shot
gaining approximately 25.11% improvement in 5-shot and 14.37% in 1-shot, which demonstrates that the proposed semantic alignment distance can work well on few-shot clas-sification. Secondly, without the SAC module to select the useful deep descriptor, the accuracy drops down according to the (w/SAC+SAM) and (w/o SAC+SAM). However, according to the (w/SAC+ED) and (w/o SAC+ED), it shows that the SAC can not work well with ED. This may because the ED can not utilize the semantic information provided by SAC. The ablation study shows our scheme can accurately utilize the weakly object localization to improve the few-shot learning performance.
5.2 Visualization
In this section, we visualize the class activation map of the input from the few-shot dataset and fine-grained few-shot datasets. Our method adopts the VGG-16 as the backbone to generate the class activation map for the input images. More specifically, we use the proposed SAC module to replace the last pooling layer and three fully connected layers of VGG16. Noted that the SAC module is a fully convolutional architecture, which means it can handle any size of the input easily. The model is trained end-to-end only on the training set. In the meta-test phase, we produce the cam for each novel images. As Figure 4 shown, compared with CAM, our model can more integral localize the key object both in the miniIm-
Figure 6: The class activation map generated by VGG-GAP-CAM and our method on miniImageNet dataset and three fine-grained few-shot datasets. The first row is input images, the second row is the class activation map generated by the CAM method, and the third row is the class attention map generate by ours. Compared with CAM, our method can capture more interval regions of the objects both in the miniImageNet and fine-grained datasets.
ageNet and the fine-grained datasets. It is worth mentioning that the model can generalize to novel categories (especially on fine-grained datasets) well since it can produce the cam for novel classes very well. This may because the unseen class always contains similar regions to the training set and the classifier will classify the novel sample to the most similar class in the training set.
This paper proposes a method that can deal with both the few-shot classification and fined-grained few-shot classifica-tion well. The proposed method introduces the SAC module to localize the key objects, and more importantly selecting the useful deep descriptors for classification and fine-grained classification. The SAM module can align the semantic content between the query images and the support images by performing the NN algorithm over each selected deep descriptor. Extensive experiments show the proposed method obtains superior performance over state-of-the-art methods on both few-shot classification and fine-grained few-shot classi-fication tasks. Furtherly, the ablation study shows that only our scheme can accurately utilize the subtle and local information to boost the performance of classification. The visualization shows the SAC module can localize the interval objects, which explains the high accuracies of our method.
[Bertinetto et al., 2019] Luca Bertinetto, Joao F. Henriques, Philip Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2019.
[Boiman et al., 2008] Oren Boiman, Eli Shechtman, and Michal Irani. In defense of nearest-neighbor based image classification. In CVPR, 2008.
[Finn et al., 2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML-Volume 70, 2017.
[Gidaris and Komodakis, 2018] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
[Hariharan and Girshick, 2017] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In ICCV, 2017.
[Huang et al., 2020] Huaxi Huang, Junjie Zhang, Jian Zhang, Jingsong Xu, and Qiang Wu. Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification, 2020.
[Khosla et al., 2011] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), 2011.
[Li et al., 2019a] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. Revisiting local descriptor based image-to-class measure for few-shot learning. In CVPR, 2019.
[Li et al., 2019b] Wenbin Li, Jinglin Xu, Jing Huo, Lei Wang, Yang Gao, and Jiebo Luo. Distribution consistency based covariance metric networks for few-shot learning. In AAAI, 2019.
[Li et al., 2019c] Xiaomeng Li, Lequan Yu, Chi-Wing Fu, Meng Fang, and Pheng-Ann Heng. Revisiting metric learning for few-shot image classification. arXiv preprint arXiv:1907.03123, 2019.
[Makadia and Yumer, 2014] Ameesh Makadia and Mehmet Ersin Yumer. Learning 3d part detection from sparsely labeled data. In 2014 2nd International Conference on 3D Vision, 2014.
[Oquab et al., 2015] Maxime Oquab, L´eon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR, 2015.
[Ravi and Larochelle, 2017] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. ICLR, 2017.
[Recht et al., 2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811, 2019.
[Satorras and Estrach, 2018] Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2018.
[Schwartz et al., 2019] Eli Schwartz, Leonid Karlinsky, Rogerio Feris, Raja Giryes, and Alex M Bronstein. Baby steps towards few-shot learning with multiple semantics. arXiv preprint arXiv:1906.01905, 2019.
[Snell et al., 2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NeurIPS, 2017.
[Sun et al., 2019] Xin Sun, Hongwei Xv, Junyu Dong, Qiong Li, and Changrui Chen. Few-shot learning for domain-specfic fine-grained image classfication. arXiv preprint arXiv:1907.09647, 2019.
[Vinyals et al., 2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NeurIPS, 2016.
[Wah et al., 2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltechucsd birds-200-2011 dataset. 2011.
[Wang et al., 2018] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In CVPR, 2018.
[Wei et al., 2016] Xiu-Shen Wei, Chen-Wei Xie, and Jianxin Wu. Mask-cnn: Localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878, 2016.
[Wei et al., 2017] Xiu-Shen Wei, Jian-Hao Luo, Jianxin Wu, and Zhi-Hua Zhou. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 2017.
[Wertheimer and Hariharan, 2019] Davis Wertheimer and Bharath Hariharan. Few-shot learning with localization in realistic settings. In CVPR, 2019.
[Zhang et al., 2018a] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S Huang. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1325–1334, 2018.
[Zhang et al., 2018b] Xiaolin Zhang, Yunchao Wei, Guoliang Kang, Yi Yang, and Thomas Huang. Self-produced guidance for weakly-supervised object localization. In ECCV, 2018.
[Zhou et al., 2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
[Zhu et al., 2020] Yaohui Zhu, Chenlong Liu, and Shuqiang Jiang. Multi-attention meta learning for few-shot fine-grained image recognition. In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence, pages 1090–1096, 2020.