An AI agent working in the real world must be able to recognize the classes of things that it has seen/learned before and detect new
things that it has not seen and learn to accommodate the new things. Tis learning paradigm is called open-world learning (OWL) [2, 7, 9]. Tis is in contrast with the classic supervised learning paradigm which makes the closed-world assumption that the classes seen in testing must have appeared in training. With the ever-changing Web, the popularity of AI agents such as intelligent assistants and self-driving cars that need to face the real-world open environment with unknowns, OWL capability is crucial.
For example, with the growing number of products sold on Amazon from various sellers, it is necessary to have an open-world model that can automatically classify a product based on a set S of product categories. An emerging product not belonging to any existing category in S should be classifed as “unseen” rather than one from S. Further, this unseen set may keep growing. When the number of products belonging to a new category is large enough, it should be added to S. An open-world model should easily accommodate this addition with a low cost of training since it is impractical to retrain the model from scratch every time a new class is added. As another example, the very frst interface for many intelligent personal assistants (IPA) (such as Amazon Alexa, Google Assistant, and Microsof Cortana) is to classify user uterances into existing known domain/intent classes (e.g., Alexa’s skills) and also reject/detect uterances from unknown domain/intent classes (that are currently not supported). But, with the support to allow the 3rd-party to develop new skills (Apps), such IPAs must recognize new/unseen domain or intent classes and include them in the classifcation model[16, 20]. Tese real-life examples present a major challenge to the maintenance of the deployed model.
Most existing solutions to OWL are built on top of closed-world models [2, 3, 9, 32], e.g., by seting thresholds on the logits (before the sofmax/sigmoid functions) to reject unseen classes which tend to mix with existing seen classes. One major weakness of these models is that they cannot easily add new/unseen classes to the existing model without re-training or incremental training (e.g., OSDN [3] and DOC [32]). Tere are incremental learning techniques (e.g., iCaRL [26] and DEN [23]) that can incrementally learn to classify new classes. However, they miss the capability of rejecting examples from unseen classes. Tis paper proposes to solve OWL with both capabilities in a very diferent way via meta-learning.
Problem Statement: At any point in time, the learning system is aware of a set of seen classes and has an OWL model/classifer for S but is unaware of a set of unseen classes
(any class not in S can be inU ) that the model may encounter. Te goal of an OWL model is two-fold: (1) classifying examples from classes in S and reject examples from classes in U , and (2) when a new class
(without loss of generality) is removed from U (now
) and added to S (now
, still being able to perform (1) without re-training the model. Two main challenges for solving this problem are: (1) how to
enable the model to classify examples of seen classes into their
respective classes and also detect/reject examples of unseen classes, and (2) how to incrementally include the new/unseen classes when they have enough data without re-training the model. As discussed above, existing methods either focus on the chal-
lenge (1) or (2), but not both. To tackle both challenges in an unifed
approach, this paper proposes an entirely new OWL method based on meta-learning [1, 10–12, 34]. Te method is called Learning to Accept Classes (L2AC). Te key novelty of L2AC is that the model maintains a dynamic set S of seen classes that allow new classes to be added or deleted with no model re-training needed. Each class is represented by a small set of training examples. In testing, the meta-classifer only uses the examples of the maintained seen classes (including the newly added classes) on-the-fy for classifcation and rejection. Tat is, the learned meta-classifer classifes or rejects a test example by comparing it with its nearest examples from each seen class in S. Based on the comparison results, it determines whether the test example belongs to a seen class or not. If the test example is not classifed as any seen class in S, it is rejected as unseen. Unlike existing OWL models, the parameters of the meta-classifer are not trained on the set of seen classes but on a large number of other classes which can share a large number of features with seen and unseen classes, and thus can work with any seen classifcation and unseen class rejection without re-training. We can see that the proposed method works like a nearest neigh-
bor classifer (e.g., kNN). However, the key diference is that we
train a meta-classifer to perform both classifcation and rejection based on a learned metric and a learned voting mechanism. Also, kNN cannot do rejection on unseen classes. Te main contributions of this paper are as follows. (1) It proposes a novel approach (called L2AC) to OWL based on
meta-learning, which is very diferent from existing approaches. (2) Te key advantage of L2AC is that with the meta-classifer,
OWL becomes simply maintaining the seen class set S because both seen class example classifcation and unseen class example rejection/detection are based on comparing the test example with the examples of each class in S. To be able to accept/classify any new class, we only need to put the class and its examples in S. Te proposed approach has been evaluated on product classif-
cation and the results show its competitive performance.
As an overview, Fig. 1 depicts how L2AC classifes a test example into an existing seen class or rejects it as from an unseen class. Te training process for the meta-classifer is not shown, which is detailed in Sec. 2.2. Te L2AC framework has two major components: a ranker and a meta-classifer. Te ranker is used to retrieve some examples from a seen class that are similar/near to the test example. Te meta-classifer performs classifcation afer it reads the retrieved examples from the seen classes. Te two components work together as follows.
Assume we have a set of seen classes S. Given a test example that may come from either a seen class or an unseen class, the ranker fnds a list of top-k nearest examples to
from each seen class
, denoted as
. Te meta-classifer produces the probability
that the test
belongs to the seen class
examples (most similar to
). If none of these probabilities from the seen classes in S exceeds a threshold (e.g., 0.5 for the sigmoid function), L2AC decides that
is from an unseen class (rejection); otherwise, it predicts
as from the seen class with the highest probability (for classifcation). We denote
as
for brevity when necessary. Note that although we use a threshold, this is a general threshold that is not for any specifc classes as in other OWL approaches but only for the meta-classifer. More practically, this threshold is pre-determined (not empirically tuned via experiments on hyper-parameter search) and the meta-classifer is trained based on this fxed threshold.
As we can see, the proposed framework works like a supervised lazy learning model, such as the k-nearest neighbor (kNN) classifer. Such a lazy learning mechanism allows the dynamic maintenance of a set of seen classes, where an unseen class can be easily added to the seen class set S. However, the key diferences are that all the metric space, voting and rejection are learned by the meta-classifer.
Retrieving the top-k nearest examples for a given test example
needs a ranking model (the ranker). We will detail a sample implementation of the ranker in Sec. 3 and discuss the details of the meta-classifer in the next section.
2.1 Meta-Classifer
Meta-classifer serves as the core component of the L2AC framework. It is essentially a binary classifer on a given seen class. It takes the top-k nearest examples (to the test example ) of the seen class as the input and determines whether
belongs to that seen class or not. In this section, we frst describe how to represent examples of a seen class. Ten we describe how the meta-classifer processes these examples together with the test example into an overall probability score (via a voting mechanism) for deciding whether the test example should belong to any seen class (classifcation) or not (rejection). Along with that we also describe how a joint decision is made for open-world classifcation over a set of seen classes. Finally, we describe how to train the meta-classifer via another set of meta-training classes and their examples.
2.1.1 Example Representation and Memory. Representation learn-
ing lives at the heart of neural networks. Following the success of using pre-trained weights from large-scale image datasets (such as ImageNet [27]) as feature encoders, we assume there is an encoder that captures almost all features for text classifcation.
Given an example x representing a text document (a sequence of tokens), we obtain its continuous representation (a vector) via an encoder , where the encoder
is typically a neural network (e.g., CNN or LSTM). We will detail a simple encoder implementation in Sec. 3.
Further, we save the continuous representations of the examples into the memory of the meta-classifer. So later, the top-k examples can be efciently retrieved via the index (address) in the memory. Te memory is essentially a matrix , where n is the number of all examples from seen classes and |h| is the size of the hidden dimensions. Note that we will still use x instead of h to
Figure 1: Overview of the L2AC framework (best viewed in colors). Assume the seen class set S has 5 classes and their examples are indicated by 5 diferent colors. L2AC has two components: a ranker and a meta-classifer. Given a (green) testing example from a seen class, the ranker frst retrieves the top-k nearest examples (memory indexes) from each seen class. Ten the meta-classifer takes both the test example and the top-k nearest examples for a seen class to produce a probability score for that class. Te meta-classifer is applied 5 times (indicated by 5 rounded rectangles) over these 5 seen classes and yields 5 probability scores, where the 3rd (green) class attends the maximum score as the fnal class (green) prediction. However, if the test example (grey) is from an unseen class (as indicated by the dashed box), none of those probability scores from the seen classes will predict positive, which leads rejection.
refer to an example for brevity. Given the test example , the meta-classifer frst looks up the actual continuous representations
of the top-k examples for a seen class. Ten the meta-classifer computes the similarity score between
individually via a 1-vs-many matching layer as described next.
2.1.2 1-vs-many Matching Layer. To compute the overall proba-
bility between a test example and a seen class, a 1-vs-many matching layer in the meta-classifer frst computes the individual similarity score between the test example and each of the top-k retrieved examples of the seen class. Te 1-vs-many matching layer essentially consists of k shared matching networks as indicated by big yellow triangles in Fig. 1. We denote each matching network as and compute similarity scores
for all top-k examples
Te matching network frst transforms the test example
from the continuous representation space to a single example in a similarity space. We leverage two similarity functions to obtain the similarity space. Te frst function is the absolute values of the element-wise subtraction:
. Te second one is the element-wise summation:
Ten the fnal similarity space is the concatenation of these two functions’ results:
, where
denotes the concatenation operation. We then pass the result to two fully-connected layers (one with Relu activation) and a sigmoid function:
Since there are k nearest examples, we have k similarity scores denoted as . Te hyper-parameters are detailed in Sec. 3.
2.1.3 Open-world Learning via Aggregation Layer. Afer geting
the individual similarity scores, an aggregation layer in the meta-classifer merges the k similarity scores into a single probability indicating whether the test example belongs to the seen class. By having the aggregation layer, the meta-classifer essentially has a parametric voting mechanism so that it can learn how to vote on multiple nearest examples (rather than a single example) from a seen class to decide the probability. As a result, the meta-classifer can have more reliable predictions, which is studied in Sec. 3.
We adopt a (many-to-one) BiLSTM [15, 29] as the aggregation layer. We set the output size of BiLSTM to 2 (1 per direction of LSTM). Ten the output of BiLSTM is connected to a fully-connected layer followed by a sigmoid function that outputs the probability. Te computation of the meta-classifer for a given test example for a seen class c can be summarized as:
Inspired by DOC [32], for each class , we evaluate Eq. 2 as:
If none of existing seen classes S gives a probability above 0.5, we as an example from some unseen class. Note that given a large number of classes, eq. 3 can be efciently implemented in parallel. We leave this to future work. To make L2AC an easily accessible approach, we use 0.5 as the threshold naturally and do not introduce an extra hyper-parameter that needs to be artifcially tuned. Note also that as discussed earlier, the seen class set S and its examples can be dynamically maintained (e.g., one can add to or remove from S any class). So the meta-classifer simply performs open-world classifcation over the current seen class set S.
2.2 Training of Meta-Classifer
Since the meta-classifer is a general classifer that is supposed to work for any class, training the meta-classifer requires examples from another set M of classes called meta-training classes. A large |M| is desirable so that meta-training classes have good coverage of features for seen and unseen classes in testing, which is in similar spirit to few-shot learning [21]. We also enforce
in Sec. 3, so that all seen and unseen classes are totally unknown to the meta-classifer.
Next, we formulate the meta-training examples from M, which consist of a set of pairs (with positive and negative labels). Te frst component of a pair is a training document from a class in M, and the second component is a sequence of top-k nearest examples also from a class in M.
We assume every example (document) of a class in M can be a training document is from class
, a positive training pair is
from class c that are most similar or nearest to
; a negative training pair is
are top-k examples from class
that are nearest to
. We call
one negative class for
. Since there are many negative classes
for
, we keep top-n negative classes for each training example
. Tat is, each
has one positive training pair and n negative training pairs. To balance the classes in the training loss, we give a weight ratio n : 1 for a positive and a negative pair, respectively.
Training the meta-classifer also requires validation classes for model selection (during optimization) and hyper-parameters (k and n) tuning (as detailed in Experiments). Since the classes tested by the meta-classifer are unexpected, we further use a set of validation ), to ensure generalization on the seen/unseen classes.
We want to address the following Research Qestions (RQs): RQ1 -what is the performance of the meta-classifer with diferent settings of top-k examples and n negative classes? RQ2 - How is the performance of L2AC compared with state-of-the-art text classifers for open-world classifcation (which all need some forms of re-training).
3.1 Dataset
We leverage the huge amount of product descriptions from the Amazon Datasets [14] and form the OWL task as the following. Amazon.com maintains a tree-structured category system. We consider each path to a leaf node as a class. We removed products belonging to multiple classes to ensure the classes have no overlapping. Tis gives us 2598 classes, where 1018 classes have more than 400 products per class. We randomly choose 1000 classes from the 1018 classes with 400 randomly selected products per class as the encoder training set; 100 classes with 150 products per class are used as the (classifcation) test set, including both seen classes S and unseen classes U ; another 1000 classes with 100 products per class are used as the meta-training set (including both M and ). For the 100 classes of the test set, we further hold out 50 examples (products) from each class as test examples. Te rest 100 examples are training data for baselines, or seen classes examples to be read by the meta-classifer (which only reads those examples but is not trained on those examples). To train the meta-classifer, we further split the meta-training set as 900 meta-training classes (M) and 100
For all datasets, we use NLTK2 as the tokenizer, and regard all words that appear more than once as the vocabulary. Tis gives us 17,526 unique words. We take the maximum length of each document as 120 since the majority of product descriptions are under 100 words.
3.2 Ranker
We use cosine similarity to rank the examples in each seen (or meta-training) class for a given test (or meta-training) example (or
. We apply cosine directly on the hidden representations of the encoder as
, where
can be either t or
denotes the l-2 norm and
denotes the dot product of two examples.
Training the meta-classifer also requires a ranking of negative classes for a meta-training example , as discussed in Sec. 2.2. We frst compute a class vector for each meta-training class. Tis class vector is averaged over all encoded representations of examples of that class. Ten we rank classes by computing cosine similarity between the class vectors and the meta-training example
. Te top-n (defned in the previous section) classes are selected as negative classes for
. We explore diferent setings of n later.
3.3 Evaluation
Similar to [32], we choose 25, 50, and 75 classes from the (classifcation) test set of 100 classes as the seen classes for three (3) experiments. Note that each class in the test set has 150 examples, where 100 examples are for the training of baseline methods or used as seen class examples for L2AC and 50 examples are for testing both the baselines and L2AC. We evaluate the results on all 100 classes for those three (3) experiments. For example, when there are 25 seen classes, testing examples from the rest 75 unseen classes are taken as from one , as in [32].
Besides using macro F1 as used in [32], we also use weighted F1 score overall classes (including seen and the rejection class) as the evaluation metric. Weighted F1 is computed as
where is the number of examples for class c and F1
is the F1 score of that class. We use this metric because macro F1 has a bias on the importance of rejection when the seen class set is small (macro F1 treats the rejection class as equally important as one seen class). For example, when the number of seen classes is small, the rejection class should have a higher weight as a classifer on a
Figure 2: Weighted F1 scores for diferent 9) and diferent
small seen set is more likely challenged by examples from unseen classes. Further, to stabilize the results, we train all models with 10 diferent initializations and average the results.
3.4 Hyper-parameters
For simplicity, we leverage a BiLSTM [15, 29] on top of a GloVe [25] embedding (840b.300d) layer as the encoder (other choices are also possible). Similar to feature encoders trained from ImageNet [27], we train classifcation over the encoder training set with 1000 classes and use 5% of the encoding training data as encoder validation data. We apply dropout rates of 0.5 to all layers of the encoder. Te classifcation accuracy of the encoder on validation data is 81.76%. Te matching network (the shared network within the 1-vs-many matching layer) has two fully-connected layers, where the size of the hidden dimension is 512 with a dropout rate of 0.5. We set the batch size of meta-training as 256.
To answer RQ1 on two hyper-parameters k (number of nearest examples from each class) and n (number of negative classes), we use the 100 validation classes to determine these two hyper-parameters. We formulate the validation data similar to the testing experiment on 50 seen classes. For each validation class, we select 50 examples for validation. Te rest 50 examples from each validation seen class are used to fnd top-k nearest examples. We perform grid search of averaged weighted F1 over 10 runs for 10, 15, 20} and
, where k = 5 and n = 9 reach a reasonably well weighted F1 (87.60%). Further increasing n gives limited improvements (e.g., 87.69% for n = 14 and 87.68% for n = 19, when k = 5). But a large n signifcantly increases the number of training examples (e.g., n = 14 ended with more than 1 million meta-training examples) and thus training time. So we decide to select k = 5 and n = 9 for all ablation studies below. Note the validation classes are also used to compute (formulated in a way similar to the meta-training classes) the validation loss for selecting the best model during Adam [17] optimization.
3.5 Compared Methods
To the best of our knowledge, DOC [32] is the only state-of-the-art baseline for open-world learning (with rejection) for text classifcation. It has been shown in [32] that DOC signifcantly outperforms the methods CL-cbsSVM and cbsSVM in [9] and OpenMax in [3]. OpenMax is a state-of-the-art method for image classifcation with rejection capability.
To answer RQ2, we use DOC and its variants to show that the proposed method has comparable performance with the best open-world learning method with re-training. Note that DOC cannot incrementally add new classes. So we re-train DOC over diferent sets of seen classes from scratch every time new classes are added to that set. It is thus actually unfair to compare our method with DOC because DOC is trained on the actual training examples of all classes. However, our method still performs beter in general. We used the original code of DOC and created six (6) variants of it.
DOC-CNN: CNN implementation as in the original DOC paper without Gaussian fting (using 0.5 as the threshold for rejection). It operates directly on a sequence of tokens.
DOC-LSTM: a variant of DOC-CNN, where we replace CNN with BiLSTM to encode the input sequence for fair comparison. BiLSTM is trainable and the input is still a sequence of tokens.
DOC-Enc: this is adapted from DOC-CNN, where we remove the feature learning part of DOC-CNN and feed the hidden representation from our encoder directly to the fully-connected layers of DOC for a fair comparison with L2AC.
DOC-*-Gaus: applying Gaussian fting proposed in [32] on the above three baselines, we have 3 more DOC baselines. Note that these 3 baselines have exactly the same models as above respectively. Tey only difer in the thresholds used for rejection. Gaussian ftting in [32] is used to set a good threshold for rejection. We use these baselines to show that the Gaussian fted threshold improves the rejection performance of DOC signifcantly but may lower the performance of seen class classifcation. Te original DOC is DOC-CNN-Gaus here.
Te following baselines are variants of L2AC.
L2AC-n9-NoVote: this is a variant of the proposed L2AC that only takes one most similar example (from each class), i.e., k = 1, with one positive class paired with n = 9 negative classes in meta-training (n = 9 has the best performance as indicated in answering RQ1 above). We use this baseline to show that the performance of taking only one sample may not be good enough. Tis baseline clearly does not have/need the aggregation layer and only has a single matching network in the 1-vs-many layer.
L2AC-n9-Vote3: this baseline uses exactly the same model as L2AC-n9-NoVote. But during evaluation, we allow a non-parametric voting process (like kNN) for prediction. We report the results of voting over top-3 examples per seen class as it has the best result (ranging from 3 to 10). If the average of the top-3 similar examples in a seen class has example scores with more than 0.5, L2AC believes the testing example belongs to that class. We use this baseline to show that the aggregation layer is efective in learning to vote and L2AC can use more similar examples and get beter performance.
L2AC-k5-n9-AbsSub/Sum: To show that using two similarity functions () gives beter results, we further perform ablation study by using only one of those similarity functions at a time, which gives us two baselines.
L2AC-k5-n9/14/19: this baseline has the best k = 5 and n = 9 on the validation classes, as indicated in the previous subsection. Interestingly, further increasing k may reduce the performance as L2AC may focus on not-so-similar examples. We also report results on n = 14 or 19 to show that the results do not get much beter.
Table 1: Weighted F1 (WF1) and macro F1 (MF1) scores on a test set with 100 classes with 3 settings: 25, 50, and 75 seen classes. Te set of seen classes are incrementally expanded from 25 to 75 classes (or gradually shrunk from 75 to 25 classes). Te results are the averages over 10 runs with standard deviations in parenthesis.
3.6 Results Analysis
From Table 1, we can see that L2AC outperforms DOC, especially when the number of seen classes is small. First, from Fig. 2 we can see that k = 5 and n = 9 gets reasonably good results. Increasing k may harm the performance as taking in more examples from a class may let L2AC focus on not-so-similar examples, which is bad for classifcation. More negative classes give L2AC beter performance in general but further increasing n beyond 9 has litle impact.
Next, we can see that as we incrementally add more classes, L2AC gradually drops its performance (which is reasonable due to more classes) but it still yields beter performance than DOC. Considering that L2AC needs no training with additional classes, while DOC needs full training from scratch, L2AC represents a major advance. Note that testing on 25 seen classes is more about testing a model’s rejection capability while testing on 75 seen classes is more about the classifcation performance of seen class examples. From Table 1, we notice that L2AC can efectively leverage multiple nearest examples and negative classes. In contrast, the non-parametric voting of L2AC-n9-Vote3 over top-3 examples may not improve the performance but introduce higher variances. Our best k = 5 indicates that the meta-classifer can dynamically leverage multiple nearest examples instead of solely relying on a single example. As an ablation study on the choices of similarity functions, running L2AC on a single similarity function gives poorer results as indicated by either L2AC-k5-n9-AbsSub or L2AC-k5-n9-Sum.
DOC without encoder (DOC-CNN or DOC-LSTM) performs poorly when the number of seen classes is small. Without Gaussian fting, DOC’s (DOC-CNN, DOC-LSTM or DOC-Enc) performance increases as more classes are added as seen classes. Tis is reasonable as DOC is more challenged by fewer seen training classes and more unseen classes during testing. As such, Gaussian fting (DOC-*-Gaus) alleviates the weakness of DOC on a small number of seen training classes.
Open-world learning has been studied in text mining and computer vision (where it is called open-set recognition) [2, 7, 9]. Most existing approaches focus on building a classifer that can predict examples from unseen classes into a (hidden) rejection class. Tese solutions are built on top of closed-world classifcation models [2, 3, 32]. Since a closed-world classifer cannot detect/reject examples from unseen classes (they will be classifed into some seen classes), some thresholds are used so that these closed-world models can also be used to do rejection. However, as discussed earlier, when incrementally learning new classes, they also need some form of re-training, either full re-training from scratch [3, 32] or partial re-training in an incremental manner [2, 9].
Our work is also related to class incremental learning [23, 26, 28], where new classes can be added dynamically to the classifer. For example, iCaRL [26] maintains some exemplary data for each class and incrementally tunes the classifer to support more new classes. However, they also require training when each new class is added.
Our work is clearly related to meta-learning (or learning to learn) [34], which turns the machine learning tasks themselves as training data to train a meta-model and has been successfully applied to many machine learning tasks lately, such as [1, 8, 10–12]. Our proposed framework focuses on learning the similarity between an example and an arbitrary class and we are not aware of any open-world learning work based on meta-learning.
Te proposed framework is also related to zero-shot learning [22, 24, 33] (in that we do not require training but need to read training examples), k-nearest neighbors (kNN) (with additional rejection capability, metric learning [36] and learning to vote), and Siamese networks [4, 18, 35] (regarding processing a pair of examples). However, all those techniques work in closed-worlds with no rejection capability.
Product classifcation has been studied in [5, 6, 13, 19, 30, 31], mostly in a multi-level (or hierarchical) seting. However, given the dynamic taxonomy in nature, product classifcation has not been studied as an open-world learning problem.
In this paper, we proposed a meta-learning framework called L2AC for open-world learning. L2AC has been applied to product classifcation. Compared to traditional closed-world classifers, our meta-classifer can incrementally accept new classes by simply adding new class examples without re-training. Compared to other open-world learning methods, the rejection capability of L2AC is trained rather than realized using some empirically set thresholds. Our experiments showed superior performances to strong baselines.
Bing Liu’s work was partially supported by the National Science Foundation (NSF IIS 1838770) and by a research gif from Huawei.
[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Mathew W Hofman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In NIPS. 3981–3989.
[2] Abhijit Bendale and Terrance Boult. 2015. Towards open world recognition. In Proceedings of the IEEE Conference on Computer Vision and Patern Recognition. 1893–1902.
[3] Abhijit Bendale and Terrance E Boult. 2016. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and patern recognition. 1563–1572.
[4] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, and Roopak Shah. 1994. Signature verifcation using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems. 737–744.
[5] Ali Cevahir and Koji Murakami. 2016. Large-scale Multi-class and Hierarchical Product Categorization for an E-commerce Giant. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 525–535.
[6] Jianfu Chen and David Warren. 2013. Cost-sensitive learning for large-scale hierarchical classifcation. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 1351–1360.
[7] Zhiyuan Chen and Bing Liu. 2018. Lifelong machine learning. Morgan & Claypool Publishers.
[8] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. 2018. Learning to Teach. arXiv preprint arXiv:1805.03643 (2018).
[9] Geli Fei, Shuai Wang, and Bing Liu. 2016. Learning cumulatively to become more knowledgeable. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1565–1574.
[10] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017).
[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic MetaLearning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning. 1126–1135.
[12] Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic Model-Agnostic Meta-Learning. arXiv preprint arXiv:1806.02817 (2018).
[13] Vivek Gupta, Harish Karnick, Ashendra Bansal, and Pradhuman Jhala. 2016. Product Classifcation in E-Commerce using Distributional Semantics. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 536–546.
[14] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative fltering. In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Commitee, 507–517.
[15] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[16] Young-Bum Kim, Dongchan Kim, Anjishnu Kumar, and Ruhi Sarikaya. 2018. Efcient Large-Scale Domain Classifcation with Personalized Atention. arXiv preprint arXiv:1804.08065 (2018).
[17] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[18] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Vol. 2.
[19] Zornitsa Kozareva. 2015. Everyone likes shopping! multi-class product categorization for e-commerce. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1329–1333.
[20] Anjishnu Kumar, Pavankumar Reddy Muddireddy, Markus Dreyer, and Bj¨orn Hofmeister. 2017. Zero-Shot Learning Across Heterogeneous Overlapping Domains.. In INTERSPEECH. 2914–2918.
[21] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33.
[22] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class atribute transfer. In Computer Vision and Patern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 951– 958.
[23] Jeongtae Lee, Jaehong Yun, Sungju Hwang, and Eunho Yang. 2017. Lifelong Learning with Dynamically Expandable Networks. arXiv preprint arXiv:1708.01547 (2017).
[24] Mark Palatucci, Dean Pomerleau, Geofrey E Hinton, and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. In NIPS. 1410–1418.
[25] Jefrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
[26] Sylvestre-Alvise Rebuf, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. iCaRL: Incremental Classifer and Representation Learning. In Computer Vision and Patern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 5533–5542.
[27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[28] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016).
[29] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
[30] Dan Shen, Jean-David Ruvini, and Badrul Sarwar. 2012. Large-scale item categorization for e-commerce. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 595–604.
[31] Dan Shen, Jean David Ruvini, Manas Somaiya, and Neel Sundaresan. 2011. Item categorization in the e-commerce domain. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 1921–1924.
[32] Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep Open Classifcation of Text Documents. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 2911–2916. htps://www.aclweb.org/anthology/D17-1314
[33] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In NIPS. 935–943.
[34] Sebastian Trun and Lorien Prat. 2012. Learning to learn. Springer.
[35] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630–3638.
[36] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. 2003. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems. 521–528.