One of the biggest challenges in machine learning is to learn in non-stationary environments, in which the underlying data distribution (i.e., the joint distribution of the input data and labels P(X, Y )) changes over time, also referred to as concept drift in previous literature (Schlimmer & Granger, 1986; Widmer & Kubat, 1996). One source of the change can be from the drift in the conditional distribution of labels given the input data (i.e., P(Y |X)), often resulting from the change in the task definition, where the predictive function from the input space to label space may vary. Therefore, this drift in P(Y |X) can be named as task drift. Another source of the distribution change is the drift in the marginal distribution of the input data (i.e., P(X)), which we name as domain drift here, with one additional assumption that P(Y |X) remains the same, in lieu of the aforementioned task drift. The domain drift problem has been identified in many practical scenarios that tackle stream data, often with different terminologies, e.g., virtual concept drift (Wid- mer & Kubat, 1993; Tsymbal, 2004), feature change (Gao et al., 2007), and many others summarized in the survey papers (Gama et al., 2014; Ditzler et al., 2015).
Current research in continual learning (Kirkpatrick et al., 2017; Zenke et al., 2017; Rebuffi et al., 2017; Shin et al., 2017; Nguyen et al., 2018; Schwarz et al., 2018) assumes a single non-stationary data stream, without considering the drifts between the training data and test data. In real-world applications, however, we normally have two streams of data arriving simultaneously, i.e., a training or support stream and a test or query stream, where both task drift and domain drift can be present across streams. For example, self-taught learning (Raina et al., 2007) can be viewed as a one-step adaptation of the task drift between the support data and query data, while most unsupervised domain adaptation algorithms (Long et al., 2015; Ganin et al., 2016; Tzeng et al., 2017; Hoffman et al., 2018; Saito et al., 2018; Zhang et al., 2019) resolve a single step of the domain drift without considering the stream data. There are works, however, that aim to tackle the domain drift for steam data, i.e., a stream of continuously evolving domains, but their proposed methodologies still lack the consideration of task drift in the stream (Hoffman et al., 2014; Wulfmeier et al., 2018).
Aiming to tackle both drifts in non-stationary environments, this paper studies the problem of Continuous Domain Adaptation (ConDA) for real-life AI use cases. In ConDA, we assume data arriving in two streams (support and query), with the possibility of having both task drift and domain drift within and across the two streams. This is a practical scenario for many applications that use cloud services to ingest data. The goal for the AI model in the cloud is to continuously ingest the support data into some form of knowledge, and apply the learned knowledge to the queries that request predictive services. More specifically, the model needs to continuously accumulate knowledge from two perspectives: first, the knowledge should be filtered to be domain-agnostic; second, the knowledge should be captured in a transferable form that can be revisited at any time as needed, either to avoid catastrophic forgetting (McCloskey & Co- hen, 1989) by retaining competence on previously seen environments, or as part of domain-independent generic prior knowledge to help solve the current query of interest. Unlike many previous works in continual learning (Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Shin et al., 2017) that address catastrophic forgetting by using unfiltered data (real or generated) to simulate a stationary environment, we emphasize on the high-level knowledge transfer for selective remembering. This is in line with the fact that although human brain has a huge amount of capacity, forgetting seems an evolutionarily correct mechanism. What we need is some form of abstract and transferable knowledge, such as textbook or dictionary, rather than data-level raw information.
To achieve the above, we propose a variational domain-agnostic feature replay approach, which is composed of three modules: (1) the inference module that transforms the input data into filtered knowledge that is domain-agnostic; (2) the generative module as a means to enable knowledge transfer by replaying the learned knowledge, similar to the high-level replay found in animal brains (Skaggs & Mc- Naughton, 1996); and (3) the solver module that applies the filtered and transferable knowledge to solve the queries. Intuitively, the synergy between the inference module and the solver module creates an information bottleneck, where the inference module minimizes the mutual information between the input and the domain-agnostic features while the solver module maximizes the mutual information between the domain-agnostic features and the labels.
We validate the proposed approach on two fundamental scenarios of continuous domain adaptation, i.e., enforcing non-stationarity on the query stream either in the task space or in the domain space. Our experiments demonstrate the effectiveness of the proposed approach on addressing both task drift and domain drift. We also show the possibility of using generative domain-agnostic feature replay as a means of data augmentation towards better generalization.
Let S denote the support stream and Q the query stream, both composed of sequentially arriving datasets, i.e., for
, and
for
(Figure 1, top), the goal of continuous domain adaptation is to maintain the knowledge on S and transfer to Q. Since the data streams can be acquired from any acquisition systems for any tasks, they may be subject to both task drift and domain drift.
Figure 1: Streams of incoming data in the cloud. represents a training (i.e., support) set while
represents a test (i.e., query) set. We want to maintain our knowledge on S and transfer to Q.
The task drift can be illustrated in Scenario (1) (shown in Figure 1, bottom left) with non-stationarity in the task space, where for the support stream S, the predictive mapping from the input data X to class labels Y can be continuously changing, i.e., , therefore resulting in different tasks, and the same non-stationarity also applies to the query stream Q. The challenge here is that the model needs to not only retain knowledge learned from past data for solving previous tasks (or queries), but also address the domain shift between the query and support streams since
. The latter problem is well known as domain adaptation 1 (Pan & Yang, 2009; Quionero-Candela et al., 2009), and in order for the query data to be solvable, a typical assumption is that there exists an unobserved latent variable
, such that
Quionero-Candela et al., 2009).
On the other hand, we can also assume non-stationarity in the domain space, where we have continuously arriving data in the query stream, each coming from a different domain (i.e., while
for
) as shown in Scenario (2) (Figure 1, bottom right). This scenario also represents a group of common use cases in practice, for example, in the healthcare sector, the sequentially arrived query data from different hospitals may be subject to domain drift due to the acquisition system settings, patient demographics, etc., however, the underlying predictive mechanism should remain unchanged. As a result, the model is required to be continuously updated to bridge new domain shift for the current query while maintaining the ability of solving previously seen queries. Note that, in both Scenario (1) and (2), we assume the usual restriction in continual learning that the data seen in previous environments is hidden, and we only have access to data in the current environment.
The above two scenarios compose the fundamental elements in building up most complex ConDA scenarios. Therefore,
Figure 2: Decoupling continuous domain adaptation into three modules: Inference module (blue), Generative module (grey) and Solver module (green). The inference module infers the domain-agnostic features for the current task through variational inference; the generative module generates domain-agnostic feature samples from previous seen tasks; and the solver module is able to continuously solve all seen tasks.
in this paper, we consider to solve these two fundamental scenarios. Moreover, in ConDA, we are interested in continuously solving unlabeled queries, whose solvability also depends on whether the corresponding support data is available, therefore, to make sure that the queries are actually solvable, we assume that for each , the corresponding
(i.e.,
shares the same labeling function with
) arrives earlier than
. In practice, if
arrives later, the adaptation of
can be held until
is available.
Here, we propose to solve continuous domain adaptation by decoupling the problem into three modules (Figure 2): (1) Inference module (Section 3.1), which uses variational inference to train a domain-agnostic feature space for a given domain drift; (2) Generative module (Section 3.2), which allows us to sample from a previously learned domain-agnostic feature space. The sampled features can be used either as our filtered knowledge to reinforce the solver on remembering the knowledge required to solve the previous queries in a non-stationary environment, or as a data augmentation tool to augment the knowledge about the current query; and (3) Solver module (Section 3.3), which focuses only on the downstream task of interest, provided that the inferred or generated data is already domain-agnostic.
This decoupling also separates the concerns of domain drift from task drift in complex non-stationary environments, and brings up the possibility of transferring domain-agnostic knowledge among different environments, which to the best of our knowledge, has not yet been investigated in current domain adaptation research. As a first step, we will show later in our experiments that this is possible with our proposed approach.
3.1. Variational domain-agnostic feature inference
In this module, we are given some labeled input data from the support stream S, and unlabeled query data
time step t. We aim to map both support and query data into a shared stochastic feature space using a mapping function , such that the following two conditions hold: (1) h maximally preserves the necessary information to predict the label y, and (2) the domain discrepancy between support and query on h is minimum. The first condition respects the current task in spite of task drift, while the second condition deals with domain drift. Note that for simplicity of notation, we ignore the subscripts for the random variables here.
To achieve this, we use variational inference (Hoffman et al., 2013) through first encoding x into a latent variable z, which is then decoded into h. We also introduce a conditioning factor c in order to enable the conditional generation for the generative module (Section 3.2) in different environments, similar to (Sohn et al., 2015). The two conditions on h can be then formulated as:
where and
are the marginal feature distributions for the support domain
and query domain
, respectively, and
d
is the divergence that measures domain discrepancy with f and
denoting two hypotheses in the hypothesis space F (Ben-David et al., 2010).
Solving (1) is equivalent to minimizing the following objective function:
where denotes the error of satisfying the first condition in (1), and
is a Lagrange multiplier. Note that although derived from a new perspective, our objective function has a similar form as the upper bound of the target domain error 2 in domain adaptation theory (Ben-David et al., 2010). In our case, we focus on constructing the domain-agnostic feature space that can be sampled from later on in the generative replay module (Section 3.2). Following the works in adversarial domain adaptation (Ganin & Lempitsky, 2015; Tzeng et al., 2017; Hoffman et al., 2018; Saito et al., 2018; Long et al., 2018; Zhang et al., 2019), we also minimize the domain disparity discrepancy via a minimax optimization process:
3.2. Generative domain-agnostic feature replay
Assuming a variational mapping from the input space to the domain-agnostic feature space has been learned, we can then use the decoder function g to conditionally generate domain-agnostic features based on the conditioning factor c for each learned environment:
Since is domain-agnostic, it represents the knowledge of interest filtered from the support data, and this knowledge can be transferred through the generative replay process, to further guide the training of the solver. Therefore, at each time step t > 1, even when the real data seen in previous environments (i.e.,
) is not available, we can still replay
to address the catastrophic forgetting, i.e., to regularize the solver to remember how to solve previous queries. This is similar to generative replay, or pseudo-rehearsal (Robins, 1995), which has been widely used as an effective approach to addressing catastrophic forgetting in continual learning. However, in most works (Shin et al., 2017; Wu et al., 2018), the original input data x is replayed. While p(x) is often difficult to approximate, especially when x is in high dimension, our feature replay can be viewed as a means for high-level knowledge transfer, i.e., filtered knowledge replay, rather than the data-level replay, and the information filtration is guided by the inference module described above (Section 3.1).
In addition to addressing catastrophic forgetting, our feature replay can also act as an alternative approach of transferring part of domain-independent generic prior knowledge among different queries. More specifically, the domain-agnostic feature resulting from solving previous query
used as augmented data in addition to the inferred
solving the current query
, given the assumption that
shares some similarity with
. We discuss more details in Section 4.4.
3.3. Solver module
Our solver is a unified model that continuously integrates knowledge filtered from seen data in the support stream, and is designed to solve all the seen queries. Since the solver operates on the domain-agnostic feature space, i.e., there is no domain shift between the support and query data in the feature space, it is therefore able to solve all the unlabeled queries in the query stream, once trained on the support stream.
At time step t, our solver sees both the inferred features from input data
in the support stream, and the generated features
from previously learned snapshot decoder. Let
be the parameters of the solver, the objective can be given
as:
where denotes the snapshot decoders learned from previously seen environments.
Note that our solver is independent from the hypothesis classifier f in the inference module. Although f also aims to predict the class label given a domain-agnostic feature
, it can not replace the role of the solver in solving previous queries, even with the replay of
. We will show later in our experiment (Figure 3 (c)) that the feature replay of
without the solver module can interfere with the adversarial learning when minimizing the domain disparity discrepancy for the current adaptation, thus resulting in impaired performance. Therefore, our solver module is indispensable as a way to confuse knowledge learned from all seen environments. We train the three modules end-to-end for continuous domain adaptation.
3.4. Theoretical analysis
Here, we analyze the theoretical guarantee for continuous domain adaptation.
Theorem 1. Let be the domain discrepancy of the marginal feature distributions
and
measured by the
divergence, i.e.,
, and
respectively denote the conditional distributions of labels
given the real features
and generated features
, the total error of the query stream
at time step t is bounded by:
solver for both the support and query streams.
The proof is given in the supplementary material. The query stream error bound has different components, which can also be explained by our three different modules. For example, represents how well our inference module has learned a domain-agnostic feature space; the KL term measures the degree of the generative module approximating the real feature distribution, since we assume the previous real data is without access; and finally,
evaluates the solver’s performance on the support data, and
is the capacity of the solver in finding an optimal solution for both streams.
Figure 3: (a) The query accuracy on the first task during sequential training (DA, Office-31); (b) The average query accuracy of all learned tasks during sequential training (Ar
Cl, Office-Home); (c) Ablation study on different components of our proposed approach. Baseline without warmup corresponds to baseline 2 in Table 1, and baseline without task confusion corresponds to baseline 3 in Table 1.
Figure 4: Comparison of features representing the support data (in light colors) and query data (in dark colors), before and after our inference module (left); The performance of the solver trained with generated features is comparable to that of the solver trained with real features (right).
We validate our proposed approach for continuous domain adaptation on two benchmark datasets. For scenario (1), we split the datasets into different tasks based on the class label to simulate the task drift in a non-stationary environment, and we consider one domain as the support stream with labels and another domain as the query stream without labels; For scenario (2), we choose one domain as the support stream and consider the remaining domains in the dataset as sequentially arriving queries in the query stream, so that the domain drift is present both within and across the two streams. We use margin disparity discrepancy (MDD) in our inference module for minimizing the domain disparity discrepancy, and also follow the same architecture choices as in (Zhang et al., 2019). More implementation details are provided in the supplementary material.
Dataset Office-31 (Saenko et al., 2010) has three domains: Amazon (A), DSLR (D) and Webcam (W), which in total contains 31 classes and 4,652 images. We split the dataset into 5 tasks, with 6 classes in the first four tasks and 7 classes
Table 1: Components of our proposed approach.
in the last task (split details in Table 4, supplementary material). Office-Home (Venkateswara et al., 2017) is a more challenging dataset that contains 65 classes and 15,500 images in four distinct domains: Artistic images (Ar), Clip art (Cl), Product images (Pr) and Real-World images (Rw). Similarly, we split the dataset into 13 tasks, each with 5 classes (split details in Table 5, supplementary material).
4.1. Domain-agnostic feature evaluation
We first evaluate the features from two perspectives: (1) whether the features can be domain-agnostic representations of filtered knowledge, and (2) whether the generated features can be a functional replacement of the real data for knowledge transfer, i.e., whether we can potentially use the proposed generative feature replay to facilitate the solver in remembering previously learned knowledge. To address the second question, we train a solver on the generated features, and evaluate it on the real features.
As shown in Figure 4 (left), the output features through the inference module are aligned between the support data (in light colors) and query data (in dark colors), as compared to the features directly extracted from a ResNet (He et al., 2016) model pretrained on ImageNet (Russakovsky et al., 2015), indicating that the features are indeed domain-agnostic. In addition, we also demonstrate in Figure 4 (right) that, the solver trained with generated features can predict
Figure 5: t-SNE visualizations of domain-agnostic features from both the support data (in light colors) and query data (in dark colors). The features are being continuously aligned given sequentially arriving tasks on Office-31.
Table 2: Average query accuracy (%) of all learned tasks on Office-Home.
the class labels as well as the solver trained with real features, although the convergence is slower with generated features. This suggests that feature replay is effective in approximating the real features for the downstream solver. It is also worth mentioning that the introduction of variational inference module does not impair the domain adaptation performance, with respect to MDD (Zhang et al., 2019) as the positive control (Figure 4, right).
4.2. Non-stationarity in tasks
Having shown the effectiveness of both inference and generative modules in Figure 4, we now use the proposed variational domain-agnostic feature replay to address Scenario (1), where we assume task drift in both streams and domain drift across streams (Figure 1, bottom left). For the solver to be able to continuously solve the non-stationary queries, we replay the generated domain-agnostic features learned from previous tasks while learning the current task. This allows the solver to operate on both previous tasks and the current task simultaneously and thus function as a task confuser, i.e., removes task boundaries.
As shown in Figure 3 (a), the generative feature replay helps the solver remember the first task as training progresses across tasks, whereas the solver suffers from catastrophic forgetting without replay (more results on other domains are shown in Figure 8, supplementary material). Similarly, the average query accuracy on all learned tasks can also be improved by the replay process (Figure 3 (b)). Surprisingly, we also find that in some cases (e.g., Office-31 dataset shown in Figure 3 (a)), the generative feature replay works better than the memory feature replay, where we store the features from real data in memory. One of the possible explanations could be that the generative feature distribution has learned the missing data points and acts as a regularizer in the fea- ture space, which helps with overfitting especially when the training data examples are few (e.g., Office-31 dataset). This is also evidenced by the comparable performance between generative feature replay and memory feature replay on Office-Home dataset (Figure 3 (b)), where more data examples are available. As a negative control, we also experiment with noise replay, where the features are replaced with random noises, and as expected, the solver suffers from catastrophic forgetting (Figure 3 (a) and (b)).
Ablation study We analyze the different model components and strategies used for our approach in Table 1, e.g., the solver module for the task confusion, the snapshot for the generative feature replay module, and the warmup strategy. The warmup strategy is designed to first train the inference module independently for a few iterations, before integrating it with the training of the solver module in an end-to-end fashion. Figure 3 (c) shows the results of the ablation study on both our approach and the baseline. It is shown in the figure that both the task confusion component and warmup strategy improve the performance of our approach while all baselines suffer severely from forgetting.
Figure 5 shows the t-SNE visualizations of features from both the support data (in light colors) and query data (in dark colors) at each task step, where the class space is gradually expanding. The class-wise alignment between the support stream and query stream guarantees the solver’s performance on the query stream, since the solver is only trained with the support data in our approach. We summarize the performance comparisons of the average query accuracy between our approach and multiple baselines in Table 2 (Office-Home) and Table 6 (Office-31, supplementary material). The superior performance of our proposed approach demonstrates its effectiveness in addressing both task drift and domain drift that are present in Scenario (1).
4.3. Non-stationarity in domains
In this section, we address Scenario (2), in which we assume a single domain in the support stream, and sequentially arriving queries from different domains in the query stream as shown in Figure 1 (bottom right). As such, the domain drift exists both within and across streams. We perform experiments on Office-Home dataset by selecting one domain as the support data and the remaining three domains as queries in the query stream ordered by the adaptation difficulty level 3, either ascending (i.e., Rw, Pr, Ar, Cl) or descending (i.e., Cl, Ar, Pr, Rw). For example, if domain Ar is chosen as the support data, the adaptation of the query stream can be written as ArRw, Pr, Cl in ascending order, and Ar
Cl, Pr, Rw in descending order. We evaluate the effectiveness of generative feature replay in the worse case scenario of catastrophic forgetting, where the solver deteriorates into a complete forgetting. To simulate this, we train the solver from scratch for each query domain, and investigate the effect of generative domain-agnostic feature replay on the overall performance of all seen domains.
Figure 6 shows the curves of the average query accuracy of all learned domains, and it demonstrates that with all listed permutations, generative feature replay facilitates the solver to generalize to all previously seen domains when the data for previous query domains is without access. This again shows the ability of generative feature replay in a de novo transfer of high-level knowledge to the solver without relying on the example-level experiences. Table 3 compares the average query accuracy of our approach to that of different baselines, where we can see the dramatic improvements with replay. We also find that the knowledge transfer can be further facilitated by keeping snapshot of the encoder from the inference module, especially when the query stream is in an ascending order (i.e., from easy to hard), suggesting that easy queries are more vulnerable to forgetting.
4.4. Generative feature replay for data augmentation
Note that in Scenario (2), as training progresses in the query stream, the solver module eventually captures generalized features that are agnostic to all seen domains. This is analogous to learning class-specific features that lay in the in-
Figure 6: The average query accuracy of all learned domains during sequential training (Office-Home).
tersection of different domains as shown in Figure 7 (a). In our proposed approach, we learn domain-agnostic features between the support domain S and the query domain
at time step i, e.g., the intersection of support domain and query 1 in Figure 7 (a) (left). The generative feature replay module learned at time step i can be used when solving a subsequent query domain
). However, whether replaying the generated features
can be beneficial for solving
depends on the similarity between
and
. For example, as illustrated in Figure 7 (a), query 1 shares more similarity with query 2 than query 3, therefore the knowledge learned from solving query 1 would generally be more transferable to query 2.
To illustrate this, we first visualize the features of the first class (Table 5, supplementary material) from the four different domains in Office-Home dataset as an estimation of domain relation. The features are extracted using a ResNet-50 model pretrained on ImageNet. As seen from the t-SNE plot in Figure 7 (b), domain Rw shares more overlap with domain Pr as compared to domain Cl. Correspondingly, we find in our experiment that, given domain Ar as the support domain, the generative replay of features learned from solving ArRw improves Ar
Pr by around 3% in the query accuracy, while no significant improvement is observed for Ar
Cl (Figure 7 (c)). However, generally speaking, the improvement is found to be a more common phenomenon, for example, Cl
Rw improves Cl
Pr by 1.35%, Ar
Cl improves Ar
Pr by 3.45%, and Pr
Cl improves Pr
Ar
Figure 7: (a) Examples of possible relation among domains; (b) t-SNE visualization of features from the four different domains in Office-Home dataset (showing the first class only); (c) Generative replay of the features learned from ArRw improves Ar
not Ar
by 1.89% (results are shown in Figure 9, supplementary material). Given the observed improvements, it is possible that our generative feature replay, by providing more augmented feature samples that are domain-agnostic, imposes an additional regularization to constrain the solver in capturing more generalized features. However, this is based on the assumption that the solver has a fixed amount of capacity.
Continuous domain adaptation The problem of continuous domain adaptation has been studied before but in different contexts with different emphases. For example, (Mancini et al., 2019) attempt to solve a specific scenario in continuous domain adaptation, where no target data is available, but with metadata provided for all domains; (Gong et al., 2019) propose to bridge two domains by generating a continuous flow of intermediate domains between the two original domains; (Hoffman et al., 2014; Wulfmeier et al., 2018) present continuous domain adaptation with the emphasis to generalize on a transitioning target domain. Closely related to our Scenario (2) is the recent work of (Bobu et al., 2018), where they also aim to address catastrophic forgetting, but with an implicit assumption that the domain drift follows a specific pattern, i.e., induced by gradually changing weather or lighting condition, which is a reasonable assumption in applications such as autonomous driving. We focus on more general use cases for solving any arriving queries in the cloud without imposing extra constraints on the relationship among the queries.
Variational information bottleneck Our work is also related to variational information bottleneck (Alemi et al., 2017), in the sense that we address the domain drift across streams via a variational inference that can be viewed as maximizing the mutual information between the domain-agnostic features and labels, while minimizing the mutual information between the input data and domain-agnostic features. A concurrent work (Song et al., 2019) adopts the idea of variational information bottleneck for domain adaptation, where the one-step domain adaptation performance is shown to be improved. Similarly, (Luo et al., 2019) show the integration of information bottleneck improves domain adaptive segmentation task. In our approach, we constrain the bottleneck on the decoded feature rather than directly on the latent code, and require no additional regularization on the query (target) data as in (Song et al., 2019).
Variational autoencoder (Kingma & Welling, 2013) has also been extensively exploited in domain adaptation to learn disentangled representations for better adaptation performance, where different types of latent variables are proposed to better capture the variations in the dataset (e.g., domainrelevant and class-relevant information), and the reconstruction is either on the image level (Ilse et al., 2019; Cai et al., 2019) or feature level (Peng et al., 2019). In our variational inference module, the domain-agnostic features are learned through supervision from labels instead of reconstruction.
Replay in continual learning Replay has been widely used as an effective approach to addressing catastrophic forgetting in the continual learning research, such as example replay (Rebuffi et al., 2017), deep generative replay (Shin et al., 2017; Wu et al., 2018), and experience replay (Rolnick et al., 2019). However, in these approaches, the generative process is unfiltered and operates on the data level, while our generative process is domain-agnostic and operates on the abstract feature level. A concurrent work (Pellegrini et al., 2019) that uses latent replay is closely related to our feature replay, both emphasizing on the high-level knowledge transfer, however, their latent replay stands for the replay of the activation volumes in some of the intermediate layers without stochasticity.
In this paper, we tackle the challenge of learning in non-stationary environments in the context of continuous domain adaptation, where we have two streams of data in the cloud (i.e., support and query steam) that can be subject to both task drift and domain drift, within and across streams. We present two fundamental scenarios for continuous domain adaptation with the presence of across-stream domain drift, by assuming either task drift or domain drift in both streams. To address both drifts, we propose a variational domain-agnostic feature replay approach, which allows the model in the cloud to continuously accumulate the filtered and transferable knowledge for solving all queries. We demonstrate the effectiveness of the proposed approach on the two fundamental scenarios in continuous domain adaptation.
Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR) 2017, 2017.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
Bobu, A., Tzeng, E., Hoffman, J., and Darrell, T. Adapting to continuously shifting domains. In International Conference on Learning Representations Workshop, 2018.
Cai, R., Li, Z., Wei, P., Qiao, J., Zhang, K., and Hao, Z. Learning disentangled semantic representation for domain adaptation. In IJCAI: proceedings of the conference, volume 2019, pp. 2060. NIH Public Access, 2019.
Ditzler, G., Roveri, M., Alippi, C., and Polikar, R. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10(4):12–25, 2015.
Gama, J., ˇZliobait˙e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.
Ganin, Y. and Lempitsky, V. Unsupervised domain adap- tation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 1180–1189, 2015.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
Gao, J., Fan, W., Han, J., and Yu, P. S. A general framework for mining concept-drifting data streams with skewed distributions. In Proceedings of the 2007 siam international conference on data mining, pp. 3–14. SIAM, 2007.
Gong, R., Li, W., Chen, Y., and Gool, L. V. Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2477–2486, 2019.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Hoffman, J., Darrell, T., and Saenko, K. Continuous man- ifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 867–874, 2014.
Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A. A., and Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, 2018.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
Ilse, M., Tomczak, J. M., Louizos, C., and Welling, M. Diva: Domain invariant variational autoencoders. arXiv preprint arXiv:1905.10427, 2019.
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des- jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 97–105. JMLR. org, 2015.
Long, M., Cao, Z., Wang, J., and Jordan, M. I. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650, 2018.
Lopez-Paz, D. and Ranzato, M. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476, 2017.
Luo, Y., Liu, P., Guan, T., Yu, J., and Yang, Y. Significance- aware information bottleneck for domain adaptive semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6778–6787, 2019.
Mancini, M., Bulo, S. R., Caputo, B., and Ricci, E. Ada- graph: Unifying predictive and continuous domain adaptation through graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6568–6577, 2019.
McCloskey, M. and Cohen, N. J. Catastrophic interfer- ence in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Varia- tional continual learning. In International Conference on Learning Representations (ICLR) 2018, 2018.
Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345–1359, 2009.
Pellegrini, L., Graffieti, G., Lomonaco, V., and Maltoni, D. Latent replay for real-time continual learning. arXiv preprint arXiv:1912.01100, 2019.
Peng, X., Huang, Z., Sun, X., and Saenko, K. Domain agnostic learning with disentangled representations. In International Conference on Machine Learning, pp. 5102– 5112, 2019.
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. The MIT Press, 2009.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pp. 759–766, 2007.
Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010, 2017.
Robins, A. Catastrophic forgetting, rehearsal and pseudore- hearsal. Connection Science, 7(2):123–146, 1995.
Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. In Advances in Neural Information Processing Systems, pp. 348–358, 2019.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211–252, 2015.
Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Springer, 2010.
Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Max- imum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723– 3732, 2018.
Schlimmer, J. C. and Granger, R. H. Incremental learning from noisy data. Machine learning, 1(3):317–354, 1986.
Schwarz, J., Luketina, J., Czarnecki, W. M., GrabskaBarwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, 2018.
Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learn- ing with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999, 2017.
Skaggs, W. E. and McNaughton, B. L. Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science, 271(5257):1870–1873, 1996.
Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491, 2015.
Song, Y., Yu, L., Cao, Z., Zhou, Z., Shen, J., Shao, S., Zhang, W., and Yu, Y. Improving unsupervised domain adaptation with variational information bottleneck. arXiv preprint arXiv:1911.09310, 2019.
Tsymbal, A. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2):58, 2004.
Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adver- sarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176, 2017.
Venkateswara, H., Eusebio, J., Chakraborty, S., and Pan- chanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018–5027, 2017.
Widmer, G. and Kubat, M. Effective learning in dynamic environments by explicit context tracking. In European Conference on Machine Learning, pp. 227–243. Springer, 1993.
Widmer, G. and Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine learning, 23 (1):69–101, 1996.
Wu, C., Herranz, L., Liu, X., van de Weijer, J., Raducanu, B., et al. Memory replay gans: Learning to generate new categories without forgetting. In Advances In Neural Information Processing Systems, pp. 5962–5972, 2018.
Wulfmeier, M., Bewley, A., and Posner, I. Incremental adversarial domain adaptation for continually changing environments. In 2018 IEEE International conference on robotics and automation (ICRA), pp. 1–9. IEEE, 2018.
Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. JMLR. org, 2017.
Zhang, Y., Liu, T., Long, M., and Jordan, M. Bridging theory and algorithm for domain adaptation. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7404–7413, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
7.1. Proof of Theorem 1)
In this subsection, we give proof of Theorem 1 presented in Section 3.4, which analyzes the theoretical guarantee for continuous domain adaptation.
Theorem 2. (Ben-David et al., 2010) Given a source domain and a target domain, with the input data distributions , we have the target domain error bounded by:
where is the
divergence that measures domain discrepancy between the source and target domain, and
is the error of an optimal classifier for both the source and target domains.
Corollary 2.1. For each time step i in the support stream S and query stream Q, let be the domain discrepancy of the feature distributions
and
, the error of the query domain
is bounded by:
where is the error of an optimal solver for both the support domain
and the query domain
With the setup introduced in Corollary 2.1, we now prove Theorem 1.
Theorem 1. (Theorem 1 in Section 3.4) Let be the domain discrepancy of the marginal feature distributions
and
measured by the
divergence, i.e.,
, and
,
respectively denote the conditional distributions of labels
given the real features
and generated features
, the total error of the query stream
at time step t is bounded by:
solver for both the support and query streams.
Proof. At time step t, the error of previous query domains for
is estimated by
, since we use
to approximate
during the training of the solver. The total error of the query stream is:
On the other hand, for each support domain , the KL divergence between the real and generated conditional distributions of labels
given the features, i.e.,
, can be interpreted as:
Therefore, the error for each support domain at the time step t is estimated by:
Combining 11 and 13, the total error of the query stream at time step t is:
ε
7.2. Implementation details
7.2.1. DATASET TASK SPLIT
The two benchmark datasets are split into multiple tasks based on the class labels to simulate the task drift in non-stationary environments. Office-31 (Saenko et al., 2010) has three domains: Amazon (A), DSLR (D) and Webcam (W), which in total contains 31 classes and 4,652 images. We split the dataset into 5 tasks, with 6 classes in the first four tasks and 7 classes in the last task (split details in Table 4). Office-Home (Venkateswara et al., 2017) is a more challenging dataset that contains 65 classes and 15,500 images in four distinct domains: Artistic images (Ar), Clip art (Cl), Product images (Pr) and Real-World images (Rw). Similarly, we split the dataset into 13 tasks, each with 5 classes (split details in Table 5).
Table 4: Office-31 task split.
7.2.2. MODEL ARCHITECTURES
We adopt ResNet (He et al., 2016) models pretrained on ImageNet (Russakovsky et al., 2015) as part of our encoder, e.g., ResNet-34 for the Office-31 dataset and ResNet-50 for the Office-Home dataset. The extracted features are used to infer the latent code with one additional linear layer. Our decoder consists of a linear layer, a ReLU layer and a Batch Norm layer. The dimension of the output domain-agnostic features is 1024 for Office-31 and 2048 for Office-Home. The solver module is a two-layer neural network with the width of 1024. The parameters of the ResNet model used in Scenario (2) is fine-tuned during training in order to learn better class representations, since more classes are involved within a single task, as compared to Scenario (1), where the extracted features directly from the ResNet model are sufficient enough as class representations.
We use margin disparity discrepancy (MDD) in our inference module for minimizing the domain disparity discrepancy, and also follow the same architecture choices as in (Zhang et al., 2019) for the two classifiers f and , i.e., two-layer neural networks.
7.2.3. OPTIMIZATION
Two optimizers are used: Adam optimizer for the inference module with the learning rate 1e-4, and SGD optimizer for the two classifiers () and the solver module, with nesterov momentum 0.9 and weight decay 5e-4. The initial learning rate of the SGD optimizer is set to 4e-4 for Scenario (1) and 4e-3 for Scenario (2).
We use the gradient reversal strategy (Ganin et al., 2016) for the minmax optimization (Eq. 4), and the training scheduler for the coefficient in the gradient reversal layer is defined by:
where i is the iteration step number. The Lagrange multiplier is set to 1.
We train the three modules (the inference module, generative module and solver module) end-to-end with 1000 iteration steps for each task for Office-31, and 5000 iteration steps for Office-Home. We warmup the inference module for 500 iteration steps.
7.3. Additional results
7.3.1. ADDITIONAL RESULTS ON OFFICE-31 DATASET
Here, we show additional results that are referenced in Section 4.2 on the Office-31 dataset, illustrating the effectiveness of our proposed approach in addressing the task drift.
Figure 8 compares the query accuracy of the first task between the baseline and our proposed approach, as the train-
Figure 8: The query accuracy on the first task during sequential training. (Office-31)
Table 6: Average query accuracy (%) of all learned tasks on Office-31.
ing progresses across tasks (from task 1 to task 5). The proposed approach outperforms the baseline, where the solver is shown to consistently maintain the knowledge on how to solve the first task, regardless of the chosen domains (e.g., DA or D
W). However, the improvement is the most evident in the D
A setting (Figure 8, left).
Table 6 shows the average query accuracy of all learned tasks on Office-31. On average, our approach gives better performance than multiple baselines, and is comparable to the upper bound.
7.3.2. EXAMPLE SCENARIO DERIVED FROM COMBINING SCENARIO (1) AND (2)
Scenario (1) and (2) are the two most fundamental scenarios that build up most complex ConDA scenarios in real-life. Upon the success of addressing both scenarios in Section 4.2 and 4.3, in this subsection, we further show an example scenario that is derived from combining Scenario (1) and (2). More specifically, we integrate the domain drift within streams from Scenario (2) into Scenario (1), therefore, the new scenario has both task drift and domain drift within streams, and domain drift across the streams. We show the setup details in Table 9, where we make random combinations of the available tasks and domains.
Table 7 (Office-31) and Table 8 (Office-Home) show both the query accuracy of the first task and the average query accuracy of all learned tasks, evaluated on the new example scenario. We compare the proposed approach to both the
Table 7: Evaluation of the proposed approach on an example scenario by combining Scenario (1) and (2) on Office-31 dataset.
The results are provided with meanstd based on three independent experiments.
defined in Table 1.
Table 8: Evaluation of the proposed approach on an example sce- nario by combining Scenario (1) and (2) on Office-Home dataset.
The results are provided with meanstd based on three independent experiments.
defined in Table 1.
optimal baseline (defined in Table 1) and the upper bound. It is noticed that although our proposed approach signifi-cantly improves the optimal baseline, there is still a margin between the proposed approach and the upper bound, suggesting further improvement could be explored. We leave the exploration for future work.
Table 9: Setup of an example scenario derived from combining Scenario (1) and (2).
Figure 9: Additional examples of generative feature replay for data augmentation.
7.3.3. ADDITIONAL RESULTS ON GENERATIVE FEATURE REPLAY FOR DATA AUGMENTATION
In this subsection, we provide additional results that is referenced in Section 4.4, where we show the usage of generative feature replay as a data augmentation tool, in addition to addressing catastrophic forgetting.
Figure 9 shows more examples of generative replay of previously learned features benefiting the adaptation of the current query domain. For example, replaying the domain-agnostic features learned from ClRw improves both Cl
Pr and Cl
Ar (Figure 9, left), and replaying the features from Ar
Cl also improves both Ar
Pr and Ar
Rw (Figure 9, middle). In some cases, however, no significant improvement is observed (e.g., from Pr
Cl to Pr
Rw in Figure 9, right), for the same reason as described in Section 4.4.