Incremental Unsupervised Domain-Adversarial Training of Neural Networks

2020·Arxiv

Abstract

Abstract

In the context of supervised statistical learning, it is typically assumed that the training set comes from the same distribution that draws the test samples. When this is not the case, the behavior of the learned model is unpredictable and becomes dependent upon the degree of similarity between the distribution of the training set and the distribution of the test set. One of the research topics that investigates this scenario is referred to as domain adaptation. Deep neural networks brought dramatic advances in pattern recognition and that is why there have been many attempts to provide good domain adaptation algorithms for these models. Here we take a different avenue and approach the problem from an incremental point of view, where the model is adapted to the new domain iteratively. We make use of an existing unsupervised domain-adaptation algorithm to identify the target samples on which there is greater confidence about their true label. The output of the model is analyzed in different ways to determine the candidate samples. The selected set is then added to the source training set by considering the labels provided by the network as ground truth, and the process is repeated until all target samples are labelled. Our results report a clear improvement with respect to the non-incremental case in several datasets, also outperforming other state-of-the-art domain adaptation algorithms.

Keywords: Domain Adaptation, Unsupervised learning, Neural Networks, Convolutional Neural

Networks, Incremental labelling, Machine learning

1. Introduction

Supervised learning is the most considered approach for dealing with classification tasks. This paradigm is based on a sufficiently representative training set to learn a classification model. This level of representativeness is usually defined by two criteria: on the one hand, the training samples must be varied, which allows the algorithm to generalize instead of memorizing; on the other hand, the application of the trained model is assumed to be carried out on samples that come from the same distribution as those of the training set [11].

Building a training set fulfilling these conditions is not always straightforward. Although obtaining samples might be easy, assigning their correct labels is costly. This is why there are efforts to alleviate the aforementioned requirements. However, while the conflict between memorization and generalization has been well studied, and there exist established mechanisms to deal with it such as regularization or data augmentation [15], learning a model that is able to correctly classify samples from a different target distribution remains open to further research. This problem is generally called transfer learning (TL) [26], and when the classification labels do not vary in the target distribution it is usually referred to as domain adaptation (DA) [35].

Within the context of supervised learning, deep learning represents an important breakthrough [19]. This term refers to the latest generation of artificial neural networks, for which novel mechanisms have been developed that allow training deeper networks, i.e., with many layers. These deep neural networks represent the state of the art in many classification tasks, and have managed to break the existing glass ceiling in many traditionally complex tasks. In turn, deep learning often requires a large amount of data, which makes the study of DA even more interesting.

As we will review in the next section, there are several alternatives to attempt DA, both general strategies and using deep neural networks. In this work we take a different avenue and study an incremental approach. We propose to use an existing DA algorithm for deep learning to classify those samples of the target domain for which the model is confident. Assuming the assigned labels as ground truth, the model is retrained. This added knowledge allows the network to refine its behavior to correctly classify other samples of the target set. This incremental process is repeated until the entire target set is completely annotated. We will show that this incremental approach achieves noticeable improvements with respect to the underlying DA algorithm. In addition, it is competitive on different benchmarks compared to other state-of-the-art DA algorithms.

The rest of the paper is structured as follows: we outline in Section 2 the existing literature about DA, with special emphasis to that based on deep neural networks; we present in Section 3 the proposed incremental methodology, as well as the underlying DA model that we consider in this work; we describe our experimental setting in Section 4, while the results are reported in Section 5; finally, the work is concluded in Section 6.

2. Background

Since the beginning of machine learning research, there exists the idea of exploiting a model beyond its use over unknown samples of the source distribution. In the literature we can find two main topics that pursue this objective: the aforementioned TL and DA strategies.

In TL, some knowledge of the model is used to solve a different classification task. For example, a pre-trained DNN model can be used as initialization [28, 17] or its feature extraction process can be considered as the basis of another classification model [37]. As a special case of TL, the DA challenge typically assumes that the classification task of the target distribution is the same (i.e., the set of labels is equal). In this work we focus on the latter case.

In a DA scenario, we can also distinguish between semi-supervised and unsupervised approaches. While semi-supervised DA considers that some labeled samples of the target distribution are available [8, 36, 25], unsupervised DA works with just unlabeled samples [18]. We will revisit in this section unsupervised DA techniques, as it is the case of the proposed approach.

Performing unsupervised DA is still considered an open problem from both theoretical and practical perspectives [5]. Most approaches consider that the key is to build a good feature representation that becomes invariant to the domain [4]. A good example is the Domain Adaptation Neural Network (DANN) proposed by Ganin et al. [13], which simultaneously learns domain-invariant features from both source and target data and discriminative features from the source domain. Following this line of research, many approaches have been proposed more recently: Virtual Adversarial Domain Adaptation (VADA) proposed by Shu et al. [27] added a penalty term to the loss function to penalize class boundaries that cross high-density feature regions. The Deep Reconstruction-Classification Networks (DRCN) [14] consists of a neural network that forces a common representation of both the source and target domains by sample reconstruction, while learning the classification task from the source samples. The Domain Separation Networks (DSN) proposed by Bousmalis et al. [6] are trained to map input representations onto both a domain-specific subspace and a domain-independent subspace, in order to improve the way that the domain-invariant features are learned. Haeusser et al. [16] proposed Associative Domain Adaptation (ADA), which is another domain-invariant feature learning approach that reinforces associations between source and target representations in an embedding space with neural networks. The Adversarial Discriminative Domain Adaptation (ADDA) strategy [33] follows the idea of Generative Adversarial Networks, along with discriminative modeling and untied weight sharing to learn domain-invariant features, while keeping a useful representation for the discriminative task. Drop to Adapt (DTA) [21] makes use of adversarial dropout to enforce discriminative domain-invariant features. Damodaran et al. [10] proposed the Deep Joint Distribution Optimal Transport (DeepJDOT) approach, which learns both the classifier and aligned data representations between the source and target domain following a single neural framework with a loss functions based on the Optimal Transport theory [34].

A different strategy to DA consists in learning how to transform features from one domain to another. Following this idea, the Subspace Alignment (SA) method [12] seeks to represent the source and target domains using subspaces modelled by eigenvectors. Then, it solves an optimization problem to align the source subspace with the target one. Also, Sun and Saenko proposed the Deep Correlation Alignment (D-CORAL) approach [31], which consists of a neural network that learns a nonlinear transformation to align correlations of layer activations from the source and target distributions.

While the methods outlined above seek for new ways to achieve the desired characteristics of a proper DA method, our proposed approach takes a different avenue. Specifically, we build upon the existing DANN approach, and we propose novel ways to improve its ability to adapt to the target domain by performing the adaptation incrementally.

3. Methodology

3.1. Preliminaries

Let X be the input space and Y be the output or label space. A classification task assumes that there exist a function that assigns a label to each possible sample of the input space. For supervised learning, the goal is to learn a hypothesis function h that models the unknown function f with the least possible error. We refer to h as label classifier. Quite often, the approach is to estimate a posterior probability P(Y |X) so that the label classifier follows a maximum a posteriori decision such that h(x) = arg max). This is the case with neural networks.

In the DA scenario, there exist two distributions over and , which are referred to as source domain and target domain, respectively. We focus on the case of unsupervised domain adaptation, for which DA is only provided with a labeled source set and a completely unlabeled target domain .

The goal of a DA algorithm is to build a label classifier for by using the information provided in both S and T.

3.2. Domain Adaptation Neural Network

Given its importance in the context of our work, we further describe here the operation of DANN, which will be considered as the backbone for our incremental approach.

DANN is based on the theory of learning from different domains discussed by [3, 2]. This suggests that the transfer of the knowledge gained from one domain to another must be based on learning features that do not allow to discriminate between the two domains (source and target) of the samples to be classified. For this, DANN learns a classification model from features that do not encode information about the domain of the sample to be classified, thus generalizing the knowledge from a source labeled domain to a target unlabeled domain.

More specifically, the proposed neural architecture includes a feature extractor module () and a ), which together build a standard feed-forward neural network that can be trained to classify an input sample x into one of the possible categories of the output space Y . The last layer of the label classifier uses a “softmax” activation, which models the posterior probability of a given input .

DANN adds a new domain classifier module () to the neural network, that classifies the domain to which the input sample x belongs. This classifier is built as a binary logistic regressor that models the probability that an input sample x comes from the source distribution (= 0 if ) or the target distribution (= 1 if ), where denotes a binary variable that indicates the domain of the sample.

The unsupervised adaptation to a target domain is achieved as follows: the domain classifier is connected to the feature extractor (which is shared with the label classifier ) through the so-called gradient reversal layer (GRL). This layer does nothing at prediction. However, while learning through back-propagation, it multiplies the gradient by a certain negative constant (). In other words, the GRL receives the gradient from the subsequent layer and multiplies it by , therefore changing its sign before passing it to the preceding layer. The idea of this operation is to force to learn generic features that do not allow discriminating the domain. In addition, since this training is carried out simultaneously with the training of (label classifier), the features must be adequate for discriminating the categories to classify, yet unbiased with respect to the input domain. According to the DA theory, this should cause to be able to correctly classify input samples regardless of their domain, given that the features from are forced to be invariant.

The DANN training simultaneously updates all modules, providing samples for both and . This can be done by using conventional mechanisms such as Stochastic Gradient Descent, from batches that include half of the examples from each domain. During the training process, the learning of pursues a trade-off between appropriate features for the classification () and inappropriate features for discriminating the domain of the input sample (). The hyper-parameter allows tuning this trade-off. The training is performed until the result converges to a saddle point, which can be found as a stationary point in the gradient update defined by the following equation:

where denotes the weights of denotes the learning rate, and and represent the loss functions for the label classifier and the domain classifier, respectively.

A graphical overview of the DANN architecture is depicted in Fig. 1.

3.3. Incremental DANN

Our main contribution within the context of DA is to propose an incremental approach to DANN (iDANN). This strategy is explained below.

Once the DANN model is trained as explained in the previous section, we can use both the feature extractor and the label classifier to predict the category of samples from both the

Figure 1: Graphical overview of the DANN architecture, consisting of three blocks: feature extractor (classifier (), and domain classifier (). The GRL circle denotes the gradient reversal layer that multiplies the gradient by a negative factor.

target domain and the source domain ())). The “softmax” activation used at the output of this classifier returns the posterior probability that the network considers x to belong to any of the classes of the output space Y .

Our main assumption is that we can use the subset of samples from the target domain for which is more confident, and then add them to the source labeled domain assuming the prediction as ground truth. These samples are thereafter considered as samples of the source domain completely. Afterwards, we can retrain the DANN network to fine-tune its weights using the new training set. This process is repeated iteratively, moving the labeled samples with greater confidence from the target domain to the source domain after each iteration. We stop when there are no more samples to move from the target domain.

The intuitive idea behind our approach is that by adding target domain information to the source (labeled) domain, the DANN learns new domain-invariant features that better fit the eventual clas-sification task, thereby becoming more accurate for other target domain samples. In each iteration, however, the task increases its complexity because it deals first with the simplest samples to classify (for which the DANN is more confident), leaving those that have more dissimilar features in the unlabeled target set. When the DANN is retrained with labeled samples that include target domain information, the domain classifier needs to be more specific. This forces the feature extraction module to forget the features that differentiate more complex samples from the target domain.

We formalize the process in Algorithm 1, where e and b represents the number of epochs and the batch size considered, respectively, denotes the number of epochs for the incremental stage of the algorithm, r indicates the size of the subset of target domain samples to select in each iteration, and is a constant that allows us to modify this size after each iteration.

In this algorithm, the samples of the target domain ( ˆB) are classified using the label classifier , and then it proceeds to select a subset ˆof size r to be moved from the target domain to the source domain. For this purpose, two selection criteria are proposed, which are described in the next section.

Once the iterative stage of the algorithm ends, the label classifier is used to classify the entire original target domain (see line 9 of Algorithm 1). This labeled target set is used to then train a neural network from scratch, which is therefore specialized in classifying target domain samples (more details in Section 3.5).

3.4. Selection policies

Below we describe in detail the two proposed policies to select samples during the iterative stage of Algorithm 1 (selection policy). One policy is directly based on the confidence level that the network provides to the prediction, while the other is based on geometric properties of the learned feature space.

3.4.1. Confidence policy

As mentioned above, the output of the label classifier uses a softmax activation. Let L denote the number of labels. Then, the standard softmax function is defined by Equation 2.

This function normalizes an L-dimensional vector z of unbounded real values into another L-dimensional vector ), for which values range between [0, 1] and add up to 1. This can be interpreted as a posterior probability over the different possible labels [7]. In order to turn these probabilities into the predicted class label, we simply take the argmax-index position of this output vector, following a Maximum a Posteriori probability criterion.

Taking advantage of this interpretation, the first policy for selecting samples to move from the target domain to the source is based on the probability provided by the label classifier , which can be seen as a measure of confidence in such classification.

With this criterion, we will keep the maximum predicted probability value for each sample of the target set among the possible labels. Then, we will order all samples based on this value—from highest to lowest—in order to select the first r samples to build the subset ˆ.

Algorithm 2 presents the algorithmic description of this process, where refers to the proba- bilistic output of the label classifier after the softmax activation, before applying argmax to select a label. The function sortr is used to sort the set in decreasing order.

Figure 2 shows an example of a set of probabilities obtained after predicting the target samples with DANN. The figure on the left shows the maximum probability values obtained for the classi-fication of each sample—without sorting—while in the figure on the right the sorted set is shown, where the threshold r has been highlighted.

3.4.2. kNN policy

As in the previous case, once the network has been trained, we use the label classifier to predict the labels of the whole target domain and then we sort them based on the confidence given by the network. However, in this case, instead of directly selecting a subset of samples according to this confidence, we will also evaluate the geometric properties of the feature space. This is performed following the k-nearest neighbor rule.

Figure 2: Example of probabilities obtained with DANN. Left: maximum probability of each sample. Right: ordered set of maximum probabilities, where the threshold r has been highlighted.

We first obtain the feature set from the source set S (using )). We then proceed to iterate the target set samples sorted by their level of confidence. Given a target sample, if the label of the k-nearest samples of the source domain matches the label assigned by the label classifier , then we will select the prototype. Otherwise, we will discard it. Therefore, samples are selected based on both the confidence provided by the DANN in their label and the extent they match the distribution of the source domain.

Algorithm 3 describes this process algorithmically. The ) function receives as parameters the query sample q, the set and the value k to be used, and yields the predicted label l and the number of samples m within its k-nearest neighbors from S that have the same label.

The idea of this policy is to select the samples of the target domain whose features are within the cluster of the source domain for the same class. An illustrative example of this condition is shown in Fig. 3 with k = 5. The example shows two labels of the source domain as green circles and blue squares. The red stars denote the target domain examples that are being evaluated to determine if they are selected. For instance, the star on the left would be selected if, and only if, the network classified it as a green circle, since its 5-nearest neighbors are green circles. Similarly, the star on the right would be selected if, and only if, the network classified it as a blue square. However, the central star would always be discarded because its 5-neighbors belong to two different classes.

If we increase k, the red star of the left would still be selected (if labeled as green circle) because it is located in the middle of the cluster. However, the red start of the right is closer to label boundaries, and so it would eventually be discarded.

3.5. Training a CNN with the new labeled target set

As described in Algorithm 1, once the iterative stage of the iDANN algorithm is completed, we use the label classifier to annotate the entire original target set T from scratch. Then, a new CNN is trained by conventional means considering the same neural architecture of )). This allows us to eventually get a neural network that is directly specialized in the classification of the

target domain.

However, we assume that some part of the iterative annotation of the target set will contain noise at the label level. To mitigate the possible efects of this noise, we consider label smoothing [32]. This is an efficient and theoretically-grounded strategy for dealing with label noise, which also makes the model less prone to overfitting.

Compared to classical one-hot output representation, label smoothing changes the construction

of the true probability to

where is a small constant (or smoothing parameter) and L is the total number of classes. Hence,

instead of minimizing cross-entropy with hard targets (0 or 1), it considers soft targets.

Figure 3: Example of sample selection using the kNN policy with k = 5. Green circles and blue squares represent samples of two different classes from the source domain. Red stars represent the samples of the target domain that are evaluated to determine whether they are chosen.

4. Experimental setup

4.1. Datasets

The proposed approach will be evaluated with two different classification tasks, that are common in the DA literature. The first one is that of digit classification, for which we consider the following datasets:

• MNIST [20]: this collection contains 28 28 images representing isolated handwritten digits.

• MNIST-M [13]: this dataset was synthetically generated by merging MNIST samples with

• Street View House Numbers (SVHN) [24]: it consists of images obtained from house numbers

• Synthetic Numbers [13]: images of digits generated using Windowsfonts, with varying

In addition, we also evaluate our approach for traffic sign classification with the following datasets:

• German Traffic Sign Recognition Benchmark (GTSRB) [30]: this dataset contains images of

• Synthetic Signs [23]: this dataset was synthetically generated by taking common street signs

Table 1 summarizes the information of our evaluation corpora, including the domain to which they belong, the number of labels, the image resolution, the number of samples, and the type of image indicating whether they are in color or grayscale format. Figure 4 shows some random examples from each of these datasets.

Table 1: Description of the datasets used in the experimentation.

Figure 4: Random examples from the datasets used in experimentation.

The images of each classification task were rescaled to the same size: the digits to 28 28 pixels, and the traffic signs to 40 40 pixels. Concerning the pre-processing of the input data, the images were normalized within the range [0, 1]. The train and test partitions were those proposed by the authors of each dataset, in order to ensure a fair comparison with the results obtained in the literature.

4.2. CNN architectures

To evaluate the proposed methodology, the same three CNN architectures used in the original DANN paper have been tested. Table 2 reports a summary of these architectures.

As the authors pointed out, these topologies are not necessarily optimal and better adaptation performance might be attained if they were tweaked. However, we chose to keep the same configuration to make a fairer comparison.

As the activation function, a Rectifier Linear Unit (ReLU) was used for each convolution layer and fully-connected layer, except for the the output layers. L neurons with softmax activation were used as output of the label classifier. For the output of the domain classifier, a single neuron with a logistic (sigmoid) activation function was used to discriminate between two possible categories (source domain or target domain).

Model 1 was used for all the experiments with digit datasets, except those using SVHN. This topology is inspired by the classical LeNet-5 architecture [20]. Model 2 was used to evaluate the experiments with digits that include SVHN. This architecture is inspired by [29]. Finally, Model 3 was used for the experiments with traffic sings. In this case, the single-CNN baseline obtained from [9] was used.

Table 2: CNN network configurations considered. Notation: Conv(f, w, h) stands for a layer with f convolution operators with a kernel of size ) stands for the max-pooling operator of dimensions pixels—with a 2 2 in all cases—and FC(n) represents a fully-connected layer of n neurons. In the output layer of the label classifier a fully-connected layer of L neurons with softmax activation is added, where L denotes the number of categories of the dataset at issue.

4.3. Training stage

To ensure a fair comparison with the original DANN algorithm, we set the same training config-uration: Stochastic Gradient Descent with a learning rate of 0.01, decay of 10, and momentum of 0.9, as well as the same number of epochs (300).

For the iterative stage of iDANN, we set to 25. This value was determined empirically. We observed that it allowed the network weights to be tuned with the new knowledge without taking too long to perform a new iteration. In each training iteration, the greater improvement occurs in the first epochs, after which the accuracy of the label classifier is stabilized.

Concerning the size of the subset to select from the target set (r), we decided to consider a percentage of the remaining samples rather than a fixed value. Initially, we set it to 5%, and it was increased after each iteration by 150% (= 1.5) until all target domain samples are selected. This value was also obtained empirically, by observing better results and more stable training if few samples are added in the first iterations.

Different values for both the batch size b and are evaluated, as will be reported in the experimentation section.

5. Results

In this section we evaluate the proposed method using the datasets, topologies, and settings described in Section 4. We first study the different hyper-parameterization, as well as the two prototype selection policies proposed. Next we show the performance results obtained over the datasets and, finally, we compare with other state-of-the-art methods.

5.1. Hyper-parameters evaluation

In this section, we start by analyzing the influence of the batch size and the value of on the performance of the method, as these hyper-parameters are those that affect the training stage the most. For this, we consider the batch sizes of {16, 32, 64, 128, 256, 512} and of 101010. This means that each result comes from a total of 336 experiments (14 combinations of dataset pairs 6 batch values 4 values of ). The rest of hyper-parameters are set as indicated in Section 4.3, that is: e = 300 (as in the original DANN paper), = 25, and r = 5%, which were empirically determined to favor stable training and obtain good results. In addition, we evaluate the results using only the prototype selection policy based on network’s confidence, as next section will be devoted to comparing the two proposed policies with the best hyper-parameters found.

As we are dealing with an unsupervised method, we mainly focus on analyzing the trend when modifying these parameters. Table 3 shows the results of this experiment, where each figure represents the average of the 14 possible combinations of source and target domain of the datasets considered and all the iterations performed by the iDANN algorithm.

The first thing to remark is that some of the hyper-parameter combinations evaluated in these experiments do not converge (h = 32/64, = 10for traffic signs). This could be detected automatically, since the accuracy is abruptly reduced to a value approximately equal to a random guess, for both the training set and evaluation set and for both the source and the target domain. However, these results have been kept in order to observe the general trend of the method and how these parameters affect it.

It can also be observed that the best performance is achieved with = 10in the two types of corpora, while a batch size of 64 and 32 are better for the digits and traffic signs, respectively. On average, better results are reported with low values and batch sizes between 32 and 256. When is greater (e.g., 10), the training becomes highly unstable, especially if combined with small batch sizes.

Table 3: Influence of hyper-parameter setting on the performance (accuracy, in %) of the iDANN algorithm. Figures report classification accuracy over the target set, averaging with the respective datasets and iterations of the algorithm.

Next, we analyze the influence of these parameters with respect to the iteration of the iDANN algorithm. Table 4 shows the average result obtained by grouping all combinations of datasets (numbers and traffic signs) and hyper-parameters considered. As in the previous analysis, better results are also observed for low values and batch sizes between 32 and 256 (see column ‘Avg.’). In this case, it can also be seen that low values are more appropriate in the first iterations, whereas greater values are more appropriate in the last iterations. It might happen that a more stable way of proceeding (low ) is preferred in the first iterations, even at the cost of being less aggressive in the domain adaptation. Therefore, we propose to start with a low and increase its value gradually (+10after each epoch).

Additionally, it is observed that each iteration of the algorithm leads to a better result than the previous one (except for 10), yielding the higher leap in the first iterations and reducing this difference towards the last iterations. Including all cases, the results improve by 5.19% between the first and the last iteration, on average. If we ignore those settings that do not converge, the average improvement obtained increases to 10.29%.

5.2. Model analysis

We now evaluate the effect of the incremental training process on the domain adaptation approach. Figure 5 shows the evolution of the accuracy obtained over the target test set during the training process for the case Syn Numbers MNIST-M combination of datasets, with a batch size of 64 and = 10. The training epochs are represented with the horizontal axis, while the

Table 4: Influence of hyper-parameter setting on the performance of the iDANN algorithm with respect to number of iterations. Figures report classification accuracy over the target set, averaging with the respective datasets.

iterations (i.e., when new training samples are added) are highlighted with blue lines and marked above. It can be observed that in the first iteration (spanning 300 epochs), the accuracy slowly improves until around 150 epochs, after which becomes stable. In the subsequent iterations, the accuracy further improves, especially during iterations 2, 3 and 4. Then, the performance increase is gradually reduced until it is hardly noticeable.

To provide further analysis, we also examine the representation space learned by the network in each of these iterations, using the same combination of datasets and training parameters. We use the t-Distributed Stochastic Neighbor Embedding (t-SNE) [22] projection to visualize the samples according to their representation by the last hidden layer of the label predictor. Figure 6 shows a visualization of the features learned after each of the iterations, where the red color represents the target domain, the blue color represents the source domain, and the green color represents the set ˆ(selected samples) using the confidence policy. This representation reveals 10 well-defined clusters—the 10 possible classes of the datasets considered for this analysis—around an additional central cluster. This central cluster groups the samples of the target domain (red color) whose representation does not correspond to any of the existing classes yet. This cluster would

Figure 5: Accuracy curve with respect to training epochs and iterations of the incremental approach.

therefore correspond to target samples whose representation has not been correctly mapped onto any of the source domain classes. Iteratively, the method is selecting samples (green points) of the target domain and moving them to the source domain. In the first iterations—until the 6th one, approximately—the method selects only samples that are well located in one of the source domain clusters (that is, those samples for which the network is more confident). Due to this process, the size of the central cluster is reduced. It is important to emphasize that this cluster becomes smaller although no samples out of it are selected, which indicates that the network is learning to better map those samples because of the selected samples of previous iterations. Towards the last iterations, the method begins to select the most complex samples that are still in this additional cluster. In Fig. 6(*) (which is the same as the Fig. 6(9) but highlighting each class with a different color), the additional cluster of target samples still appears without being mapped, yet with a very small size. This cluster contains almost all the classification errors, having mapped only some isolated prototypes to the actual class clusters incorrectly.

5.3. kNN policy

We compare in this section the two policies proposed for selecting the set of target prototypes ˆto be added to the source domain. To this end, we evaluate whether the label assigned to each of these prototypes is correct. In this case, we make use of the ground-truth of the target domain just for the sake of analysis.

We show in Fig. 7 a dotted line with the performance of the confidence policy, which may serve as a baseline here, and eight results for the kNN policy with varying k values. As in the previous experiments, the reported figures are obtained for all combinations of datasets and hyper-parameters considered.

It is observed that, as the number of iterations of the algorithm increases, the accuracy of the additional labels assigned to the selected prototypes decreases. However, the kNN policy generally

Figure 6: t-SNE representation of the feature space of the neural network (from the last hidden layer) with respect to the iteration of the approach. The red color represents the target domain, the blue color represents the source domain, and the green color represents the set ˆ. Last representation (*) depicts each sample according to its actual category by using different colors.

obtains better results from the first iteration, obtaining on average (for all iterations) an improvement of 6.36 % with respect to confidence policy. This improvement is significantly greater in the last iterations, obtaining an increase up to 24.85 % between the result of the confidence policy and the best result obtained with kNN policy.

The role of the parameter k is also illustrated in Fig. 7, where better results are attained as k is increased. It is shown that the impact of this parameter is more noticeable in the last iterations, where a difference of up to 8.91 % is obtained between k = 3 and k = 150.

Because the kNN selection policy worked better, this policy was used in all following experiments.

5.4. Accuracy on target

In this section, we evaluate the final result obtained through the proposed iDANN method with the best combination of hyper-parameters previously obtained for each of the dataset pairs. We will

Figure 7: Accuracy of the labels assigned to the selected prototypes set ( ˆ) from the target domain according to the selection policy.

compare this result with that obtained by the original DANN method in order to check the goodness of the incremental approach.

Table 5 reports the results of the experiment, where rows indicate the dataset pairs (source and target) and columns represent the DA method. Concerning iDANN, we report two results: the accuracy of the labels assigned during the iterative process itself (1), as well as the accuracy using the CNN trained from scratch using only the target samples (once all the target samples have been assigned a label). In addition to DANN and iDANN methods, we have also added the results obtained with the neural networks trained just with the source set (‘CNN Src.’), as well as the results obtained with the neural neural networks directly trained with the target set (‘CNN Tgt.’). The former serves as baseline, to better assess the impact of the domain-adaptation mechanisms, while the second represents the upper bound of accuracy.

The first thing to remark is that the worst results obtained by the baseline (‘CNN Src.’) come from the combinations of single-digit datasets (MNIST, MNIST-M) as source and complex digit datasets (SVHN, Syn Numbers) as target. Furthermore, the best results from the baseline are reported for combinations where the source and target are similar (MNIST-M MNIST, SVHN Syn Numbers).

The original DANN method outperforms the results obtained by using the baseline network (‘CNN Src.’) by 10.7 %, on average, obtaining the most significant improvement for the combinations of Syn Numbers MNIST (improving by 29.31 %). It is also noticeable the impact of DANN when the dataset pair consists of similar tasks with the most complex one as target, such as MNIST MNIST-M—improvement of 23 %—or Syn Signs GTSRB—improvement of 15.49 %. These results for DANN have been obtained using our own implementation, following the details given in the original paper. We observed that the accuracy matches approximately that reported by the authors (for the 4 combinations they considered), and so we assume that our implementation is correct. We can therefore faithfully report the performance in all source-target combinations of our experiments.

Concerning the labels assigned during the proposed incremental approach iDANN (Acc.), the first thing to note is its improvement with respect to the underlying DANN method, which is around 16 %, on average. In the best case, this improvement reaches values around 33 %, 35 % and 36 % for the Syn Numbers MNIST-M, MNIST-M Syn Numbers, and MNIST Syn Numbers pairs, respectively. This confirms the goodness of our strategy, which uses the same domain adaptation method in a novel way.

Finally, if the CNN is trained from scratch with the target labels that have been automatically assigned by the iDANN (Acc.), it can further improve the results up to 1.64 %, on average, and up to 5.5 % in the best case (MNIST-M Syn Numbers). It should be noted that in some specific combinations, this approach slightly outperforms the CNN trained with the correct target labels (for example, MNIST-M MNIST or Syn Numbers SVHN). It might happen that the incorrectly assigned labels of the iDANN process act as a regularizer that alleviates some overfitting.

Table 5: Accuracy (%) over the the target dataset for different strategies: ‘CNN Src.’ indicates a neural network

trained only with source samples; ‘DANN’ denotes the original DANN strategy; ‘iDANN’ yields two results from the incremental strategy: refers to the accuracy of the labels assigned to the target samples during the iterative process, while refers to the classification after training a new CNN from scratch using the labels assigned to the target samples; and ‘CNN Tgt.’ denotes a CNN trained using the ground truth of the target samples.

5.5. Comparison with the state of the art

To conclude the results section, we present below a comparison with other domain adaptation strategies from the state of the art. In these works, not all possible combinations of source-target pairs are considered but a few combinations of them. We show in Table 6 the results reported in the literature, along with the results obtained by our proposal (iDANN). A brief description of the competing methods was provided in Section 2. Readers are referred to the corresponding references for details.

These results reveal that our method yields the best performance in 5 out of 7 source-target pairs. The performance of iDANN is especially remarkable in the case of MNIST Syn Num, where the improvement reaches around 30 % compared to the literature. For the case in which our proposal does not attain the best result, we observe a dissimilar performance: it is still very competitive for the MNIST SVHN pair, whereas it is outperformed for the SVHN MNIST pair. When all the results are good, the improvement is relative, but when there is enough margin, the improvement is quite remarkable (as in the case of MNIST Syn Num).

Table 6: Comparison of accuracy (%) between state-of-the-art DA approaches and iDANN. The first two rows denote the source and target dataset, respectively. The best result for each combination (column) is highlighted in bold typeface. Empty cells indicate that the result is not reported in the literature.

Furthermore, it should be noted that many of the compared methods propose specific CNN architecture for each combination of datasets and/or focus on optimizing the result for a particular combination, such as DTA or DeepJDOT. In our case, we utilized the topologies proposed in the original DANN paper, so it could be assumed that if we pursue a specific architecture adapted to each of the source-target pairs, our results will probably improve.

6. Conclusions and Future Work

This paper proposes an incremental strategy to the problem of domain adaptation with artificial neural networks. Our approach is built upon an existing domain adaptation approach, combined with a heuristic that, in each iteration, decides which prototypes of the target set can be added to the training set by considering the label provided by the neural network. To this end, two selection policies have been proposed: one directly based on the confidence given by the network to the prediction and another based on geometric properties of the learned feature space. We observed that the latter reported a better performance, especially in the last iterations of the algorithm. In addition, we consider a final stage in which the labeled target set is used to train a new neural network with label smoothing.

Our experiments were performed on various corpora and using several configurations of the neural network. From the results, we conclude that the incremental approach outperforms the underlying DANN model, as well as other state-of-the-art methods. It is interesting to note that, in some cases, the iDANN approach improves the result obtained with the CNN trained directly with the ground-truth data of the target set, which could indicate that the incremental process also serves as a regularizer that leads to greater robustness. Furthermore, unlike the classic DANN, our approach improves results when domains are similar and helps keeping the accuracy for the source domain. We also observed a greater training stability and less dependence on the hyper-parameters set.

As future work, a primary objective would be to establish a well-principled stop criterion that allows us to detect when the prediction over the target samples is not reliable. In addition, we want to extend the experiments to other types of input types (such as sequences), as well as to study the behavior of the incremental strategy when the underlying DA method is different—given that there currently exist several architectures for this challenge. Note that our incremental approach is independent of the underlying DA model considered, and so it could be adopted as a generic strategy that might improve to the same extent as the underlying DA algorithm improves. Other avenues to further explore this proposal is to evaluate more neural network architectures, as well as adding data augmentation to the learning process.

References

[1] Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 898–916. doi:10.1109/TPAMI.2010.161.

[2] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79, 151–175. doi:10.1007/ s10994-009-5152-4.

[3] Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2006). Analysis of representations for domain adaptation. In NIPS (pp. 137–144).

[4] Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In B. Sch¨olkopf, J. C. Platt, & T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19 (NIPS) (pp. 137–144). MIT Press.

[5] Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., & Krishnan, D. (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 (pp. 95–104).

[6] Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 343–351). Curran Associates, Inc.

[7] Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. F. Souli´e, & J. H´erault (Eds.), Neurocomputing (pp. 227–236). Berlin, Heidelberg: Springer Berlin Heidelberg.

[8] Cheng, L., & Pan, S. J. (2014). Semi-supervised domain adaptation on manifolds. IEEE Transactions on Neural Networks and Learning Systems, 25, 2240–2249. doi:10.1109/TNNLS. 2014.2308325.

[9] Cire¸san, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333 – 338. Selected Papers from IJCNN 2011.

[10] Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 (pp. 467–483). Cham: Springer International Publishing.

[11] Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification, 2nd Edition. Wiley.

[12] Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In 2013 IEEE International Conference on Computer Vision (ICCV) (pp. 2960–2967).

[13] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., & Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17, 1–35.

[14] Ghifary, M., Kleijn, W. B., Zhang, M., Balduzzi, D., & Li, W. (2016). Deep reconstruction- classification networks for unsupervised domain adaptation. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 597–613). Cham: Springer International Publishing.

[15] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www. deeplearningbook.org.

[16] Haeusser, P., Frerix, T., Mordvintsev, A., & Cremers, D. (2017). Associative domain adaptation. In The IEEE International Conference on Computer Vision (ICCV).

[17] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (pp. 770–778).

[18] Kouw, W. M., & Loog, M. (2019). A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, (pp. 1–1).

[19] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.

[20] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. In Proc. of the IEEE (pp. 2278–2324). volume 86.

[21] Lee, S., Kim, D., Kim, N., & Jeong, S.-G. (2019). Drop to adapt: Learning discriminative fea- tures for unsupervised domain adaptation. In The IEEE International Conference on Computer Vision (ICCV).

[22] van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

[23] Moiseev, B., Konev, A., Chigorin, A., & Konushin, A. (2013). Evaluation of traffic sign recognition methods trained on synthetically generated data. In J. Blanc-Talon, A. Kasinski, W. Philips, D. Popescu, & P. Scheunders (Eds.), Advanced Concepts for Intelligent Vision Systems (pp. 576–583). Cham: Springer International Publishing.

[24] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

[25] Saito, K., Kim, D., Sclaroff, S., Darrell, T., & Saenko, K. (2019). Semi-supervised domain adaptation via minimax entropy. In The IEEE International Conference on Computer Vision (ICCV).

[26] Shao, L., Zhu, F., & Li, X. (2014). Transfer learning for visual categorization: A survey. IEEE Transactions on Neural Networks and Learning Systems, 26, 1019–1034.

[27] Shu, R., Bui, H., Narui, H., & Ermon, S. (2018). A DIRT-t approach to unsupervised domain adaptation. In International Conference on Learning Representations.

[28] Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

[29] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.

[30] Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmark- ing machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323 – 332. doi:https://doi.org/10.1016/j.neunet.2012.02.016.

[31] Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In G. Hua, & H. J´egou (Eds.), Computer Vision – ECCV 2016 Workshops (pp. 443–450). Cham: Springer International Publishing.

[32] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2818–2826). doi:10.1109/CVPR.2016.308.

[33] Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 2962–2971).

[34] Villani, C. (2009). Optimal Transport Old and New. Springer.

[35] Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135 – 153.

[36] Yao, T., Pan, Y., Ngo, C.-W., Li, H., & Mei, T. (2015). Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2142–2150).

[37] Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320–3328).