b

DiscoverSearch
About
My stuff
Improving Domain-Adapted Sentiment Classification by Deep Adversarial Mutual Learning
2020·arXiv
Abstract
Abstract

Domain-adapted sentiment classification refers to training on a labeled source domain to well infer document-level sentiment on an unlabeled target domain. Most existing relevant models involve a feature extractor and a sentiment classifier, where the feature extractor works towards learning domain-invariant features from both domains, and the sentiment clas-sifier is trained only on the source domain to guide the feature extractor. As such, they lack a mechanism to use sentiment polarity lying in the target domain. To improve domain-adapted sentiment classification by learning sentiment from the target domain as well, we devise a novel deep adversarial mutual learning approach involving two groups of feature extractors, domain discriminators, sentiment classifiers, and label probers. The domain discriminators enable the feature extractors to obtain domain-invariant features. Meanwhile, the label prober in each group explores document sentiment polarity of the target domain through the sentiment prediction generated by the classifier in the peer group, and guides the learning of the feature extractor in its own group. The proposed approach achieves the mutual learning of the two groups in an end-to-end manner. Experiments on multiple public datasets indicate our method obtains the state-of-the-art performance, validating the effectiveness of mutual learning through label probers.

Domain-adapted sentiment classification aims at training on a labeled source domain to well infer document-level sentiment on an unlabeled target domain. It is a natural intersection of researches on sentiment classification (Liu 2012) and unsupervised domain adaptation (Ben-David et al. 2010), and has attracted much attention with the flourish of online review platforms, such as Amazon, Yelp, etc. On the one hand, the total amount of reviews is exceedingly huge, which brings opportunities to pursue effective sentiment classifica-tion models (Tang, Qin, and Liu 2015a). On the other hand, a large number of review domains (e.g., product categories in Amazon) make it intractable to manually annotate enough data in each domain for training domain-specific models. Thus developing automatically domain-adapted methods is imperative in this area. It is worth noting that this task is somewhat more challenging than some other cross-domain sentiment classification tasks which require a few labeled data in target domains (Peng et al. 2018).

A straightforward solution to the task is to directly apply the sentiment classification models trained on a source domain to a target domain. It does not obtain any training guidance from target domains. However, this is not ideal since it ignores the semantic gap between different domains. Taking Movie and Food domains as examples, the first domain contains common sentiment words such as “romantic”, “violent”, and “dramatic”. By contrast, the second includes opinion words like “tasty” and “delicious”. The potentially small set intersection of domain-dependent sentiment words indicates that there is huge potential for improving the naive solution. As such, it is important to build a domain-adapted sentiment classification model to perform well.

Existing relevant studies could be attributed into two categories: two-stage approaches (Blitzer, Dredze, and Pereira 2007; Glorot, Bordes, and Bengio 2011; Ziser and Re- ichart 2018; Ziser and Reichart 2019) and end-to-end models (Ganin et al. 2016; He et al. 2018; Qu et al. 2019). The two-stage approaches typically construct unsupervised feature extractors or manually select pivot features across domains in the first stage. And in the second stage, a sentiment classifier is trained on the labeled source domain. Yet the first stage could not be directly guided by the ground-truth, and the selection of pivot features is a little too empirical and costly. On the other hand, benefiting from the advanced learning techniques such as adversarial learning (Ganin and Lempitsky 2015) and maximum mean discrepancy (Gret- ton et al. 2012a), end-to-end models overcome the above issues by training domain-variant feature extractors and sentiment classifiers holistically, without relying on pivot features. However, despite the promising results they have made, there still remains a major limitation in most of the end-to-end models: data from target domains is not fully utilized. That is, their sentiment classifiers and feature extractors overlook the sentiment polarity lying in the review text of target domains. One relevant study (He et al. 2018) in this regard obtains pseudo-labels in target domains by a self-ensemble bootstrapping technique to train its sentiment classifier and feature extractor. However, the pseudo-labels are asynchronously generated by the earlier version of the sentiment classifier, which is a weaker classifier compared with its current version, and possibly limits the effectiveness of training.

In this paper, we propose a novel learning approach, named deep adversarial mutual learning (DAML). DAML improves domain-adapted sentiment classification by learning sentiment polarity from an unlabeled target domain. This is partially inspired by the recently proposed deep mutual learning (DML) (Zhang et al. 2018) for supervised single-domain tasks, where two classification models teach each other through their synchronously inferred sentiment label distributions which complement true labels. The rationality behind mutual learning is that these models can learn collaboratively and transfer their own learned knowledge to each other throughout the training process. This makes the models robust to the noise in the data. We extend mutual learning to the unsupervised cross-domain sentiment classifica-tion scenario through DAML. It involves two groups of feature extractors, domain discriminators, sentiment classifiers, and label probers. Different from standard mutual learning, we leverage the label prober in each group to learn pseudo sentiment label distributions of documents in the target domain, which are generated by the classifier from the other group. Meanwhile, the label probers guide the learning of feature extractors in their corresponding groups by gradient back-propagation. Through the above manner, probers act as bridges between classifiers and extractors, ensuring sentiment information in the target domain to be used by them. Another advantage of probers is that they free classifiers from aligning with each other, which is required by standard mutual learning and harms performance in a fully unsupervised scenario (see Figure 3a). In addition, we leverage gradient reverse layers (GRL) (Ganin et al. 2016) to learn domain-variant feature extractors with domain discriminators. The two groups can be regarded as two models which are mutually learned in an end-to-end manner to improve generalization on the target domain. We summarize the main contributions of this work as follows:

We address the learning of sentiment polarity in the target domain, which is largely overlooked by previous studies. We propose DAML which combines the merits of adversarial learning and mutual learning, as well as the introduced label probers which learn from classifiers and guide the learning of extractors. To our knowledge, this is the first study of adapting mutual learning for domain-adapted sentiment classification.

We evaluate DAML on multiple datasets with different origins. The extensive experiments show that DAML achieves the state-of-the-art performance. We also validate the better results of mutual learning with label probers than directly applying standard mutual learning. As a byproduct, we will release the source code of our approach1.

In this section, we briefly discuss domain adaptation, sentiment classification, and deep mutual learning, to highlight the key differences in this research. Domain adaptation. Domain adaptation has been a longstanding attractive research topic due to its real applications where labeled data is only available in a source domain (Pan and Yang 2009). It is widely recognized that the distribution gap between source domain and target domain is the fundamental challenge. To address this, early instance-based approaches reweigh each source example following the idea of importance sampling to match the target data distribution (Huang et al. 2007). Recent advancements in learning domain-variant feature representations include adversarial learning (Ganin and Lempitsky 2015) which trains an extractor to fool a domain classifier, and maximum mean discrepancy (MMD) (Gretton et al. 2012b) which measures the degree of domain shift and minimizes it. In the specific scenario of domain-adapted sentiment clas-sification, pivot-based methods (Blitzer, Dredze, and Pereira 2007; Yu and Jiang 2016; Ziser and Reichart 2018; Ziser and Reichart 2019) first heuristically select domain-shared pivot features and then use them to learn the correspondence of domain-specific sentiment words. However, such an empirical selection is a little costly and its inaccuracy would be delivered to the following learning step. Some other studies such as (Glorot, Bordes, and Bengio 2011) adopt a similar two-stage procedure by learning an unsupervised domain-variant feature extractor (e.g., stacked denoising auto-encoder (Vincent et al. 2008)) in the first stage and then train a sentiment classifier by taking the obtained features as input. Due to the nature that the first stage of the above methods is not guided by sentiment labels, there is a potential to improve them by learning in an end-to-end manner. Some recent approaches (Ganin et al. 2016; He et al. 2018; Qu et al. 2019) employ adversarial learning or MMD to fulfill the end-to-end learning. However, as discussed previously, they inevitably suffer from the limitations of ignoring sentiment polarity lying in target domains or not leveraging them in an effective manner. In comparison, our work proposes the novel DAML approach to address the above issue, which is the most important contribution of this paper. Sentiment classification. Despite the aforementioned studies about domain-adapted sentiment classification, a large amount of efforts have been devoted to the development of well-performed single-domain sentiment classifica-tion models (Zhang and Wang 2015), especially deep learning based ones, including recursive neural networks (Socher et al. 2013), convolutional neural networks (Kim 2014), and recurrent neural networks (Tang, Qin, and Liu 2015a), etc. To our knowledge, hierarchical attention networks (HAN) (Yang et al. 2016) is the state-of-the-art sentiment classification model, according to the performance comparison shown in (Chen et al. 2016; Wu et al. 2018), As such, our work takes the main part of HAN, except the last feedforward output layer, as our feature extractor. Deep mutual learning. Deep mutual learning (Zhang et al. 2018) is a recently developed learning approach in

the background of model distillation (Hinton, Vinyals, and Dean 2015), but it does not use a larger teacher model for guiding the training process. Apart from learning to fit the true labels, DML makes classifiers simultaneously learn from each other by mimicking others’ inferred label distributions and finds its applications (Kanaci et al. 2019; Wu et al. 2019). However, none of existing studies have investigated its power in domain-adapted classification tasks. Later we show in the experiments that a naive combination of adversarial learning with standard mutual learning does not improve the performance in the task.

In domain-adapted sentiment classification, we are given a labeled source domain  Ds = {(xsi, ysi )}nsi=1with  nsdoc- uments and an unlabeled target domain  Dt = {xti}ns+nti=ns+1with  ntdocuments, where x is the original word sequence representation in a document and y is the corresponding one-hot encoding of sentiment label. The goal of the task is to learn a well-performed sentiment classification model which is adapted to the target domain by leveraging the examples from  Dsand  Dt. To this end, we present a deep adversarial mutual learning approach with the architecture shown in Figure 1.

Approach Overview: The proposed approach consists of two groups of feature extractors, domain discriminators, sentiment classifiers, and label probers. The two groups have exactly the same structures but are with their own parameters, wherein each group is associated with both the source and target domains. The feature extractors take documents from the source domain and the target domain as input, and eventually obtain document representations which are fed into the other components in each group. Domain discriminators, together with gradient reverse layers, ensure the extractors to get domain-invariant features by the guidance of domain discriminative loss  LDOM. The prober in each group is associated with the classifier in the other group through the optimization of mutual learning based loss  LML. In addition, each classifier also corresponds to a classification loss  LCLSw.r.t. the labeled source domain.

Feature Extraction

We use HAN as the feature extractors to produce the representations of sentiment documents. HAN takes a hierarchical structure to mirror the word-level and sentence-level structures of documents. Assume the i-th document from either  Dsor  Dtcontains  lisentences and each sentence j ∈ {1, · · · , li}has  mi,jwords. HAN first leverages a bidirectional GRU (Bahdanau, Cho, and Bengio 2015) to get contextualized word-level representations as follows,

image

To capture the importance of each word in constructing a sentence vector  si,j, word-level attention mechanism is used

image

Figure 1: The architecture of deep adversarial mutual learning. The red lines and blues lines indicate the information flow from source domain and the target domain, respectively, while the black lines correspond to both domains. LDOM, LML, and  LCLSdenotes domain discriminative loss, mutual learning based loss, and source domain sentiment classification loss, respectively.

and defined as follows:

image

where  αi,j,kquantifies the word importance weight.

Afterwards, another bidirectional GRU is used to model si,jto get contextualized sentence representations  hi,j (j ∈{1, · · · , li}). Given this, the document-level representation diis obtained through sentence-level attention mechanism in a similar fashion as Eq. 2, i.e.,  di = �j αi,jhi,j.

To summarize the procedure of feature extraction, we denote it as  di = FE(xi; ΘF E)where  ΘF Ecovers all the parameters involved in the above computations.

Sentiment Classification We define a sentiment classifier  C(di; ΘC)which contains one to several layers of perceptrons and is given as follows:

image

where  Wcand  bcare learnable parameters and the dimension of output equals to the number of sentiment labels. Given this, the classification loss on the labeled source domain is based on cross-entropy, given as:

image

Domain Adaptation To empower the feature extractors with the ability of learning domain-invariant representations, we consider the adversarial learning method. In particular, we introduce a domain discriminator D, which is trying to figure out from which domain a document vector comes. On the contrary, the feature extractor FE aims to fool D.

We define D as a multi-layer perceptron with parameters ΘD, which regards the previously obtained document vector as input and outputs a scalar indicating the probability of a document being from the source domain. For document  xi, we set  zi = 1if it belongs to the source domain, and  zi =0 for the target domain. Based on this, a min-max game is played to optimize the parameters  ΘF Eand  ΘDas follows:

image

where  LDOM = − �ns+nti=1 �zilog D(di)+(1−zi) log(1−D(di))�.

As suggested in (Ganin et al. 2016), this min-max game is implemented by a gradient reversal layer, which reverses ∂LDOMΘF Einto  −η ∂LDOMΘF Eduring the gradient back-propagation process.  ηis a controllable hyper-parameter. Once FE is able to prevent the discriminator from distinguishing data from the source or target domain, it is supposed to be successfully trained to retain only domain-independent information to some extent.

Integration with Mutual Learning So far, we are able to utilize the target domain from the aspect of modeling its data distribution. However, it is still far from satisfactory since the involved sentiment information is not explicitly captured. A natural solution is to generate pseudo-labels for the target domain to guide the learning of the sentiment classification model.

Motivated by mutual learning (Zhang et al. 2018) in supervised single-domain tasks, we first consider a direct way of incorporating standard mutual learning (sML) into our framework. We extend the aforementioned feature extractor, sentiment classifier, and domain discriminator to form two groups (models), i.e.,  G1 = (FE1, C1, D1; ΘG1)and  G2 =(FE2, C2, D2; ΘG2), where  ΘG1and  ΘG2cover all the parameters of the groups, i.e.,  ΘG1 = {ΘF E1, ΘC1, ΘD1}and ΘG2 = {ΘF E2, ΘC2, ΘD2}. Correspondingly, they have LCLS1and  LDOM1, LCLS2and  LDOM2, respectively. In consequence, we define the loss of standard mutual learning in our problem setting as follows:

image

where  DKLis Kullback Leibler (KL) divergence. Moreover, the two groups have their own hybrid objectives which are defined as follows:

image

where  λDand  λMare parameters to control the relative in-fluence of the losses in the whole objectives. An alternating optimization strategy is used for mutually learning the two groups. That is, in each mini-batch,∂LG1∂ΘG1and∂LG2∂ΘG2are computed and propagated back to update model parameters.

It is intuitive that due to different parameter initialization, the feature extractor and sentiment classifier in each group try to obtain some different abilities (e.g., focusing on some special parts of latent feature spaces) to pursue similar results of its peer group, by the guidance of mutual learning. As such, they can transfer some exclusive knowledge to the peers. It is worth noting that Eq. 6 and 7 only guide each group with the other’s predicted labels, but without telling them how to obtain these pseudo-labels, thus not forcing them to obtain very similar feature representations. However, the way in which these pseudo-labels are applied to the unsupervised domain-adaptation scenario is somewhat questionable. This is because the requirement that the two classifiers are made to trust the pseudo-labels generated by each other (see Eq. 6 and 7) is a little too restricted in domain adaptation, where the two classifiers actually cannot see any labeled data in the target domain. As a result, the classifiers might be not strong enough to teach each other, which turns out that they are misled by the counterparts’ predictions and cannot achieve improved classification performance (see Figure 3 for validation).

To avoid to degrade the classification performance and meanwhile retain the ability of learning sentiment polarity from the target domain, we introduce another component called label prober (P) to each group, which has the same architecture as the sentiment classifier but comes with its own learnable parameters. The goal of label probers is not for sentiment classification like sentiment classifiers. Instead, each prober probes sentiment information learned by the other group in the target domain, and transfers the knowledge to the feature extractor from the same group, acting as a bridge which connects the feature extractor to the sentiment classifier from the other group. In particularly, our mutual learning based loss functions are defined as follows:

image

Correspondingly, we have two new versions of objectives LG1and  LG2by replacing  LsML1and  LsML2with  LML1and  LML2. Taking  ΘC1for illustration, compared with standard mutual learning, our approach changes the gradients of the classifiers from∂LCLS1∂ΘC1 + λM∂LsML1∂ΘC1 to∂LCLS1∂ΘC1. Thus it frees the classifiers from learning not very reliable pseudo-labels as mentioned before.

Thanks to the integration of adversarial learning and mutual learning, the gradients of feature extractors are given as: ∂LG1∂ΘF E1 = ∂LCLS1∂ΘF E1 − ηλD∂LDOM1∂ΘF E1 + λM∂LML1∂ΘF E1and∂LG2∂ΘF E2 = ∂LCLS2∂ΘF E2 − ηλD∂LDOM2∂ΘF E2 + λM∂LML2∂ΘF E2, enabling the feature extractors to learn data distributions and sentiment information from both the source domain and the target domain.

In this section, we assess the effectiveness of our approach DAML by first clarifying the experimental setup and after-

image

Table 1: Statistics of the experimental datasets.

wards analyzing the experimental results.

Experimental Setup

Datasets We adopt multiple publicly available datasets with different orgins to evaluate DAML. The first pair of source and target datasets is Yelp and IMDB datasets built by (Tang, Qin, and Liu 2015b). One issue is that Yelp has 5 sentiment labels, while the number is 10 for IMDB. To align the space of sentiment labels for domain adaptation, we simply divide the scores of the sentiment labels in IMDB by 2 and round them up.

Moreover, to investigate the performance in different domains with the same origin, we choose three domains, i.e., Electronics, CD, and Clothing, from the Amazon dataset (McAuley et al. 2015). For each domain, there are 80,000 pieces of documents in the training set and 10,000 in both the test set and development set. All of these statistics are summarized in Table 1 in detail.

Evaluation Following the study of (Tang, Qin, and Liu 2015b), the model performance in our experiments is compared according to Acc, short for Accuracy, which measures the classification performance, and RMSE, which shows the divergence between the predicted ratings and the ground truth ratings.

For unsupervised domain-adaptation task, the unavailability of the development set in target domains results in two issues. First, it is impossible to tune hyper-parameters for every task according to the performance on the development set of each target domain. Therefore, as is done by (He et al. 2018), we build a development set available for only one task (Yelp  →IMDB). Hyper-parameters of our model and baselines are tuned to perform their best on this development set and then fixed on all tasks. Second, we are not able to do early-stopping for a model according to its performance in a target domain. As an alternative, we use the development set of each source domain to determine early-stopping. Specifically, for each model, we save its parameters when it achieves better results on the development set along the training process. And finally the saved parameters with best performance are regarded as model parameters for testing on the target domain. All the models experimented in this work determine their model parameters through the above manner.

For each mutual learning based approach, the classifier that performs the best on the development set of a source domain is chosen to be eventually evaluated on a target domain.

Implementations As mentioned in the section of feature extraction, our approach uses HAN as the feature extractor for its good performance. For a fair comparison, we make other approaches to use HAN as well. Though Bidirectional Encoder Representations from Transformers (BERT) (De- vlin et al. 2019) has been gaining more and more attention as a powerful text feature extractor, we did not find an appropriate setting to obtain significantly better results than HAN from our local experiments.

Before the training process starts, 200-dimensional word embeddings are learned by Word2vec (Mikolov et al. 2013). Word2vec is trained on all the available data in all the experimented domains. The hyper-parameters are fixed for all domains. By default,  η, λD, λMare set as 0.005, 1.0 and 1.0, respectively. Adam is adopted as the optimizer for all the experiments.

For the baselines, if their source codes are available, we directly utilize them by just adjusting input and tuning parameters. Otherwise, we implement them by referring to the settings shown in their papers.

Baselines The following end-to-end domain-adapted sentiment classification models are adopted for comparison. Naive: We experiment with a sentiment classification model implemented by HAN without any domain-adaptation technique. It is trained on a source domain and then directly tested on a target domain. DANN (Ganin et al. 2016): An adversarial discriminator is introduced to distinguish data from different domains so that its feature extractor could finally generate domain-invariant latent feature vectors by trying to confuse the discriminator. MMD (Gretton et al. 2012a): As another common method for domain adaptation, MMD is applied to a feature extractor to make it domain-independent. The results of MMD are mainly compared with the ones of adversarial domain adaptation and so as to examine the effectiveness of the latter. WDGRL (Shen et al. 2017): This approach is similar to DANN, but uses Wasserstein Distance instead of JSdivergence as the loss function of the discriminator. DAS (He et al. 2018)2: This approach employs entropy minimization and a self-ensemble bootstrapping method to refine its classifier while minimizing the domain divergence. ACAN (Qu et al. 2019)3: ACAN introduces two classifier networks on the top of the feature extractor. The discrepancy of two classifiers is increased to provide diverse views, while the extractor learns to create better features away from the category boundaries by minimizing this discrepancy.

To fully investigate the mechanism behind our approach, we consider the following variants: Naive Ensemble (NE): To show that DAML performs well in terms of leveraging multiple parallel models, we design a naive ensemble framework which fetches the probabilities returned by two domain-adapted models and takes their addition as its prediction. ML (Zhang et al. 2018): It is the direct application of standard mutual learning with adversarial learning to our task, as discussed in previous section.

Table 2: Sentiment classification results in terms of Acc and RMSE. The best results are marked in bold.

image

Feature Alignment (FA): This method aligns two classifi-cation models at the level of latent feature layer by minimizing the Euclidean distance between the outputs of two feature extractors for each input document.

Experimental Results

Model comparison Table 2 compares our approaches with baselines on different domain-adaptation tasks. Our fi-nal approach DAML outperforms all the other approaches on most tasks. Compared with the NE framework, which in average performs the best among all the left approaches, DAML improves from 63.4% to 64.5% in terms of accuracy and reduces the RMSE error from 0.925 to 0.898. Moreover, NE requires multiple available classification models when testing, while DAML only needs a single model that performs the best on the development set of a source domain.

Among the four proposed methods, the ML and FA learning frameworks perform significantly worse than NE and DAML, and even worse than single models. As is analyzed before, ML requires each model to trust its weak peer. Its failure emphasizes the intuition that models ought to take different views on ground truth and pseudo-labels. On the other hand, though FA tries to align multiple models on the layer of the feature representation without disturbing their classifiers, it does not provide models with capability on learning from each other effectively.

Mutual learning on different domain settings So far, one of the most important remaining questions is whether our learning framework works by offering pseudo-labels of the target domain or by only introducing a regularizer that pulls two models closer. To answer this question, we try to train a group of the DAML frameworks which apply probers to only the source domain, only the target domain, and both the two domains. These experiments are conducted on the

image

Table 3: Results of DAML trained on different domains, with its probers exclusively applied to the domain indicated by the column name.

image

Yelp, IMDB, Amazon Electronics, and CD datasets.

Table 3 presents the corresponding results. We can see training the probers of DAML on the source domain exclusively performs much worse than the ones trained on the target domain, or both domains. The comparison emphasizes the importance of leveraging the sentiment information in the target domains. According to the results, we can conclude that what makes DAML perform well is its ability of taking an advantage of pseudo-labels.

Feature visualization Taking the task of Yelp  →IMDB for illustration, we visualize the feature vectors of the target domain generated by our approach and standard mutual learning (ML). As shown in Figure 2, the feature representations learned by ML have vague classification boundaries. By contrast, our approach successfully reveals the sentiment polarity in the obtained representations, since negative (red and blue), neutral (green), and positive (purple and yellow)

image

Figure 2: t-SNE visualization of the features in target domain learned by standard Mutual Learning and proposed DAML methods. Different rating scores are indicated by colors and shapes. (Red = 1, Blue = 2, Green = 3, Purple = 4, Yellow = 5)

image

Figure 3: Accuracy on the test set, evaluated at each training step. In each picture, two solid lines stand for the two models trained in the framework. The dashed line represents the plain adversarial domain-adaptation method DANN.

examples are located in the left, middle, and right parts of the figure, respectively.

Analysis on training process It can help understand the effect of each mutual learning framework to see its performance in the target domain at each step. Therefore, we conduct a series of experiments on the Yelp→IMDB task, where a development set from the target domain is available. The performance on the development set of the chosen approaches is evaluated every several steps and the curves of accuracy are drawn in Figure 3.

We can see DANN and two sentiment classifiers in the two mutual learning based frameworks obtain high performance at about the twentieth steps. After that, DANN keeps its performance relatively stable, compared with two other models in the figure. Particularly, the curves of standard ML start to drop rapidly, while our final approach gradually obtains even higher performance. These comparisons clearly illustrate the negative influence of trusting each other’s weak domain-adapted classifier to much.

Influence of hyper-parameter  ηand  λMIn order to observe the influence of the relative weights of different parts in the total loss function, we conduct some experiments with the development sets of the target domains. Specifically, three tasks, i.e., Yelp  →IMDB, Clothing  →Electronics and Electronics  →CD, are chosen. For each task, our learning

image

Figure 4: DAML trained on three tasks with different  ηand λM. The vertical axis stands for the relative accuracy value gained w.r.t. the minimum point for each curve.

approach is trained and then validated on the development set of the target domain for just showing the influence. This procedure is repeated several times with different  ηand  λM. λDis not considered because it only influences the performance of the discriminators when  ηcontrols the gradient flowed to the feature extractors.

Figure 4 depicts how the change of  ηand  λMinflu-ences the performance. It can be seen from the left subfig-ure that DAML does not perform very well when adversarial learning is not applied (when  ηis 0), which shows the contribution of adversarial learning to our learning framework. Moreover, we can observe from the right subfigure that DAML obtains good performance when  λMis in the range [1.0, 5.0] for all the three tasks. Besides, compared with the standard domain-adaptation model without mutual learning (when  λMequals to 0), DAML gains improvement when  λMis in even a larger range. These curves show the effectiveness of our learning framework and its robustness in terms of the critical hyper-parameters.

Influence of number of groups and models In the end, we concisely analyze how the results change by increasing the number of groups in DAML and domain-adapted models in the baseline NE, from 2 to 4. We take the domains of Yelp and IMDB for testing and show the performance in Table 4.

Table 4: Sentiment classification results of more groups and models.

image

It can be seen that, in this work, involving more models doesn’t always improve the performance of the approaches. This might be attributed to the fact that a development set from a target domain is hardly available in practice, as mentioned in the section of evaluation. Therefore, we follow previous work to train each model until it performs the best on a source domain. In consequence, though an approach with more groups or models may perform better on the target domain at some training points during the learning process, we

are not able to finally obtain these peek points.

In this work, we have devised DAML, a novel deep adversarial mutual learning framework through which two domain-adapted models can learn from each other to effectively utilize the sentiment information lying in a target domain. This advantage is enjoyed by the means of incorporating probers into mutual learning, which act as bridges to connect feature extractors with sentiment classifiers. Unlike standard mutual learning, our learning framework introduces a flexi-ble manner for each model to decide how to view its peer’s predictions, rather than to align the output labels of sentiment classifiers, which is somewhat questionable in unsupervised domain-adaptation tasks. Experiments have shown that DAML significantly improves the performance over representative end-to-end models for domain-adapted sentiment classification.

We thank the anonymous reviewers for their valuable comments. This work is supported by National Key Research and Development Program (2019YFB2102600), NSFC (61702190), NSFC-Zhejiang (U1609220), and Shanghai Sailing Program (17YF1404500).

[Bahdanau, Cho, and Bengio 2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.

[Ben-David et al. 2010] Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1-2):151– 175.

[Blitzer, Dredze, and Pereira 2007] Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL.

[Chen et al. 2016] Chen, H.; Sun, M.; Tu, C.; Lin, Y.; and Liu, Z. 2016. Neural sentiment classification with user and product attention. In EMNLP, 1650–1659.

[Devlin et al. 2019] Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, 4171–4186.

[Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. S. 2015. Unsupervised domain adaptation by backpropagation. In ICML, 1180–1189.

[Ganin et al. 2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. JMLR 17(1):2096–2030.

[Glorot, Bordes, and Bengio 2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 513–520.

[Gretton et al. 2012a] Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch¨olkopf, B.; and Smola, A. 2012a. A kernel two-sample test. JMLR 13(Mar):723–773.

[Gretton et al. 2012b] Gretton, A.; Sriperumbudur, B. K.; Sejdi- novic, D.; Strathmann, H.; Balakrishnan, S.; Pontil, M.; and Fuku-

mizu, K. 2012b. Optimal kernel choice for large-scale two-sample tests. In NIPS, 1214–1222.

[He et al. 2018] He, R.; Lee, W. S.; Ng, H. T.; and Dahlmeier, D. 2018. Adaptive semi-supervised learning for cross-domain sentiment classification. In EMNLP, 3467–3476.

[Hinton, Vinyals, and Dean 2015] Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

[Huang et al. 2007] Huang, J.; Gretton, A.; Borgwardt, K.; Sch¨olkopf, B.; and Smola, A. J. 2007. Correcting sample selection bias by unlabeled data. In NIPS, 601–608.

[Kanaci et al. 2019] Kanaci, A.; Li, M.; Gong, S.; and Rajamanoha- ran, G. 2019. Multi-task mutual learning for vehicle re-identification. In CVPR Workshops, 62–70.

[Kim 2014] Kim, Y. 2014. Convolutional neural networks for sen- tence classification. In EMNLP, 1746–1751.

[Liu 2012] Liu, B. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.

[McAuley et al. 2015] McAuley, J. J.; Targett, C.; Shi, Q.; and van den Hengel, A. 2015. Image-based recommendations on styles and substitutes. In SIGIR, 43–52.

[Mikolov et al. 2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, 3111–3119.

[Pan and Yang 2009] Pan, S. J., and Yang, Q. 2009. A survey on transfer learning. IEEE TKDE 22(10):1345–1359.

[Peng et al. 2018] Peng, M.; Zhang, Q.; Jiang, Y.; and Huang, X. 2018. Cross-domain sentiment classification with target domain specific information. In ACL, 2505–2513.

[Qu et al. 2019] Qu, X.; Zou, Z.; Cheng, Y.; Yang, Y.; and Zhou, P. 2019. Adversarial category alignment network for cross-domain sentiment classification. In ACL, 2496–2508.

[Shen et al. 2017] Shen, J.; Qu, Y.; Zhang, W.; and Yu, Y. 2017. Wasserstein distance guided representation learning for domain adaptation. arXiv preprint arXiv:1707.01217.

[Socher et al. 2013] Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In ACL, 1631–1642.

[Tang, Qin, and Liu 2015a] Tang, D.; Qin, B.; and Liu, T. 2015a. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, 1422–1432.

[Tang, Qin, and Liu 2015b] Tang, D.; Qin, B.; and Liu, T. 2015b. Learning semantic representations of users and products for document level sentiment classification. In ACL, volume 1, 1014–1023.

[Vincent et al. 2008] Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P. 2008. Extracting and composing robust features with denoising autoencoders. In ICML, 1096–1103.

[Wu et al. 2018] Wu, Z.; Dai, X.; Yin, C.; Huang, S.; and Chen, J. 2018. Improving review representations with user attention and product attention for sentiment classification. In AAAI, 5989–5996.

[Wu et al. 2019] Wu, R.; Feng, M.; Guan, W.; Wang, D.; Lu, H.; and Ding, E. 2019. A mutual learning method for salient object detection with intertwined multi-supervision. In CVPR, 8150–8159.

[Yang et al. 2016] Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy, E. 2016. Hierarchical attention networks for document classification. In NAACL, 1480–1489.

[Yu and Jiang 2016] Yu, J., and Jiang, J. 2016. Learning sentence embeddings with auxiliary tasks for cross-domain sentiment clas-sification. In EMNLP, 236–246.

[Zhang and Wang 2015] Zhang, W., and Wang, J. 2015. Priorbased dual additive latent dirichlet allocation for user-item connected documents. In IJCAI, 1405–1411.

[Zhang et al. 2018] Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018. Deep mutual learning. In CVPR, 4320–4328.

[Ziser and Reichart 2018] Ziser, Y., and Reichart, R. 2018. Pivot based language modeling for improved neural domain adaptation. In NAACL, 1241–1251.

[Ziser and Reichart 2019] Ziser, Y., and Reichart, R. 2019. Task refinement learning for improved accuracy and stability of unsupervised domain adaptation. In ACL, 5895–5906.


Designed for Accessibility and to further Open Science